© Burkhard Rost (TU Munich) /65
Protein Prediction - Part 1: Structure
1Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Announcements
Videos: SciVee www.rostlab.orgTHANKS : Tim Karl + Haitam Sohby NO lectures: Tue May 31(!) studentische vollversammlung Thu Jun 2 (Ascension) Thu Jun 16 ?
LAST lecture: Jul 7Examen: Jul 12 (?), 10:30 (likely this room)
• Makeup: likely: October 13 - morning
CONTACT: Marlena Drabik [email protected]
2
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Today: Secondary structure prediction 1
LAST WEEKs• Secondary structure prediction: principles on white board THIS WEEK
• Secondary structure prediction methods - detailsNEXT WEEK
• Student Assembly & Ascension Day -> no lecture2 WEEKs from now
• Marc Ofmann: Molecular Dynamics (MD)• Comparative modeling
3
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
1D prediction:gory details
4Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
PHDsec: the un-g(l)ory details
76% is average over distribution: ≈ 10%
5
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Prediction accuracy varies!
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
6Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
PHDsec: the un-g(l)ory details
76% is average over distribution: ≈ 10%stronger predictions more accurate
7
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Stronger predictions more accurate!
.
0
20
40
60
80
100
0
20
40
60
80
100
3 4 5 6 7 8 9
Q per protein3 fit: Q3fit = 21 + 8.7 * Q
3
Q3 p
er p
rote
in
Reliability index averaged over protein
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
H=0.5E=0.4L=0.1
H=0.8E=0.1L=0.1
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
8Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
PHDsec: the un-g(l)ory details
76% is average over distribution: ≈ 10%stronger predictions more accurateWARNING: reliability index almost factor 2 too large for single sequences
9
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Details PHDsec: Multiple alignment
single sequences => accuracy clearly lower
id nali Q3sec Q2accAA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEEself 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH
10
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
FAQ of secondary structure prediction
What is the best alignment?Limit of prediction accuracy reached? Comparative modeling or de novo?Ultimate rôle in structure prediction (1D-3D)? Will secondary structure and 3D prediction merge completely?
11
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
FAQ of secondary structure prediction
What is the best alignment?
12
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Evolution has it!
.
0
20
40
60
80
100
0 50 100 150 200 250
Perc
enta
ge se
quen
ce id
entit
y
Number of residues aligned
Sequence identityimplies structural
similarity !
Don't know region
13
C Sander & R Schneider 1991 Proteins 9:56-68B Rost 1999 Prot Engin 12:85-94
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Different alignment strategies
Method SWISS-PROT BIGE<1 E<10-3 E<1 E<10-3
BLAST 8.2 7.6 9.7 9.2simple ClustalW 4.4 5.4profile ClustalW 5.4 7.1MaxHom with McLachlan 7.2 7.5 9.0 8.9MaxHom with BLOSUM62 8.3 7.9 9.5 9.1BLAST-filter 7.9 7.6 9.5 9.2profile-based BLAST 8.2 7.8 9.6 9.1
significant difference > 0.44 > 0.44 > 0.44 > 0.44
14Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Different alignment strategies
Method SWISS-PROT BIGE<1 E<10-3 E<1 E<10-3
BLAST 8.2 7.6 9.7 9.2simple ClustalW 4.4 5.4profile ClustalW 5.4 7.1MaxHom with McLachlan 7.2 7.5 9.0 8.9MaxHom with BLOSUM62 8.3 7.9 9.5 9.1BLAST-filter 7.9 7.6 9.5 9.2profile-based BLAST 8.2 7.8 9.6 9.1
significant difference > 0.44 > 0.44 > 0.44 > 0.44
15Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Small vs. big
-30
-20
-10
0
10
20
30
40
-20 -10 0 10 20 30
Q 3(NR
DB)
- Q 3(s
ingl
e seq
uenc
e)
Q3(SWISS-PROT) - Q3(single sequence)
-20 -10 0 10 20 30-30
-20
-10
0
10
20
30
40
Q3(SWISS-PROT) - Q3(single sequence)
A: BLAST-E cut-off < 1 B: BLAST-E cut-off < 10-20
C:
16Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Different alignment strategies 3
Method SWISS-PROT BIGE<1 E<10-3 E<1 E<10-3
BLAST 8.2 7.6 9.7 9.2simple ClustalW 4.4 5.4profile ClustalW 5.4 7.1MaxHom with McLachlan 7.2 7.5 9.0 8.9MaxHom with BLOSUM62 8.3 7.9 9.5 9.1BLAST-filter 7.9 7.6 9.5 9.2profile-based BLAST 8.2 7.8 9.6 9.1
significant difference > 0.44 > 0.44 > 0.44 > 0.44
17Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Accuracy vs. E-value BLAST
E-value b PHDsec c PHDacc d
100 8.7 4.4
20 9.1 4.9 10 9.5 5.0 1 9.7 5.2 10-1 9.5 5.3 10-2 9.2 5.3 10-3 9.1 5.2 10-4 8.9 5.2 10-7 8.5 5.0 10-20 6.9 4.5
significant difference >0.44 >0.39
18Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Accuracy vs. E-value PSI-BLAST
Iteration E-value b PHDsec c PHDacc d
10 7.3 3.0 1 9.3 4.2
10-1 10.1 4.810-2 10.1 5.010-3 10.0 5.010-4 10.1 5.110-7 9.9 5.110-20 9.6 5.210-60 9.4 5.0
significant difference >0.44 >0.3
19Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Accuracy vs. pollution
Number of iterations b h 10-4 c h 10-10 c
filtered d non-filtered d filtered d non-filtered d
1 9.5 9.7 9.5 9.72 9.9 10.0 9.8 10.0
3 10.1 9.8 10.1 10.04 9.6 9.3 10.1 9.86 9.3 8.8 9.9 9.710 8.1 7.4 9.7 9.5
significant difference >0.44 >0.44 >0.44 >0.44
20Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
50
60
70
80
90
50 60 70 80 90
Pred
ictio
n ac
cura
cy u
sing
itera
ted
PSI-
BLA
ST
Prediction accuracy using pairwise BLAST
PSI-BLAST not always best
21Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
-5
0
5
10
15
20
25
1 10 100 1000
Q3
alig
nmen
t - Q
3 sin
gle s
eque
nce
Number of proteins aligned
More = better?
22Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
BLAST on SWISS-PROTBLAST on NRDBPSI-BLAST on NRDBPSI-BLAST excludingBLAST hits
0
20
40
60
80
100
0 20 40 60 80 100
Num
ber o
f seq
uenc
es in
alig
nmen
t
Percent of proteins
Data deluge not enough for greedy bioinformaticians
23Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
FAQ of secondary structure prediction
What is the best alignment?
24
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
FAQ of secondary structure prediction
What is the best alignment?? ... that depends
25
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
FAQ of secondary structure prediction
What is the best alignment?that dependsLimit of prediction accuracy reached?
Comparative modeling or de novo?
Ultimate rôle in structure prediction (1D-3D)? Will secondary structure and 3D prediction merge completely?
26
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
1D secondary structure prediction: Quo vadis?
27Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
How to assess secondary structure prediction methods?
28Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
How to assess performance?
29
?Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
How to assess performance?
29
?gr
oups
groups
groupsgroupsgroups
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: Truth from where
30
Standard of truth:PDB -> DSSP -> string HEL
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set size
many many many?
31
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set size
many many many?
32
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set size
33
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Num
ber o
f pro
tein
cha
ins
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf
1bct
1stu
3ifm
1psm
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set size
Some 500+ proteins appear to workANY 500+ do?
34
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set size
ANY 500+ do?NOwe need to sample the “true distribution”
(redundancy reduction/bias reduction)
35
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
36
standard deviation ~ 10% (in Q3)assume 2500 “effective” proteins in data set
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
37
standard deviation ~ 10% (in Q3)assume 2500 “effective” proteins in data set
method 1: Q3=76.521%method 2: Q3=76.301%
Do they differ?
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
38
σ ~ 10% (in Q3) / nprot=2500
method 1: Q3=76.521%method 2: Q3=76.301%
Rule of thumb: StdError = σ / √nprot-> 0.2%
YES, the difference is statistically significant (although borderline)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
39
σ ~ 10% (in Q3) / nprot=2500
method 1: Q3=76.521%method 2: Q3=76.301%
Rule of thumb: StdError = σ / √nprot-> 0.2%
YES, the difference is statistically significant (although borderline)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
39
σ ~ 10% (in Q3) / nprot=2500
method 1: Q3=76.521%method 2: Q3=76.301%
Rule of thumb: StdError = σ / √nprot-> 0.2%
YES, the difference is statistically significant (although borderline)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /6540
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
41
σ ~ 10% (in Q3) / nprot=2500
method 1: Q3=76.521%method 2: Q3=76.301%
Rule of thumb: StdError = σ / √nprot-> 0.2%
YES, the difference is statistically significant (although borderline)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
41
σ ~ 10% (in Q3) / nprot=2500
method 1: Q3=76.521%method 2: Q3=76.301%
Rule of thumb: StdError = σ / √nprot-> 0.2%
YES, the difference is statistically significant (although borderline)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set size
ANY 500+ do?NOwe need to sample the “true distribution”
(redundancy reduction/bias reduction)
is that ENOUGH?
42
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set
anything else to consider in choosing the data set?
43
?Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: data set
set of sequence-unique/unbiased proteins that are ideally also sequence-unique/unbiased with respect to anything used to develop the methods to assess
-> NEW proteins
44
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
45
σ ~ 10% (in Q3) / nprot=100
method 1: Q3=76.521%method 2: Q3=76.301%
Rule of thumb: StdError = σ / √nprot-> 1%
YES, the difference is statistically significant (although borderline)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: standard error
45
σ ~ 10% (in Q3) / nprot=100
method 1: Q3=76.521%method 2: Q3=76.301%
Rule of thumb: StdError = σ / √nprot-> 1%
YES, the difference is statistically significant (although borderline)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Art 2 Assess: anything else?
Anything else to consider?
46
?Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Evaluation alternatives
47
Method 1 predicts proteins P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11Method 2 predicts P2, P4, P6, P8, P10, P11Method 3: P1, P3, P5, P7, P9, P10, P11Method 4: P0, P10, P11
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Ranking not stable!
29 different worse than 11 identical
VA Eyrich, IYY Koh, D Przybylski, O Graña, F Pazos, A Valencia and B Rost (2003) Proteins 53 Suppl 6 548-60
© Burkhard Rost (Columbia New York)
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Compare methods on identical data sets!!
49Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
one proteinPDB vs prediction
weeksummary
Compile results at
PDB
Prediction servers
secondary structure, fold recognition
inter-residue contacts / distancescomparative modelling, fold recognition
Satellites/Mirrors
everyweek
everyday
User• browse• query• ftp
Results
staticpages
Collect HTMLUpdate central pages
EVA-DBSend sequences
Analyse: pairwise BLAST
Analyse:• PSI-BLAST• MaxHom• sequence- unique sets
Get PDB
EVA: automatic continuous EVAluation of structure prediction
50Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
EVA: secondary structure
Method B Q3 C Q3 Claim D SOV E Info F CorrH G CorrE H CorrL I Class K BAD L
PROF 76.0 72 0.35 0.67 0.63 0.55 82 2.7PSIPRED 76.0 76.5-78.3 M 72 0.36 0.65 0.62 0.55 78 2.8SSpro 76.0 76 71 0.35 0.67 0.63 0.56 83 2.8
JPred2 75.0 76.4 69 0.34 0.65 0.60 0.54 76 2.6PHDpsi 75.0 71 0.33 0.65 0.60 0.54 81 3.0
PHD 71.4 71.6 68 0.28 0.59 0.58 0.49 77 4.3
Copenhagen 78 N 77.8
Wang/Yuan 53 O
76%
51Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Secondary structure predictions differ
52Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Accuracy varies for proteins!
0
5
10
15
20
25
30 40 50 60 70 80 90 100
PSIPREDSSproPROFPHDpsiJPred2PHD
Perce
ntage
of al
l 150
prote
ins
Percentage correctly predicted residues per protein
53Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Some proteins predicted better
30
40
50
60
70
80
90
0 20 40 60 80 100
Acc
urac
y pe
r pro
tein
(Q3)
Cumulative percentage of proteins
54Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
-30
-20
-10
0
10
20
30
55 60 65 70 75 80 85 90 95
ave-PSIPREDave-SSproave-PROFave-PHDpsiave-JPred2ave-PHD
55 60 65 70 75 80 85 90 95
Devi
ation
of m
ethod
from
avera
ge
Per-protein prediction accuracy averaged over 6 methods
Averaging over many methods not always good!
55Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Reliability correlates with accuracy!
70
75
80
85
90
95
100
70
75
80
85
90
95
100
0 20 40 60 80 100
JPred2PHDPROFPSIPRED
0 20 40 60 80 100
Perc
enta
ge o
f cor
rect
ly p
redi
cted
resid
ues
Percentage of residues predicted
56Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Secondary structure prediction 2005
history1st generation 50-55%2nd generation 55-62%3rd generation 1992 70-72% 2000 > 76% 2010 > 78%
57
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Secondary structure prediction 2005
history1st generation 1970s 50-55% 55
2nd generation1980s 55-62% 62 + 7
3rd generation 1992 70-72% 72 +10
2000 > 76% 76 + 4 2011 > 78% 78 + 2
58
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Secondary structure prediction 2005
history1st generation 1970s 50-55% 552nd generation1980s 55-62% 62 + 73rd generation 1992 70-72% 72 +10 2000 > 76% 76 + 4 2011 > 78% 78 + 2
what improves (2002)?database growth +3PSI-BLAST +0.5new training +1‘clever method’ +1
59
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Quo vadis?
1980: 55% simple1990: 60% less simple1993: 70% evolution2000: 76% more evolution2011: 78% even more evolutionwhat is the limit?
60
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Quo vadis?
1980: 55% simple1990: 60% less simple1993: 70% evolution2000: 76% more evolution2011: 78% even more evolutionwhat is the limit?
88% for proteins of similar structure80% for 1/5th of proteins with families > 100
61
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Quo vadis?
1980: 55% simple1990: 60% less simple1993: 70% evolution2000: 76% more evolution2011: 78% even more evolutionwhat is the limit?
88% for proteins of similar structure80% for 1/5th of proteins with families > 100 missing: better definition of secondary structure including long-range interactionsstructural switcheschameleon / folding
62
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Conclusion: secondary structure prediction
big gain through using evolutionary informationare we going to reach above 80%? How high?continuous secondary structurebetter methodsother featuresuse secondary structure: ASP M Young, Kirshenbaum, Dill, S Highsmith: Predicting conformational switches in proteins. Protein Sci 1999, 8:1752-1764.
63
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
FAQ of secondary structure prediction
What is the best alignment?that dependsLimit of prediction accuracy reached? noComparative modeling or de novo?specialist is bestUltimate rôle in structure prediction (1D-3D)? Will secondary structure and 3D prediction merge completely?
64
Wednesday May 25, 2011
© Burkhard Rost (TU Munich) /65
Announcements
Videos: SciVee www.rostlab.orgTHANKS : Tim Karl + Haitam Sohby NO lectures: Tue May 31(!) studentische vollversammlung Thu Jun 2 (Ascension) Thu Jun 16 ?
LAST lecture: Jul 7Examen: Jul 12 (?), 10:30 (likely this room)
• Makeup: likely: October 13 - morning
CONTACT: Marlena Drabik [email protected]
65
Wednesday May 25, 2011