Protein Prediction - Part 1: Structure · 2011. 5. 26. · Today: Secondary structure prediction 1...

© Burkhard Rost (TU Munich) /65

Protein Prediction - Part 1: Structure

1Wednesday May 25, 2011


Announcements

Videos: SciVee www.rostlab.orgTHANKS : Tim Karl + Haitam Sohby NO lectures: Tue May 31(!) studentische vollversammlung Thu Jun 2 (Ascension) Thu Jun 16 ?

LAST lecture: Jul 7Examen: Jul 12 (?), 10:30 (likely this room)

• Makeup: likely: October 13 - morning

CONTACT: Marlena Drabik [email protected]

2

Wednesday May 25, 2011

http://www.rostlab.org


mailto:[email protected]



Today: Secondary structure prediction 1

LAST WEEKs• Secondary structure prediction: principles on white board THIS WEEK

• Secondary structure prediction methods - detailsNEXT WEEK

• Student Assembly & Ascension Day -> no lecture2 WEEKs from now

• Marc Ofmann: Molecular Dynamics (MD)• Comparative modeling

3



1D prediction:gory details



PHDsec: the un-g(l)ory details

76% is average over distribution: ≈ 10%

5



Prediction accuracy varies!

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm




76% is average over distribution: ≈ 10%stronger predictions more accurate

7



Stronger predictions more accurate!

.

0

20

40

60

80

100

0

20

40

60

80

100

3 4 5 6 7 8 9

Q per protein3 fit: Q3fit = 21 + 8.7 * Q

3

Q3 p

er p

rote

in

Reliability index averaged over protein

ACDEFGHIKLMNPQRSTVWY.

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

H=0.5E=0.4L=0.1

H=0.8E=0.1L=0.1

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins


<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm




76% is average over distribution: ≈ 10%stronger predictions more accurateWARNING: reliability index almost factor 2 too large for single sequences

9



Details PHDsec: Multiple alignment

single sequences => accuracy clearly lower

id nali Q3sec Q2accAA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEEself 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH

10



FAQ of secondary structure prediction

What is the best alignment?Limit of prediction accuracy reached? Comparative modeling or de novo?Ultimate rôle in structure prediction (1D-3D)? Will secondary structure and 3D prediction merge completely?

11




What is the best alignment?

12



Evolution has it!

.

0

20

40

60

80

100

0 50 100 150 200 250

Perc

enta

ge se

quen

ce id

entit

y

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region

13

C Sander & R Schneider 1991 Proteins 9:56-68B Rost 1999 Prot Engin 12:85-94



Different alignment strategies

Method SWISS-PROT BIGE<1 E<10-3 E<1 E<10-3

BLAST 8.2 7.6 9.7 9.2simple ClustalW 4.4 5.4profile ClustalW 5.4 7.1MaxHom with McLachlan 7.2 7.5 9.0 8.9MaxHom with BLOSUM62 8.3 7.9 9.5 9.1BLAST-filter 7.9 7.6 9.5 9.2profile-based BLAST 8.2 7.8 9.6 9.1

significant difference > 0.44 > 0.44 > 0.44 > 0.44



Different alignment strategies






Small vs. big

-30

-20

-10

0

10

20

30

40

-20 -10 0 10 20 30

Q 3(NR

DB)

- Q 3(s

ingl

e seq

uenc

e)

Q3(SWISS-PROT) - Q3(single sequence)

-20 -10 0 10 20 30-30

-20

-10

0

10

20

30

40

Q3(SWISS-PROT) - Q3(single sequence)

A: BLAST-E cut-off < 1 B: BLAST-E cut-off < 10-20

C:



Different alignment strategies 3






Accuracy vs. E-value BLAST

E-value b PHDsec c PHDacc d

100 8.7 4.4

20 9.1 4.9 10 9.5 5.0 1 9.7 5.2 10-1 9.5 5.3 10-2 9.2 5.3 10-3 9.1 5.2 10-4 8.9 5.2 10-7 8.5 5.0 10-20 6.9 4.5

significant difference >0.44 >0.39



Accuracy vs. E-value PSI-BLAST

Iteration E-value b PHDsec c PHDacc d

10 7.3 3.0 1 9.3 4.2

10-1 10.1 4.810-2 10.1 5.010-3 10.0 5.010-4 10.1 5.110-7 9.9 5.110-20 9.6 5.210-60 9.4 5.0

significant difference >0.44 >0.3



Accuracy vs. pollution

Number of iterations b h 10-4 c h 10-10 c

filtered d non-filtered d filtered d non-filtered d

1 9.5 9.7 9.5 9.72 9.9 10.0 9.8 10.0

3 10.1 9.8 10.1 10.04 9.6 9.3 10.1 9.86 9.3 8.8 9.9 9.710 8.1 7.4 9.7 9.5

significant difference >0.44 >0.44 >0.44 >0.44



50

60

70

80

90

50 60 70 80 90

Pred

ictio

n ac

cura

cy u

sing

itera

ted

PSI-

BLA

ST

Prediction accuracy using pairwise BLAST

PSI-BLAST not always best



-5

0

5

10

15

20

25

1 10 100 1000

Q3

alig

nmen

t - Q

3 sin

gle s

eque

nce

Number of proteins aligned

More = better?



BLAST on SWISS-PROTBLAST on NRDBPSI-BLAST on NRDBPSI-BLAST excludingBLAST hits

0

20

40

60

80

100

0 20 40 60 80 100

Num

ber o

f seq

uenc

es in

alig

nmen

t

Percent of proteins

Data deluge not enough for greedy bioinformaticians




What is the best alignment?

24




What is the best alignment?? ... that depends

25




What is the best alignment?that dependsLimit of prediction accuracy reached?

Comparative modeling or de novo?

Ultimate rôle in structure prediction (1D-3D)? Will secondary structure and 3D prediction merge completely?

26



1D secondary structure prediction: Quo vadis?



How to assess secondary structure prediction methods?



How to assess performance?

29

?Wednesday May 25, 2011


How to assess performance?

29

?gr

oups

groups

groupsgroupsgroups



Art 2 Assess: Truth from where

30

Standard of truth:PDB -> DSSP -> string HEL



Art 2 Assess: data set size

many many many?

31




many many many?

32




33

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber o

f pro

tein

cha

ins


<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm




Some 500+ proteins appear to workANY 500+ do?

34




ANY 500+ do?NOwe need to sample the “true distribution”

(redundancy reduction/bias reduction)

35



Art 2 Assess: standard error

36

standard deviation ~ 10% (in Q3)assume 2500 “effective” proteins in data set




37

standard deviation ~ 10% (in Q3)assume 2500 “effective” proteins in data set

method 1: Q3=76.521%method 2: Q3=76.301%

Do they differ?




38

σ ~ 10% (in Q3) / nprot=2500

method 1: Q3=76.521%method 2: Q3=76.301%

Rule of thumb: StdError = σ / √nprot-> 0.2%

YES, the difference is statistically significant (although borderline)




39

σ ~ 10% (in Q3) / nprot=2500

method 1: Q3=76.521%method 2: Q3=76.301%






39

σ ~ 10% (in Q3) / nprot=2500

method 1: Q3=76.521%method 2: Q3=76.301%








41

σ ~ 10% (in Q3) / nprot=2500

method 1: Q3=76.521%method 2: Q3=76.301%






41

σ ~ 10% (in Q3) / nprot=2500

method 1: Q3=76.521%method 2: Q3=76.301%






ANY 500+ do?NOwe need to sample the “true distribution”

(redundancy reduction/bias reduction)

is that ENOUGH?

42



Art 2 Assess: data set

anything else to consider in choosing the data set?

43



Art 2 Assess: data set

set of sequence-unique/unbiased proteins that are ideally also sequence-unique/unbiased with respect to anything used to develop the methods to assess

-> NEW proteins

44




45

σ ~ 10% (in Q3) / nprot=100

method 1: Q3=76.521%method 2: Q3=76.301%

Rule of thumb: StdError = σ / √nprot-> 1%





45

σ ~ 10% (in Q3) / nprot=100

method 1: Q3=76.521%method 2: Q3=76.301%

Rule of thumb: StdError = σ / √nprot-> 1%




Art 2 Assess: anything else?

Anything else to consider?

46



Evaluation alternatives

47

Method 1 predicts proteins P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11Method 2 predicts P2, P4, P6, P8, P10, P11Method 3: P1, P3, P5, P7, P9, P10, P11Method 4: P0, P10, P11



Ranking not stable!

29 different worse than 11 identical

VA Eyrich, IYY Koh, D Przybylski, O Graña, F Pazos, A Valencia and B Rost (2003) Proteins 53 Suppl 6 548-60

© Burkhard Rost (Columbia New York)



Compare methods on identical data sets!!



one proteinPDB vs prediction

weeksummary

Compile results at

PDB

Prediction servers

secondary structure, fold recognition

inter-residue contacts / distancescomparative modelling, fold recognition

Satellites/Mirrors

everyweek

everyday

User• browse• query• ftp

Results

staticpages

Collect HTMLUpdate central pages

EVA-DBSend sequences

Analyse: pairwise BLAST

Analyse:• PSI-BLAST• MaxHom• sequence- unique sets

Get PDB

EVA: automatic continuous EVAluation of structure prediction



EVA: secondary structure

Method B Q3 C Q3 Claim D SOV E Info F CorrH G CorrE H CorrL I Class K BAD L

PROF 76.0 72 0.35 0.67 0.63 0.55 82 2.7PSIPRED 76.0 76.5-78.3 M 72 0.36 0.65 0.62 0.55 78 2.8SSpro 76.0 76 71 0.35 0.67 0.63 0.56 83 2.8

JPred2 75.0 76.4 69 0.34 0.65 0.60 0.54 76 2.6PHDpsi 75.0 71 0.33 0.65 0.60 0.54 81 3.0

PHD 71.4 71.6 68 0.28 0.59 0.58 0.49 77 4.3

Copenhagen 78 N 77.8

Wang/Yuan 53 O

76%



Secondary structure predictions differ



Accuracy varies for proteins!

0

5

10

15

20

25

30 40 50 60 70 80 90 100

PSIPREDSSproPROFPHDpsiJPred2PHD

Perce

ntage

of al

l 150

prote

ins

Percentage correctly predicted residues per protein



Some proteins predicted better

30

40

50

60

70

80

90

0 20 40 60 80 100

Acc

urac

y pe

r pro

tein

(Q3)

Cumulative percentage of proteins



-30

-20

-10

0

10

20

30

55 60 65 70 75 80 85 90 95

ave-PSIPREDave-SSproave-PROFave-PHDpsiave-JPred2ave-PHD

55 60 65 70 75 80 85 90 95

Devi

ation

of m

ethod

from

avera

ge

Per-protein prediction accuracy averaged over 6 methods

Averaging over many methods not always good!



Reliability correlates with accuracy!

70

75

80

85

90

95

100

70

75

80

85

90

95

100

0 20 40 60 80 100

JPred2PHDPROFPSIPRED

0 20 40 60 80 100

Perc

enta

ge o

f cor

rect

ly p

redi

cted

resid

ues

Percentage of residues predicted



Secondary structure prediction 2005

history1st generation 50-55%2nd generation 55-62%3rd generation 1992 70-72% 2000 > 76% 2010 > 78%

57




history1st generation 1970s 50-55% 55

2nd generation1980s 55-62% 62 + 7

3rd generation 1992 70-72% 72 +10

2000 > 76% 76 + 4 2011 > 78% 78 + 2

58




history1st generation 1970s 50-55% 552nd generation1980s 55-62% 62 + 73rd generation 1992 70-72% 72 +10 2000 > 76% 76 + 4 2011 > 78% 78 + 2

what improves (2002)?database growth +3PSI-BLAST +0.5new training +1‘clever method’ +1

59



Quo vadis?

1980: 55% simple1990: 60% less simple1993: 70% evolution2000: 76% more evolution2011: 78% even more evolutionwhat is the limit?

60



Quo vadis?


88% for proteins of similar structure80% for 1/5th of proteins with families > 100

61



Quo vadis?


88% for proteins of similar structure80% for 1/5th of proteins with families > 100 missing: better definition of secondary structure including long-range interactionsstructural switcheschameleon / folding

62



Conclusion: secondary structure prediction

big gain through using evolutionary informationare we going to reach above 80%? How high?continuous secondary structurebetter methodsother featuresuse secondary structure: ASP M Young, Kirshenbaum, Dill, S Highsmith: Predicting conformational switches in proteins. Protein Sci 1999, 8:1752-1764.

63




What is the best alignment?that dependsLimit of prediction accuracy reached? noComparative modeling or de novo?specialist is bestUltimate rôle in structure prediction (1D-3D)? Will secondary structure and 3D prediction merge completely?

64



Announcements

Videos: SciVee www.rostlab.orgTHANKS : Tim Karl + Haitam Sohby NO lectures: Tue May 31(!) studentische vollversammlung Thu Jun 2 (Ascension) Thu Jun 16 ?

LAST lecture: Jul 7Examen: Jul 12 (?), 10:30 (likely this room)

• Makeup: likely: October 13 - morning

CONTACT: Marlena Drabik [email protected]

65






Date post:	13-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Protein Prediction - Part 1: Structure · 2011. 5. 26. · Today: Secondary structure prediction 1...

Documents