Post on 02-Jan-2016
description
transcript
Burkhard Rost (Columbia New York)
Some gory details of protein Some gory details of protein secondary structure predictionsecondary structure prediction
Some gory details of protein Some gory details of protein secondary structure predictionsecondary structure prediction
Burkhard Rost
CUBIC Columbia University
rost@columbia.edu
http://www.columbia.edu/~rost
http://cubic.bioc.columbia.edu/
Burkhard Rost (Columbia New York)
FoRc
HoMo
1D
….the art of being humble
Burkhard Rost (Columbia New York)
Goal of secondary structure predictionGoal of secondary structure predictionGoal of secondary structure predictionGoal of secondary structure prediction
LEDKSPDHNPTGID
AKGKPMDRNFTGRNHPPKDSS
AAQVKDALTK
LEQWGTLAQL
RAIWEQELTDFPEFLTMMARQETWLGWLTI
helix strand
loop
LAVIGVLMKW
FVFLMIE
KIYHKLT
DIRVGLTYYIAQ
VNTFVGTFAAVAHAL
Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation
Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation
• single residues (1. generation)
– Chou-Fasman, GOR 1957-70/8050-55% accuracy
• segments (2. generation)
– GORIII 1986-9255-60% accuracy
• problems
– < 100% they said: 65% max
– < 40% they said: strand non-local
– short segments
Burkhard Rost (Columbia New York)
Helix formation is localHelix formation is localHelix formation is localHelix formation is local
residuesi
andi+3
THYROID hormone receptor (2nll)
Burkhard Rost (Columbia New York)
-sheet formation is NOT local-sheet formation is NOT local-sheet formation is NOT local-sheet formation is NOT local
Erabutoxin (3ebx)
Burkhard Rost (Columbia New York)
SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE
TYP EHHHH EE EEEE EE HHHEE EEEHH
Problems of secondary structure predictionsProblems of secondary structure predictions(before 1994)(before 1994)
Problems of secondary structure predictionsProblems of secondary structure predictions(before 1994)(before 1994)
Burkhard Rost (Columbia New York)
J 1 1
J 1 2
1
1
1
0
o u t 0 = in 1J 1 1 i n 2J 1 2 +
o u t = t an h ( o u t 0 )
Simple Neural NetworkSimple neural networkSimple neural networkSimple neural networkSimple neural network
Burkhard Rost (Columbia New York)
Training a neural network 1Training a neural network 1Training a neural network 1Training a neural network 1
1
0
Burkhard Rost (Columbia New York)
1
0
Errare = (out net - out want) 2
.
1
- 1
21- 1- 2in
Training a neural network 2Training a neural network 2Training a neural network 2Training a neural network 2
Burkhard Rost (Columbia New York)
Training a neural network 3Training a neural network 3Training a neural network 3Training a neural network 3
Error
J unctions
1
0
0
1
1
1
1
1
Burkhard Rost (Columbia New York)
Training a neural network 4Training a neural network 4Training a neural network 4Training a neural network 4
1
0
0
1
1
1
1
1
.
1
- 1
21- 1- 2in
1
0
0
1
0
1
1
2
1
0
0
1
- 1
1
1
2+?
Burkhard Rost (Columbia New York)
Neural networks classify pointsNeural networks classify pointsNeural networks classify pointsNeural networks classify points
Burkhard Rost (Columbia New York)
Simple Neural NetworkWith Hidden Layer
o u ti= f
i j
2
J ⋅ fj k
1
Jk
∑ ⋅ki n
⎛
⎝⎜
⎞
⎠⎟
j
∑⎛
⎝
⎜⎜
⎞
⎠
⎟⎟
Simple neural network with hidden layerSimple neural network with hidden layerSimple neural network with hidden layerSimple neural network with hidden layer
Burkhard Rost (Columbia New York)
ACDEFGHIKLMNPQRSTVWY.
H
E
L
D (L)
R (E)
Q (E)
G (E)
F (E)
V (E)
P (E)
A (H)
A (H)
Y (H)
V (E)
K (E)
K (E)
Neural Network for secondary structureNeural Network for secondary structureNeural Network for secondary structureNeural Network for secondary structure
Burkhard Rost (Columbia New York)
Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation
Secondary structure predictionsSecondary structure predictions of 1. and 2. generation of 1. and 2. generation
• single residues (1. generation)– Chou-Fasman, GOR 1957-70/80
50-55% accuracy
• segments (2. generation)– GORIII 1986-92
55-60% accuracy
• problems– < 100% they said: 65% max
– < 40% they said: strand non-local
– short segments
Burkhard Rost (Columbia New York)
h e l i x s t r a n d o t h e ro v e r a l l
a c c u r a c ym e t h o d
u n b a l a n c e d 6 2 %
Burkhard Rost (Columbia New York)
h e l i x s t r a n d o t h e ro v e r a l l
a c c u r a c ym e t h o d
u n b a l a n c e d 6 2 %
c o m p a r i s o n :
d a t a b a n k
d i s t r i b u t i o n
Burkhard Rost (Columbia New York)
h e l i x s t r a n d o t h e ro v e r a l l
a c c u r a c ym e t h o d
u n b a l a n c e d 6 2 %
c o m p a r i s o n :
d a t a b a n k
d i s t r i b u t i o n
c o m p a r i s o n :
3 3 : 3 3 : 3 3
Burkhard Rost (Columbia New York)
E = oiμ −di
μ( )i∑
μ=α,,L∑
2
Eμ = oiμ −di
μ( )i∑ 2
ΔJ μ ∝ - ∂Eμ {J}∂J
normal training
balanced training
Balanced trainingBalanced trainingBalanced trainingBalanced training
Burkhard Rost (Columbia New York)
h e l i x s t r a n d o t h e ro v e r a l l
a c c u r a c ym e t h o d
u n b a l a n c e d 6 2 %
c o m p a r i s o n :
d a t a b a n k
d i s t r i b u t i o n
c o m p a r i s o n :
3 3 : 3 3 : 3 3b a l a n c e d 6 0 %
Burkhard Rost (Columbia New York)
H
E
L
V (E)
P (E)
A (H)
PHDsec:
structure-to-structure
PHDsec: PHDsec: structure-to-structure structure-to-structure
networknetwork
PHDsec: PHDsec: structure-to-structure structure-to-structure
networknetwork
Burkhard Rost (Columbia New York)
.
0
200
400
600
800
1000
1200
0 10 20 30 40 50
Segment length
0
5
10
15
20
25
25 30 35 40 45 50
DSSPPHD
-800
-600
-400
-200
0
200
400
600
800
0 2 4 6 8 10
helixstrandloop
Segment length
A B
Better prediction of segment lengthsBetter prediction of segment lengthsBetter prediction of segment lengthsBetter prediction of segment lengths
Burkhard Rost (Columbia New York)
Evolution has it!Evolution has it!Evolution has it!Evolution has it!
.
0
20
40
60
80
100
0 50 100 150 200 250
Number of residues aligned
Sequence identityimplies structural
similarity !
Don't know region
Burkhard Rost (Columbia New York)
1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
Burkhard Rost (Columbia New York)
1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYIyrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYIfgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCIyes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYIsrc_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIsrc_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYIstk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYIsrc_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYIhck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYIblk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYVhck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYIlyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFIlck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFIss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGIIabl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVabl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWVsrc1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLImysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKVyfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIFabl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWVtec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYIabl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWVtxk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLIyha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIFabp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF
Burkhard Rost (Columbia New York)
Η
Ε
L
>
>
>
pickmaximal
unit=>
currentprediction
J2
inputlayer
first orhidden layer
second oroutput layer
s0 s1 s2J1
:GYIY
DPAVGDPDNGVEP
GTEF:
:GYIY
DPEVGDPTQNIPP
GTKF:
:GYEY
DPAEGDPDNGVKP
GTSF:
:GYEY
DPAEGDPDNGVKP
GTAF:
Alignments
5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .
. . . . 5 . . . . . . . . . . . . . . .
. . . 5 . . . . . . . . . . . . . . . .
. . 3 . . . . 2 . . . . . . . . . . . .
. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .
5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .
GSAPD NTEKQ CVHIR LMYFW
profile table
:GYIY
DPEDGDPDDGVNP
GTDF:
Protein
corresponds to the the 21*3 bits coding for the profile of one residue
Burkhard Rost (Columbia New York)
25%
80
100%
number of residues aligned
filterMaxHom
sequencedata bank
protein Aprotein B
:protein N
protein Aprotein C
:protein M
MaxHom
BLAST
112233
extractalignment
PHD
Burkhard Rost (Columbia New York)
PHDsec
H
L
E
4+1""""""
20444
outputlayer
inputlayer
hiddenlayer
20444
21+3""""""
H
L
E
0.5
0.1
0.4percentage of each amino acid in protein
length of protein (≤60, ≤120, ≤240, >240)
distance: centre, N-term (≤40,≤30,≤20,≤10)
distance: centre, C-term (≤40,≤30,≤20,≤10)
input global in sequence
input local in sequence
local
align-
ment
13
adjacent
residues
:::
AAA
AA.
LLL
LII
AAG
CCS
GVV
:::
global
statist.
whole
protein
% AA
Length
∆ N-term
∆ C-term
A C L I G S V ins del cons
100 0 0 0 0 0 0 0 0 1.17
100 0 0 0 0 0 0 33 0 0.42
0 0 100 0 0 0 0 0 33 0.92
0 0 33 66 0 0 0 0 0 0.74
66 0 0 0 33 0 0 0 0 1.17
0 66 0 0 0 33 0 0 0 0.74
0 0 0 33 0 0 66 0 0 0.48
first levelsequence-to- structure
second levelstructure-to- structure
Burkhard Rost (Columbia New York)
HEADER CYTOSKELETONCOMPND ALPHA SPECTRIN (SH3 DOMAIN) SOURCE CHICKEN (GALLUS GALLUS) BRAINAUTHOR M.NOBLE,R.PAUPTIT,A.MUSACCHIO,M.SARASTE
Spectrin homology domain (SH3)Spectrin homology domain (SH3)Spectrin homology domain (SH3)Spectrin homology domain (SH3)
59%65%
72%
Burkhard Rost (Columbia New York)
Prediction accuracy varies!Prediction accuracy varies!Prediction accuracy varies!Prediction accuracy varies!
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Number of protein chains
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf 1bct1stu
3ifm1psm
Burkhard Rost (Columbia New York)
Why so bad?Why so bad?Why so bad?Why so bad?
....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,.1evwA ALTNAQILAVIDSWEETVGQFPVITHHVPLGGGLQGTLHCYEIPLAAPYGVGFAKNGPTRWQYKRTINQVVHRWGSDSSP HHHHHHHHHHHHHHHH EEEEEEEEE EEEEEEEEE EEEEE EEEEEEE EEEEEJPred2 EEEEEE EEEEEEEE EEEEE E EEEHHHHEEEEEEPHD EEEEEEE HHH EEEEEEEE EEEEEEEEE EEE EEEEEEEEEEEEEPHDpsi EEEEEEE HHH EEEEEEE EEEEEEEE EEEE HHHHHE EEEEEEPROFsec EEEEEE HHHH EEEEEE EEEEEEEE EE HHHHHHHHHEEEEProf_king EEEEEEE HHHH EEEEE EEEEEEE E EEEEEHHHHHHHHPSIPRED EEEEEEE HHHHH EEEE EEEEEEE HHHHHHHHHHHHHHSAM T99sec HHHHHHHHHHHHH EEE EEEEE E EEEEEEHHEEEESSpro HHHHHHHHH HHHHH EEEEE EEEE HH EEEEE HHHHEEEH
...8....,....9....,....10...,....11...,....12...,....13...,....14...,....15...,....16.1evwA HTVPFLLEPDNINGKTCTASHLCHNTRCHNPLHLCWESADDNKGRNWCPGPNGGCVHAVVCLRQGPLYGPGATVAGPQQRGSHFVVDSSP HHH EE EEEEEEE E HHHEEEEEHHHHHHHHH EJPred2 EEEEE EEEEE EEE EEEEEEEE EEEPHD EEEEEE EEEEE EEEEEEE EEEEEEEEE EEE EEEEEPHDpsi EEEE EEEEEE EEEEE EEEEEEEEEE EEE EEEEEPROFsec EEE EEEEEE EEEEEE EEEEEEEEE EE EEEProf_king EEEEEEE EEEEEEE EEEEE EEEEEEE EE EEEEPSIPRED EEE EE HHH HHHHHHHH HHHHHHHHH HHHSAM T99sec EEEEEE E EEEEEEE E EESSpro HHE H EEEE EEEEEEE EE EE
1evw:A
Burkhard Rost (Columbia New York)
Stronger predictions more accurate!Stronger predictions more accurate!Stronger predictions more accurate!Stronger predictions more accurate!
.
0
20
40
60
80
100
0
20
40
60
80
100
3 4 5 6 7 8 9
Q per protein3fit: Q
3fit = 21 + 8.7 * Q
3
Reliability index averaged over protein
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 70 80 90 100
Number of protein chains
Per-residue accuracy (Q3)
<Q3>=72.3% ; sigma=10.5%
1spf 1bct1stu
3ifm1psm
Burkhard Rost (Columbia New York)
Correct prediction of correctly predicted residuesCorrect prediction of correctly predicted residuesCorrect prediction of correctly predicted residuesCorrect prediction of correctly predicted residues
.
7 0
7 5
8 0
8 5
9 0
9 5
100
0 20 4 0 60 8 0 1 00
P H D sec
P H D acc
P H D h tm
70
75
80
85
90
95
10 0R I=9
R I=0R I=9
R I=0
R I=9
R I=4
7
percen tag e o f resd id ues p red ic ted
Burkhard Rost (Columbia New York)
BAD errors are frequent!BAD errors are frequent!BAD errors are frequent!BAD errors are frequent!
0
50
100
150
200
250
300
350
0 10 20 30 40
BAD error (H for E, or E for H)
<BAD>=4.0% ; sigma=5.9%
0
5
10
15
20
0 20 40 60 80 100Cumulative percentage of protein chains
Burkhard Rost (Columbia New York)
False prediction for engineered proteins!False prediction for engineered proteins!False prediction for engineered proteins!False prediction for engineered proteins!
G B 1 : I g G - b i n d i n g d o m a i n o f p r o t e i n G ( C H A M E L E O N )
K i m & B e r g , N a t u r e , 3 6 6 , 2 6 7 - 2 7 0 , 1 9 9 3
. . . . , . . . . 1 . . . . , . . . . 2 . . . . , . . . . 3 . . . . , . . . . 4 . . . . , . . . . 5 . . . . , . .
A A T T Y K L I L N G K T L K G E T T T E A V D A A T A E K V F K Q Y A N D N G V D G E W T Y D D A T K T F T V T E K
D S S P E E E E E E E E E E E E E E E E H H H H H H H H H H H H H H H H H E E E E E E E E E E E E E E E
P H D 3 0 E E E E E E E E E H H H H H H H H H H H H H H E E E E E E E E E E E E E E
P H D n o E E E E E E E E E E E H H H H H H H H H H H H H H H H E E E E E E E E E E E
A A T A E K V F K Q Y
A W T V E K A F K T F
P H D 3 0 E E E E E E E E E E E E E H H H H H H H H H E E E E E E E E E E E E E
P H D n o E E E E E E E E E E E E H H H H H H H H H H H H H H H E E E E E E E E E E E
E W T Y D D A T K T F
A W T V E K A F K T F
P H D 3 0 E E E E E E E E E E H H H H H H H H H H H H H H H H E E E E E E E E E E E
P H D n o E E E E E E E E E H H H H H H H H H H H H H H H H H H H H H H H E E E E E
A W T V E K A F K T F
H H H H H
Burkhard Rost (Columbia New York)
PHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory details
• average accuracy > 72% (helix, strand, other)
• 72% is average over distribution: ≈ 10%
• stronger predictions more accurate
• WARNING: reliability index almost factor
2 too large for single
sequences
Burkhard Rost (Columbia New York)
Details PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignment
• single sequences => accuracy clearly lower
id nali Q3sec Q2accAA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEEself 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH
Burkhard Rost (Columbia New York)
PHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory detailsPHDsec: the un-g(l)ory details
• average accuracy > 72% (helix, strand, other)
• 72% is average over distribution: ≈ 10%
• stronger predictions more accurate
• WARNING: reliability index almost factor
2 too large for single
sequences
Burkhard Rost (Columbia New York)
Details PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignmentDetails PHDsec: Multiple alignment
• single sequences => accuracy clearly lower
id nali Q3sec Q2accAA KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDOBS EEEE E E EEEEEE EEEEEE EEEEEEHHHEEEE30 N 26 70 77 EEEEEEE EEE EEEEE EEEE EE EEEself 1 63 72 EEEEEEE EEEE EEEEE EEEEEE HHHHH
Burkhard Rost (Columbia New York)
Secondary structure predictionSecondary structure predictionSecondary structure predictionSecondary structure prediction
• Limit of prediction accuracy reached?
• How complementing other methods?
• Ultimate rôle in structure prediction (1D-3D)?
• Better to use "pure" secondary structure prediction methods, or to use 3D methods and read the secondary structure off the 3D model?
• Conversely, are 3D predictors making optimal use of secondary structure predictions?
• Will secondary structure and 3D prediction merge completely?
Burkhard Rost (Columbia New York)
Secondary structure prediction 2000Secondary structure prediction 2000Secondary structure prediction 2000Secondary structure prediction 2000
• history• 1st generation 50-55%• 2nd generation 55-62%• 3rd generation 1992 70-72%
2000 > 76%• what improves?
• database growth +3• PSI-BLAST +0.5• new training +1• ‘clever method’ +1
• limit?• max 88% -> 12% to go• 1/5 of proteins with more than 100 proteins
-> >80%• and from there?
Burkhard Rost (Columbia New York)
Prediction of protein secondary structurePrediction of protein secondary structurePrediction of protein secondary structurePrediction of protein secondary structure
• 1980: 55% simple• 1990: 60% less simple• 1993: 70% evolution• 2000: 76% more evolution• what is the limit?
• 88% for proteins of similar structure
• 80% for 1/5th of proteins with families > 100
• missing through: better definition of secondary structureincluding long-range interactions
• structural switches
• chameleon / folding
Burkhard Rost (Columbia New York)
CAFASP statisticsCAFASP statisticsCAFASP statisticsCAFASP statistics
• 29 proteins not similar to known PDB– T0086,T0087,T0090,T0091,T0092,T0094,T0095,T0096,T0097,T0098,T0
101,T0102,T0104,T0105,T0106,T0107,T0108,T0109,T0110,T0114,T0115,T0116,T0117,T0118,T0120,T0124,T0125,T0126,T0127
• 2 proteins with PSI-BLAST homologue – T0089,T0103
• 9 proteins with trivial homologue to PDB– T0099,T0100,T0111,T0112,T0113,T0121,T0122,T0123,T0128
Burkhard Rost (Columbia New York)
CAFASP sec uniqueCAFASP sec uniqueCAFASP sec uniqueCAFASP sec unique
Nprot Rank Method Q3 ERRsigQ3 SOV info class11 1 PSIpred 77.6 +/-2.6 71.1 0.38 81.811 1 SAM-T99 78.9 +/-2.3 75.2 0.39 81.811 1 SSpro 76.2 +/-3.1 68.7 0.34 81.811 2 Isites 72.9 +/-2.2 63.5 0.31 72.711 2 Pred2ary 73.4 +/-3.5 61.4 0.30 90.911 2 PROF 73.7 +/-2.6 65.8 0.32 72.711 3 PSSP 68.9 +/-2.8 62.5 0.26 72.7
29 1 SAM-T99 78.3 +/-1.6 74.3 0.39 75.929 2 SSpro 76.3 +/-2.0 71.0 0.36 79.3
Burkhard Rost (Columbia New York)
CAFASP sec homologousCAFASP sec homologousCAFASP sec homologousCAFASP sec homologous
Nprot Rank Method Q3 ERRsigQ3 SOV info class
9 1 PSIpred 79.6 +/-3.2 76.9 0.44 88.9
9 1 SAM-T99 78.5 +/-3.0 74.5 0.41 100.0
9 1 SSpro 80.4 +/-2.6 79.6 0.46 88.9
9 2 Pred2ary 74.1 +/-1.7 70.1 0.32 77.8
Burkhard Rost (Columbia New York)
CAFASP conceptCAFASP conceptCAFASP conceptCAFASP concept
• Targets & Non-targets
– comparative modelling 85% > all current methods
• Never compare methods on different proteins
• Never rank when too few proteins
• (Never show numbers for one protein between
different proteins)
Burkhard Rost (Columbia New York)
What is significantWhat is significantWhat is significantWhat is significant
66
68
70
72
74
76
78
80
0 5 10 15 20 25 30 5 10 15 20 25 3066
68
70
72
74
76
78
80
A: 29 different proteins
Number of random draws
B: 11 identical proteins
Prof_king1-7 1-61-8 2-8
Rank A/29 B/11
SAMt99secPROFsecPSIPRED
SSpro
Rank A/29 B/11
1-7 1-71-8 1-52-8 1-7
JPred2PHDPHDpsi
1-8 1-73-8 4-82-8 2-7
Rank A/29 B/11
Average accuracy for one draw
Burkhard Rost (Columbia New York)
Rank only if significantRank only if significantRank only if significantRank only if significant
• e.g. M1 = 75, M2 = 73• say 16 proteins• rule-of-thumb: significant
sigma / sqrt(Number of porteins)• -> 10/4 = 2.5
-> M1 and M2 cannot be distinguished
Burkhard Rost (Columbia New York)
EVA: automatic continuous EVAluation of structure predictionEVA: automatic continuous EVAluation of structure predictionEVA: automatic continuous EVAluation of structure predictionEVA: automatic continuous EVAluation of structure prediction
one proteinPDB vs prediction
weeksummary
Compile results at
PDB
Prediction servers
secondary structure, fold recognition
inter-residue contacts / distances
comparative modelling, fold recognition
Satellites/Mirrors
everyweek
everyday
User• browse• query• ftp
Results
staticpages
Collect HTMLUpdate central pages
EVA-DBSend sequences
Analyse: pairwise BLAST
Analyse:• PSI-BLAST• MaxHom• sequence- unique sets
Get PDB
Burkhard Rost (Columbia New York)
EVA: automatic continuous EVA: automatic continuous EVAluation of structure predictionEVAluation of structure prediction
EVA: automatic continuous EVA: automatic continuous EVAluation of structure predictionEVAluation of structure prediction
• statistics:31 weeks ->
1549 new structures 352 new sequence unique chains (of
2200)• categories:
– secondary structure prediction (7 methods)– comparative modelling (4)– fold recognition (7)– contact prediction (4)
Burkhard Rost (Columbia New York)
EVA: secondary structureEVA: secondary structureEVA: secondary structureEVA: secondary structure
• MAJOR lessons from EVA:– no point comparing apples and oranges– no point comparing < 20 apples
• EVA team:– CUBIC, Columbia:
Volker Eyrich, Dariusz Przybylski, Burkhard Rost– Rockefeller:
Marc Marti-Renom, Andras Fiser, Andrej Sali– Madrid:
Florencio Pazos, Alfonso Valencia• URL:
• http://cubic.bioc.columbia.edu/eva/• http://pipe.rockefeller.edu/~eva/• http://montblanc.cnb.uam.es/eva/
Burkhard Rost (Columbia New York)
EVA: secondary structureEVA: secondary structureEVA: secondary structureEVA: secondary structure
Method B Q3 C Q3 Claim D SOV E Info F CorrH G CorrE H CorrL I Class K BAD L
PROF 76.0 72 0.35 0.67 0.63 0.55 82 2.7PSIPRED 76.0 76.5-78.3 M 72 0.36 0.65 0.62 0.55 78 2.8SSpro 76.0 76 71 0.35 0.67 0.63 0.56 83 2.8
JPred2 75.0 76.4 69 0.34 0.65 0.60 0.54 76 2.6PHDpsi 75.0 71 0.33 0.65 0.60 0.54 81 3.0
PHD 71.4 71.6 68 0.28 0.59 0.58 0.49 77 4.3
Copenhagen 78 N 77.8
Wang/Yuan 53 O
76%
Burkhard Rost (Columbia New York)
Accuracy Accuracy varies for varies for proteins!proteins!
Accuracy Accuracy varies for varies for proteins!proteins!
0
5
10
15
20
25
30 40 50 60 70 80 90 100
PSIPREDSSproPROFPHDpsiJPred2PHD
Percentage correctly predicted residues per protein
Burkhard Rost (Columbia New York)
Averaging Averaging overover
many many methods methods
not alwaysnot alwaysa good a good idea!idea!
Averaging Averaging overover
many many methods methods
not alwaysnot alwaysa good a good idea!idea!
-30
-20
-10
0
10
20
30
55 60 65 70 75 80 85 90 95
ave-PSIPREDave-SSproave-PROFave-PHDpsiave-JPred2ave-PHD
55 60 65 70 75 80 85 90 95
Per-protein prediction accuracy averaged over 6 methods
Burkhard Rost (Columbia New York)
Some proteins predicted betterSome proteins predicted betterSome proteins predicted betterSome proteins predicted better
30
40
50
60
70
80
90
0 20 40 60 80 100Cumulative percentage of proteins
Burkhard Rost (Columbia New York)
Reliability correlates with accuracy!Reliability correlates with accuracy!Reliability correlates with accuracy!Reliability correlates with accuracy!
70
75
80
85
90
95
100
70
75
80
85
90
95
100
0 20 40 60 80 100
JPred2PHDPROFPSIPRED
0 20 40 60 80 100
Percentage of residues predicted
Burkhard Rost (Columbia New York)
ConclusionConclusionConclusionConclusion
• big gain through using evolutionary information• are we going to reach above 80%? How high?• continuous secondary structure• better methods• other features• use secondary structure: ASP
Young M, Kirshenbaum K, Dill KA, Highsmith S: Predicting conformational switches in proteins. Protein Sci 1999, 8:1752-1764.
Burkhard Rost (Columbia New York)
Availability of methodsAvailability of methodsAvailability of methodsAvailability of methods
• email: PredictProtein@columbia.edu– subject: HELP– file:
• WWW: http://cubic.bioc.columbia.edu/predictprotein/
• META: http://cubic.bioc.columbia.edu/ predictprotein/submit_meta.html
• EVA: http://cubic.bioc.columbia.edu/eva
• CUBIC: http://cubic.bioc.columbia.edu/
Email addressoptions# protein nameSEQWENCE