7/16/16
1
Protein Sequence Comparisonand Protein Evolution
(What BLAST does/Why BLAST works
William R. Pearsonfaculty.virginia.edu/wrpearson
1
Effective Similarity Searching in Practice1. Always search protein databases (possibly with
translated DNA)2. Use E()-values, not percent identity, to infer
homology – E() < 0.001 is significant in a single search
3. Search smaller (comprehensive) databases4. Change the scoring matrix for:– short sequences (exons, reads)– short evolutionary distances (mammals, vertebrates, a-
proteobacteria)– high identity (>50% alignments) to reduce over-extension
5. All methods (pairwise, HMM, PSSM) miss homologs, and find homologs the other methods miss
2
7/16/16
2
Homology Fundamentals
• Homologous sequences are unexpectedly similar (excess similarity)– excess compared to ??? – random similarity
(similarity by chance, E()-value)• Non-significant similarity IS NOT evidence for non-
homology– significant similarity to a protein of different structure
shows non-homology• Homology at the (entire) sequence level is
different from homology at the residue level– sequence homology is inferred from statistics– residue homology REQUIRES sequence homology
3
Establishing homology from statistically significant similarity
Why BLAST works• For most proteins, homologs are easily found over
long evolutionary distances (500 My – 2 By) using standard approaches (BLAST, FASTA)
• Difficult for distant relationships or very short domains• Most default search parameters are optimized for
distant relationships and work well• Not every aligned residue is homologous
– but with significant similarity, there is an homologous domain
4
7/16/16
3
This talk is not about:• Alignment
– Alignment quality may be more sensitive to parameter choice
– Multiple sequences for biologically accurate alignments• Inferring Protein Function
– Homology (common ancestry) implies common structure (guaranteed), not necessarily common function
– Homologs have different functions– Non-homologs have similar (or identical) functions
• The best sequences for building trees– Protein sequences are clearly best for establishing homology,
but DNA sequences may be better for resolving recent divergence
5
Homologues share a common ancestor
6
chemical evolution
prokaryotes/eukaryotes
plants/animals
vertebrates/arthopods
self-replicating systems
4,2896,530
18,000
time
(bili
ons
of ye
ars)
hum
anho
rse
fish
insec
t
worm
whea
t
yeas
t
E. c
oli
-0.1
-1.0
-2.0
-3.0
-4.0
7/16/16
4
When do we infer homology?
7
Bovine trypsin (5ptp)Structure: E()< 10-23;
RMSD 0.0 ASequence: E()< 10-84
100% 223/223
S. griseus trypsin (1sgt)E()< 10-14 RMSD 1.6 AE()< 10-19 36%; 226/223
S. griseus protease A (2sga)E()< 10-4; RMSD 2.6 AE()< 2.6 25%; 199/181
Homology <=> structural similarity? sequence similarity
When can we infer non-homology?
8
Subtilisin (1sbt)E() >100E()<280; 25% 159/275
Cytochrome c4 (1etp)E() > 100E()<5.5; 23% 171/190
Non-homologous proteins havedifferent structures
Bovine trypsin (5ptp)Structure: E()<10-23
RMSD 0.0 ASequence: E()<10-84
100% 223/223
7/16/16
5
9
Homology and EXCESS similarity
• Why are proteins (sequences or structures) similar – alternative hypotheses?1. common ancestry – homology2. convergence from independent origins due to
functional constraints• We infer homology from excess similarity
– if no excess, could happen by chance (independently)
• To recognize excess similarity we need to know what random similarity looks like
10
z-sc obs
E()
< 20 9 0:=
22 1 0:= one = represents 23 library sequences
24 2 0:=
26 1 0:=
28 3 3:*
30 8 18:*
32 49 71:===*
34 145 192:======= *
36 342 395:=============== *
38 567 653:========================= *
40 882 911:=======================================*
42 1120 1114:================================================*
44 1274 1229:=====================================================*==
46 1367 1251:======================================================*=====
48 1299 1198:====================================================*====
50 1140 1093:===============================================*==
52 1049 961:=========================================*====
54 869 821:===================================*==
56 607 686:=========================== *
58 471 563:===================== *
60 419 456:===================*
62 336 366:===============*
64 263 291:============*
66 214 230:=========*
68 177 181:=======*
70 143 142:======*
72 124 111:====*=
74 85 86:===*
76 63 67:==*
78 47 52:==*
80 45 41:=*
82 33 31:=*
84 29 25:=*
86 20 19:*
88 19 15:* inset = represents 1 library sequences
90 16 11:*
92 18 9:* :========*=========
94 9 7:* :======*==
96 7 5:* :====*==
98 4 4:* :===*
100 13 3:* :==*==========
102 5 2:* :=*===
104 2 2:* :=*
106 5 1:* :*====
108 4 1:* :*===
110 2 1:* :*=
112 5 1:* :*====
114 6 1:* :*=====
116 2 0:= *==
118 1 0:= *=
>120 30 0:== *==============================
Query: atp6_human.aa ATP synthase a chain - 226 aaLibrary: PIR1 Annotated (rel. 66)
5190103 residues in 13351 sequences
7/16/16
6
11
Inferring Homology from Statistical Significance
• Real UNRELATED sequences have similarity scores that are indistinguishable fromRANDOM sequences
• If a similarity is NOT RANDOM, then it must be NOT UNRELATED
• Therefore, NOT RANDOM (statistically significant) similarity must reflect RELATEDsequences
12
Query: atp6_human.aa ATP synthase a chain - 226 aaLibrary: 5190103 residues in 13351 sequences
The best scores are: ( len) s-w bits E(13351) %_id %_sim alensp|P00846|ATP6_HUMAN ATP synthase a chain (AT ( 226) 1400 325.8 5.8e-90 1.000 1.000 226sp|P00847|ATP6_BOVIN ATP synthase a chain (AT ( 226) 1157 270.5 2.5e-73 0.779 0.951 226sp|P00848|ATP6_MOUSE ATP synthase a chain (AT ( 226) 1118 261.7 1.2e-70 0.757 0.916 226sp|P00849|ATP6_XENLA ATP synthase a chain (AT ( 226) 745 176.8 4.0e-45 0.533 0.847 229sp|P00851|ATP6_DROYA ATP synthase a chain (AT ( 224) 473 115.0 1.7e-26 0.378 0.721 222sp|P00854|ATP6_YEAST ATP synthase a chain pre ( 259) 428 104.7 2.3e-23 0.353 0.694 232sp|P00852|ATP6_EMENI ATP synthase a chain pre ( 256) 365 90.4 4.8e-19 0.304 0.691 230sp|P14862|ATP6_COCHE ATP synthase a chain (AT ( 257) 353 87.7 3.2e-18 0.313 0.650 214sp|P68526|ATP6_TRITI ATP synthase a chain (AT ( 386) 309 77.6 5.1e-15 0.289 0.651 235sp|P05499|ATP6_TOBAC ATP synthase a chain (AT ( 395) 309 77.6 5.2e-15 0.283 0.635 233sp|P07925|ATP6_MAIZE ATP synthase a chain (AT ( 291) 283 71.7 2.3e-13 0.311 0.667 180sp|P0AB98|ATP6_ECOLI ATP synthase a chain (AT ( 271) 178 47.9 3.2e-06 0.233 0.585 236sp|P0C2Y5|ATPI_ORYSA Chloroplast ATP synth (A ( 247) 144 40.1 0.00062 0.242 0.580 231sp|P06452|ATPI_PEA Chloroplast ATP synthase a ( 247) 143 39.9 0.00072 0.250 0.586 232sp|P27178|ATP6_SYNY3 ATP synthase a chain (AT ( 276) 142 39.7 0.00095 0.265 0.571 170sp|P06451|ATPI_SPIOL Chloroplast ATP synthase ( 247) 138 38.8 0.0016 0.242 0.580 231sp|P08444|ATP6_SYNP6 ATP synthase a chain (AT ( 261) 127 36.3 0.0095 0.263 0.557 167sp|P69371|ATPI_ATRBE Chloroplast ATP synthase ( 247) 126 36.0 0.01 0.221 0.571 231sp|P06289|ATPI_MARPO Chloroplast ATP synthase ( 248) 126 36.0 0.011 0.240 0.575 167sp|P30391|ATPI_EUGGR Chloroplast ATP synthase ( 251) 123 35.4 0.017 0.257 0.579 214
sp|P19568|TLCA_RICPR ADP,ATP carrier protein ( 498) 122 35.0 0.043 0.243 0.579 152sp|P24966|CYB_TAYTA Cytochrome b ( 379) 113 33.0 0.13 0.234 0.532 158sp|P03892|NU2M_BOVIN NADH-ubiquinone oxidored ( 347) 107 31.7 0.31 0.261 0.479 211sp|P68092|CYB_STEAT Cytochrome b ( 379) 104 31.0 0.54 0.277 0.547 137sp|P03891|NU2M_HUMAN NADH-ubiquinone oxidored ( 347) 103 30.8 0.58 0.201 0.537 149sp|P00156|CYB_HUMAN Cytochrome b ( 380) 102 30.5 0.74 0.268 0.585 205sp|P15993|AROP_ECOLI Aromatic amino acid tr ( 457) 103 30.7 0.78 0.234 0.622 111sp|P24965|CYB_TRANA Cytochrome b ( 379) 101 30.3 0.87 0.234 0.563 158sp|P29631|CYB_POMTE Cytochrome b ( 308) 99 29.9 0.95 0.274 0.584 113sp|P24953|CYB_CAPHI Cytochrome b ( 379) 99 29.8 1.2 0.236 0.564 140
7/16/16
7
13
>sp|P00846|ATP6_HUMAN ATP synthase subunit a; F-ATPase protein 6 >sp|P0AB98|ATP6_ECOLI ATP synthase subunit a; F-ATPase subunit 6Length=271
Score = 47.9 bits (178), Expect = 3e-06Identities = 55/199 (27%), Positives = 113/199 (56%), Gaps = 37/199 (18%)
Query 8 SFIAPTILGLPAAVLIILFPPLLIPTSKYLINNRLITTQQWLIKLTSKQMMTMHNTKGRT 67S +LGL ++++LF + + + ++ T + +I + + + M++ K +
Sbjct 45 SMFFSVVLGL---LFLVLFRSVAKKATSG-VPGKFQTAIELVIGFVNGSVKDMYHGKSKL 100
Query 68 WSLMLVSLIIFIATTNLLGLLP---------HSF-------TPTTQLSMNLAMAIPLWAG 111+ + +++ +++ NL+ LLP H + P+ +++ L+MA+ ++
Sbjct 101 IAPLALTIFVWVFLMNLMDLLPIDLLPYIAEHVLGLPALRVVPSADVNVTLSMALGVF-- 158
Query 112 TVIMGFRSKIKNALAHFLPQGTPTPL-----IPMLVIIETISLLIQPMALAVRLTANITA 166+++ F S + F + T P+ IP+ +I+E +SLL +P++L +RL N+ A
Sbjct 159 -ILILFYSIKMKGIGGFTKELTLQPFNHWAFIPVNLILEGVSLLSKPVSLGLRLFGNMYA 217
Query 167 GHLLMHLIGSATLAMSTINLPSTLIIFTILILLTILEIAVALIQAYVFTLLVSLYL 222G L+ LI S L IF ILI+ +QA++F +L +YL
Sbjct 218 GELIFILIAGLLPWWSQWILNVPWAIFHILIIT---------LQAFIFMVLTIVYL 264
14
The PAM250 matrix
Cys 12Ser 0 2Thr -2 1 3Pro -1 1 0 6Ala -2 1 1 1 2Gly -3 1 0 -1 1 5Asn -4 1 0 -1 0 0 2Asp -5 0 0 -1 0 1 2 4Glu -5 0 0 -1 0 0 1 3 4Gln -5 -1 -1 0 0 -1 1 2 2 4His -3 -1 -1 0 -1 -2 2 1 1 3 6Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W
7/16/16
8
15
A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10
Pam40A R N D E I L
A 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6
Pam250
Where do scoring matrices come from?
• Scoring matrices can be designed for different evolutionary distances (less=shallow; more=deep)
• Deep matrices allow more substitution
€
λS = logqijpi p j
#
$ % %
&
' ( (
frequency of replace-ment in homologs
frequency of align-ment by chance
16
The best scores are: ( len) s-w bits E(13351) %_id %_sim alensp|P00846|ATP6_HUMAN ATP synthase a chain (AT ( 226) 1400 325.8 5.8e-90 1.000 1.000 226sp|P00847|ATP6_BOVIN ATP synthase a chain (AT ( 226) 1157 270.5 2.5e-73 0.779 0.951 226sp|P00848|ATP6_MOUSE ATP synthase a chain (AT ( 226) 1118 261.7 1.2e-70 0.757 0.916 226sp|P00849|ATP6_XENLA ATP synthase a chain (AT ( 226) 745 176.8 4.0e-45 0.533 0.847 229sp|P00851|ATP6_DROYA ATP synthase a chain (AT ( 224) 473 115.0 1.7e-26 0.378 0.721 222sp|P00854|ATP6_YEAST ATP synthase a chain pre ( 259) 428 104.7 2.3e-23 0.353 0.694 232sp|P00852|ATP6_EMENI ATP synthase a chain pre ( 256) 365 90.4 4.8e-19 0.304 0.691 230sp|P14862|ATP6_COCHE ATP synthase a chain (AT ( 257) 353 87.7 3.2e-18 0.313 0.650 214sp|P68526|ATP6_TRITI ATP synthase a chain (AT ( 386) 309 77.6 5.1e-15 0.289 0.651 235sp|P05499|ATP6_TOBAC ATP synthase a chain (AT ( 395) 309 77.6 5.2e-15 0.283 0.635 233sp|P07925|ATP6_MAIZE ATP synthase a chain (AT ( 291) 283 71.7 2.3e-13 0.311 0.667 180sp|P0AB98|ATP6_ECOLI ATP synthase a chain (AT ( 271) 178 47.9 3.2e-06 0.233 0.585 236sp|P0C2Y5|ATPI_ORYSA Chloroplast ATP synth (A ( 247) 144 40.1 0.00062 0.242 0.580 231sp|P06452|ATPI_PEA Chloroplast ATP synthase a ( 247) 143 39.9 0.00072 0.250 0.586 232sp|P27178|ATP6_SYNY3 ATP synthase a chain (AT ( 276) 142 39.7 0.00095 0.265 0.571 170sp|P06451|ATPI_SPIOL Chloroplast ATP synthase ( 247) 138 38.8 0.0016 0.242 0.580 231sp|P08444|ATP6_SYNP6 ATP synthase a chain (AT ( 261) 127 36.3 0.0095 0.263 0.557 167sp|P69371|ATPI_ATRBE Chloroplast ATP synthase ( 247) 126 36.0 0.01 0.221 0.571 231sp|P06289|ATPI_MARPO Chloroplast ATP synthase ( 248) 126 36.0 0.011 0.240 0.575 167sp|P30391|ATPI_EUGGR Chloroplast ATP synthase ( 251) 123 35.4 0.017 0.257 0.579 214
sp|P19568|TLCA_RICPR ADP,ATP carrier protein ( 498) 122 35.0 0.043 0.243 0.579 152sp|P24966|CYB_TAYTA Cytochrome b ( 379) 113 33.0 0.13 0.234 0.532 158sp|P03892|NU2M_BOVIN NADH-ubiquinone oxidored ( 347) 107 31.7 0.31 0.261 0.479 211sp|P68092|CYB_STEAT Cytochrome b ( 379) 104 31.0 0.54 0.277 0.547 137sp|P03891|NU2M_HUMAN NADH-ubiquinone oxidored ( 347) 103 30.8 0.58 0.201 0.537 149sp|P00156|CYB_HUMAN Cytochrome b ( 380) 102 30.5 0.74 0.268 0.585 205sp|P15993|AROP_ECOLI Aromatic amino acid tr ( 457) 103 30.7 0.78 0.234 0.622 111sp|P24965|CYB_TRANA Cytochrome b ( 379) 101 30.3 0.87 0.234 0.563 158sp|P29631|CYB_POMTE Cytochrome b ( 308) 99 29.9 0.95 0.274 0.584 113sp|P24953|CYB_CAPHI Cytochrome b ( 379) 99 29.8 1.2 0.236 0.564 140
Query: atp6_human.aa ATP synthase a chain - 226 aaLibrary: 5190103 residues in 13351 sequences
7/16/16
9
17
The best scores are: ( len) s-w bits E(13351) %_id %_sim alensp|P0AB98|ATP6_ECOLI ATP synthase a chain (AT ( 271) 1774 416.8 3.e-117 1.000 1.000 271sp|P06451|ATPI_SPIOL Chloroplast ATP synthase ( 247) 274 70.4 5.8e-13 0.270 0.616 211sp|P69371|ATPI_ATRBE Chloroplast ATP synthase ( 247) 271 69.7 9.3e-13 0.270 0.607 211sp|P08444|ATP6_SYNP6 ATP synthase a chain (AT ( 261) 271 69.7 9.9e-13 0.267 0.600 240sp|P06452|ATPI_PEA Chloroplast ATP synthase a ( 247) 266 68.5 2.1e-12 0.274 0.614 223sp|P30391|ATPI_EUGGR Chloroplast ATP synthase ( 251) 265 68.3 2.5e-12 0.298 0.596 225sp|P0C2Y5|ATPI_ORYSA Chloroplast ATP synthase ( 247) 260 67.2 5.4e-12 0.259 0.603 239sp|P27178|ATP6_SYNY3 ATP synthase a chain (AT ( 276) 260 67.1 6.1e-12 0.264 0.578 258sp|P06289|ATPI_MARPO Chloroplast ATP synthase ( 248) 250 64.8 2.7e-11 0.261 0.621 211sp|P07925|ATP6_MAIZE ATP synthase a chain (AT ( 291) 215 56.7 8.7e-09 0.259 0.578 232sp|P68526|ATP6_TRITI ATP synthase a chain (AT ( 386) 209 55.3 3.1e-08 0.259 0.603 239sp|P00854|ATP6_YEAST ATP synthase a chain pre ( 259) 204 54.2 4.5e-08 0.235 0.578 277sp|P05499|ATP6_TOBAC ATP synthase a chain (AT ( 395) 189 50.7 7.8e-07 0.220 0.582 268sp|P00846|ATP6_HUMAN ATP synthase a chain (AT ( 226) 178 48.2 2.5e-06 0.237 0.589 236sp|P00852|ATP6_EMENI ATP synthase a chain pre ( 256) 178 48.2 2.8e-06 0.209 0.590 244sp|P00849|ATP6_XENLA ATP synthase a chain (AT ( 226) 173 47.1 5.5e-06 0.261 0.630 165sp|P00847|ATP6_BOVIN ATP synthase a chain (AT ( 226) 172 46.8 6.5e-06 0.233 0.581 236sp|P14862|ATP6_COCHE ATP synthase a chain (AT ( 257) 171 46.6 8.7e-06 0.204 0.608 265sp|P00848|ATP6_MOUSE ATP synthase a chain (AT ( 226) 166 45.5 1.7e-05 0.259 0.617 193sp|P00851|ATP6_DROYA ATP synthase a chain (AT ( 224) 139 39.2 0.0013 0.225 0.549 253
sp|P24962|CYB_STELO Cytochrome b ( 379) 125 35.9 0.021 0.223 0.575 193sp|P09716|US17_HCMVA Hypothetical protein HVL ( 293) 109 32.3 0.21 0.260 0.565 131sp|P68092|CYB_STEAT Cytochrome b ( 379) 109 32.2 0.27 0.211 0.562 194sp|P24960|CYB_ODOHE Cytochrome b ( 379) 104 31.1 0.61 0.210 0.555 200sp|P03887|NU1M_BOVIN NADH-ubiquinone oxidored ( 318) 98 29.7 1.3 0.287 0.545 167sp|P24992|CYB_ANTAM Cytochrome b ( 379) 99 29.9 1.4 0.192 0.565 193
Query: atp6_ecoli.aa ATP synthase a - 271 aaLibrary: 5190103 residues in 13351 sequences
Homology is Transitive
(on domains)
18
Human mito
E. coli
Euglena chloro.SynechocystisCyanobacteriaMarch. chloro.
Spinach chloro.Tobacco chloro. 0.007 : 10-13
0.001 : 10-13
0.0007 : 10-12
0.007 : 10-11
0.006 : 10-13
0.001 : 10-13
0.02 : 10-12
10-6 : 10-117
Pea chloro.
10-90 : 10-6
Bovine mitoMouse mito
Frog mitoDros. mito
10-23 : 10-8
10-18 : 10-5
0.0006 : 10-12
10-1 : /10-6
10-13 : 10-9
10-15 : 10-8
Rice chloro.
10-70 : 10-5
10-73 : 10-6
10-45 : 10-6
10-26 : 0.0013
Yeast mito.
Cochliobolus mito.Aspergillus mito.
Corn mito.Wheat mito.
vs human : E. coli
vs human : E. coli
7/16/16
10
19
Homology and EXCESS similarityThe importance of perspective
• We use the E()-value to infer homology based on excess similarity BETWEEN the QUERY and the SUBJECT sequences
• Two homologous sequences may not share excess similarity in a BLAST search with one or the other as query, but share significant similarity to a third sequence (or to a PSSM or HMM)
• If two sequences share significant similarity to the the same sequence in the same region, we can infer homology.
As always, non-excess similaritydoes not imply non-homology
Homology and Domains –Histone acetyltransferase KAT2B
20
The best scores are: s-w bits E(454402) %_id %_sim alenKAT2B_HUMAN Histone acetyltransferase KAT2B ( 832) 3820 1456. 0 1.000 1.000 832
KAT2A_HUMAN Histone acetyltransferase KAT2A ( 837) 2747 1049. 0 0.721 0.870 813
GCN5_SCHPO Histone acetyltransferase gcn5 ( 454) 867 334.7 3e-90 0.483 0.768 354GCN5_YEAST Histone acetyltransferase GCN5 ( 439) 792 306.2 1.1e-81 0.469 0.760 354
GCN5_ORYSJ Histone acetyltransferase GCN5 ( 511) 760 294.0 5.9e-78 0.436 0.755 376GCN5_ARATH Histone acetyltransferase GCN5; ( 568) 719 278.4 3.3e-73 0.434 0.740 369
BPTF_HUMAN Nucleosome-remodeling factor sub (3046) 286 113.6 7.6e-23 0.495 0.804 97
NU301_DROME Nucleosome-remodeling factor su (2669) 276 109.8 9.1e-22 0.511 0.819 94CECR2_HUMAN Cat eye syndrome critical regio (1484) 232 93.2 5e-17 0.371 0.790 105
BRD4_HUMAN Bromodomain-containing protein 4 (1362) 214 86.4 5.2e-15 0.379 0.698 116
BRD4_MOUSE Bromodomain-containing protein 4 (1400) 214 86.4 5.3e-15 0.379 0.698 116BAZ2A_HUMAN Bromodomain adjacent to zinc fi (1905) 211 85.2 1.7e-14 0.382 0.683 123
BAZ2A_XENLA Bromodomain adjacent to zinc fi (1698) 206 83.3 5.5e-14 0.350 0.684 117
FSH_DROME Homeotic protein female sterile; (2038) 205 82.9 8.8e-14 0.341 0.667 129BAZ2A_MOUSE Bromodomain adjacent to zinc fi (1889) 204 82.5 1e-13 0.368 0.680 125
BRDT_MACFA Bromodomain testis-specific prot ( 947) 197 80.0 3e-13 0.367 0.697 109BRD3_HUMAN Bromodomain-containing protein 3 ( 726) 194 78.9 4.9e-13 0.362 0.664 116
7/16/16
11
Homology and Domains –Histone acetyltransferase KAT2B
21
E()< 0832
E()< 0813
1e-81354
3e-73369
7e-2397
5e-15116
2e-8109
200 400 600 800KATB_HUMAN
200 400 600 800KAT2B_HUMANPCAF_N C.Acetyltrans Bromodomain
200 400 600 800KAT2B_HUMAN
200 400 600 800KAT2A_HUMANPCAF_N C.Acetyltrans Bromodomain
200 400
200 400 600 800KAT2B_HUMAN
GCN5_YEASTC.Acetyltrans Bromodomain
200 400 600 800KAT2B_HUMAN
200 400GCN5_ARATHC.Acetyltrans Bromodomain
KAT2B_HUMAN
1000 2000 3000BPTF_HUMAN
500KAT2B_HUMAN
500 1000 BRD4_MOUSEBromodomainBromodomain
500KAT2B_HUMAN
500 1000 BRD4_MOUSEBromodomainBromodomain
Similarity searches for homology
• Homologous sequences are unexpectedly similar (excess similarity)– excess compared to ??? – random similarity
(similarity by chance, E()-value)• In a similarity search, excess similarity reflects
the perspective of the query sequence– different queries can reveal excess similarity– homology in the aligned region
• Non-significant similarity IS NOT evidence for non-homology– significant similarity to a protein of different structure
shows non-homology
22
7/16/16
12
Effective similarity searching• Use protein/translated DNA comparisons• Modern sequence similarity searching is highly efficient,
sensitive, and reliable – homologs are homologs– similarity statistics are accurate– databases are large– most queries will find a significant match
• Improving similarity searches– protein (translated DNA) similarity searches– smaller databases– appropriate scoring matrices for short reads/assemblies– appropriate alignment boundaries
• Extracting more information from annotations– homologous over extension– scoring sub-alignments to identify homologous domains
• All methods (pairwise, HMM, PSSM) miss homologs– all methods find genuine homologs the other methods miss
23
24
The best scores are: DNA tfastx3 prot.E(188,018) E(187,524) E(331,956)
DMGST D.melanogaster GST1-1 1.3e-164 4.1e-109 1.0e-109MDGST1 M.domestica GST-1 gene 2e-77 3.0e-95 1.9e-76LUCGLTR Lucilia cuprina GST 1.5e-72 5.2e-91 3.3e-73MDGST2A M.domesticus GST-2 mRNA 9.3e-53 1.4e-77 1.6e-62MDNF1 M.domestica nf1 gene. 10 4.6e-51 2.8e-77 2.2e-62MDNF6 M.domestica nf6 gene. 10 2.8e-51 4.2e-77 3.1e-62MDNF7 M.domestica nf7 gene. 10 6.1e-47 9.2e-77 6.7e-62AGGST15 A.gambiae GST mRNA 3.1e-58 4.2e-76 4.3e-61CVU87958 Culicoides GST 1.8e-41 4.0e-73 3.6e-58AGG3GST11 A.gambiae GST1-1 mRNA 1.5e-46 2.8e-55 1.1e-43BMO6502 Bombyx mori GST mRNA 1.1e-23 8.8e-50 5.7e-40AGSUGST12 A.gambiae GST1-1 gene 2.3e-16 4.5e-46 5.1e-37MOTGLUSTRA Manduca sexta GST 5.7e-07 2.5e-30 8.0e-25RLGSTARGN R.legominosarum gstA 0.0029 3.2e-13 1.4e-10HUMGSTT2A H. sapiens GSTT2 0.32 3.3e-10 2.0e-09HSGSTT1 H.sapiens GSTT1 mRNA 7.2 8.4e-13 3.6e-10ECAE000319 E. coli hypothet. prot. — 4.7e-10 1.1e-09MYMDCMA Methyl. dichlorometh. DH — 1.1e-09 6.9e-07BCU19883 Burkholderia maleylacetate red.— 1.2e-09 1.1e-08NFU43126 Naegleria fowleri GST — 3.2e-07 0.0056SP505GST Sphingomonas paucim — 1.8e-06 0.0002EN1838 H. sapiens maleylaceto. iso. — 2.1e-06 5.9e-06HSU86529 Human GSTZ1 — 3.0e-06 8.0e-06SYCCPNC Synechocystis GST — 1.2e-05 9.5e-06HSEF1GMR H.sapiens EF1g mRNA — 9.0e-05 0.00065
DNA vs protein sequence comparison
7/16/16
13
25
Why smaller databases produce more sensitive searches – statistics
S’ = λSraw - ln K m nSbit = (λSraw - ln K)/ln(2)P(S’>x) = 1 - exp(-e-x)P(Sbit > x) = 1 -exp(-mn2-x)E(S’>x |D) = P D
P(B bits) = m n 2-B
P(40 bits)= 1.5x10-7
E(40 | D=4000) = 6x10-4
E(40 | D=12E6) = 1.8
-2 0 2 4 6
-2 0 2 4 6 8 10
0
15 20 25 30
10000
8000
2000
6000
4000
Z(s)λS
bit
What is a “bit” score (I)?1. Scoring matrices (PAM250, BLOSUM62, VTML40) contain
“log-odds” scores:– si,j (bits) = log2(qi,j/pipj) (qi,j freq. in homologs / pipj freq. by chance)– si,j (bits) = 2 -> a residue is 22=4-times more likely to occur by
homology compared with chance (at one residue)– si,j (bits) = -1 -> a residue is 2-1 = 1/2 as likely to occur by homology
compared with chance (at one residue)2. An alignment score is the maximum sum of si,j bit scores
across the aligned residues. – A 40-bit score is 240 more likely to occur by homology than by
chance.3. How often should a score occur by chance? In a 400 * 400
alignment, there are ~160,000 places where the alignment could start by chance, so we expect a score of 40 bits would occur: P(Sbit > x) = 1 -exp(-mn2-x) ~ mn2-x
– 400 x 400 x 2-40 = 160,000 / 240 (1013.3) = 1.5 x 10-7 times– Thus, the probability of a 40 bit score in ONE alignment is ~ 10-7
26
7/16/16
14
What is a “bit” score (I)?4. But we did not ONE alignment, we did 4,000,
40,000, 500,000, or 20 million alignments when we searched the database:
– E(Sbit | D) = p(40 bits) x database size– E(40 | 4,000) = 10-7 x 4,000 = 4 x 10-4 (significant)– E(40 | 40,000) = 10-7 x 4 x 104 = 4 x 10-3 (not significant)– E(40 | 500,000) =10-7 x 5 x 105 = 0.05 (not significant)– E(40 | 20 million) = 10-7 x 2.0 x 107 = 2.0 (not significant)
27
Not significant does not mean not-homologous
Bonferroni correction for multiple tests –each alignment is a test
28
E()-values are conservative frequentist estimates that similarity occurred by chance
S’ = λSraw - ln K m nSbit = (λSraw - ln K)/ln(2)P(S’>x) = 1 - exp(-e-x)P(Sbit > x) = 1 -exp(-mn2-x)
Bonferroni correction:E(S’>x |D) = P D(# of tests)-2 0 2 4 6
-2 0 2 4 6 8 10
15 20 25 30
Z(s)λSbit
With modern sequence databases (thousands of complete proteomes), E()<10-10 is routine for sequences >25% identical, after correcting for 10,000,000 sequences (tests)
7/16/16
15
How many “bits” do I need?E() = p() x database size
E(40 | 4,000) = 10-7 x 4,000 = 4 x 10-4 (significant)E(40 | 40,000) = 10-7 x4 x 104 = 4 x 10-3 (not significant)E(40 | 500,000) = 10-7 x 5 x 105 = 0.05 (not significant)
To get E() ~ 10-3 , how many bits do I need? p = m n 2 –bits
bits = –log2(p/(m n)) = –log2(E()/(database_size m n))genome (10,000) p ~ 10-3/104 = 10-7/160,000 = 40 bitsSwissProt (500,000) p ~ 10-3/106 = 10-9/160,000 = 47 bitsUniprot/NR (107) p ~ 10-3/107 = 10-10/160,000 = 50 bits
29
very significant E()<10-50
significant E()<10-3
not significant
significant E()<10-6
What database to search?
• Search the smallest comprehensive database likely to contain your protein– vertebrates – human proteins (40,000)– fungi – S. cerevisiae (6,000)– bacteria – E. coli, gram positive, etc. (<100,000)
• Search a richly annotated protein set (SwissProt, >500,000)
• Always search NR (> 50 million) LAST• Never Search “GenBank” (DNA)
30
7/16/16
16
Scoring matrices
• Scoring matrices can set the evolutionary look-back time for a search– Lower PAM (PAM10/MDM10 … PAM60) for closer
(10% … 50% identity)– Higher BLOSUM for higher conservation (BLOSUM50
distant, BLOSUM80 conserved)• Shallow scoring matrices for short domains/short
queries (metagenomics)– Matrices have “bits/position” (score/position), 40 aa at
0.45 bits/position (BLOSUM62) means 18 bit ave.score (50 bits significant)
• Deep scoring matrices allow alignments to continue, possibly outside the homologous region
31
32
A R N D E I LA 8R -9 12N -4 -7 11D -4 -13 3 11E -3 -11 -2 4 11I -6 -7 -7 -10 -7 12L -8 -11 -9 -16 -12 -1 10
Pam40A R N D E I L
A 2R -2 6N 0 0 2D 0 -1 2 4E 0 -1 1 3 4I -1 -2 -2 -2 -2 5L -2 -3 -3 -4 -3 2 6
Pam250
Where do scoring matrices come from?
qij : replacement frequency at PAM40, 250qR:N ( 40) = 0.000435 pR = 0.051 qR:N (250) = 0.002193 pN = 0.043 λ2 Sij = lg2 (qij/pipj) λe Sij = ln(qij/pipj) pRpN = 0.002193λ2 SR:N( 40) = lg2 (0.000435/0.00219)= -2.333λ2 = 1/3; SR:N( 40) = -2.333/l2 = -7λ SR:N(250) = lg2 (0.002193/0.002193)= 0
λSi, j = logb(qi, jpi pj
)
7/16/16
17
33
PAM matrices and alignment length
BLOS
UM80
BLOS
UM62
BLOS
UM50
Short domains require “shallow” scoring matrices
Empirical matrix performance(median results from random alignments)
Matrix target%ident bits/position aln len (50bits)
VT160-12/-2 23.8 0.26 192
BLOSUM50-10/-2 25.3 0.23 217
BLOSUM62*-11/-1 28.9 0.45 111
VT120-11/-1 27.4 1.03 48
VT80 -11/-1 51.9 1.55 32
PAM70*-10/-1 33.8 0.64 78
PAM30*-9/-1 45.5 1.06 47
VT40-12/-1 72.7 2.76 18
VT20-15/-2 84.6 3.62 13
VT10/16/-2 90.9 4.32 12
34
HMMs can be very "deep"
7/16/16
18
Scoring matrices affect alignment boundaries(homologous over-extension)
BLOSUM62 -11/-1 VTML80 -10/-1
100 200 300 400 500sp|Q14247.2|SRC8_HUMAN Src substrate cortactin; Am
50
100
150
200
250
300
350
400
450
500
550
sp|Q14247.2|SRC8_HUMAN Src substrate cort
E(): <0.0001<0.01
<1<1e+02
>1e+02
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
SH3_domain
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
SH3_domain
100 200 300 400 500sp|Q14247.2|SRC8_HUMAN Src substrate cortactin; Am
50
100
150
200
250
300
350
400
450
500
550
sp|Q14247.2|SRC8_HUMAN Src substrate cort
E(): <0.0001<0.01
<1<1e+02
>1e+02
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
SH3_domain
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
SH3_domain
35
36
Scoring Matrices - Summary
• PAM and BLOSUM matrices greatly improve the sensitivity of protein sequence comparison – low identity with significant similarity
• PAM matrices have an evolutionary model - lower number, less divergence – lower=closer; higher=more distant
• BLOSUM matrices are sampled from conserved regions at different average identity – higher=more conservation
• Short alignments (domains, exons, reads) require shallow (higher information content) matrices
• Shallow matrices set maximum look-back time
7/16/16
19
Effective similarity searching• Modern sequence similarity searching is highly efficient,
sensitive, and reliable – homologs are homologs– similarity statistics are accurate– databases are large– most queries will find a significant match
• Improving similarity searches– smaller databases– appropriate scoring matrices for short reads/assemblies– appropriate alignment boundaries
• Extracting more information from annotations– homologous over extension– scoring sub-alignments to identify homologous domains
• All methods (pairwise, HMM, PSSM) miss homologs– all methods find genuine homologs the other methods miss
37
Overextension into random sequence
> pf26|15978520|E6SGT6|E6SGT6_THEM7 Heavy metal translocating P-type
ATPase EC=3.6.3.4
Length=888
Score = 299 bits (766), Expect = 1e-90, Method: Compositional matrix adjust.
Identities = 170/341 (50%), Positives = 224/341 (66%), Gaps = 19/341 (6%)
Query 84 FLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLGKTLEAVAKGRTSEAIKKLMGLKA 143
+L+ V A +P+ +F + V++ L+ LG LE A+GRTSEAIKKL+GL+A
Sbjct 312 WLYSTVAVAFPQIFPSMALAEVFYDVTAVVVALVNLGLALELRARGRTSEAIKKLIGLQA 371
Query 144 KRARVIRGGRELDIPVEAVLAGDLVVVRPGEKIPVDGVVEEGASAVDESMLTGESLPVDK 203
+ ARV+R G E+DIPVE VL GD+VVVRPGEKIPVDGVV EG S+VDESM+TGES+PV+
Sbjct 372 RTARVVRDGTEVDIPVEEVLVGDIVVVRPGEKIPVDGVVIEGTSSVDESMITGESIPVEM 431
Query 204 QPGDTVIGATLNKQGSFKFRATKVGRDTALAQIISVVEEAQGSKAPIQRLADTISGYFVP 263
+PGD VIGAT+N+ GSF+FRATKVG+DTAL+QII +V++AQGSKAPIQR+ D +S YFVP
Sbjct 432 KPGDEVIGATINQTGSFRFRATKVGKDTALSQIIRLVQDAQGSKAPIQRIVDRVSHYFVP 491
Query 264 VVVSLAVITFFVWYFAVAPENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKG 323
V+ LA++ VWY + AL+ F L+IACPCALGLATPTS+ VG GKGAE+G
Sbjct 492 AVLILAIVAAVVWYVFGPEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQG 551
Query 324 ILFKGGEHLENAG---------GGAHTEGAENKAELLKTRATGISILVTLGLTAKGRDRS 374
IL + G+ L+ A G T+G +++ ATG + L LTA
Sbjct 552 ILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVVA--ATGFDEDLILRLTA------ 603
Query 375 TVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGV 415
A ++ + L I + L R +A E+ +A P GV
Sbjct 604 --AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGV 642
PF00122
PF00122
113 335
340 562
566 783
113
340
335
562 566
38
7/16/16
20
Region: 12-37:14-34 : score=50; bits=26.0; Qval=41 : DOMAIN_N: alphaRegion: 42-60:39-57 : score=26; bits=14.5; Qval= 7 : DOMAIN_N: betaSmith-Waterman score: 82; E(1) < 2e-9; 49.0% identity (61.2% similar) in 49 aa overlap
10 20 30 40 50 60 70 3BD1 MNAIDIAINKLGSVSALAASLGVRQSAISNWRARGRVPAERCIDIERVTNGAVICRELRPDVFGASPAGHRPEASNAAA
:. :::::.::: :::::. : : : :.: .: : :.:: 2PIJ XKKIPLSKYLEEHGTQSALAAALGVNQSAISQ-----MVRAGRSIEITLYEDGRVEANEIRPIPARPKRTAA
[ 10 20 30 ] [0 50 60 ]
Can homologous proteins have different structures?Roessler C G et al. PNAS (2008)105:2343
Region: 12-29:14-31 : score=82; bits=42.9; LPr=9.2 : DOMAIN_N: alphaSmith-Waterman score: 82; E(1) < 6.5e-10 77.8% identity (88.9% similar) in 18 aa overlap
10 20 30 40 50 60 70 3BD1 MNAIDIAINKLGSVSALAASLGVRQSAISNWRARGRVPAERCIDIERVTNGAVICRELRPDVFGASPAGHRPEASNAAA
:. :::::.::: ::::: 2PIJ XKKIPLSKYLEEHGTQSALAAALGVNQSAISQMVRAGRSIEITLYEDGRVEANEIRPIPARPKRTAA
[ 10 20 30 ] 40 50 60
BLOSUM8040%id
VTML4070%id
39
Scoring matrices affect alignment boundaries(homologous over-extension)
BLOSUM62 -11/-1 BLOSUM62 -11/-1
100 200 300 400 500sp|Q14247.2|SRC8_HUMAN Src substrate cortactin; Am
50
100
150
200
250
300
350
400
450
500
550
sp|Q14247.2|SRC8_HUMAN Src substrate cort
E(): <0.0001<0.01
<1<1e+02
>1e+02
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
SH3_domain
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
Hs1_Cortacti
SH3_domain
32- 42: 69- 79 : Id=0.455; Q= 0.0 : NODOM :043- 79: 80-116 : Id=0.158; Q= 0.0 : Hs1_Cortactin80-116:117-153 : Id=0.622; Q=37.4 : Hs1_Cortactin
117-153:154-190 : Id=0.757; Q=50.2 : Hs1_Cortactin154-190:191-227 : Id=0.811; Q=61.0 : Hs1_Cortactin191-227:228-264 : Id=0.568; Q=35.3 : Hs1_Cortactin228-264:265-301 : Id=0.649; Q=41.5 : Hs1_Cortactin265-287:302-324 : Id=0.565; Q= 8.9 : Hs1_Cortactin288-458:325-491 : Id=0.165; Q= 0.0 : NODOM459-473:492-506 : Id=0.200; Q= 0.0 : SH3
VTML80 -10/-1
82-116:119-153 : Id=0.657; Q=102.2 : Hs1_Cortactin117-153:154-190 : Id=0.757; Q=138.0 : Hs1_Cortactin
154-190:191-227 : Id=0.811; Q=164.6 : Hs1_Cortactin
191-227:228-264 : Id=0.568; Q= 91.9 : Hs1_Cortactin228-264:265-301 : Id=0.649; Q=112.4 : Hs1_Cortactin
265-287:302-324 : Id=0.565; Q= 36.7 : Hs1_Cortactin
40
7/16/16
21
Overextension into random sequence
> pf26|15978520|E6SGT6|E6SGT6_THEM7 Heavy metal translocating P-type
ATPase EC=3.6.3.4
Length=888
Score = 299 bits (766), Expect = 1e-90, Method: Compositional matrix adjust.
Identities = 170/341 (50%), Positives = 224/341 (66%), Gaps = 19/341 (6%)
Query 84 FLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLGKTLEAVAKGRTSEAIKKLMGLKA 143
+L+ V A +P+ +F + V++ L+ LG LE A+GRTSEAIKKL+GL+A
Sbjct 312 WLYSTVAVAFPQIFPSMALAEVFYDVTAVVVALVNLGLALELRARGRTSEAIKKLIGLQA 371
Query 144 KRARVIRGGRELDIPVEAVLAGDLVVVRPGEKIPVDGVVEEGASAVDESMLTGESLPVDK 203
+ ARV+R G E+DIPVE VL GD+VVVRPGEKIPVDGVV EG S+VDESM+TGES+PV+
Sbjct 372 RTARVVRDGTEVDIPVEEVLVGDIVVVRPGEKIPVDGVVIEGTSSVDESMITGESIPVEM 431
Query 204 QPGDTVIGATLNKQGSFKFRATKVGRDTALAQIISVVEEAQGSKAPIQRLADTISGYFVP 263
+PGD VIGAT+N+ GSF+FRATKVG+DTAL+QII +V++AQGSKAPIQR+ D +S YFVP
Sbjct 432 KPGDEVIGATINQTGSFRFRATKVGKDTALSQIIRLVQDAQGSKAPIQRIVDRVSHYFVP 491
Query 264 VVVSLAVITFFVWYFAVAPENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKG 323
V+ LA++ VWY + AL+ F L+IACPCALGLATPTS+ VG GKGAE+G
Sbjct 492 AVLILAIVAAVVWYVFGPEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQG 551
Query 324 ILFKGGEHLENAG---------GGAHTEGAENKAELLKTRATGISILVTLGLTAKGRDRS 374
IL + G+ L+ A G T+G +++ ATG + L LTA
Sbjct 552 ILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVVA--ATGFDEDLILRLTA------ 603
Query 375 TVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGV 415
A ++ + L I + L R +A E+ +A P GV
Sbjct 604 --AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGV 642
PF00122
PF00122
113 335
340 562
566 783
113
340
335
562 566
41
>>sp|E6SGT6|E6SGT6_THEM7 Heavy metal translocating P-type ATPase EC=3.6.3.4 (888 aa) qRegion: 81-112:309-340 : score=15; bits=12.3; Id=0.219; Q=0.0 : Shuffle qRegion: 113-335:341-563 : score=736; bits=232.8; Id=0.641; Q=644.7 : PF00122 qRegion: 336-415:564-642 : score=14; bits=12.0; Id=0.236; Q=0.0 : Shuffle Region: 81-111:309-339 : score=11; bits=11.1; Id=0.194; Q=0.0 : NODOM :0 Region: 112-334:340-562 : score=736; bits=232.8; Id=0.641; Q=644.7 : PF00122 Pfam Region: 338-415:566-642 : score=16; bits=12.6; Id=0.244; Q=0.0 : PF00702 Pfam s-w opt: 632 Z-score: 1048.6 bits: 204.2 E(274545): 3.7e-51Smith-Waterman score: 765; 49.7% identity (73.3% similar) in 344 aa overlap (81-415:309-642)
200 400
200 400 600 800E6SGT6
PF04945 PF00403 PF00122 PF00702
Shuffle PF00122 Shuffle
B0TE74
Sub-alignment scoring detects overextension
50 60 70 80 90 100 110 ][ 120B0TE74 LPIVPGTMALGVQSDGKDETLLVALEVPVDRERVGPAIVDAFRFLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLG
: .:. .: .:. . .:. . .: . :...:. ::E6SGT6 ILGLLTLPVMLWSGSHFFNGMWQGLKHRQANMHTLISIGIAAAWLYSTVAVAFPQIFPSMALAEVFYDvtavvvalvnlg
270 280 290 300 310 320 330 3][ ...
290 300 310 320 330 ][ 340 350 B0TE74 APENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKGILFKGGEHLENAGG---------GAHTEGAENKAELL
. ::. :...:.::::::::::::::. :: :::::.:::...:. :. :. :. :.: . ....E6SGT6 PEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQGILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVV
510 520 530 540 550 560 ] [ 570 580 360 370 380 390 400 410 420 430
B0TE74 KTRATGISILVTLGLTAKGRDRSTVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGVVVDSLVTTAFLAVEEI:::.. . : ::: :..... : : .. : : .: :. ..: : ::
E6SGT6 --AATGFDEDLILRLTA--------AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGVEAQVEGHHVLVGNERL590 600 610 620 630 640 650
42
7/16/16
22
Scoring domains highlights over extension>>sp|SRC8_HUMAN Src substrate cortactin; (550 aa)>>sp|SRC8_CHICK Src substrate p85; Cort (563 aa)84.7% id (1-550:11-563) E(454402): 1.2e-159
1- 79: 11- 88 Id=0.873; Q=281.4 : NODOM80-116: 89-125 Id=1.000; Q=133.2 : Hs1_Cortactin117-153:126-162 Id=0.946; Q=121.0 : Hs1_Cortactin154-190:163-199 Id=0.973; Q=127.1 : Hs1_Cortactin191-227:200-236 Id=0.973; Q=128.3 : Hs1_Cortactin228-264:237-273 Id=0.973; Q=137.5 : Hs1_Cortactin265-301:274-310 Id=0.892; Q=117.3 : Hs1_Cortactin302-324:311-333 Id=0.957; Q= 69.6 : Hs1_Cortactin325-491:334-504 Id=0.632; Q=386.6 : NODOM492-550:505-563 Id=0.966; Q=226.3 : SH3
>>sp|SRC8_HUMAN Src substrate cortactin (550 aa)>>sp|HCLS1_MOUSE Hematopoiet ln cell-sp (486 aa)44.1% id (1-548:1-485) E(454402): 4.1e-61
1- 79: 1- 78 Id=0.671; Q=213.0 : NODOM80-116: 79-115 Id=0.757; Q= 97.9 : Hs1_Cortactin117-153:116-152 Id=0.703; Q= 94.8 : Hs1_Cortactin154-190:153-189 Id=0.703; Q= 97.3 : Hs1_Cortactin191-213:190-212 Id=0.826; Q= 60.5 : Hs1_Cortactin
214-491:213-428 Id=0.179; Q= 0.0 : NODOM :0492-548:429-485 Id=0.719; Q=173.2 : SH3
Q=-10log(p)Q>30.0->p<0.001
43
Homology, non-homology, and over-extension• Sequences that share statistically significant sequence
similarity are homologous (simplest explanation)• But not all regions of the alignment contribute uniformly to
the score– lower identity/Q-value because of non-homology (over-
extension) ?– lower identity/Q-value because more distant
relationship (domains have different ages) ?• Test by searching with isolated region
– can the distant domain (?) find closer (significant) homologs?
• Similar (homology) or distinct (non-homology) structure is the gold standard
• Multiple sequence alignment can obscure over-extension– if the alignment is over-extended, part of the alignment
is NOT homologous44
7/16/16
23
Improving sensitivity withprotein/domain family models
• PSI-BLAST - method1. do BLAST search2. use query-based implied multiple sequence
alignment to build Position Specific Scoring Matrix (PSSM)
3. repeat steps 1 and 2 with PSSM, for 5 – 10 iterations
• PSI-BLAST – results:1. Typically 2X as sensitive as single sequence
methods2. Over-extension can cause PSSM contamination
45
Improving sensitivity withprotein/domain family models
• HMMER3 – jackhmmer – method1. do HMMER (Hidden Markov Model, HMM) search
with single sequence2. use query-HMM-based implied multiple sequence
alignment to more accurate HMM3. repeat steps 1 and 2 with HMM
• HMMER3– results:1. Less over-extension because of probabilistic
alignment2. Used to construct Pfam domain database
• Many protein families are too diverse for one HMM, Pfamdivides families into multiple HMMs and groups in Clans
3. Clearly homologous sequences are still missed
46
7/16/16
24
Missing homology beyond the HMM model>>tr|Q8LNM4|Q8LNM4_ORYSJ Eukaryotic aspartyl protease family protein vs>>tr|Q2QSI0|Q2QSI0_ORYSJ Glycosyl hydrolase family 9 protein, expressed OS=O (694 aa)
qRegion: 134-277:172-311 : score=508; bits=240.8; LPr=67.0 : Aspartyl protease
s-w opt: 508 Z-score: 1248.7 bits: 240.8 E(1): 9.6e-68Smith-Waterman score: 508; 62.5% identity (79.2% similar) in 144 aa overlap
130 140 150 160 170 180 190 200
Q8LNM4 TDACKSIPTSNCSSNMCTYEGTINSKLGGHTLGIVATDTFAIGTATASLGFGCVVASGIDTMGGPSGLIGLGRAPSSLVS
::: :.: :: . :. : : : : :::::.::: :.: ::::::: :::: : ::..::::.: :::.Q2QSI0 LCESISNDIHNCSGNVCMYEASTNA---GDTGGKVGTDTFAVGTAKANLAFGCVVASNIDTMDGSSGIVGLGRTPWSLVT
170 180 190 200 210 220 230
210 220 230 240 250 260 270 280Q8LNM4 QMNITKFSYCLTPHDSGKNSRLLLGSSAKLAGGGNSTTTPFVKTSPGDDMSQYYPIQLDGIKAGDAAIALPPSGNTVLVQ
: .. :::::.:::.:::. :.:::.::::::: ...:::: : :.:.: :: .::. .::::: : :::::
Q2QSI0 QTGVAAFSYCLAPHDAGKNNALFLGSTAKLAGGGKTASTPFVNIS-GNDLSNYYKVQLEVLKAGDAMIPLPPSGVLWDNY240 250 260 270 280 290 300 310
Q8LNM4 Q2QSI0
Asp
47
hamB2hamA1a
humM1
humA2ahumD2
dogAd1
dogCCKB
ratCCKAmusEP2
musEP3humTXA2
humMSHhumACTHratPOT
ratCGPCRhumEDG1
ratLHbovOP
ratODORchkP2y
musP2ugpPAFchkGPCR
humRSC
dogRDC1
ratG10dhumfMLFratANG
her pesEC
humIL8bovLCR1
ratRBS11
cmvHH3
cmvHH2
humSSR1
musdeltohumC5aratBK2 humTHR
ratRTA humMRGhumMAS
ratNPYY1ratNK1flyNKflyNPYmusGIR
ratNTR
musTRHmusGnRH
ratVIabovETAmusGRP
ratD1bovH1
hum5HT1a
Pfam misses/mis-alignsproteins distant from the model
• For diverse families, a single model can find, and miss, closely related homologs
• Even if homologs are found, alignments may be short
48
7/16/16
25
Homology Fundamentals
• Homologous sequences are unexpectedly similar (excess similarity)– excess compared to ??? – random similarity
(similarity by chance, E()-value)• Non-significant similarity IS NOT evidence for non-
homology– significant similarity to a protein of different structure
shows non-homology• Homology at the (entire) sequence level is
different from homology at the residue level– sequence homology is inferred from statistics– residue homology REQUIRES sequence homology
49
Effective Similarity Searching in Practice
1. Always search protein databases (possibly with translated DNA)
2. Use E()-values, not percent identity, to infer homology – E() < 0.001 is significant in a single search
3. Search smaller (comprehensive) databases4. Change the scoring matrix for:
– short sequences (exons, reads)– short evolutionary distances (mammals, vertebrates, a-
proteobacteria)– high identity (>50% alignments) to reduce over-extension
5. All methods (pairwise, HMM, PSSM) miss homologs, and find homologs the other methods miss
50