Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
1
BIOL2119 Computational Biology
Iterative searchwith Michael Cameron
Overview• Multiple alignment• Profiles
– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)
• Iterative search– PSI-BLAST– SAM
• Practical: building a simple bioinformatics search tool
Single sequence
• Given a pair of sequences:
HS47_CHICK/23-396 DKNMENILLSPVVVASSLGLVSLGGKATTASQAKUNKNOWN ANPGQNVVLSAFSVLPPLGQLALASVGESHDELL
• We perform a pairwise alignment:
Query: 5 ENILLSPVVVASSLG---LVSLG 24 +N++LS V LG L S+GSbjct: 5 QNVVLSAFSVLPPLGQLALASVG 27
• Related sequences or chance similarity?
Many sequences
• Given a collection of related sequences, we can construct a multiple alignment
HS47_CHICK/23-396 DKNMENILLSPVVVASSLGLVSLGGKATTASQAKSPI1_MYXVL/5-352 YNESDNVVFSPYGLTSALSVLRIAAGGNTKREIDPAI1_MOUSE/24-402 ASKDRNVVFSPYGVSSVLAMLQMTT--KTRRQIQPAI1_BOVIN/27-402 ASKDRNVVFSPYGVASVLAMLQLTTGGETRQQIQGDN_HUMAN/20-398 SRPHDNIVISPHGIASVLGMLQLGADGRTKKQLAPRTZ_HORVU/6-395 ERAAGNVAFSPLSLHVALSLITAGA-AATRDQLV
Multiple alignments
• Multiple alignments can be generated
automatically and then refined by hand• Line-up the sequences by shifting start/end
locations and inserting gaps• A complex but well studied problem• Most popular tool for multiple alignment is
CLUSTALW (Higgins et al. 1994)
A larger multiple alignment
Increased sensitivity
• We can use a multiple alignment to detect homologies not previous possible with a single sequence
HS47_CHICK/23-396 DKNMENILLSPVVVASSLGLVSLGGKATTASQAKSPI1_MYXVL/5-352 YNESDNVVFSPYGLTSALSVLRIAAGGNTKREIDPAI1_MOUSE/24-402 ASKDRNVVFSPYGVSSVLAMLQMTT--KTRRQIQPAI1_BOVIN/27-402 ASKDRNVVFSPYGVASVLAMLQLTTGGETRQQIQGDN_HUMAN/20-398 SRPHDNIVISPHGIASVLGMLQLGADGRTKKQLAPRTZ_HORVU/6-395 ERAAGNVAFSPLSLHVALSLITAGA-AATRDQLV
UNKNOWN ANPGQNVVLSAFSVLPPLGQLALASVGESHDELL
Multiple alignment search
• We can used a multiple alignment to
search a database• More sensitive than a single sequence• Computationally difficult• Common practice is to use profiles that
describe the multiple alignment instead
Profiles
• A profile is construct from a multiple
alignment• It describes which residues are preferred
for each position/column in the alignment• Two most common types of profiles:
– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)
A R N D C Q E G H I L K M F P S T W Y V D 3 -2 -1 2 -2 -1 2 -2 -1 -3 -3 -1 -2 -2 -2 1 -1 -2 3 -2 K -1 4 4 -1 -3 0 -1 -2 -1 -4 -3 2 -2 -4 -2 2 -1 -4 -3 -3 N 1 -1 2 -1 -3 0 2 -2 -2 -3 -3 2 -3 -4 5 -1 -1 -4 -3 -3 M 1 -2 -1 3 -3 -1 -1 2 4 -2 -2 -2 3 -3 -2 1 -1 -3 -2 -2 E -2 3 0 4 -4 3 2 2 -1 -4 -4 0 -3 -4 -2 -1 -2 -4 -3 -4 N -2 -1 7 1 -4 -1 -1 -1 0 -4 -4 -1 -3 -4 -3 0 -1 -5 -3 -4 I -1 -4 -4 -4 -2 -3 -4 -4 -4 4 0 -3 0 -1 -3 -3 -1 -4 -2 5 L 1 -3 -4 -4 -2 -3 -3 -3 -4 1 2 -3 0 -2 -3 -2 -1 -3 -2 5 L -3 -3 -4 -4 -3 -4 -4 -4 -3 2 3 -4 0 6 -4 -3 -2 -1 1 0 S 0 -2 0 -1 -2 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2 5 1 -4 -3 -2 P 1 -3 -3 -2 -3 -2 -2 -2 -3 -3 -3 -2 -3 -4 8 -1 -2 -4 -4 -3 V -2 -3 -3 -4 -3 -2 -3 -4 4 0 1 -3 -1 4 -4 -3 -2 0 5 1 V 0 -3 -1 -2 -2 -2 -2 4 -3 -2 -3 -2 -2 -3 -2 3 -1 -4 -3 1 V -1 -3 -4 -4 -2 -3 -4 -4 -4 3 3 -3 1 -1 -3 -3 -1 -3 -2 4 A 3 -2 -2 -2 -2 -1 -2 -2 4 -2 1 -2 -1 -2 -2 1 2 -3 -2 -1 S 0 -2 -1 -2 -2 -1 -1 -2 -2 -2 -2 -1 -2 -3 3 4 0 -4 -3 1 S 3 -3 -2 -3 -2 -2 -2 -2 -3 0 -1 -2 -1 -3 3 2 -1 -4 -3 3 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -3 1 0 -4 -3 -2 -2 -2 0 G 2 -2 -1 -2 -2 -2 -2 5 -2 -4 -4 -2 -3 -4 -2 3 -1 -3 -3 -3 L -2 -2 -3 -3 -2 2 -2 -4 -2 0 3 -2 5 -1 -3 -2 -2 -3 -2 1 V -2 -3 -4 -4 -2 -3 -4 -5 -4 3 4 -3 1 -1 -4 -3 -2 -3 -2 2 S 1 2 -1 -2 -2 4 0 -2 -1 -3 -3 0 -2 -3 -2 2 2 -3 -2 -2 L -2 -2 -3 -3 -2 -2 -3 -3 -3 2 4 -2 3 0 -3 -2 -1 -2 -2 0 G 4 -2 -2 -2 -2 -2 -2 3 -3 -3 -3 -2 -2 -3 -2 0 2 -3 -3 -2 G 2 -2 -1 -2 -2 -2 -2 4 -2 -3 -3 -2 -2 -3 -2 2 2 -3 -3 -2 K 1 -1 -1 2 -3 -1 -1 3 -2 -2 -3 3 -2 -3 -2 -1 -1 -4 -3 1 A 3 -2 -1 -2 -2 -2 -2 5 -2 -3 -3 -2 -2 -3 -2 0 -1 -3 -3 -2 T 1 2 2 0 -2 0 3 -2 -1 -3 -2 0 -2 -3 -2 0 2 -3 -2 -2 T -1 -2 -1 -2 -2 -1 -2 -2 -2 -2 -2 -1 -2 -3 -2 2 6 -3 -2 -1 A 1 4 -1 -2 -3 0 -1 -2 4 -3 -3 3 -2 -3 -2 -1 -2 -4 -2 -3 S -2 3 0 4 -4 2 0 -2 -1 -4 -4 2 -3 -4 -2 1 -1 -4 -3 -3 Q -2 0 -1 0 -4 6 4 -3 0 -4 -3 0 -2 -4 -2 -1 -2 -3 -2 -3 A 1 -3 -4 -4 -2 -3 -3 -3 -4 4 3 -3 1 -1 -3 -2 -2 -3 -2 1 K 1 -1 -1 2 -3 3 0 -3 -2 -1 1 2 -1 -3 -2 -1 -1 -3 -2 1
PSSM:
A R N D C Q E G H I L K M F P S T W Y V D 3 -2 -1 2 -2 -1 2 -2 -1 -3 -3 -1 -2 -2 -2 1 -1 -2 3 -2 K -1 4 4 -1 -3 0 -1 -2 -1 -4 -3 2 -2 -4 -2 2 -1 -4 -3 -3 N 1 -1 2 -1 -3 0 2 -2 -2 -3 -3 2 -3 -4 5 -1 -1 -4 -3 -3 M 1 -2 -1 3 -3 -1 -1 2 4 -2 -2 -2 3 -3 -2 1 -1 -3 -2 -2 E -2 3 0 4 -4 3 2 2 -1 -4 -4 0 -3 -4 -2 -1 -2 -4 -3 -4 N -2 -1 7 1 -4 -1 -1 -1 0 -4 -4 -1 -3 -4 -3 0 -1 -5 -3 -4 I -1 -4 -4 -4 -2 -3 -4 -4 -4 4 0 -3 0 -1 -3 -3 -1 -4 -2 5 L 1 -3 -4 -4 -2 -3 -3 -3 -4 1 2 -3 0 -2 -3 -2 -1 -3 -2 5 L -3 -3 -4 -4 -3 -4 -4 -4 -3 2 3 -4 0 6 -4 -3 -2 -1 1 0 S 0 -2 0 -1 -2 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2 5 1 -4 -3 -2 P 1 -3 -3 -2 -3 -2 -2 -2 -3 -3 -3 -2 -3 -4 8 -1 -2 -4 -4 -3 V -2 -3 -3 -4 -3 -2 -3 -4 4 0 1 -3 -1 4 -4 -3 -2 0 5 1 V 0 -3 -1 -2 -2 -2 -2 4 -3 -2 -3 -2 -2 -3 -2 3 -1 -4 -3 1 V -1 -3 -4 -4 -2 -3 -4 -4 -4 3 3 -3 1 -1 -3 -3 -1 -3 -2 4 A 3 -2 -2 -2 -2 -1 -2 -2 4 -2 1 -2 -1 -2 -2 1 2 -3 -2 -1 S 0 -2 -1 -2 -2 -1 -1 -2 -2 -2 -2 -1 -2 -3 3 4 0 -4 -3 1 S 3 -3 -2 -3 -2 -2 -2 -2 -3 0 -1 -2 -1 -3 3 2 -1 -4 -3 3 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -3 1 0 -4 -3 -2 -2 -2 0 G 2 -2 -1 -2 -2 -2 -2 5 -2 -4 -4 -2 -3 -4 -2 3 -1 -3 -3 -3 L -2 -2 -3 -3 -2 2 -2 -4 -2 0 3 -2 5 -1 -3 -2 -2 -3 -2 1
Pros and cons of PSSMs
• Fast and simple to use• Little modification to the BLAST
algorithm required to use them• PSSM replaces the query sequence and
scoring matrix• Statistical theory for scoring not as solid• Not as detailed as HMMs
Hidden Markov Models
1 2
0.1
0.2
0.8
0.9
A 0.1
G 0.3
T 0.5
C 0.1
A 0.7
G 0.1
T 0.1
C 0.1State sequence (hidden):
1 1 1 1 1 1 1 2 2 2 2 2 2 1 1
Symbol sequence:
T G T T C G T A A A C A A T G
Profile HMMs
M1
D1
I1
M2
D2
I2
M3
D3
I3
M4
D4
A 0.01
C 0.04
D 0.31
…
A 0.03
C 0.15
D 0.02
…
A 0.22
C 0.02
D 0.13
…
A 0.05
C 0.03
D 0.09
…
Pros and cons of HMMs
• Strong statistical foundation• Can be trained on aligned and non-
aligned data• More detailed• Alignment is more computational
expensive
Using profiles
• There exist many databases containing profiles
of known families:• ie. Pfam database
• Tools exist for searching a profile database with
a sequence query and vice versa:• HMMER (Durbin et al. 1998)• IMPALA (Schaffer et al. 1999)
• Profiles form the basis of iterative search
Iterative search
• Search database with query sequence• Construct multiple alignment from high-scoring
aligned sequences• Construct a profile using the multiple alignment• Search database with profile. Repeat.
• Popular tools for iterative search:• PSI-BLAST (Altschul et al. 1997) uses PSSMs• SAM (Karplus et al. 1998) uses HMMs
Iterative process flow chartQuery sequence
SEARCH Database
ResultsMultiple alignment
PSSM
Converged?No Yes End
PSI-BLAST
• Iterative version of BLAST that uses PSSMs• Each iteration takes about as long as a regular BLAST
search. • Maximum number of iterations
• Typically between 5 and 20
• Threshold for inclusion in multiple alignment and PSSM construction• Typically E-value of 0.001 or less
• A PSI-BLAST search takes considerably longer than BLAST but is much more sensitive
SAM
• Iterative tool that uses Hidden Markov Models
instead of Position-Specific Score Matrices• Only uses 4 iterations• Also uses BLAST for searching
• Refines alignment scoring using a HMM
• Better accuracy than BLAST• About 3 times slower than BLAST
Profile corruption
• False positives are high-scoring alignments that are not in fact related to the query
• A single false positive can corrupt the profile• The profile now includes information about an
alignment with an unrelated sequence• Decreasing the e-value threshold reduces the
likelihood of false positives, but decreases sensitivity• Selectivity / sensitivity tradeoff
Searching....doneResults from round 1 Score ESequences producing significant alignments: (bits) Value
gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 82 4e-17gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 33 0.020gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 33 0.022gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 28 0.90 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 27 1.5 gi|482884|gb|AAC46500.1| circumsporozoite protein 27 1.6 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.9 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.9 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|231413|sp|P30490|1B52_HUMAN HLA class I histocompatibility an... 26 2.6 gi|231350|sp|P30377|1A03_GORGO CLASS I HISTOCOMPATIBILITY ANTIGE... 26 3.1 gi|3522980|dbj|BAA32614.1| MHC class I antigen [Homo sapiens] 26 3.3 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 7.2 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.5
>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp.
Score = 82.0 bits (201), Expect = 4e-17 Identities = 52/194 (26%), Positives = 96/194 (49%), Gaps = 29/194 (14%)
Query: 256 FTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVGLSGLSSKAAKD 305 F E+L R+ P + ++ K + + IID +AI+P ++ V + Sbjct: 66 FCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV----------RA 115
Query: 306 VLGFLRVVRFVRILRIFKLTRHFVGLRVLGHTLRASTNEFLLLIIFLALGVLIFATMIYY 365 L LRV+R +RIL+I + + + + LR+ + E + ++ L +LI +T++Y Sbjct: 116 ELKILRVIRLLRILKIGRSEKFKKSIFHFNYALRSKSQELQISTVYTVLLLLISSTLMYL 175
Searching....doneResults from round 2 Score ESequences producing significant alignments: (bits) ValueSequences used in model and found again:
gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 273 7e-75
Sequences not found previously or not previously below threshold:
gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 50 2e-07gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 39 5e-04gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 28 0.64 gi|15922436|ref|NP_378105.1| 266aa long conserved hypothetical p... 28 0.83 gi|482884|gb|AAC46500.1| circumsporozoite protein 28 1.1 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 28 1.2 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.7 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.8 gi|231413|sp|P30490|1B52_HUMAN HLA class I histocompatibility an... 27 1.9 gi|3522980|dbj|BAA32614.1| MHC class I antigen [Homo sapiens] 26 2.4 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|231350|sp|P30377|1A03_GORGO CLASS I HISTOCOMPATIBILITY ANTIGE... 26 2.6 gi|15216247|dbj|BAB63254.1| PER3 [Homo sapiens] 25 4.7 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 5.1 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.7 gi|32417378|ref|XP_329167.1| predicted protein [Neurospora crass... 25 7.8
>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp. Score = 273 bits (699), Expect = 7e-75 Identities = 52/194 (26%), Positives = 96/194 (49%), Gaps = 29/194 (14%)
Query: 256 FTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVGLSGLSSKAAKD 305 F E+L R+ P + ++ K + + IID +AI+P ++ V + Sbjct: 66 FCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV----------RA 115
Query: 306 VLGFLRVVRFVRILRIFKLTRHFVGLRVLGHTLRASTNEFLLLIIFLALGVLIFATMIYY 365 L LRV+R +RIL+I + + + + LR+ + E + ++ L +LI +T++Y Sbjct: 116 ELKILRVIRLLRILKIGRSEKFKKSIFHFNYALRSKSQELQISTVYTVLLLLISSTLMYL 175
Query: 366 AERIGAQPNDPSASEHTHFKNIPIGFWWAVVTMTTLGYGDMYPQTWSGMLVGALCALAGV 425 AE S+ + +IP WW+V T++ +GYGD P T G ++ ++ +L G+Sbjct: 176 AE---------SSIQPELLGSIPRCLWWSVTTVSAVGYGDSIPVTAIGKIIASVTSLLGI 226
Query: 426 LTIAMPVPVIVNNF 439 IA+P ++ FSbjct: 227 GAIAIPTGILAAGF 240
Searching....doneResults from round 3 Score ESequences producing significant alignments: (bits) ValueSequences used in model and found again:
gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 251 3e-68gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 235 3e-63gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 83 2e-17
Sequences not found previously or not previously below threshold:
gi|482884|gb|AAC46500.1| circumsporozoite protein 28 1.1 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 28 1.1 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.6 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.7 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 26 4.0 gi|15805398|ref|NP_294092.1| hypothetical protein [Deinococcus r... 26 4.3 gi|15216247|dbj|BAB63254.1| PER3 [Homo sapiens] 25 4.7 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 5.1 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.6 gi|15889580|ref|NP_355261.1| AGR_C_4191p [Agrobacterium tumefaci... 25 8.4 gi|15598989|ref|NP_252483.1| hypothetical protein [Pseudomonas a... 25 9.8
CONVERGED!>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp. Score = 251 bits (642), Expect = 3e-68 Identities = 53/205 (25%), Positives = 101/205 (49%), Gaps = 29/205 (14%)
Query: 245 LTYIEGVCVVWFTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVG 294 + +++ V F E+L R+ P + ++ K + + IID +AI+P ++ V Sbjct: 55 IDFLDWVIGGLFCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV- 113
Summary• Multiple alignment
– Advantages over using a single sequence
• Constructing and using profiles– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)
• Iterative search– PSI-BLAST– SAM
Practical exercise• Build a simple bioinformatics search
tool:– Perform Smith-Waterman search between
query and each database sequence– Calculate e-value for each alignment, and if
below cutoff then display alignment
• For code and detailed instructions go to:
http://www.cs.rmit.edu.au/~mcam/prac