+ All Categories
Home > Documents > 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Date post: 22-Dec-2015
Category:
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
26
1 BIOL2119 Computational Biology Iterative search with Michael Cameron
Transcript
Page 1: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

1

BIOL2119 Computational Biology

Iterative searchwith Michael Cameron

Page 2: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Overview• Multiple alignment• Profiles

– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)

• Iterative search– PSI-BLAST– SAM

• Practical: building a simple bioinformatics search tool

Page 3: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Single sequence

• Given a pair of sequences:

HS47_CHICK/23-396 DKNMENILLSPVVVASSLGLVSLGGKATTASQAKUNKNOWN  ANPGQNVVLSAFSVLPPLGQLALASVGESHDELL

• We perform a pairwise alignment:

Query: 5 ENILLSPVVVASSLG---LVSLG 24 +N++LS V LG L S+GSbjct: 5 QNVVLSAFSVLPPLGQLALASVG 27

• Related sequences or chance similarity?

Page 4: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Many sequences

• Given a collection of related sequences, we can construct a multiple alignment

HS47_CHICK/23-396  DKNMENILLSPVVVASSLGLVSLGGKATTASQAKSPI1_MYXVL/5-352  YNESDNVVFSPYGLTSALSVLRIAAGGNTKREIDPAI1_MOUSE/24-402  ASKDRNVVFSPYGVSSVLAMLQMTT--KTRRQIQPAI1_BOVIN/27-402  ASKDRNVVFSPYGVASVLAMLQLTTGGETRQQIQGDN_HUMAN/20-398  SRPHDNIVISPHGIASVLGMLQLGADGRTKKQLAPRTZ_HORVU/6-395  ERAAGNVAFSPLSLHVALSLITAGA-AATRDQLV

Page 5: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Multiple alignments

• Multiple alignments can be generated

automatically and then refined by hand• Line-up the sequences by shifting start/end

locations and inserting gaps• A complex but well studied problem• Most popular tool for multiple alignment is

CLUSTALW (Higgins et al. 1994)

Page 6: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

A larger multiple alignment

Page 7: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Increased sensitivity

• We can use a multiple alignment to detect homologies not previous possible with a single sequence

HS47_CHICK/23-396  DKNMENILLSPVVVASSLGLVSLGGKATTASQAKSPI1_MYXVL/5-352  YNESDNVVFSPYGLTSALSVLRIAAGGNTKREIDPAI1_MOUSE/24-402  ASKDRNVVFSPYGVSSVLAMLQMTT--KTRRQIQPAI1_BOVIN/27-402  ASKDRNVVFSPYGVASVLAMLQLTTGGETRQQIQGDN_HUMAN/20-398  SRPHDNIVISPHGIASVLGMLQLGADGRTKKQLAPRTZ_HORVU/6-395  ERAAGNVAFSPLSLHVALSLITAGA-AATRDQLV

UNKNOWN   ANPGQNVVLSAFSVLPPLGQLALASVGESHDELL

Page 8: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Multiple alignment search

• We can used a multiple alignment to

search a database• More sensitive than a single sequence• Computationally difficult• Common practice is to use profiles that

describe the multiple alignment instead

Page 9: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Profiles

• A profile is construct from a multiple

alignment• It describes which residues are preferred

for each position/column in the alignment• Two most common types of profiles:

– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)

Page 10: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

A R N D C Q E G H I L K M F P S T W Y V D 3 -2 -1 2 -2 -1 2 -2 -1 -3 -3 -1 -2 -2 -2 1 -1 -2 3 -2 K -1 4 4 -1 -3 0 -1 -2 -1 -4 -3 2 -2 -4 -2 2 -1 -4 -3 -3 N 1 -1 2 -1 -3 0 2 -2 -2 -3 -3 2 -3 -4 5 -1 -1 -4 -3 -3 M 1 -2 -1 3 -3 -1 -1 2 4 -2 -2 -2 3 -3 -2 1 -1 -3 -2 -2 E -2 3 0 4 -4 3 2 2 -1 -4 -4 0 -3 -4 -2 -1 -2 -4 -3 -4 N -2 -1 7 1 -4 -1 -1 -1 0 -4 -4 -1 -3 -4 -3 0 -1 -5 -3 -4 I -1 -4 -4 -4 -2 -3 -4 -4 -4 4 0 -3 0 -1 -3 -3 -1 -4 -2 5 L 1 -3 -4 -4 -2 -3 -3 -3 -4 1 2 -3 0 -2 -3 -2 -1 -3 -2 5 L -3 -3 -4 -4 -3 -4 -4 -4 -3 2 3 -4 0 6 -4 -3 -2 -1 1 0 S 0 -2 0 -1 -2 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2 5 1 -4 -3 -2 P 1 -3 -3 -2 -3 -2 -2 -2 -3 -3 -3 -2 -3 -4 8 -1 -2 -4 -4 -3 V -2 -3 -3 -4 -3 -2 -3 -4 4 0 1 -3 -1 4 -4 -3 -2 0 5 1 V 0 -3 -1 -2 -2 -2 -2 4 -3 -2 -3 -2 -2 -3 -2 3 -1 -4 -3 1 V -1 -3 -4 -4 -2 -3 -4 -4 -4 3 3 -3 1 -1 -3 -3 -1 -3 -2 4 A 3 -2 -2 -2 -2 -1 -2 -2 4 -2 1 -2 -1 -2 -2 1 2 -3 -2 -1 S 0 -2 -1 -2 -2 -1 -1 -2 -2 -2 -2 -1 -2 -3 3 4 0 -4 -3 1 S 3 -3 -2 -3 -2 -2 -2 -2 -3 0 -1 -2 -1 -3 3 2 -1 -4 -3 3 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -3 1 0 -4 -3 -2 -2 -2 0 G 2 -2 -1 -2 -2 -2 -2 5 -2 -4 -4 -2 -3 -4 -2 3 -1 -3 -3 -3 L -2 -2 -3 -3 -2 2 -2 -4 -2 0 3 -2 5 -1 -3 -2 -2 -3 -2 1 V -2 -3 -4 -4 -2 -3 -4 -5 -4 3 4 -3 1 -1 -4 -3 -2 -3 -2 2 S 1 2 -1 -2 -2 4 0 -2 -1 -3 -3 0 -2 -3 -2 2 2 -3 -2 -2 L -2 -2 -3 -3 -2 -2 -3 -3 -3 2 4 -2 3 0 -3 -2 -1 -2 -2 0 G 4 -2 -2 -2 -2 -2 -2 3 -3 -3 -3 -2 -2 -3 -2 0 2 -3 -3 -2 G 2 -2 -1 -2 -2 -2 -2 4 -2 -3 -3 -2 -2 -3 -2 2 2 -3 -3 -2 K 1 -1 -1 2 -3 -1 -1 3 -2 -2 -3 3 -2 -3 -2 -1 -1 -4 -3 1 A 3 -2 -1 -2 -2 -2 -2 5 -2 -3 -3 -2 -2 -3 -2 0 -1 -3 -3 -2 T 1 2 2 0 -2 0 3 -2 -1 -3 -2 0 -2 -3 -2 0 2 -3 -2 -2 T -1 -2 -1 -2 -2 -1 -2 -2 -2 -2 -2 -1 -2 -3 -2 2 6 -3 -2 -1 A 1 4 -1 -2 -3 0 -1 -2 4 -3 -3 3 -2 -3 -2 -1 -2 -4 -2 -3 S -2 3 0 4 -4 2 0 -2 -1 -4 -4 2 -3 -4 -2 1 -1 -4 -3 -3 Q -2 0 -1 0 -4 6 4 -3 0 -4 -3 0 -2 -4 -2 -1 -2 -3 -2 -3 A 1 -3 -4 -4 -2 -3 -3 -3 -4 4 3 -3 1 -1 -3 -2 -2 -3 -2 1 K 1 -1 -1 2 -3 3 0 -3 -2 -1 1 2 -1 -3 -2 -1 -1 -3 -2 1

PSSM:

Page 11: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

A R N D C Q E G H I L K M F P S T W Y V D 3 -2 -1 2 -2 -1 2 -2 -1 -3 -3 -1 -2 -2 -2 1 -1 -2 3 -2 K -1 4 4 -1 -3 0 -1 -2 -1 -4 -3 2 -2 -4 -2 2 -1 -4 -3 -3 N 1 -1 2 -1 -3 0 2 -2 -2 -3 -3 2 -3 -4 5 -1 -1 -4 -3 -3 M 1 -2 -1 3 -3 -1 -1 2 4 -2 -2 -2 3 -3 -2 1 -1 -3 -2 -2 E -2 3 0 4 -4 3 2 2 -1 -4 -4 0 -3 -4 -2 -1 -2 -4 -3 -4 N -2 -1 7 1 -4 -1 -1 -1 0 -4 -4 -1 -3 -4 -3 0 -1 -5 -3 -4 I -1 -4 -4 -4 -2 -3 -4 -4 -4 4 0 -3 0 -1 -3 -3 -1 -4 -2 5 L 1 -3 -4 -4 -2 -3 -3 -3 -4 1 2 -3 0 -2 -3 -2 -1 -3 -2 5 L -3 -3 -4 -4 -3 -4 -4 -4 -3 2 3 -4 0 6 -4 -3 -2 -1 1 0 S 0 -2 0 -1 -2 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2 5 1 -4 -3 -2 P 1 -3 -3 -2 -3 -2 -2 -2 -3 -3 -3 -2 -3 -4 8 -1 -2 -4 -4 -3 V -2 -3 -3 -4 -3 -2 -3 -4 4 0 1 -3 -1 4 -4 -3 -2 0 5 1 V 0 -3 -1 -2 -2 -2 -2 4 -3 -2 -3 -2 -2 -3 -2 3 -1 -4 -3 1 V -1 -3 -4 -4 -2 -3 -4 -4 -4 3 3 -3 1 -1 -3 -3 -1 -3 -2 4 A 3 -2 -2 -2 -2 -1 -2 -2 4 -2 1 -2 -1 -2 -2 1 2 -3 -2 -1 S 0 -2 -1 -2 -2 -1 -1 -2 -2 -2 -2 -1 -2 -3 3 4 0 -4 -3 1 S 3 -3 -2 -3 -2 -2 -2 -2 -3 0 -1 -2 -1 -3 3 2 -1 -4 -3 3 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -3 1 0 -4 -3 -2 -2 -2 0 G 2 -2 -1 -2 -2 -2 -2 5 -2 -4 -4 -2 -3 -4 -2 3 -1 -3 -3 -3 L -2 -2 -3 -3 -2 2 -2 -4 -2 0 3 -2 5 -1 -3 -2 -2 -3 -2 1

Page 12: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Pros and cons of PSSMs

• Fast and simple to use• Little modification to the BLAST

algorithm required to use them• PSSM replaces the query sequence and

scoring matrix• Statistical theory for scoring not as solid• Not as detailed as HMMs

Page 13: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Hidden Markov Models

1 2

0.1

0.2

0.8

0.9

A 0.1

G 0.3

T 0.5

C 0.1

A 0.7

G 0.1

T 0.1

C 0.1State sequence (hidden):

1 1 1 1 1 1 1 2 2 2 2 2 2 1 1

Symbol sequence:

T G T T C G T A A A C A A T G

Page 14: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Profile HMMs

M1

D1

I1

M2

D2

I2

M3

D3

I3

M4

D4

A 0.01

C 0.04

D 0.31

A 0.03

C 0.15

D 0.02

A 0.22

C 0.02

D 0.13

A 0.05

C 0.03

D 0.09

Page 15: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Pros and cons of HMMs

• Strong statistical foundation• Can be trained on aligned and non-

aligned data• More detailed• Alignment is more computational

expensive

Page 16: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Using profiles

• There exist many databases containing profiles

of known families:• ie. Pfam database

• Tools exist for searching a profile database with

a sequence query and vice versa:• HMMER (Durbin et al. 1998)• IMPALA (Schaffer et al. 1999)

• Profiles form the basis of iterative search

Page 17: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Iterative search

• Search database with query sequence• Construct multiple alignment from high-scoring

aligned sequences• Construct a profile using the multiple alignment• Search database with profile. Repeat.

• Popular tools for iterative search:• PSI-BLAST (Altschul et al. 1997) uses PSSMs• SAM (Karplus et al. 1998) uses HMMs

Page 18: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Iterative process flow chartQuery sequence

SEARCH Database

ResultsMultiple alignment

PSSM

Converged?No Yes End

Page 19: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

PSI-BLAST

• Iterative version of BLAST that uses PSSMs• Each iteration takes about as long as a regular BLAST

search. • Maximum number of iterations

• Typically between 5 and 20

• Threshold for inclusion in multiple alignment and PSSM construction• Typically E-value of 0.001 or less

• A PSI-BLAST search takes considerably longer than BLAST but is much more sensitive

Page 20: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

SAM

• Iterative tool that uses Hidden Markov Models

instead of Position-Specific Score Matrices• Only uses 4 iterations• Also uses BLAST for searching

• Refines alignment scoring using a HMM

• Better accuracy than BLAST• About 3 times slower than BLAST

Page 21: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Profile corruption

• False positives are high-scoring alignments that are not in fact related to the query

• A single false positive can corrupt the profile• The profile now includes information about an

alignment with an unrelated sequence• Decreasing the e-value threshold reduces the

likelihood of false positives, but decreases sensitivity• Selectivity / sensitivity tradeoff

Page 22: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Searching....doneResults from round 1 Score ESequences producing significant alignments: (bits) Value

gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 82 4e-17gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 33 0.020gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 33 0.022gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 28 0.90 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 27 1.5 gi|482884|gb|AAC46500.1| circumsporozoite protein 27 1.6 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.9 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.9 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|231413|sp|P30490|1B52_HUMAN HLA class I histocompatibility an... 26 2.6 gi|231350|sp|P30377|1A03_GORGO CLASS I HISTOCOMPATIBILITY ANTIGE... 26 3.1 gi|3522980|dbj|BAA32614.1| MHC class I antigen [Homo sapiens] 26 3.3 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 7.2 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.5

>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp.

Score = 82.0 bits (201), Expect = 4e-17 Identities = 52/194 (26%), Positives = 96/194 (49%), Gaps = 29/194 (14%)

Query: 256 FTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVGLSGLSSKAAKD 305 F E+L R+ P + ++ K + + IID +AI+P ++ V + Sbjct: 66 FCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV----------RA 115

Query: 306 VLGFLRVVRFVRILRIFKLTRHFVGLRVLGHTLRASTNEFLLLIIFLALGVLIFATMIYY 365 L LRV+R +RIL+I + + + + LR+ + E + ++ L +LI +T++Y Sbjct: 116 ELKILRVIRLLRILKIGRSEKFKKSIFHFNYALRSKSQELQISTVYTVLLLLISSTLMYL 175

Page 23: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Searching....doneResults from round 2 Score ESequences producing significant alignments: (bits) ValueSequences used in model and found again:

gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 273 7e-75

Sequences not found previously or not previously below threshold:

gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 50 2e-07gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 39 5e-04gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 28 0.64 gi|15922436|ref|NP_378105.1| 266aa long conserved hypothetical p... 28 0.83 gi|482884|gb|AAC46500.1| circumsporozoite protein 28 1.1 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 28 1.2 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.7 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.8 gi|231413|sp|P30490|1B52_HUMAN HLA class I histocompatibility an... 27 1.9 gi|3522980|dbj|BAA32614.1| MHC class I antigen [Homo sapiens] 26 2.4 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|231350|sp|P30377|1A03_GORGO CLASS I HISTOCOMPATIBILITY ANTIGE... 26 2.6 gi|15216247|dbj|BAB63254.1| PER3 [Homo sapiens] 25 4.7 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 5.1 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.7 gi|32417378|ref|XP_329167.1| predicted protein [Neurospora crass... 25 7.8

>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp. Score = 273 bits (699), Expect = 7e-75 Identities = 52/194 (26%), Positives = 96/194 (49%), Gaps = 29/194 (14%)

Query: 256 FTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVGLSGLSSKAAKD 305 F E+L R+ P + ++ K + + IID +AI+P ++ V + Sbjct: 66 FCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV----------RA 115

Query: 306 VLGFLRVVRFVRILRIFKLTRHFVGLRVLGHTLRASTNEFLLLIIFLALGVLIFATMIYY 365 L LRV+R +RIL+I + + + + LR+ + E + ++ L +LI +T++Y Sbjct: 116 ELKILRVIRLLRILKIGRSEKFKKSIFHFNYALRSKSQELQISTVYTVLLLLISSTLMYL 175

Query: 366 AERIGAQPNDPSASEHTHFKNIPIGFWWAVVTMTTLGYGDMYPQTWSGMLVGALCALAGV 425 AE S+ + +IP WW+V T++ +GYGD P T G ++ ++ +L G+Sbjct: 176 AE---------SSIQPELLGSIPRCLWWSVTTVSAVGYGDSIPVTAIGKIIASVTSLLGI 226

Query: 426 LTIAMPVPVIVNNF 439 IA+P ++ FSbjct: 227 GAIAIPTGILAAGF 240

Page 24: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Searching....doneResults from round 3 Score ESequences producing significant alignments: (bits) ValueSequences used in model and found again:

gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 251 3e-68gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 235 3e-63gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 83 2e-17

Sequences not found previously or not previously below threshold:

gi|482884|gb|AAC46500.1| circumsporozoite protein 28 1.1 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 28 1.1 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.6 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.7 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 26 4.0 gi|15805398|ref|NP_294092.1| hypothetical protein [Deinococcus r... 26 4.3 gi|15216247|dbj|BAB63254.1| PER3 [Homo sapiens] 25 4.7 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 5.1 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.6 gi|15889580|ref|NP_355261.1| AGR_C_4191p [Agrobacterium tumefaci... 25 8.4 gi|15598989|ref|NP_252483.1| hypothetical protein [Pseudomonas a... 25 9.8

CONVERGED!>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp. Score = 251 bits (642), Expect = 3e-68 Identities = 53/205 (25%), Positives = 101/205 (49%), Gaps = 29/205 (14%)

Query: 245 LTYIEGVCVVWFTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVG 294 + +++ V F E+L R+ P + ++ K + + IID +AI+P ++ V Sbjct: 55 IDFLDWVIGGLFCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV- 113

Page 25: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Summary• Multiple alignment

– Advantages over using a single sequence

• Constructing and using profiles– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)

• Iterative search– PSI-BLAST– SAM

Page 26: 1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Practical exercise• Build a simple bioinformatics search

tool:– Perform Smith-Waterman search between

query and each database sequence– Calculate e-value for each alignment, and if

below cutoff then display alignment

• For code and detailed instructions go to:

http://www.cs.rmit.edu.au/~mcam/prac


Recommended