1 BIOL2119 Computational Biology Iterative search with Michael Cameron.

Post on 22-Dec-2015

214 views 0 download

Tags:

transcript

1

BIOL2119 Computational Biology

Iterative searchwith Michael Cameron

Overview• Multiple alignment• Profiles

– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)

• Iterative search– PSI-BLAST– SAM

• Practical: building a simple bioinformatics search tool

Single sequence

• Given a pair of sequences:

HS47_CHICK/23-396 DKNMENILLSPVVVASSLGLVSLGGKATTASQAKUNKNOWN  ANPGQNVVLSAFSVLPPLGQLALASVGESHDELL

• We perform a pairwise alignment:

Query: 5 ENILLSPVVVASSLG---LVSLG 24 +N++LS V LG L S+GSbjct: 5 QNVVLSAFSVLPPLGQLALASVG 27

• Related sequences or chance similarity?

Many sequences

• Given a collection of related sequences, we can construct a multiple alignment

HS47_CHICK/23-396  DKNMENILLSPVVVASSLGLVSLGGKATTASQAKSPI1_MYXVL/5-352  YNESDNVVFSPYGLTSALSVLRIAAGGNTKREIDPAI1_MOUSE/24-402  ASKDRNVVFSPYGVSSVLAMLQMTT--KTRRQIQPAI1_BOVIN/27-402  ASKDRNVVFSPYGVASVLAMLQLTTGGETRQQIQGDN_HUMAN/20-398  SRPHDNIVISPHGIASVLGMLQLGADGRTKKQLAPRTZ_HORVU/6-395  ERAAGNVAFSPLSLHVALSLITAGA-AATRDQLV

Multiple alignments

• Multiple alignments can be generated

automatically and then refined by hand• Line-up the sequences by shifting start/end

locations and inserting gaps• A complex but well studied problem• Most popular tool for multiple alignment is

CLUSTALW (Higgins et al. 1994)

A larger multiple alignment

Increased sensitivity

• We can use a multiple alignment to detect homologies not previous possible with a single sequence

HS47_CHICK/23-396  DKNMENILLSPVVVASSLGLVSLGGKATTASQAKSPI1_MYXVL/5-352  YNESDNVVFSPYGLTSALSVLRIAAGGNTKREIDPAI1_MOUSE/24-402  ASKDRNVVFSPYGVSSVLAMLQMTT--KTRRQIQPAI1_BOVIN/27-402  ASKDRNVVFSPYGVASVLAMLQLTTGGETRQQIQGDN_HUMAN/20-398  SRPHDNIVISPHGIASVLGMLQLGADGRTKKQLAPRTZ_HORVU/6-395  ERAAGNVAFSPLSLHVALSLITAGA-AATRDQLV

UNKNOWN   ANPGQNVVLSAFSVLPPLGQLALASVGESHDELL

Multiple alignment search

• We can used a multiple alignment to

search a database• More sensitive than a single sequence• Computationally difficult• Common practice is to use profiles that

describe the multiple alignment instead

Profiles

• A profile is construct from a multiple

alignment• It describes which residues are preferred

for each position/column in the alignment• Two most common types of profiles:

– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)

A R N D C Q E G H I L K M F P S T W Y V D 3 -2 -1 2 -2 -1 2 -2 -1 -3 -3 -1 -2 -2 -2 1 -1 -2 3 -2 K -1 4 4 -1 -3 0 -1 -2 -1 -4 -3 2 -2 -4 -2 2 -1 -4 -3 -3 N 1 -1 2 -1 -3 0 2 -2 -2 -3 -3 2 -3 -4 5 -1 -1 -4 -3 -3 M 1 -2 -1 3 -3 -1 -1 2 4 -2 -2 -2 3 -3 -2 1 -1 -3 -2 -2 E -2 3 0 4 -4 3 2 2 -1 -4 -4 0 -3 -4 -2 -1 -2 -4 -3 -4 N -2 -1 7 1 -4 -1 -1 -1 0 -4 -4 -1 -3 -4 -3 0 -1 -5 -3 -4 I -1 -4 -4 -4 -2 -3 -4 -4 -4 4 0 -3 0 -1 -3 -3 -1 -4 -2 5 L 1 -3 -4 -4 -2 -3 -3 -3 -4 1 2 -3 0 -2 -3 -2 -1 -3 -2 5 L -3 -3 -4 -4 -3 -4 -4 -4 -3 2 3 -4 0 6 -4 -3 -2 -1 1 0 S 0 -2 0 -1 -2 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2 5 1 -4 -3 -2 P 1 -3 -3 -2 -3 -2 -2 -2 -3 -3 -3 -2 -3 -4 8 -1 -2 -4 -4 -3 V -2 -3 -3 -4 -3 -2 -3 -4 4 0 1 -3 -1 4 -4 -3 -2 0 5 1 V 0 -3 -1 -2 -2 -2 -2 4 -3 -2 -3 -2 -2 -3 -2 3 -1 -4 -3 1 V -1 -3 -4 -4 -2 -3 -4 -4 -4 3 3 -3 1 -1 -3 -3 -1 -3 -2 4 A 3 -2 -2 -2 -2 -1 -2 -2 4 -2 1 -2 -1 -2 -2 1 2 -3 -2 -1 S 0 -2 -1 -2 -2 -1 -1 -2 -2 -2 -2 -1 -2 -3 3 4 0 -4 -3 1 S 3 -3 -2 -3 -2 -2 -2 -2 -3 0 -1 -2 -1 -3 3 2 -1 -4 -3 3 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -3 1 0 -4 -3 -2 -2 -2 0 G 2 -2 -1 -2 -2 -2 -2 5 -2 -4 -4 -2 -3 -4 -2 3 -1 -3 -3 -3 L -2 -2 -3 -3 -2 2 -2 -4 -2 0 3 -2 5 -1 -3 -2 -2 -3 -2 1 V -2 -3 -4 -4 -2 -3 -4 -5 -4 3 4 -3 1 -1 -4 -3 -2 -3 -2 2 S 1 2 -1 -2 -2 4 0 -2 -1 -3 -3 0 -2 -3 -2 2 2 -3 -2 -2 L -2 -2 -3 -3 -2 -2 -3 -3 -3 2 4 -2 3 0 -3 -2 -1 -2 -2 0 G 4 -2 -2 -2 -2 -2 -2 3 -3 -3 -3 -2 -2 -3 -2 0 2 -3 -3 -2 G 2 -2 -1 -2 -2 -2 -2 4 -2 -3 -3 -2 -2 -3 -2 2 2 -3 -3 -2 K 1 -1 -1 2 -3 -1 -1 3 -2 -2 -3 3 -2 -3 -2 -1 -1 -4 -3 1 A 3 -2 -1 -2 -2 -2 -2 5 -2 -3 -3 -2 -2 -3 -2 0 -1 -3 -3 -2 T 1 2 2 0 -2 0 3 -2 -1 -3 -2 0 -2 -3 -2 0 2 -3 -2 -2 T -1 -2 -1 -2 -2 -1 -2 -2 -2 -2 -2 -1 -2 -3 -2 2 6 -3 -2 -1 A 1 4 -1 -2 -3 0 -1 -2 4 -3 -3 3 -2 -3 -2 -1 -2 -4 -2 -3 S -2 3 0 4 -4 2 0 -2 -1 -4 -4 2 -3 -4 -2 1 -1 -4 -3 -3 Q -2 0 -1 0 -4 6 4 -3 0 -4 -3 0 -2 -4 -2 -1 -2 -3 -2 -3 A 1 -3 -4 -4 -2 -3 -3 -3 -4 4 3 -3 1 -1 -3 -2 -2 -3 -2 1 K 1 -1 -1 2 -3 3 0 -3 -2 -1 1 2 -1 -3 -2 -1 -1 -3 -2 1

PSSM:

A R N D C Q E G H I L K M F P S T W Y V D 3 -2 -1 2 -2 -1 2 -2 -1 -3 -3 -1 -2 -2 -2 1 -1 -2 3 -2 K -1 4 4 -1 -3 0 -1 -2 -1 -4 -3 2 -2 -4 -2 2 -1 -4 -3 -3 N 1 -1 2 -1 -3 0 2 -2 -2 -3 -3 2 -3 -4 5 -1 -1 -4 -3 -3 M 1 -2 -1 3 -3 -1 -1 2 4 -2 -2 -2 3 -3 -2 1 -1 -3 -2 -2 E -2 3 0 4 -4 3 2 2 -1 -4 -4 0 -3 -4 -2 -1 -2 -4 -3 -4 N -2 -1 7 1 -4 -1 -1 -1 0 -4 -4 -1 -3 -4 -3 0 -1 -5 -3 -4 I -1 -4 -4 -4 -2 -3 -4 -4 -4 4 0 -3 0 -1 -3 -3 -1 -4 -2 5 L 1 -3 -4 -4 -2 -3 -3 -3 -4 1 2 -3 0 -2 -3 -2 -1 -3 -2 5 L -3 -3 -4 -4 -3 -4 -4 -4 -3 2 3 -4 0 6 -4 -3 -2 -1 1 0 S 0 -2 0 -1 -2 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2 5 1 -4 -3 -2 P 1 -3 -3 -2 -3 -2 -2 -2 -3 -3 -3 -2 -3 -4 8 -1 -2 -4 -4 -3 V -2 -3 -3 -4 -3 -2 -3 -4 4 0 1 -3 -1 4 -4 -3 -2 0 5 1 V 0 -3 -1 -2 -2 -2 -2 4 -3 -2 -3 -2 -2 -3 -2 3 -1 -4 -3 1 V -1 -3 -4 -4 -2 -3 -4 -4 -4 3 3 -3 1 -1 -3 -3 -1 -3 -2 4 A 3 -2 -2 -2 -2 -1 -2 -2 4 -2 1 -2 -1 -2 -2 1 2 -3 -2 -1 S 0 -2 -1 -2 -2 -1 -1 -2 -2 -2 -2 -1 -2 -3 3 4 0 -4 -3 1 S 3 -3 -2 -3 -2 -2 -2 -2 -3 0 -1 -2 -1 -3 3 2 -1 -4 -3 3 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5 -3 1 0 -4 -3 -2 -2 -2 0 G 2 -2 -1 -2 -2 -2 -2 5 -2 -4 -4 -2 -3 -4 -2 3 -1 -3 -3 -3 L -2 -2 -3 -3 -2 2 -2 -4 -2 0 3 -2 5 -1 -3 -2 -2 -3 -2 1

Pros and cons of PSSMs

• Fast and simple to use• Little modification to the BLAST

algorithm required to use them• PSSM replaces the query sequence and

scoring matrix• Statistical theory for scoring not as solid• Not as detailed as HMMs

Hidden Markov Models

1 2

0.1

0.2

0.8

0.9

A 0.1

G 0.3

T 0.5

C 0.1

A 0.7

G 0.1

T 0.1

C 0.1State sequence (hidden):

1 1 1 1 1 1 1 2 2 2 2 2 2 1 1

Symbol sequence:

T G T T C G T A A A C A A T G

Profile HMMs

M1

D1

I1

M2

D2

I2

M3

D3

I3

M4

D4

A 0.01

C 0.04

D 0.31

A 0.03

C 0.15

D 0.02

A 0.22

C 0.02

D 0.13

A 0.05

C 0.03

D 0.09

Pros and cons of HMMs

• Strong statistical foundation• Can be trained on aligned and non-

aligned data• More detailed• Alignment is more computational

expensive

Using profiles

• There exist many databases containing profiles

of known families:• ie. Pfam database

• Tools exist for searching a profile database with

a sequence query and vice versa:• HMMER (Durbin et al. 1998)• IMPALA (Schaffer et al. 1999)

• Profiles form the basis of iterative search

Iterative search

• Search database with query sequence• Construct multiple alignment from high-scoring

aligned sequences• Construct a profile using the multiple alignment• Search database with profile. Repeat.

• Popular tools for iterative search:• PSI-BLAST (Altschul et al. 1997) uses PSSMs• SAM (Karplus et al. 1998) uses HMMs

Iterative process flow chartQuery sequence

SEARCH Database

ResultsMultiple alignment

PSSM

Converged?No Yes End

PSI-BLAST

• Iterative version of BLAST that uses PSSMs• Each iteration takes about as long as a regular BLAST

search. • Maximum number of iterations

• Typically between 5 and 20

• Threshold for inclusion in multiple alignment and PSSM construction• Typically E-value of 0.001 or less

• A PSI-BLAST search takes considerably longer than BLAST but is much more sensitive

SAM

• Iterative tool that uses Hidden Markov Models

instead of Position-Specific Score Matrices• Only uses 4 iterations• Also uses BLAST for searching

• Refines alignment scoring using a HMM

• Better accuracy than BLAST• About 3 times slower than BLAST

Profile corruption

• False positives are high-scoring alignments that are not in fact related to the query

• A single false positive can corrupt the profile• The profile now includes information about an

alignment with an unrelated sequence• Decreasing the e-value threshold reduces the

likelihood of false positives, but decreases sensitivity• Selectivity / sensitivity tradeoff

Searching....doneResults from round 1 Score ESequences producing significant alignments: (bits) Value

gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 82 4e-17gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 33 0.020gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 33 0.022gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 28 0.90 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 27 1.5 gi|482884|gb|AAC46500.1| circumsporozoite protein 27 1.6 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.9 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.9 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|231413|sp|P30490|1B52_HUMAN HLA class I histocompatibility an... 26 2.6 gi|231350|sp|P30377|1A03_GORGO CLASS I HISTOCOMPATIBILITY ANTIGE... 26 3.1 gi|3522980|dbj|BAA32614.1| MHC class I antigen [Homo sapiens] 26 3.3 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 7.2 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.5

>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp.

Score = 82.0 bits (201), Expect = 4e-17 Identities = 52/194 (26%), Positives = 96/194 (49%), Gaps = 29/194 (14%)

Query: 256 FTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVGLSGLSSKAAKD 305 F E+L R+ P + ++ K + + IID +AI+P ++ V + Sbjct: 66 FCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV----------RA 115

Query: 306 VLGFLRVVRFVRILRIFKLTRHFVGLRVLGHTLRASTNEFLLLIIFLALGVLIFATMIYY 365 L LRV+R +RIL+I + + + + LR+ + E + ++ L +LI +T++Y Sbjct: 116 ELKILRVIRLLRILKIGRSEKFKKSIFHFNYALRSKSQELQISTVYTVLLLLISSTLMYL 175

Searching....doneResults from round 2 Score ESequences producing significant alignments: (bits) ValueSequences used in model and found again:

gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 273 7e-75

Sequences not found previously or not previously below threshold:

gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 50 2e-07gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 39 5e-04gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 28 0.64 gi|15922436|ref|NP_378105.1| 266aa long conserved hypothetical p... 28 0.83 gi|482884|gb|AAC46500.1| circumsporozoite protein 28 1.1 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 28 1.2 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.7 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.8 gi|231413|sp|P30490|1B52_HUMAN HLA class I histocompatibility an... 27 1.9 gi|3522980|dbj|BAA32614.1| MHC class I antigen [Homo sapiens] 26 2.4 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|231350|sp|P30377|1A03_GORGO CLASS I HISTOCOMPATIBILITY ANTIGE... 26 2.6 gi|15216247|dbj|BAB63254.1| PER3 [Homo sapiens] 25 4.7 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 5.1 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.7 gi|32417378|ref|XP_329167.1| predicted protein [Neurospora crass... 25 7.8

>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp. Score = 273 bits (699), Expect = 7e-75 Identities = 52/194 (26%), Positives = 96/194 (49%), Gaps = 29/194 (14%)

Query: 256 FTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVGLSGLSSKAAKD 305 F E+L R+ P + ++ K + + IID +AI+P ++ V + Sbjct: 66 FCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV----------RA 115

Query: 306 VLGFLRVVRFVRILRIFKLTRHFVGLRVLGHTLRASTNEFLLLIIFLALGVLIFATMIYY 365 L LRV+R +RIL+I + + + + LR+ + E + ++ L +LI +T++Y Sbjct: 116 ELKILRVIRLLRILKIGRSEKFKKSIFHFNYALRSKSQELQISTVYTVLLLLISSTLMYL 175

Query: 366 AERIGAQPNDPSASEHTHFKNIPIGFWWAVVTMTTLGYGDMYPQTWSGMLVGALCALAGV 425 AE S+ + +IP WW+V T++ +GYGD P T G ++ ++ +L G+Sbjct: 176 AE---------SSIQPELLGSIPRCLWWSVTTVSAVGYGDSIPVTAIGKIIASVTSLLGI 226

Query: 426 LTIAMPVPVIVNNF 439 IA+P ++ FSbjct: 227 GAIAIPTGILAAGF 240

Searching....doneResults from round 3 Score ESequences producing significant alignments: (bits) ValueSequences used in model and found again:

gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predic... 251 3e-68gi|16264095|ref|NP_436887.1| putative ionic voltage-gated channe... 235 3e-63gi|17536613|ref|NP_494333.1| TWiK family of potassium channels (... 83 2e-17

Sequences not found previously or not previously below threshold:

gi|482884|gb|AAC46500.1| circumsporozoite protein 28 1.1 gi|32420293|ref|XP_330590.1| hypothetical protein [Neurospora cr... 28 1.1 gi|6723566|emb|CAB66363.1| immunoglobulin mu heavy chain variabl... 27 1.6 gi|29836862|emb|CAD88668.1| immunoglobulin heavy chain [Homo sap... 27 1.7 gi|2127462|pir||S72598 sulfate permease T protein - Mycobacteriu... 26 2.5 gi|12232625|emb|CAC21575.2| MHC class I antigen [Homo sapiens] 26 4.0 gi|15805398|ref|NP_294092.1| hypothetical protein [Deinococcus r... 26 4.3 gi|15216247|dbj|BAB63254.1| PER3 [Homo sapiens] 25 4.7 gi|32420913|ref|XP_330900.1| hypothetical protein [Neurospora cr... 25 5.1 gi|21356701|ref|NP_652739.1| Pp1-Y2 [Drosophila melanogaster] >g... 25 7.6 gi|15889580|ref|NP_355261.1| AGR_C_4191p [Agrobacterium tumefaci... 25 8.4 gi|15598989|ref|NP_252483.1| hypothetical protein [Pseudomonas a... 25 9.8

CONVERGED!>gi|33240976|ref|NP_875918.1| Kef-type K+ transport system predicted NAD-binding component [Prochlorococcus marinus subsp. Score = 251 bits (642), Expect = 3e-68 Identities = 53/205 (25%), Positives = 101/205 (49%), Gaps = 29/205 (14%)

Query: 245 LTYIEGVCVVWFTFEFLMRVVFCPNKVEFIK----------NSLNIIDFVAILPFYLEVG 294 + +++ V F E+L R+ P + ++ K + + IID +AI+P ++ V Sbjct: 55 IDFLDWVIGGLFCIEYLCRLWVAPLQEKYGKGLKGIFRYVLSPMAIIDVIAIIPSFIGV- 113

Summary• Multiple alignment

– Advantages over using a single sequence

• Constructing and using profiles– Position-Specific Score Matrices (PSSMs)– Hidden Markov Models (HMMs)

• Iterative search– PSI-BLAST– SAM

Practical exercise• Build a simple bioinformatics search

tool:– Perform Smith-Waterman search between

query and each database sequence– Calculate e-value for each alignment, and if

below cutoff then display alignment

• For code and detailed instructions go to:

http://www.cs.rmit.edu.au/~mcam/prac