+ All Categories
Home > Documents > Probabilistic Sequence Alignment BMI 877 Colin Dewey [email protected] February 25, 2014.

Probabilistic Sequence Alignment BMI 877 Colin Dewey [email protected] February 25, 2014.

Date post: 24-Dec-2015
Category:
Upload: donald-parks
View: 215 times
Download: 0 times
Share this document with a friend
53
Probabilistic Sequence Alignment BMI 877 Colin Dewey [email protected] February 25, 2014
Transcript
Page 1: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Probabilistic Sequence Alignment

BMI 877

Colin Dewey

[email protected]

February 25, 2014

Page 2: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

What you’ve seen thus far• The pairwise sequence alignment task

• Notion of a “best” alignment – scoring schemes

• Dynamic programming algorithms for efficiently finding a “best” alignment

• Variants on the task– global, local, different gap penalty functions

• Heuristic methods for large-scale alignment - BLAST

Page 3: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Tasks addressed today

• How can we express the uncertainty of an alignment?

• How can we estimate the parameters for alignment?

• How can we align multiple sequences to each other?

Page 4: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Picking Alignmentsmel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA------------pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ********

mel -------CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG--------------------CGCAGTCGGCGGCGGCGGCpse CATTCGGCATGTGAAAA-------------TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** **

mel GCGCAGTCAGC----------GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG---------------pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * *

mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGACpse GCCAAAGAGAGCGATTTTATTTATGTG-----------------ACTGCGCTGCCTG----------GTCCTCGGC ********** ************* ** **** * * * * * ** *

mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATCpse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * *

mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATpse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC------------------------------------------------- * ** * * *************** *

mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGpse --------------------------------------------GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * ****

mel GCAGCGA-----------------------------------------------------------------------------pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** *

mel ----------------------------------CTGCGACpse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** *

Alignment summary: 27 mismatches, 12 gaps, 116 spaces

Alignment summary: 45 mismatches, 4 gaps, 214 spaces

Alig

nmen

t 1A

lignm

ent 2

Page 5: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

An Elusive Cis-Regulatory Element

Drosophila melanogaster polytene chromosomes

>chr3R:2824911-2825170 TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC

Adf1→Antp:06447: Binding site for transcription factor Adf1

Antp: antennapedia

Page 6: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Alig

nmen

t 1

Alig

nmen

t 2

The Conservation of Adf1→Antp:06447mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA------------pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ********

mel -------CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG--------------------CGCAGTCGGCGGCGGCGGCpse CATTCGGCATGTGAAAA-------------TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** **

mel GCGCAGTCAGC----------GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG---------------pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * *

mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGACpse GCCAAAGAGAGCGATTTTATTTATGTG-----------------ACTGCGCTGCCTG----------GTCCTCGGC ********** ************* ** **** * * * * * ** *

mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATCpse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * *

mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATpse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC------------------------------------------------- * ** * * *************** *

mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGpse --------------------------------------------GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * ****

mel GCAGCGA-----------------------------------------------------------------------------pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** *

mel ----------------------------------CTGCGACpse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** *

Alignment summary: 27 mismatches, 12 gaps, 116 spaces

Alignment summary: 45 mismatches, 4 gaps, 214 spaces

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTG-----------------ACTGCG**** ** ***

33% identity

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTGCGCCAGCGTCAGCGCCAGCGCCG****** ******* ** ** * **

74% identity

Page 7: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Polytope

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTGCGCCAGCGTCAGCGCCAGCGCCG****** ******* ** ** * **

TGTGCGTCAGCGTCGGCCGCAACAGCGTGTG-----------------ACTGCG**** ** ***

364 Vertices760 Ridges398 Facets

Sequence lengths = 260bp, 280bp

Page 8: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Methodological machinery to be used

• Hidden Markov models (HMMs)– Viterbi and Forward algorithms– Profile HMMs– Pair HMMs– Connections to classical sequence

alignment

Page 9: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Hidden Markov models

• Generative probabilistic models of sequences• Explicitly models unobserved (hidden) states

that “emit” the characters of the observed sequence

• Primary task of interest is to infer the hidden states given the observed sequence

• Alignment case: hidden states = alignment

Page 10: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Two HMM random variables

• Observed sequence

• Hidden state sequence

• HMM: – Markov chain over hidden sequence– Dependence between

Page 11: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Parameters of an HMM

• since we’ve decoupled states and characters, we also have emission probabilities

probability of emitting character b in state k

probability of a transition from state k to l represents a path (sequence of states) through the model

• as in Markov chain models, we have transition probabilities

Page 12: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

A Simple HMM with Emission Parameters

0.8

probability of emitting character A in state 2

probability of a transition from state 1 to state 3

0.4

A 0.4C 0.1G 0.2T 0.3

A 0.1C 0.4G 0.4T 0.1

A 0.2C 0.3G 0.3T 0.2

begin end

0.5

0.5

0.2

0.8

0.6

0.1

0.90.2

0 5

4

3

2

1

A 0.4C 0.1G 0.1T 0.4

Page 13: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Three Important Questions• How likely is a given sequence?

the Forward algorithm

• What is the most probable “path” (sequence of hidden states) for generating a given sequence?the Viterbi algorithm

• How can we learn the HMM parameters given a set of sequences?the Forward-Backward (Baum-Welch) algorithm

Page 14: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

How Likely is a Given Path and Sequence?

• the probability that the path is taken and the sequence is generated:

(assuming begin/end are the only silent states on path)

Page 15: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

How Likely Is A Given Path and Sequence?

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

Page 16: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

How Likely is a Given Sequence?• We usually only observe the sequence, not the path• To find the probability of a sequence, we must sum over

all possible paths

• but the number of paths can be exponential in the length of the sequence...

• the Forward algorithm enables us to compute this efficiently

Page 17: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

How Likely is a Given Sequence: The Forward Algorithm

• A dynamic programming solution• subproblem: define to be the probability of

generating the first i characters and ending in state k• we want to compute , the probability of generating

the entire sequence (x) and ending in the end state (state N)

• can define this recursively

Page 18: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Forward Algorithm• because of the Markov property, don’t have to explicitly

enumerate every path

• e.g. compute using

A 0.4C 0.1G 0.2T 0.3

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

A 0.2C 0.3G 0.3T 0.2

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

Page 19: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Forward Algorithm

• initialization:

probability that we’re in the start state andhave observed 0 characters from the sequence

Page 20: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Forward Algorithm

• recursion for silent states:

• recursion for emitting states (i =1…L):

Page 21: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Forward Algorithm

• termination:

probability that we’re in the end state andhave observed the entire sequence

Page 22: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Forward Algorithm Example

A 0.4C 0.1G 0.2T 0.3

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

A 0.2C 0.3G 0.3T 0.2

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

• given the sequence x = TAGA

Page 23: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Forward Algorithm Example• given the sequence x = TAGA• initialization

• computing other values

Page 24: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Three Important Questions

• How likely is a given sequence?• What is the most probable “path” for generating a given

sequence?• How can we learn the HMM parameters given a set of

sequences?

Page 25: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Finding the Most Probable Path: The Viterbi Algorithm

• Dynamic programming approach, again!• subproblem: define to be the probability of the most

probable path accounting for the first i characters of x and ending in state k

• we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state

• can define recursively• can use DP to find efficiently

Page 26: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Finding the Most Probable Path: The Viterbi Algorithm

• initialization:

Page 27: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Viterbi Algorithm

• recursion for emitting states (i =1…L):

• recursion for silent states:

keep track of most probable path

Page 28: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

The Viterbi Algorithm

• traceback: follow pointers back starting at

• termination:

Page 29: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Forward & Viterbi Algorithms

begin end

• Forward/Viterbi algorithms effectively consider all possible paths for a sequence– Forward to find probability of a sequence– Viterbi to find most probable path

• consider a sequence of length 4…

Page 30: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

HMM parameter estimation

• Easy if the hidden path is known for each sequence

• In general, the paths are unknown• Baum-Welch (Forward-Backward) algorithm

is used to compute maximum likelihood estimates

• Backward algorithm is analog of forward algorithm for computing probabilities of suffixes of a sequence

Page 31: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Learning Parameters: The Baum-Welch Algorithm

• algorithm sketch:– initialize the parameters of the model– iterate until convergence

• calculate the expected number of times each transition or emission is used

• adjust the parameters to maximize the likelihood of these expected values

Page 32: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

How can we use HMMs for pairwise alignment?

• What is the observed sequence?– one of the two sequences?– both sequences?

• What is the hidden path?– the alignment

Page 33: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Profile HMM for pairwise alignment

• Select one sequence to be observed (the query)

• The other sequence (the reference) defines the states of the HMM

• Three classes of states– Match: corresponds to aligned positions– Delete: positions of the reference that are deleted in

the query– Insert: positions on the query that are insertions

relative to the reference

Page 34: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Profile HMMs

i 2 i 3i 1i 0

d 1 d 2 d 3

m 1 m 3m 2start endMatch states represent

key conserved positions

Insert states account

for extra characters

in some sequences

Delete states are silent; they

Account for characters missing

in some sequences

A 0.01R 0.12D 0.04N 0.29C 0.01E 0.03Q 0.02G 0.01

Insert and match states have

emission distributions over

sequence characters

Page 35: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Example Profile HMM

Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

match states

delete states(silent)

insert states

Page 36: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Profile HMM considerations

• Odd asymmetry: have to pick one sequence as reference

• Models conditional probability P(X|Y) of query sequence (X) given reference sequence (Y)

• Is there something more natural here?– Yes, Pair HMMs

• We will revisit Profile HMMs for multiple alignment a bit later

Page 37: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Pair Hidden Markov Models

• each non-silent state emits one or a pair of characters

I: insert state

D: delete state

H: homology (match) state

Page 38: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

PHMM Paths = Alignments

HAA

HAT

IG

IC

HGG

D

T

HCC

hidden:

observed:

sequence 1: AAGCGCsequence 2: ATGTC

B E

Page 39: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Transition Probabilities

• probabilities of moving between states at each step

B H I D E

B 1-2δ-τ δ δ τ

H 1-2δ-τ δ δ τ

I 1-ε-τ ε τ

D 1-ε-τ ε τ

E

state i+1

state

i

Page 40: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Emission Probabilities

A 0.3

C 0.2

G 0.3

T 0.2

A 0.1

C 0.4

G 0.4

T 0.1

A C G T

A 0.13 0.03 0.06 0.03

C 0.03 0.13 0.03 0.06

G 0.06 0.03 0.13 0.03

T 0.03 0.06 0.03 0.13

Homology (H)Insertion (I)Deletion (D)

single character single character pairs of characters

Page 41: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Pair HMM Viterbi• probability of most likely sequence of hidden states

generating length i prefix of x and length j prefix of y, with the last state being:

H

I

D

• note that the recurrence relations here allow ID and DI transitions

Page 42: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

PHMM Alignment

• calculate probability of most likely alignment

• traceback, as in Needleman-Wunsch (NW), to obtain sequence of state states giving highest probability

HIDHHDDIIHH...

Page 43: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Correspondence with Needleman-Wunsch (NW)

• NW values ≈ logarithms of Pair HMM Viterbi values

Page 44: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Posterior Probabilities

• there are similar recurrences for the Forward and Backward values

• from the Forward and Backward values, we can calculate the posterior probability of the event that the path passes through a certain state S, after generating length i and j prefixes

Page 45: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Uses for Posterior Probabilities

• sampling of suboptimal alignments• posterior probability of pairs of residues being

homologous (aligned to each other)• posterior probability of a residue being gapped• training model parameters (EM)

Page 46: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Posterior Probabilities

plot of posterior probability of each alignment column

Page 47: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Parameter Training

• supervised training– given: sequences and correct alignments– do: calculate parameter values that maximize

joint likelihood of sequences and alignments

• unsupervised training– given: sequence pairs, but no alignments– do: calculate parameter values that maximize

marginal likelihood of sequences (sum over all possible alignments)

Page 48: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Multiple Alignment with Profile HMMs

• given a set of sequences to be aligned– use Baum-Welch to learn parameters of model– may also adjust length of profile HMM during training

• to compute a multiple alignment given the profile HMM– run the Viterbi algorithm on each sequence– Viterbi paths indicate correspondences among

sequences

Page 49: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Multiple Alignment with Profile HMMs

Page 50: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

More common multiple alignment strategy: Progressive alignment

TGTAAC TGTAC

ATGT--CATGTGGC

ATGTC ATGTGGC

TGTAACTGT-AC

-TGTAAC-TGT-ACATGT--CATGTGGC

TGTTAAC

-TGTTAAC-TGT-AAC-TGT--ACATGT---CATGT-GGC

Page 51: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Classification w/ Profile HMMs• To classify sequences according to family, we can

train a profile HMM to model the proteins of each family of interest

• Given a sequence x, use Bayes’ rule to make classification

β-galactosidase

β-glucanase

β-amylase

α-amylase

Page 52: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

PFAM

• Large database of protein families• Each family has a trained Profile HMM• Example search with a globin

sequence:

Page 53: Probabilistic Sequence Alignment BMI 877 Colin Dewey cdewey@biostat.wisc.edu February 25, 2014.

Summary

• Probabilistic models for alignment are more powerful than classical combinatorial alignment algorithms– Captures uncertainty in alignment– Allows for principled estimation of

parameters– Easily used in classification settings


Recommended