1
Introduction to Hidden Markov
Models and Profiles in Sequence
Alignment
Utah State University – Spring 2010
STAT 5570: Statistical BioinformaticsNotes 6.3
2
References
� Chapters 3-6 of Biological Sequence Analysis (Durbin et al., 2001)
� Chapter 9 of A First Course in Probability (Ross, 1997)
� Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14:755-763
3
The “occasionally dishonest casino”
� A casino usually uses a fair die, but sometimes (5% of the time) switches to a loaded die. Once using the loaded die, they usually keep using it (90% of the time).
� How do you know which die you’re playing?- not sure, but have to look at many plays to see pattern- the “state” here is “hidden”
� One possible representation of this “model”:1: 1/6 1: 1/10 2: 1/6 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 5: 1/6 5: 1/10 6: 1/6 6: 1/2
0.95 0.900.05
0.10Possible partial sequence of rolls:
Roll ...62625335636616366646623...Die ...FFFFFFFFLLLLLLLLLFFFFFF...
L = “loaded” state
4
Follow-up to a sequence alignment
� Consider pairwise (or multiple) alignment� What does alignment mean?
possibly represents: common ancestry� Possible questions
� Does alignment describe some: “family”?� How can we describe its internal structure?
� Can sometimes characterize these “family”structures as a Markov Chain
5
“Family” Example: CpG islands
� In DNA: C & G are always matched up in helical structure
� CpG (C followed by G) in sequence is rare, but is more frequent in promoters of start regions of a gene
� General idea: CpG � at start of gene� Possible question about our alignment:
Does CpG frequency suggest we are in apromoter region?
6
Markov Chain Examples� Occasionally dishonest casino: Probability you are
using a fair die on one play depends on whether you were using a fair die on the previous play.
� Rain indicator: Chance of rain tomorrow depends on whether it rains today
� Gambler’s Ruin: A gambler starts out with a fortune and plays a game repeatedly (with same prob. of winning or losing $1 each time) until her fortune reaches either 0 or M. Her sequence of fortunes is a Markov chain.
� CpG Island: Probability you are in a CpG island at one point in alignment depends on whether you were in a CpG island at the previous point.
7
Markov Chains – a little more formally
� Sequence of random variables:X0, X1, …
� Set of possible values: {0,1,…,M}� Think of Xt as state of process at time t� This sequence is a Markov Chain if:
P{Xt+1=j | Xt=i,Xt-1=it-1, …, X1=i1, X0=i0} = P{Xt+1=j | Xt=i} = Pij
� So state at time t+1 depends only on: state at time t
8
Hidden Markov Model (HMM) - vocabulary
� State: in which “family” the process is(CpG vs. not, fair vs. loaded die, etc.)
- the “path”: π ; the ith state in path: πi
� Symbol: observed “outcome” x (sequence, die, etc.) from unknown state
� Transition probabilities:akh = P{πi=h | πi-1=k}
� Emission probabilities:ek(b) = P{xi=b | πi=k }
� Joint probability of observed sequence x & state sequence π : ∏ +
=i
i iiiaxeaxP
11)(),( 0 πππππ
9
Estimating the HMM path
Several approaches to find the “most probable state path” π*=argmax P(x,π)� Viterbi
� Forward algorithm� Backward algorithm – focuses more on:
posterior state probabilities (position-specific prob. that observation came from state k given observed sequence)
focus on identifying the most probable state path
10
Estimation when paths unknown: Baum-Welch
� An iterative procedure estimates the transition and emission probabilities
� A special case of the EM algorithm(a general approach to deal with maximum likelihood with missing/incomplete/latent data)
- think of missing covariate: state
� Can also consider an approximation based on iterations of the Viterbi algorithm
11
Pairwise Alignments as HMMs
X(+1,+0)
Y(+0,+1)
M(+1,+1)
s(xi,yj)
s(xi,yj)
-d
-d
s(xi,yj)-e
-e
Recall notation: x & y are sequences to be aligned, with gap opening penalty of d and gap extension penalty of e
Let (+w,+v) here represent change in sequence position, with M=match, and X,Y=insertion (gap) in x or y
Xqxi
Yqyj
MPxiyj
1- ε
1- ε
δ
δ
1-2δ ε
ε
States: insertion (X or Y), match (M)δ=probability of moving to a specific
insertion stateε=prob. of staying in an insertion state
12
What can be done with pairwise HMM
� Build HMM for a random (non-matched) model
� Evaluate likelihood of matched model by considering log-odds of matched vs. random models
� Search for other alignments: sub-optimal� Consider posterior probability of alignment:
{ }yxyxP ji ,|“is aligned with”
13
Using HMMs to describe a “family”
� Suppose we have an alignment of multiple sequences – we can model their “relationship” as a family of sequences – call this the family’s: “profile”
� PSSM – position-specific score matrix- estimate this to: describe this particular profile (e.g., should ‘A’ count for more at a particular position in the alignment?)
� Allow for insertions and deletions, where “cost” could also be position-specific
� Use this profile to describe the alignment and look for other similar sequences
14
Transition structure of a profile HMM
Begin Mj End
Ij
Dj
specific position of profile
: match state : insertion state : deletion state
15
How do we get this “family”?
� Multiple Sequence Alignment- many possible strategies to find and score
possible alignments
� One common way: ClustalW� a “progressive alignment” approach� construct pairwise distances based on evolutionary
distance� essentially follow an agglomerative clustering approach,
progressively aligning nodes in order of decreasing similarity
� additional heuristics make final alignment more accurate
16
Possible
Strategies
image from HMMER (on bioweb.pasteur.fr server)
17
One possible analysis approach
� Obtain multiple alignment using ClustalW� http://www.ebi.ac.uk/Tools/clustalw2� creates alignment files in various formats
- some specialized for tree-viewing, for example- can get FASTA format of alignment to pass to HMMER
� Obtain HMM model using HMMER� http://bioweb2.pasteur.fr/alignment/intro-en.html� creates a “consensus” sequence to summarize the profile
(hmmbuild)� can use this profile to search database for similar
sequences (hmmsearch)
18
Example
� (Source: JalView example at ClustalW; same as HW5 data)
� Five proteins from different species (FASTA format)� Mouse (2)� Human� Chicken� Rat
>FOSB_MOUSE Protein fosBMFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL>FOSB_HUMAN Protein fosBMFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSYTSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL>FOS_CHICK Proto-oncogene proteinc-fosMMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANFVPTVTAISTSPDLQWLVQPTLISSVAPSQNRGHPYGVPAPAPPAAYSRPAVLKAPGGRGQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEEEKSALQAEIANLLKEKEKLEFILAAHRPACKMPEELRFSEELAAATALDLGAPSPAAAEEAFALPLMTEAPPAVPPKEPSGSGLELKAEPFDELLFSAGPREASRSVPDMDLPGASSFYASDWEPLGAGSGGELEPLCTPVVTCTPCPSTYTSTFVFTYPEADAFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL>FOS_RAT Proto-oncogene protein c-fosMMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVVKTMSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGFPEEMSVTSLDLTGGLPEATTPESEEAFTLPLLNDPEPKPSLEPVKNISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSGSFYAADWEPLHSSSLGMGPMVTELEPLCTPVVTCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL>FOS_MOUSE Proto-oncogene protein c-fosMMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMVKTVSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPDDLGFPEEMSVASLDLTGGLPEASTPESEEAFTLPLLNDPEPKPSLEPVKSISNVELKAEPFDDFLFPASSRPSGSETSRSVPDVDLSGSFYAADWEPLHSNSLGMGPMVTELEPLCTPVVTCTPGCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL
19
ClustalW – quick example
set alignment options here
paste multiple sequences here(in FASTA format, e.g.)
click Run to start alignment
20
21
ClustalW – format output in Jalview (Java applet)
>FOS_RAT/1-380MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVVKTMSGGRAQSIG--------------------RRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGFPEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSG--SFYAADWEPLHSSSLGMGPMVTELEPLCTPVVTCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL>FOS_MOUSE/1-380MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMVKTVSGGRAQSIG------------...
from “File” � “Output to Textbox” � “FASTA” format (others available)
Here, color is by BLOSUM62 score
22
paste FASTA format here(from ClustalW, for example)
http://bioweb2.pasteur.fr/alignment/intro-en.html
23
hmmbuild results
…
…
24
hmmbuild results (reformatted by hand)
...
192.830.993.443.503.264.272.3519
182.830.993.443.503.264.272.3518
173.132.933.894.124.200.812.6617
163.343.160.883.093.574.913.0816
152.830.993.443.503.264.272.3515
...
...TSRQ...DCAHMM
25
HMMER – search for “family” members
26
HMMER – search for “family” members
27
28
hmmsearch results
…
29
12345678901234567890123456789012alignfile_data 1 mmfqafagdyeasssrcssaspaadslsyyls
mmf++f++dyeasssrcssaspa+dslsyy+sgp|BC029814|BC029814_1 1 MMFSGFNADYEASSSRCSSASPAGDSLSYYHS
(more to profile than just SRCSS)
30
Summary
� Hidden Markov Models� use to describe sequence alignments� main idea: how does each portion of alignment
represent the “family profile”
� Idea of profile: general “family” characteristics
� Online resources� ClustalW – perform multiple alignments� HMMER – build (& use) HMM model from multiple
alignment