Introduction to Hidden Markov Models and Profiles in...

1

Introduction to Hidden Markov

Models and Profiles in Sequence

Alignment

Utah State University – Spring 2010

STAT 5570: Statistical BioinformaticsNotes 6.3

2

References

� Chapters 3-6 of Biological Sequence Analysis (Durbin et al., 2001)

� Chapter 9 of A First Course in Probability (Ross, 1997)

� Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14:755-763

3

The “occasionally dishonest casino”

� A casino usually uses a fair die, but sometimes (5% of the time) switches to a loaded die. Once using the loaded die, they usually keep using it (90% of the time).

� How do you know which die you’re playing?- not sure, but have to look at many plays to see pattern- the “state” here is “hidden”

� One possible representation of this “model”:1: 1/6 1: 1/10 2: 1/6 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 5: 1/6 5: 1/10 6: 1/6 6: 1/2

0.95 0.900.05

0.10Possible partial sequence of rolls:

Roll ...62625335636616366646623...Die ...FFFFFFFFLLLLLLLLLFFFFFF...

L = “loaded” state

4

Follow-up to a sequence alignment

� Consider pairwise (or multiple) alignment� What does alignment mean?

possibly represents: common ancestry� Possible questions

� Does alignment describe some: “family”?� How can we describe its internal structure?

� Can sometimes characterize these “family”structures as a Markov Chain

5

“Family” Example: CpG islands

� In DNA: C & G are always matched up in helical structure

� CpG (C followed by G) in sequence is rare, but is more frequent in promoters of start regions of a gene

� General idea: CpG � at start of gene� Possible question about our alignment:

Does CpG frequency suggest we are in apromoter region?

6

Markov Chain Examples� Occasionally dishonest casino: Probability you are

using a fair die on one play depends on whether you were using a fair die on the previous play.

� Rain indicator: Chance of rain tomorrow depends on whether it rains today

� Gambler’s Ruin: A gambler starts out with a fortune and plays a game repeatedly (with same prob. of winning or losing $1 each time) until her fortune reaches either 0 or M. Her sequence of fortunes is a Markov chain.

� CpG Island: Probability you are in a CpG island at one point in alignment depends on whether you were in a CpG island at the previous point.

7

Markov Chains – a little more formally

� Sequence of random variables:X0, X1, …

� Set of possible values: {0,1,…,M}� Think of Xt as state of process at time t� This sequence is a Markov Chain if:

P{Xt+1=j | Xt=i,Xt-1=it-1, …, X1=i1, X0=i0} = P{Xt+1=j | Xt=i} = Pij

� So state at time t+1 depends only on: state at time t

8

Hidden Markov Model (HMM) - vocabulary

� State: in which “family” the process is(CpG vs. not, fair vs. loaded die, etc.)

- the “path”: π ; the ith state in path: πi

� Symbol: observed “outcome” x (sequence, die, etc.) from unknown state

� Transition probabilities:akh = P{πi=h | πi-1=k}

� Emission probabilities:ek(b) = P{xi=b | πi=k }

� Joint probability of observed sequence x & state sequence π : ∏ +

=i

i iiiaxeaxP

11)(),( 0 πππππ

9

Estimating the HMM path

Several approaches to find the “most probable state path” π*=argmax P(x,π)� Viterbi

� Forward algorithm� Backward algorithm – focuses more on:

posterior state probabilities (position-specific prob. that observation came from state k given observed sequence)

focus on identifying the most probable state path

10

Estimation when paths unknown: Baum-Welch

� An iterative procedure estimates the transition and emission probabilities

� A special case of the EM algorithm(a general approach to deal with maximum likelihood with missing/incomplete/latent data)

- think of missing covariate: state

� Can also consider an approximation based on iterations of the Viterbi algorithm

11

Pairwise Alignments as HMMs

X(+1,+0)

Y(+0,+1)

M(+1,+1)

s(xi,yj)

s(xi,yj)

-d

-d

s(xi,yj)-e

-e

Recall notation: x & y are sequences to be aligned, with gap opening penalty of d and gap extension penalty of e

Let (+w,+v) here represent change in sequence position, with M=match, and X,Y=insertion (gap) in x or y

Xqxi

Yqyj

MPxiyj

1- ε

1- ε

δ

δ

1-2δ ε

ε

States: insertion (X or Y), match (M)δ=probability of moving to a specific

insertion stateε=prob. of staying in an insertion state

12

What can be done with pairwise HMM

� Build HMM for a random (non-matched) model

� Evaluate likelihood of matched model by considering log-odds of matched vs. random models

� Search for other alignments: sub-optimal� Consider posterior probability of alignment:

{ }yxyxP ji ,|“is aligned with”

13

Using HMMs to describe a “family”

� Suppose we have an alignment of multiple sequences – we can model their “relationship” as a family of sequences – call this the family’s: “profile”

� PSSM – position-specific score matrix- estimate this to: describe this particular profile (e.g., should ‘A’ count for more at a particular position in the alignment?)

� Allow for insertions and deletions, where “cost” could also be position-specific

� Use this profile to describe the alignment and look for other similar sequences

14

Transition structure of a profile HMM

Begin Mj End

Ij

Dj

specific position of profile

: match state : insertion state : deletion state

15

How do we get this “family”?

� Multiple Sequence Alignment- many possible strategies to find and score

possible alignments

� One common way: ClustalW� a “progressive alignment” approach� construct pairwise distances based on evolutionary

distance� essentially follow an agglomerative clustering approach,

progressively aligning nodes in order of decreasing similarity

� additional heuristics make final alignment more accurate

16

Possible

Strategies

image from HMMER (on bioweb.pasteur.fr server)

17

One possible analysis approach

� Obtain multiple alignment using ClustalW� http://www.ebi.ac.uk/Tools/clustalw2� creates alignment files in various formats

- some specialized for tree-viewing, for example- can get FASTA format of alignment to pass to HMMER

� Obtain HMM model using HMMER� http://bioweb2.pasteur.fr/alignment/intro-en.html� creates a “consensus” sequence to summarize the profile

(hmmbuild)� can use this profile to search database for similar

sequences (hmmsearch)

18

Example

� (Source: JalView example at ClustalW; same as HW5 data)

� Five proteins from different species (FASTA format)� Mouse (2)� Human� Chicken� Rat

>FOSB_MOUSE Protein fosBMFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL>FOSB_HUMAN Protein fosBMFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSYTSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL>FOS_CHICK Proto-oncogene proteinc-fosMMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANFVPTVTAISTSPDLQWLVQPTLISSVAPSQNRGHPYGVPAPAPPAAYSRPAVLKAPGGRGQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEEEKSALQAEIANLLKEKEKLEFILAAHRPACKMPEELRFSEELAAATALDLGAPSPAAAEEAFALPLMTEAPPAVPPKEPSGSGLELKAEPFDELLFSAGPREASRSVPDMDLPGASSFYASDWEPLGAGSGGELEPLCTPVVTCTPCPSTYTSTFVFTYPEADAFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL>FOS_RAT Proto-oncogene protein c-fosMMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVVKTMSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGFPEEMSVTSLDLTGGLPEATTPESEEAFTLPLLNDPEPKPSLEPVKNISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSGSFYAADWEPLHSSSLGMGPMVTELEPLCTPVVTCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL>FOS_MOUSE Proto-oncogene protein c-fosMMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMVKTVSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPDDLGFPEEMSVASLDLTGGLPEASTPESEEAFTLPLLNDPEPKPSLEPVKSISNVELKAEPFDDFLFPASSRPSGSETSRSVPDVDLSGSFYAADWEPLHSNSLGMGPMVTELEPLCTPVVTCTPGCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL

19

ClustalW – quick example

set alignment options here

paste multiple sequences here(in FASTA format, e.g.)

click Run to start alignment

20

21

ClustalW – format output in Jalview (Java applet)

>FOS_RAT/1-380MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVVKTMSGGRAQSIG--------------------RRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGFPEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSG--SFYAADWEPLHSSSLGMGPMVTELEPLCTPVVTCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL>FOS_MOUSE/1-380MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMVKTVSGGRAQSIG------------...

from “File” � “Output to Textbox” � “FASTA” format (others available)

Here, color is by BLOSUM62 score

22

paste FASTA format here(from ClustalW, for example)

http://bioweb2.pasteur.fr/alignment/intro-en.html

23

hmmbuild results

…

…

24

hmmbuild results (reformatted by hand)

...

192.830.993.443.503.264.272.3519

182.830.993.443.503.264.272.3518

173.132.933.894.124.200.812.6617

163.343.160.883.093.574.913.0816

152.830.993.443.503.264.272.3515

...

...TSRQ...DCAHMM

25

HMMER – search for “family” members

26

HMMER – search for “family” members

27

28

hmmsearch results

…

29

12345678901234567890123456789012alignfile_data 1 mmfqafagdyeasssrcssaspaadslsyyls

mmf++f++dyeasssrcssaspa+dslsyy+sgp|BC029814|BC029814_1 1 MMFSGFNADYEASSSRCSSASPAGDSLSYYHS

(more to profile than just SRCSS)

30

Summary

� Hidden Markov Models� use to describe sequence alignments� main idea: how does each portion of alignment

represent the “family profile”

� Idea of profile: general “family” characteristics

� Online resources� ClustalW – perform multiple alignments� HMMER – build (& use) HMM model from multiple

alignment

Date post:	06-Jun-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Introduction to Hidden Markov Models and Profiles in...

Documents