Home > Documents > Bioinformatics Introduction to Hidden Markov Models - … · Bioinformatics Introduction to Hidden...

# Bioinformatics Introduction to Hidden Markov Models - … · Bioinformatics Introduction to Hidden...

Date post: 15-Apr-2018
Category:
View: 236 times
Embed Size (px)
of 17 /17
Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment Slides borrowed from Scott C. Schmidler (MIS graduated student)
Transcript

Bioinformatics

Introduction to Hidden MarkovModels

Hidden Markov Models andMultiple Sequence Alignment

Slides borrowed from

Scott C. Schmidler

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 3

Outline

! Probability Review

! Markov Chains

! Hidden Markov Chains

! Examples in HMMs for Protein Sequence

! Algorithm Review for HMMs

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 4

Motivation: Composing a Drama byMimicking Shakespeare

! Assume we want to write a drama of Shakespeare style

! We collect a large set of Shakespeare's works

! Define a vocabulary V = {X1, X2, ..., XN}

! Build a model P(Xi|Xj) for i, j = 1, ..., N

! To compose a drama, generate words from the modelP(Xi|Xj)

! Though this is too simplistic to be useful, this naive modelcan be extended and refined to mimic the writing style ofShakespears'

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 5

Markov Approximations to English

! From Shannon’s original paper:1. Zero-order approximation:

XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYDQPAAMKBZAACIBZLHJQD

2. First-order approximation:

OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEIALHENHTTPA OOBTTVA NAH RBL

3. Second-order approximation:

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMYACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSOTIZIN ANDY TOBE SEACE CITSBE

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 6

Markov Approximations (cont.)

From Shannon’s paper4. Third-order approximation:

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCIDPONDENOME OF DEMONSTURES OF THE REPTABIN ISREGOACTIONA OF CRE

Markov random field with 1000 “features,” no underlying“machine” (Della Pietra et. Al, 1997):

WAS REASER IN THERE TO WILL WAS BY HOMES THINGBE RELOVERATED THER WHICH CONISTS AT RORESANDITING WITH PROVERAL THE CHESTRAING FORHAVE TO INTRALLY OF QUT DIVERAL THIS OFFECTINATEVER THIFER CONSTRANDED STATER VILLMENTTERING AND OF IN VERATE OF TO

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 7

Word-Based Approximations

1. First-order approximation:REPRESENTING AND SPEEDILY IS AN GOOD APT OR COMECAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TOOF TO EXPERT GRAY COME TO FURNISHES THE LINEMESSAGE HAD BE T

2. Second-order approximation:THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISHWRITER THAT THE CHARACTER OF THIS POINT ISTHEREFORE ANOTHER METHOD FOR THE LETTERS THAT THETIME OF WHO EVER TOLD THE PROBLEM FOR ANUNEXPETED

Shannon’s comment:

“It would be interesting if further approximations could be constructed,but the labor involved becomes enormous at the next stage.”

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 8

Motivation: Composing a Symphonyof Beethoven Style

! We want to compose a symphony of Beethoven style

! We collect a large set of Beethoven's works

! Define a vocabulary V = {X1, X2, ..., XN} of musical notes

! Build a model P(Xi|Xj) for i, j = 1, ..., N

! To compose a symphony, generate note symbols from themodel P(Xi|Xj)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 9

Modeling Biological Sequences

! Collect a set of sequences of interest! Define a vocabulary V = {X1, X2, ..., XN}

4 For DNA sequences: N = 4 and V = {A, T, G, C}4 For protein sequences: N = 20 and V = {amino acids}

! Build (learn) a model P(Xi|Xj) for i, j = 1, ..., N or in more generalP(X|w) with X = X1, X2, ..., XM and model parameter vector w

! The model can be used to4 To generate typical sequences from the class of training sequences, e.g.

protein family4 To compute the probability of an observed sequence O being generated

from the model class4 and others

! Hidden Markov models (HMMs) are a class of stochastic generativemodels effective for building such probabilistic models.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 10

Probability Review

! Probability notation:

4 Probability:

4 Joint probability:

4 Conditional probability:

4 Marginal probability:

4 Independence:

4 Bayes’rule:

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 11

Markov Chains

! Markov property:

! Formally:4State space

4Transition matrix

4 Initial distribution

! CS intuition4Stochastic finite automaton

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 12

Markovian Sequence

! States through which the chain passes form a sequences:

Example:

! Graphically:

! By the Markov property:

L,,,,,, 101110 SSSSSS

��

��

��

��

��

��

( ) ( )( ) ( ) ( )11010

101110

||

,,,,,,

SSPSSPS

SSSSSSPSequenceP

π== L

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 13

Example

! Markov chain for generating a DNA sequence:

! Sequence probability:

Dinucleotide frequency (e.g. base-stacking)

� � � � � �…

( ) ( ) ( ) ( ) ( )KATPGAPAGPAAGATCGP |||π=

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 14

Hidden Markov Chains

! Observed sequence is a probabilistic function ofunderlying Markov chain4Example: HMM for a (noisy) DNA sequence (see e.g. Churchill

1989)

True state sequence unknown, but observation sequence gives us aclue

� � � � � �…

Unobservedtruth

Observednoisy sequence data

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 15

Example: Hidden Markov Chain forProtein Sequence

! State space is backbone secondary structure4Used for prediction (Asai et. al., Stultz et. al.)

! State space is side chain environment4Used for fold-recognition (Hubbard et. al.)

� � � � �

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 16

A HMM for Multiple Protein Sequences(Krogh et. al.)

! “Match” states are model (consensus) positions

! Position-specific deletion penalties

! Position-specific insertion frequencies

! Path through states aligns sequence to model

Figu

refr

om(K

rogh

et.a

l.19

94)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 17

Example: Multiple Alignment of GlobinSequences

Figu

refr

om(K

rogh

et.a

l.19

94)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 18

HMM-based Multiple Sequence Alignment

! Multiple alignment of k sequences is O(nk), soinstead:1. Estimate a statistical model for the sequences

• Use head start PROFILE alignment

• Start from scratch with unaligned sequences (harder)

2. Align each remaining sequence to the model

3. Alignment yields assignments of equivalent sequenceelements within the multiple alignment

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 19

Example: Aligning Sequence to Model

! Given an HMM model for a protein family:

Align a new sequence to the model

(d states are gaps, i states are insertions)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 20

Computing with HMMs

1. Probability of an observed sequence

Given O1, O2, …, Or find

(nontrivial since state sequence unobserved)

2. Most likely hidden state sequence

Given O1, O2, …, Or compute

2. Most likely hidden state sequence

Given observed sequence {O1, …, On } find

OOSSPr

,,|,,maxarg 11,,1

KKK

( )rOOOP ,, 21 K

( )θθ

|,,maxarg 1 nOOP K

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 21

Computing Likelihood of ObservedSequence

! Compute4True state sequence unknown

4Must sum over all possible paths

4Number paths O(TN)

4Markovian structure permits:• Recursive definition and hence

• Efficient calculation by dynamic programming

! Key observation: Any path must be in exactly one state attime t

( ) ( ) ( )rSSS

rrr SSSPSSSOOOPOOPr

,,,,,,|,,,,, 10,,,

10211

10

KKKKK∑=

( )rOOOP ,,, 21 K

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 22

Key Idea for HMM Computations… … … …

… …

States (1…t, t+1,…T)

Npo

ssib

leam

ino

acid

s

1 t t+1 T

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 23

Example: Searching Protein Databasewith HMM Profile

! For each sequence in database:

! Does sequence “fit” model?

! Score by P(O1, O2, …, Or ), compute Z-score adjusted forlength

ProteinKinases:

Globins:

Figu

refr

om(K

rogh

et.a

l.19

94)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 24

Estimate Alignment and ModelParameters - Simultaneously

! Key idea – missing data:4What if we know the alignment?

Parameters easy to estimate:Calculate (expected) number of transitions

Calculate (expected) frequency of amino acids

4What if we knew the parameters?Alignment easy to find

Align each sequence to model using Viterbi algorithm

Align residues in match states

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 25

Other “details”

! How many states in model?

! How to initialize parameters?

! How to avoid local models?

See (Krong et. al., 1994) for some suggestion

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 26

Multiple Protein Sequence Alignment

! Give a set of sequences:4Estimate HMM model using optimization for parameter

search (Baum-Welch, EM)

4Align each sequence to model (Viterbi)

4Match states of model provide columns of resultingmultiple alignment

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 27

Extensions

Clustering subfamilies

Modeling domains

Figure from (Krogh et. al. 1994)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 28

! Advantages:4Explicit probabilistic model for family

4Position specific residue distributions, gap penalties,insertions frequencies

! Disadvantages:4Many parameters, requires more data of care

4Traded one hard optimization problem for another

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 29

HMM Summary

! Powerful tool for modeling protein families

! Generalization of existing profile methods

! Data-intensive

! Widely applicable to problems in bioinformatics

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 30

References

! Bioinformatics Classic: Krogh et. al. (1994)Hidden Markov models in computational biology: applications toprotein modeling, J. Mol. Biol. 235: 1501-1531

! Book: Eddy & Durbin, 1999. See web site.

! Tutorial: Rabiner, L. (1989) A tutorial on hidden Markovmodels and selected applications in speech recognition,Proc IEEE, 77(2), 257-286

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 31

Forward-backward Algorithm

! Forward pass:4Define

4Prob. Of subsequence O1, O2, …, Ot when in Sj at t

Key obs: any path must be in 1 of N states at t

… … … …… …

( ) ( ) ( ) ( )ji

N

iijtt SOPSSPij ||

11

= ∑=

−αα

1 t-1 t T

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 32

Forward-backward Algorithm

! Notice

! Define an analogues backward pass so that:

and

… …… …

t-1 t T+1

( ) ( ) ( ) ( )it

N

ijitt SOPSSPij || 1

11 +

=−∑= ββ

( ) ( ) ( )( ) ( )∑

=

= N

itt

ttjt

ii

iiSfromcameOP

1

βα

βα

( ) ( )∑=N

tr jOOOP1

21 ,,, αK

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 33

Finding Most Likely Path

! Forward pass:4Replace summation with maximization

4Max prob. of subseq. O1, O2, …, Or When in Sj at t

4Again: , then trace back

… … … …

… …

( ) )(max,,,max1

21 jOOOP TNj

r α≤≤

=K

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 34

Baum-Welch Algorithm (Expectation-Maximization)

! Set parameters to expected values given observedsequences:4State transition probs:4Observation probs:

4Recalculate expectations with new probabilities4 Iterate to convergence

• Guaranteed strictly increasing, converge to local mode(See Rabiner, 1989 for details)

( )θ|,,1 nOOP K

Recommended