Introduction to Hidden Markov Models (HMMs)muse/assets/hmms.pdf · 2016. 9. 15. · Introduction to...

Introduction to Hidden Markov

Models (HMMs)

But first, some probability and statistics

background

Important Topics

1.! Random Variables

and Probability

2.! Probability

Distributions

3.! Parameter Estimation

4.! Hypothesis Testing

5.! Likelihood

7.! Conditional Probability

8.! Stochastic Processes

9.! Inference for

Stochastic Processes

Probability

The probability of a particular event occurring is

the frequency of that event over a very long series

of repetitions.

•! P(tossing a head) = 0.50

•! P(rolling a 6) = 0.167

•! P(average age in a population sample is greater than 21) = 0.25

Random Variables

A random variable is a quantity that cannot be

measured or predicted with absolute accuracy.

!p = fraction of nucleotides that are either G or C

X = age of an individual

Y = length of a gene

Probability Distributions

•! The distribution of a random variable

describes the possible values of the

variable and the probabilities of each

value.

•! For discrete random variables, the

distribution can be enumerated; for

continuous ones we describe the

distribution with a function.

Examples of Distributions

X ~ Bin( , . )3 0 5 Z ~ N( , )µ !2

x! P(X = x)!

0! 0.125!

1! 0.375!

2! 0.375!

3! 0.125!

P(a < Z < b) = f (z;µ,! 2)dz

a

b

"

f (z;µ,!2) =

1

! 2"e#(z #µ )2

2!2

Binomial Normal

Parameter Estimation

One of the primary goals of statistical inference is to

estimate unknown parameters.

For example, using a sample taken from the target

population, we might estimate the population mean

using several different statistics: the sample mean,

the sample median, or the sample mode. Different

statistics have different sampling properties.

Hypothesis Testing

A second goal of statistical inference is testing the

validity of hypotheses about parameters using sample

data.

H

H

O

A

: .

: .

p

p

=

>

0 5

0 5

If the observed frequency is much greater

than 0.5, we should reject the null

hypothesis in favor of the alternative

hypothesis.

How do we decide what “much greater” is?

Likelihood

For our purposes, it is sufficient to define the

likelihood function as

L( ) Pr( Pr( ; )! != =data;parameter values) X

Analyses based on the likelihood function are well-

studied, and usually have excellent statistical

properties.

Maximum Likelihood

Estimation

The maximum likelihood estimate of an unknown

parameter is defined to be the value of that parameter

that maximizes the likelihood function:

!̂ = argmax!

L(!) = argmax!

Pr(X;!)

We say that is the maximum likelihood estimate of . !! !

Example: Binomial Probability

X ~ Bin( , )n p L pn

x n xp px n x( )

!

!( )!( )=

!!

!1If , then

Some simple calculus shows that the MLE of is ,

the frequency of “successes” in our sample of size n.

p !p

If we had been unable to do the calculus, we could

still have found the MLE by plotting the likelihood:

0

0.0005

0.001

0.0015

0.002

0.0025

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p

p p7 31( )!

Likelihood Ratio Tests

H

H

O

A

:

:

! !

! !

=

>

O

O

Consider testing the

hypothesis:

The likelihood ratio test statistic is:

!

"

"

"

"

" "

" "

= ==

>

max

max

( )

( )

( )

( )

O

O

O

A

L

L

L

L

Distribution of the Likelihood

Ratio Test Statistic

Under quite general conditions,

! !21

2ln ~" #

n

where n-1 is the difference between the number of

free parameters in the two hypotheses.

The Parametric Bootstrap

Why we need it: The conditions necessary for the

asymptotic chi-squared distribution are not always

satisfied.!

What it is: A simulation based approach for evaluating the

p-value of a statistical test statistic (often a likelihood ratio)!

Parametric Bootstrap

Procedure 1.! Compute the LRT using the observed data

2.! Use the parameters estimated under the null hypothesis to simulate a new dataset of the same size as the observed data.

3.! Compute the LRT for the simulated dataset.

4.! Repeat steps 2 & 3, say, 1000 times.

5.! Construct a histogram of the simulated LRTs.

6.! The p-value for the test is the frequency of simulated LRTs that exceed the observed LRT.

Conditional Probability

The conditional probability of event A given that

event B has happened is

Pr( |Pr

PrA B) =

(A and B)

(B)

Pr(2|even number rolled) =P(2 and even)

P(even)= =

16

12

1

3

Stochastic Processes

A stochastic process is a series of random variables

measured over time. Values in the future typically

depend on current values.

•! Closing value of the stock market

•! Annual per capita murder rate

•! Current temperature

ACGGTTACGGATTGTCGAA

ACaGTTACGGATTGTCGAA

ACaGTTACGGATgGTCGAA

ACcGTTACGGATgGTCGAA

t = 0

t = 1

t = 2

t = 3

Inference for Stochastic

Processes

We often need to make inferences that involve the

changes in molecular genetic sequences over time.

Given a model for the process of sequence evolution,

likelihood analyses can be performed.

Pr( ; , )sequence sequence i j t! "

Introduction to Hidden Markov

Models (HMMs)

HMM: Hidden Markov Model

•!Does this sequence come from a

particular “class”?

–!Does the sequence contain a beta sheet?

•!What can we determine about the

internal composition of the sequence if it

belongs to this class?

–!Assuming this sequence contains a gene,

where are the splice sites?

Example: A Dishonest Casino

•! Suppose a casino usually uses “fair” dice

(probability of any side is 1/6), but

occasionally changes briefly to unfair dice

(probability of a 6 is 1/2, all others have

probability 1/10)

•! We only observe the results of the tosses

•! Can we identify the tosses with the biased

dice?

The data we actually observe look like the following:

2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5

Which (if any) tosses were made using an unfair

die?

F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F U U U U F F U U U F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F U U U U F F U U U F F F F F F F F

If the tosses were made with a fair die (scenario 1), the

probability of observing the series of tosses is:!

Pr(data) = 1

6( )23

=1.266 !10"18

If the indicated tosses were made with an unfair die

(scenario 2), then the series of tosses has probability!

Pr(data) = 1

6( )16 1

2( )5 1

10( )2

=1.108 !10"16

2 6 3 4 4 1 3 6 6 5 6 3 6 6 6 1 3 5 2 6 2 4 5

The series of tosses is 87.5 times more probable under

scenario 2 than under the scenario 1.!

1! 1/6

2! 1/6

3! 1/6

4! 1/6

5! 1/6

6! 1/6

1! 1/10

2! 1/10

3! 1/10

4! 1/10

5! 1/10

6! 1/2

0.9

0.5

0.1

Fair Unfair

0.5

Hidden states!

Emission Probabilities!Transition Probabilities!

The Likelihood Function

Pr( ; ) Pr( ) Pr( | ; )X X! " " !"

= #paths

Probability of the data

in terms of 1 or more

unknown parameters.

Probability of the

hidden states ( may

depend on 1 or more

unknown parameters).

Probability of the data

GIVEN the hidden

states, in terms of 1 or

more unknown

parameters.

Compute via the

forward algorithm!

Predicting the Hidden States 1. The most probable state path (compute via the Viterbi

algorithm):

!̂ = argmaxpaths !

P(X,!;")

= argmaxpaths !

Pr(! )Pr(X |!;")

2. Posterior state probabilities (compute via the backward

algorithm):

Pr(!i

= k | x;") =P(x,!

i= k;")

P(x;")

Simple Gene Structure Model

5’ UTR Exons Introns 3’ UTR

Start codon! Stop codon!

HMM Example: Gene Regions

Start Stop

Intron

Exon 5’ UTR 3’ UTR

Content sensor: Region of residues

with similar properties (introns,

exons)

Signal sensor: A specific signal

sequence; might be a consensus

sequence (start, stop codons)

Basic Gene-finding HMM

B S D A T F 5’ EI

E

I

EF 3’

ES •!B: Begin Sequence

•!S: Start translation

•!D: Donor splice site

•!A: Acceptor splice site

•!T: Stop translation

•!F: End sequence

•!5’: 5’ Untranslated region

•!EI: Initial exon

•!ES: Single exon

•!E: Exon

•!I: Intron

•!EF: Final exon

•!3’: 3’ untranslated region

OK, What do we do with it

now?

The HMM must first be “trained” using a database of

known genes.

Consensus sequences for all signal sensors are

needed.

Compositional rules (i.e., emission probabilities) and

length distributions are necessary for content sensors.

Transition probabilities between all connected states

must be estimated.

GENSCAN

Chris Burge and Samuel Karlin

J. Mol. Biol (1997) 268:78-94

Prediction of Complete Gene

Structures in Human Genomic

DNA

Basic features of GENSCAN

•!HMM description of human genomic sequences,including: –!Transcriptional, translational, and splicing

signals

–!Length distributions and compositional features of introns, exons, and intergenic regions

–!Distinct model parameters for regions with different GC compositions

Accuracy per

nucleotide!Accuracy per exon!

Sn! Sp! AC! CC! Sn! Sp! Avg! ME! WE!

0.93! 0.93! 0.91! 0.92! 0.78! 0.81! 0.80! 0.09! 0.05!

Sn: Sensitivity = Prob(True nucleotide or exon predicted in a gene)

Sp: Specificity = Prob(Predicted nucleotide or exon is in a gene)

AC and CC are overall measures of accuracy, including both

positive and negative predictions.

ME: Prob(true exon is completely missed in the prediction)

WE: Prob(predicted exon is not in a gene)

A A C T - - C T A

A T C T C - C G A

A G C T - - T G G

T G T T C T C T A

A A C T C - C G A

A G C T C - C G A

PROSITE regular expression [AT][ATG][CT][T][CT]*[CT][GT][AG]

Profile HMMs

A A C T - - C T A

A T C T C - C G A

A G C T - - T G G

T G T T C T C T A

A A C T C - C G A

A G C T C - C G A

[AT][ATG][CT][T][CT]*[CT][GT][AG]

T T T T T T T G

The sequence matches the PROSITE expression at

every position, but does it really “match” the

profile?

A A C T - - C T A

A T C T C - C G A

A G C T - - T G G

T G T T C T C T A

A A C T C - C G A

A .8

C 0

G 0

T .2

A .4

C 0

G .4

T .2

A 0

C .8

G 0

T .2

A 0

C 0

G 0

T 1

A 0

C .8

G 0

T .2

A 0

C 0

G .6

T .4

A .8

C 0

G .2

T 0

A 0

C .75

G 0

T .25

1.0 1.0 1.0 1.0 1.0

0.75 0.6

0.4

0.25

A .8

C 0

G 0

T .2

A .4

C 0

G .4

T .2

A 0

C .8

G 0

T .2

A 0

C 0

G 0

T 1

A 0

C .8

G 0

T .2

A 0

C 0

G .6

T .4

A .8

C 0

G .2

T 0

A 0

C .75

G 0

T .25

1.0 1.0 1.0 1.0 1.0

0.75 0.6

0.4

0.25

T T T T T T T G

. . . . . . . .4 .2 1 0 2 1 0 2 1 1 0 6 0 25 0 75 0 2 1 0 1 0 2! ! ! ! ! ! ! ! ! ! ! ! ! !

T T T GT T T T

General Profile HMM

Structure

Begin End Match

States

Insert

States

Delete

States

Mj!

Ij!

Dj!

Date post:	04-Oct-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Introduction to Hidden Markov Models (HMMs)muse/assets/hmms.pdf · 2016. 9. 15. · Introduction to...

Documents