Date post: | 13-Mar-2016 |

Category: |
## Documents |

Author: | cecilia-hess |

View: | 43 times |

Download: | 1 times |

Share this document with a friend

Description:

Hidden Markov Model. CS570 Lecture Note KAIST. This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB). Sequential Data. Often highly variable, but has an embedded structure Information is contained in the structure. More examples. - PowerPoint PPT Presentation

Embed Size (px)

of 99
/99

Transcript

Hidden Markov Model TutorialThis lecture note was made based on the
notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky
(UCB)

Hidden Markov Model

CS570 Lecture Note

This lecture note was made based on the notes of

Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Sequential Data

Information is contained in the structure

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

More examples

main() { char q=34, n=10, *a=“main() {

char q=34, n=10, *a=%c%s%c; printf(

a,q,a,q,n);}%c”; printf(a,q,a,n); }

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Speech Recognition

Given a sequence of inputs-features of some kind extracted by some hardware, guess the words to which the features correspond.

Hard because features dependent on

Speaker, speed, noise, nearby features(“co-articulation” constraints), word boundaries

“How to wreak a nice beach.”

“How to recognize speech.”

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Defining the problem

y, a string of acoustic features of some form,

w, a string of words, from some fixed vocabulary

L, a language (defined as all possible strings in that language),

Given some features, what is the most probable string the speaker uttered?

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis

P(w|y)= P(w)P(y|w)/P(y)

Since y is the same for different w’s we might choose, the

problem reduces to

each possible string in our language

pronunciation, given an utterance.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

P(w) where w is an utterance

Problem: There are a very large number of possible utterances!

Indeed, we create new utterances all the time, so we cannot hope to have there probabilities.

So, we will need to make some independence assumptions.

First attempt: Assume that words are uttered independently of one another.

Then P(w) becomes P(w1)… P(wn) , where wi are the individual words in the string.

Easy to estimate these numbers-count the relative frequency words in the language.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Assumptions

Words don’t just follow each other randomly

Second attempt: Assume each word depends only on the previous word.

E.g., “the” is more likely to be followed by “ball” than by “a”,

despite the fact that “a” would otherwise be a very common, and hence, highly probably word.

Of course, this is still not a great assumption, but it may be a decent approximation

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

In General

This is typical of lots of problems, in which

we view the probability of some event as dependent on potentially many past events,

of which there too many actual dependencies to deal with.

So we simplify by making assumption that

Each event depends only on previous event, and

it doesn’t make any difference when these events happen

in the sequence.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Speech Example

= s p iy iy iy ch ch ch ch

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis Methods

Probability-based analysis?

Method I

A poor model for temporal structure

Model size = |V| = N

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis methods

Method II

A symbol is dependent only on the immediately preceding:

|V|×|V| matrix model

105×105 – doubly outrageous!!

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Another analysis method

Method III

What you see is a clue to what lies behind and is not known a priori

The source that generated the observation

The source evolves and generates characteristic observation sequences

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

More Formally

To clarify, let’s write the sequence this way:

P(q1=Si, q2=Sj,…, qn-1=Sk, qn=Si)

Here the indicate the I-th position of the sequence,

and the Si the possible different words from our

vocabulary.

E.g., if the string were “The girl saw the boy”, we might have

S1= the q1= S1

S2= girl q2= S2

S3= saw q3= S3

S4= boy q4= S1

S1= the q5= S4

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Formalization (continue)

We want P(q1=Si, q2=Sj,…, qn-1=Sk, qn=Si)

Let’s break this down as we usually break down a joint :

= P(qn=Si | q1=Sj,…,qn-1=Sk)P(q1=Sj,…,qn-1=Sk)

…

qn-1=Sm)P(q2=Sj|q1=Sj)P(q1=Si)

Our simplifying assumption is that each event is only

dependent on the previous event, and that we don’t care when

the events happen, I.e.,

P(qi=Si | qi-1=Sk)=P(qj=Si | qj-1=Sk)

This is called the Markov assumption.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Markov Assumption

“The future does not depend on the past, given the present.”

Sometimes this if called the first-order Markov assumption.

second-order assumption would mean that each event depends on the previous two events.

This isn’t really a crucial distinction.

What’s crucial is that there is some limit on how far we are willing to look back.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Morkov Models

The Markov assumption means that there is only one probability to remember for each event type (e.g., word) to another event type.

Plus the probabilities of starting with a particular event.

This lets us define a Markov model as:

finite state automaton in which

the states represent possible event types(e.g., the different words in our example)

the transitions represent the probability of one event type following another.

It is easy to depict a Markov model as a graph.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A Markov Model for a Tiny Fragment

of English

e.g., P(qi=girl|qi-1 =the)=.8

The numbers on the initial arrows show the probability of

starting in the given state.

Missing probabilities are assumed to be O.

the

a

girl

little

.7

.8

.9

.22

.3

.78

.2

.1

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A Markov Model for a Tiny Fragment

of English

with probabilities :

P(“A little little girl”)=.3.22.1.9= .00594

the

a

girl

little

.7

.8

.9

.22

.3

.78

.2

.1

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example(con’t)

P(q1=the, q2=little, q3=girl)

where , and are states.

“Given that sentence begins with “a”, what is the probability

that the next words were “little girl”?”

P(q3=the, q2=little, q1=a)

= P(q3=girl | q2=little, q1=a)P(q2=little q1=a)

= P(q3=girl | q2=little)P(q2=little q1=a)

= .9.22=.198

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Markov Models and Graphical Models

Markov models and Belief Networks can both be represented by nice graphs.

Do the graphs mean the same thing?

No! In the graphs for Markov models; nodes do not represent random variables, CPTs.

Suppose we wanted to encode the same information via a belief network.

We would have to “unroll” it into a sequence of nodes-as many as there are elements in the sequence-each dependent on the previous, each with the same CPT.

This redrawing is valid, and sometimes useful, but doesn’t explicitly represent useful facts, such as that the CPTs are the same everywhere.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to the Speech Recognition Problem

A Markov model for all of English would have one node for each word, which would be connected to the node for each word that can follow it.

Without Loss Of Generality, we could have connections from every node to every node, some of which have transition probability 0.

Such a model is sometimes called a bigram model.

This is equivalent to knowing the probability distribution of pair of words in sequences (and the probability distribution for individual words).

A bigram model is an example of a language model, i.e., some (in this case, extremely simply) view of what sentences or sequences are likely to be seen.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Bigram Model

Bigram models are rather inaccurate language models.

E.g., the word after “a” is much more likely to be “missile” if the word preceding “a” is “launch”.

the Markov assumption is pretty bad.

If we could condition on a few previous words, life gets a bit better:

E.g., we could predict “missile” is more likely to follow “launch a” than “saw a”.

This would require a “second order” Markov model.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Higher-Order Models

In the case of words, this is equivalent to going to trigrams.

Fundamentally, this isn’t a big difference:

We can convert a second order model into a first order model, but with a lot more states.

And we would need much more data!

Note, though, that a second-order model still couldn’t accurately predict what follows “launch a large”

i.e., we are predicting the next work based on only the two previous words, so the useful information before “a large” is lost.

Nevertheless, such language models are very useful approximations.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to Our Spoken Sentence recognition Problem

We are trying to find

argmaxw∈L P(w)P(y|w)

Now let’s look at P(y|w).

- That is, how do we pronounce a sequence of words?

- Can make the simplification that how we pronounce words

is independent of one another.

P(y|w)=ΣP(o1=vi,o2=vj,…,ok=vl|w1)×

… × P(ox-m=vp,ox-m+1=vq,…,ox=vr |wn)

i.e., each word produces some of the sounds with some

probability; we have to sum over possible different word

boundaries.

So, what we need is model of how we pronounce individual words.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Model

We can represent this idea by complicating the Markov model:

- Let’s add probabilistic emissions of outputs from each state.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A (Simplistic) Model for Pronouncing “of”

Each state can emit a different sound, with some probability.

Variant: Have the emissions on the transitions, rather than

the states.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

How the Model Works

We can’t “see” the actual state transitions.

But we can infer possible underlying transitions from the

observations, and then assign a probability to them

E.g., from “o v”, we infer the transition “phone1 phone2”

- with probability .7 x .9 = .63.

I.e., the probability that the word “of” would be pronounced

as “o v” is 63%.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Models

This is a “hidden Markov model”, or HMM.

Like (fully observable) Markov models, transitions from one

state to another are independent of everything else.

Also, the emission of an output from a state depends only on

that state, i.e.:

=P(o1|q1)×P(o2|q2)×…×P(on|q1)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMMs Assign Probabilities

We want to know how probable a sequence of observations

is given an HMM.

ways to produce the observed output.

So, we have to consider all possible ways an output might be

produced, i.e., for a given HMM:

P(O) = ∑Q P(O|Q)P(Q)

where O is a given output sequence, and Q ranges over all

possible sequence of states in the model.

P(Q) is computed as for (visible) Markov models.

P(O|Q) = P(o1,o2,…,on|q1,…qn)

= P(o1|q1)×P(o2|q2)×…×P(on|q1)

We’ll look at computing this efficiently in a while…

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Finishing Solving the Speech Problem

To find argmaxw∈L P(w)P(y|w), just consider these

probabilities over all possible strings of words.

Could “splice in” each word in language model with its HMM

pronunciation model to get one big HMM.

- Lets us incorporate more dependencies.

- E.g., could have two models of “of”, one of which has a

much higher probability of transitioning to words beginning

with consonants.

are broken up into acoustic vectors.

- but these are also HMMs.

- So we can make one gigantic HMM out of the whole thing.

So, given y, all we need is to find most probable path through

the model that generates it.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Our Word Markov Model

the

a

girl

little

.7

.8

.9

.22

.3

.78

.2

.1

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Splicing in Pronunciation HMMs

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

V12

V9

V10

V8

V11

V12

.8

.7

.2

.78

.3

.22

.1

a

.9

little

girl

the

1

2

3

4

5

6

8

9

10

7

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Best Sequence

girl

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

V12

V9

V10

V8

V11

V12

.8

.7

.2

.78

.3

.22

.1

a

.9

little

the

Suppose observation is “v1 v3 v4 v9 v8 v11 v7 v8 v10”

Suppose most probable sequence is determined to be

“1,2,3,8,9,10,4,5,6” (happens to be only way in example)

Then interpretation is “the little girl”.

1

2

3

4

5

6

8

9

10

7

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Models

Determine the probability of a give sequence

Determine the probability of a model producing a sequence in a particular way

equivalent to recognizing or interpreting that sequence

Learning a model form some observations.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Why HMM?

Because the HMM is a very good model for such patterns!

highly variable spatiotemporal data sequence

often unclear, uncertain, and incomplete

Because it is very successful in many applications!

Because it is quite easy to use!

Tools already exist…

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The problem

Not quite a valid assumption

There are often errors or noise

Noisy sound, sloppy handwriting, ungrammatical or Kornglish sentence

There may be some truth process

Underlying hidden sequence

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Auxiliary Variable

nor is

is Markovian

“Markov chain”

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Summary of the Concept

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Model

stochastic chain process : { q(t) }

output process : { f(x|q) }

is also called as

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM Characterization

(A, B, )

B : symbol output/observation probability

: initial state distribution probability

{ i | i = p(q1=i) }

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM, Formally

qt denotes the state at time t.

A transition probability matrix A, such that

A[i,j]=aij=P(qt+1=Sj|qt=Si)

This is an N x N matrix.

A set of symbols, {v1,…,vM}

For all purposes, these might as well just be {1,…,M}

ot denotes the observation at time t.

A observation symbol probability distribution matrix B, such that

B[i,j]=bi,j=P(ot=vj|qt=Si)

This is a N x M matrix.

An initial state distribution, π, such that πi=P(q1=Si)

For convenience, call the entire model λ = (A,B,π)

Note that N and M are important implicit parameters.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Graphical Example

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Data interpretation

P(s s p p iy iy iy ch ch ch|)

= Q P(ssppiyiyiychchch,Q|)

P(Q|) p(ssppiyiyiychchch|Q, )

= P(1122333444|) p(ssppiyiyiychchch|1122333444, )

= P(1| )P(s|1,) P(1|1, )P(s|1,) P(2|1, )P(p|2,)

P(2|2, )P(p|2,) …..

×(.3×.6)×(1.×.6)2

0.6 0.4 0.0 0.0

0.0 0.5 0.5 0.0

0.0 0.0 0.7 0.3

0.0 0.0 0.0 1.0

0.2 0.2 0.0 0.6 …

0.0 0.2 0.5 0.3 …

0.0 0.8 0.1 0.1 …

0.6 0.0 0.2 0.2 …

Let Q = 1 1 2 2 3 3 3 4 4 4

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Issues in HMM

Difficult problems

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Number of States

r r g b b g b b b r

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(1) The simplest model

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(2) Two state model

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(3) Three state models

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Criterion is

The best topology comes from insight and experience

the # classes/symbols/samples

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A trained HMM

.5 .4 .1

.0 .6 .4

.0 .0 .0

.6 .2 .2

.2 .5 .3

.0 .3 .7

1. 0. 0.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Three Problems

What is the probability of the observation?

Given an observed sequence and an HMM, how probable is that sequence?

Forward algorithm

What is the best state sequence for the observation?

Given an observed sequence and an HMM, what is the most likely state sequence that generated it?

Viterbi algorithm

Given an observation, can we learn an HMM for it?

Baum-Welch reestimation algorithm

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Problem: How to Compute P(O|λ)

P(O|λ)=∑all Q P(O|Q,λ)P(Q|λ)

where O is an observation sequence and Q is a sequence of

states.

the model, and sum up their probabilities.

Naively, this would involve O(TNT) operations:

- At every step, there are N possible transitions, and there are

T steps.

However, we can take advantage of the fact that, at any given

time, we can only be in one of N states, so we only have to

keep track of paths to each state.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Basic Idea for P(O|λ) Algorithm

If we know the probability

of being in each state at

time t, and producing the

observation so far (αt(i)),

then the probability of

next clock tick, and then

emitting the next output, is

easy to compute.

time t+1

αt(i)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Better Algorithm For Computing P(O|λ)

Let αt(i) be defined as P(o1o2…ot,qt=Si|λ).

- I.e., the probability of seeing a prefix of the observation, and

ending in a particular state.

Algorithm:

- Induction: αt+1(j) ← (∑1≤i≤N αt(i)aij)bj,ot+1

1 ≤ t ≤ T-1, 1 ≤ j ≤ N

- Finish: Return ∑1≤i≤N αT(i)

This is called the forward procedure.

How efficient is it?

- At each time step, do O(N) multiplications and additions for

each of N nodes, or O(N2T).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Can We Do it Backwards?

βt(i) is probability that,

If we know probability of

outputting the tail of the

observation, given that we

compute probability that,

at the previous clock tick,

we emit the (one symbol

bigger) tail.

βt+1(i)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Going Backwards Instead

Define βt(i) as P(ot+1ot+2…oT|qt=Si,λ).

- I.e., the probability of seeing a tail of the observation, given

that we were in a particular state.

Algorithm:

- Induction: βt(i) ← ∑1≤j≤N aijbj,ot+1βt+1(j)

T-1 ≥ t ≥ 1, 1 ≤ i ≤ N

This is called the backward procedure.

We could use this to compute P(O|λ) too (β1(i) is

P(o2o3…oT|q1=Si,λ), so P(O|λ)= ∑1≤j≤Nπibi,o1β1(i)).

- But nobody does this.

How efficient is it?

- Same as forward procedure.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

1. Model Evaluation

Solution: forward/backward procedure

Define: forward probability -> FW procedure

Define: backward probability -> BW procedure

These are probabilities of the partial events leading to/from a point in space-time

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Forward procedure

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Numerical example: P(RRGB|)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Backward procedure

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

2nd Problem: What’s the Most Likely State Sequence?

Actually, there is more than one possible interpretation of

“most likely state sequence”.

One is, which states are individually most likely.

- i.e., what is the most likely first state? The most likely second?

And so on.

- Note, though, that we can end up with a “sequence” that isn’t

even a possible sequence through the HMM, much less a

likely one.

- i.e., find argmaxQ P(Q|O,λ)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Algorithm Idea

Suppose we knew the highest probability path ending in each

state at time step t.

We can compute the highest probability path ending in a state

at t+1 by considering each transition from each state at time t,

and remembering only the best one.

This is another application of the dynamic programming

principle.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Basic Idea for argmaxQP(Q|O,λ) Algorithm

If we know the probability

of the best path to each

state at time t producing

the observation so far

(δt(i)), then the probability

state producing the next

observation at the next

probability times the

transition probability, times

emission.

a1,4

a2,4

aN-1,4

aN,4

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Definitions for Computing argmaxQ P(Q|O,λ)

Note that P(Q|O,λ)=P(Q,O|λ)/P(O|λ), so maximizing

P(Q|O,λ) is equivalent to maximizing P(Q,O|λ).

- Turns out latter is a bit easier to work with.

Define δt(i) as

maxq1,q2,…,qt-1P(q1,q2,…,qt=i,o1,o2,…,ot|λ).

We’ll use these to inductively find P(Q,O|λ). But how do we

find the actual path?

We’ll maintain another array, ψt(i), which keeps track of the

argument that maximizes δt(i).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Algorithm for Computing argmaxQ P(Q|O,λ)

Algorithm:

ψ1(i) ← 0

- Finish: P* = maxi (δT(i)) is probability of best path

qT* = argmaxi (δT(i)) is best final state

- Extract path by backtracking:

This is called the Viterbi algorithm.

Note that it is very similar to the forward procedure (and hence

as efficient).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

2. Decoding Problem

The best path Q* given an input X ?

It can be obtained by maximizing the joint probability over state sequences

Path likelihood score:

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Viterbi algorithm

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Numerical Example: P(RRGB,Q*|)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

3. Model Training Problem

No analytical solution exists

MLE + EM algorithm developed

Baum-Welch reestimation [Baum+68,70]

maximizes the probability estimate of observed events

guarantees finite improvement

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

MLE Example

Experiment

Known: 3 balls inside (some white, some red; exact numbers unknown)

Unknown: R = # red balls

Two models

p (|R=3) = 3C2 / 3C2 = 1

Which model?

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Learning an HMM

We will assume we have an observation O, and want to know

the “best” HMM for it.

I.e., we would want to find argmaxλ P(λ|O), where

λ= (A,B,π) is some HMM.

- I.e., what model is most likely, given the observation?

When we have a fixed observation that we use to pick a model,

we call the observation training data.

Functions/values like P(λ|O) are called likelihoods.

What we want is the maximum likelihood estimate(MLE) of the

parameters of our model, given some training data.

This is an estimate of the parameters, because it depends for

its accuracy on how representative the observation is.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing Likelihoods

- The observation is constant, so it is enough to maximize

P(λ)P(O|λ).

P(λ) is the prior probability of a model; P(O|λ) is the

probability of the observation, given a model.

Typically, we don’t know much about P(λ).

- E.g., we might assume all models are equally likely.

- Or, we might stipulate that some subclass are equally likely,

and the rest not worth considering.

We will ignore P(λ), and simply optimize P(O|λ).

I.e., we can find the most probable model by asking which

model makes the data most probable.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximum Likelihood Example

Simple example: We have a coin that may be biased. We

would like to know the probability that it will come up heads.

Let’s flip it a number of times; use percentage of times it comes

up heads to estimate the desired probability.

Given m out of n trials come up heads, what probability should

be assigned?

In terms of likelihoods, we want to know P(λ|O), where our

model is just the simple parameter, the probability of a coin

coming up heads.

maximizing P(λ)P(O|λ).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing P(λ)P(O|λ) For Coin Flips

Note that knowing something about P(λ) is knowing whether

coins tended to be biased.

If we don’t, we just optimize P(O|λ).

I.e., let’s pick the model that makes the observation most likely.

We can solve this analytically:

- P(m heads over n coin tosses|P(Heads)=p) = nCmpm(1-p)n-m

- Take derivative, set equal to 0, solve for p.

Turns out p is m/n.

- This is probably what you guessed.

So we have made a simple maximum likelihood estimate.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Comments

Clearly, our result will just be an estimate.

- but one we hope will become increasingly accurate with

more data.

Note, though, via MLE, the probability of everything we haven’t

seen so far is 0.

For modeling rare events, there will never be enough data for

this problem to go away.

There are ways to smooth over this problem (in fact, one way

is called “smoothing”!), but we won’t worry about this now.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to HMMs

MLE tells us to optimize P(O|λ).

We know how to compute P(O|λ) for a given model.

- E.g., use the “forward” procedure.

How do we find the best one?

Unfortunately, we don’t know how to solve this problem

analytically.

However, there is a procedure to find a better model, given an

existing one.

- So, if we have a good guess, we can make it better.

- Or, start out with fully connected HMM (of given N, M) in

which each state can emit every possible value; set all

probabilities to random non-zero values.

Will this guarantee us a best solution?

No! So this is a form of …

- Hillclimbing!

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing the Probability of the Observation

Basic idea is:

2. Compute new λnew based on λold and observation O.

3. If P(O|λnew)-P(O| λold) < threshold

(or we’ve iterated enough), stop.

4. Otherwise, λold ← λnew and go to step 2.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Increasing the Probability of the Observation

Let’s “count” the number of times, from each state, we

- start in a given state

- make a transition to each other state

- emit each different symbol.

If we knew these numbers, we can compute new probability

estimates.

We can’t really “count” these.

- We don’t know for sure which path through the model was

taken.

- But we know the probability of each path, given the model.

So we can compute the expected value of each figure.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Set Up

- i.e., the probability that, given observation and model, we

are in state Si at time t and state Sj at time t+1.

Here is how such a transition can happen:

Si

Sj

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Computing ξt(i,j)

But we want ξt(i,j) = P(qt=Si,qt+1=Sj|O,λ)

By definition of conditional probability,

ξt(i,j) = αt(i)aijbj,ot+1βt+1(j)/P(O|λ)

i.e., given

- P(O), which we can compute by forward, but also, by

∑i ∑j αt(i)aijbj,ot+1βt+1(j)

- and A,B, which are given in model, – we can compute ξ.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

One more preliminary…

Define γt(i) as probability of being in state Si at time t, given

the observation and the model.

Given our definition of ξ:

γt(i)=∑1≤j≤N ξt(i,j)

So:

- ∑1≤t≤T-1 γt(i) = expected number of transitions from Si

- ∑1≤t≤T-1 ξt(i,j) = expected number of transitions from

Si to Sj

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Now We Can Reestimate Parameters

aij’ = expected no. of transitions from Si to Sj/expected no. of

transitions from Si

πi’ = probability of being in S1 = γ1(i)

bjk’ = expected no. of times in Sj, observing vk/expected no.

of times in Sj

= ∑1≤t≤T, s.t. Ot=vkγt(i)/∑1≤t≤T γt(i)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Iterative Reestimation Formulae

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Baum-Welch Algorithm

or Forward-backward algorithm

It was proven (by Baum et al.) that this reestimation procedure

leads to increased likelihood.

But remember, it only guarantees climbing to a local maximum!

It is a special case of a very general algorithm for incremental

improvement by iteratively

transitions and state emissions),

- The general procedure is called Expectation-Maximization,

or the EM algorithm.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Implementation Considerations

observation, and eventually underflow.

- I.e., instead of computing αt(i), at each time step, compute

αt(i)/∑iαt(i).

- Turns out that if you also used the ∑iαt(i)s to scale the

βt(j)s, you get a numerically nice value, although it doesn’t

has a nice probabilities interpretation;

- Yet, when you use both scaled values, the rest of the ξt(i,j)

computation is exactly the same.

Another: Sometimes estimated probabilities will still get very

small, and it seems better to not let these fall all the way to 0.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Other issues

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Other types of parametric structure

Continuous density HMM (CHMM)

Semi-continuous HMM

State-duration HMM

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Graphical DHMM and CHMM

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Statistical Decision Making System

Statistical K-class pattern recognition

1, …, K

P(X|k)

P(k)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

An Application of HMMs:“Part of Speech” Tagging

A problem: Word sense disambiguation.

- I.e., words typically have multiple senses; need to determine

which one a given word is being used as.

For now, we just want to guess the “part of speech” of each

word in a sentence.

function of their “grammatical class”.

- Grammatical classes are things like verb, noun, preposition,

determiner, etc.

- Words often have multiple grammatical classes.

» For example, the word “rock” can be a noun or a verb.

» Each of which can have a number of different meanings.

We want an algorithm that will “tag” each word with its most

likely part of speech.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Parsing, Briefly

S

NP

VP

V

PRO

NP

D

N

I

saw

a

bird

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

However

Each of these words is listed in the dictionary as having

multiple POS entries:

- “saw”: noun, verb

birdwatch”)

- “I”: pronoun, noun (the letter “I”, the square root of –1,

something shaped like an I (I-beam), symbol (I-80, Roman

numeral, iodine)

- “a”: article, noun (the letter “a”, something shaped like an “a”,

the grade “A”), preposition (“three times a day”), French

pronoun.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Moreover, there is a parse!

S

NP

VP

N

N

N

V

I

saw

a

bird

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

What’s It Mean?

The question is, can we avoid such nonsense cheaply.

Note that this parse corresponds to a very unlikely set of POS

tags.

So, just restricting our tags to reasonably probably ones might

eliminate such silly options.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Solving the Problem

- Using “tagged corpora”.

How well does it work?

Turns out it will be correct about 91% of the time.

Good?

Humans will agree with each other about 97-98% of the time.

So, room for improvement.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Better Algorithm

Make POS guess depend on context.

For example, if the previous word were “the”, then the word

“rock” is much more likely to be occurring as a noun than as

a verb.

We can incorporate context by setting up an HMM in which the

hidden states are POSs, and the words the emissions in those

states.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM For POS Tagging:Example

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM For POS Tagging

Second-order equivalent to POS trigrams.

Generally works well if there is a hand-tagged corpus from

which to read off the probabilities.

- Lots of detailed issues: smoothing, etc.

If none available, train using Baum-Welch.

Usually start with some constraints:

- E.g., start with 0 emissions for words not listed in dictionary

as having a given POS; estimate transitions.

Best variations get about 96-97% accuracy, which is

approaching human performance.

Pukyong Natl Univ

Hidden Markov Model

CS570 Lecture Note

This lecture note was made based on the notes of

Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Sequential Data

Information is contained in the structure

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

More examples

main() { char q=34, n=10, *a=“main() {

char q=34, n=10, *a=%c%s%c; printf(

a,q,a,q,n);}%c”; printf(a,q,a,n); }

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Speech Recognition

Given a sequence of inputs-features of some kind extracted by some hardware, guess the words to which the features correspond.

Hard because features dependent on

Speaker, speed, noise, nearby features(“co-articulation” constraints), word boundaries

“How to wreak a nice beach.”

“How to recognize speech.”

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Defining the problem

y, a string of acoustic features of some form,

w, a string of words, from some fixed vocabulary

L, a language (defined as all possible strings in that language),

Given some features, what is the most probable string the speaker uttered?

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis

P(w|y)= P(w)P(y|w)/P(y)

Since y is the same for different w’s we might choose, the

problem reduces to

each possible string in our language

pronunciation, given an utterance.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

P(w) where w is an utterance

Problem: There are a very large number of possible utterances!

Indeed, we create new utterances all the time, so we cannot hope to have there probabilities.

So, we will need to make some independence assumptions.

First attempt: Assume that words are uttered independently of one another.

Then P(w) becomes P(w1)… P(wn) , where wi are the individual words in the string.

Easy to estimate these numbers-count the relative frequency words in the language.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Assumptions

Words don’t just follow each other randomly

Second attempt: Assume each word depends only on the previous word.

E.g., “the” is more likely to be followed by “ball” than by “a”,

despite the fact that “a” would otherwise be a very common, and hence, highly probably word.

Of course, this is still not a great assumption, but it may be a decent approximation

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

In General

This is typical of lots of problems, in which

we view the probability of some event as dependent on potentially many past events,

of which there too many actual dependencies to deal with.

So we simplify by making assumption that

Each event depends only on previous event, and

it doesn’t make any difference when these events happen

in the sequence.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Speech Example

= s p iy iy iy ch ch ch ch

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis Methods

Probability-based analysis?

Method I

A poor model for temporal structure

Model size = |V| = N

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis methods

Method II

A symbol is dependent only on the immediately preceding:

|V|×|V| matrix model

105×105 – doubly outrageous!!

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Another analysis method

Method III

What you see is a clue to what lies behind and is not known a priori

The source that generated the observation

The source evolves and generates characteristic observation sequences

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

More Formally

To clarify, let’s write the sequence this way:

P(q1=Si, q2=Sj,…, qn-1=Sk, qn=Si)

Here the indicate the I-th position of the sequence,

and the Si the possible different words from our

vocabulary.

E.g., if the string were “The girl saw the boy”, we might have

S1= the q1= S1

S2= girl q2= S2

S3= saw q3= S3

S4= boy q4= S1

S1= the q5= S4

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Formalization (continue)

We want P(q1=Si, q2=Sj,…, qn-1=Sk, qn=Si)

Let’s break this down as we usually break down a joint :

= P(qn=Si | q1=Sj,…,qn-1=Sk)P(q1=Sj,…,qn-1=Sk)

…

qn-1=Sm)P(q2=Sj|q1=Sj)P(q1=Si)

Our simplifying assumption is that each event is only

dependent on the previous event, and that we don’t care when

the events happen, I.e.,

P(qi=Si | qi-1=Sk)=P(qj=Si | qj-1=Sk)

This is called the Markov assumption.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Markov Assumption

“The future does not depend on the past, given the present.”

Sometimes this if called the first-order Markov assumption.

second-order assumption would mean that each event depends on the previous two events.

This isn’t really a crucial distinction.

What’s crucial is that there is some limit on how far we are willing to look back.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Morkov Models

The Markov assumption means that there is only one probability to remember for each event type (e.g., word) to another event type.

Plus the probabilities of starting with a particular event.

This lets us define a Markov model as:

finite state automaton in which

the states represent possible event types(e.g., the different words in our example)

the transitions represent the probability of one event type following another.

It is easy to depict a Markov model as a graph.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A Markov Model for a Tiny Fragment

of English

e.g., P(qi=girl|qi-1 =the)=.8

The numbers on the initial arrows show the probability of

starting in the given state.

Missing probabilities are assumed to be O.

the

a

girl

little

.7

.8

.9

.22

.3

.78

.2

.1

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A Markov Model for a Tiny Fragment

of English

with probabilities :

P(“A little little girl”)=.3.22.1.9= .00594

the

a

girl

little

.7

.8

.9

.22

.3

.78

.2

.1

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example(con’t)

P(q1=the, q2=little, q3=girl)

where , and are states.

“Given that sentence begins with “a”, what is the probability

that the next words were “little girl”?”

P(q3=the, q2=little, q1=a)

= P(q3=girl | q2=little, q1=a)P(q2=little q1=a)

= P(q3=girl | q2=little)P(q2=little q1=a)

= .9.22=.198

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Markov Models and Graphical Models

Markov models and Belief Networks can both be represented by nice graphs.

Do the graphs mean the same thing?

No! In the graphs for Markov models; nodes do not represent random variables, CPTs.

Suppose we wanted to encode the same information via a belief network.

We would have to “unroll” it into a sequence of nodes-as many as there are elements in the sequence-each dependent on the previous, each with the same CPT.

This redrawing is valid, and sometimes useful, but doesn’t explicitly represent useful facts, such as that the CPTs are the same everywhere.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to the Speech Recognition Problem

A Markov model for all of English would have one node for each word, which would be connected to the node for each word that can follow it.

Without Loss Of Generality, we could have connections from every node to every node, some of which have transition probability 0.

Such a model is sometimes called a bigram model.

This is equivalent to knowing the probability distribution of pair of words in sequences (and the probability distribution for individual words).

A bigram model is an example of a language model, i.e., some (in this case, extremely simply) view of what sentences or sequences are likely to be seen.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Bigram Model

Bigram models are rather inaccurate language models.

E.g., the word after “a” is much more likely to be “missile” if the word preceding “a” is “launch”.

the Markov assumption is pretty bad.

If we could condition on a few previous words, life gets a bit better:

E.g., we could predict “missile” is more likely to follow “launch a” than “saw a”.

This would require a “second order” Markov model.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Higher-Order Models

In the case of words, this is equivalent to going to trigrams.

Fundamentally, this isn’t a big difference:

We can convert a second order model into a first order model, but with a lot more states.

And we would need much more data!

Note, though, that a second-order model still couldn’t accurately predict what follows “launch a large”

i.e., we are predicting the next work based on only the two previous words, so the useful information before “a large” is lost.

Nevertheless, such language models are very useful approximations.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to Our Spoken Sentence recognition Problem

We are trying to find

argmaxw∈L P(w)P(y|w)

Now let’s look at P(y|w).

- That is, how do we pronounce a sequence of words?

- Can make the simplification that how we pronounce words

is independent of one another.

P(y|w)=ΣP(o1=vi,o2=vj,…,ok=vl|w1)×

… × P(ox-m=vp,ox-m+1=vq,…,ox=vr |wn)

i.e., each word produces some of the sounds with some

probability; we have to sum over possible different word

boundaries.

So, what we need is model of how we pronounce individual words.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Model

We can represent this idea by complicating the Markov model:

- Let’s add probabilistic emissions of outputs from each state.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A (Simplistic) Model for Pronouncing “of”

Each state can emit a different sound, with some probability.

Variant: Have the emissions on the transitions, rather than

the states.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

How the Model Works

We can’t “see” the actual state transitions.

But we can infer possible underlying transitions from the

observations, and then assign a probability to them

E.g., from “o v”, we infer the transition “phone1 phone2”

- with probability .7 x .9 = .63.

I.e., the probability that the word “of” would be pronounced

as “o v” is 63%.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Models

This is a “hidden Markov model”, or HMM.

Like (fully observable) Markov models, transitions from one

state to another are independent of everything else.

Also, the emission of an output from a state depends only on

that state, i.e.:

=P(o1|q1)×P(o2|q2)×…×P(on|q1)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMMs Assign Probabilities

We want to know how probable a sequence of observations

is given an HMM.

ways to produce the observed output.

So, we have to consider all possible ways an output might be

produced, i.e., for a given HMM:

P(O) = ∑Q P(O|Q)P(Q)

where O is a given output sequence, and Q ranges over all

possible sequence of states in the model.

P(Q) is computed as for (visible) Markov models.

P(O|Q) = P(o1,o2,…,on|q1,…qn)

= P(o1|q1)×P(o2|q2)×…×P(on|q1)

We’ll look at computing this efficiently in a while…

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Finishing Solving the Speech Problem

To find argmaxw∈L P(w)P(y|w), just consider these

probabilities over all possible strings of words.

Could “splice in” each word in language model with its HMM

pronunciation model to get one big HMM.

- Lets us incorporate more dependencies.

- E.g., could have two models of “of”, one of which has a

much higher probability of transitioning to words beginning

with consonants.

are broken up into acoustic vectors.

- but these are also HMMs.

- So we can make one gigantic HMM out of the whole thing.

So, given y, all we need is to find most probable path through

the model that generates it.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Our Word Markov Model

the

a

girl

little

.7

.8

.9

.22

.3

.78

.2

.1

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Splicing in Pronunciation HMMs

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

V12

V9

V10

V8

V11

V12

.8

.7

.2

.78

.3

.22

.1

a

.9

little

girl

the

1

2

3

4

5

6

8

9

10

7

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Best Sequence

girl

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

V11

V12

V9

V10

V8

V11

V12

.8

.7

.2

.78

.3

.22

.1

a

.9

little

the

Suppose observation is “v1 v3 v4 v9 v8 v11 v7 v8 v10”

Suppose most probable sequence is determined to be

“1,2,3,8,9,10,4,5,6” (happens to be only way in example)

Then interpretation is “the little girl”.

1

2

3

4

5

6

8

9

10

7

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Models

Determine the probability of a give sequence

Determine the probability of a model producing a sequence in a particular way

equivalent to recognizing or interpreting that sequence

Learning a model form some observations.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Why HMM?

Because the HMM is a very good model for such patterns!

highly variable spatiotemporal data sequence

often unclear, uncertain, and incomplete

Because it is very successful in many applications!

Because it is quite easy to use!

Tools already exist…

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The problem

Not quite a valid assumption

There are often errors or noise

Noisy sound, sloppy handwriting, ungrammatical or Kornglish sentence

There may be some truth process

Underlying hidden sequence

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Auxiliary Variable

nor is

is Markovian

“Markov chain”

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Summary of the Concept

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Model

stochastic chain process : { q(t) }

output process : { f(x|q) }

is also called as

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM Characterization

(A, B, )

B : symbol output/observation probability

: initial state distribution probability

{ i | i = p(q1=i) }

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM, Formally

qt denotes the state at time t.

A transition probability matrix A, such that

A[i,j]=aij=P(qt+1=Sj|qt=Si)

This is an N x N matrix.

A set of symbols, {v1,…,vM}

For all purposes, these might as well just be {1,…,M}

ot denotes the observation at time t.

A observation symbol probability distribution matrix B, such that

B[i,j]=bi,j=P(ot=vj|qt=Si)

This is a N x M matrix.

An initial state distribution, π, such that πi=P(q1=Si)

For convenience, call the entire model λ = (A,B,π)

Note that N and M are important implicit parameters.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Graphical Example

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Data interpretation

P(s s p p iy iy iy ch ch ch|)

= Q P(ssppiyiyiychchch,Q|)

P(Q|) p(ssppiyiyiychchch|Q, )

= P(1122333444|) p(ssppiyiyiychchch|1122333444, )

= P(1| )P(s|1,) P(1|1, )P(s|1,) P(2|1, )P(p|2,)

P(2|2, )P(p|2,) …..

×(.3×.6)×(1.×.6)2

0.6 0.4 0.0 0.0

0.0 0.5 0.5 0.0

0.0 0.0 0.7 0.3

0.0 0.0 0.0 1.0

0.2 0.2 0.0 0.6 …

0.0 0.2 0.5 0.3 …

0.0 0.8 0.1 0.1 …

0.6 0.0 0.2 0.2 …

Let Q = 1 1 2 2 3 3 3 4 4 4

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Issues in HMM

Difficult problems

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Number of States

r r g b b g b b b r

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(1) The simplest model

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(2) Two state model

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(3) Three state models

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Criterion is

The best topology comes from insight and experience

the # classes/symbols/samples

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A trained HMM

.5 .4 .1

.0 .6 .4

.0 .0 .0

.6 .2 .2

.2 .5 .3

.0 .3 .7

1. 0. 0.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Three Problems

What is the probability of the observation?

Given an observed sequence and an HMM, how probable is that sequence?

Forward algorithm

What is the best state sequence for the observation?

Given an observed sequence and an HMM, what is the most likely state sequence that generated it?

Viterbi algorithm

Given an observation, can we learn an HMM for it?

Baum-Welch reestimation algorithm

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Problem: How to Compute P(O|λ)

P(O|λ)=∑all Q P(O|Q,λ)P(Q|λ)

where O is an observation sequence and Q is a sequence of

states.

the model, and sum up their probabilities.

Naively, this would involve O(TNT) operations:

- At every step, there are N possible transitions, and there are

T steps.

However, we can take advantage of the fact that, at any given

time, we can only be in one of N states, so we only have to

keep track of paths to each state.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Basic Idea for P(O|λ) Algorithm

If we know the probability

of being in each state at

time t, and producing the

observation so far (αt(i)),

then the probability of

next clock tick, and then

emitting the next output, is

easy to compute.

time t+1

αt(i)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Better Algorithm For Computing P(O|λ)

Let αt(i) be defined as P(o1o2…ot,qt=Si|λ).

- I.e., the probability of seeing a prefix of the observation, and

ending in a particular state.

Algorithm:

- Induction: αt+1(j) ← (∑1≤i≤N αt(i)aij)bj,ot+1

1 ≤ t ≤ T-1, 1 ≤ j ≤ N

- Finish: Return ∑1≤i≤N αT(i)

This is called the forward procedure.

How efficient is it?

- At each time step, do O(N) multiplications and additions for

each of N nodes, or O(N2T).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Can We Do it Backwards?

βt(i) is probability that,

If we know probability of

outputting the tail of the

observation, given that we

compute probability that,

at the previous clock tick,

we emit the (one symbol

bigger) tail.

βt+1(i)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Going Backwards Instead

Define βt(i) as P(ot+1ot+2…oT|qt=Si,λ).

- I.e., the probability of seeing a tail of the observation, given

that we were in a particular state.

Algorithm:

- Induction: βt(i) ← ∑1≤j≤N aijbj,ot+1βt+1(j)

T-1 ≥ t ≥ 1, 1 ≤ i ≤ N

This is called the backward procedure.

We could use this to compute P(O|λ) too (β1(i) is

P(o2o3…oT|q1=Si,λ), so P(O|λ)= ∑1≤j≤Nπibi,o1β1(i)).

- But nobody does this.

How efficient is it?

- Same as forward procedure.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

1. Model Evaluation

Solution: forward/backward procedure

Define: forward probability -> FW procedure

Define: backward probability -> BW procedure

These are probabilities of the partial events leading to/from a point in space-time

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Forward procedure

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Numerical example: P(RRGB|)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Backward procedure

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

2nd Problem: What’s the Most Likely State Sequence?

Actually, there is more than one possible interpretation of

“most likely state sequence”.

One is, which states are individually most likely.

- i.e., what is the most likely first state? The most likely second?

And so on.

- Note, though, that we can end up with a “sequence” that isn’t

even a possible sequence through the HMM, much less a

likely one.

- i.e., find argmaxQ P(Q|O,λ)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Algorithm Idea

Suppose we knew the highest probability path ending in each

state at time step t.

We can compute the highest probability path ending in a state

at t+1 by considering each transition from each state at time t,

and remembering only the best one.

This is another application of the dynamic programming

principle.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Basic Idea for argmaxQP(Q|O,λ) Algorithm

If we know the probability

of the best path to each

state at time t producing

the observation so far

(δt(i)), then the probability

state producing the next

observation at the next

probability times the

transition probability, times

emission.

a1,4

a2,4

aN-1,4

aN,4

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Definitions for Computing argmaxQ P(Q|O,λ)

Note that P(Q|O,λ)=P(Q,O|λ)/P(O|λ), so maximizing

P(Q|O,λ) is equivalent to maximizing P(Q,O|λ).

- Turns out latter is a bit easier to work with.

Define δt(i) as

maxq1,q2,…,qt-1P(q1,q2,…,qt=i,o1,o2,…,ot|λ).

We’ll use these to inductively find P(Q,O|λ). But how do we

find the actual path?

We’ll maintain another array, ψt(i), which keeps track of the

argument that maximizes δt(i).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Algorithm for Computing argmaxQ P(Q|O,λ)

Algorithm:

ψ1(i) ← 0

- Finish: P* = maxi (δT(i)) is probability of best path

qT* = argmaxi (δT(i)) is best final state

- Extract path by backtracking:

This is called the Viterbi algorithm.

Note that it is very similar to the forward procedure (and hence

as efficient).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

2. Decoding Problem

The best path Q* given an input X ?

It can be obtained by maximizing the joint probability over state sequences

Path likelihood score:

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Viterbi algorithm

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Numerical Example: P(RRGB,Q*|)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

3. Model Training Problem

No analytical solution exists

MLE + EM algorithm developed

Baum-Welch reestimation [Baum+68,70]

maximizes the probability estimate of observed events

guarantees finite improvement

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

MLE Example

Experiment

Known: 3 balls inside (some white, some red; exact numbers unknown)

Unknown: R = # red balls

Two models

p (|R=3) = 3C2 / 3C2 = 1

Which model?

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Learning an HMM

We will assume we have an observation O, and want to know

the “best” HMM for it.

I.e., we would want to find argmaxλ P(λ|O), where

λ= (A,B,π) is some HMM.

- I.e., what model is most likely, given the observation?

When we have a fixed observation that we use to pick a model,

we call the observation training data.

Functions/values like P(λ|O) are called likelihoods.

What we want is the maximum likelihood estimate(MLE) of the

parameters of our model, given some training data.

This is an estimate of the parameters, because it depends for

its accuracy on how representative the observation is.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing Likelihoods

- The observation is constant, so it is enough to maximize

P(λ)P(O|λ).

P(λ) is the prior probability of a model; P(O|λ) is the

probability of the observation, given a model.

Typically, we don’t know much about P(λ).

- E.g., we might assume all models are equally likely.

- Or, we might stipulate that some subclass are equally likely,

and the rest not worth considering.

We will ignore P(λ), and simply optimize P(O|λ).

I.e., we can find the most probable model by asking which

model makes the data most probable.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximum Likelihood Example

Simple example: We have a coin that may be biased. We

would like to know the probability that it will come up heads.

Let’s flip it a number of times; use percentage of times it comes

up heads to estimate the desired probability.

Given m out of n trials come up heads, what probability should

be assigned?

In terms of likelihoods, we want to know P(λ|O), where our

model is just the simple parameter, the probability of a coin

coming up heads.

maximizing P(λ)P(O|λ).

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing P(λ)P(O|λ) For Coin Flips

Note that knowing something about P(λ) is knowing whether

coins tended to be biased.

If we don’t, we just optimize P(O|λ).

I.e., let’s pick the model that makes the observation most likely.

We can solve this analytically:

- P(m heads over n coin tosses|P(Heads)=p) = nCmpm(1-p)n-m

- Take derivative, set equal to 0, solve for p.

Turns out p is m/n.

- This is probably what you guessed.

So we have made a simple maximum likelihood estimate.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Comments

Clearly, our result will just be an estimate.

- but one we hope will become increasingly accurate with

more data.

Note, though, via MLE, the probability of everything we haven’t

seen so far is 0.

For modeling rare events, there will never be enough data for

this problem to go away.

There are ways to smooth over this problem (in fact, one way

is called “smoothing”!), but we won’t worry about this now.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to HMMs

MLE tells us to optimize P(O|λ).

We know how to compute P(O|λ) for a given model.

- E.g., use the “forward” procedure.

How do we find the best one?

Unfortunately, we don’t know how to solve this problem

analytically.

However, there is a procedure to find a better model, given an

existing one.

- So, if we have a good guess, we can make it better.

- Or, start out with fully connected HMM (of given N, M) in

which each state can emit every possible value; set all

probabilities to random non-zero values.

Will this guarantee us a best solution?

No! So this is a form of …

- Hillclimbing!

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing the Probability of the Observation

Basic idea is:

2. Compute new λnew based on λold and observation O.

3. If P(O|λnew)-P(O| λold) < threshold

(or we’ve iterated enough), stop.

4. Otherwise, λold ← λnew and go to step 2.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Increasing the Probability of the Observation

Let’s “count” the number of times, from each state, we

- start in a given state

- make a transition to each other state

- emit each different symbol.

If we knew these numbers, we can compute new probability

estimates.

We can’t really “count” these.

- We don’t know for sure which path through the model was

taken.

- But we know the probability of each path, given the model.

So we can compute the expected value of each figure.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Set Up

- i.e., the probability that, given observation and model, we

are in state Si at time t and state Sj at time t+1.

Here is how such a transition can happen:

Si

Sj

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Computing ξt(i,j)

But we want ξt(i,j) = P(qt=Si,qt+1=Sj|O,λ)

By definition of conditional probability,

ξt(i,j) = αt(i)aijbj,ot+1βt+1(j)/P(O|λ)

i.e., given

- P(O), which we can compute by forward, but also, by

∑i ∑j αt(i)aijbj,ot+1βt+1(j)

- and A,B, which are given in model, – we can compute ξ.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

One more preliminary…

Define γt(i) as probability of being in state Si at time t, given

the observation and the model.

Given our definition of ξ:

γt(i)=∑1≤j≤N ξt(i,j)

So:

- ∑1≤t≤T-1 γt(i) = expected number of transitions from Si

- ∑1≤t≤T-1 ξt(i,j) = expected number of transitions from

Si to Sj

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Now We Can Reestimate Parameters

aij’ = expected no. of transitions from Si to Sj/expected no. of

transitions from Si

πi’ = probability of being in S1 = γ1(i)

bjk’ = expected no. of times in Sj, observing vk/expected no.

of times in Sj

= ∑1≤t≤T, s.t. Ot=vkγt(i)/∑1≤t≤T γt(i)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Iterative Reestimation Formulae

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Baum-Welch Algorithm

or Forward-backward algorithm

It was proven (by Baum et al.) that this reestimation procedure

leads to increased likelihood.

But remember, it only guarantees climbing to a local maximum!

It is a special case of a very general algorithm for incremental

improvement by iteratively

transitions and state emissions),

- The general procedure is called Expectation-Maximization,

or the EM algorithm.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Implementation Considerations

observation, and eventually underflow.

- I.e., instead of computing αt(i), at each time step, compute

αt(i)/∑iαt(i).

- Turns out that if you also used the ∑iαt(i)s to scale the

βt(j)s, you get a numerically nice value, although it doesn’t

has a nice probabilities interpretation;

- Yet, when you use both scaled values, the rest of the ξt(i,j)

computation is exactly the same.

Another: Sometimes estimated probabilities will still get very

small, and it seems better to not let these fall all the way to 0.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Other issues

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Other types of parametric structure

Continuous density HMM (CHMM)

Semi-continuous HMM

State-duration HMM

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Graphical DHMM and CHMM

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Statistical Decision Making System

Statistical K-class pattern recognition

1, …, K

P(X|k)

P(k)

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

An Application of HMMs:“Part of Speech” Tagging

A problem: Word sense disambiguation.

- I.e., words typically have multiple senses; need to determine

which one a given word is being used as.

For now, we just want to guess the “part of speech” of each

word in a sentence.

function of their “grammatical class”.

- Grammatical classes are things like verb, noun, preposition,

determiner, etc.

- Words often have multiple grammatical classes.

» For example, the word “rock” can be a noun or a verb.

» Each of which can have a number of different meanings.

We want an algorithm that will “tag” each word with its most

likely part of speech.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Parsing, Briefly

S

NP

VP

V

PRO

NP

D

N

I

saw

a

bird

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

However

Each of these words is listed in the dictionary as having

multiple POS entries:

- “saw”: noun, verb

birdwatch”)

- “I”: pronoun, noun (the letter “I”, the square root of –1,

something shaped like an I (I-beam), symbol (I-80, Roman

numeral, iodine)

- “a”: article, noun (the letter “a”, something shaped like an “a”,

the grade “A”), preposition (“three times a day”), French

pronoun.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Moreover, there is a parse!

S

NP

VP

N

N

N

V

I

saw

a

bird

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

What’s It Mean?

The question is, can we avoid such nonsense cheaply.

Note that this parse corresponds to a very unlikely set of POS

tags.

So, just restricting our tags to reasonably probably ones might

eliminate such silly options.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Solving the Problem

- Using “tagged corpora”.

How well does it work?

Turns out it will be correct about 91% of the time.

Good?

Humans will agree with each other about 97-98% of the time.

So, room for improvement.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Better Algorithm

Make POS guess depend on context.

For example, if the previous word were “the”, then the word

“rock” is much more likely to be occurring as a noun than as

a verb.

We can incorporate context by setting up an HMM in which the

hidden states are POSs, and the words the emissions in those

states.

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM For POS Tagging:Example

Pukyong Natl Univ

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM For POS Tagging

Second-order equivalent to POS trigrams.

Generally works well if there is a hand-tagged corpus from

which to read off the probabilities.

- Lots of detailed issues: smoothing, etc.

If none available, train using Baum-Welch.

Usually start with some constraints:

- E.g., start with 0 emissions for words not listed in dictionary

as having a given POS; estimate transitions.

Best variations get about 96-97% accuracy, which is

approaching human performance.

Pukyong Natl Univ

Recommended