+ All Categories
Home > Documents > Hidden Markov Model

Hidden Markov Model

Date post: 13-Mar-2016
Category:
Upload: cecilia-hess
View: 50 times
Download: 1 times
Share this document with a friend
Description:
Hidden Markov Model. CS570 Lecture Note KAIST. This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB). Sequential Data. Often highly variable, but has an embedded structure Information is contained in the structure. More examples. - PowerPoint PPT Presentation
99
1 This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Pro f. Wilensky (UCB) Hidden Markov Model CS570 Lecture Note KAIST This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)
Transcript
Page 1: Hidden Markov Model

1This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Model

CS570 Lecture NoteKAIST

This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Page 2: Hidden Markov Model

2This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Sequential Data

Often highly variable, but has an embedded structure

Information is contained in the structure

Page 3: Hidden Markov Model

3This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

More examples• Text, on-line handwiritng, music notes, DNA sequ

ence, program codes

main() { char q=34, n=10, *a=“main() { char q=34, n=10, *a=%c%s%c; printf(a,q,a,q,n);}%c”; printf(a,q,a,n); }

Page 4: Hidden Markov Model

4This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Speech Recognition

• Given a sequence of inputs-features of some kind extracted by some hardware, guess the words to which the features correspond.

• Hard because features dependent on– Speaker, speed, noise, nearby

features(“co-articulation” constraints), word boundaries

• “How to wreak a nice beach.”• “How to recognize speech.”

Page 5: Hidden Markov Model

5This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Defining the problem• Find argmax w∈L P(w|y)

– y, a string of acoustic features of some form,– w, a string of words, from some fixed vocabula

ry– L, a language (defined as all possible strings i

n that language),• Given some features, what is the most pr

obable string the speaker uttered?

Page 6: Hidden Markov Model

6This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis By Bayes’ rule:

P(w|y)= P(w)P(y|w)/P(y)

Since y is the same for different w’s we might choose, the problem reduces to

argmaxw∈L P(w)P(y|w)

we need to be able to predict each possible string in our language pronunciation, given an utterance.

Page 7: Hidden Markov Model

7This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

P(w) where w is an utterance

• Problem: There are a very large number of possible utterances!

• Indeed, we create new utterances all the time, so we cannot hope to have there probabilities.

• So, we will need to make some independence assumptions.

• First attempt: Assume that words are uttered independently of one another.– Then P(w) becomes P(w1)… P(wn) , where wi are the in

dividual words in the string.– Easy to estimate these numbers-count the relative fre

quency words in the language.

Page 8: Hidden Markov Model

8This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Assumptions• However, assumption of independence is

pretty bad.– Words don’t just follow each other randomly

• Second attempt: Assume each word depends only on the previous word.– E.g., “the” is more likely to be followed by “ball”

than by “a”,– despite the fact that “a” would otherwise be a

very common, and hence, highly probably word.• Of course, this is still not a great assumption,

but it may be a decent approximation

Page 9: Hidden Markov Model

9This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

In General• This is typical of lots of problems, in which

– we view the probability of some event as dependent on potentially many past events,

– of which there too many actual dependencies to deal with.

• So we simplify by making assumption that – Each event depends only on previous event,

and – it doesn’t make any difference when these

events happen – in the sequence.

Page 10: Hidden Markov Model

10This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Speech Example• Representation

– X = x1 x2 x3 x4 x5 … xT-1 xT

= s p iy iy iy ch ch ch ch

Page 11: Hidden Markov Model

11This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis Methods• Probability-based analysis?

• Method I

– Observations are independent; no time/order

– A poor model for temporal structure• Model size = |V| = N

433 )ch()iy()p()()s( PPPPP

?)chch ch ch iy iy iy p s( P

Page 12: Hidden Markov Model

12This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Analysis methods• Method II

– A simple model of ordered sequence• A symbol is dependent only on the immediately

preceding:

– |V|×|V| matrix model• 50×50 – not very bad …• 105×105 – doubly outrageous!!

`2

2

)ch|ch()|ch()|()iy|(

)iy|iy()p|iy()|p()s|()s|s()s(

PPPP

PPPPPP

)|()|( 11321 tttt xxPxxxxxP

Page 13: Hidden Markov Model

13This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Another analysis method• Method III

– What you see is a clue to what lies behind and is not known a priori• The source that generated the observation• The source evolves and generates

characteristic observation sequences

t

tttTT qqxPqqPqqPqqPqP )|,()|,ch( )|,()|,s(),s( 1123121

Q t

tttQ

TT qqxPqqPqqPqqPqP )|,()|,ch( )|,()|,s(),s( 1123121

Tqqqq 210

Page 14: Hidden Markov Model

14This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

We want to know P(w1,….wn-1, wn). To clarify, let’s write the sequence this way:

P(q1=Si, q2=Sj,…, qn-1=Sk, qn=Si)Here the indicate the I-th position of the sequence,and the Si the possible different words from ourvocabulary.

E.g., if the string were “The girl saw the boy”, we might have S1= the q1= S1

S2= girl q2= S2

S3= saw q3= S3

S4= boy q4= S1

S1= the q5= S4

More Formally

Page 15: Hidden Markov Model

15This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Formalization (continue)• We want P(q1=Si, q2=Sj,…, qn-1=Sk, qn=Si)• Let’s break this down as we usually break down a joint :• = P(qn=Si | q1=Sj,…,qn-1=Sk)ⅹP(q1=Sj,…,qn-1=Sk)• …• = P(qn=Si | q1=Sj,…,qn-1=Sk)ⅹP(qn-1=Sk|q1=Sj,…,• qn-1=Sm)ⅹP(q2=Sj|q1=Sj)ⅹP(q1=Si)• Our simplifying assumption is that each event is only • dependent on the previous event, and that we don’t car

e when • the events happen, I.e.,• P(qi=Si | q1=Sj,…,qi-1=Sk)ⅹP(qi=Si | qi-1=Sk)and• P(qi=Si | qi-1=Sk)=P(qj=Si | qj-1=Sk)• This is called the Markov assumption.

Page 16: Hidden Markov Model

16This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Markov Assumption• “The future does not depend on the

past, given the present.”• Sometimes this if called the first-

order Markov assumption.• second-order assumption would

mean that each event depends on the previous two events.– This isn’t really a crucial distinction.– What’s crucial is that there is some limit

on how far we are willing to look back.

Page 17: Hidden Markov Model

17This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Morkov Models• The Markov assumption means that there is

only one probability to remember for each event type (e.g., word) to another event type.

• Plus the probabilities of starting with a particular event.

• This lets us define a Markov model as:– finite state automaton in which

• the states represent possible event types(e.g., the different words in our example)

• the transitions represent the probability of one event type following another.

• It is easy to depict a Markov model as a graph.

Page 18: Hidden Markov Model

18This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A Markov Model for a Tiny Fragment of English

Numbers on arrows between nodes are “transition” probabilities, e.g., P(qi=girl|qi-1 =the)=.8 The numbers on the initial arrows show the probability of starting in the given state. Missing probabilities are assumed to be O.

the

a

girl

little

.7.8

.9

.22.3

.78

.2

.1

Page 19: Hidden Markov Model

19This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: A Markov Model for a Tiny Fragment of English

Generates/recognizes a tiny(but infinite!) language, along with probabilities :

P(“The little girl”)=.7ⅹ.2ⅹ.9= .126P(“A little little girl”)=.3ⅹ.22ⅹ.1ⅹ.9= .00594

the

a

girl

little

.7.8

.9

.22.3

.78

.2

.1

Page 20: Hidden Markov Model

20This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example(con’t)

P(“The little girl”) is really shorthand for P(q1=the, q2=little, q3=girl)

where , and are states. We can easily amswer other questions, e.g.: “Given that sentence begins with “a”, what is the probability that the next words were “little girl”?”

P(q3=the, q2=little, q1=a)= P(q3=girl | q2=little, q1=a)P(q2=little q1=a)= P(q3=girl | q2=little)P(q2=little q1=a)= .9ⅹ.22=.198

Page 21: Hidden Markov Model

21This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Markov Models and Graphical Models

Markov models and Belief Networks can both be represented by nice graphs.

Do the graphs mean the same thing? No! In the graphs for Markov models; nodes do not represent random variable

s, CPTs. Suppose we wanted to encode the same information via a belief netw

ork. We would have to “unroll” it into a sequence of nodes-as many as there are ele

ments in the sequence-each dependent on the previous, each with the same CPT.

This redrawing is valid, and sometimes useful, but doesn’t explicitly represent useful facts, such as that the CPTs are the same everywhere.

Page 22: Hidden Markov Model

22This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to the Speech Recognition Problem

• A Markov model for all of English would have one node for each word, which would be connected to the node for each word that can follow it.

• Without Loss Of Generality, we could have connections from every node to every node, some of which have transition probability 0.

• Such a model is sometimes called a bigram model.– This is equivalent to knowing the probability distribution

of pair of words in sequences (and the probability distribution for individual words).

– A bigram model is an example of a language model, i.e., some (in this case, extremely simply) view of what sentences or sequences are likely to be seen.

Page 23: Hidden Markov Model

23This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Bigram Model• Bigram models are rather inaccurate language m

odels.– E.g., the word after “a” is much more likely to be “miss

ile” if the word preceding “a” is “launch”.

• the Markov assumption is pretty bad.• If we could condition on a few previous words, lif

e gets a bit better:– E.g., we could predict “missile” is more likely to follow

“launch a” than “saw a”.

• This would require a “second order” Markov model.

Page 24: Hidden Markov Model

24This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Higher-Order Models• In the case of words, this is equivalent to going

to trigrams.• Fundamentally, this isn’t a big difference:

– We can convert a second order model into a first order model, but with a lot more states.

– And we would need much more data!• Note, though, that a second-order model still

couldn’t accurately predict what follows “launch a large”– i.e., we are predicting the next work based on only

the two previous words, so the useful information before “a large” is lost.

• Nevertheless, such language models are very useful approximations.

Page 25: Hidden Markov Model

25This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to Our Spoken Sentence recognition Problem

We are trying to find argmaxw∈L P(w)P(y|w)

We just discussed estimating P(w).

Now let’s look at P(y|w). - That is, how do we pronounce a sequence of words? - Can make the simplification that how we pronounce words is independent of one another.

P(y|w)=ΣP(o1=vi,o2=vj,…,ok=vl|w1)×… × P(ox-m=vp,ox-m+1=vq,…,ox=vr |wn)i.e., each word produces some of the sounds with someprobability; we have to sum over possible different wordboundaries.

So, what we need is model of how we pronounce individual words.

Page 26: Hidden Markov Model

26This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Model

Assume there are some underlying states, called “phones”, say, that get pronounced in slightly different ways. We can represent this idea by complicating the Markov model: - Let’s add probabilistic emissions of outputs from each state.

Page 27: Hidden Markov Model

27This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Each state can emit a different sound, with some probability. Variant: Have the emissions on the transitions, rather than the states.

Phone1 Phone2 End.9 .1

.7 .3 1

“o”“a”

“v”

Example: A (Simplistic) Model for Pronouncing “of”

Page 28: Hidden Markov Model

28This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

How the Model Works

We see outputs, e.g., “o v”. We can’t “see” the actual state transitions. But we can infer possible underlying transitions from the observations, and then assign a probability to them E.g., from “o v”, we infer the transition “phone1 phone2” - with probability .7 x .9 = .63. I.e., the probability that the word “of” would be pronounced as “o v” is 63%.

Page 29: Hidden Markov Model

29This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Models

This is a “hidden Markov model”, or HMM. Like (fully observable) Markov models, transitions from one state to another are independent of everything else. Also, the emission of an output from a state depends only on that state, i.e.:

P(O|Q)=P(o1,o2,…,on|q1,…qn) =P(o1|q1)×P(o2|q2)×…×P(on|q1)

Page 30: Hidden Markov Model

30This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMMs Assign Probabilitiesto Sequences

We want to know how probable a sequence of observations is given an HMM. This is slightly complicated because there might be multiple ways to produce the observed output. So, we have to consider all possible ways an output might be produced, i.e., for a given HMM:

P(O) = ∑Q P(O|Q)P(Q) where O is a given output sequence, and Q ranges over all possible sequence of states in the model. P(Q) is computed as for (visible) Markov models.

P(O|Q) = P(o1,o2,…,on|q1,…qn) = P(o1|q1)×P(o2|q2)×…×P(on|q1)

We’ll look at computing this efficiently in a while…

Page 31: Hidden Markov Model

31This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Finishing Solving the Speech Problem

To find argmaxw∈L P(w)P(y|w), just consider these probabilities over all possible strings of words. Could “splice in” each word in language model with its HMM pronunciation model to get one big HMM. - Lets us incorporate more dependencies. - E.g., could have two models of “of”, one of which has a much higher probability of transitioning to words beginning with consonants. Real speech systems have another level in which phonemes are broken up into acoustic vectors. - but these are also HMMs. - So we can make one gigantic HMM out of the whole thing. So, given y, all we need is to find most probable path through the model that generates it.

Page 32: Hidden Markov Model

32This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Our Word Markov Model

the

a

girl

little

.7.8

.9

.22.3

.78

.2

.1

Page 33: Hidden Markov Model

33This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Splicing in Pronunciation HMMs

1 2 3 4 5 6

8 9 107

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

V11 V12 V9 V10 V8 V11 V12

.8.7

.2

.78

.3.22

.1

a.9

little

girlthe

Page 34: Hidden Markov Model

34This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Example: Best Sequence

girl

1 2 3 4 5 6

8 9 107

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

V11 V12 V9 V10 V8 V11 V12

.8.7

.2

.78

.3.22

.1

a .9little

the

Suppose observation is “v1 v3 v4 v9 v8 v11 v7 v8 v10” Suppose most probable sequence is determined to be “1,2,3,8,9,10,4,5,6” (happens to be only way in example) Then interpretation is “the little girl”.

Page 35: Hidden Markov Model

35This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Models• Modeling sequences of events• Might want to

– Determine the probability of a give sequence

– Determine the probability of a model producing a sequence in a particular way• equivalent to recognizing or interpreting that

sequence– Learning a model form some

observations.

Page 36: Hidden Markov Model

36This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Why HMM?• Because the HMM is a very good

model for such patterns!– highly variable spatiotemporal data

sequence– often unclear, uncertain, and incomplete

• Because it is very successful in many applications!

• Because it is quite easy to use!– Tools already exist…

Page 37: Hidden Markov Model

37This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The problem• “What you see is the truth”

– Not quite a valid assumption– There are often errors or noise

• Noisy sound, sloppy handwriting, ungrammatical or Kornglish sentence

– There may be some truth process• Underlying hidden sequence• Obscured by the incomplete observation

Page 38: Hidden Markov Model

38This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Auxiliary Variable

• N is also conjectured• {qt:t0} is conjectured, not visible

– nor is– is Markovian

– “Markov chain”

} , ,1{ NSqt

)|( )|()() ( 112121 TTT qqPqqPqPqqqP

TqqqQ 21

Page 39: Hidden Markov Model

39This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Summary of the Concept

Q

Q

QXPQP

QXPXP

)|()(

),()(

Q

TTT qqqxxxPqqqP )|()( 212121

Q

T

ttt

T

ttt qxpqqP

111 )|()|(

Markov chain process Output process

Page 40: Hidden Markov Model

40This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Hidden Markov Model• is a doubly stochastic process

– stochastic chain process : { q(t) }– output process : { f(x|q) }

• is also called as– Hidden Markov chain– Probabilistic function of Markov chain

Page 41: Hidden Markov Model

41This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM Characterization (A, B, )

– A : state transition probability { aij | aij = p(qt+1=j|qt=i) }

– B : symbol output/observation probability { bj(v) | bj(v) = p(x=v|qt=j) }

: initial state distribution probability { i | i = p(q1=i) }

QTqqqqqqqqqq

Q

xbxbxbaaa

QPQP

TTT

)( ... )()( ...

),|()|(

21 21132211

X

Page 42: Hidden Markov Model

42This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM, Formally• A set of states {S1,…,SN}

– qt denotes the state at time t.

• A transition probability matrix A, such that A[i,j]=aij=P(qt+1=Sj|qt=Si)– This is an N x N matrix.

• A set of symbols, {v1,…,vM}– For all purposes, these might as well just be {1,…,M}– ot denotes the observation at time t.

• A observation symbol probability distribution matrix B, such that B[i,j]=bi,j=P(ot=vj|qt=Si)

– This is a N x M matrix.

• An initial state distribution, π, such that πi=P(q1=Si)

• For convenience, call the entire model λ = (A,B,π)– Note that N and M are important implicit parameters.

Page 43: Hidden Markov Model

43This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Graphical Example

B =

0.2 0.2 0.0 0.6 … 0.0 0.2 0.5 0.3 … 0.0 0.8 0.1 0.1 … 0.6 0.0 0.2 0.2 …

1234

ch iy p s

0.6 0.4 0.0 0.00.0 0.5 0.5 0.00.0 0.0 0.7 0.30.0 0.0 0.0 1.0

A =

1234

1 2 3 4 = [ 1.0 0 0 0 ]

0.6

0.41 2 3 4

0.5 0.7

s p iy chiyp ch

Page 44: Hidden Markov Model

44This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Data interpretationP(s s p p iy iy iy ch ch ch|) = Q P(ssppiyiyiychchch,Q|) = Q P(Q|) p(ssppiyiyiychchch|Q,)

P(Q|) p(ssppiyiyiychchch|Q, ) = P(1122333444|) p(ssppiyiyiychchch|1122333444, ) = P(1| )P(s|1,) P(1|1, )P(s|1,) P(2|1, )P(p|2,) P(2|2, )P(p|2,) ….. = (1×.6)×(.6×.6)×(.4×.5)×(.5×.5)×(.5×.8)×(.7×.8)2

×(.3×.6)×(1.×.6)2 0.0000878

0.6 0.4 0.0 0.00.0 0.5 0.5 0.00.0 0.0 0.7 0.30.0 0.0 0.0 1.0

#multiplications ~ 2TNT

0.2 0.2 0.0 0.6 … 0.0 0.2 0.5 0.3 … 0.0 0.8 0.1 0.1 … 0.6 0.0 0.2 0.2 …

Let Q = 1 1 2 2 3 3 3 4 4 4

Page 45: Hidden Markov Model

45This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Issues in HMM• Intuitive decisions

1. number of states (N)2. topology (state inter-connection)3. number of observation symbols (V)

• Difficult problems4. efficient computation methods5. probability parameters ()

Page 46: Hidden Markov Model

46This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Number of States• How many states?

– Model size– Model topology/structure

• Factors– Pattern complexity/length and variability– The number of samples

• Ex: r r g b b g b b b r

Page 47: Hidden Markov Model

47This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(1) The simplest model• Model I

– N = 1– a11=1.0– B [1/3, 1/6, 1/2]

311

211

211

211

611

211

211

611

311

311)|r b b b g b b gr r ( 1

P

1.0

Page 48: Hidden Markov Model

48This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(2) Two state model• Model II:

– N = 20.6 0.40.6 0.4

1/2 1/3 1/61/6 1/6 2/3

A =

B =

0.6 0.41 2

0.6

0.4

?

216.

644.

644.

644.

316.

644.

644.

316.

216.

215.)|r b b b g b b gr r ( 1

P

Page 49: Hidden Markov Model

49This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

(3) Three state models• N=3:

0.6

0.5 0.31 3

0.1

0.3

0.62

0.2 0.20.2 0.6

0.71 3

0.3

0.2

0.2 0.7

0.32

Page 50: Hidden Markov Model

50This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

The Criterion is• Obtaining the best model() that

maximizes

• The best topology comes from insight and experience the # classes/symbols/samples

)ˆ|( XP

Page 51: Hidden Markov Model

51This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A trained HMM

.5 .4 .1

.0 .6 .4

.0 .0 .0

.6 .2 .2

.2 .5 .3

.0 .3 .7

1. 0. 0.

123

123

1 2 3

R G B

=

A =

B =

.6

.2

.2

.2

.5

.3

.0

.3

.7

RGB

.5

.6

.4

.4 .1

1

2

3

Page 52: Hidden Markov Model

52This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Three Problems1. Model evaluation problem

– What is the probability of the observation? – Given an observed sequence and an HMM, how probable is tha

t sequence?– Forward algorithm

2. Path decoding problem– What is the best state sequence for the observation?– Given an observed sequence and an HMM, what is the most lik

ely state sequence that generated it?– Viterbi algorithm

3. Model training problem– How to estimate the model parameters? – Given an observation, can we learn an HMM for it?– Baum-Welch reestimation algorithm

Page 53: Hidden Markov Model

53This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Problem: How to Compute P(O|λ)

P(O|λ)=∑all Q P(O|Q,λ)P(Q|λ) where O is an observation sequence and Q is a sequence of states.

= ∑q1,q2,…,qTbq1,o1

bq2,o2…bqT,oT

πq1aq1,q2…

aqT-1,qT

So, we could just enumerate all possible sequences through the model, and sum up their probabilities. Naively, this would involve O(TNT) operations: - At every step, there are N possible transitions, and there are T steps. However, we can take advantage of the fact that, at any given time, we can only be in one of N states, so we only have to keep track of paths to each state.

Page 54: Hidden Markov Model

54This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Basic Idea for P(O|λ) Algorithm

If we know the probabilityof being in each state attime t, and producing theobservation so far (αt(i)),then the probability ofbeing in a given state at thenext clock tick, and thenemitting the next output, iseasy to compute.

1

2

3

4

N-1

N

1

2

N-1

N

time t+1time t

αt+1(j) ← (∑iN αt(i)aij) bj,ot+1αt(i)

a1,4

a2,4

aN-1,4

aN,4

Page 55: Hidden Markov Model

55This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Better Algorithm For Computing P(O|λ)

Let αt(i) be defined as P(o1o2…ot,qt=Si|λ). - I.e., the probability of seeing a prefix of the observation, and ending in a particular state. Algorithm: - Initialization: α1(i) ← πibi,o1,1 ≤ i ≤ N - Induction: αt+1(j) ← (∑1≤i≤N αt(i)aij)bj,ot+1 1 ≤ t ≤ T-1, 1 ≤ j ≤ N - Finish: Return ∑1≤i≤N αT(i) This is called the forward procedure. How efficient is it? - At each time step, do O(N) multiplications and additions for each of N nodes, or O(N2T).

Page 56: Hidden Markov Model

56This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Can We Do it Backwards?

βt(i) is probability that,given we are in state i attime t we produce ot+1…oT.If we know probability ofoutputting the tail of theobservation, given that weare in some state, we cancompute probability that,given we are in some stateat the previous clock tick,we emit the (one symbolbigger) tail.time t+1time t

1

2

3

4

N-1

N

1

2

3

N-1

N

βt(i)=∑1≤j≤N aijbj,ot+1βt+1(j) βt+1(i)

a4,1

a4,2

a4,N-1

a4,N

Page 57: Hidden Markov Model

57This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Going Backwards Instead Define βt(i) as P(ot+1ot+2…oT|qt=Si,λ). - I.e., the probability of seeing a tail of the observation, given that we were in a particular state. Algorithm: - Initialization: βT(i) ← 1,1 ≤ i ≤ N - Induction: βt(i) ← ∑1≤j≤N aijbj,ot+1βt+1(j) T-1 ≥ t ≥ 1, 1 ≤ i ≤ N This is called the backward procedure. We could use this to compute P(O|λ) too (β1(i) is P(o2o3…oT|q1=Si,λ), so P(O|λ)= ∑1≤j≤N

πibi,o1β1(i)).

- But nobody does this. - Instead, we will have another use for it soon. How efficient is it? - Same as forward procedure.

Page 58: Hidden Markov Model

58This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

1. Model Evaluation• Solution: forward/backward procedure

– Define: forward probability -> FW procedure

– Define: backward probability -> BW procedure

• These are probabilities of the partial events leading to/from a point in space-time

)|,()( 1 jqxxPj ttt

)|,,()( 1 jqxxiqPi tTttt

Page 59: Hidden Markov Model

59This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Forward procedure• Initialization:

• Recursion:

• Termination:

)()( 11 xiibi Ni 1

)()()( 11

1

tjij

N

itt baij x 1 , ,2 ,1 ,1 TtNj

N

iT iP

1

)()|( X

1

2

N

j

t t+11 - - -

1

2

N

i i

N

1

2

HM

M st

ates

1

2

N

i

1

2

N

i

Page 60: Hidden Markov Model

60This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Numerical example: P(RRGB|)

R R G B

1×.6.6

0×.2.0

0×.0.0

.6

.2

.2

.2

.5

.3

.0

.3

.7

RGB

.5

.6

.4

.4 .1

=[1 0 0]T

.5×.6.18

.6×.2.048

.0

.4×.2

.1×.0.4×.0

.5×.2.018

.6×.5.0504

.01116

.4×.5

.1×.3.4×.3

.5×.2.0018

.6×.3.01123

.01537

.4×.3

.1×.7.4×.7

Page 61: Hidden Markov Model

61This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Backward procedure• Initialization:

• Recursion:1)( iT Ni 1

N

jttjijt jbai

111 )()()( x 1 , ,2 ,1 ,1 TTtNi

1

2

N

i

t t+1 - - - T

1

2

N

j j

N

1

2

HM

M st

ates

Page 62: Hidden Markov Model

62This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

2nd Problem: What’s the Most Likely State Sequence?

Actually, there is more than one possible interpretation of “most likely state sequence”. One is, which states are individually most likely. - i.e., what is the most likely first state? The most likely second? And so on. - Note, though, that we can end up with a “sequence” that isn’t even a possible sequence through the HMM, much less a likely one. Another criteria is the single best state sequence. - i.e., find argmaxQ P(Q|O,λ)

Page 63: Hidden Markov Model

63This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Algorithm Idea

Suppose we knew the highest probability path ending in each state at time step t. We can compute the highest probability path ending in a state at t+1 by considering each transition from each state at time t, and remembering only the best one. This is another application of the dynamic programming principle.

Page 64: Hidden Markov Model

64This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Basic Idea for argmaxQP(Q|O,λ) Algorithm

If we know the probabilityof the best path to eachstate at time t producingthe observation so far(δt(i)), then the probabilityof the best path to a givenstate producing the nextobservation at the nextclock tick is the max of thisprobability times thetransition probability, timesthe probability of the rightemission.

1

2

3

4

N-1

N

1

2

N-1

N

time t+1time t

δt(i) δt+1(j) ← (maxi δt(i)aij)bj,ot+1

a1,4

a2,4

aN-1,4

aN,4

Page 65: Hidden Markov Model

65This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Definitions for Computing argmaxQ P(Q|O,λ)

Note that P(Q|O,λ)=P(Q,O|λ)/P(O|λ), so maximizing P(Q|O,λ) is equivalent to maximizing P(Q,O|λ). - Turns out latter is a bit easier to work with. Define δt(i) as maxq1,q2,…,qt-1P(q1,q2,…,qt=i,o1,o2,…,ot|λ). We’ll use these to inductively find P(Q,O|λ). But how do we find the actual path? We’ll maintain another array, ψt(i), which keeps track of the argument that maximizes δt(i).

Page 66: Hidden Markov Model

66This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Algorithm for Computing argmaxQ P(Q|O,λ)

Algorithm: - Initialization: δ1(i) ← πibi,o1,1 ≤ i ≤ N ψ1(i) ← 0 - Induction: δt(j) ← maxi (δt-1(i)aij)bj,ot

ψt(i) ← argmaxi(δt-1(i)aij) 2 ≤ t ≤ T, 1 ≤ j ≤ N - Finish: P* = maxi (δT(i)) is probability of best path qT* = argmaxi (δT(i)) is best final state - Extract path by backtracking: qt* = ψt+1(qt+1*), t=T-1, T-2, …,1 This is called the Viterbi algorithm. Note that it is very similar to the forward procedure (and hence as efficient).

Page 67: Hidden Markov Model

67This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

2. Decoding Problem• The best path Q* given an input X ?

– It can be obtained by maximizing the joint probability over state sequences

– Path likelihood score:

– Viterbi algorithm

)()(max

]|,,[ max)(

1

21121121

tjijti

tttqqqt

xbai

xxxjqqqqPjt

Q1,t = q1q2 ··· qt : a partial (best) state sequence)()(maxargˆ)( 1 tjijt

it xbaiij

Page 68: Hidden Markov Model

68This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Viterbi algorithm• Introduction:

• Recursion:

• Termination:

• Path backtracking:

)()( 11 xiibi

)()( max)( 111 tjijtNit baij x

)(max1

iP TNi

0)(1 i

ijtNi

t aij )( maxarg)(1

1

)(maxarg1

iq TNi

T

1,,1 ),( 11

Ttqq ttt

1

2

3

states

Page 69: Hidden Markov Model

69This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Numerical Example: P(RRGB,Q*|)

.6

.2

.2

.2

.5

.3

.0

.3

.7

RGB

.5

.6

.4

.4 .1

=[1 0 0]TR R G B

.5×.2.0018

.00648

.01008

.4×.3

.1×.7.4×.7

.6×.3

.61×.6

0×.2.0

0×.0.0

.5×.2.018

.6×.5.036

.00576

.4×.5

.1×.3.4×.3

.5×.6.18

.6×.2.048

.0

.4×.2

.1×.0.4×.0

Page 70: Hidden Markov Model

70This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

3. Model Training Problem• Estimate =(A,B,) that maximizes P(X|)• No analytical solution exists• MLE + EM algorithm developed

– Baum-Welch reestimation [Baum+68,70]– a local maximization using iterative procedure– maximizes the probability estimate of observe

d events– guarantees finite improvement– is based on forward-backward procedure

Page 71: Hidden Markov Model

71This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

MLE Example• Experiment

– Known: 3 balls inside (some white, some red; exact numbers unknown)

– Unknown: R = # red balls– Observation: one random

samples : (two reds)

• Two models– p (|R=2) = 2C2 × 1C0 / 3C2 = 1/3– p (|R=3) = 3C2 / 3C2 = 1

• Which model?

three balls are inside,some white, some red

Page 72: Hidden Markov Model

72This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Learning an HMM

We will assume we have an observation O, and want to know the “best” HMM for it. I.e., we would want to find argmaxλ P(λ|O), where λ= (A,B,π) is some HMM. - I.e., what model is most likely, given the observation? When we have a fixed observation that we use to pick a model, we call the observation training data. Functions/values like P(λ|O) are called likelihoods. What we want is the maximum likelihood estimate(MLE) of the parameters of our model, given some training data. This is an estimate of the parameters, because it depends for its accuracy on how representative the observation is.

Page 73: Hidden Markov Model

73This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing Likelihoods

We want to know argmaxλ P(λ|O). By Bayes’ rule, P(λ|O)= P(λ)P(O|λ)/P(O). - The observation is constant, so it is enough to maximize P(λ)P(O|λ). P(λ) is the prior probability of a model; P(O|λ) is the probability of the observation, given a model. Typically, we don’t know much about P(λ). - E.g., we might assume all models are equally likely. - Or, we might stipulate that some subclass are equally likely, and the rest not worth considering. We will ignore P(λ), and simply optimize P(O|λ). I.e., we can find the most probable model by asking which model makes the data most probable.

Page 74: Hidden Markov Model

74This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximum Likelihood Example

Simple example: We have a coin that may be biased. We would like to know the probability that it will come up heads. Let’s flip it a number of times; use percentage of times it comes up heads to estimate the desired probability. Given m out of n trials come up heads, what probability should be assigned? In terms of likelihoods, we want to know P(λ|O), where our model is just the simple parameter, the probability of a coin coming up heads. As per our previous discussion, can maximize this by maximizing P(λ)P(O|λ).

Page 75: Hidden Markov Model

75This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing P(λ)P(O|λ) For Coin Flips

Note that knowing something about P(λ) is knowing whether coins tended to be biased. If we don’t, we just optimize P(O|λ). I.e., let’s pick the model that makes the observation most likely. We can solve this analytically: - P(m heads over n coin tosses|P(Heads)=p) = nCmpm(1-p)n-m

- Take derivative, set equal to 0, solve for p. Turns out p is m/n. - This is probably what you guessed. So we have made a simple maximum likelihood estimate.

Page 76: Hidden Markov Model

76This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Comments

Making a MLE seems sensible, but has its problems. Clearly, our result will just be an estimate. - but one we hope will become increasingly accurate with more data. Note, though, via MLE, the probability of everything we haven’t seen so far is 0. For modeling rare events, there will never be enough data for this problem to go away. There are ways to smooth over this problem (in fact, one way is called “smoothing”!), but we won’t worry about this now.

Page 77: Hidden Markov Model

77This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Back to HMMs

MLE tells us to optimize P(O|λ). We know how to compute P(O|λ) for a given model. - E.g., use the “forward” procedure. How do we find the best one? Unfortunately, we don’t know how to solve this problem analytically. However, there is a procedure to find a better model, given an existing one. - So, if we have a good guess, we can make it better. - Or, start out with fully connected HMM (of given N, M) in which each state can emit every possible value; set all probabilities to random non-zero values. Will this guarantee us a best solution? No! So this is a form of … - Hillclimbing!

Page 78: Hidden Markov Model

78This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Maximizing the Probability of the Observation

Basic idea is:1. Start with an initial model, λold.2. Compute new λnew based on λold and observation O.3. If P(O|λnew)-P(O| λold) < threshold (or we’ve iterated enough), stop.4. Otherwise, λold ← λnew and go to step 2.

Page 79: Hidden Markov Model

79This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Increasing the Probability of the Observation

Let’s “count” the number of times, from each state, we - start in a given state - make a transition to each other state - emit each different symbol. given the model and the observation. If we knew these numbers, we can compute new probability estimates. We can’t really “count” these. - We don’t know for sure which path through the model was taken. - But we know the probability of each path, given the model. So we can compute the expected value of each figure.

Page 80: Hidden Markov Model

80This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Set Up

Define ξt(i,j) = P(qt=Si,qt+1=Sj|O,λ) - i.e., the probability that, given observation and model, we are in state Si at time t and state Sj at time t+1. Here is how such a transition can happen:

Si Sj… …

aijbj,ot+1

αt(i) βt+1(j)t-1 t t+1 t+2

P(o1o2…ot,qt=Si|λ). P(ot+2…oT|qt+1=Si,λ)

Page 81: Hidden Markov Model

81This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Computing ξt(i,j)

αt(i)aijbj,ot+1βt+1(j) = P(qt=Si,qt+1=Sj,O|λ) But we want ξt(i,j) = P(qt=Si,qt+1=Sj|O,λ) By definition of conditional probability, ξt(i,j) = αt(i)aijbj,ot+1βt+1(j)/P(O|λ) i.e., given - α, which we can compute by forward procedure, - β, which we can compute by backward procedure, - P(O), which we can compute by forward, but also, by ∑i ∑j αt(i)aijbj,ot+1βt+1(j) - and A,B, which are given in model, – we can compute ξ.

Page 82: Hidden Markov Model

82This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

One more preliminary…

Define γt(i) as probability of being in state Si at time t, given the observation and the model. Given our definition of ξ:

γt(i)=∑1≤j≤N ξt(i,j) So: - ∑1≤t≤T-1 γt(i) = expected number of transitions from Si

- ∑1≤t≤T-1 ξt(i,j) = expected number of transitions from Si to Sj

Page 83: Hidden Markov Model

83This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Now We Can Reestimate Parameters

aij’ = expected no. of transitions from Si to Sj/expected no. of transitions from Si

= ∑1≤t≤T-1 ξt(i,j)/∑1≤t≤T-1 γt(i)πi’ = probability of being in S1 = γ1(i)bjk’ = expected no. of times in Sj, observing vk/expected no. of times in Sj

= ∑1≤t≤T, s.t. Ot=vkγt(i)/∑1≤t≤T γt(i)

Page 84: Hidden Markov Model

84This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Iterative Reestimation Formulae

j tttjijt

tttjijt

T

tt

T

tt

ij

jxbaiP

jxbaiP

i

ji

ija

)()()(1

)()()(1

)(

),(

state from/given state tostransition# of ratio expected

11

11

1

1

1

1

ttt

tttt

tt

ttt

jk

jjP

kxjjP

j

kxjb

)()(1

),()()(1

)(

),()(

Pii

ii)()(

)( 111

repe

at

Page 85: Hidden Markov Model

85This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Baum-Welch Algorithm

This Algorithm is Known As Baum-Welch Algorithm

or Forward-backward algorithm It was proven (by Baum et al.) that this reestimation procedure leads to increased likelihood. But remember, it only guarantees climbing to a local maximum! It is a special case of a very general algorithm for incremental improvement by iteratively - computing expected values some unobservables (e.g., transitions and state emissions), - using these to compute new MLEs of parameters (A,B,π) - The general procedure is called Expectation-Maximization, or the EM algorithm.

Page 86: Hidden Markov Model

86This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Implementation Considerations

This doesn’t quite work as advertised in practice. One problem: αt(i), βt(j) get smaller with length of observation, and eventually underflow. This is readily fixed by “normalizing” these values. - I.e., instead of computing αt(i), at each time step, compute αt(i)/∑iαt(i). - Turns out that if you also used the ∑iαt(i)s to scale the βt(j)s, you get a numerically nice value, although it doesn’t has a nice probabilities interpretation; - Yet, when you use both scaled values, the rest of the ξt(i,j) computation is exactly the same. Another: Sometimes estimated probabilities will still get very small, and it seems better to not let these fall all the way to 0.

Page 87: Hidden Markov Model

87This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Other issues• Other method of training

– MAP estimation – for adaptation– MMI estimation– MDI estimation– Viterbi training– Discriminant/reinforcement training

Page 88: Hidden Markov Model

88This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

• Other types of parametric structure– Continuous density HMM (CHMM)

• More accurate, but much more parameters to train

– Semi-continuous HMM• Mix of CHMM and DHMM, using parameter sharing

– State-duration HMM• More accurate temporal behavior

• Other extensions– HMM+NN, Autoregressive HMM– 2D models: MRF, Hidden Mesh model,

pseudo-2D HMM

Page 89: Hidden Markov Model

89This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Graphical DHMM and CHMM• Models for ‘5’ and ‘2’

Page 90: Hidden Markov Model

90This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Statistical Decision Making System

• Statistical K-class pattern recognition– Prepare K HMMs trained with samples

1, …, K

– Integrate measurement and a priori knowledge• P(X|k)

• P(k)

– DecisionkiPP ikk )|()|( if Choose XX

Page 91: Hidden Markov Model

91This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

An Application of HMMs:“Part of Speech” Tagging A problem: Word sense disambiguation. - I.e., words typically have multiple senses; need to determine which one a given word is being used as. For now, we just want to guess the “part of speech” of each word in a sentence. - Linguists propose that words have properties that are a function of their “grammatical class”. - Grammatical classes are things like verb, noun, preposition, determiner, etc. - Words often have multiple grammatical classes.

» For example, the word “rock” can be a noun or a verb. » Each of which can have a number of different meanings.

We want an algorithm that will “tag” each word with its most likely part of speech. - Why? Helping with parsing, pronunciation.

Page 92: Hidden Markov Model

92This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Parsing, Briefly

Consider a simple sentence: “I saw a bird.” Let’s “diagram” this sentences, making a parse tree:

S

NP VP

VPRO NP

D N

I saw a bird

Page 93: Hidden Markov Model

93This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

However

Each of these words is listed in the dictionary as having multiple POS entries: - “saw”: noun, verb - “bird”: noun, intransitive verb (“to catch or hunt birds, birdwatch”) - “I”: pronoun, noun (the letter “I”, the square root of –1, something shaped like an I (I-beam), symbol (I-80, Roman numeral, iodine) - “a”: article, noun (the letter “a”, something shaped like an “a”, the grade “A”), preposition (“three times a day”), French pronoun.

Page 94: Hidden Markov Model

94This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Moreover, there is a parse!

S

NP VP

N N N V

I saw a

bird

Page 95: Hidden Markov Model

95This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

What’s It Mean?

It’s nonsense, of course. The question is, can we avoid such nonsense cheaply. Note that this parse corresponds to a very unlikely set of POS tags. So, just restricting our tags to reasonably probably ones might eliminate such silly options.

Page 96: Hidden Markov Model

96This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

Solving the Problem

A “baseline” solution: Just count the frequencies in which each word occurs as a particular part of speech. - Using “tagged corpora”. Pick most frequent POS for each word. How well does it work? Turns out it will be correct about 91% of the time. Good? Humans will agree with each other about 97-98% of the time. So, room for improvement.

Page 97: Hidden Markov Model

97This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

A Better Algorithm

Make POS guess depend on context. For example, if the previous word were “the”, then the word “rock” is much more likely to be occurring as a noun than as a verb. We can incorporate context by setting up an HMM in which the hidden states are POSs, and the words the emissions in those states.

Page 98: Hidden Markov Model

98This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM For POS Tagging:Example

start D N

N NN

filestime

a

smilinglittletime

files

in

.1

.54 .36 .025 .05.0001

.85.25

.3

.1

.076 .05 .15 .01

.15

… …

Page 99: Hidden Markov Model

99This lecture note was made based on the notes of Prof. B.K.Shin(Pukyung Nat’l Univ) and Prof. Wilensky (UCB)

HMM For POS Tagging

First-order HMM equivalent to POS bigrams Second-order equivalent to POS trigrams. Generally works well if there is a hand-tagged corpus from which to read off the probabilities. - Lots of detailed issues: smoothing, etc. If none available, train using Baum-Welch. Usually start with some constraints: - E.g., start with 0 emissions for words not listed in dictionary as having a given POS; estimate transitions. Best variations get about 96-97% accuracy, which is approaching human performance.


Recommended