Introduction to Hidden Markov Models Bayesian Networks/OL… · Introduction •A Hidden Markov...

Introduction to Hidden Markov Models

Introduction

• A Hidden Markov Model (HMM) is a simple temporal Bayesian

network.

• The structure of the network dictates how a system that varies over

time moves from one state to the next.

• Generally, we start at an initial state at time zero and the process

moves to each of its possible new states with a probability based

solely on the current state.

• In Markov models, the states are directly observable – they are the

values being measured

• In Hidden Markov Models, the states are hidden, but lead to

observable values with certain probabilities.

• Set of states:

• Process moves from one state to another generating a

sequence of states :

•Markov chain property: probability of each subsequent state depends only on

what was the previous state:

•To define Markov model, the following probabilities have to be specified:

inital probabilities and transition probabilities

Firstly, Markov Models

},,,{ 21 Nsss K

KK ,,,,21 ikii

sss

)|(),,,|( 1121 !! =ikikikiiikssPssssP K

)|( jiij ssPa =)(iisP=!

Rain Dry

0.70.3

0.2 0.8

• Two states : ‘Rain’ and ‘Dry’.

• Transition probabilities:

P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,

P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8

• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .

Example of Markov Model

• By Markov chain property, probability of state sequence can be found by the

formula:

•Suppose we want to calculate a probability of a sequence of states in our

example, {‘Dry’,’Dry’,’Rain’,Rain’}.

P({‘Dry’,’Dry’,’Rain’,Rain’} )

= P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)

= 0.3*0.2*0.8*0.6

Calculation of sequence probability

)()|()|()|(

),,,()|(

),,,(),,,|(),,,(

112211

1211

12112121

iiiikikikik

ikiiikik

ikiiikiiikikii

sPssPssPssP

sssPssP

sssPssssPsssP

K

KK

KKK

!!!

!!

!!

=

==

=

Low High

0.70.3

0.2 0.8

DryRain

0.6 0.60.4 0.4

Example of Hidden Markov Model

• Two states : ‘Low’ and ‘High’ atmospheric pressure.

• Two observations : ‘Rain’ and ‘Dry’.

• Transition probabilities:

P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,

P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8

• Observation probabilities :

P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 ,

P(‘Dry’|‘High’)=0.3 .

• Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .

Example of Hidden Markov Model

•Suppose we want to calculate a probability of a sequence of observations in

our example, {‘Dry’,’Rain’}.

•Consider all possible hidden state sequences:

P({‘Dry’,’Rain’} )

= P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} ,

{‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) +

P({‘Dry’,’Rain’} , {‘High’,’High’})

where first term is :

P({‘Dry’,’Rain’} , {‘Low’,’Low’})

= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’})

= P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low’)

= 0.4*0.4*0.6*0.4*0.3

Calculation of observation sequence probability

How HMM Works

• HMM is a stochastic generative

model for sequences

• Defined by

– finite set of states S

– finite alphabet A

– transition prob matrix T

– emission prob matrix E

• Move from state to state

according to T while emitting

symbols according to Esk

s1

…

s2

a1

a2

• Game:

– You bet £1

– You roll your die

– Casino rolls their die

– Highest number wins £2

• Question: Suppose we played 2

games, and the sequence of rolls

was 1, 6, 2, 6. Have we been

cheated?

Example: Dishonest Casino

• Casino has two dices:

– Fair dice

• P(i) = 1/6, i = 1..6

– Loaded dice

• P(i) = 1/10, i = 1..5

• P(i) = 1/2, i = 6

• Casino switches between fair &

loaded die with probability of 1/2.

• Initially, dice is always fair

“Visualisation” of Dishonest Casino

1, 6, 2, 6?

We were probably cheated...

E(1|Fair) * T(?,Fair)*

E(6|Loaded) * T(Fair,Loaded)*

E(2|Fair) * T(Loaded, Fair) *

E(6|Loaded) * T(Fair, Loaded)

Evaluation problem. Given the HMM M=(A, B, !) and the observation sequence O=o1

o2 ... oK , calculate the probability that model M has generated sequence O .

• Decoding problem. Given the HMM M=(A, B, !) and the observation sequence O=o1

o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation

sequence O.

• Learning problem. Given some training observation sequences O=o1 o2 ... oK and general

structure of HMM (numbers of hidden and visible states), determine HMM parameters

M=(A, B, !) that best fit training data.

O=o1...oK denotes a sequence of observations ok!{v1,…,vM}.

Main uses of HMMs :

• Typed word recognition, assume all characters are separated.

• Character recognizer outputs probability of the image being

particular character, P(image|character).

0.5

0.03

0.005

0.31z

c

b

a

Word recognition example(1).

Hidden state Observation

• If lexicon is given, we can construct separate HMM models

for each lexicon word.

Amherst a m h e r s t

Buffalo b u f f a l o

0.5 0.03

• Here recognition of word image is equivalent to the problem

of evaluating few HMM models.

•This is an application of Evaluation problem.


0.4 0.6

• We can construct a single HMM for all words.

• Hidden states = all characters in the alphabet.

• Transition probabilities and initial probabilities are calculated

from language model.

• Observations and observation probabilities are as before.

a m

h e

r

s

t

b v

f

o

• Here we have to determine the best sequence of hidden states,

the one that most likely produced word image.

• This is an application of Decoding problem.


What makes a good HMM problem space?

Characteristics:

• Classification problems

There are two main types of output from an HMM:

– Scoring of sequences

• For example, protein family modelling

– Labeling of observations within a sequence

• For example, identifying genes in a particular sequence

HMM Requirements

So you’ve decided you want to build an HMM,

here’s what you need:

• An architecture

– Probably the hardest part

– Should be sound & easy to interpret

• A well-defined success measure

– Necessary for any form of machine learning

HMM Requirements

Continued

• Training data

– Labeled or unlabeled – it depends

• You do not always need a labeled training set to do

observation labeling, but it helps

– Amount of training data needed is:

• Directly proportional to the number of free parameters in the

model

• Inversely proportional to the size of the training sequences

HMM Advantages

• Statistical Grounding

– Statisticians are comfortable with the theory behind hidden

Markov models

– Freedom to manipulate the training and verification processes

– Mathematical / theoretical analysis of the results and processes

– HMMs are still very powerful modeling tools – far more powerful

than many statistical methods

HMM Advantages continued

• Modularity

– HMMs can be combined into larger HMMs

• Transparency of the Model

– Assuming an architecture with a good design

– People can read the model and make sense of it

– The model itself can help increase understanding

HMM Advantages continued

• Incorporation of Prior Knowledge

– Incorporate prior knowledge into the architecture

– Initialize the model close to something believed to be correct

– Use prior knowledge to constrain training process

HMM Disadvantages

• Markov Chains

– States are supposed to be independent

– P(y) must be independent of P(x), and vice versa

– This usually isn’t true

– Can get around it when relationships are local

P(x) … P(y)

HMM Disadvantages

continued

• …and then there are the standard Machine Learning Problems

• Watch out for local maxima

– Model may not converge to a truly optimal parameter set for a

given training set

• Avoid over-fitting

– You’re only as good as your training set

– More training is not always good

HMM Disadvantages

continued

• Speed!!!

– Almost everything one does in an HMM involves: “enumerating

all possible paths through the model”

– There are efficient ways to do this

– Still slow in comparison to other methods

Conclusions

• HMMs have problems where they excel, and problems where they

do not

• You should consider using one if:

– Problem can be phrased as classification

– Observations are ordered

– The observations follow some sort of grammatical structure

(optional)

Conclusions

Advantages:

– Statistics

– Modularity

– Transparency

– Prior Knowledge

Disadvantages:

– State independence

– Over-fitting

– Local Maximums

– Speed

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	13 times
Download:	0 times

Introduction to Hidden Markov Models Bayesian Networks/OL… · Introduction •A Hidden Markov...

Documents