+ All Categories
Home > Documents > MATH 829: Introduction to Data Mining and Analysis Hidden...

MATH 829: Introduction to Data Mining and Analysis Hidden...

Date post: 12-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
53
Transcript
Page 1: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

MATH 829: Introduction to Data Mining andAnalysis

Hidden Markov Models

Dominique Guillot

Departments of Mathematical Sciences

University of Delaware

May 11, 2016

1/12

Page 2: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models

Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:

P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)

= P (X1 = j|X0 = i)

=: p(i, j).

A Hidden Markov Model has two components:

1 A Markov chain that describes the state of the system and is

unobserved.

2 An observed process where each output depends on the state

of the chain.

2/12

Page 3: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models

Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:

P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)

= P (X1 = j|X0 = i)

=: p(i, j).

A Hidden Markov Model has two components:

1 A Markov chain that describes the state of the system and is

unobserved.

2 An observed process where each output depends on the state

of the chain.

2/12

Page 4: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models

Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:

P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)

= P (X1 = j|X0 = i)

=: p(i, j).

A Hidden Markov Model has two components:

1 A Markov chain that describes the state of the system and is

unobserved.

2 An observed process where each output depends on the state

of the chain.

2/12

Page 5: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models

Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:

P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)

= P (X1 = j|X0 = i)

=: p(i, j).

A Hidden Markov Model has two components:

1 A Markov chain that describes the state of the system and is

unobserved.

2 An observed process where each output depends on the state

of the chain.

2/12

Page 6: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models

Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:

P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)

= P (X1 = j|X0 = i)

=: p(i, j).

A Hidden Markov Model has two components:

1 A Markov chain that describes the state of the system and is

unobserved.

2 An observed process where each output depends on the state

of the chain.

2/12

Page 7: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models (cont.)

More precisely, a Hidden Markov Model consists of:

1 A Makov chain (Zt : t = 1, . . . , T ) with states

S := {s1, . . . , s|S|}, say:

P (Zt+1 = sj |Zt = si) = Aij .

2 An observation process (Xt : t = 1, . . . , T ) taking values in

V := {v1, . . . , v|V |} such that

P (Xt = vj |Zt = si) = Bij .

Remarks:1 The observed variable Xt depends only on Zt, the state of the

Markov chain at time t.2 The output is a random function of the current state.

3/12

Page 8: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models (cont.)

More precisely, a Hidden Markov Model consists of:1 A Makov chain (Zt : t = 1, . . . , T ) with states

S := {s1, . . . , s|S|}, say:

P (Zt+1 = sj |Zt = si) = Aij .

2 An observation process (Xt : t = 1, . . . , T ) taking values in

V := {v1, . . . , v|V |} such that

P (Xt = vj |Zt = si) = Bij .

Remarks:1 The observed variable Xt depends only on Zt, the state of the

Markov chain at time t.2 The output is a random function of the current state.

3/12

Page 9: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models (cont.)

More precisely, a Hidden Markov Model consists of:1 A Makov chain (Zt : t = 1, . . . , T ) with states

S := {s1, . . . , s|S|}, say:

P (Zt+1 = sj |Zt = si) = Aij .

2 An observation process (Xt : t = 1, . . . , T ) taking values in

V := {v1, . . . , v|V |} such that

P (Xt = vj |Zt = si) = Bij .

Remarks:1 The observed variable Xt depends only on Zt, the state of the

Markov chain at time t.2 The output is a random function of the current state.

3/12

Page 10: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Hidden Markov Models (cont.)

More precisely, a Hidden Markov Model consists of:1 A Makov chain (Zt : t = 1, . . . , T ) with states

S := {s1, . . . , s|S|}, say:

P (Zt+1 = sj |Zt = si) = Aij .

2 An observation process (Xt : t = 1, . . . , T ) taking values in

V := {v1, . . . , v|V |} such that

P (Xt = vj |Zt = si) = Bij .

Remarks:1 The observed variable Xt depends only on Zt, the state of the

Markov chain at time t.2 The output is a random function of the current state.

3/12

Page 11: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Examples

A HMM with states S = {x1, x2, x3} and possible observations

V = {y1, y2, y3, y4}.

Source: Wikipedia.

a's are the state transition probabilities.

b's are the output probabilities.

4/12

Page 12: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Examples (cont.)

Examples of applications:

Recognizing human facial expression from sequences of images

(see e.g. Schmidt et al, 2010).

Speech recognition systems (see e.g. Gales and Young, 2007)

Gales and Young, 2007.

Longitudinal comparisons in medical studies (see e.g. Scott et

al. 2005).

Many applications in �nance (e.g. pricing options, valuation of

life insurance policies, credit risk modeling, etc.).

etc..

5/12

Page 13: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Examples (cont.)

Examples of applications:

Recognizing human facial expression from sequences of images

(see e.g. Schmidt et al, 2010).

Speech recognition systems (see e.g. Gales and Young, 2007)

Gales and Young, 2007.

Longitudinal comparisons in medical studies (see e.g. Scott et

al. 2005).

Many applications in �nance (e.g. pricing options, valuation of

life insurance policies, credit risk modeling, etc.).

etc..

5/12

Page 14: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Examples (cont.)

Examples of applications:

Recognizing human facial expression from sequences of images

(see e.g. Schmidt et al, 2010).

Speech recognition systems (see e.g. Gales and Young, 2007)

Gales and Young, 2007.

Longitudinal comparisons in medical studies (see e.g. Scott et

al. 2005).

Many applications in �nance (e.g. pricing options, valuation of

life insurance policies, credit risk modeling, etc.).

etc..

5/12

Page 15: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Examples (cont.)

Examples of applications:

Recognizing human facial expression from sequences of images

(see e.g. Schmidt et al, 2010).

Speech recognition systems (see e.g. Gales and Young, 2007)

Gales and Young, 2007.

Longitudinal comparisons in medical studies (see e.g. Scott et

al. 2005).

Many applications in �nance (e.g. pricing options, valuation of

life insurance policies, credit risk modeling, etc.).

etc..5/12

Page 16: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Three problems

Three (closely related) important problems naturally arise when

working with HMM:

1 What is the probability of a given observed sequence?

2 What is the most likely series of states that generated a given

observed sequence?

3 What are the state transition probabilities and the observation

probabilities of the model (i.e., how can we estimate the

parameters of the model)?

6/12

Page 17: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Three problems

Three (closely related) important problems naturally arise when

working with HMM:

1 What is the probability of a given observed sequence?

2 What is the most likely series of states that generated a given

observed sequence?

3 What are the state transition probabilities and the observation

probabilities of the model (i.e., how can we estimate the

parameters of the model)?

6/12

Page 18: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Three problems

Three (closely related) important problems naturally arise when

working with HMM:

1 What is the probability of a given observed sequence?

2 What is the most likely series of states that generated a given

observed sequence?

3 What are the state transition probabilities and the observation

probabilities of the model (i.e., how can we estimate the

parameters of the model)?

6/12

Page 19: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Three problems

Three (closely related) important problems naturally arise when

working with HMM:

1 What is the probability of a given observed sequence?

2 What is the most likely series of states that generated a given

observed sequence?

3 What are the state transition probabilities and the observation

probabilities of the model (i.e., how can we estimate the

parameters of the model)?

6/12

Page 20: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.

7/12

Page 21: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.

7/12

Page 22: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.

7/12

Page 23: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.

7/12

Page 24: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.

7/12

Page 25: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.

7/12

Page 26: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.

7/12

Page 27: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence

Suppose the parameters of the model are known.

Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.

What is P (x;A,B)?

Conditioning on the hidden states, we obtain:

P (x;A,B) =∑z∈ST

P (x|z;A,B)P (z;A,B)

=∑z∈ST

T∏i=1

P (xi|zi;B) ·T∏i=1

P (zi|zi−1;A)

=∑z∈ST

T∏i=1

Bzi,xi ·T∏i=1

Azi−1,zi .

Problem: Although the previous expression is simple, it involves

summing over a set of size |S|T , which is generally too

computationally intensive.7/12

Page 28: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence (cont.)

We can compute P (x;A,B) e�ciently using dynamic programming.

Idea: avoid computing the same quantities multiple times!

Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).

The Forward Procedure for computing αi(t)

1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=

∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,

t = 1, . . . , T .

Now, P (x;A,B) = P (x1, . . . , xT ;A,B)

=

|S|∑i=1

P (x1, . . . , xT , zT = si;A,B)

=

|S|∑i=1

αi(T ).

Complexity is now O(|S| · T ) instead of O(|S|T )!

8/12

Page 29: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence (cont.)

We can compute P (x;A,B) e�ciently using dynamic programming.

Idea: avoid computing the same quantities multiple times!

Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).

The Forward Procedure for computing αi(t)

1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=

∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,

t = 1, . . . , T .

Now, P (x;A,B) = P (x1, . . . , xT ;A,B)

=

|S|∑i=1

P (x1, . . . , xT , zT = si;A,B)

=

|S|∑i=1

αi(T ).

Complexity is now O(|S| · T ) instead of O(|S|T )!

8/12

Page 30: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence (cont.)

We can compute P (x;A,B) e�ciently using dynamic programming.

Idea: avoid computing the same quantities multiple times!

Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).

The Forward Procedure for computing αi(t)

1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=

∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,

t = 1, . . . , T .

Now, P (x;A,B) = P (x1, . . . , xT ;A,B)

=

|S|∑i=1

P (x1, . . . , xT , zT = si;A,B)

=

|S|∑i=1

αi(T ).

Complexity is now O(|S| · T ) instead of O(|S|T )!

8/12

Page 31: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence (cont.)

We can compute P (x;A,B) e�ciently using dynamic programming.

Idea: avoid computing the same quantities multiple times!

Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).

The Forward Procedure for computing αi(t)

1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=

∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,

t = 1, . . . , T .

Now, P (x;A,B) = P (x1, . . . , xT ;A,B)

=

|S|∑i=1

P (x1, . . . , xT , zT = si;A,B)

=

|S|∑i=1

αi(T ).

Complexity is now O(|S| · T ) instead of O(|S|T )!

8/12

Page 32: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence (cont.)

We can compute P (x;A,B) e�ciently using dynamic programming.

Idea: avoid computing the same quantities multiple times!

Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).

The Forward Procedure for computing αi(t)

1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=

∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,

t = 1, . . . , T .

Now, P (x;A,B) = P (x1, . . . , xT ;A,B)

=

|S|∑i=1

P (x1, . . . , xT , zT = si;A,B)

=

|S|∑i=1

αi(T ).

Complexity is now O(|S| · T ) instead of O(|S|T )!

8/12

Page 33: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Probability of an observed sequence (cont.)

We can compute P (x;A,B) e�ciently using dynamic programming.

Idea: avoid computing the same quantities multiple times!

Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).

The Forward Procedure for computing αi(t)

1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=

∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,

t = 1, . . . , T .

Now, P (x;A,B) = P (x1, . . . , xT ;A,B)

=

|S|∑i=1

P (x1, . . . , xT , zT = si;A,B)

=

|S|∑i=1

αi(T ).

Complexity is now O(|S| · T ) instead of O(|S|T )!8/12

Page 34: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Inferring the hidden states

One of the most natural question one can ask about a HMM is:

what are the mostly likely states that generated the observations?

In other words, we would like to compute:

argmaxz∈ST

P (z|x;A,B).

Using Bayes' theorem:

argmaxz∈ST

P (z|x;A,B) = argmaxz∈ST

P (x|z;A,B)P (z;A)

P (x;A,B)

= argmaxz∈ST

P (x|z;A,B)P (z;A)

since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of

examining all of them to pick the optimal one in practice.

9/12

Page 35: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Inferring the hidden states

One of the most natural question one can ask about a HMM is:

what are the mostly likely states that generated the observations?

In other words, we would like to compute:

argmaxz∈ST

P (z|x;A,B).

Using Bayes' theorem:

argmaxz∈ST

P (z|x;A,B) = argmaxz∈ST

P (x|z;A,B)P (z;A)

P (x;A,B)

= argmaxz∈ST

P (x|z;A,B)P (z;A)

since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of

examining all of them to pick the optimal one in practice.

9/12

Page 36: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Inferring the hidden states

One of the most natural question one can ask about a HMM is:

what are the mostly likely states that generated the observations?

In other words, we would like to compute:

argmaxz∈ST

P (z|x;A,B).

Using Bayes' theorem:

argmaxz∈ST

P (z|x;A,B) = argmaxz∈ST

P (x|z;A,B)P (z;A)

P (x;A,B)

= argmaxz∈ST

P (x|z;A,B)P (z;A)

since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of

examining all of them to pick the optimal one in practice.

9/12

Page 37: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Inferring the hidden states

One of the most natural question one can ask about a HMM is:

what are the mostly likely states that generated the observations?

In other words, we would like to compute:

argmaxz∈ST

P (z|x;A,B).

Using Bayes' theorem:

argmaxz∈ST

P (z|x;A,B) = argmaxz∈ST

P (x|z;A,B)P (z;A)

P (x;A,B)

= argmaxz∈ST

P (x|z;A,B)P (z;A)

since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of

examining all of them to pick the optimal one in practice.

9/12

Page 38: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Inferring the hidden states

One of the most natural question one can ask about a HMM is:

what are the mostly likely states that generated the observations?

In other words, we would like to compute:

argmaxz∈ST

P (z|x;A,B).

Using Bayes' theorem:

argmaxz∈ST

P (z|x;A,B) = argmaxz∈ST

P (x|z;A,B)P (z;A)

P (x;A,B)

= argmaxz∈ST

P (x|z;A,B)P (z;A)

since the denominator does not depend on z.

Note: There are |S|T possibilities for z so there is no hope of

examining all of them to pick the optimal one in practice.

9/12

Page 39: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Inferring the hidden states

One of the most natural question one can ask about a HMM is:

what are the mostly likely states that generated the observations?

In other words, we would like to compute:

argmaxz∈ST

P (z|x;A,B).

Using Bayes' theorem:

argmaxz∈ST

P (z|x;A,B) = argmaxz∈ST

P (x|z;A,B)P (z;A)

P (x;A,B)

= argmaxz∈ST

P (x|z;A,B)P (z;A)

since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of

examining all of them to pick the optimal one in practice.

9/12

Page 40: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

The Viterbi algorithm

The Viterbi algorithm is a dynamic programming algorithm that

can be used to e�ciently compute the most likely path for the

states, given a sequence of observations x ∈ V T .

Let vi(t) denote the most probable path that ends in state si attime t:

vi(t) := maxzt,...,zt−1

P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).

Key observation: We have

vj(t) = max1≤i≤|S|

vi(t− 1)AijBj,xt .

In other words:

Best Path at t that end at j

= max1≤i≤|S|

(Best Path at t− 1 that end at i)

× (Go from i to j)

× (Observe xt in state sj).

10/12

Page 41: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

The Viterbi algorithm

The Viterbi algorithm is a dynamic programming algorithm that

can be used to e�ciently compute the most likely path for the

states, given a sequence of observations x ∈ V T .

Let vi(t) denote the most probable path that ends in state si attime t:

vi(t) := maxzt,...,zt−1

P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).

Key observation: We have

vj(t) = max1≤i≤|S|

vi(t− 1)AijBj,xt .

In other words:

Best Path at t that end at j

= max1≤i≤|S|

(Best Path at t− 1 that end at i)

× (Go from i to j)

× (Observe xt in state sj).

10/12

Page 42: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

The Viterbi algorithm

The Viterbi algorithm is a dynamic programming algorithm that

can be used to e�ciently compute the most likely path for the

states, given a sequence of observations x ∈ V T .

Let vi(t) denote the most probable path that ends in state si attime t:

vi(t) := maxzt,...,zt−1

P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).

Key observation: We have

vj(t) = max1≤i≤|S|

vi(t− 1)AijBj,xt .

In other words:

Best Path at t that end at j

= max1≤i≤|S|

(Best Path at t− 1 that end at i)

× (Go from i to j)

× (Observe xt in state sj).

10/12

Page 43: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

The Viterbi algorithm

The Viterbi algorithm is a dynamic programming algorithm that

can be used to e�ciently compute the most likely path for the

states, given a sequence of observations x ∈ V T .

Let vi(t) denote the most probable path that ends in state si attime t:

vi(t) := maxzt,...,zt−1

P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).

Key observation: We have

vj(t) = max1≤i≤|S|

vi(t− 1)AijBj,xt .

In other words:

Best Path at t that end at j

= max1≤i≤|S|

(Best Path at t− 1 that end at i)

× (Go from i to j)

× (Observe xt in state sj).

10/12

Page 44: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

The Viterbi algorithm

The Viterbi algorithm:

1 Initialize vi(1) := πiBi,x1 , i = 1, . . . , |S|, where πi is the initial

distribution of the Markov chain.

2 Compute vi(t) recursively for i = 1, . . . , S and t = 1, . . . , T .

3 Finally, the most probable path is the path corresponding to

max1≤i≤|S|

vi(T ).

11/12

Page 45: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

The Viterbi algorithm

The Viterbi algorithm:

1 Initialize vi(1) := πiBi,x1 , i = 1, . . . , |S|, where πi is the initial

distribution of the Markov chain.

2 Compute vi(t) recursively for i = 1, . . . , S and t = 1, . . . , T .

3 Finally, the most probable path is the path corresponding to

max1≤i≤|S|

vi(T ).

11/12

Page 46: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

The Viterbi algorithm

The Viterbi algorithm:

1 Initialize vi(1) := πiBi,x1 , i = 1, . . . , |S|, where πi is the initial

distribution of the Markov chain.

2 Compute vi(t) recursively for i = 1, . . . , S and t = 1, . . . , T .

3 Finally, the most probable path is the path corresponding to

max1≤i≤|S|

vi(T ).

11/12

Page 47: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Estimating A, B, and π

So far, we assumed the parameters A, B, and π of the HMM

were known.

We now turn to the estimation of these parameters.

Let θ := (A,B, π).

We know how to compute:

1 P (x|θ) Forward algorithm.

2 P (z|x; θ) Viterbi algorithm.

We now want

argmaxθ

P (x|θ),

i.e., the set of parameters for which the observed values are most

likely to be obtained.

Note: if we could observe z, then we could easily compute

A,B, π.

We solve the problem using the EM algorithm.

12/12

Page 48: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Estimating A, B, and π

So far, we assumed the parameters A, B, and π of the HMM

were known.

We now turn to the estimation of these parameters.

Let θ := (A,B, π).

We know how to compute:

1 P (x|θ) Forward algorithm.

2 P (z|x; θ) Viterbi algorithm.

We now want

argmaxθ

P (x|θ),

i.e., the set of parameters for which the observed values are most

likely to be obtained.

Note: if we could observe z, then we could easily compute

A,B, π.

We solve the problem using the EM algorithm.

12/12

Page 49: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Estimating A, B, and π

So far, we assumed the parameters A, B, and π of the HMM

were known.

We now turn to the estimation of these parameters.

Let θ := (A,B, π).

We know how to compute:

1 P (x|θ) Forward algorithm.

2 P (z|x; θ) Viterbi algorithm.

We now want

argmaxθ

P (x|θ),

i.e., the set of parameters for which the observed values are most

likely to be obtained.

Note: if we could observe z, then we could easily compute

A,B, π.

We solve the problem using the EM algorithm.

12/12

Page 50: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Estimating A, B, and π

So far, we assumed the parameters A, B, and π of the HMM

were known.

We now turn to the estimation of these parameters.

Let θ := (A,B, π).

We know how to compute:

1 P (x|θ) Forward algorithm.

2 P (z|x; θ) Viterbi algorithm.

We now want

argmaxθ

P (x|θ),

i.e., the set of parameters for which the observed values are most

likely to be obtained.

Note: if we could observe z, then we could easily compute

A,B, π.

We solve the problem using the EM algorithm.

12/12

Page 51: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Estimating A, B, and π

So far, we assumed the parameters A, B, and π of the HMM

were known.

We now turn to the estimation of these parameters.

Let θ := (A,B, π).

We know how to compute:

1 P (x|θ) Forward algorithm.

2 P (z|x; θ) Viterbi algorithm.

We now want

argmaxθ

P (x|θ),

i.e., the set of parameters for which the observed values are most

likely to be obtained.

Note: if we could observe z, then we could easily compute

A,B, π.

We solve the problem using the EM algorithm.

12/12

Page 52: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Estimating A, B, and π

So far, we assumed the parameters A, B, and π of the HMM

were known.

We now turn to the estimation of these parameters.

Let θ := (A,B, π).

We know how to compute:

1 P (x|θ) Forward algorithm.

2 P (z|x; θ) Viterbi algorithm.

We now want

argmaxθ

P (x|θ),

i.e., the set of parameters for which the observed values are most

likely to be obtained.

Note: if we could observe z, then we could easily compute

A,B, π.

We solve the problem using the EM algorithm.

12/12

Page 53: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes

Estimating A, B, and π

So far, we assumed the parameters A, B, and π of the HMM

were known.

We now turn to the estimation of these parameters.

Let θ := (A,B, π).

We know how to compute:

1 P (x|θ) Forward algorithm.

2 P (z|x; θ) Viterbi algorithm.

We now want

argmaxθ

P (x|θ),

i.e., the set of parameters for which the observed values are most

likely to be obtained.

Note: if we could observe z, then we could easily compute

A,B, π.

We solve the problem using the EM algorithm.

12/12


Recommended