Hidden Markov Models
Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi
Sequential Data
• Time-series: Stock
market, weather, speech,
video
• Ordered: Text, genes
Sequential Data: Tracking
Observe noisy measurements of missile location
Where is the missile now? Where will it be in 1 minute?
Sequential Data: Weather
• Predict the weather tomorrow
using previous information
• If it rained yesterday, and the
previous day and historically it
has rained 7 times in the past
10 years on this date — does
this affect my prediction?
Sequential Data: Weather
• Use product rule for joint distribution of a sequence
• How do I solve this?
• Model how weather changes over time
• Model how observations are produced
• Reason about the model
Markov Chain
• Set S is called the state space
• Process moves from one state to another generating a
sequence of states: x1, x2, …, xt
• Markov chain property: probability of each subsequent
state depends only on the previous state:
Markov Chain: Parameters
• State transition matrix A (|S| x |S|)
A is a stochastic matrix (all rowssum to one)
Time homogenous Markov chain: transition probability between two states does not depend on time
• Initial (prior) state probabilities
Rain Dry
0.70.3
0.2 0.8
• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Rain’|‘Rain’)=0.3, P(‘Dry’|‘Rain’)=0.7
P(‘Rain’|‘Dry’) =0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: P(‘Rain’)=0.4 , P(‘Dry’)=0.6
Example of Markov Model
Example: Weather Prediction
• Compute probability of tomorrow’s weather using Markov property
• Evaluation: given today is dry, what’s the probability that tomorrow is dry and the next day is rainy?
• Learning: give some observations, determine the transition probabilities
P({‘Dry’,’Dry’,’Rain’} )
= P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)
= 0.2*0.8*0.6
Hidden Markov Model (HMM)
• Stochastic model where
the states of the model
are hidden
• Each state can emit an
output which is observed
HMM: Parameters
• State transition matrix A
• Emission / observation
conditional output probabilities B
• Initial (prior) state probabilities
Low High
0.70.3
0.2 0.8
DryRain
0.6 0.60.4 0.4
Example of Hidden Markov Model
• Two states : ‘Low’ and ‘High’ atmospheric pressure.
• Two observations : ‘Rain’ and ‘Dry’.
• Transition probabilities:
P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7
P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8
• Observation probabilities :
P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4
P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3
• Initial probabilities:
P(‘Low’)=0.4 , P(‘High’)=0.6
Example of Hidden Markov Model
• Suppose we want to calculate a probability of a sequence of
observations in our example, {‘Dry’,’Rain’}.
• Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) +
P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} ,
{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’})=
P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) =
P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)
= 0.4*0.4*0.6*0.4*0.3
Calculation of observation sequence probability
Example: Dishonest Casino
• A casino has two dices that it switches between with
5% probability
• Fair dice
• Loaded dice
Example: Dishonest Casino
• Initial probabilities
• State transition matrix
• Emission probabilities
Example: Dishonest Casino
• Given a sequence of rolls by the casino player
• How likely is this sequence given our model of how the casino works? – evaluation problem
• What sequence portion was generated with the fair die, and what portion with the loaded die? – decoding problem
• How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded and back? – learning problem
HMM: Problems
• Evaluation: Given parameters and observation sequence,
find probability (likelihood) of observed sequence
- forward algorithm
• Decoding: Given HMM parameters and observation
sequence, find the most probable sequence of hidden states
- Viterbi algorithm
• Learning: Given HMM with unknown parameters and
observation sequence, find the parameters that maximizes
likelihood of data
- Forward-Backward algorithm
HMM: Evaluation Problem
• Given
• Probability of observed sequence
Summing over all possible hidden state values at all
times — KT exponential # terms
s1
s2
si
sK
s1
s2
si
sK
s1
s2
sj
sK
s1
s2
si
sK
a1j
a2j
aij
aKj
Time= 1 t t+1 T
o1 ot ot+1 oT = Observations
Trellis representation of an HMM
HMM: Forward Algorithm
• Instead pose as recursive problem
• Dynamic program to compute forward probability in state St = k
after observing the first t observations
• Algorithm:
- initialize: t=1
- iterate with recursion: t=2, … t=k …
- terminate: t=T
t
kttk
HMM: Problems
• Evaluation: Given parameters and observation sequence,
find probability (likelihood) of observed sequence
- forward algorithm
• Decoding: Given HMM parameters and observation
sequence, find the most probable sequence of hidden states
- Viterbi algorithm
• Learning: Given HMM with unknown parameters and
observation sequence, find the parameters that maximizes
likelihood of data
- Forward-Backward algorithm
HMM: Decoding Problem 1
• Given
• Probability that hidden state at time t was k
We know how to compute the first part using
forward algorithm
HMM: Backward Probability
• Similar to forward probability, we can express as a
recursion problem
• Dynamic program
• Initialize
• Iterate using recursion
tttk
HMM: Decoding Problem 1
• Probability that hidden state at time t was k
• Most likely state assignment
Forward-
backward
algorithm
HMM: Decoding Problem 2
• Given
• What is most likely state sequence?
probability of most likely sequence of
states ending at state ST=k
HMM: Viterbi Algorithm
• Compute probability recursively over t
• Use dynamic programming again!
HMM: Viterbi Algorithm
• Initialize
• Iterate
• Terminate
Traceback
HMM: Computational Complexity
• What is the running time for the forward algorithm,
backward algorithm, and Viterbi?
O(K2T) vs O(KT)!
HMM: Problems
• Evaluation: Given parameters and observation sequence,
find probability (likelihood) of observed sequence
- forward algorithm
• Decoding: Given HMM parameters and observation
sequence, find the most probable sequence of hidden states
- Viterbi algorithm
• Learning: Given HMM with unknown parameters and
observation sequence, find the parameters that maximizes
likelihood of data
- Forward-Backward, Baum-Welch algorithm
HMM: Learning Problem
• Given only observations
• Find parameters that maximize likelihood
• Need to learn hidden state sequences as well
HMM: Baum-Welch (EM) Algorithm
• Randomly initialize parameters
• E-step: Fix parameters, find expected state assignment
Forward-backward
algorithm
HMM: Baum-Welch (EM) Algorithm
• Expected number of times we will be in state i
• Expected number of transitions from state i
• Expected number of transitions from state i to j
HMM: Baum-Welch (EM) Algorithm
• M-step: Fix expected state assignments, update
parameters
HMM: Problems
• Evaluation: Given parameters and observation sequence,
find probability (likelihood) of observed sequence
- forward algorithm
• Decoding: Given HMM parameters and observation
sequence, find the most probable sequence of hidden states
- Viterbi algorithm
• Learning: Given HMM with unknown parameters and
observation sequence, find the parameters that maximizes
likelihood of data
- Forward-Backward (Baum-Welch) algorithm
HMM vs Linear Dynamical Systems
• HMM
• States are discrete
• Observations are discrete or continuous
• Linear dynamical systems
• Observations and states are multivariate Gaussians
• Can use Kalman Filters to solve
Linear State Space Models
• States & observations are Gaussian
• Kalman filter: (recursive) prediction and
update
More examples
• Location prediction
• Privacy preserving data monitoring
Next Location Prediction: Definitions
Source: A. Monreale, F. Pinelli, R. Trasarti, F. Giannotti. WhereNext: a Location Predictor on Trajectory Pattern Mining. KDD 2009
o Personalization
• Individual-based methods only utilize the history of one object to predict its
future locations.
• General-based methods use the movement history of other objects
additionally (e.g. similar objects or similar trajectories) to predict the object’s
future location.
Next Location Prediction: Classification of
Methods
Source: A. Monreale, F. Pinelli, R. Trasarti, F. Giannotti. WhereNext: a Location Predictor on Trajectory Pattern Mining. KDD 2009
o Temporal Representation
• Location-series representations define trajectories as a set of
sequenced locations ordered in time.
• Fixed-interval time representations use a fixed time interval
between two consecutive locations
• Variable-interval time representations allow variable
transition times between sequenced locations
Next Location Prediction: Classification of
Methods
o Spatial Representation
• Grid-based methods divide space into fixed-size cells which
can be simple rectangular regions
• Frequent/dense regions using clustering methods such as
density-based algorithms such as DBSCAN and hierarchical
clustering.
• Semantic-based methods use semantic features of locations
in addition to the geographic information, e.g. home, bank,
school.
Next Location Prediction: Classification of
Methods
o Mobility Learning Method
• Model-based (formulate the movement of moving objects using
mathematical models) Markov Chains
Recursive Motion Function (Y. Tao et. al., ACM SIGMOD 2004)
Semi-Lazy Hidden Markov Model (J. Zhou et. al., ACM SIGKDD 2013)
Deep learning models
• Pattern-based (exploit pattern mining algorithms for prediction) Trajectory Pattern Mining (A. Monreale et. Al., ACM SIGKDD 2009)
• Hybrid Recursive Motion Function + Sequential Pattern Mining (H. Jeung et. al., ICDE 2008)
Next Location Prediction: Classification of
Methods
Preliminary Results
Prediction error for different prediction length using (a) Brinkhoff , and (b) Periodical Synthetic dataset
(a) (b)