EM Algorithm
Jur van den Berg
Kalman Filtering vs. Smoothing
• Dynamics and Observation model
• Kalman Filter:– Compute– Real-time, given data so far
• Kalman Smoother:– Compute – Post-processing, given all data
ttt YYX yy ,,| 00
TtYYX TTt 0,,,| 00 yy
),(~),(~
,,1
RNVQNW
CA
tt
tt
tt
tt
t
t
0v0w
vxwx
yx
EM Algorithm
• Kalman smoother: – Compute distributions X0, …, Xt
given parameters A, C, Q, R, and data y0, …, yt.
• EM Algorithm:– Simultaneously optimize X0, …, Xt and A, C, Q, R
given data y0, …, yt.
),(~),(~
,,1
RNVQNW
CA
tt
tt
tt
tt
t
t
0v0w
vxwx
yx
Probability vs. Likelihood
• Probability: predict unknown outcomes based on known parameters: – p(x | q)
• Likelihood: estimate unknown parameters based on known outcomes: – L(q | x) = p(x | q)
• Coin-flip example:– q is probability of “heads” (parameter)– x = HHHTTH is outcome
Likelihood for Coin-flip Example • Probability of outcome given parameter:– p(x = HHHTTH | q = 0.5) = 0.56 = 0.016
• Likelihood of parameter given outcome:– L(q = 0.5 | x = HHHTTH) = p(x | q) = 0.016
• Likelihood maximal when q = 0.6666… • Likelihood function not a probability density
Likelihood for Cont. Distributions • Six samples {-3, -2, -1, 1, 2, 3} believed to be
drawn from some Gaussian N(0, s2)
• Likelihood of s:
• Maximum likelihood:
16.26
321)1()2()3( 222222
s
)|3()|2()|3(})3,2,1,1,2,3{|( ssss xpxpxpL
Likelihood for Stochastic Model
• Dynamics model
• Suppose xt and yt are given for 0 ≤ t ≤ T, what is likelihood of A, C, Q and R?
• • Compute log-likelihood:
),(~),(~
,,1
RNVQNW
CA
tt
tt
tt
tt
t
t
0v0w
vxwx
yx
T
ttttt ppRQCApRQCAL
01 )|()|(),,,|,(),|,,,( xyxxyxyx
),,,|,(log RQCAp yx
Log-likelihood
• Multivariate normal distribution N(m, S) has pdf:
• From model:
...)|(log)|(log
)|()|(log),,,|,(log
1
0 01
01
T
t
T
ttttt
T
ttttt
pp
ppRQCAp
xyxx
xyxxyx
),(~1 QAN tt xx ),(~ RCN tt xy
))()(exp()2()( 121
2/112/ μxμxx SS Tkp
const)()(21log
21
)()(21log
21
1
0
1
11
1
1
0
1
ttT
tt
T
t
ttT
tt
T
t
CRCR
AQAQ
xyxy
xxxx
Log-likelihood #2
• a = Tr(a) if a is scalar• Bring summation inward
...const)()(21log
21
)()(21log
21
1
0
1
11
1
1
0
1
ttT
tt
T
t
ttT
tt
T
t
CRCR
AQAQ
xyxy
xxxx
const))()Tr((21log
21
))()Tr((21log
2
0
11
1
01
11
1
T
ttt
Ttt
T
ttt
Ttt
CRCRT
AQAQT
xyxy
xxxx
Log-likelihood #3
• Tr(AB) = Tr(BA)• Tr(A) + Tr(B) = Tr(A+B)
...const))()Tr((21log
21
))()Tr((21log
2
0
11
1
01
11
1
T
ttt
Ttt
T
ttt
Ttt
CRCRT
AQAQT
xyxy
xxxx
const))((Tr21log
21
))((Tr21log
2
0
11
1
011
11
T
t
Ttttt
T
t
Ttttt
CCRRT
AAQQT
xyxy
xxxx
Log-likelihood #4
• Expand
constTr21log
21
Tr21log
2
),|,,,(
01
11
1
01111
11
T
t
TTtt
Ttt
TTtt
Ttt
T
t
TTtt
Ttt
TTtt
Ttt
CCCCRRT
AAAAQQT
RQCAl
xxyxxyyy
xxxxxxxx
yx
...const))((Tr21log
21
))((Tr21log
2
0
11
1
011
11
T
t
Ttttt
T
t
Ttttt
CCRRT
AAQQT
xyxy
xxxx
Maximize likelihood
• log is monotone function– max log(f(x)) max f(x)
• Maximize l(A, C, Q, R | x, y) in turn for A, C, Q and R.– Solve for A– Solve for C– Solve for Q– Solve for R
0),|,,,(
CyxRQCAl
0),|,,,(
QyxRQCAl
0),|,,,(
RyxRQCAl
0),|,,,(
AyxRQCAl
Matrix derivatives• Defined for scalar functions f : Rn*m -> R
• Key identities
T
TTT
TTT
TTT
AAA
BAAB
ABA
AAB
AABBABB
AAA
log
)(Tr)(Tr)(Tr
)(
)(xxxx
Optimizing A
• Derivative
• Maximizer
1
01
1 2221),|,,,( T
t
Ttt
Ttt AQ
AyxRQCAl xxxx
11
0
1
01
T
t
Ttt
T
t
TttA xxxx
Optimizing C
• Derivative
• Maximizer
T
t
Ttt
Ttt CR
CyxRQCAl
0
1 2221),|,,,( xxxy
1
00
T
t
Ttt
T
t
TttC xxxy
Optimizing Q
• Derivative with respect to inverse
• Maximizer
TT
t
TTtt
Ttt
TTtt
Ttt AAAAQT
QyxRQCAl
1
011111 2
12
),|,,,( xxxxxxxx
1
01111
1 T
t
TTtt
Ttt
TTtt
Ttt AAAA
TQ xxxxxxxx
Optimizing R
• Derivative with respect to inverse
• Maximizer
TT
t
TTtt
Ttt
TTtt
Ttt CCCCRT
RyxRQCAl
01 2
121),|,,,( xxyxxyyy
T
t
TTtt
Ttt
TTtt
Ttt CCCC
TR
011 xxyxxyyy
EM-algorithm
• Initial guesses of A, C, Q, R• Kalman smoother (E-step): – Compute distributions X0, …, XT
given data y0, …, yT and A, C, Q, R.
• Update parameters (M-step):– Update A, C, Q, R such that
expected log-likelihood is maximized• Repeat until convergence (local optimum)
),(~),(~
,,1
RNVQNW
CA
tt
tt
tt
tt
t
t
0v0w
vxwx
yx
Kalman Smoother• for (t = 0; t < T; ++t) // Kalman filter
• for (t = T – 1; t ≥ 0; --t) // Backward pass
QAAPP
AT
tttt
tttt
||1
||1 ˆˆ xx
ttttttt
tttttttt
Ttt
Tttt
CPKPPCKRCCPCPK
|11|11|1
|111|11|1
1|1|11
ˆˆˆ
xyxx
TtttTttttTt
ttTttttTt
ttT
ttt
LPPLPPL
PAPL
)(ˆˆˆˆ
|1|1||
|1|1||
1|1|
xxxx
Update Parameters• Likelihood in terms of x, but only X available
• Likelihood-function linear in • Expected likelihood: replace them with:
• Use maximizers to update A, C, Q and R.
constTr21log
21
Tr21log
2
),|,,,(
01
11
1
01111
11
T
t
TTtt
Ttt
TTtt
Ttt
T
t
TTtt
Ttt
TTtt
Ttt
CCCCRRT
AAAAQQT
RQCAl
xxyxxyyy
xxxxxxxx
yx
Ttt
Tttt 1,, xxxxx
TTtttTtTtt
TTttt
Ttt
TTtTtTt
Ttt
Ttt
PLXXE
PXXE
XE
|1|1|1|1|1|1
|||
|
ˆ)ˆˆ(ˆˆ)|(
ˆˆ)|(
ˆ)|(
xxxxxy
xxy
xy
Convergence
• Convergence is guaranteed to local optimum• Similar to coordinate ascent
Conclusion
• EM-algorithm to simultaneously optimize state estimates and model parameters
• Given ``training data’’, EM-algorithm can be used (off-line) to learn the model for subsequent use in (real-time) Kalman filters
Next time
• Learning from demonstrations• Dynamic Time Warping