Eidetic 3D LSTM: A Model for Video Prediction and Beyond
Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4
1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University
Summary
I We build space-time models of the world through predictiveunsupervised learning.
I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments
I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?
I Code/models available: github.com/google/e3d_lstm
Motivations
Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM
I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.
I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.
Modeling Short-Term Video Representations
I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem
(a)3D-CNN at Bottom (b)3D-CNN on Top (c)E3D-LSTM Network
Modeling Long-Term Video Representations
I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)
I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate
(d)Spatiotemporal LSTM (e)Eidetic 3D LSTM
Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)
RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C
kt�⌧ :t�1)
T) · Ck
t�⌧ :t�1
Ckt = It � Gt + LayerNorm(Ck
t�1 + RECALL(Rt, Ckt�⌧ :t�1))
(1)
Moving MNIST Dataset
Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3
(f) 10 ! 10 Prediction (g)Copy Test
KTH Action Dataset: Video Prediction and Replay
Early Action Recognition
Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73
Tsinghua University - Beijing - China - 100084 Mail: [email protected]
Eidetic 3D LSTM: A Model for Video Prediction and Beyond
Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4
1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University
Summary
I We build space-time models of the world through predictiveunsupervised learning.
I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments
I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?
I Code/models available: github.com/google/e3d_lstm
Motivations
Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM
I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.
I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.
Modeling Short-Term Video Representations
I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem
RNN Unit RNN Unit RNN Unit
Frame3:T+2Frame1:T Frame2:T+1
FrameT+1 FrameT+2 FrameT+32D CNN
Decoders2D CNN
Decoders2D CNN
Decoders
Classifier
3D CNNEncoders
3D CNNEncoders
3D CNNEncoders
(a) 3D-CNN at Bottom
RNN Unit
2D CNNEncoders
RNN Unit RNN Unit
2D CNNEncoders
2D CNNEncoders
FrameT+1 FrameT+2 FrameT+3
FrameT+1 FrameT+2FrameT
Classifier
3D CNNDecoders
3D CNNDecoders
3D CNNDecoders
(b) 3D-CNN on Top
Frame1:T
FrameT+13D CNN
Decoders3D CNN
Decoders3D CNN
Decoders
E3D-LSTM E3D-LSTM E3D-LSTM
3D CNNEncoders
3D CNNEncoders
3D CNNEncoders
Classifier
Frameτ :T+τ Frame2τ :T+2τ
FrameT+2τ+1FrameT+τ+1
(c)E3D-LSTM Network
Modeling Long-Term Video Representations
I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)
I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate
Forget Gateht-1k
ct-1k
xt
mtk-1
mtk
ctk ht
kgt
it
ft
ot
′it
′gt′ft
(d)Spatiotemporal LSTM
Recall Gate
Softmax
Ctk
Ht-1k
Htk
Xt
Mtk
′Ft
Gt
Ot
′It
It
Rt
Ct-1k
Mtk-1
{ LayerNorm
Ct-2kCt−τ
k Ct−τ+1k
Ct−τ :t-1k
′Gt
(e)Eidetic 3D LSTM
Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)
RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C
kt�⌧ :t�1)
T) · Ck
t�⌧ :t�1
Ckt = It � Gt + LayerNorm(Ck
t�1 + RECALL(Rt, Ckt�⌧ :t�1))
(1)
Moving MNIST Dataset
Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3
Inputs
Ground Truth
Ours
PredRNN++
ConvLSTM
VPN Baseline
PredRNN
(f) 10 ! 10 Prediction
PredRNN++
Ours
Prior Context (same as Seq. 2)
Seq. 2 Inputs Seq. 2 Predictions (the first line is the expected ground truth)
ConvLSTM
Seq. 1 Inputs Seq. 1 Predictions
(g)Copy Test
KTH Action Dataset: Video Prediction and Replay
t=11t=1
PredRNN++
Ours
ConvLSTM
Inputs Prediction Ground Truth
PredRNN++
Ours
ConvLSTM
Inputs Prediction Ground Truth
Prior Inputs
t=3 t=5 t=7 t=9 t=13 t=15 t=17 t=19 t=21 t=23 t=25 t=27 t=29 t=31 t=33 t=35 t=37 t=39 t=41 t=43 t=45 t=47 t=49
Early Action Recognition
Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73
Poking a stack…w/o collapsing
Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses
Poking a stack of [sth.] so the stack collapses
0% 100%25% 50%
3D-CNNsE3D-LSTM3D-CNN
Poking a stack of [sth.] without the stack collapsing
3D-CNNE3D-LSTM
Putting [sth.] onto [sth.] Poking a stack…w/o collapsing
Poking [sth.] so lightly Poking a stack…w/o collapsing
Pouring [sth.] into [sth.] until it overflows
3D-CNNE3D-LSTM
Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows
Trying to pour [sth.] into [sth.], but missing so it spills next to it
3D-CNNE3D-LSTM
Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills
Poking a stack…w/o collapsing
Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses
Poking a stack of [sth.] so the stack collapses
0% 100%25% 50%
3D-CNNsE3D-LSTM3D-CNN
Poking a stack of [sth.] without the stack collapsing
3D-CNNE3D-LSTM
Putting [sth.] onto [sth.] Poking a stack…w/o collapsing
Poking [sth.] so lightly Poking a stack…w/o collapsing
Pouring [sth.] into [sth.] until it overflows
3D-CNNE3D-LSTM
Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows
Trying to pour [sth.] into [sth.], but missing so it spills next to it
3D-CNNE3D-LSTM
Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills
Tsinghua University - Beijing - China - 100084 Mail: [email protected]