Efficient and Tractable System Identification through Supervised LearningA HM E D HE FNY
M A CHINE L E A RNING DE PA RTME NT
CARNE GIE M E L LON UNIV E RSIT Y
8/1/2017 1
J O I NT W O R K W I T H
G E O F F RE Y G OR DON, C A R LTON D O W NE Y,
BY R O N B O OT S ( G E O R GI A T E C H ) ,
Z I TA MA R I NH O, W E N S UN
Outline- Problem Statement:
◦ Learning Dynamical Systems
◦ Solution Properties
- Formulation:◦ A Taxonomy of Dynamical System Models
◦ Predictive State Models: Formulation and Learning
◦ Connection to Recurrent Networks
- Extensions:◦ Controlled Systems
◦ Reinforcement Learning
8/1/2017 2
Outline- Problem Statement:
◦ Learning Dynamical Systems
◦ Solution Properties
- Formulation:◦ A Taxonomy of Dynamical System Models
◦ Predictive State Models: Formulation and Learning
◦ Connection to Recurrent Networks
- Extensions:◦ Controlled Systems
◦ Reinforcement Learning
8/1/2017 3
Learning Dynamical Systems
𝒔𝒕 𝒔𝒕+𝟏
𝒐𝒕
System Dynamics
Observation Model
Latent System State
Position & speed
8/1/2017 6
Learning Dynamical Systems
𝒔𝒕 𝒔𝒕+𝟏
𝒐𝒕
System Dynamics
Observation Model
Latent System State
Position & speed
8/1/2017 7
𝑞𝑡 ≡ 𝑃 𝑠𝑡 𝑜1:𝑡−1
Learning Dynamical Systems
8/1/2017 8
𝑞𝑡 ≡ 𝑃 𝑠𝑡 𝑜1:𝑡−1
𝑞𝑡history𝑜1:𝑡−1
future𝑜𝑡:∞
𝑃 𝑜𝑡+𝜏 𝑜1:𝑡=1 ≡ 𝑃(𝑜𝑡+𝜏|𝑞𝑡)
Learning a Recursive FilterGiven:
Training Sequences:
(𝑜1, 𝑜2, … , 𝑜𝑇)
Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡 , 𝑜𝑡)
◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)
8/1/2017 9
Learning a Recursive FilterGiven:
Training Sequences:
(𝑜1, 𝑜2, … , 𝑜𝑇)
Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡 , 𝑜𝑡)
◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)
System- Non-linear- Partially observable- Controlled
Algorithm- Theoretical Guarantees - Scalability
8/1/2017 10
Outline- Problem Statement:
◦ Learning Dynamical Systems
◦ Solution Properties
- Formulation:◦ A Taxonomy of Dynamical System Models
◦ Predictive State Models: Formulation and Learning
◦ Connection to Recurrent Networks
- Extensions:◦ Controlled Systems
◦ Reinforcement Learning
8/1/2017 11
RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(𝑜1, 𝑜2, … , 𝑜𝑇)
Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)
◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)
𝑜𝑡+1
𝑞𝑡 𝑞𝑡+1
𝑜𝑡
f
g
8/1/2017 12
Learn a model then derive f
HMM[EM, Tensor Decomp.]
Directly Learn f RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(𝑜1, 𝑜2, … , 𝑜𝑇)
Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)
◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)
𝑜𝑡+1
𝑞𝑡 𝑞𝑡+1
𝑜𝑡
f
g
8/1/2017 13
Fix g (Predictive State) Learn g (Latent State)
Learn a model then derive f
HMM[EM, Tensor Decomp.]
Directly Learn f 𝑞𝑡 ≡ 𝐸 𝜓(𝑜𝑡:∞) ∣ 𝑜1:𝑡−1state = E[sufficient future stats]PSIM [DAgger]
RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(𝑜1, 𝑜2, … , 𝑜𝑇)
Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)
◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)
𝑜𝑡+1
𝑞𝑡 𝑞𝑡+1
𝑜𝑡
f
g
8/1/2017 14
Fix g (Predictive State) Learn g (Latent State)
Learn a model then derive f
Predictive State Models[Method of moments: Two-stage regression]
HMM[EM, Tensor Decomp.]
Directly Learn f 𝑞𝑡 ≡ 𝐸 𝜓(𝑜𝑡:∞) ∣ 𝑜1:𝑡−1state = E[sufficient future stats]PSIM [DAgger]
RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(𝑜1, 𝑜2, … , 𝑜𝑇)
Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)
◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)
𝑜𝑡+1
𝑞𝑡 𝑞𝑡+1
𝑜𝑡
f
g
8/1/2017 15
Why restrict 𝑓 and 𝑔 ?
* Predictive State (a.k.a Observable Representation):
State is a prediction of future observation statistics
Future statistics are noisy estimates of the state.
Reduction to supervised learning.
* Additional assumptions on dynamics facilitate the development of an efficient algorithm with provable guarantees.
* Local improvement is still possible.
Learning a Recursive Filter
PSM PSIM RNN
8/1/2017 16
Predictive State Model (Formulation)Predictive State 𝑞𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘−1 𝑜1:𝑡−1 ≡ 𝐸 𝜓𝑡 𝑜1:𝑡−1
Extended Predictive State 𝑝𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘 𝑜1:𝑡−1 ≡ 𝐸 𝜁𝑡 𝑜1:𝑡−1
Linear Dynamics 𝑝𝑡 =𝑾 𝑞𝑡
𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘
future 𝜓𝑡
extended future 𝜉𝑡𝝍𝒕 𝒒𝒕
Indicator Vector Joint Probability Table
1st and 2nd moments
Gaussian Distribution
178/1/2017
Predictive State Model (Formulation)Predictive State 𝑞𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘−1 𝑜1:𝑡−1 ≡ 𝐸 𝜓𝑡 𝑜1:𝑡−1
Extended Predictive State 𝑝𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘 𝑜1:𝑡−1 ≡ 𝐸 𝜁𝑡 𝑜1:𝑡−1
Linear Dynamics 𝑝𝑡 =𝑾 𝑞𝑡
Filtering: 𝑓 𝑞𝑡 , 𝑜𝑡 = 𝒇𝒇𝒊𝒍𝒕𝒆𝒓 𝑝𝑡, 𝑜𝑡 = 𝑓𝑓𝑖𝑙𝑡𝑒𝑟 𝑊𝑞𝑡 , 𝑜𝑡
𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘
future 𝜓𝑡
extended future 𝜉𝑡𝝍𝒕 𝒒𝒕 𝒇𝒇𝒊𝒍𝒕𝒆𝒓
Indicator Vector Joint Probability Table
Bayes Rule:𝑃 𝑜𝑡+1:𝑡+𝑘 𝑜1:𝑡 ∝ 𝑃(𝑜𝑡:𝑡+𝑘 ∣ 𝑜1:𝑡−1)
1st and 2nd moments Gaussian Distribution
Gaussian conditional mean and covariance.
fixed
learned
8/1/2017 18
Predictive State Model (Formulation)Predictive State 𝑞𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘−1 𝑜1:𝑡−1 ≡ 𝐸 𝜓𝑡 𝑜1:𝑡−1
Extended Predictive State 𝑝𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘 𝑜1:𝑡−1 ≡ 𝐸 𝜁𝑡 𝑜1:𝑡−1
Linear Dynamics 𝑝𝑡 =𝑾 𝑞𝑡
Filtering: 𝑓 𝑞𝑡 , 𝑜𝑡 = 𝒇𝒇𝒊𝒍𝒕𝒆𝒓 𝑝𝑡, 𝑜𝑡 = 𝑓𝑓𝑖𝑙𝑡𝑒𝑟 𝑊𝑞𝑡 , 𝑜𝑡
𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘
future 𝜓𝑡
extended future 𝜉𝑡
fixed
learned
Why linear 𝑾 ?Crucial to the consistency of learning algorithm.
Why this particular filtering formulation ?Matches existing models (HMM, Kalman filter, PSR)
8/1/2017 19
Predictive State Model (Learning)𝜓𝑡 and 𝜁𝑡 are unbiased estimates of 𝑞𝑡 and 𝑝𝑡:
• 𝜓𝑡 = 𝑞𝑡 + 𝜖𝑡
• 𝜁𝑡 = 𝑝𝑡 + 𝜈𝑡
Learning Procedure:
• ො𝑞0 =1
𝑁σ𝑖 𝜓𝑖
• Learn 𝑊 using linear regression with examples (𝜓𝑡 , 𝜁𝑡 )
𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘
future 𝜓𝑡
extended future 𝜉𝑡
𝜖𝑡 and 𝜈𝑡 are correlated𝐶𝑜𝑣 𝑞𝑡 , 𝑝𝑡 ≠ 𝐶𝑜𝑣(𝜓𝑡, 𝜁𝑡)
8/1/2017 20
Predictive State (Learning)𝜓𝑡 and 𝜁𝑡 are unbiased estimates of 𝑞𝑡 and 𝑝𝑡:
• 𝜓𝑡 = 𝑞𝑡 + 𝜖𝑡
• 𝜁𝑡 = 𝑝𝑡 + 𝜈𝑡
Learning Procedure:
• 𝑞0 =1
𝑁σ𝑖 𝜓𝑖
• Denoise examples (𝜓𝑡 , 𝜁𝑡) to obtain ( ො𝑞𝑡 , Ƹ𝑝𝑡)
• Learn 𝑊 using linear regression with examples (ො𝑞𝑡 , Ƹ𝑝𝑡)
𝜖𝑡 and 𝜈𝑡 are correlated𝐶𝑜𝑣 𝑞𝑡 , 𝑝𝑡 ≠ 𝐶𝑜𝑣(𝜓𝑡, 𝜁𝑡)
𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘
future 𝜓𝑡
extended future 𝜉𝑡
denoised future ො𝑞𝑡
denoised extended future Ƹ𝑝𝑡
???
𝑝𝑡 = 𝑊𝑞𝑡
𝐸 𝜁𝑡 𝑜𝑡−𝑘:𝑡−1 = 𝑊𝐸[𝜓𝑡 ∣ 𝑜𝑡−𝑘:𝑡−1]Use regression
8/1/2017 21
𝐸 𝜁𝑡 𝑜1:𝑡−1 = 𝑊𝐸[𝜓𝑡 ∣ 𝑜1:𝑡−1]
ො𝑞𝑡 = 𝑊 Ƹ𝑝𝑡
Learning Dynamical Systems Using Instrument Regression
𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘
history ℎ𝑡 future 𝜓𝑡
extended future 𝜉𝑡
S1A regression
S1B regression
S2 regression (learn W)
Condition on 𝑜𝑡 (filter) 𝑞𝑡+1Marginalize 𝑜𝑡 (predict) 𝑞𝑡+1|𝑡−1
denoised future 𝐸[𝑞𝑡|ℎ𝑡]estimated future ො𝑞𝑡
denoised extended future 𝐸[𝑝𝑡|ℎ𝑡]
Apply W
estimated extended future Ƹ𝑝𝑡
8/1/2017 22
In a nutshell* Predictive State:
◦ State is a prediction of future observations
◦ Future observations are noisy estimates of the state
* Two stage regression:◦ Use history features to “denoise” states (S1 Regression)
◦ Use denoised states to learn dynamics (S2 Regression)
8/1/2017 23
What do we gain ?* More understanding of existing algorithms:
◦ Spectral algorithms for learning HMMs, Kalman filters, PSRs are two stage regression algorithms with linear regression in all stages.
* Theoretical Results (Asymptotic and finite sample):
◦ Error in estimating 𝑊 is ෨𝑂 1/√𝑁 [Under mild assumptions]
◦ Exact rate depends on S1 regression error
* New flavors of dynamical systems learning algorithms:◦ HMM with logistic regression.
◦ Online learning of linear dynamical systems (Sun et al. 2015).
◦ Linear dynamical systems with sparse dynamics (Hefny et al 2015, Gus Xia 2016).
8/1/2017 24
Predictive State Models as RNNs
Predictive state models define RNNs that are easy to initialize !!
8/1/2017 26
Back to Special Case: Modeling Discrete Systems with Indicator VectorsAssume the discrete case: 𝑜𝑡 is an indicator vector.
Let 𝜓𝑡 = 𝑜𝑡 and 𝜁𝑡 = 𝑜𝑡 ⊗𝑜𝑡+1
Then:
𝑞𝑡 Probability Vector
𝑝𝑡 Joint Probability Table
𝑓(𝑝𝑡 , 𝑜𝑡) Choose column from 𝑝𝑡 corresponding to 𝑜𝑡 then renormalize
8/1/2017 27
Back to Special Case: Modeling Discrete Systems with Indicator Vectors
8/1/2017 28
𝒒𝒕
0.2
0.5
0.3
0.1 0.2 0.1
0.05 0.05 0.25
0.05 0.15 0.05
𝒑𝒕 𝒒𝒕+𝟏
0.2
0.5
0.3𝑾
|| . ||
0
1
0
𝒐𝒕
Back to Special Case: Modeling Discrete Systems with Indicator Vectors
8/1/2017 29
𝒒𝒕
0.2
0.5
0.3
𝒒𝒕+𝟏
0.2
0.5
0.3𝑾
|| . ||
0
1
0
𝒐𝒕
Back to Special Case: Modeling Discrete Systems with Indicator Vectors
8/1/2017 30
𝒒𝒕
0.2
0.5
0.3
𝒒𝒕+𝟏
0.2
0.5
0.3𝑾
|| . ||
0
1
0
𝒐𝒕 𝑞𝑡+1(𝑘)
∝ 𝐶⊤ (𝐴𝑞𝑡 ∘ 𝐵𝑜𝑡)
Multiplicative unit
Predictive State Models as RNNs
Predictive units have a multiplicative structure, similar to LSTMs and GRUs.
8/1/2017 31
What about the continuous case ?
Mean-maps for Continuous ObservationsMean-maps provide a powerful tool to model non-parametric distribution using the feature map of a universal kernel.
A discrete distribution is a special case that uses the delta kernel and indicator feature map. Continuous distributions can be modeled using e.g. RBF kernel.
Discrete Case General Case
Indicator Vector Kernel feature map 𝜙(𝑥)
Joint Probability Table P(X,Y) Covariance Operator 𝐶𝑋𝑌
Conditional Probability Table Conditional Operator 𝐶𝑋|𝑌
Normalization P(X,Y) P(X|Y) Kernel Bayes Rule 𝐶𝑋|𝑌 = 𝐶𝑋𝑌𝐶𝑌𝑌−1
8/1/2017 32
Mean-maps for Continuous ObservationsMean-maps provide a powerful tool to model non-parametric distribution using the feature map of a universal kernel.
A discrete distribution is a special case that uses the delta kernel and indicator feature map. Continuous distributions can be modeled using e.g. RBF kernel.
Discrete Case General Case RFF Approximation
Indicator Vector Kernel feature map 𝜙(𝑥) RFF Feature Vector
Joint Probability Table P(X,Y) Covariance Operator 𝐶𝑋𝑌 Covariance Matrix
Conditional Probability Table Conditional Operator 𝐶𝑋|𝑌 Conditional Matrix
Normalization P(X,Y) P(X|Y) Kernel Bayes Rule 𝐶𝑋|𝑌 = 𝐶𝑋𝑌𝐶𝑌𝑌−1 Solve Linear System
8/1/2017 33
Results
8/1/2017 34
Character prediction accuracy of English textError predicting handwriting trajectories
Outline- Problem Statement:
◦ Learning Dynamical Systems
◦ Solution Properties
- Formulation:◦ A Taxonomy of Dynamical System Models
◦ Predictive State Models: Formulation and Learning
◦ Connection to Recurrent Networks
- Extensions:◦ Controlled Systems
◦ Reinforcement Learning
8/1/2017 35
Extension to Controlled SystemsIn controlled systems we have observations and actions. (e.g. car velocity and pressure on pedals)
𝑜𝑡
𝑎𝑡
𝑠𝑡 𝑠𝑡+1
Recursive Filter:
𝐸 𝑜𝑡 𝑞𝑡 , 𝑎𝑡 = 𝑔 𝑞𝑡 , 𝑎𝑡
𝐸 𝑞𝑡+1 𝑞𝑡 , 𝑜𝑡 , 𝑎𝑡 = 𝑓(𝑞𝑡 , 𝑎𝑡 , 𝑜𝑡)
8/1/2017 36
Extension to Controlled SystemsSame principle
𝑃𝑡 = 𝑊(𝑄𝑡)
This time, the predictive state is a linear operator encoding conditional distribution of future observations given future actions.
Example: Think of 𝑄𝑡 and 𝑃𝑡 as conditional probability tables.
Requires appropriate modifications to S1 regression. S2 regression remains the same.
8/1/2017 37
Extension to Controlled SystemsTwo stage regression: It is all about finding
𝐸[𝑄𝑡|𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1]
0.1 0.8 0.5 0.2
0.3 0.1 0.2 0.1
0.6 0.1 0.3 0.7Ob
serv
atio
n
Action
8/1/2017 38
Extension to Controlled SystemsTwo stage regression: It is all about finding
𝐸 𝑄𝑡 𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1
Problem:
At each time step we observe a noisy version of
a slice of 𝑄𝑡
? ? 1 ?
? ? 0 ?
? ? 0 ?Ob
serv
atio
n
Action
8/1/2017 39
Extension to Controlled SystemsTwo stage regression: It is all about finding
𝐸 𝑄𝑡 𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1 = 𝑄(𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1)
Problem:
At each time step we observe a noisy version of
a slice of 𝑄𝑡
Solution 1 (Joint Modeling):
Predict the joint distribution of observation and actions.
Manually convert to conditional table (e.g. normalize columns).
0 0 1 0
0 0 0 0
0 0 0 0Ob
serv
atio
n
Action
8/1/2017 40
Extension to Controlled SystemsTwo stage regression: It is all about finding
𝐸 𝑄𝑡 𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1 = 𝑄(𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1)
Problem:
At each time step we observe a noisy version of
a slice of 𝑄𝑡
Solution 2 (Conditional Modeling):
Train regression model to fit the observed slice of 𝑄𝑡.
min𝑄σ𝑡
𝑄(𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1) 𝜓𝑡𝑎 − 𝜓𝑡
𝑜 2
D/C D/C 1 D/C
D/C D/C 0 D/C
D/C D/C 0 D/COb
serv
atio
n
Action
8/1/2017 41
Results
Predicting the pose of a swimming robot:• Ground Truth• Linear ARX• RFFPSR (our model)
8/1/2017 43
Reinforcement Learning with Predictive State Policy Networks
8/1/2017 44
PSRNNFeed-forward
Policy𝑜𝑡 , 𝑎𝑡 𝑎𝑡+1
• Can be initialized• States have a meaning:
can be trained to match observations
Conclusions- Predictive State Models for filtering in dynamical systems:◦ Predictive State: State is a prediction of future observations.
◦ Two Stage Regression: Learning predictive state models can be reduced to supervised learning.
- Predictive State Models are a special type of recurrent networks.
- Can be extended to controlled systems and employed in reinforcement learning. upon that to develop a principled and efficient approach for learning to predict and act in continuous systems ?
8/1/2017 46
Thank you !* Hefny, Downey and Gordon, “Supervised Learning for Dynamical System Learning” (NIPS15), https://arxiv.org/abs/1505.05310
* Hefny, Downey and Gordon, “Supervised Learning for Controlled Dynamical System Learning”, https://arxiv.org/abs/1702.03537
* Downey, Hefny, Li, Boots and Gordon, “Predictive State Recurrent Neural Networks”, https://arxiv.org/abs/1705.09353
8/1/2017 47