`Learning to Control’ an Unknown System
Simons Institute RTDM Program - Societal Networks Workshop March 26, 2018
1
Joint work with Mukul Gagrani (USC), Ashutosh Nayyar (USC) and Yi Ouyang (UC Berkeley)
Rahul Jain
University of Southern California
/17
Outline
I. MDPs, Dynamic Programming
II. Bandit Models, Online Learning
III. PSDE: An RL Algorithm for Unknown MDPs
IV. PSDE Algorithm for Unknown Linear Stochastic Systems
!2
/17
A Markov Decision Process
!3
V() = lim infT!1
1
TE[
TX
t=1
r(xt, ut)]
xy
u𝜽(y|x,u)
𝛑(u|x;𝜽)
(x,r)ur(x,u)
MDP
Control
Finite State space X Finite Action space U
known
/17
Dynamic Programming
Weakly communicating finite MDP
Optimal average reward
Bellman equation
‣ w*(x,𝜽) is relative value function
Solve by average-reward DP algorithms
!4
E[w*(y)|x,u]
V () = sup
V()
V () + w(x, ) = supur(x, u) +
X
y
(y|x, u)w(x, )
/17
Unknown Model
True 𝜃o, unknown ~ prior μ
Learning policy 𝜙t(ht), history ht=(states,actions)
Objective of Learning: To find a nearly optimal policy at the fastest possible rate?
!5
xy
u𝜽(y|x,u)
(x,r)ur(x,u)
MDP
Control
Finite State space X Finite Action space U
𝛑(u|x;𝜽)Learn
Unknown
𝜽^
[Borkar-Varaiya’82]
/17
Bandit Models and Online Learning
Reward on Heads = $1, on Tails =0
Objective: “max expected long-term total reward”
≡
min (expected) Regret
Lai & Robbins (1985) lower bound O(log T)
UCB1 algorithm achieves O(log T) [Agrawal’95, Auer, et al’02]‣
!6
Unknown 𝜽1 Unknown 𝜽2
gi(t, ti) = Xi +p
2 log t/tiOptimism in the Face of Uncertainty (OFU)
RT () = Tmax E[TX
t=1
rt]
/17
Maintain a belief (posterior distribution), µi over 𝜽i
Sample 𝜽i from µi
Choose i* = arg maxi 𝜽i
Achieves (exp) regret RT(𝜙) = O(log T)
‣ Advantage: superior numerical performance, computationally simpler
‣ Thompson’33, Chapelle-Li’11, Agrawal-Goyal’12!7
Unknown 𝜽1 Unknown 𝜽2
The (Thompson) Posterior Sampling Algorithm
Unknown 𝜽m
…
^
^
Posterior Sampling Algorithms
/17
Learning an Unknown MDP
Learning policy 𝜙t(ht) to search over space 𝚯
Objective of Learning:
Lower Bound =Ω(√T) [Tsitsiklis, et al (2010)]
OFU v. PS!8
xy
u𝜽(y|x,u)
(x,r)ur(x,u)
MDP
Control
Finite State space X Finite Action space U
𝛑(u|x;𝜽)Learn
Unknown
𝜽^
𝚯
RT () = TV (o) E[TX
t=1
r(xt, ut)]
/17
The PSDE Algorithm: Posterior Sampling with Dynamic Episodes
The PSDE Algorithm:
Resample 𝜃 from posterior 𝜇t at end of every episode
‣ Compute policy optimal for sampled 𝜃
At each t, update posterior using Bayes’ rule
!9
µt() = P(|ht)
Stopping Rule 1: Stopping Rule 2:0 t1
T1
t2
T2dynamicepisodes
t > tk + Tk1 Nt(x, u) > 2Ntk(x, u) for some (x, u)
Resample 𝜃compute policy 𝜋*(𝜃)
Resample 𝜃compute policy 𝜋*(𝜃)
ExploreExploit Exploit Explore Exploit
/17
Up to logarithmic factors, exact constants known PSDE Algorithm works with approximately optimal policies in each
episode also Episode length can’t increase faster
Non-asymptotic Regret bound for PSDE
!10
Theorem.*
If the MDP is weakly communicating and its span ≤ H, then
where X is state space size, and U is action space size.
*Y. Ouyang, M. Gagrani, A. Nayyar and R. Jain, “Learning Unknown MDPs: A Thompson Sampling Approach”, NIPS, 2017.
RT (PSDE) O(HXpUT )
/17
Numerical Performance
Riverswim Benchmark problem
!11
0 2 4 6 8 10
104
0
1000
2000
3000
4000
5000
6000
UCRL2: [Jaksch, Ortner, Auer (2010)]TSMDP: [Gopalan & Mannor (2015)]Lazy-PSRL: [Yadkori & Szepesvari (2015)]
PSDE
PSDE
/17
Proof Outline
For any function f and RV X, algorithm must satisfy
Upper bounds number of episodes
Upper bound between true and sampled parameters
!12
E[f(k, X)] = E[f(o, X)]
KT p
2XUT log T
TE[V (o)]KTX
k=1
E[TkV(k)] E[KT ]
/17!13
Unknown Stochastic Linear System
Parameters θ unknown
Regret
Optimal control policy is linear:
Assumption 1: There is a set Θ such that for all θ ϵ Θ, there is a unique p.d. solution to the Ricatti equation
xt+1 = Axt +But + wt,
ut = t(ht)
RT () = E[TX
t=1
ct TJ()]
u = G()x where G() = (R+B>S(B)1B>S()A.
ct = x>t Qxt + u>
t Rut
> = [A,B] known xt,ut
/17!14
Stochastic Adaptive Control
Classical Adaptive Control…
Certainty equivalence principle
‣ Astrom-Wittenmark’94, Sastry’89, Narendra’89
Cost-biased Max Likelihood approach
‣ Campi and Kumar’98, Prandini-Campi’01,…
Optimism in the Face of Uncertainty (OFU)
‣ Yadkori-Szepesvari’11,’15, Van Roy, et al’12,’13,’16 (computation!)
‣ Abeile-Lazaric’17 ~O(T2/3)
/17!15
The Posterior Sampling with Dynamic Episodes (PSDE) Learning Algorithm
From data zt=[xt,ut], estimate parameters θ:
Sample parameters tk from µtk(tk)Solve Ricatti equationCompute Gain G(tk)
Posterior Sampling:
…T1 Tk=Tk-1+1
0 t2 tk tk+1…
det(t) < 0.5det(tk)
Dynamic Episodes
/17!16
√T-Regret of PSDE Algorithm
Assumption 2. State space X compact,
This implies for all θ ϵ Θ, spectral radius 𝛒(A1+B1G(θ)) < δ < 1
Theorem.* Expected regret of PSDE,
Open Loop Stable System Open Loop Unstable System
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000Horizon
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
R(T
)/T
δ = 0.99δ = 2
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000Horizon
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
R(T
)/T
δ = 0.99δ = 2
*Y. Ouyang, M. Gagrani and R. Jain, “Learning-based Control of Unknown Linear Systems with Thompson Sampling”, arXiv:1709.04047
RT (PSDE) O(pT )
/17
Conclusions Simple Posterior Sampling (PS)-based Learning-to-
Control Algorithms‣ For MDPs and Linear Stochastic Systems
Trades-off `Exploration v. Exploitation’ nearly optimally to get O(√T) regret
‣ Unlike OFU-type algorithms, computationally simple
‣ A natural design
‣ A deterministic schedule possible?
Extensions‣ Continuous state space MDPs via function approximation
‣ Time-varying systems
!17*Y. Ouyang, M. Gagrani and R. Jain, “Learning-based Control of Unknown Linear Systems with Thompson Sampling”, arXiv:1709.04047, 2017
*Y. Ouyang, M. Gagrani, A. Nayyar and R. Jain, “Learning Unknown MDPs: A Thompson Sampling Approach”, NIPS, 2017.