Date post: | 18-Mar-2018 |
Category: |
Data & Analytics |
Upload: | sean-meyn |
View: | 248 times |
Download: | 0 times |
Reinforcement LearningHidden Theory and New Super-Fast Algorithms∗
Sean P. Meyn
Department of Electrical and Computer Engineering — University of Florida
∗Based on joint research with Adithya Devraj@UF
August 28, 2017
Reinforcement Learning: Hidden Theory, and ...Outline
1 Reinforcement Learning ⊂ Stochastic Approximation
2 Fastest Stochastic Approximation
3 Introducing Zap Q-Learning
4 Conclusions
5 References
E[f(θ,W)]θ=θ∗
= 0
Stochastic Approximation
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may notbe known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may beexpensive. For root finding, we may need to compute the expectationfor many values of θ
3 The recursive algorithms we come up with are often slow, and theirvariance may be infinite (this is typical).
1 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may notbe known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may beexpensive. For root finding, we may need to compute the expectationfor many values of θ
3 The recursive algorithms we come up with are often slow, and theirvariance may be infinite (this is typical).
1 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may notbe known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may beexpensive. For root finding, we may need to compute the expectationfor many values of θ
3 The recursive algorithms we come up with are often slow, and theirvariance may be infinite (this is typical).
1 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may notbe known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may beexpensive. For root finding, we may need to compute the expectationfor many values of θ
3 The recursive algorithms we come up with are often slow, and theirvariance may be infinite (this is typical).
1 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?Example: Monte-Carlo
Monte-Carlo Estimation
c is a real valued function of a random variable X.We want to compute the mean: η = E[c(X)]
SA formulation: Find θ∗ solving 0 = E[f(θ,X)] = E[c(X)− θ]
Algorithm: θ(n) =1
n
n∑i=1
c(X(i))
=⇒ (n+ 1)θ(n+ 1) =
n+1∑i=1
c(X(i)) = nθ(n) + c(X(n+ 1))
=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]
SA Recursion: θ(n+ 1) = θ(n) + anf(θ(n), X(n+ 1))
2 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?Example: Monte-Carlo
Monte-Carlo Estimation
c is a real valued function of a random variable X.We want to compute the mean: η = E[c(X)]
SA formulation: Find θ∗ solving 0 = E[f(θ,X)] = E[c(X)− θ]
Algorithm: θ(n) =1
n
n∑i=1
c(X(i))
=⇒ (n+ 1)θ(n+ 1) =
n+1∑i=1
c(X(i)) = nθ(n) + c(X(n+ 1))
=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]
SA Recursion: θ(n+ 1) = θ(n) + anf(θ(n), X(n+ 1))
2 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?Example: Monte-Carlo
Monte-Carlo Estimation
c is a real valued function of a random variable X.We want to compute the mean: η = E[c(X)]
SA formulation: Find θ∗ solving 0 = E[f(θ,X)] = E[c(X)− θ]
Algorithm: θ(n) =1
n
n∑i=1
c(X(i))
=⇒ (n+ 1)θ(n+ 1) =
n+1∑i=1
c(X(i))
= nθ(n) + c(X(n+ 1))
=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]
SA Recursion: θ(n+ 1) = θ(n) + anf(θ(n), X(n+ 1))
2 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?Example: Monte-Carlo
Monte-Carlo Estimation
c is a real valued function of a random variable X.We want to compute the mean: η = E[c(X)]
SA formulation: Find θ∗ solving 0 = E[f(θ,X)] = E[c(X)− θ]
Algorithm: θ(n) =1
n
n∑i=1
c(X(i))
=⇒ (n+ 1)θ(n+ 1) =
n+1∑i=1
c(X(i)) = nθ(n) + c(X(n+ 1))
=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]
SA Recursion: θ(n+ 1) = θ(n) + anf(θ(n), X(n+ 1))
2 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?Example: Monte-Carlo
Monte-Carlo Estimation
c is a real valued function of a random variable X.We want to compute the mean: η = E[c(X)]
SA formulation: Find θ∗ solving 0 = E[f(θ,X)] = E[c(X)− θ]
Algorithm: θ(n) =1
n
n∑i=1
c(X(i))
=⇒ (n+ 1)θ(n+ 1) =
n+1∑i=1
c(X(i)) = nθ(n) + c(X(n+ 1))
=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]
SA Recursion: θ(n+ 1) = θ(n) + anf(θ(n), X(n+ 1))
2 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?Example: Monte-Carlo
Monte-Carlo Estimation
c is a real valued function of a random variable X.We want to compute the mean: η = E[c(X)]
SA formulation: Find θ∗ solving 0 = E[f(θ,X)] = E[c(X)− θ]
Algorithm: θ(n) =1
n
n∑i=1
c(X(i))
=⇒ (n+ 1)θ(n+ 1) =
n+1∑i=1
c(X(i)) = nθ(n) + c(X(n+ 1))
=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]
SA Recursion: θ(n+ 1) = θ(n) + anf(θ(n), X(n+ 1))
2 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Optimal Control
MDP Model
X is a controlled Markov chain, with input U .
For all states x and sets A,
P{X(t+ 1) ∈ A | X(t) = x, U(t) = u, and prior history} = Pu(x,A)
c : X× U→ R is a cost function
β < 1 a discount factor.
Value function: h∗(x) = minU
∞∑t=0
βtE[c(X(t), U(t)) | X(0) = x]
Bellman equation:
h∗(x) = minu{c(x, u) + βE[h∗(X(t+ 1)) | X(t) = x, U(t) = u]}
3 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Optimal Control
MDP Model
X is a controlled Markov chain, with input U .
For all states x and sets A,
P{X(t+ 1) ∈ A | X(t) = x, U(t) = u, and prior history} = Pu(x,A)
c : X× U→ R is a cost function
β < 1 a discount factor.
Value function: h∗(x) = minU
∞∑t=0
βtE[c(X(t), U(t)) | X(0) = x]
Bellman equation:
h∗(x) = minu{c(x, u) + βE[h∗(X(t+ 1)) | X(t) = x, U(t) = u]}
3 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Optimal Control TD-Learning
Approximate value function obtained for a fixed policy:
hθ(x) ≈ h(x) =
∞∑t=0
βtE[c(X(t), U(t)) | X(0) = x]
Two flavors:
minθ
E[(h(X)− hθ(X)
)2].
Search for solution to
0 = E[(h(X)− hθ(X)
)∇θhθ(X)
]Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},
0 = E[(c(X(t), U(t)) + βhθ(X(t+ 1))− hθ(X(t))
)ζt]
4 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Optimal Control TD-Learning
Approximate value function obtained for a fixed policy:
hθ(x) ≈ h(x) =
∞∑t=0
βtE[c(X(t), U(t)) | X(0) = x]
Two flavors:
minθ
E[(h(X)− hθ(X)
)2]. Search for solution to
0 = E[(h(X)− hθ(X)
)∇θhθ(X)
]
Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},
0 = E[(c(X(t), U(t)) + βhθ(X(t+ 1))− hθ(X(t))
)ζt]
4 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Optimal Control TD-Learning
Approximate value function obtained for a fixed policy:
hθ(x) ≈ h(x) =
∞∑t=0
βtE[c(X(t), U(t)) | X(0) = x]
Two flavors:
minθ
E[(h(X)− hθ(X)
)2]. Search for solution to
0 = E[(h(X)− hθ(X)
)∇θhθ(X)
]Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},
0 = E[(c(X(t), U(t)) + βhθ(X(t+ 1))− hθ(X(t))
)ζt]
4 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Q-Learning: Another Galerkin Relaxation
Bellman equation:
Q∗(x, u) = c(x, u) + βE[Q∗(X(t+ 1)) | X(t) = x, U(t) = u]
General Q-Learning Algorithm: For t ≥ 0,
θ(t+1) = θ(t)+at(c(X(t), U(t))+βQθ(t)(X(t+1))−Qθ(t)(X(t), U(t))
)ζt
Does this work?
5 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Q-Learning: Another Galerkin Relaxation
Bellman equation:
Q∗(x, u) = c(x, u) + βE[Q∗(X(t+ 1)) | X(t) = x, U(t) = u]
Notation and Derivation:
h∗(x) = minuQ∗(x, u)
:= minu{c(x, u) + βE[h∗(X(t+ 1)) | X(t) = x, U(t) = u]}
Q∗(x) = minuQ∗(x, u)
General Q-Learning Algorithm: For t ≥ 0,
θ(t+1) = θ(t)+at(c(X(t), U(t))+βQθ(t)(X(t+1))−Qθ(t)(X(t), U(t))
)ζt
Does this work?
5 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Q-Learning: Another Galerkin Relaxation
Bellman equation:
Q∗(x, u) = c(x, u) + βE[Q∗(X(t+ 1)) | X(t) = x, U(t) = u]
Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},
0 = E[(c(X(t), U(t)) + βQθ(X(t+ 1))−Qθ(X(t), U(t))
)ζt]
General Q-Learning Algorithm: For t ≥ 0,
θ(t+1) = θ(t)+at(c(X(t), U(t))+βQθ(t)(X(t+1))−Qθ(t)(X(t), U(t))
)ζt
Does this work?
5 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Q-Learning: Another Galerkin Relaxation
Bellman equation:
Q∗(x, u) = c(x, u) + βE[Q∗(X(t+ 1)) | X(t) = x, U(t) = u]
Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},
0 = E[(c(X(t), U(t)) + βQθ(X(t+ 1))−Qθ(X(t), U(t))
)ζt]
General Q-Learning Algorithm: For t ≥ 0,
θ(t+1) = θ(t)+at(c(X(t), U(t))+βQθ(t)(X(t+1))−Qθ(t)(X(t), U(t))
)ζt
Does this work?
5 / 30
Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm
Q-Learning: Another Galerkin Relaxation
Bellman equation:
Q∗(x, u) = c(x, u) + βE[Q∗(X(t+ 1)) | X(t) = x, U(t) = u]
Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},
0 = E[(c(X(t), U(t)) + βQθ(X(t+ 1))−Qθ(X(t), U(t))
)ζt]
General Q-Learning Algorithm: For t ≥ 0,
θ(t+1) = θ(t)+at(c(X(t), U(t))+βQθ(t)(X(t+1))−Qθ(t)(X(t), U(t))
)ζt
Does this work?
5 / 30
Reinforcement Learning ⊂ Stochastic Approximation ODE Method
ODE Method
Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
Method: θ(n+ 1) = θ(n) + anf(θ(n), X(n))
ODE:d
dtϑ(t) = f(ϑ(t)) θ∗ = stationary point
Assumptions: (see Borkar’s monograph)
1∑an =∞,
∑a2n <∞. We will take an = (n+ 1)−1
2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸ ︷︷ ︸white
+ d(n)︸︷︷︸∼ vanishing
3 ODE is asymptotically stable, f is Lipschitz, andConditions to ensure that {θ(n)} is a bounded sequence.
6 / 30
Reinforcement Learning ⊂ Stochastic Approximation ODE Method
ODE Method
Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
Method: θ(n+ 1) = θ(n) + anf(θ(n), X(n))
ODE:d
dtϑ(t) = f(ϑ(t)) θ∗ = stationary point
Assumptions: (see Borkar’s monograph)
1∑an =∞,
∑a2n <∞. We will take an = (n+ 1)−1
2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸ ︷︷ ︸white
+ d(n)︸︷︷︸∼ vanishing
3 ODE is asymptotically stable, f is Lipschitz, andConditions to ensure that {θ(n)} is a bounded sequence.
6 / 30
Reinforcement Learning ⊂ Stochastic Approximation ODE Method
ODE Method
Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
Method: θ(n+ 1) = θ(n) + anf(θ(n), X(n))
ODE:d
dtϑ(t) = f(ϑ(t)) θ∗ = stationary point
Assumptions: (see Borkar’s monograph)
1∑an =∞,
∑a2n <∞. We will take an = (n+ 1)−1
2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸ ︷︷ ︸white
+ d(n)︸︷︷︸∼ vanishing
3 ODE is asymptotically stable, f is Lipschitz, andConditions to ensure that {θ(n)} is a bounded sequence.
6 / 30
Reinforcement Learning ⊂ Stochastic Approximation ODE Method
ODE Method
Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
Method: θ(n+ 1) = θ(n) + anf(θ(n), X(n))
ODE:d
dtϑ(t) = f(ϑ(t)) θ∗ = stationary point
Assumptions: (see Borkar’s monograph)
1∑an =∞,
∑a2n <∞. We will take an = (n+ 1)−1
2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸ ︷︷ ︸white
+ d(n)︸︷︷︸∼ vanishing
3 ODE is asymptotically stable, f is Lipschitz, andConditions to ensure that {θ(n)} is a bounded sequence.
6 / 30
Reinforcement Learning ⊂ Stochastic Approximation ODE Method
ODE Method
Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗
= 0
Method: θ(n+ 1) = θ(n) + anf(θ(n), X(n))
ODE:d
dtϑ(t) = f(ϑ(t)) θ∗ = stationary point
Assumptions: (see Borkar’s monograph)
1∑an =∞,
∑a2n <∞. We will take an = (n+ 1)−1
2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸ ︷︷ ︸white
+ d(n)︸︷︷︸∼ vanishing
3 ODE is asymptotically stable, f is Lipschitz, andConditions to ensure that {θ(n)} is a bounded sequence.
6 / 30
θ(k)
k
Fastest Stochastic Approximation
Fastest Stochastic Approximation Algorithm Performance
Algorithm Performance
Once we have settled that we have convergence:
limn→∞
θ(n) = θ∗
We are very happy to see that one of these algorithms is working great!
θ(k)
k
(two algorithms)
θ(k)
k
θ∗
(two algorithms)
Two standard approaches to evaluate performance:1 Finite-n bound:
P{‖θ(n)‖ ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2)
2 Asymptotic covariance:
Σ = limk→∞
kE[θ(k)θ(k)T
],
√nθ(n) ≈ N(0,Σ)
7 / 30
Fastest Stochastic Approximation Algorithm Performance
Algorithm Performance
Once we have settled that we have convergence:
limn→∞
θ(n) = θ∗
We are very happy to see that one of these algorithms is working great!
θ(k)
k
(two algorithms)
θ(k)
k
θ∗
(two algorithms)
Two standard approaches to evaluate performance:1 Finite-n bound:
P{‖θ(n)‖ ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2)
2 Asymptotic covariance:
Σ = limk→∞
kE[θ(k)θ(k)T
],
√nθ(n) ≈ N(0,Σ)
7 / 30
Fastest Stochastic Approximation Algorithm Performance
Algorithm Performance
Once we have settled that we have convergence:
limn→∞
θ(n) = θ∗
We are very happy to see that one of these algorithms is working great!
θ(k)
k
θ∗
(two algorithms)
Not so great?
Two standard approaches to evaluate performance:1 Finite-n bound:
P{‖θ(n)‖ ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2)
2 Asymptotic covariance:
Σ = limk→∞
kE[θ(k)θ(k)T
],
√nθ(n) ≈ N(0,Σ)
7 / 30
Fastest Stochastic Approximation Algorithm Performance
Algorithm Performance
Once we have settled that we have convergence:
limn→∞
θ(n) = θ∗
We are very happy to see that one of these algorithms is working great!
θ(k)
k
θ∗
(two algorithms)
Two standard approaches to evaluate performance:1 Finite-n bound:
P{‖θ(n)‖ ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2)
2 Asymptotic covariance:
Σ = limk→∞
kE[θ(k)θ(k)T
],
√nθ(n) ≈ N(0,Σ)
7 / 30
Fastest Stochastic Approximation Algorithm Performance
Finite-n bounds
Beautiful analysis for
Watkins’ Q Learning (Szepesvari)
Speedy Q Learning (Azar et. al.)
See also Moulines & Bach for optimization.
Hope for other algorithms?Not today!
For a geometrically ergodic Markov chain X we have,
θ(n) =1
n
n∑i=1
c(X(i)) , n ≥ 1
P{‖θ(n)‖ ≥ ε} ≤ ?
Fortunately, theory surrounding asymptotic theory is complete
8 / 30
Fastest Stochastic Approximation Algorithm Performance
Finite-n bounds
Beautiful analysis for
Watkins’ Q Learning (Szepesvari)
Speedy Q Learning (Azar et. al.)
See also Moulines & Bach for optimization.
Hope for other algorithms?
Not today!
For a geometrically ergodic Markov chain X we have,
θ(n) =1
n
n∑i=1
c(X(i)) , n ≥ 1
P{‖θ(n)‖ ≥ ε} ≤ ?
Fortunately, theory surrounding asymptotic theory is complete
8 / 30
Fastest Stochastic Approximation Algorithm Performance
Finite-n bounds
Beautiful analysis for
Watkins’ Q Learning (Szepesvari)
Speedy Q Learning (Azar et. al.)
See also Moulines & Bach for optimization.
Hope for other algorithms?Not today!
For a geometrically ergodic Markov chain X we have,
θ(n) =1
n
n∑i=1
c(X(i)) , n ≥ 1
P{‖θ(n)‖ ≥ ε} ≤ ?
Fortunately, theory surrounding asymptotic theory is complete
8 / 30
Fastest Stochastic Approximation Algorithm Performance
Finite-n bounds
Beautiful analysis for
Watkins’ Q Learning (Szepesvari)
Speedy Q Learning (Azar et. al.)
See also Moulines & Bach for optimization.
Hope for other algorithms?Not today!
For a geometrically ergodic Markov chain X we have,
θ(n) =1
n
n∑i=1
c(X(i)) , n ≥ 1
P{‖θ(n)‖ ≥ ε} ≤
?
Fortunately, theory surrounding asymptotic theory is complete
8 / 30
Fastest Stochastic Approximation Algorithm Performance
Finite-n bounds
Beautiful analysis for
Watkins’ Q Learning (Szepesvari)
Speedy Q Learning (Azar et. al.)
See also Moulines & Bach for optimization.
Hope for other algorithms?Not today!
For a geometrically ergodic Markov chain X we have,
θ(n) =1
n
n∑i=1
c(X(i)) , n ≥ 1
P{‖θ(n)‖ ≥ ε} ≤ ?
Fortunately, theory surrounding asymptotic theory is complete
8 / 30
Fastest Stochastic Approximation Algorithm Performance
Finite-n bounds
Beautiful analysis for
Watkins’ Q Learning (Szepesvari)
Speedy Q Learning (Azar et. al.)
See also Moulines & Bach for optimization.
Hope for other algorithms?Not today!
For a geometrically ergodic Markov chain X we have,
θ(n) =1
n
n∑i=1
c(X(i)) , n ≥ 1
P{‖θ(n)‖ ≥ ε} ≤ ?
Fortunately, theory surrounding asymptotic theory is complete
8 / 30
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Theory surrounding asymptotic theory is as complete as for i.i.d. XConsider the asymptotic covariance:
Σ = limn→∞
Σn := limn→∞
nE[θ(n)θ(n)T
]Computation based on Taylor series:
θ(n+ 1) ≈ θ(n) + 1n
(Aθ(n) + ∆(n+ 1)
), A =
d
dθf (θ∗) .
More Taylor series:
Σn+1 ≈ Σn + 1n
{(A+ 1
2I)Σn + Σn(A+ 12I)T + Σ∆
}SA recursion for {Σn}
9 / 30
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Theory surrounding asymptotic theory is as complete as for i.i.d. XConsider the asymptotic covariance:
Σ = limn→∞
Σn := limn→∞
nE[θ(n)θ(n)T
]Computation based on Taylor series:
θ(n+ 1) ≈ θ(n) + 1n
(Aθ(n) + ∆(n+ 1)
), A =
d
dθf (θ∗) .
More Taylor series:
Σn+1 ≈ Σn + 1n
{(A+ 1
2I)Σn + Σn(A+ 12I)T + Σ∆
}
SA recursion for {Σn}
9 / 30
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Theory surrounding asymptotic theory is as complete as for i.i.d. XConsider the asymptotic covariance:
Σ = limn→∞
Σn := limn→∞
nE[θ(n)θ(n)T
]Computation based on Taylor series:
θ(n+ 1) ≈ θ(n) + 1n
(Aθ(n) + ∆(n+ 1)
), A =
d
dθf (θ∗) .
More Taylor series:
Σn+1 ≈ Σn + 1n
{(A+ 1
2I)Σn + Σn(A+ 12I)T + Σ∆
}SA recursion for {Σn}
9 / 30
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
SA recursion for {Σn}:
Σn+1 ≈ Σn + 1n
{(A+ 1
2I)Σn + Σn(A+ 12I)T + Σ∆
}Conclusions
1 If Reλ(A) ≥ −12 for some eigenvalue then Σ is (typically) infinite
2 If Reλ(A) < −12 for all, then Σ = limn→∞Σn is the unique solution
to the Lyapunov equation:
0 = (A+ 12I)Σ + Σ(A+ 1
2I)T + Σ∆
10 / 30
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
The asymptotic covariance of Q-learning is typically infinite, and it shows!
1000 200 300 400 486.60
10
20
30
40
n = 106
Histogram for θ
θ*
n(15)
(15)
Example from Devraj & M 2017
11 / 30
Fastest Stochastic Approximation Hidden Theory
Hidden Theory
Seemingly lost knowledge
Reinforcement Learning ⊂ Stochastic Approximation
∃ Dozens of RL papers searching for the right matrix gain:
θ(n+ 1) = θ(n) +Gnf(θ(n), X(n+ 1))
Less ambitious: the best choice of scalar, Gn = g/n
Q Learning is often great, but only with the right choice of g.
12 / 30
Fastest Stochastic Approximation Hidden Theory
Hidden Theory
Seemingly lost knowledge
Reinforcement Learning ⊂ Stochastic Approximation
∃ Dozens of RL papers searching for the right matrix gain:
θ(n+ 1) = θ(n) +Gnf(θ(n), X(n+ 1))
Less ambitious: the best choice of scalar, Gn = g/n
Q Learning is often great, but only with the right choice of g.
12 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Covariance
Introduce a matrix gain:
θ(n+ 1) = θ(n) +1
n+ 1Gnf(θ(n), X(n))
Assume it converges, and linearize:
θ(n+ 1) ≈ θ(n) +1
n+ 1G(Aθ(n) + ∆(n+ 1)
), A =
d
dθf (θ∗) .
Looks like Monte-Carlo recursion if G = G∗ :=−A−1
Results in optimal covariance: Σ∗ = G∗Σ∆G∗T
Example: LSTD(λ), but this was not their motivation!
13 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Covariance
Introduce a matrix gain:
θ(n+ 1) = θ(n) +1
n+ 1Gnf(θ(n), X(n))
Assume it converges, and linearize:
θ(n+ 1) ≈ θ(n) +1
n+ 1G(Aθ(n) + ∆(n+ 1)
), A =
d
dθf (θ∗) .
Looks like Monte-Carlo recursion if G = G∗ :=−A−1
Results in optimal covariance: Σ∗ = G∗Σ∆G∗T
Example: LSTD(λ), but this was not their motivation!
13 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Covariance
Introduce a matrix gain:
θ(n+ 1) = θ(n) +1
n+ 1Gnf(θ(n), X(n))
Assume it converges, and linearize:
θ(n+ 1) ≈ θ(n) +1
n+ 1G(Aθ(n) + ∆(n+ 1)
), A =
d
dθf (θ∗) .
Looks like Monte-Carlo recursion if G = G∗ :=−A−1
Results in optimal covariance: Σ∗ = G∗Σ∆G∗T
Example: LSTD(λ), but this was not their motivation!
13 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Covariance
Introduce a matrix gain:
θ(n+ 1) = θ(n) +1
n+ 1Gnf(θ(n), X(n))
Assume it converges, and linearize:
θ(n+ 1) ≈ θ(n) +1
n+ 1G(Aθ(n) + ∆(n+ 1)
), A =
d
dθf (θ∗) .
Looks like Monte-Carlo recursion if G = G∗ :=−A−1
Results in optimal covariance: Σ∗ = G∗Σ∆G∗T
Example: LSTD(λ), but this was not their motivation!
13 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n+ 1) = θ(n) +g
n+ 1
(−θ(n) +X(n+ 1)
)
14 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n+ 1) = θ(n) +g
n+ 1
(−θ(n) +X(n+ 1)
)∆(n) = X(n)− E[X(n)]
14 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n)− E[X(n)]Normalization for analysis:
θ(n+ 1) = θ(n) +g
n+ 1
(−θ(n) + ∆(n+ 1)
)
14 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n)− E[X(n)]Normalization for analysis:
θ(n+ 1) = θ(n) +g
n+ 1
(−θ(n) + ∆(n+ 1)
)Example: X(n) = W 2(n), W ∼ N(0, 1), σ2
∆ = 3
14 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n)− E[X(n)]Normalization for analysis:
θ(n+ 1) = θ(n) +g
n+ 1
(−θ(n) + ∆(n+ 1)
)Example: X(n) = W 2(n), W ∼ N(0, 1), σ2
∆ = 3
0 1 2 3 4 5 g
σ2∆
Σ =σ2∆
2
g2
g − 1/2
Asymptotic variance as a function of g
14 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n)− E[X(n)]Normalization for analysis:
θ(n+ 1) = θ(n) +g
n+ 1
(−θ(n) + ∆(n+ 1)
)Example: X(n) = W 2(n), W ∼ N(0, 1), σ2
∆ = 3
0 1 2 3 4 5 x 1040
1
g Σ
0.5 ∞1 3
10 15.8
20 30.8
θ(t)
t
SA estimates of E[W 2], W ∼ N(0, 1)
14 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Two ways to achieve Σ∗ = A−1Σ∆A−1T
1 Polyak-Ruppert technique. Forget the matrix, and insteada) crank up the gain! e.g. an = n−2/3
b) Average {θ(n)} a second time.
2 Stochastic Newton Raphson, Gn ≈ G∗, such as
G−1n = − 1
n
n∑k=1
An, An =d
dθf(θ(n), X(n))
Design choices:PR is super-simple to implement, but may have large transients.NR is not universally applicable, but is often amazing
just like ordinary Newton-Raphson
15 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Two ways to achieve Σ∗ = A−1Σ∆A−1T
1 Polyak-Ruppert technique. Forget the matrix, and insteada) crank up the gain! e.g. an = n−2/3
b) Average {θ(n)} a second time.
2 Stochastic Newton Raphson, Gn ≈ G∗, such as
G−1n = − 1
n
n∑k=1
An, An =d
dθf(θ(n), X(n))
Design choices:PR is super-simple to implement, but may have large transients.NR is not universally applicable, but is often amazing
just like ordinary Newton-Raphson
15 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Two ways to achieve Σ∗ = A−1Σ∆A−1T
1 Polyak-Ruppert technique. Forget the matrix, and insteada) crank up the gain! e.g. an = n−2/3
b) Average {θ(n)} a second time.
2 Stochastic Newton Raphson, Gn ≈ G∗, such as
G−1n = − 1
n
n∑k=1
An, An =d
dθf(θ(n), X(n))
Design choices:PR is super-simple to implement, but may have large transients.NR is not universally applicable, but is often amazing
just like ordinary Newton-Raphson
15 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Two ways to achieve Σ∗ = A−1Σ∆A−1T
1 Polyak-Ruppert technique. Forget the matrix, and insteada) crank up the gain! e.g. an = n−2/3
b) Average {θ(n)} a second time.
2 Stochastic Newton Raphson, Gn ≈ G∗, such as
G−1n = − 1
n
n∑k=1
An, An =d
dθf(θ(n), X(n))
Design choices:PR is super-simple to implement, but may have large transients.NR is not universally applicable, but is often amazing
just like ordinary Newton-Raphson
15 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Two ways to achieve Σ∗ = A−1Σ∆A−1T
1 Polyak-Ruppert technique. Forget the matrix, and insteada) crank up the gain! e.g. an = n−2/3
b) Average {θ(n)} a second time.
2 Stochastic Newton Raphson, Gn ≈ G∗, such as
G−1n = − 1
n
n∑k=1
An, An =d
dθf(θ(n), X(n))
Design choices:PR is super-simple to implement, but may have large transients.NR is not universally applicable, but is often amazing
just like ordinary Newton-Raphson
15 / 30
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Two ways to achieve Σ∗ = A−1Σ∆A−1T
1 Polyak-Ruppert technique. Forget the matrix, and insteada) crank up the gain! e.g. an = n−2/3
b) Average {θ(n)} a second time.
2 Stochastic Newton Raphson, Gn ≈ G∗, such as
G−1n = − 1
n
n∑k=1
An, An =d
dθf(θ(n), X(n))
Design choices:PR is super-simple to implement, but may have large transients.NR is not universally applicable, but is often amazing
just like ordinary Newton-Raphson
15 / 30
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
Zap Q-Learning
Introducing Zap Q-Learning
Zap Q Learning
Zap Q Learning ≡ SNR for Q Learning
Not quiteDo we really want to average uniformly over past samples?
G−1n = − 1
n
n∑k=1
An , An =d
dθf(θ(n), X(n))
To better emulate Newton-Raphson, crank up the gain:
G−1n+1 = G−1
n + γn[An −G−1n ] ,
with γn � 1/n. Example: γn = (1/n)0.85
What gives us the courage? Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.
16 / 30
Introducing Zap Q-Learning
Zap Q Learning
Zap Q Learning ≡ SNR for Q LearningNot quite
Do we really want to average uniformly over past samples?
G−1n = − 1
n
n∑k=1
An , An =d
dθf(θ(n), X(n))
To better emulate Newton-Raphson, crank up the gain:
G−1n+1 = G−1
n + γn[An −G−1n ] ,
with γn � 1/n. Example: γn = (1/n)0.85
What gives us the courage? Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.
16 / 30
Introducing Zap Q-Learning
Zap Q Learning
Zap Q Learning ≡ SNR for Q LearningNot quite
Do we really want to average uniformly over past samples?
G−1n = − 1
n
n∑k=1
An , An =d
dθf(θ(n), X(n))
To better emulate Newton-Raphson, crank up the gain:
G−1n+1 = G−1
n + γn[An −G−1n ] ,
with γn � 1/n. Example: γn = (1/n)0.85
What gives us the courage? Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.
16 / 30
Introducing Zap Q-Learning
Zap Q Learning
Zap Q Learning ≡ SNR for Q LearningNot quite
Do we really want to average uniformly over past samples?
G−1n = − 1
n
n∑k=1
An , An =d
dθf(θ(n), X(n))
To better emulate Newton-Raphson, crank up the gain:
G−1n+1 = G−1
n + γn[An −G−1n ] ,
with γn � 1/n.
Example: γn = (1/n)0.85
What gives us the courage? Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.
16 / 30
Introducing Zap Q-Learning
Zap Q Learning
Zap Q Learning ≡ SNR for Q LearningNot quite
Do we really want to average uniformly over past samples?
G−1n = − 1
n
n∑k=1
An , An =d
dθf(θ(n), X(n))
To better emulate Newton-Raphson, crank up the gain:
G−1n+1 = G−1
n + γn[An −G−1n ] ,
with γn � 1/n. Example: γn = (1/n)0.85
What gives us the courage? Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.
16 / 30
Introducing Zap Q-Learning
Zap Q Learning
Zap Q Learning ≡ SNR for Q LearningNot quite
Do we really want to average uniformly over past samples?
G−1n = − 1
n
n∑k=1
An , An =d
dθf(θ(n), X(n))
To better emulate Newton-Raphson, crank up the gain:
G−1n+1 = G−1
n + γn[An −G−1n ] ,
with γn � 1/n. Example: γn = (1/n)0.85
What gives us the courage?
Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.
16 / 30
Introducing Zap Q-Learning
Zap Q Learning
Zap Q Learning ≡ SNR for Q LearningNot quite
Do we really want to average uniformly over past samples?
G−1n = − 1
n
n∑k=1
An , An =d
dθf(θ(n), X(n))
To better emulate Newton-Raphson, crank up the gain:
G−1n+1 = G−1
n + γn[An −G−1n ] ,
with γn � 1/n. Example: γn = (1/n)0.85
What gives us the courage? Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.
16 / 30
Introducing Zap Q-Learning
Zap Q LearningWhat gives us the courage?
Assumptions:
1 Classical setting, with complete basis
2 Unique optimal policy
3 γ = (1/n)δ
4 Technical condition to deal with discontinuities
Assumption 1 =⇒One-to-one mapping between Qθ and its associated cost function cθ
Theorem:
ODE:d
dtcϑ(t) = −cϑ(t) + c
17 / 30
Introducing Zap Q-Learning
Zap Q LearningWhat gives us the courage?
Assumptions:
1 Classical setting, with complete basis
2 Unique optimal policy
3 γ = (1/n)δ
4 Technical condition to deal with discontinuities
Assumption 1 =⇒One-to-one mapping between Qθ and its associated cost function cθ
Theorem:
ODE:d
dtcϑ(t) = −cϑ(t) + c
17 / 30
Introducing Zap Q-Learning
Zap Q LearningWhat gives us the courage?
Assumptions:
1 Classical setting, with complete basis
2 Unique optimal policy
3 γ = (1/n)δ
4 Technical condition to deal with discontinuities
Assumption 1 =⇒One-to-one mapping between Qθ and its associated cost function cθ
Theorem:
ODE:d
dtcϑ(t) = −cϑ(t) + c
17 / 30
Introducing Zap Q-Learning
Zap Q Learning
Theorem: Everything we hope for. Convergence, optimal asymptoticcovariance, and what looks like a good approximation of Newton-Raphson.
Conclusions from Numerical Experiments:
Watkins’ algorithm is worthless
W using an = g∗/n, g∗ optimizing a. variance, is often awesome
Polyak-Ruppert induces large transients, with long-term effect
Zap ... Let’s see ...
18 / 30
Introducing Zap Q-Learning
Zap Q Learning
Theorem: Everything we hope for. Convergence, optimal asymptoticcovariance, and what looks like a good approximation of Newton-Raphson.
Conclusions from Numerical Experiments:
Watkins’ algorithm is worthless
W using an = g∗/n, g∗ optimizing a. variance, is often awesome
Polyak-Ruppert induces large transients, with long-term effect
Zap ... Let’s see ...
18 / 30
Introducing Zap Q-Learning
Zap Q Learning
Theorem: Everything we hope for. Convergence, optimal asymptoticcovariance, and what looks like a good approximation of Newton-Raphson.
Conclusions from Numerical Experiments:
Watkins’ algorithm is worthless
W using an = g∗/n, g∗ optimizing a. variance, is often awesome
Polyak-Ruppert induces large transients, with long-term effect
Zap ... Let’s see ...
18 / 30
Introducing Zap Q-Learning
Zap Q Learning
Theorem: Everything we hope for. Convergence, optimal asymptoticcovariance, and what looks like a good approximation of Newton-Raphson.
Conclusions from Numerical Experiments:
Watkins’ algorithm is worthless
W using an = g∗/n, g∗ optimizing a. variance, is often awesome
Polyak-Ruppert induces large transients, with long-term effect
Zap ... Let’s see ...
18 / 30
Introducing Zap Q-Learning
Zap Q Learning
Theorem: Everything we hope for. Convergence, optimal asymptoticcovariance, and what looks like a good approximation of Newton-Raphson.
Conclusions from Numerical Experiments:
Watkins’ algorithm is worthless
W using an = g∗/n, g∗ optimizing a. variance, is often awesome
Polyak-Ruppert induces large transients, with long-term effect
Zap ... Let’s see ...
18 / 30
Introducing Zap Q-Learning
Zap Q LearningOptimize Walk to Cafe
14
653 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with an = 1/nOptimal scalar gain is approximately an = 1500/n
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
Zap, γn = αn
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
Watkins, g = 1500
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
19 / 30
Introducing Zap Q-Learning
Zap Q LearningOptimize Walk to Cafe
14
653 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with an = 1/n
Optimal scalar gain is approximately an = 1500/n
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
Zap, γn = αn
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
Watkins, g = 1500
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
19 / 30
Introducing Zap Q-Learning
Zap Q LearningOptimize Walk to Cafe
14
653 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with an = 1/nOptimal scalar gain is approximately an = 1500/n
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
Zap, γn = αn
0 1 2 3 4 5 6 7 8 9 10 1050
20
40
60
80
100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging
Zap
Bellm
an E
rror
n
Watkins, g = 1500
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
19 / 30
Introducing Zap Q-Learning
Zap Q LearningOptimize Walk to Cafe
14
653 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104 n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =√nθn
Entry #18: n = 104 n = 106Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: β = 0.99
20 / 30
Introducing Zap Q-Learning
Zap Q LearningOptimize Walk to Cafe
14
653 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500g = 1500
SpeedyPoly
g = 5000
Polyak-Ruppert
B0 10 20 30 40 500
1
2
0 20 40 60 80 100 120 140 1600
0.5
Zap-Q:Zap-Q: ≡ α0 85
n
γn ≡γn
αn
Wat
kins
Bellm
an E
rror
His
togr
amsn
=10
6
g = 500g = 1500
SpeedyPoly
g = 5000
g = 500g = 1500
SpeedyPoly
g = 5000
103 104 105 106100
101
102
103
104
103 104 105 106 n
Polyak-Ruppert
Polyak-RuppertB
B
n
0 10 20 30 40 500
1
2
0 20 40 60 80 100 120 140 1600
0.5
Zap-Q:Zap-Q: ≡ α0 85
n
γn ≡γn
αn
Zap-Q:Zap-Q: ≡ α0 85
n
γn ≡γn
αn
Wat
kins
Wat
kins
Bellm
an E
rror
Bellm
an E
rror
His
togr
amsn
=10
6
2σ confidence intervals for the Q-learning algorithms
21 / 30
Introducing Zap Q-Learning
Zap Q LearningOptimize Walk to Cafe
14
653 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500g = 1500
SpeedyPoly
g = 5000
Polyak-Ruppert
B0 10 20 30 40 500
1
2
0 20 40 60 80 100 120 140 1600
0.5
Zap-Q:Zap-Q: ≡ α0 85
n
γn ≡γn
αn
Wat
kins
Bellm
an E
rror
His
togr
amsn
=10
6
g = 500g = 1500
SpeedyPoly
g = 5000
g = 500g = 1500
SpeedyPoly
g = 5000
103 104 105 106100
101
102
103
104
103 104 105 106 n
Polyak-Ruppert
Polyak-RuppertB
B
n
0 10 20 30 40 500
1
2
0 20 40 60 80 100 120 140 1600
0.5
Zap-Q:Zap-Q: ≡ α0 85
n
γn ≡γn
αn
Zap-Q:Zap-Q: ≡ α0 85
n
γn ≡γn
αn
Wat
kins
Wat
kins
Bellm
an E
rror
Bellm
an E
rror
His
togr
amsn
=10
6
2σ confidence intervals for the Q-learning algorithms21 / 30
Introducing Zap Q-Learning
Zap Q LearningModel of Tsitsiklis and Van Roy: Option Call
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
Real for every eigenvalue λ
Asymptotic covariance is in�nite
λ > −1
2Realλi(A)
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
Real for every eigenvalue λ
Authors observed slow convergenceProposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is in�nite
λ > −1
2Realλi(A)
{Gn}i
0 1 2 3 4 5 6 7 8 9 10-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
-0.525-30 -25 -20 -15 -10 -5-10
-5
0
5
10
Re (λ(GA))
Co
(λ(G
A))
λi(GA)Realλi(A)
22 / 30
Introducing Zap Q-Learning
Zap Q LearningModel of Tsitsiklis and Van Roy: Option Call
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
Real for every eigenvalue λ
Asymptotic covariance is in�nite
λ > −1
2Realλi(A)
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
Real for every eigenvalue λ
Authors observed slow convergenceProposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is in�nite
λ > −1
2Realλi(A)
{Gn}
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
-0.525-30 -25 -20 -15 -10 -5-10
-5
0
5
10
Re (λ(GA))
Co
(λ(G
A))
λi(GA)Realλi(A)
22 / 30
Introducing Zap Q-Learning
Zap Q LearningModel of Tsitsiklis and Van Roy: Option Call
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
Real for every eigenvalue λ
Asymptotic covariance is in�nite
λ > −1
2Realλi(A)
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
Real for every eigenvalue λ
Authors observed slow convergenceProposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is in�nite
λ > −1
2Realλi(A)
{Gn}
i0 1 2 3 4 5 6 7 8 9 10
-100
-10-1
-10-2
-10-3
-10-4
-10-5
-10-6
-0.525-30 -25 -20 -15 -10 -5-10
-5
0
5
10
Re (λ(GA))
Co
(λ(G
A))
λi(GA)Realλi(A)
Eigenvalues of A and GA for the finance example
We discovered that favorite choice of gain in [23] barely meets the criterionfor Re(λ(GA)) < −1
2
22 / 30
Introducing Zap Q-Learning
Zap Q LearningModel of Tsitsiklis and Van Roy: Option Call
State space: R100. Parameterized Q-function: Qθ with θ ∈ R10
Zap-Q
G-Q-1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800
-250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =√nθn
Entry #1: n = 2 × 106 Entry #7: n = 2 × 106
23 / 30
Conclusions
ConclusionsHidden Theory?
Reinforcement Learning is cursed by variance, and not simplydimension.
We need better design tools to improve performance.
Little theory to support useful finite-n bounds,and these bounds give little insight for algorithm improvement.
The asymptotic covariance is an awesome design tool.
Example: g = 1500 was chosen based on asymptotic covariance
It is also predictive of finite-n performance.
Other open questions:
Algorithm design in specific contextsAdaptive optimization of algorithm parameters
24 / 30
Conclusions
ConclusionsHidden Theory?
Reinforcement Learning is cursed by variance, and not simplydimension.
We need better design tools to improve performance.
Little theory to support useful finite-n bounds,
and these bounds give little insight for algorithm improvement.
The asymptotic covariance is an awesome design tool.
Example: g = 1500 was chosen based on asymptotic covariance
It is also predictive of finite-n performance.
Other open questions:
Algorithm design in specific contextsAdaptive optimization of algorithm parameters
24 / 30
Conclusions
ConclusionsHidden Theory?
Reinforcement Learning is cursed by variance, and not simplydimension.
We need better design tools to improve performance.
Little theory to support useful finite-n bounds,and these bounds give little insight for algorithm improvement.
The asymptotic covariance is an awesome design tool.
Example: g = 1500 was chosen based on asymptotic covariance
It is also predictive of finite-n performance.
Other open questions:
Algorithm design in specific contextsAdaptive optimization of algorithm parameters
24 / 30
Conclusions
ConclusionsHidden Theory?
Reinforcement Learning is cursed by variance, and not simplydimension.
We need better design tools to improve performance.
Little theory to support useful finite-n bounds,and these bounds give little insight for algorithm improvement.
The asymptotic covariance is an awesome design tool.
Example: g = 1500 was chosen based on asymptotic covariance
It is also predictive of finite-n performance.
Other open questions:
Algorithm design in specific contextsAdaptive optimization of algorithm parameters
24 / 30
Conclusions
ConclusionsHidden Theory?
Reinforcement Learning is cursed by variance, and not simplydimension.
We need better design tools to improve performance.
Little theory to support useful finite-n bounds,and these bounds give little insight for algorithm improvement.
The asymptotic covariance is an awesome design tool.
Example: g = 1500 was chosen based on asymptotic covariance
It is also predictive of finite-n performance.
Other open questions:
Algorithm design in specific contextsAdaptive optimization of algorithm parameters
24 / 30
Conclusions
ConclusionsHidden Theory?
Reinforcement Learning is cursed by variance, and not simplydimension.
We need better design tools to improve performance.
Little theory to support useful finite-n bounds,and these bounds give little insight for algorithm improvement.
The asymptotic covariance is an awesome design tool.
Example: g = 1500 was chosen based on asymptotic covariance
It is also predictive of finite-n performance.
Other open questions:
Algorithm design in specific contexts
Adaptive optimization of algorithm parameters
24 / 30
Conclusions
ConclusionsHidden Theory?
Reinforcement Learning is cursed by variance, and not simplydimension.
We need better design tools to improve performance.
Little theory to support useful finite-n bounds,and these bounds give little insight for algorithm improvement.
The asymptotic covariance is an awesome design tool.
Example: g = 1500 was chosen based on asymptotic covariance
It is also predictive of finite-n performance.
Other open questions:
Algorithm design in specific contextsAdaptive optimization of algorithm parameters
24 / 30
Conclusions
ConclusionsHidden Theory?
thankful
25 / 30
References
Control TechniquesFOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chainsand
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f
)<
∞
∆V (x) ≤ −f(x) + bIC(x)
‖Pn(x, · ) − π‖f → 0
sup
CEx [S
τC(f
)]<
∞
References
26 / 30
References
Selected References I
[1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
[2] A. Benveniste, M. Metivier, and P. Priouret. Adaptive algorithms and stochasticapproximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,Berlin, 1990. Translated from the French by Stephen S. Wilson.
[3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. HindustanBook Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,2008.
[4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochasticapproximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. CambridgeUniversity Press, Cambridge, second edition, 2009. Published in the CambridgeMathematical Library.
[6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.See last chapter on simulation and average-cost TD learning
27 / 30
References
Selected References II
[7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.The Annals of Statistics, 13(1):236–245, 1985.
[8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Researchand Industrial Engineering, Ithaca, NY, 1988.
[9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika itelemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages98–107, 1990.
[10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM J. Control Optim., 30(4):838–855, 1992.
[11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochasticapproximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximationalgorithms for machine learning. In Advances in Neural Information Processing Systems24, pages 451–459. Curran Associates, Inc., 2011.
28 / 30
References
Selected References III
[13] C. Szepesvari. Algorithms for Reinforcement Learning. Synthesis Lectures on ArtificialIntelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,Cambridge, Cambridge, UK, 1989.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,3(1):9–44, 1988.
[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with functionapproximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[18] C. Szepesvari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10thInternat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. InAdvances in Neural Information Processing Systems, 2011.
[20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine LearningResearch, 5(Dec):1–25, 2003.
29 / 30
References
Selected References IV
[21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection forneuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning andApproximate Dynamic Programming for Feedback Control. Wiley, 2011.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert spacetheory, approximation algorithms, and an application to pricing high-dimensional financialderivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation andefficient temporal-difference learning. Discrete Event Dynamic Systems: Theory andApplications, 16(2):207–239, 2006.
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal differencelearning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear functionapproximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEEConference on Decision and Control, pages 3598–3605, Dec. 2009.
30 / 30