Post on 11-Oct-2020
transcript
Exploration-Exploitationin Reinforcement LearningPart 2 – Regret Minimization in Tabular MDPs
Mohammad Ghavamzadeh, Alessandro Lazaric and Matteo PirottaFacebook AI Research
Outline2
1 Tabular Model-BasedOptimisticRandomized
2 Tabular Model-Free Algorithms
Websitehttps://rlgammazero.github.io
Ghavamzadeh, Lazaric and Pirotta
Minimax Lower Bound3
Theorem (adapted from Jaksch et al. [2010])
For any MDP M? = 〈S,A, ph, rh, H〉 with stationary (p1 = p2 = . . . = pH) transitions,any algorithm A at any episode K suffers a regret of at least
Ω(√
HSAT)
with T = HK.
If non-stationary transitions• p1, . . . , pH can be arbitrary different
• Effective number of states is S′ = HS• Lower bound
Ω(H√SAT
)Ghavamzadeh, Lazaric and Pirotta
Tabular MDPs: Outline4
1 Tabular Model-BasedOptimisticRandomized
2 Tabular Model-Free Algorithms
Ghavamzadeh, Lazaric and Pirotta
The Optimism Principle: Intuition
The Optimism Principle: Intuition6
Exploration vs. Exploitation
Optimism in Face of Uncertainty
When you are uncertain, consider the best possible world (reward-wise)
If the best possible world is correct
=⇒ no regret
If the best possible world is wrong
=⇒ learn useful information
Ghavamzadeh, Lazaric and Pirotta
The Optimism Principle: Intuition6
Exploration vs. Exploitation
Optimism in Face of Uncertainty
When you are uncertain, consider the best possible world (reward-wise)
If the best possible world is correct
=⇒ no regret
If the best possible world is wrong
=⇒ learn useful information
Ghavamzadeh, Lazaric and Pirotta
The Optimism Principle: Intuition6
Exploration vs. Exploitation
Optimism in Face of Uncertainty
When you are uncertain, consider the best possible world (reward-wise)
If the best possible world is correct
=⇒ no regret
If the best possible world is wrong
=⇒ learn useful informationExploitation Exploration
Ghavamzadeh, Lazaric and Pirotta
The Optimism Principle: Intuition6
Exploration vs. Exploitation
Optimism in Face of Uncertainty
When you are uncertain, consider the best possible world (reward-wise)
If the best possible world is correct
=⇒ no regret
If the best possible world is wrong
=⇒ learn useful informationExploitation Exploration
Optimism in value function
Ghavamzadeh, Lazaric and Pirotta
History: OFU for Regret MinimizationTabular MDPs
7
FH: finite-horizonAR: average reward
Agraw
al[1990]
AuerandOrtner[2006] (AR)
Bartlett
andTewari[2009] (AR)
Filippi et al.[2010] (AR)
Jakschetal.[2010] (AR)
Talebi andMaillard[2018] (AR)
Fruit
etal.
[201
8b] (AR)
Fruit
etal.
[201
8a] (AR)
Fruit
etal.
[201
9]:
Tossouetal.[2019] :
é
Qianet
al.[201
9](AR)
ZhangandJi[2019] (AR)
Azar et al. [2017] (FH)
Zanette andBrunskill [2018] (FH)
Kakade et al. [2018] :é
Jinet al. [2018] (FH)
Zanette andBrunskill [2019] (FH)
Efroni et al. [2019] (FH):: arXiv paper (not published)é: possibly incorrect
Ghavamzadeh, Lazaric and Pirotta
Learning Problem8
Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Compute (Qh,k)Hh=1 from Dk
Define πk based on (Qhk)Hh=1
for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k
end
Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1
end
Ghavamzadeh, Lazaric and Pirotta
Learning Problem8
Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Compute (Qh,k)Hh=1 from Dk
Define πk based on (Qhk)Hh=1
for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k
end
Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1
end
Defines the type of algorithm
Ghavamzadeh, Lazaric and Pirotta
Model-based Learning9
Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Estimate empirical MDP Mk = (S,A, phk, rhk, H) from Dk
phk(s′|s, a) =
∑k−1i=1 1
((shi, ahi, sh+1,i) = (s, a, s′)
)Nhk(s, a)
, rhk(s, a) =
∑k−1i=1 rhi · 1 ((shi, ahi) = (s, a))
Nhk(s, a)
Planning (by backward induction) for πhk
for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k
endAdd trajectory (shk, ahk, rhk)
Hh=1 to Dk+1
end
Ghavamzadeh, Lazaric and Pirotta
Measuring Uncertainty 10
Bounded parameter MDP [Strehl and Littman, 2008]
Mk =⟨S,A, rh, ph, H
⟩: ∀h ∈ [H]
rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp
hk(s, a), ∀(s, a) ∈ S ×A
Compact confidence sets
Brhk(s, a) :=
[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)
]Bphk(s, a) :=
p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)
Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]
βrhk(s, a) ∝√
log(Nhk(s, a)/δ)
Nhk(s, a), βphk(s, a) ∝
√S log(Nhk(s, a)/δ)
Nhk(s, a)
Ghavamzadeh, Lazaric and Pirotta
Measuring Uncertainty 10
Bounded parameter MDP [Strehl and Littman, 2008]
Mk =⟨S,A, rh, ph, H
⟩: ∀h ∈ [H]
rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp
hk(s, a), ∀(s, a) ∈ S ×A
Compact confidence sets
Brhk(s, a) :=
[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)
]Bphk(s, a) :=
p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)
Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]
βrhk(s, a) ∝√
log(Nhk(s, a)/δ)
Nhk(s, a), βphk(s, a) ∝
√S log(Nhk(s, a)/δ)
Nhk(s, a)
Ghavamzadeh, Lazaric and Pirotta
Bounded Parameter MDP: Optimism11
M t
M t
M?
V πM Fix a policy π
Mt
Ghavamzadeh, Lazaric and Pirotta
Bounded Parameter MDP: Optimism11
M t
M t
M?
V πM Fix a policy π
Mt
Ghavamzadeh, Lazaric and Pirotta
Bounded Parameter MDP: Optimism11
M t
M t
M?
V πM Fix a policy π
Mt
UCBt(V π1 )
Ghavamzadeh, Lazaric and Pirotta
Bounded Parameter MDP: Optimism11
M t
M t
M?
V πM Fix a policy π
Mt
UCBt(V π1 )Optimism: UCBt(V π
1 ) = maxM∈Mt
V π1,M≥V π
1,M?
Ghavamzadeh, Lazaric and Pirotta
Extended Value Iteration[Jaksch et al., 2010]
12
Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A
for h = H, . . . , 1 dofor (s, a) ∈ S ×A do
Compute
Qhk(s, a) = maxrh∈Br
hk(s,a)
rh(s, a) + maxph∈B
phk
(s,a)Es′∼ph(·|s,a)
[Vh+1,k(s
′)]
= rhk(s, a) + βrhk(s, a) + maxph∈B
phk
(s,a)Es′∼ph(·|s,a)
[Vh+1,k(s
′)]
Vhk(s) = minH − (h− 1), max
a∈AQhk(s, a)
end
endreturn πhk(s) = argmaxa∈A Qhk(s, a)
Ghavamzadeh, Lazaric and Pirotta
Extended Value Iteration[Jaksch et al., 2010]
12
Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A
for h = H, . . . , 1 dofor (s, a) ∈ S ×A do
Compute
Qhk(s, a) = maxrh∈Br
hk(s,a)
rh(s, a) + maxph∈B
phk
(s,a)Es′∼ph(·|s,a)
[Vh+1,k(s
′)]
= rhk(s, a) + βrhk(s, a) + maxph∈B
phk
(s,a)Es′∼ph(·|s,a)
[Vh+1,k(s
′)]
Vhk(s) = minH − (h− 1), max
a∈AQhk(s, a)
end
endreturn πhk(s) = argmaxa∈A Qhk(s, a)
Policy executed at episode k
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
H
1
H1
uncertainty
BrH
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
H
1
H1
uncertainty
EVI
VI
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
H
1
H1
uncertainty
EVI
VI
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
H
1
H1
uncertainty
EVI
VI
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
H
1
H1
uncertainty
EVI
VI
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
H
1
H1
uncertainty
EVI
VI
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
Qhk(s, a)
H
1
H1
uncertainty
EVI
VI
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
Optimism 13
Q?h(s, a)
Qhk(s, a)
H
1
H1
uncertainty
EVI
VI
∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)
Ghavamzadeh, Lazaric and Pirotta
UCRL2-CH for Finite Horizon14
Theorem (adapted from [Jaksch et al., 2010])
For any tabular MDP with stationary transitions, UCRL2 with Chernoff-Hoeffdingconfidence intervals (UCRL2-CH), with high-probability, suffers a regret
R(K,M?,UCRL2-CH) = O(HS√AT +H2SA
)Order optimal
√AT√
HS factor worse than the lower-bound
Lower-bound: Ω(√HSAT ) (stationary transitions)
Ghavamzadeh, Lazaric and Pirotta
Extended Value Iteration15
Qhk(s, a) = max(r,p)∈Brhk(s,a)×Bphk(s,a)
r + pTVh+1,k
= maxr∈Brhk(s,a)
r + maxp∈Bphk(s,a)
pTVh+1,k
= rhk(s, a) + βrhk(s, a) + maxp∈Bphk(s,a)
pTVh+1,k
≤ rhk(s, a) + βrhk(s, a) + ‖p− phk(·|s, a)‖1‖Vh+1,k‖∞ + phk(·|s, a)TVh+1,k
≤ rhk(s, a) + βrhk(s, a) +Hβphk(s, a) + phk(·|s, a)TVh+1,k
- Exploration bonus (1 +H√S)βrhk(s, a) for the reward
Ghavamzadeh, Lazaric and Pirotta
Extended Value Iteration15
Qhk(s, a) = max(r,p)∈Brhk(s,a)×Bphk(s,a)
r + pTVh+1,k
= maxr∈Brhk(s,a)
r + maxp∈Bphk(s,a)
pTVh+1,k
= rhk(s, a) + βrhk(s, a) + maxp∈Bphk(s,a)
pTVh+1,k
≤ rhk(s, a) + βrhk(s, a) + ‖p− phk(·|s, a)‖1‖Vh+1,k‖∞ + phk(·|s, a)TVh+1,k
≤ rhk(s, a) + βrhk(s, a) +Hβphk(s, a) + phk(·|s, a)TVh+1,k
- Exploration bonus (1 +H√S)βrhk(s, a) for the reward
Ghavamzadeh, Lazaric and Pirotta
UCBVI[Azar et al., 2017]
16
Replace EVI with Exploration Bonus
Input: S, A, Brhk, Bphk, rhk, phk, bhkSet QH+1,k(s, a) = 0 for all (s, a) ∈ S ×A
for h = H, . . . , 1 dofor (s, a) ∈ S ×A do
Compute
Qhk(s, a) = rhk(s, a) + bhk(s, a) + Es′∼phk(·|s,a)
[Vh+1,k(s
′)]
Vhk(s) = minH − (h− 1), max
a′∈AQhk(s
′, a′)
endendreturn πhk(s) = argmaxa∈A Qhk(s, a)
Equivalent to value iteration on Mk = (S,A, rhk + bhk, phk, H)
Ghavamzadeh, Lazaric and Pirotta
UCBVI: Measuring Uncertainty 17
Combine uncertainties in rewards and transitionsIn a smart way
bhk(s, a) = (H + 1)
√log(Nhk(s, a)/δ)
Nhk(s, a)< βrhk +Hβphk
Save a√S factor∣∣∣ (ph(·|s, a)− phk(·|s, a))T V ?h︸︷︷︸
≤H
∣∣∣ ≤ H√ log(Nhk(s, a)/δ)
Nhk(s, a)︸ ︷︷ ︸=β
phk/√S
Ghavamzadeh, Lazaric and Pirotta
UCBVI: Measuring Uncertainty 17
Combine uncertainties in rewards and transitionsIn a smart way
bhk(s, a) = (H + 1)
√log(Nhk(s, a)/δ)
Nhk(s, a)< βrhk +Hβphk
Save a√S factor∣∣∣ (ph(·|s, a)− phk(·|s, a))T V ?h︸︷︷︸
≤H
∣∣∣ ≤ H√ log(Nhk(s, a)/δ)
Nhk(s, a)︸ ︷︷ ︸=β
phk/√S
Ghavamzadeh, Lazaric and Pirotta
UCBVI-CH: Regret 18
Theorem (Thm. 1 of Azar et al. [2017])
For any tabular MDP with stationary transitions, UCBVI with Chernoff-Hoeffdingconfidence intervals (UCBVI-CH), with high-probability, suffers a regret
R(K,M?,UCBVI-CH) = O(H√SAT +H2S2A
)Order optimal
√SAT√
H factor worse than the lower-boundLong “warm up” phase
If non-stationary, then O(H3/2
√SAT
)
Lower-bound: Ω(√HSAT ) (stationary transitions)
Ghavamzadeh, Lazaric and Pirotta
Refined Confidence Bounds19
UCRL2 with Bernstein-Freedman bounds (instead of Hoeffding/Weissman): * see tutorial website
R(K,M?,UCRL2B) = O(√
H Γ SAT +H2S2A
), Still not matching the lower-bound!
UCBVI with Bernstein-Freedman bounds: *
R(K,M?,UCBVI-BF) = O(√
HSAT +H2S2A+H√T)
- Matching the Lower-Bound!
, Long “warm up” phase
Γ = maxh,s,a
‖ph(·|s, a)‖0 ≤ S
* stationary model (p1 = . . . = pH)
Ghavamzadeh, Lazaric and Pirotta
Refined Confidence Bounds19
UCRL2 with Bernstein-Freedman bounds (instead of Hoeffding/Weissman): * see tutorial website
R(K,M?,UCRL2B) = O(√
H Γ SAT +H2S2A
), Still not matching the lower-bound!
UCBVI with Bernstein-Freedman bounds: *
R(K,M?,UCBVI-BF) = O(√
HSAT +H2S2A+H√T)
- Matching the Lower-Bound!
, Long “warm up” phase
Γ = maxh,s,a
‖ph(·|s, a)‖0 ≤ S
* stationary model (p1 = . . . = pH)
Ghavamzadeh, Lazaric and Pirotta
Refined Confidence Bounds20
EULER [Zanette and Brunskill, 2019]keeps upper and lower bounds on V ?
h
R(K,M?, EULER) = O(√
Q?SAT +√SSAH2(
√S +√H))
, Problem-dependent bound based on environmental norm [Maillard et al., 2014]
Q? = maxs,a,h
(V(rh(s, a)) + Vx∼ph(·|s,a)(V
?h+1(x))
)Vx∼p(f(x)) = Ex∼p
[(f(x)− Ey∼p[f(y)]
)2]
- Can remove the dependence on H
- Matching lower-bound in the worst case
Ghavamzadeh, Lazaric and Pirotta
UCRL2: RiverSwim 21
Hoeffding
brhk(s, a) = rmax
√L
N
bphk(s, a) =
√SL
N
Bernstein
brhk(s, a) =
√LV(rhk)
N+ rmax
L
N
bphk(s, a) =
√LV(phk)
N+L
N
V(rhk) =1
N
∑i
(rh,i − rhk)2
is the population variance
N = Nhk(s, a) ∨ 1L = log(SAN/δ)
0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
50
100
150
200
250
K
R(K
)
UCRL2-H
UCRL2-B
Ghavamzadeh, Lazaric and Pirotta
UCBVI: RiverSwim 22
Hoeffding
bhk(s, a) =(H − h)L√
N
Bernstein
bhk(s, a) =
√LVphk
(Vh+1,k)
N
+(H − h)L
N+
(H − h)√N
Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]
N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
50
100
150
200
250
K
R(K
)
UCBVI-H
UCBVI-B
Ghavamzadeh, Lazaric and Pirotta
UCBVI: RiverSwim 22
Hoeffding
bhk(s, a) =(H − h)L√
N
Bernstein
bhk(s, a) =
√LVphk
(Vh+1,k)
N
+(H − h)L
N+
(H − h)√N
Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]
N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
50
100
150
200
250
K
R(K
)
UCRL2-H
UCRL2-B
UCBVI-H
UCBVI-B
Ghavamzadeh, Lazaric and Pirotta
Model-Based Advantages23
Learning efficiencyFirst order optimalMatching lower-bound
Counterfactual reasoningOptimistic/Pessimistic value estimate for any πUsefull for inference (e.g., safety)
Ghavamzadeh, Lazaric and Pirotta
Model-Based Issues24
Complexity
Space O(HS2A)
non-stationary model =⇒ H( S2A︸︷︷︸transitions
+ SA︸︷︷︸rewards
)
Time O(K HS2A︸ ︷︷ ︸planning by VI
)
incremental updates
Ghavamzadeh, Lazaric and Pirotta
Model-Based Issues24
Complexity
Space O(HS2A)
non-stationary model =⇒ H( S2A︸︷︷︸transitions
+ SA︸︷︷︸rewards
)
Time O(K HS2A︸ ︷︷ ︸planning by VI
)
incremental updates
Ghavamzadeh, Lazaric and Pirotta
Real-Time Dynamic Programming (RTDP)[Barto et al., 1995]
25
Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H]
for k = 1, . . . ,K do // episodes
Observe initial state s1k (arbitrary)
for h = 1, . . . , H doahk ∈ arg max
a∈Arh(shk, a) + ph(·|shk, a)>Vh+1,k−1 (1-step planning)
Vh,k(shk) = rh(shk, ahk) + ph(·|shk, ahk)>Vh+1,k−1 (1-step planning)
Observe sh+1,k ∼ ph(·|shk, ahk)end
end
Ghavamzadeh, Lazaric and Pirotta
Opt-RTDP: Incremental Planning[Efroni et al., 2019]
26
Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk
for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k
endAdd trajectory (shk, ahk, rhk)
Hh=1 to Dk+1
end
Ghavamzadeh, Lazaric and Pirotta
Opt-RTDP: Incremental Planning[Efroni et al., 2019]
26
Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk
for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A
Q← using phk, rhk, Vh+1,k−1
Set V hk(shk) = minVh,k−1(shk) , max
a′∈AQ(shk, a
′)
Execute ahk = πhk(shk) = argmaxa∈A
Q(shk, a)
Observe rhk and sh+1,k
endAdd trajectory (shk, ahk, rhk)
Hh=1 to Dk+1
end Optimism + RTDP
Ghavamzadeh, Lazaric and Pirotta
Opt-RTDP: Incremental Planning[Efroni et al., 2019]
26
Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk
for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A
Q← using phk, rhk, Vh+1,k−1
Set V hk(shk) = minVh,k−1(shk) , max
a′∈AQ(shk, a
′)
Execute ahk = πhk(shk) = argmaxa∈A
Q(shk, a)
Observe rhk and sh+1,k
endAdd trajectory (shk, ahk, rhk)
Hh=1 to Dk+1
end Optimism + RTDP
Forward estimate (not backward!)
Next stage but previous episode!
Ghavamzadeh, Lazaric and Pirotta
Opt-RTDP: Properties 27
Non-Increasing EstimatesVhk(s) ≤ Vh,k−1(s)
how?
Optimistic initialization: Vh0(s) = H − (h− 1)
Clipping:
V hk(shk) = minVh,k−1(shk) , max
a′∈AQ(shk, a
′)
Ghavamzadeh, Lazaric and Pirotta
Opt-RTDP: Properties 28
Optimistic EstimatesVhk(s) ≥ V ?
h (s)
how?
Optimistic initialization: Vh0(s) = H − (h− 1)
Optimistic update
Example. UCRL2-like step
Q(shk, a) = maxr∈Br
hk(shk,a)r(shk, a) + max
p∈Bphk(shk,a)
Es′∼p(·|shk,a) [Vh+1,k−1(s′)]
Vh+1,k−1 is one episode behind but optimistic
Then Q is optimistic!
Ghavamzadeh, Lazaric and Pirotta
Opt-RTDP: Properties 28
Optimistic EstimatesVhk(s) ≥ V ?
h (s)
how?
Optimistic initialization: Vh0(s) = H − (h− 1)
Optimistic update
Example. UCRL2-like step
Q(shk, a) = maxr∈Br
hk(shk,a)r(shk, a) + max
p∈Bphk(shk,a)
Es′∼p(·|shk,a) [Vh+1,k−1(s′)]
Vh+1,k−1 is one episode behind but optimistic
Then Q is optimistic!
Ghavamzadeh, Lazaric and Pirotta
Opt-RTDP: Regret 29
Theorem (Thm. 8 of Efroni et al. [2019])
For any tabular MDP with stationary transitions, UCRL2-GP (Opt-RTDP based onUCRL2 with Hoeffding bounds), with high-probability, suffers a regret
R(K,M?,UCRL2-GP) = O(HS√AT +H2S3/2A
)Same regret as UCRL2-CH
Computationally more efficientTime: O(SA) per step, total runtime O(HSAK)
can be adapted to any algorithm (e.g.,UCBVI, EULER)
Ghavamzadeh, Lazaric and Pirotta
UCRL2-GP: RiverSwim 30
Hoeffding
brhk(s, a) = rmax
√L
N
bphk(s, a) =
√SL
N
Bernstein
brhk(s, a) =
√LV(rhk)
N+ rmax
L
N
bphk(s, a) =
√LV(phk)
N+L
N
N = Nhk(s, a) ∨ 1L = log(SAN/δ)
0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
50
100
150
200
250
K
R(K
)
UCRL2-GP-H
UCRL2-GP-B
Ghavamzadeh, Lazaric and Pirotta
UCRL2-GP: RiverSwim 30
Hoeffding
brhk(s, a) = rmax
√L
N
bphk(s, a) =
√SL
N
Bernstein
brhk(s, a) =
√LV(rhk)
N+ rmax
L
N
bphk(s, a) =
√LV(phk)
N+L
N
N = Nhk(s, a) ∨ 1L = log(SAN/δ)
0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
50
100
150
200
250
K
R(K
)
UCRL2-H
UCRL2-B
UCRL2-GP-H
UCRL2-GP-B
Ghavamzadeh, Lazaric and Pirotta
Tabular MDPs: Outline31
1 Tabular Model-BasedOptimisticRandomized
2 Tabular Model-Free Algorithms
Ghavamzadeh, Lazaric and Pirotta
Posterior Sampling (PS)a.k.a. Thompson Sampling [Thompson, 1933]
32
Keep Bayesian posterior for the unknown MDP
A sample from the posterior is used as anestimate of the unknown MDP
Few samples =⇒ uncertainty in theestimate
Exploration
More samples =⇒ posterior concentrateson the true MDP
Exploitation
Posteriordistribution µt
Set of MDPs
Ghavamzadeh, Lazaric and Pirotta
History: PS for Regret MinimizationTabular MDPs
33
FH: finite-horizonAR: average reward
Thompson[1933]
Strens [2000]
Osbandet al. [2013] (FH)
Ë
Gopalan
andMannor [2015] (AR) Ë
Abbasi-YadkoriandSzepesvári[2015] (AR) é
OsbandandRoy [2016] :
Osbandand
Roy [2017] (FH)é
Ouyangetal.[2017] (AR) î
Agraw
alandJia
[2017] (AR) é
Theocharousetal.[2018] (AR) î
Russo[2019] (FH)
:: arXiv paper (not published)é: possibly incorrectî: possibly incorrect assumptions
Ghavamzadeh, Lazaric and Pirotta
Bayesian Regret 34
RB(K,µ1,A) = EM?∼µ1
[R(K,M?,A)︸ ︷︷ ︸
:=E[R(K,M?,A)
]]
= EM?
[K∑k=1
V ?1,M?(s1k)− V πk
1,M?(s1k)
]
Ghavamzadeh, Lazaric and Pirotta
Posterior Sampling[Osband and Roy, 2017]
35
Input: S, A, rh, ph, prior µ1
Initialize D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Sample Mk ∼ µk(·|Dk)Compute
πk ∈ arg maxπ
V π1,Mk
for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k
endAdd trajectory (shk, ahk, rhk)
Hh=1 to Dk+1
end
Prior distribution:
∀Θ, P(M∗ ∈ Θ) = µ1(Θ)
Posterior distribution:
∀Θ, P(M∗ ∈ Θ|Dk, µ1) = µk(Θ)
PriorsDirichlet (transitions)Beta, Normal-Gamma, etc. (rewards)
Ghavamzadeh, Lazaric and Pirotta
Posterior Sampling[Osband and Roy, 2017]
35
Input: S, A, rh, ph, prior µ1
Initialize D1 = ∅
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Sample Mk ∼ µk(·|Dk)Compute
πk ∈ arg maxπ
V π1,Mk
for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k
endAdd trajectory (shk, ahk, rhk)
Hh=1 to Dk+1
end
Prior distribution:
∀Θ, P(M∗ ∈ Θ) = µ1(Θ)
Posterior distribution:
∀Θ, P(M∗ ∈ Θ|Dk, µ1) = µk(Θ)
PriorsDirichlet (transitions)Beta, Normal-Gamma, etc. (rewards)
Ghavamzadeh, Lazaric and Pirotta
Model Update with Dirichlet Priors36
o assume r is knownµt, (st, at, st+1)
︸ ︷︷ ︸∼Ht
7→ µt+1
µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)
Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is
µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )
• Posterior mean vector pt+1(si|s, a) =αi
n
• Variance bounded by1
n
Ghavamzadeh, Lazaric and Pirotta
Model Update with Dirichlet Priors36
o assume r is knownµt, (st, at, st+1)
︸ ︷︷ ︸∼Ht
7→ µt+1
µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)
Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is
µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )
• Posterior mean vector pt+1(si|s, a) =αi
n
• Variance bounded by1
n
n =
S∑i=1
αi
Ghavamzadeh, Lazaric and Pirotta
Posterior Sampling is Usually Better37
Bandit
[Chapelle and Li, 2011]
Finite horizon RL
[Osband and Roy, 2017]
Ghavamzadeh, Lazaric and Pirotta
PSRL: Regret 38
Theorem (Osband and Roy [2017] revisited)
For any prior µ1 with any independent Dirichlet prior over stationary transitions, theBayesian regret of PSRL is bounded as
RB(K,µ1, PSRL) = O(HS√AT )
Order optimal√AT√
HS factor suboptimal
Lower-bound: Ω(√HSAT ) (stationary transitions)
* in [Osband and Roy, 2017] is O(H√SAT ) for stationary MDPs
but there is a mistake in Lem. 3 (see [Qian et al., 2020])Ghavamzadeh, Lazaric and Pirotta
PSRL: RiverSwim 39
0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
10
20
30
K
R(K
)
PSRL
Ghavamzadeh, Lazaric and Pirotta
PSRL: RiverSwim 39
0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
20
40
60
80
100
120
140
160
K
R(K
)
PSRL
UCRL2-B
UCBVI-B
Ghavamzadeh, Lazaric and Pirotta
Tabular Randomized Least-Squares Value Iteration (RLSVI)[Russo, 2019]
40
Input: S, A, H
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)
Run Tabular-RLSVI on Dkfor h = 1, . . . , H do
Execute ahk = πhk(shk) = argmaxa
Qhk(shk, a)
Observe rhk and sh+1,k
endAdd trajectory (shk, ahk, rhk)
Hh=1 to Dk+1
end
* Not necessary to store all the data. Updates can be done incrementallyGhavamzadeh, Lazaric and Pirotta
Tabular-RLSVI41
Input: Dataset Dk = (shi, ahi, rhi)H,kh=1,i=1
Estimate empirical MDP Mk = (S,A, ph, rh, H)
phk(s′|s, a) =1
Nhk(s, a)
k−1∑i=1
1((shi, ahi, sh+1,i) = (s, a, s′)
),
rhk(s, a) =1
Nhk(s, a)
k−1∑i=1
rhi · 1 ((shi, ahi) = (s, a))
for h = H, . . . , 1 do // backward inductionSample ξhk ∼ N
(0, σ2
hkI)
Compute
∀(s, a) ∈ S ×A, Qhk(s, a) = rhk(s, a) + ξhk(s, a) +∑s′∈S
phk(s′|s, a)Vh+1,k(s
′)
endreturn QhkHh=1
Planning on MDPMk = (S,A, phk, rhk + ξhk, H)
Ghavamzadeh, Lazaric and Pirotta
Tabular-RLSVI: Frequentist Regret42
Theorem (Russo [2019])
For any tabular MDP with non-stationary transitions, Tab-RLSVI with
σhk(s, a) = O(√
SH3
Nhk(s, a) + 1
)
suffers with high probability a frequentist regret
R(K,M?,Tab-RLSVI) = O(H5/2S3/2
√AT)
Order optimal√AT
H3/2S worse than the lower-bound Ω(H√SAT )
Analysis can be improved!
Ghavamzadeh, Lazaric and Pirotta
Tab-RLVSIσ: RiverSwim43
σ1 (theory)
σh(s, a) =1
4
√(H − h)3SL
N
σ2
σh(s, a) =1
4
√(H − h)2L
N
σ3
σh(s, a) =1
4
√L
N
N = Nhk(s, a) ∨ 1L = log(SAN/δ)
0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
0.2
0.4
0.6
0.8
1
·104
K
R(K
)
Tab-RLSVI-1
Tab-RLSVI-2
Tab-RLSVI-3
Ghavamzadeh, Lazaric and Pirotta
Tab-RLVSIσ: RiverSwim43
σ1 (theory)
σh(s, a) =1
4
√(H − h)3SL
N
σ2
σh(s, a) =1
4
√(H − h)2L
N
σ3
σh(s, a) =1
4
√L
N
N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4
·104
0
200
400
600
800
1,000
1,200
K
R(K
)
Tab-RLSVI-3
UCRL2-H
Ghavamzadeh, Lazaric and Pirotta
Tabular MDPs: Outline44
1 Tabular Model-BasedOptimisticRandomized
2 Tabular Model-Free Algorithms
Ghavamzadeh, Lazaric and Pirotta
Model-Based Issues45
ComplexitySpace O(HS2A)
nonstationary model =⇒ H( S2A︸︷︷︸transitions
+ SA︸︷︷︸rewards
)
Time O(K HS2A︸ ︷︷ ︸planning by VI
)
Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)
2 Space complexity: avoid to estimate rewards and transitions
Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)
Ghavamzadeh, Lazaric and Pirotta
Model-Based Issues45
ComplexitySpace O(HS2A)
nonstationary model =⇒ H( S2A︸︷︷︸transitions
+ SA︸︷︷︸rewards
)
Time O(K HS2A︸ ︷︷ ︸planning by VI
)
Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)
2 Space complexity: avoid to estimate rewards and transitions
Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)
Ghavamzadeh, Lazaric and Pirotta
Model-Based Issues45
ComplexitySpace O(HS2A)
nonstationary model =⇒ H( S2A︸︷︷︸transitions
+ SA︸︷︷︸rewards
)
Time O(K HS2A︸ ︷︷ ︸planning by VI
)
Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)
2 Space complexity: avoid to estimate rewards and transitions
Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)
Ghavamzadeh, Lazaric and Pirotta
River Swim: Q-learning w\ ε-greedy Exploration46
εt = 1.0
εt = 0.5
εt =ε0
(N(st)− 1000)2/3
εt =
1.0 t < 6000ε0
N(st)1/2otherwise
εt =
1.0 t < 7000ε0
N(st)1/2otherwise
Tuning the ε schedule is difficult and problem dependent
Regret: Ω(
minT,AH/2)
Ghavamzadeh, Lazaric and Pirotta
Optimistic Q-learning47
Input: S, A, rh, phInitialize Qh(s, a) = H − (h− 1) and Nh(s, a) = 0 for all (s, a) ∈ S ×A and h = [H]
for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)for h = 1, . . . , H do
Execute ahk = πhk(shk) = argmaxa
Qh(shk, a)
Observe rhk and sh+1,k
Set Nh(shk, ahk) = Nh(shk, ahk) + 1Update
Qh(shk, ahk) = (1− αt)Qh(shk, ahk) + αt(rhk + Vh+1(sh+1,k) + bt
)Set Vh(shk) = min
H − (h− 1),max
a∈AQh(shk, a)
end
end
Upper-Confidence Bound
Ghavamzadeh, Lazaric and Pirotta
Step size αt48
Qlearning uses αt ofO(1/t) or O(1/
√t)
with t = Nhk(s, a)
Opt-QL
αt =H + 1
H + t
Ghavamzadeh, Lazaric and Pirotta
Step size αt49
Recursive Q-learning update (t = Nhk(s, a))
Qhk(s, a) = 1 (t = 0)H +
t∑i=1
αit
(rki + Vh+1,ki(sh+1,ki) + bi
)with αi
t = αi
t∏j=i+1
(1− αj)
Optimistic initialization
Weighted Average ofbootstrapped values
*ki = k : Nhk(s, a) = i
Ghavamzadeh, Lazaric and Pirotta
Step size αt49
Recursive Q-learning update (t = Nhk(s, a))
Qhk(s, a) = 1 (t = 0)H +
t∑i=1
αit
(rki + Vh+1,ki(sh+1,ki) + bi
)with αi
t = αi
t∏j=i+1
(1− αj)
Idea: favoring later updateslast 1/H fraction of samples of (s, a) have non-negligible weights1− 1/H is forgotten
Optimistic initialization
Weighted Average ofbootstrapped values
*ki = k : Nhk(s, a) = i
Ghavamzadeh, Lazaric and Pirotta
Step size αt49
Recursive Q-learning update (t = Nhk(s, a))
Qhk(s, a) = 1 (t = 0)H +t∑i=1
αit
(rki + Vh+1,ki(sh+1,ki) + bi
)with αi
t = αi
t∏j=i+1
(1− αj)
Example. H = 10 and assume t = Nhk(s, a) = 1000
0 100 200 300 400 500 600 700 800 900 1,0000
0.5
1
1.5·10−2
samples (i)
αt 1000
αi = (H + 1)/(H + i)
αi = 1/i
αt = 1/√i
Optimistic initialization
Weighted Average ofbootstrapped values
*ki = k : Nhk(s, a) = i
Ghavamzadeh, Lazaric and Pirotta
Exploration Bonus bt50
Let t = Nhk(s, a)∣∣∣∣∣t∑i=1
αit
(V ?h+1(sh+1,ki)− Es′|s,a[V ?
h+1(s′)])∣∣∣∣∣ ≤ c
√H3 log(SAT/δ)
t︸ ︷︷ ︸:=bt
Note thatt∑i=1
αit = 1.
Ghavamzadeh, Lazaric and Pirotta
Opt-Q-learning: Regret51
Theorem ([Jin et al., 2018])For any tabular MDP with non-stationary transitions, Opt-QL with Hoeffdinginequalities (bt = O(
√H3/t)), with high probability, suffers a regret
R(K,M?,Opt-QL) = O(H2√SAT +H2SA)
Order optimal√SAT
H factor worse than the lower-bound Ω(H√SAT )
• √H factor worse than model-based with Hoeffding inequalitiesUCBVI-CH for non-stationary ph suffers O(H3/2
√SAT )
• but better second-order terms
The bound does not improve in stationary MDPs (i.e., p1 = . . . = pH)
Ghavamzadeh, Lazaric and Pirotta
Opt-Qlearning: Example52
0 1 2 3 4
·104
0
1,000
2,000
3,000
K
R(K
)
OptQL
Ghavamzadeh, Lazaric and Pirotta
Refined Confidence Intervals53
Opt-QL with Bernstein-Freedman bounds (instead of Hoeffding/Weissman):
R(K) = O(H3/2
√SAT +
√H9S3A3
), Still not matching the lower-bound!√
H worse than model-based (e.g.,UCBVI-BF)
Ghavamzadeh, Lazaric and Pirotta
Open Questions 54
1 prove frequentist regret for PSRL
2 whether the gap between the regret of model-based and model-free should exist?
3 which algorithm is better in practice?
Ghavamzadeh, Lazaric and Pirotta
Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameterized systems. InUAI, pages 1–11. AUAI Press, 2015.
Rajeev Agrawal. Adaptive control of markov chains under the weak accessibility. In 29th IEEE Conference onDecision and Control, pages 1426–1431. IEEE, 1990.
Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regretbounds. In NIPS, pages 1184–1194, 2017.
Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. InNIPS, pages 49–56. MIT Press, 2006.
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning.In ICML, volume 70 of Proceedings of Machine Learning Research, pages 263–272. PMLR, 2017.
Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning inweakly communicating MDPs. In UAI, pages 35–42. AUAI Press, 2009.
Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamicprogramming. Artif. Intell., 72(1-2):81–138, 1995.
Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In J. Shawe-Taylor, R. S. Zemel,P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems24, pages 2249–2257. Curran Associates, Inc., 2011. URLhttp://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf.
Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds formodel-based reinforcement learning with greedy policies. In NeurIPS, pages 12203–12213, 2019.
Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leiblerdivergence. 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton),pages 115–122, 2010.
Ghavamzadeh, Lazaric and Pirotta
Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation innon-communicating markov decision processes. In NeurIPS, pages 2998–3008, 2018a.
Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrainedexploration-exploitation in reinforcement learning. In ICML, Proceedings of Machine Learning Research.PMLR, 2018b.
Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2B, 2019. URLhttps://rlgammazero.github.io/docs/ucrl2b_improved.pdf.
Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. InCOLT, volume 40 of JMLR Workshop and Conference Proceedings, pages 861–898. JMLR.org, 2015.
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the AmericanStatistical Association, 58(301):13–30, 1963. URL http://www.jstor.org/stable/2282952.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journalof Machine Learning Research, 11:1563–1600, 2010.
Chi Jin, Zeyuan Allen-Zhu, Sébastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? InNeurIPS, pages 4868–4878, 2018.
Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcementlearning. CoRR, abs/1802.09184, 2018.
Odalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. How hard is my mdp?" the distribution-normto the rescue". In NIPS, pages 1835–1843, 2014.
Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. CoRR,abs/1608.02731, 2016.
Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning?In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710. PMLR, 2017.
Ghavamzadeh, Lazaric and Pirotta
Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posteriorsampling. In NIPS, pages 3003–3011, 2013.
Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: Athompson sampling approach. In NIPS, pages 1333–1342, 2017.
Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Exploration bonus for regret minimization indiscrete and continuous average reward mdps. In NeurIPS, pages 4891–4900, 2019.
Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Concentration inequalities for multinoullirandom variables. CoRR, abs/2001.11595, 2020.
Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. In NeurIPS, pages14410–14420, 2019.
Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for Markov decisionprocesses. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
Malcolm Strens. A bayesian framework for reinforcement learning. In In Proceedings of the SeventeenthInternational Conference on Machine Learning, pages 943–950. ICML, 2000.
Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscountedreinforcement learning in mdps. In ALT, volume 83 of Proceedings of Machine Learning Research, pages770–805. PMLR, 2018.
Georgios Theocharous, Zheng Wen, Yasin Abbasi, and Nikos Vlassis. Scalar posterior sampling withapplications. In NeurIPS, pages 7696–7704, 2018.
William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidenceof two samples. Biometrika, 25(3-4):285–294, 1933.
Aristide C. Y. Tossou, Debabrota Basu, and Christos Dimitrakakis. Near-optimal optimistic reinforcementlearning using empirical bernstein inequalities. CoRR, abs/1905.12425, 2019.
Ghavamzadeh, Lazaric and Pirotta
Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú, and Marcelo J. Weinberger. Inequalities forthe L1 deviation of the empirical distribution. 2003.
Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identifybandit structure in mdps. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages5732–5740. JMLR.org, 2018.
Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learningwithout domain knowledge using value function bounds. In ICML, volume 97 of Proceedings of MachineLearning Research, pages 7304–7312. PMLR, 2019.
Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal biasfunction. In NeurIPS, pages 2823–2832, 2019.
Ghavamzadeh, Lazaric and Pirotta