Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Reinforcement Learning as Variational Inference:Two Recent Approaches
Rohith Kuditipudi
Duke University
11 August 2017
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Outline
1 Background
2 Stein Variational Policy Gradient
3 Soft Q-Learning
4 Closing Thoughts/Takeaways
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Outline
1 Background
2 Stein Variational Policy Gradient
3 Soft Q-Learning
4 Closing Thoughts/Takeaways
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Background: Reinforcement Learning
A Markov Decision Process (MDP) is a tuple (S,A, p, r, γ),where:
S is the state spaceA is the action spacep(st+1|st, at) is the probability of the next state st+1 ∈ S giventhe current state st ∈ S and action at ∈ A taken by the agentr : S ×A → [rmin, rmax] is the reward functionγ ∈ [0, 1] is the discount factor
A policy π(·|st) is a distribution over A conditioned on thecurrent state st
The goal of reinforcement learning is to learn a policy π thatmaximizes the total expected (discounted) reward J(π),where:
J(π) = E(st,at)∼π
[ ∞∑t=0
γtr(st, at)
]
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Background: Stein Variational Gradient Descent (SVGD)
SVGD1 is a variational inference algorithm that iterativelytransports a set of particles {xi}ni=1 to match a targetdistribution
at each step, particles are updated as follows:
xi ← xi + εφ(xi)
where ε is the step size and φ is a perturbation direction chosento greedily minimize KL divergence with target distribution
By restricting φ to lie in the unit ball of an RKHS withcorresponding kernel k, we can obtain a closed form solutionfor the optimal perturbation direction...
1Q. Liu and D. Wang. Stein Variational Gradient Descent: A GeneralPurpose Bayesian Inference Algorithm.
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Background: Stein Variational Gradient Descent (SVGD)
Input : A target distribution with density function p(x) and a setof initial particles {x0
i }ni=1
Output: A set of particles {xi}ni=1 that approximates the targetdistribution
for iteration ` do
x`+1i ← x`i + ε`φ
∗(x`i) where
φ∗(x`i) = 1n
∑nj=1
[k(x`j , x)∇x`j log p(x`j) +∇x`jk(x`j , x)
]and ε` is the step size at the `-th iteration
endAlgorithm 1: Stein Variational Gradient Descent (SVGD)
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Outline
1 Background
2 Stein Variational Policy Gradient
3 Soft Q-Learning
4 Closing Thoughts/Takeaways
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Stein Variational Policy Gradient2: Preliminaries
Policy iteration: parametrize policy as π(at|st; θ) anditeratively update θ to maximize J(π(at|st; θ)) (a.k.a. J(θ))
MaxEnt Policy Optimization: instead of searching for asingle policy θ, optimize a distribution q on θ as follows:
maxq
{Eq(θ)[J(θ)]− αDKL(q||q0)
}where q0 is a prior on θ.
If q0 = constant, then the objective simplifies to
maxq
{Eq(θ)[J(θ)] + αH(q)
}where H(q) is the entropy of qOptimal distribution is
q(θ) ∝ exp
(1
αJ(θ)q0(θ)
)2Liu et al. Stein Variational Policy Gradient.
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Stein Variational Policy Gradient: Algorithm
Input: Learning rate ε, kernel k(x, x′), temperature α, initialparticles {θi}
for t = 0, 1, ..., T dofor i = 0, 1, ...n do
Compute ∇θiJ(θi) (using RL method of choice)endfor i = 0, 1, ...n do
∆θi =1n
∑nj=1
[∇θj
(1αJ(θj) + log q0(θj)
)k(θj , θi) +∇θjk(θj , θi)
]θi ← θi + ε∆θi
end
endAlgorithm 2: Stein Variational Policy Gradient (SVPG)
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Stein Variational Policy Gradient: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Stein Variational Policy Gradient: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Stein Variational Policy Gradient: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Stein Variational Policy Gradient: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Stein Variational Policy Gradient: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Outline
1 Background
2 Stein Variational Policy Gradient
3 Soft Q-Learning
4 Closing Thoughts/Takeaways
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning3: Preliminaries
Recall the standard RL objective (assuming γ = 1):
π∗std = arg maxπ
∑t
E(st,at)∼π[r(st, at)]
In MaxEnt RL, we augment the reward with an entropy termthat encourages exploration:
π∗MaxEnt = arg maxπ
∑t
E(st,at)∼π[r(st, at) + αH(π(·|st))]
Note: when γ 6= 1, the MaxEnt RL objective becomes:
arg maxπ
∑t
E(st,at)∼π
[ ∞∑l=t
γl−tE(sl,al)[r(sl, al) + αH(π(·|sl))|st, at]
]
3Haarnoja et al. Reinforcement Learning with Deep Energy-Based Policies.
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Preliminaries
Theorem
Let the soft Q-function be defined by
Q∗soft(st, at) = rt + Eπ∗MaxEnt
[ ∞∑l=1
γl(rt+l + αH(π∗MaxEnt(·|st+l)))
]
and soft value function by
V ∗soft(st) = α log
∫A
exp(Q∗soft(st, a
′))da′
then
π∗MaxEnt(st, at) = exp
(1
α(Q∗soft(st, at)− V ∗soft(st))
)
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Preliminaries
Theorem
The soft Q-function satisfies the soft Bellman equation
Q∗soft(st, at) = rt + γEst+1∼p[V∗soft(st+1)]
Theorem (Soft Q-iteration)
Let Qsoft(·, ·) and Vsoft(·) be bounded and assume that∫A exp( 1
αQsoft(·, a′))da′ <∞ and that Q∗soft <∞ exists. Then thefixed point iteration
Qsoft(st, at)← rt + γEst+1∼p[Vsoft(st+1)], ∀st, at
Vsoft(st)← α log
∫A
exp(1
αQsoft(st, a
′))da′, ∀st
converges to Q∗soft and V ∗soft respectively.
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Soft Q-iteration in Practice
In practice, we model the Soft Q function using a functionapproximator with parameters θ, denoted as QθsoftTo convert Soft Q Iteration into a stochastic optimizationproblem, we can re-express Vsoft(st) as an expectation viaimportance sampling (instead of an integral) as follows:
V θsoft(st) = α logEq
[exp( 1
αQθsoft(st, a
′))
q(a′)
]
where q is the sampling distribution (e.g. current policy)
Furthermore, we can update Qsoft to minimize
JQ(θ) = E(st,at)∼q
[1
2
(Qθsoft(st, at)−Qθsoft(st, at)
)2]
where Qθsoft(st, at) = rt + γEst+1∼p[Vθsoft(st+1)]
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Sampling from Soft Q-function
Goal: learn a state-conditioned stochastic neural networkat = fφ(ξ; st), with parameters φ, that maps Gaussian noiseξ to unbiased action samples from EBM specified by Qθsoft
Specifically, we want to minimize the following loss:
Jπ(φ; st) = DKL
(πφ(·|st) || exp
(1
α(Qθsoft(st, ·)− V θsoft(st))
))where πφ(·|st) is the action distribution induced by φ
Strategy: Sample actions a(i)t = fφ(ξ(i); st) and use SVGD
to compute optimal greedy perturbations ∆fφ(ξ(i); st) tominimize Jπ(φ; st), where:
∆fφ(ξ(i); st) = Eat∼πφ[κ(at, f
φ(ξ(i); st))∇a′Qθsoft(st, a′)|a′=at
+ α∇a′κ(a′, fφ(ξ(i); st))|a′=at]
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Sampling from Soft Q-function
Stein variational gradient can be backpropagated intosampling network using:
∂Jπ(φ; st)
∂φ∝ Eξ
[∆fφ(ξ; st)
∂fφ(ξ; st)
∂φ
]And so we have:
∇φJπ(φ; st) =1
KM
K∑j=1
M∑i=1
(κ(a
(i)t , a
(j)t )∇a′Qθsoft(st, a′)|a′=a(i)
t
+ α∇a′κ(a′, a(j)t )|
a′=a(i)t
)∇φfφ(ξ(j); st)
the ultimate update direction ∇φJπ(φ) is the average of
∇φJπ(φ; st) over a mini-batch sampled from replay memory
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Algorithm
for each epoch dofor each t do
Collect Experience
at ← fφ(ξ, st) where ξ ∼ N (0, I)st+1 ∼ p(st+1|at, st)D ← D ∪ {st, at, r(st, at), st+1}Sample minibatch from replay memory
{s(i)t , a(i)t , r
(i)t , s
(i)t+1}Ni=0 ∼ D
Update soft Q-function parameters
Sample {a(i,j)}Mj=0 ∼ q for each s(i)t+1
Compute V θsoft(s(i)t+1) and ∇θJQ, update θ using ADAM
Update policy
Sample {ξ(i,j)}Mj=0 ∼ N (0, I) for each s(i)t
Compute actions a(i,j)t = fφ(ξ(i,j), s
(i)t )
Compute ∆fφ and ∇φJπ , update φ using ADAM
endif epoch mod update interval = 0 then
Update target parameters: θ ← θ, φ← φend
end
Algorithm 3: Soft Q-Learning
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Soft Q-Learning: Results
Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways
Outline
1 Background
2 Stein Variational Policy Gradient
3 Soft Q-Learning
4 Closing Thoughts/Takeaways