Reinforcement Learning as Variational Inference: Two...

Background Stein Variational Policy Gradient Soft Q-Learning Closing Thoughts/Takeaways

Reinforcement Learning as Variational Inference:Two Recent Approaches

Rohith Kuditipudi

Duke University

11 August 2017


Outline

1 Background

2 Stein Variational Policy Gradient

3 Soft Q-Learning

4 Closing Thoughts/Takeaways


Outline

1 Background


3 Soft Q-Learning



Background: Reinforcement Learning

A Markov Decision Process (MDP) is a tuple (S,A, p, r, γ),where:

S is the state spaceA is the action spacep(st+1|st, at) is the probability of the next state st+1 ∈ S giventhe current state st ∈ S and action at ∈ A taken by the agentr : S ×A → [rmin, rmax] is the reward functionγ ∈ [0, 1] is the discount factor

A policy π(·|st) is a distribution over A conditioned on thecurrent state st

The goal of reinforcement learning is to learn a policy π thatmaximizes the total expected (discounted) reward J(π),where:

J(π) = E(st,at)∼π

[ ∞∑t=0

γtr(st, at)

]


Background: Stein Variational Gradient Descent (SVGD)

SVGD1 is a variational inference algorithm that iterativelytransports a set of particles {xi}ni=1 to match a targetdistribution

at each step, particles are updated as follows:

xi ← xi + εφ(xi)

where ε is the step size and φ is a perturbation direction chosento greedily minimize KL divergence with target distribution

By restricting φ to lie in the unit ball of an RKHS withcorresponding kernel k, we can obtain a closed form solutionfor the optimal perturbation direction...

1Q. Liu and D. Wang. Stein Variational Gradient Descent: A GeneralPurpose Bayesian Inference Algorithm.


Background: Stein Variational Gradient Descent (SVGD)

Input : A target distribution with density function p(x) and a setof initial particles {x0

i }ni=1

Output: A set of particles {xi}ni=1 that approximates the targetdistribution

for iteration ` do

x`+1i ← xì + ε`φ

∗(xì) where

φ∗(xì) = 1n

∑nj=1

[k(x`j , x)∇x`j log p(x`j) +∇x`jk(x`j , x)

]and ε` is the step size at the `-th iteration

endAlgorithm 1: Stein Variational Gradient Descent (SVGD)


Outline

1 Background


3 Soft Q-Learning



Stein Variational Policy Gradient2: Preliminaries

Policy iteration: parametrize policy as π(at|st; θ) anditeratively update θ to maximize J(π(at|st; θ)) (a.k.a. J(θ))

MaxEnt Policy Optimization: instead of searching for asingle policy θ, optimize a distribution q on θ as follows:

maxq

{Eq(θ)[J(θ)]− αDKL(q||q0)

}where q0 is a prior on θ.

If q0 = constant, then the objective simplifies to

maxq

{Eq(θ)[J(θ)] + αH(q)

}where H(q) is the entropy of qOptimal distribution is

q(θ) ∝ exp

(1

αJ(θ)q0(θ)

)2Liu et al. Stein Variational Policy Gradient.


Stein Variational Policy Gradient: Algorithm

Input: Learning rate ε, kernel k(x, x′), temperature α, initialparticles {θi}

for t = 0, 1, ..., T dofor i = 0, 1, ...n do

Compute ∇θiJ(θi) (using RL method of choice)endfor i = 0, 1, ...n do

∆θi =1n

∑nj=1

[∇θj

(1αJ(θj) + log q0(θj)

)k(θj , θi) +∇θjk(θj , θi)

]θi ← θi + ε∆θi

end

endAlgorithm 2: Stein Variational Policy Gradient (SVPG)


Stein Variational Policy Gradient: Results










Outline

1 Background


3 Soft Q-Learning



Soft Q-Learning3: Preliminaries

Recall the standard RL objective (assuming γ = 1):

π∗std = arg maxπ

∑t

E(st,at)∼π[r(st, at)]

In MaxEnt RL, we augment the reward with an entropy termthat encourages exploration:

π∗MaxEnt = arg maxπ

∑t

E(st,at)∼π[r(st, at) + αH(π(·|st))]

Note: when γ 6= 1, the MaxEnt RL objective becomes:

arg maxπ

∑t

E(st,at)∼π

[ ∞∑l=t

γl−tE(sl,al)[r(sl, al) + αH(π(·|sl))|st, at]

]

3Haarnoja et al. Reinforcement Learning with Deep Energy-Based Policies.


Soft Q-Learning: Preliminaries

Theorem

Let the soft Q-function be defined by

Q∗soft(st, at) = rt + Eπ∗MaxEnt

[ ∞∑l=1

γl(rt+l + αH(π∗MaxEnt(·|st+l)))

]

and soft value function by

V ∗soft(st) = α log

∫A

exp(Q∗soft(st, a

′))da′

then

π∗MaxEnt(st, at) = exp

(1

α(Q∗soft(st, at)− V ∗soft(st))

)


Soft Q-Learning: Preliminaries

Theorem

The soft Q-function satisfies the soft Bellman equation

Q∗soft(st, at) = rt + γEst+1∼p[V∗soft(st+1)]

Theorem (Soft Q-iteration)

Let Qsoft(·, ·) and Vsoft(·) be bounded and assume that∫A exp( 1

αQsoft(·, a′))da′ <∞ and that Q∗soft <∞ exists. Then thefixed point iteration

Qsoft(st, at)← rt + γEst+1∼p[Vsoft(st+1)], ∀st, at

Vsoft(st)← α log

∫A

exp(1

αQsoft(st, a

′))da′, ∀st

converges to Q∗soft and V ∗soft respectively.


Soft Q-Learning: Soft Q-iteration in Practice

In practice, we model the Soft Q function using a functionapproximator with parameters θ, denoted as QθsoftTo convert Soft Q Iteration into a stochastic optimizationproblem, we can re-express Vsoft(st) as an expectation viaimportance sampling (instead of an integral) as follows:

V θsoft(st) = α logEq

[exp( 1

αQθsoft(st, a

′))

q(a′)

]

where q is the sampling distribution (e.g. current policy)

Furthermore, we can update Qsoft to minimize

JQ(θ) = E(st,at)∼q

[1

2

(Qθsoft(st, at)−Qθsoft(st, at)

)2]

where Qθsoft(st, at) = rt + γEst+1∼p[Vθsoft(st+1)]


Soft Q-Learning: Sampling from Soft Q-function

Goal: learn a state-conditioned stochastic neural networkat = fφ(ξ; st), with parameters φ, that maps Gaussian noiseξ to unbiased action samples from EBM specified by Qθsoft

Specifically, we want to minimize the following loss:

Jπ(φ; st) = DKL

(πφ(·|st) || exp

(1

α(Qθsoft(st, ·)− V θsoft(st))

))where πφ(·|st) is the action distribution induced by φ

Strategy: Sample actions a(i)t = fφ(ξ(i); st) and use SVGD

to compute optimal greedy perturbations ∆fφ(ξ(i); st) tominimize Jπ(φ; st), where:

∆fφ(ξ(i); st) = Eat∼πφ[κ(at, f

φ(ξ(i); st))∇a′Qθsoft(st, a′)|a′=at

+ α∇a′κ(a′, fφ(ξ(i); st))|a′=at]


Soft Q-Learning: Sampling from Soft Q-function

Stein variational gradient can be backpropagated intosampling network using:

∂Jπ(φ; st)

∂φ∝ Eξ

[∆fφ(ξ; st)

∂fφ(ξ; st)

∂φ

]And so we have:

∇φJπ(φ; st) =1

KM

K∑j=1

M∑i=1

(κ(a

(i)t , a

(j)t )∇a′Qθsoft(st, a′)|a′=a(i)

t

+ α∇a′κ(a′, a(j)t )|

a′=a(i)t

)∇φfφ(ξ(j); st)

the ultimate update direction ∇φJπ(φ) is the average of

∇φJπ(φ; st) over a mini-batch sampled from replay memory


Soft Q-Learning: Algorithm

for each epoch dofor each t do

Collect Experience

at ← fφ(ξ, st) where ξ ∼ N (0, I)st+1 ∼ p(st+1|at, st)D ← D ∪ {st, at, r(st, at), st+1}Sample minibatch from replay memory

{s(i)t , a(i)t , r

(i)t , s

(i)t+1}Ni=0 ∼ D

Update soft Q-function parameters

Sample {a(i,j)}Mj=0 ∼ q for each s(i)t+1

Compute V θsoft(s(i)t+1) and ∇θJQ, update θ using ADAM

Update policy

Sample {ξ(i,j)}Mj=0 ∼ N (0, I) for each s(i)t

Compute actions a(i,j)t = fφ(ξ(i,j), s

(i)t )

Compute ∆fφ and ∇φJπ , update φ using ADAM

endif epoch mod update interval = 0 then

Update target parameters: θ ← θ, φ← φend

end

Algorithm 3: Soft Q-Learning


Soft Q-Learning: Results






Outline

1 Background


3 Soft Q-Learning


Date post:	01-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Reinforcement Learning as Variational Inference: Two...

Documents