Affine Natural Proximal, then Multi-Agent Reinforcement ... · A ne Natural Proximal (joint work...

Affine Natural Proximal, then Multi-AgentReinforcement LearningFor MPI Summer 2019

Speaker: Alex Tong Lin

July 23, 2019

Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 1 / 36

1 Natural Gradient

2 Wasserstein Natural Gradient

3 Affine Natural Proximal

4 Numerical Examples

5 Multi-Agent Reinforcement Learning

6 Frameworks for MARL


Affine Natural Proximal(joint work with Wuchen Li, ATL, and

Guido Montufar)


Deep Learning and Neural Networks

Deep Learning is a framework for learning data representations, and usingthese representations for tasks such as classification, generative modeling,and more.


Natural Gradient


Natural Gradient

I The point of Natural Gradient is to have your optimization beinvariant to how you describe your problem (i.e. choice ofcoordinates).

1

1CHENG2013311.Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 6 / 36

Rethinking Steepest Descent

I Two coordinate systems: x and θ, where x = A−1θ. Minimize f (θ).

I Gradient descent in θ-coordinates:

θk+1 = θk − α∇θf (θk)

I Gradient descent in x-coordinates:

xk+1 = xk − α∇x f (Axk)

I Question: if Axk = θk , do we have Axk+1 = θk+1?

NO!

Axk+1 = Axk − αA∇x f (Axk)

= Axk − αAAT∇θf (Axk)

= θk − αAAT∇θf (θk)

6= θk+1



I Two coordinate systems: x and θ, where x = A−1θ. Minimize f (θ).

I Gradient descent in θ-coordinates:

θk+1 = θk − α∇θf (θk)

I Gradient descent in x-coordinates:

xk+1 = xk − α∇x f (Axk)

I Question: if Axk = θk , do we have Axk+1 = θk+1? NO!

Axk+1 = Axk − αA∇x f (Axk)

= Axk − αAAT∇θf (Axk)

= θk − αAAT∇θf (θk)

6= θk+1



I Lesson: The “steepest descent direction” in one coordinate systemdoes not equal the “steepest descent direction” in another coordinatesystem.

I Then we need to reinterpret the idea of “steepest descent”, which isinvariant to the description (i.e. choice of parameters).


Actually Steepest Descent

I Actually steepest descent: Suppose we have a metric in the inputspace, d(x , x ′), and a function f : X → Z . Then a natural way todefine the steepest descent direction (with step-size α) is:

δ∗ = arg minδ:d(x ,x+δ)=α

f (x + δ)

I This is the (actual) steepest descent in a metric space (given astep-size α). This forms the basis of Natural Gradient.


Natural Gradient - Actually Steepest Descent inProbability Distributions

I In learning, we (more-or-less) want to find the best probabilitydistribution that minimizes a loss function. Then we have:

I Actually steepest descent (in probability distributions):

δ∗ = arg minδ:d(p,p+δ)=α

L(p + δ)

(note “+” is an abuse of notation)

I In deep learning, we have a weight matrix θ. Now we have:

I Actually steepest descent (in deep learning):

δ∗ = arg minδ:d(pθ,pθ+δ)=α

L(θ + δ)


Natural Gradient

I The Fisher Natural Gradient uses the KL divergence metric as thedistance between distributions:

KL(p‖q) =

∫p(x) log

p(x)

q(x)dx

I So the direction of steepest descent will be,

δ∗ = arg minδ:KL(pθ‖pθ+δ)=α

L(θ + δ)

I After moving the constrains up into the objective (i.e. theLagrangian) and approximating with Taylor expansions, we get

δ∗ ≈ 1

αF−1θ ∇θL(θ)

where Fpθ is the Fisher Information Matrix (it’s basically the Hessianof KL).


Natural Gradient

I So the natural gradient update scheme will follow,

θk+1 = θk − hF−1pθk∇θL(θk)

where h is the step-size.

I We used KL as the discrepancy between probability distributions.What about Wasserstein?

2

2emdpic.Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 12 / 36

Wasserstein Natural Gradient


Why Wasserstein Instead of KL?I Wasserstein is more “continuous” than KL.

I W2(a, b)2 = (1− α)(a1 − b1)2 + α(a2 − b2)2

I KL(a‖b) = +∞ (because no overlap between distributions)I Euclidean(a, b) = (a1 − b1)2 + (a2 − b2)2

I L2(a, b)2 = +∞ (because integrating over all of R)

I So when a(k) → b, then we have convergence under the W2 andEuclidean metric, but not others.

I But the Euclidean metric overemphasizes the distance between a1 andb1, which should be weighted less.


Wasserstein Natural Gradient

I The Wasserstein Natural Gradient measures the distance betweendistributions:

W2(p, q)2 = infγ∈Γ(p,q)

∫Ω×Ω‖x − y‖2 dγ(x , y)

where Γ(p, q) is the collection of all measures on Ω× Ω withmarginals p and q, respectively.

I So the direction of steepest descent will be,

δ∗ = arg minδ:W2(pθ‖pθ+δ)=α

L(θ + δ)

I Similar to the KL case, after writing the Lagrangian andapproximating with Taylor expansions, we get

δ∗ ≈ 1

αG (θ)−1∇θL(θ)


Affine Natural Proximal



I Learning problems seek to minimize: minθ∈Θ F (θ).

I In order to perform natural gradient descent, we want the update,

θk+1 = θk − G (θk)−1∇θF (θk)

where G (θ) is the matrix representation of the natural metricstructure on the probability space (P(Ω), g).

I The above is the (forward Euler) discretization of the gradient flow,

θ(t) = −G (θ(t))−1∇θF (θ(t))



I So we have,G (θ) = (∇θρθ, g(ρθ)∇θρθ)

which can be considered the pull-back of g back into parameter space.

I If gθ = −(∆ρθ)−1 where ∆ρθ = ∇ · (ρθ∇) is the weighted ellipticoperator, then G (θ) is the Wasserstein metric tensor, and we have,

GW (θ)ij =(∇θiρθ, (−∆ρθ)−1∆θjρθ

)I If gθ = 1

ρθ, then G (θ) is the Fisher-Rao metric tensor, given by,

GFR(θ)ij =

(∇θiρθ,

1

ρθ∇θjρθ

)


The need for the proximal

I In order to perform the gradient descent update,

θk+1 = θk − G (θk)−1∇θF (θk)

we would need to compute the inverse of the humongous matrixG (θk) (as in deep learning, θ can be a vector of a billion entries).

I Another way is to consider the proximal update.


The need for the proximal

I Consider the proximal update:

θk+1 = ProxhF (θk) = arg minθ

F (θ) +1

2hD(θ, θk)

I D is the distance between θ and θk where

D(θ, θk) = infθ(t)

∫ 1

0θ(t)TG (θ(t))θ(t) dt : θ(0) = θ, θ(1) = θk

= inf

θ(t)

∫ 1

0

(∂tρθ(t), g(ρθ(t))∂tρθ(t)

)dt : θ(0) = θ, θ(1) = θk


The proximal and the affine space approximation

I But in order to practically use D(θ, θk), we use the approximation,

1

2D(θ, θk) =

1

2

(ρθ − ρθk , g(ρρθ)(ρθ − ρθk )

)I We can turn this into a variational formulation,

1

2D(θ, θk) = sup

Φ:Ω→R(Φ, ρθ − ρθk )− 1

2

(Φ, g(ρθ)†Φ

)(whose argsup solution is Φ = g(ρθ)(ρθ − ρθk ), and recovers theprevious formula).

I Then we restrict Φ to belong to the space of functions:

FΨ =

Φ(x) =n∑

j=1

ξjψj(x) = ξTΨ(x) : ξ ∈ Rn


The proximal and the affine space approximation

I For the Wasserstein metric, we have,

1

2DW

Ψ (θ, θk) = supΦ=ξT Ψ

Eθ[Φ]− Eθk [Φ]− 1

2Eθ[‖∇Φ‖2]

I For the Fisher-Rao metric, we have,

1

2DFR

Ψ (θ, θk) = supΦ=ξT Ψ

Eθ[Φ]− Eθk [Φ]− 1

2Eθ[(Φ− Eθ[Φ])2]


Examples

I (Order 1 approximation) For the metrix approximation whereF = Φ(x) = aT x + b, we haveI Wasserstein:

1

2DW (θ, θk) =

1

2‖Eθ[x ]− Eθk [x ]‖2

I Fisher-Rao:

DFR(θ, θk) = (Eθ[x ]−Eθk [x ])>(Eθ[(x −Eθx)(x −Eθx)>

])−1

(Eθ[x ]−Eθk [x ]).

And we have ξ = (b, a) and Ψ = (1, ψ1) where ψ1 = Id .


I (Order 2 approximation) For the space of quadratic functionsF = Φ(x) = 1

2xTQx + aT x + b, we have,

I Wasserstein:

DW2 (θ, θk) =

(Eθ[

xx⊗x

2

]−Eθk

[x

x⊗x2

] )>Eθ[

Im x>⊗Imx⊗Im Im⊗xx>

]−1 (Eθ[

xx⊗x

2

]−Eθk

[x

x⊗x2

] ).

I Fisher-Rao:

DFR2 (θ, θk) =

(Eθ[

xx⊗x

2

]− Eθk

[x

x⊗x2

] )>(CFR(θ)

)−1(Eθ[

xx⊗x

2

]− Eθk

[x

x⊗x2

] ),

where

CFR = Eθ

[( [x

x⊗x2

]− Eθ

[x

x⊗x2

] )( [x

x⊗x2

]− Eθ

[x

x⊗x2

] )>].

In this case, ξ = (b, a, vec(Q)), and Ψ = (1, ψ1, ψ2), where ψ1 = Id andψ2 : x 7→ x ⊗ x .


Numerical Examples


Classification for CIFAR-10

Figure: Right: The learning curves for the image classification task on CIFAR-10.Each experiment was averaged over 5 runs. The bold lines represent the average,and the envelopes are the minimum and maximum achieved.


Classification for CIFAR-10

Algorithm 1 Wasserstein Proximal Natural Gradient for Neural Networks

Require: Loss function L, neural network f (x , θ), Order 1 or 2 Wassersteindistance approximation D, and data-label pairs (x , y) from dataset D.

Require: m number of gradient descent steps, and h strength of the prox-imal termwhile stopping criteria not met do

Sample a mini-batch of image-label pairs (xb, yb)Bb=1 ∈ D

Approximately solve (by performing SGD m times)

θk+1 ← argminθ

1

B

B∑b=1

L(y , f (x , θ)) +1

2hD(θ, θk)

end while


Multi-Agent Reinforcement Learning


Multi-Agent Reinforcement Learning (MARL)

I Multi-Agent Reinforcement Learning involves many agents.

I In cooperative MARL, there is a shared reward.

I In non-cooperative MARL, each agent has its own reward that maybe adversarial to another agent’s reward.


Challenges to MARL

Curse of dimensionality of the joint state space and joint action space.

I For N agents, the action space is A1 × A2 × · · · × AN .

I So we have in total∏N

i=1 |Ai | number of states, which is exponential.

I Similarly for the state space.


Challenges to MARL

A non-Markovian (a.k.a. non-stationary) environment from theperspective of each agent, because other agents are also updati

Mention POMDP and Markov Games


Challenges to MARL

Specifying a good MARL goal in the non-cooperative case.


Frameworks for MARL


Decentralized (PO)MDPA Dec-POMDP can be described as follows:I M =< I , S ,Ai ,P,R,Ωi ,O, h >.I I , the set of agents.I S , the set of states.I Ai the set of actions for agent i , with A = ×iAi the set of joint

actions.I P, the state transition probabilities: P(s ′|s, a), the probability of the

environment transitioning to state s ′ given it was in state s andagents took actions a.

I R, the global reward function: R(s, a), the immediate reward thesystem receives for being in state s and agents taking actions a.

I Ωi , the set of observations for agent i , with Ω = ×iΩi the set of jointobservations.

I O, the observation probabilities: O(o|s, a), the probability of agentsseeing observations o, given the state is s and agents take actions a.

If the joint observations equals the state, then we call the above aDec-MDP.


Markov Games

Similar to a Dec-POMDP, except the reward function is not global, andnow each agent has its own individual reward function it seeks tomaximize.


The end

Thank you!


Date post:	19-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Affine Natural Proximal, then Multi-Agent Reinforcement ... · A ne Natural Proximal (joint work...

Documents