Affine Natural Proximal, then Multi-AgentReinforcement LearningFor MPI Summer 2019
Speaker: Alex Tong Lin
July 23, 2019
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 1 / 36
1 Natural Gradient
2 Wasserstein Natural Gradient
3 Affine Natural Proximal
4 Numerical Examples
5 Multi-Agent Reinforcement Learning
6 Frameworks for MARL
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 2 / 36
Affine Natural Proximal(joint work with Wuchen Li, ATL, and
Guido Montufar)
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 3 / 36
Deep Learning and Neural Networks
Deep Learning is a framework for learning data representations, and usingthese representations for tasks such as classification, generative modeling,and more.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 4 / 36
Natural Gradient
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 5 / 36
Natural Gradient
I The point of Natural Gradient is to have your optimization beinvariant to how you describe your problem (i.e. choice ofcoordinates).
1
1CHENG2013311.Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 6 / 36
Rethinking Steepest Descent
I Two coordinate systems: x and θ, where x = A−1θ. Minimize f (θ).
I Gradient descent in θ-coordinates:
θk+1 = θk − α∇θf (θk)
I Gradient descent in x-coordinates:
xk+1 = xk − α∇x f (Axk)
I Question: if Axk = θk , do we have Axk+1 = θk+1?
NO!
Axk+1 = Axk − αA∇x f (Axk)
= Axk − αAAT∇θf (Axk)
= θk − αAAT∇θf (θk)
6= θk+1
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 7 / 36
Rethinking Steepest Descent
I Two coordinate systems: x and θ, where x = A−1θ. Minimize f (θ).
I Gradient descent in θ-coordinates:
θk+1 = θk − α∇θf (θk)
I Gradient descent in x-coordinates:
xk+1 = xk − α∇x f (Axk)
I Question: if Axk = θk , do we have Axk+1 = θk+1? NO!
Axk+1 = Axk − αA∇x f (Axk)
= Axk − αAAT∇θf (Axk)
= θk − αAAT∇θf (θk)
6= θk+1
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 7 / 36
Rethinking Steepest Descent
I Lesson: The “steepest descent direction” in one coordinate systemdoes not equal the “steepest descent direction” in another coordinatesystem.
I Then we need to reinterpret the idea of “steepest descent”, which isinvariant to the description (i.e. choice of parameters).
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 8 / 36
Actually Steepest Descent
I Actually steepest descent: Suppose we have a metric in the inputspace, d(x , x ′), and a function f : X → Z . Then a natural way todefine the steepest descent direction (with step-size α) is:
δ∗ = arg minδ:d(x ,x+δ)=α
f (x + δ)
I This is the (actual) steepest descent in a metric space (given astep-size α). This forms the basis of Natural Gradient.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 9 / 36
Natural Gradient - Actually Steepest Descent inProbability Distributions
I In learning, we (more-or-less) want to find the best probabilitydistribution that minimizes a loss function. Then we have:
I Actually steepest descent (in probability distributions):
δ∗ = arg minδ:d(p,p+δ)=α
L(p + δ)
(note “+” is an abuse of notation)
I In deep learning, we have a weight matrix θ. Now we have:
I Actually steepest descent (in deep learning):
δ∗ = arg minδ:d(pθ,pθ+δ)=α
L(θ + δ)
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 10 / 36
Natural Gradient
I The Fisher Natural Gradient uses the KL divergence metric as thedistance between distributions:
KL(p‖q) =
∫p(x) log
p(x)
q(x)dx
I So the direction of steepest descent will be,
δ∗ = arg minδ:KL(pθ‖pθ+δ)=α
L(θ + δ)
I After moving the constrains up into the objective (i.e. theLagrangian) and approximating with Taylor expansions, we get
δ∗ ≈ 1
αF−1θ ∇θL(θ)
where Fpθ is the Fisher Information Matrix (it’s basically the Hessianof KL).
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 11 / 36
Natural Gradient
I So the natural gradient update scheme will follow,
θk+1 = θk − hF−1pθk∇θL(θk)
where h is the step-size.
I We used KL as the discrepancy between probability distributions.What about Wasserstein?
2
2emdpic.Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 12 / 36
Wasserstein Natural Gradient
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 13 / 36
Why Wasserstein Instead of KL?I Wasserstein is more “continuous” than KL.
I W2(a, b)2 = (1− α)(a1 − b1)2 + α(a2 − b2)2
I KL(a‖b) = +∞ (because no overlap between distributions)I Euclidean(a, b) = (a1 − b1)2 + (a2 − b2)2
I L2(a, b)2 = +∞ (because integrating over all of R)
I So when a(k) → b, then we have convergence under the W2 andEuclidean metric, but not others.
I But the Euclidean metric overemphasizes the distance between a1 andb1, which should be weighted less.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 14 / 36
Wasserstein Natural Gradient
I The Wasserstein Natural Gradient measures the distance betweendistributions:
W2(p, q)2 = infγ∈Γ(p,q)
∫Ω×Ω‖x − y‖2 dγ(x , y)
where Γ(p, q) is the collection of all measures on Ω× Ω withmarginals p and q, respectively.
I So the direction of steepest descent will be,
δ∗ = arg minδ:W2(pθ‖pθ+δ)=α
L(θ + δ)
I Similar to the KL case, after writing the Lagrangian andapproximating with Taylor expansions, we get
δ∗ ≈ 1
αG (θ)−1∇θL(θ)
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 15 / 36
Affine Natural Proximal
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 16 / 36
Affine Natural Proximal
I Learning problems seek to minimize: minθ∈Θ F (θ).
I In order to perform natural gradient descent, we want the update,
θk+1 = θk − G (θk)−1∇θF (θk)
where G (θ) is the matrix representation of the natural metricstructure on the probability space (P(Ω), g).
I The above is the (forward Euler) discretization of the gradient flow,
θ(t) = −G (θ(t))−1∇θF (θ(t))
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 17 / 36
Affine Natural Proximal
I So we have,G (θ) = (∇θρθ, g(ρθ)∇θρθ)
which can be considered the pull-back of g back into parameter space.
I If gθ = −(∆ρθ)−1 where ∆ρθ = ∇ · (ρθ∇) is the weighted ellipticoperator, then G (θ) is the Wasserstein metric tensor, and we have,
GW (θ)ij =(∇θiρθ, (−∆ρθ)−1∆θjρθ
)I If gθ = 1
ρθ, then G (θ) is the Fisher-Rao metric tensor, given by,
GFR(θ)ij =
(∇θiρθ,
1
ρθ∇θjρθ
)
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 18 / 36
The need for the proximal
I In order to perform the gradient descent update,
θk+1 = θk − G (θk)−1∇θF (θk)
we would need to compute the inverse of the humongous matrixG (θk) (as in deep learning, θ can be a vector of a billion entries).
I Another way is to consider the proximal update.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 19 / 36
The need for the proximal
I Consider the proximal update:
θk+1 = ProxhF (θk) = arg minθ
F (θ) +1
2hD(θ, θk)
I D is the distance between θ and θk where
D(θ, θk) = infθ(t)
∫ 1
0θ(t)TG (θ(t))θ(t) dt : θ(0) = θ, θ(1) = θk
= inf
θ(t)
∫ 1
0
(∂tρθ(t), g(ρθ(t))∂tρθ(t)
)dt : θ(0) = θ, θ(1) = θk
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 20 / 36
The proximal and the affine space approximation
I But in order to practically use D(θ, θk), we use the approximation,
1
2D(θ, θk) =
1
2
(ρθ − ρθk , g(ρρθ)(ρθ − ρθk )
)I We can turn this into a variational formulation,
1
2D(θ, θk) = sup
Φ:Ω→R(Φ, ρθ − ρθk )− 1
2
(Φ, g(ρθ)†Φ
)(whose argsup solution is Φ = g(ρθ)(ρθ − ρθk ), and recovers theprevious formula).
I Then we restrict Φ to belong to the space of functions:
FΨ =
Φ(x) =n∑
j=1
ξjψj(x) = ξTΨ(x) : ξ ∈ Rn
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 21 / 36
The proximal and the affine space approximation
I For the Wasserstein metric, we have,
1
2DW
Ψ (θ, θk) = supΦ=ξT Ψ
Eθ[Φ]− Eθk [Φ]− 1
2Eθ[‖∇Φ‖2]
I For the Fisher-Rao metric, we have,
1
2DFR
Ψ (θ, θk) = supΦ=ξT Ψ
Eθ[Φ]− Eθk [Φ]− 1
2Eθ[(Φ− Eθ[Φ])2]
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 22 / 36
Examples
I (Order 1 approximation) For the metrix approximation whereF = Φ(x) = aT x + b, we haveI Wasserstein:
1
2DW (θ, θk) =
1
2‖Eθ[x ]− Eθk [x ]‖2
I Fisher-Rao:
DFR(θ, θk) = (Eθ[x ]−Eθk [x ])>(Eθ[(x −Eθx)(x −Eθx)>
])−1
(Eθ[x ]−Eθk [x ]).
And we have ξ = (b, a) and Ψ = (1, ψ1) where ψ1 = Id .
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 23 / 36
I (Order 2 approximation) For the space of quadratic functionsF = Φ(x) = 1
2xTQx + aT x + b, we have,
I Wasserstein:
DW2 (θ, θk) =
(Eθ[
xx⊗x
2
]−Eθk
[x
x⊗x2
] )>Eθ[
Im x>⊗Imx⊗Im Im⊗xx>
]−1 (Eθ[
xx⊗x
2
]−Eθk
[x
x⊗x2
] ).
I Fisher-Rao:
DFR2 (θ, θk) =
(Eθ[
xx⊗x
2
]− Eθk
[x
x⊗x2
] )>(CFR(θ)
)−1(Eθ[
xx⊗x
2
]− Eθk
[x
x⊗x2
] ),
where
CFR = Eθ
[( [x
x⊗x2
]− Eθ
[x
x⊗x2
] )( [x
x⊗x2
]− Eθ
[x
x⊗x2
] )>].
In this case, ξ = (b, a, vec(Q)), and Ψ = (1, ψ1, ψ2), where ψ1 = Id andψ2 : x 7→ x ⊗ x .
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 24 / 36
Numerical Examples
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 25 / 36
Classification for CIFAR-10
Figure: Right: The learning curves for the image classification task on CIFAR-10.Each experiment was averaged over 5 runs. The bold lines represent the average,and the envelopes are the minimum and maximum achieved.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 26 / 36
Classification for CIFAR-10
Algorithm 1 Wasserstein Proximal Natural Gradient for Neural Networks
Require: Loss function L, neural network f (x , θ), Order 1 or 2 Wassersteindistance approximation D, and data-label pairs (x , y) from dataset D.
Require: m number of gradient descent steps, and h strength of the prox-imal termwhile stopping criteria not met do
Sample a mini-batch of image-label pairs (xb, yb)Bb=1 ∈ D
Approximately solve (by performing SGD m times)
θk+1 ← argminθ
1
B
B∑b=1
L(y , f (x , θ)) +1
2hD(θ, θk)
end while
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 27 / 36
Multi-Agent Reinforcement Learning
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 28 / 36
Multi-Agent Reinforcement Learning (MARL)
I Multi-Agent Reinforcement Learning involves many agents.
I In cooperative MARL, there is a shared reward.
I In non-cooperative MARL, each agent has its own reward that maybe adversarial to another agent’s reward.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 29 / 36
Challenges to MARL
Curse of dimensionality of the joint state space and joint action space.
I For N agents, the action space is A1 × A2 × · · · × AN .
I So we have in total∏N
i=1 |Ai | number of states, which is exponential.
I Similarly for the state space.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 30 / 36
Challenges to MARL
A non-Markovian (a.k.a. non-stationary) environment from theperspective of each agent, because other agents are also updati
Mention POMDP and Markov Games
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 31 / 36
Challenges to MARL
Specifying a good MARL goal in the non-cooperative case.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 32 / 36
Frameworks for MARL
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 33 / 36
Decentralized (PO)MDPA Dec-POMDP can be described as follows:I M =< I , S ,Ai ,P,R,Ωi ,O, h >.I I , the set of agents.I S , the set of states.I Ai the set of actions for agent i , with A = ×iAi the set of joint
actions.I P, the state transition probabilities: P(s ′|s, a), the probability of the
environment transitioning to state s ′ given it was in state s andagents took actions a.
I R, the global reward function: R(s, a), the immediate reward thesystem receives for being in state s and agents taking actions a.
I Ωi , the set of observations for agent i , with Ω = ×iΩi the set of jointobservations.
I O, the observation probabilities: O(o|s, a), the probability of agentsseeing observations o, given the state is s and agents take actions a.
If the joint observations equals the state, then we call the above aDec-MDP.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 34 / 36
Markov Games
Similar to a Dec-POMDP, except the reward function is not global, andnow each agent has its own individual reward function it seeks tomaximize.
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 35 / 36
The end
Thank you!
Lin Affine Natural Proximal, then Multi-Agent Reinforcement Learning For MPI Summer 2019July 23, 2019 36 / 36