1
Wasserstein barycenters from a PDEperspective
Guillaume Carlier a
Based on joint works with M. Agueh, K. Eichinger and A.
Kroshnin,
Erlangen, FAU, CAA online seminar, June 9th 2020.
aCEREMADE, Université Paris Dauphine and MOKAPLAN (Inria-
Dauphine).
/1
Outline 2
Outline
➀ Definition and characterization
➁ Integrability, convexity
➂ Numerical approximation
➃ Regularized barycenters
➄ Asymptotics (LLN, CLT)
/2
Definition and characterization 3
Definition and characterization
Quadratic Wasserstein distance, let P2(Rd) be the set of
probability measures on Rd with a finite second moment, let ρ0
and ρ1 be in P2(Rd), 2-Wasserstein distance between ρ0 and ρ1,
W2(ρ0, ρ1)
W 22 (ρ0, ρ1) = inf
γ∈Π(ρ0,ρ1)
∫
Rd×Rd
|x− y|2dγ(x, y)
where Π(ρ0, ρ1) is the set of transport plans between ρ0 and ρ1
i.e. prob. on Rd ×R
d having ρ0 and ρ1 as marginals. W2 is a
distance on P2(Rd), (P2(R
d),W2) Wasserstein space.
Definition and characterization/1
Definition and characterization 4
Let T : Rd → R
k Borel and ν ∈ P(Rd), the push-forward of ν
through T (or image measure) is the measure T#ν ∈ P(Rk)
defined by
T#ν(A) = ν(T−1(A)), ∀A Borel subset of Rk
or, equivalently,∫
Rk
ϕdT#ν =
∫
Rd
ϕ(T (x))dν(x)
for every ϕ ∈ Cb(Rk). One says that T transports ν to ρ if
T#ν = ρ.
Definition and characterization/2
Definition and characterization 5
Brenier, McCann: if ρ0 does not charge Lipschitz hypersurfaces,
there is a unique solution, characterized by γ = (id,∇ϕ)#ρ0
with ϕ convex. Brenier’s map: ∇ϕ, extension of the monotone
rearrangement to several dimensions. Link with Monge-Ampère:
det(D2ϕ)ρ1(∇ϕ) = ρ0, ϕ convex.
Regularity theory (Caffarelli, Figalli, De Philippis).
Interpolation (McCann): curve of measures
t ∈ [0, 1] 7→ ρt = ((1− t) id+t∇ϕ)#ρ0, geodesic between ρ0 and
ρ1.
Definition and characterization/3
Definition and characterization 6
Given ν1, . . . , νN in P2(Rd)N and λi > 0,
∑Ni=1 λi = 1,
Wasserstein barycenter problem
infρ∈P2(Rd)
N∑
i=1
λiW22 (νi, ρ). (1)
This is a convex problem, existence is easy, uniqueness holds as
soon as one of the measures νi does not charge hypersurfaces (in
this case ρ 7→ W 22 (ρ, νi) is strictly convex). Solution:
Wasserstein barycenter. No regularizing effect. Special instance
of Fréchet mean.
Definition and characterization/4
Definition and characterization 7
Variants:
• Riemannian manifold instead of Rd, Kim and Pass.
• Barycenter of a general probability P over P2(Rd) with
∫
P2(Rd)
m2(ν)dP (ν) < +∞, m2(ν) :=
∫
Rd
|x|2dν(x)
infρ∈P2(Rd)
∫
P2(Rd)
W 22 (ν, ρ)dP (ν). (2)
Bigot and Klein, Loubès and Le Gouic.
Definition and characterization/5
Definition and characterization 8
Characterization of the barycenter ρ. Kantorovich duality
formula:
1
2W 2
2 (ρ, νi) = sup∫
Rd
uidρ+
∫
Rd
vidνi : ui(x)+vi(y) ≤1
2|x−y|2
achieved by a pair of potentials (ui, vi) which are semiconcave
and related through
ϕi :=1
2|.|2 − ui convex with ϕ∗
i :=1
2|.|2 − vi
optimal transport plan γi between ρ and νi: on the support of
γi the constraint is an equality ui(x) + vi(y) =12 |x− y|2, i.e.
ϕi(x) + ϕ∗i (y) = x · y i.e.
y ∈ ∂ϕi(x).
Definition and characterization/6
Definition and characterization 9
Optimality condition for the barycenter ρ:
N∑
i=1
λiui ≥ 0, with equality on spt(ρ)
i.e.
1
2|x|2 ≥
N∑
i=1
λiϕi(x), ∀x ∈ Rd with equality on spt(ρ)
since the ϕi’s are convex, they all are differentiable on the
contact set and in particular everywhere on spt(ρ), and
x =N∑
i=1
λi∇ϕi(x), ∀x ∈ spt(ρ). (3)
and ϕi#ρ = νi, there exists a (Lipschitz) optimal transport from
the barycenter to the fixed measures νi (no assumption on νi).
Definition and characterization/7
Definition and characterization 10
The barycenter (actually its support) is characterized by a
free-boundary problem for a system of Monge-Ampère
equations: ϕi convex,
1
2|x|2 ≥
N∑
i=1
λiϕi(x), ∀x ∈ Rd with equality on spt(ρ)
and
det(D2ϕi)νi(∇ϕi) = ρ, i = 1, . . . , N.
Definition and characterization/8
Definition and characterization 11
Explicit examples:
• N = 2, two measures: their barycenters coincide with
McCann’s interpolation,
• d = 1, ν1 atomless Ti monotone Ti#ν1 = νi,
ρ = (∑N
i=1 λiTi)#ν1, (in particular barycenters are
associative: false in higher dimensions),
• barycenters of Gaussians are Gaussians.
Definition and characterization/9
Integrability, convexity 12
Integrability, convexity
McCann’s displacement convexity, ν1, ν2, optimal transport
∇ϕ, ρλ = ((1− λ) id+λ∇ϕ)#ν1, λ ∈ [0, 1]. A functional E:
P2(Rd) → R ∪ +∞ is displacement convex if
E(ρλ) ≤ (1− λ)E(ν1) + λE(ν2).
Convexity along barycenters if for ρ a barycenter of the νi’s
with weights λi one has
E(ρ) ≤N∑
i=1
λiE(νi).
Integrability, convexity/1
Integrability, convexity 13
Moments: Let V : Rd → R convex and such that
∫
Rd V dνi < +∞ for every i. Recall (3) and ∇ϕi#ρ = νi:
∫
Rd
V dρ =
∫
Rd
V (N∑
i=1
λi∇ϕi(x))dρ(x)
≤N∑
i=1
λi
∫
Rd
V (∇ϕi(x))dρ(x)
=
N∑
i=1
λi
∫
Rd
V dνi.
In particular
m2(ρ) ≤N∑
i=1
λim2(νi),
∫
Rd
xdρ(x) =N∑
i=1
λi
∫
Rd
ydνi(y)
Integrability, convexity/2
Integrability, convexity 14
Integral estimates: Internal energy E(ρ) =∫
Rd U(ρ(x))dx, U :
R+ → R, U(0) = 0, satisfies McCann’s condition:
t 7→ tdU( 1
td
)
convex nonincreasing
e.g. U(t) = tp, p > 1, U(t) = t log(t). McCann showed
displacement convexity of such energies. Generalizes to
barycenters: assume νi absolutely continuous, and with finite
energy
N∑
i=1
λi
∫
Rd
U(νi(x))dx =
N∑
i=1
λi
∫
Rd
U( ρ
detD2ϕi
)
detD2ϕidx
Integrability, convexity/3
Integrability, convexity 15
Set Φρ : (t) = tdU( ρtd), then by convexity of Φρ
N∑
i=1
λi
∫
Rd
U(νi(x))dx =N∑
i=1
λi
∫
Rd
Φρ((detD2ϕi)
1/d)
≥∫
Rd
Φρ(
N∑
i=1
λi(detD2ϕi)
1/d)
Since D2ϕi ≥ 0, Minkowski’s concavity inequality together with
the optimallity condition∑
λiD2ϕi = id on spt(ρ), gives
N∑
i=1
λi(detD2ϕi)
1/d ≤ (N∑
i=1
λi detD2ϕi)
1/d = 1
so that, since Φρ is nondecreasing
Φρ
(
N∑
i=1
λi(detD2ϕi)
1/d)
≥ Φρ(1) = U(ρ)
Integrability, convexity/4
Integrability, convexity 16
The desired convexity inequality follows
∫
Rd
U(ρ(x))dx ≤N∑
i=1
λi
∫
Rd
U(νi(x))dx.
In particular:
• p ∈ (1,+∞), νi ∈ Lp ⇒ ρ ∈ Lp with ‖ρ‖pLp ≤ ∑
i λi‖νi‖pLp ,
• νi has finite entropy so does ρ, Ent(ρ) ≤ ∑Ni=1 λiEnt(νi),
• if d = 1, one can use negative powers as well.
Limit cases:
• νi ∈ L1 ⇒ ρ ∈ L1,
• ν1 ∈ L∞ ⇒ ρ ∈ L∞ with ‖ρ‖L∞ ≤ λ−d1 ‖ν1‖L∞ ,
• nothing known about regularity: νi ∈ Ck,α ⇒ ρ ∈ Ck,α?
Integrability, convexity/5
Numerical approximation 17
Numerical approximation
Reformulate the barycenter problem (1) as a linear pproblem in
terms of transport plans γi between the fixed measures νi and
the unkown barycenter ρ, minimize with respect to
(γ1, . . . , γN ) ∈ P(Rd ×Rd) the cost
N∑
i=1
λi
∫
Rd×Rd
1
2|xi − x|2dγi(xi, x)
subject to the fixed marginal constraints ((π1, π2)(a, b) = (a, b)
denote the canonical projections):
π1#γi = νi, i = 1, . . . , N, (4)
Numerical approximation/1
Numerical approximation 18
as well as the constraint all the γi’s share the same second
marginal (which is the barycenter):
π2#γ1 = π2#γ2 = . . . = π2#γN (= ρ). (5)
Entropic regularization: ε > 0 minimize under the same
constraints
N∑
i=1
λi
∫
Rd×Rd
(1
2|xi − x|2 + ε log(γi(xi, x)
)
dγi(xi, x)
which is an entropic projection problem:
infN∑
i=1
λiH(γi|θi) : θi(xi, x) = exp(−|xi − x|22ε
)
where H denotes relative (aka Kullback-Leibler divergence). In
other words, we have to Kullback-Leibler project Gaussians
onto the linear constraints (4)-(5).
Numerical approximation/2
Numerical approximation 19
This is a popular, simple and efficient approximation method
for computational OT: Sinkhorn scaling algorithm (Cuturi,
Cuturi and Peyré’s recent book). Alternate projection
algorithm: perform one projection at a time (same as
coordinate descent, aka nonlinear Gauss-Seidel, on the dual).
The key-point is that these projections are totally explicit. KL
Projection onto a fixed marginal constraint:
infH(γ1|θ1) =∫
(
log(γ1
θ1
)
− 1)
γ1 : π1#γ1 = ν1
optimality condition (with u1 a Lagrange multiplier for the
fixed marginal constraint):
log(γ1(x1, x)
θ1(x1, x)
)
= u1(x1) i.e. γ1(x1, x) = a1(x1)θ1(x1, x).
Numerical approximation/3
Numerical approximation 20
One finds a1 thanks to the marginal constraint:
a1(x1) =ν1(x1)
∫
Rd θ1(x1, x)dx
this is just a scaling as in Sinkhorn. Consider now the KL
projection onto the common marginal constraint (5):
infN∑
i=1
λiH(γi|θi) subject to (5)
trick: use Lagrange multipliers by observing that (5) is the
same as:
N∑
i=1
ui = 0 ⇒N∑
i=1
∫
Rd×Rd
ui(x)dγi(xi, x) = 0.
Numerical approximation/4
Numerical approximation 21
Optimality conditions:
λi log(γi(xi, x)
θi(xi, x)
)
= ui(x),N∑
i=1
ui = 0, that is
γi(xi, x) = θi(xi, x)ai(x)1
λi ,
N∏
i=1
ai = 1
now we solve using the constraint that
ρ(x) =
∫
Rd
γi(xi, x)dxi = ai(x)1
λi
∫
Rd
θi(xi, x)dxi, i = 1, . . . , N
Numerical approximation/5
Numerical approximation 22
and find an explict geometric mean formula for ρ:
ρ(x) =
N∏
i=1
ρ(x)λi =
N∏
i=1
ai(x)(
∫
Rd
θi(xi, x)dxi
)λi
=N∏
i=1
(
∫
Rd
θi(xi, x)dxi
)λi
.
All steps of Sinkhorn for barycenters are totally explicit. Easy
to implement (few lines of codes), converges pretty fast (for ε
not too small). Cost of computing the integrals (can be
parallelized, well suited for GPU).
Numerical approximation/5
Regularized barycenters 23
Regularized barycenters
Given Ω an open convex subset of Rd (can be the whole of Rd)
and µ > 0, entropic barycenter
infρ∈P2(Rd),
∫Ωρ=1
N∑
i=1
λiW22 (νi, ρ) + µ
∫
Ω
ρ log(ρ). (6)
Proposed by Bigot, Cazelles and Papadakis for numerical
(avoids discretization artifacts) and statistical puposes. Forces
full support in Ω. Different from the entropic regularization
using plans in Sinkhorn. Unique solution, the entropic
regularized barycenter. As we shall see, this is a purely PDE
problem.
Regularized barycenters/1
Regularized barycenters 24
More generally let P be a Borel measure over Wasserstein
P2(Rd) with
∫
P2(Rd)
m2(ν)dP (ν) < +∞
the entropic regularized barycenter of P , ρ = Barµ,Ω(P ) (just
Barµ(P ) if Ω = Rd) is the solution of
infρ∈P2(Rd),
∫Ωρ=1
∫
P2(Rd)
W 22 (ν, ρ)dP (ν) + µ
∫
Ω
ρ log(ρ). (7)
Regularized barycenters/2
Regularized barycenters 25
Euler-Lagrange equation is a (possibly infinite) system of
Monge-Ampère equations: ∇ϕνρ the OT from ρ to ν. Then
µ log(ρ(x)) +|x|22
=
∫
P2(Rd)
ϕνρ(x)dP (ν)(convex)
i.e.
ρ(x) = exp(
− |x|22µ
+1
µ
∫
P2(Rd)
ϕνρ(x)dP (ν)
)
which couples the family of Monge-Ampère equations:
ρ = det(D2ϕνρ)ν(∇ϕν
ρ).
In particular ρ is less log concave than a Gaussian, log(ρ) is
locally Lipschitz, has a locally BV gradient and
µ∇ log(ρ) + id =
∫
P2(Rd)
∇ϕνρ(x)dP (ν), ∇ϕν
ρ#ρ = ν. (8)
Regularized barycenters/3
Regularized barycenters 26
We expect regularizing effects and global bounds, which ones?
Moment bounds Let V : Rd → R+ convex, assume that
∫
P2(Rd)
∫
Rd
V dνdP (ν) < +∞
on the one hand, convexity and (8) give∫
Rd
V (x+ µ∇ log ρ)ρ =
∫
Rd
V(
∫
P2(Rd)
∇ϕνρ(x)dP (ν)
)
ρ(x)dx
≤∫
Rd
∫
P2(Rd)
V (∇ϕνρ(x))dP (ν)dρ(x) =
∫
P2(Rd)
∫
Rd
V dνdP (ν)
On the other hand if V is C1,1 (say), since
V (x+ µ∇ log ρ) ≥ V (x) +∇V (x)∇ log(ρ) we also have∫
Rd
V (x+ µ∇ log ρ)ρ ≥∫
Rd
(V − µ∆V )ρ
Regularized barycenters/4
Regularized barycenters 27
So∫
Rd
(V − µ∆V )ρ ≤∫
P2(Rd)
∫
Rd
V dνdP (ν)
for instance for V quadratic this gives
m2(ρ) ≤ 2µd+
∫
P2(Rd)
m2(ν)dP (ν).
More interestingly, if P has higher moments, p > 2,
mp(ν) :=∫
|x|pdν and∫
P2(Rd)
mp(ν)dP (ν) < +∞
then mp(ρ) < +∞ (with an explicit bound).
Regularized barycenters/5
Regularized barycenters 28
Sobolev regularity Fisher Information bound: square (8),
µ2|∇ log(ρ)|2 multiply by ρ, integrate and use Jensen and
Fubini:
µ2
∫
Rd
|∇ log(ρ)|2ρ =
∫
Rd
∣
∣
∣
∫
P2(Rd)
(∇ϕνρ(x)− x)dP (ν)
∣
∣
∣
2
ρ(x)dx
≤∫
P2(Rd)
∫
Rd
|∇ϕνρ(x)− x|2ρ(x)dxdP (ν)
=
∫
P2(Rd)
W 22 (µ, ρ)dP (ν)
so that√ρ ∈ H1(Rd). If for p > 2,
∫
P2(Rd)mp(ν)dP (ν) < +∞
then by a similar argument and the bound on mp(ρ) one gets
ρ1/p ∈ W 1,p(Rd). In particular if p > d, ρ ∈ C0,1−d/p(Rd).
Regularized barycenters/6
Regularized barycenters 29
Strong stability Wasserstein metric between measures on
P2(Rd) (with finite second moments):
W22 (P,Q) = inf
Γ∈Π(P,Q)
∫
P2(Rd)×P2(Rd)
W 22 (µ, ν)dΓ(µ, ν). (9)
Wasserstein over Wasserstein. Assume W22 (Pn, P ) → 0, set
ρn = Barµ(Pn), ρ = Barµ(P ), then not only W2(ρn, ρ) → 0 but
also√ρn → √
ρ strongly in H1(Rd).
Useful for the law of large numbers (see later).
Regularized barycenters/7
Regularized barycenters 30
Maximum principle Again by the optimality condition (8),
looking at a maximum point of log(ρ) (suitably regularizing the
problem if necessary) one gets that if
P (ν ∈ L∞(Rd), ν ≤ M) = α > 0
then ρ ∈ L∞(Rd) with the bound
ρ ≤ M
αd.
Regularized barycenters/8
Regularized barycenters 31
Higher regularity I: the bounded case If the data are
smooth and compactly supported, Ω = B = B(0, r):
P (ν : spt(ν) ⊂ B, ‖ν‖Ck,α(B) + ‖ log(ν)‖L∞(B) ≤ M) = 1
then by Caffarelli’s regularity theory one gets
ρ ∈ Ck+2,α(B).
Regularized barycenters/9
Regularized barycenters 32
Higher regularity II: the log concave case Observe that
log(ρ) is less concave than − 12µ |.|2. Caffarelli’s contraction
principle: the OT between a standard gaussian γ and e−Wγ
with W convex is 1-Lipschitz. We can use this principle here to
deduce that, if there is some A > 0 such that P -almost every ν
writes as dν = e−V (y)dy with D2V ≥ A id (in the sense of
distributions), then log ρ ∈ C1,1(Rd) and more precisely there
holds
− id ≤ µD2 log ρ ≤( 1√
µA− 1
)
id . (10)
Regularized barycenters/10
Asymptotics (LLN, CLT) 33
Asymptotics (LLN, CLT)
A stochastic perspective on the problem, Bigot and Klein, Le
Gouic and Loubès, Kroshnin. P is a probability measure on
P2(Rd) (with a finite second moment), N large, ν1, . . . , νN
drawn in an i.i.d. way according to P . Empirical barycenter
(this is a random measure), µ ≥ 0 (regularization or not)
ρN = Barµ
( 1
N
N∑
i=1
δνi
)
and the true barycenter
ρ = Barµ(P ).
Asymptotics (LLN, CLT)/1
Asymptotics (LLN, CLT) 34
Law of large numbers, a.s. convergence of ρN to ρ,
• Bigot and Klein, Le Gouic and Loubès, weak convergence,
W2(ρN , ρ) → 0 a.s.,
• µ > 0, strong convergence thanks to the control of the
Fisher information (and its convergence), C., Eichinger,
Kroshnin.
Can we go one step futher: speed of convergence, asymptotic
normality of the error (CLT)? In the un-regularized case,
serious nonsmoothness issues (free boundary problem for
Monge-Ampère), it is not even clear that the barycenter
behaves in a (Wasserstein) Lipschitz way with respect to P .
Asymptotics (LLN, CLT)/2
Asymptotics (LLN, CLT) 35
What we mean by CLT has to be made precise, two ways to
look at the Wasserstein space:
• as a (formal) Riemannian manifold (Otto) where the
tangent space at ρ consists of L2(ρ) vector fields (and see
variations of ρ as the image of ρ by the corresponding flow
over a short time), Benamou-Brenier dynamical
formulation...
• as a convex subset of the vector space of measures (or of a
space of more regular densities, L2, H1...),
Asymptotics (LLN, CLT)/5
Asymptotics (LLN, CLT) 36
With the geometric viewpoint, LLN suggests to work in the
tangent L2(ρ) and to consider TN as the OT from ρ to ρN
(L2(ρ,Rd) random variable, natural framework for CLT).
Roughly, LLN expresses that ‖TN − id ‖L2(ρ) → 0. CLT holds if
hN :=√N(TN − id) converges in law to a Gaussian N (0,Σ)
where Σ is a self-adjoint, nonnegative operator trace class
operator. No general result of this kind, except in particular
cases we considered with Agueh (µ = 0):
• d = 1,
• P is supported by Gaussians.
Asymptotics (LLN, CLT)/6
Asymptotics (LLN, CLT) 37
For µ > 0, we saw that there are regularizing effects, it makes
sense to expect that√N(ρN − ρ) is asymptotically gaussian).
Requires control on the linearization of Monge-Ampère.
Theorem 1 (C., Eichinger, Kroshnin) Under suitable
Assumptions on P ,√N(ρN − ρ) converges in law in L2 to the
gaussian with covariance Σ = G−1Var(ϕνρ)G
−1 where
G(u) = µ(u
ρ− 1
|B|
∫
B
u
ρ
)
− E(Φν)′(ρ).
Asymptotics (LLN, CLT)/7