Wasserstein barycenters from a PDE perspective · 2020-06-12 · 1 Wasserstein barycenters from a...

1

Wasserstein barycenters from a PDEperspective

Guillaume Carlier a

Based on joint works with M. Agueh, K. Eichinger and A.

Kroshnin,

Erlangen, FAU, CAA online seminar, June 9th 2020.

aCEREMADE, Université Paris Dauphine and MOKAPLAN (Inria-

Dauphine).

/1

Outline 2

Outline

➀ Definition and characterization

➁ Integrability, convexity

➂ Numerical approximation

➃ Regularized barycenters

➄ Asymptotics (LLN, CLT)

/2

Definition and characterization 3

Definition and characterization

Quadratic Wasserstein distance, let P2(Rd) be the set of

probability measures on Rd with a finite second moment, let ρ0

and ρ1 be in P2(Rd), 2-Wasserstein distance between ρ0 and ρ1,

W2(ρ0, ρ1)

W 22 (ρ0, ρ1) = inf

γ∈Π(ρ0,ρ1)

∫

Rd×Rd

|x− y|2dγ(x, y)

where Π(ρ0, ρ1) is the set of transport plans between ρ0 and ρ1

i.e. prob. on Rd ×R

d having ρ0 and ρ1 as marginals. W2 is a

distance on P2(Rd), (P2(R

d),W2) Wasserstein space.

Definition and characterization/1


Let T : Rd → R

k Borel and ν ∈ P(Rd), the push-forward of ν

through T (or image measure) is the measure T#ν ∈ P(Rk)

defined by

T#ν(A) = ν(T−1(A)), ∀A Borel subset of Rk

or, equivalently,∫

Rk

ϕdT#ν =

∫

Rd

ϕ(T (x))dν(x)

for every ϕ ∈ Cb(Rk). One says that T transports ν to ρ if

T#ν = ρ.



Brenier, McCann: if ρ0 does not charge Lipschitz hypersurfaces,

there is a unique solution, characterized by γ = (id,∇ϕ)#ρ0

with ϕ convex. Brenier’s map: ∇ϕ, extension of the monotone

rearrangement to several dimensions. Link with Monge-Ampère:

det(D2ϕ)ρ1(∇ϕ) = ρ0, ϕ convex.

Regularity theory (Caffarelli, Figalli, De Philippis).

Interpolation (McCann): curve of measures

t ∈ [0, 1] 7→ ρt = ((1− t) id+t∇ϕ)#ρ0, geodesic between ρ0 and

ρ1.



Given ν1, . . . , νN in P2(Rd)N and λi > 0,

∑Ni=1 λi = 1,

Wasserstein barycenter problem

infρ∈P2(Rd)

N∑

i=1

λiW22 (νi, ρ). (1)

This is a convex problem, existence is easy, uniqueness holds as

soon as one of the measures νi does not charge hypersurfaces (in

this case ρ 7→ W 22 (ρ, νi) is strictly convex). Solution:

Wasserstein barycenter. No regularizing effect. Special instance

of Fréchet mean.



Variants:

• Riemannian manifold instead of Rd, Kim and Pass.

• Barycenter of a general probability P over P2(Rd) with

∫

P2(Rd)

m2(ν)dP (ν) < +∞, m2(ν) :=

∫

Rd

|x|2dν(x)

infρ∈P2(Rd)

∫

P2(Rd)

W 22 (ν, ρ)dP (ν). (2)

Bigot and Klein, Loubès and Le Gouic.



Characterization of the barycenter ρ. Kantorovich duality

formula:

1

2W 2

2 (ρ, νi) = sup∫

Rd

uidρ+

∫

Rd

vidνi : ui(x)+vi(y) ≤1

2|x−y|2

achieved by a pair of potentials (ui, vi) which are semiconcave

and related through

ϕi :=1

2|.|2 − ui convex with ϕ∗

i :=1

2|.|2 − vi

optimal transport plan γi between ρ and νi: on the support of

γi the constraint is an equality ui(x) + vi(y) =12 |x− y|2, i.e.

ϕi(x) + ϕ∗i (y) = x · y i.e.

y ∈ ∂ϕi(x).



Optimality condition for the barycenter ρ:

N∑

i=1

λiui ≥ 0, with equality on spt(ρ)

i.e.

1

2|x|2 ≥

N∑

i=1

λiϕi(x), ∀x ∈ Rd with equality on spt(ρ)

since the ϕi’s are convex, they all are differentiable on the

contact set and in particular everywhere on spt(ρ), and

x =N∑

i=1

λi∇ϕi(x), ∀x ∈ spt(ρ). (3)

and ϕi#ρ = νi, there exists a (Lipschitz) optimal transport from

the barycenter to the fixed measures νi (no assumption on νi).



The barycenter (actually its support) is characterized by a

free-boundary problem for a system of Monge-Ampère

equations: ϕi convex,

1

2|x|2 ≥

N∑

i=1

λiϕi(x), ∀x ∈ Rd with equality on spt(ρ)

and

det(D2ϕi)νi(∇ϕi) = ρ, i = 1, . . . , N.



Explicit examples:

• N = 2, two measures: their barycenters coincide with

McCann’s interpolation,

• d = 1, ν1 atomless Ti monotone Ti#ν1 = νi,

ρ = (∑N

i=1 λiTi)#ν1, (in particular barycenters are

associative: false in higher dimensions),

• barycenters of Gaussians are Gaussians.


Integrability, convexity 12

Integrability, convexity

McCann’s displacement convexity, ν1, ν2, optimal transport

∇ϕ, ρλ = ((1− λ) id+λ∇ϕ)#ν1, λ ∈ [0, 1]. A functional E:

P2(Rd) → R ∪ +∞ is displacement convex if

E(ρλ) ≤ (1− λ)E(ν1) + λE(ν2).

Convexity along barycenters if for ρ a barycenter of the νi’s

with weights λi one has

E(ρ) ≤N∑

i=1

λiE(νi).

Integrability, convexity/1


Moments: Let V : Rd → R convex and such that

∫

Rd V dνi < +∞ for every i. Recall (3) and ∇ϕi#ρ = νi:

∫

Rd

V dρ =

∫

Rd

V (N∑

i=1

λi∇ϕi(x))dρ(x)

≤N∑

i=1

λi

∫

Rd

V (∇ϕi(x))dρ(x)

=

N∑

i=1

λi

∫

Rd

V dνi.

In particular

m2(ρ) ≤N∑

i=1

λim2(νi),

∫

Rd

xdρ(x) =N∑

i=1

λi

∫

Rd

ydνi(y)



Integral estimates: Internal energy E(ρ) =∫

Rd U(ρ(x))dx, U :

R+ → R, U(0) = 0, satisfies McCann’s condition:

t 7→ tdU( 1

td

)

convex nonincreasing

e.g. U(t) = tp, p > 1, U(t) = t log(t). McCann showed

displacement convexity of such energies. Generalizes to

barycenters: assume νi absolutely continuous, and with finite

energy

N∑

i=1

λi

∫

Rd

U(νi(x))dx =

N∑

i=1

λi

∫

Rd

U( ρ

detD2ϕi

)

detD2ϕidx



Set Φρ : (t) = tdU( ρtd), then by convexity of Φρ

N∑

i=1

λi

∫

Rd

U(νi(x))dx =N∑

i=1

λi

∫

Rd

Φρ((detD2ϕi)

1/d)

≥∫

Rd

Φρ(

N∑

i=1

λi(detD2ϕi)

1/d)

Since D2ϕi ≥ 0, Minkowski’s concavity inequality together with

the optimallity condition∑

λiD2ϕi = id on spt(ρ), gives

N∑

i=1

λi(detD2ϕi)

1/d ≤ (N∑

i=1

λi detD2ϕi)

1/d = 1

so that, since Φρ is nondecreasing

Φρ

(

N∑

i=1

λi(detD2ϕi)

1/d)

≥ Φρ(1) = U(ρ)



The desired convexity inequality follows

∫

Rd

U(ρ(x))dx ≤N∑

i=1

λi

∫

Rd

U(νi(x))dx.

In particular:

• p ∈ (1,+∞), νi ∈ Lp ⇒ ρ ∈ Lp with ‖ρ‖pLp ≤ ∑

i λi‖νi‖pLp ,

• νi has finite entropy so does ρ, Ent(ρ) ≤ ∑Ni=1 λiEnt(νi),

• if d = 1, one can use negative powers as well.

Limit cases:

• νi ∈ L1 ⇒ ρ ∈ L1,

• ν1 ∈ L∞ ⇒ ρ ∈ L∞ with ‖ρ‖L∞ ≤ λ−d1 ‖ν1‖L∞ ,

• nothing known about regularity: νi ∈ Ck,α ⇒ ρ ∈ Ck,α?


Numerical approximation 17

Numerical approximation

Reformulate the barycenter problem (1) as a linear pproblem in

terms of transport plans γi between the fixed measures νi and

the unkown barycenter ρ, minimize with respect to

(γ1, . . . , γN ) ∈ P(Rd ×Rd) the cost

N∑

i=1

λi

∫

Rd×Rd

1

2|xi − x|2dγi(xi, x)

subject to the fixed marginal constraints ((π1, π2)(a, b) = (a, b)

denote the canonical projections):

π1#γi = νi, i = 1, . . . , N, (4)

Numerical approximation/1


as well as the constraint all the γi’s share the same second

marginal (which is the barycenter):

π2#γ1 = π2#γ2 = . . . = π2#γN (= ρ). (5)

Entropic regularization: ε > 0 minimize under the same

constraints

N∑

i=1

λi

∫

Rd×Rd

(1

2|xi − x|2 + ε log(γi(xi, x)

)

dγi(xi, x)

which is an entropic projection problem:

infN∑

i=1

λiH(γi|θi) : θi(xi, x) = exp(−|xi − x|22ε

)

where H denotes relative (aka Kullback-Leibler divergence). In

other words, we have to Kullback-Leibler project Gaussians

onto the linear constraints (4)-(5).



This is a popular, simple and efficient approximation method

for computational OT: Sinkhorn scaling algorithm (Cuturi,

Cuturi and Peyré’s recent book). Alternate projection

algorithm: perform one projection at a time (same as

coordinate descent, aka nonlinear Gauss-Seidel, on the dual).

The key-point is that these projections are totally explicit. KL

Projection onto a fixed marginal constraint:

infH(γ1|θ1) =∫

(

log(γ1

θ1

)

− 1)

γ1 : π1#γ1 = ν1

optimality condition (with u1 a Lagrange multiplier for the

fixed marginal constraint):

log(γ1(x1, x)

θ1(x1, x)

)

= u1(x1) i.e. γ1(x1, x) = a1(x1)θ1(x1, x).



One finds a1 thanks to the marginal constraint:

a1(x1) =ν1(x1)

∫

Rd θ1(x1, x)dx

this is just a scaling as in Sinkhorn. Consider now the KL

projection onto the common marginal constraint (5):

infN∑

i=1

λiH(γi|θi) subject to (5)

trick: use Lagrange multipliers by observing that (5) is the

same as:

N∑

i=1

ui = 0 ⇒N∑

i=1

∫

Rd×Rd

ui(x)dγi(xi, x) = 0.



Optimality conditions:

λi log(γi(xi, x)

θi(xi, x)

)

= ui(x),N∑

i=1

ui = 0, that is

γi(xi, x) = θi(xi, x)ai(x)1

λi ,

N∏

i=1

ai = 1

now we solve using the constraint that

ρ(x) =

∫

Rd

γi(xi, x)dxi = ai(x)1

λi

∫

Rd

θi(xi, x)dxi, i = 1, . . . , N



and find an explict geometric mean formula for ρ:

ρ(x) =

N∏

i=1

ρ(x)λi =

N∏

i=1

ai(x)(

∫

Rd

θi(xi, x)dxi

)λi

=N∏

i=1

(

∫

Rd

θi(xi, x)dxi

)λi

.

All steps of Sinkhorn for barycenters are totally explicit. Easy

to implement (few lines of codes), converges pretty fast (for ε

not too small). Cost of computing the integrals (can be

parallelized, well suited for GPU).


Regularized barycenters 23

Regularized barycenters

Given Ω an open convex subset of Rd (can be the whole of Rd)

and µ > 0, entropic barycenter

infρ∈P2(Rd),

∫Ωρ=1

N∑

i=1

λiW22 (νi, ρ) + µ

∫

Ω

ρ log(ρ). (6)

Proposed by Bigot, Cazelles and Papadakis for numerical

(avoids discretization artifacts) and statistical puposes. Forces

full support in Ω. Different from the entropic regularization

using plans in Sinkhorn. Unique solution, the entropic

regularized barycenter. As we shall see, this is a purely PDE

problem.

Regularized barycenters/1


More generally let P be a Borel measure over Wasserstein

P2(Rd) with

∫

P2(Rd)

m2(ν)dP (ν) < +∞

the entropic regularized barycenter of P , ρ = Barµ,Ω(P ) (just

Barµ(P ) if Ω = Rd) is the solution of

infρ∈P2(Rd),

∫Ωρ=1

∫

P2(Rd)

W 22 (ν, ρ)dP (ν) + µ

∫

Ω

ρ log(ρ). (7)



Euler-Lagrange equation is a (possibly infinite) system of

Monge-Ampère equations: ∇ϕνρ the OT from ρ to ν. Then

µ log(ρ(x)) +|x|22

=

∫

P2(Rd)

ϕνρ(x)dP (ν)(convex)

i.e.

ρ(x) = exp(

− |x|22µ

+1

µ

∫

P2(Rd)

ϕνρ(x)dP (ν)

)

which couples the family of Monge-Ampère equations:

ρ = det(D2ϕνρ)ν(∇ϕν

ρ).

In particular ρ is less log concave than a Gaussian, log(ρ) is

locally Lipschitz, has a locally BV gradient and

µ∇ log(ρ) + id =

∫

P2(Rd)

∇ϕνρ(x)dP (ν), ∇ϕν

ρ#ρ = ν. (8)



We expect regularizing effects and global bounds, which ones?

Moment bounds Let V : Rd → R+ convex, assume that

∫

P2(Rd)

∫

Rd

V dνdP (ν) < +∞

on the one hand, convexity and (8) give∫

Rd

V (x+ µ∇ log ρ)ρ =

∫

Rd

V(

∫

P2(Rd)

∇ϕνρ(x)dP (ν)

)

ρ(x)dx

≤∫

Rd

∫

P2(Rd)

V (∇ϕνρ(x))dP (ν)dρ(x) =

∫

P2(Rd)

∫

Rd

V dνdP (ν)

On the other hand if V is C1,1 (say), since

V (x+ µ∇ log ρ) ≥ V (x) +∇V (x)∇ log(ρ) we also have∫

Rd

V (x+ µ∇ log ρ)ρ ≥∫

Rd

(V − µ∆V )ρ



So∫

Rd

(V − µ∆V )ρ ≤∫

P2(Rd)

∫

Rd

V dνdP (ν)

for instance for V quadratic this gives

m2(ρ) ≤ 2µd+

∫

P2(Rd)

m2(ν)dP (ν).

More interestingly, if P has higher moments, p > 2,

mp(ν) :=∫

|x|pdν and∫

P2(Rd)

mp(ν)dP (ν) < +∞

then mp(ρ) < +∞ (with an explicit bound).



Sobolev regularity Fisher Information bound: square (8),

µ2|∇ log(ρ)|2 multiply by ρ, integrate and use Jensen and

Fubini:

µ2

∫

Rd

|∇ log(ρ)|2ρ =

∫

Rd

∣

∣

∣

∫

P2(Rd)

(∇ϕνρ(x)− x)dP (ν)

∣

∣

∣

2

ρ(x)dx

≤∫

P2(Rd)

∫

Rd

|∇ϕνρ(x)− x|2ρ(x)dxdP (ν)

=

∫

P2(Rd)

W 22 (µ, ρ)dP (ν)

so that√ρ ∈ H1(Rd). If for p > 2,

∫

P2(Rd)mp(ν)dP (ν) < +∞

then by a similar argument and the bound on mp(ρ) one gets

ρ1/p ∈ W 1,p(Rd). In particular if p > d, ρ ∈ C0,1−d/p(Rd).



Strong stability Wasserstein metric between measures on

P2(Rd) (with finite second moments):

W22 (P,Q) = inf

Γ∈Π(P,Q)

∫

P2(Rd)×P2(Rd)

W 22 (µ, ν)dΓ(µ, ν). (9)

Wasserstein over Wasserstein. Assume W22 (Pn, P ) → 0, set

ρn = Barµ(Pn), ρ = Barµ(P ), then not only W2(ρn, ρ) → 0 but

also√ρn → √

ρ strongly in H1(Rd).

Useful for the law of large numbers (see later).



Maximum principle Again by the optimality condition (8),

looking at a maximum point of log(ρ) (suitably regularizing the

problem if necessary) one gets that if

P (ν ∈ L∞(Rd), ν ≤ M) = α > 0

then ρ ∈ L∞(Rd) with the bound

ρ ≤ M

αd.



Higher regularity I: the bounded case If the data are

smooth and compactly supported, Ω = B = B(0, r):

P (ν : spt(ν) ⊂ B, ‖ν‖Ck,α(B) + ‖ log(ν)‖L∞(B) ≤ M) = 1

then by Caffarelli’s regularity theory one gets

ρ ∈ Ck+2,α(B).



Higher regularity II: the log concave case Observe that

log(ρ) is less concave than − 12µ |.|2. Caffarelli’s contraction

principle: the OT between a standard gaussian γ and e−Wγ

with W convex is 1-Lipschitz. We can use this principle here to

deduce that, if there is some A > 0 such that P -almost every ν

writes as dν = e−V (y)dy with D2V ≥ A id (in the sense of

distributions), then log ρ ∈ C1,1(Rd) and more precisely there

holds

− id ≤ µD2 log ρ ≤( 1√

µA− 1

)

id . (10)


Asymptotics (LLN, CLT) 33

Asymptotics (LLN, CLT)

A stochastic perspective on the problem, Bigot and Klein, Le

Gouic and Loubès, Kroshnin. P is a probability measure on

P2(Rd) (with a finite second moment), N large, ν1, . . . , νN

drawn in an i.i.d. way according to P . Empirical barycenter

(this is a random measure), µ ≥ 0 (regularization or not)

ρN = Barµ

( 1

N

N∑

i=1

δνi

)

and the true barycenter

ρ = Barµ(P ).

Asymptotics (LLN, CLT)/1


Law of large numbers, a.s. convergence of ρN to ρ,

• Bigot and Klein, Le Gouic and Loubès, weak convergence,

W2(ρN , ρ) → 0 a.s.,

• µ > 0, strong convergence thanks to the control of the

Fisher information (and its convergence), C., Eichinger,

Kroshnin.

Can we go one step futher: speed of convergence, asymptotic

normality of the error (CLT)? In the un-regularized case,

serious nonsmoothness issues (free boundary problem for

Monge-Ampère), it is not even clear that the barycenter

behaves in a (Wasserstein) Lipschitz way with respect to P .



What we mean by CLT has to be made precise, two ways to

look at the Wasserstein space:

• as a (formal) Riemannian manifold (Otto) where the

tangent space at ρ consists of L2(ρ) vector fields (and see

variations of ρ as the image of ρ by the corresponding flow

over a short time), Benamou-Brenier dynamical

formulation...

• as a convex subset of the vector space of measures (or of a

space of more regular densities, L2, H1...),



With the geometric viewpoint, LLN suggests to work in the

tangent L2(ρ) and to consider TN as the OT from ρ to ρN

(L2(ρ,Rd) random variable, natural framework for CLT).

Roughly, LLN expresses that ‖TN − id ‖L2(ρ) → 0. CLT holds if

hN :=√N(TN − id) converges in law to a Gaussian N (0,Σ)

where Σ is a self-adjoint, nonnegative operator trace class

operator. No general result of this kind, except in particular

cases we considered with Agueh (µ = 0):

• d = 1,

• P is supported by Gaussians.



For µ > 0, we saw that there are regularizing effects, it makes

sense to expect that√N(ρN − ρ) is asymptotically gaussian).

Requires control on the linearization of Monge-Ampère.

Theorem 1 (C., Eichinger, Kroshnin) Under suitable

Assumptions on P ,√N(ρN − ρ) converges in law in L2 to the

gaussian with covariance Σ = G−1Var(ϕνρ)G

−1 where

G(u) = µ(u

ρ− 1

|B|

∫

B

u

ρ

)

− E(Φν)′(ρ).


Date post:	18-Jun-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Wasserstein barycenters from a PDE perspective · 2020-06-12 · 1 Wasserstein barycenters from a...

Documents