arXiv:2011.04162v1 [stat.ML] 9 Nov 2020

Sinkhorn Natural Gradient for Generative Models

Zebang Shen˚ Zhenfu Wang: Alejandro Ribeiro˚ Hamed Hassani˚˚Department of Electrical and Systems Engineering :Department of Mathematics

University of Pennsylvania{zebang@seas,zwang423@math,aribeiro@seas,hassani@seas}.upenn.edu

Abstract

We consider the problem of minimizing a functional over a parametric familyof probability measures, where the parameterization is characterized via a push-forward structure. An important application of this problem is in training generativeadversarial networks. In this regard, we propose a novel Sinkhorn Natural Gradient(SiNG) algorithm which acts as a steepest descent method on the probability spaceendowed with the Sinkhorn divergence. We show that the Sinkhorn informationmatrix (SIM), a key component of SiNG, has an explicit expression and can beevaluated accurately in complexity that scales logarithmically with respect to thedesired accuracy. This is in sharp contrast to existing natural gradient methodsthat can only be carried out approximately. Moreover, in practical applicationswhen only Monte-Carlo type integration is available, we design an empiricalestimator for SIM and provide the stability analysis. In our experiments, wequantitatively compare SiNG with state-of-the-art SGD-type solvers on generativetasks to demonstrate its efficiency and efficacy of our method.

1 Introduction

Consider the minimization of a functional F over a parameterized family probability measures tαθu:

minθPΘ

tF pθq:“Fpαθqu , (1)

where Θ Ď Rd is the feasible domain of the parameter θ. We assume that the measures αθ are definedover a common ground set X Ď Rq with the following structure: αθ “ Tθ7µ, where µ is a fixed andknown measure and Tθ is a push-forward mapping. More specifically, µ is a simple measure on alatent space Z Ď Rq , such as the standard Gaussian measure µ “ N p0q, Iqq, and the parameterizedmap Tθ : Z Ñ X transforms the measure µ to αθ. This type of push-forward parameterizationis commonly used in deep generative models, where Tθ represents a neural network parametrizedby weights θ [Goodfellow et al., 2014, Salimans et al., 2018, Genevay et al., 2018]. Consequently,methods to efficiently and accurately solve problem (1) are of great importance in machine learning.

The de facto solvers for problem (1) are generic nonconvex optimizers such as Stochastic GradientDescent (SGD) and its variants, Adam [Kingma and Ba, 2014], Amsgrad [Reddi et al., 2019],RMSProp [Hinton et al.], etc. These optimization algorithms directly work on the parameter spaceand are agnostic to the fact that αθ’s are probability measures. Consequently, SGD type solvers sufferfrom the complex optimization landscape induced from the neural-network mappings Tθ.

An alternative to SGD type methods is the natural gradient method, which is originally motivatedfrom Information Geometry [Amari, 1998, Amari et al., 1987]. Instead of simply using the Euclideanstructure of the parameter space Θ in the usual SGD, the natural gradient method endows theparameter space with a “natural" metric structure by pulling back a known metric on the probabilityspace and then searches the steepest descent direction of F pθq in the “curved" neighborhood of θ. Inparticular, the natural gradient update is invariant to reparametrization. This allows natural gradient to

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arX

iv:2

011.

0416

2v1

[st

at.M

L]

9 N

ov 2

020

avoid the undesirable saddle point or local minima that are artificially created by the highly nonlinearmaps Tθ. The classical Fisher-Rao Natural Gradient (FNG) [Amari, 1998] as well as its many variants[Martens and Grosse, 2015, Thomas et al., 2016, Song et al., 2018] endows the probability space withthe KL divergence and admits update direction in closed form. However, the update rules of thesemethods all require the evaluation of the score function of the variable measure. Leaving aside itsexistence, this quantity is in general difficult to compute for push-forward measures, which limits theapplication of FNG type methods in the generative models. Recently, Li and Montúfar [2018] proposeto replace the KL divergence in FNG by the Wasserstein distance and propose the Wasserstein NaturalGradient (WNG) algorithm. WNG shares the merit of reparameterization invariance as FNG whileavoiding the requirement of the score function. However, the Wasserstein information matrix (WIM)is very difficult to compute as it does not attain a closed form expression when the dimension d ofparameters is greater than 1, rendering WNG impractical.

Following the line of natural gradient, in this paper, we propose Sinkhorn Natural Gradient (SiNG),an algorithm that performs the steepest descent of the objective functional F on the probability spacewith the Sinkhorn divergence as the underlying metric. Unlike FNG, SiNG requires only to samplefrom the variable measure αθ. Moreover, the Sinkhorn information matrix (SIM), a key componentin SiNG, can be computed in logarithmic time in contrast to WIM in WNG. Concretely, we list ourcontributions as follows:

1. We derive the Sinkhorn Natural Gradient (SiNG) update rule as the exact direction thatminimizes the objective functional F within the Sinkhorn ball of radius ε centered at thecurrent measure. In the asymptotic case ε Ñ 0, we show that the SiNG direction onlydepends on the Hessian of the Sinkhorn divergence and the gradient of the function F , whilethe effect of the Hessian of F becomes negligible. Further, we prove that SiNG is invariantto reparameterization in its continuous-time limit (i.e. using the infinitesimal step size).

2. We explicitly derive the expression of the Sinkhorn information matrix (SIM), i.e. theHessian of the Sinkhorn divergence with respect to the parameter θ. We then show theSIM can be computed using logarithmic (w.r.t. the target accuracy) function operations andintegrals with respect to αθ.

3. When only Monte-Carlo integration w.r.t. αθ is available, we propose to approximate SIMwith its empirical counterpart (eSIM), i.e. the Hessian of the empirical Sinkhorn divergence.Further, we prove stability of eSIM. Our analysis relies on the fact that the Fréchet derivativeof Sinkhorn potential with respect to the parameter θ is continuous with respect to theunderlying measure µ. Such result can be of general interest.

In our experiments, we pretrain the discriminators for the celebA and cifar10 datasets. Fixing thediscriminator, we compare SiNG with state-of-the-art SGD-type solvers in terms of the generatorloss. The result shows the remarkable superiority of SiNG in both efficacy and efficiency.Notation: Let X Ď Rq be a compact ground set. We useM`

1 pX q to denote the space of probabilitymeasures on X and use CpX q to denote the family of continuous functions mapping from X to R.For a function f P CpX q, we denote its L8 norm by }f}8:“maxxPX |fpxq| and its gradient by∇f .For a functional on general vector spaces, the Fréchet derivative is formally defined as follows. LetV and W be normed vector spaces, and U Ď V be an open subset of V . A function F : U ÑW iscalled Fréchet differentiable at x P U if there exists a bounded linear operator A : V ÑW such that

lim}h}Ñ0

}Fpx` hq ´ Fpxq Áh}W}h}V

“ 0. (2)

If there exists such an operator A, it will be unique, so we denote DFpxq “ A and call it theFréchet derivative. From the above definition, we know that DF : U Ñ T pV,W q where T pV,W qis the family of bounded linear operators from V to W . Given x P U , the linear map DFpxq takesone input y P V and outputs z P W . This is denoted by z “ DFpxqrys. We then define theoperator norm of DF at x as }DFpxq}op:“maxhPV

}DFpxqrhs}W}h}V

. Further, the second-order Fréchetderivative of F is denoted as D2F : U Ñ L2pV ˆ V,W q, where L2pV ˆ V,W q is the family of allcontinuous bilinear maps from V to W . Given x P U , the bilinear map D2Fpxq takes two inputsy1, y2 P V and outputs z PW . We denote this by z “ D2Fpxqry1, y2s. If a function F has multiplevariables, we use Dif to denote the Fréchet derivative with its ith variable and use D2

ijF to denotethe corresponding second-order terms. Finally, ˝ denotes the composition of functions.

2

2 Related Work on Natural Gradient

The Fisher-Rao natural gradient (FNG) [Amari, 1998] is a now classical algorithm for the functionalminimization over a class of parameterized probability measures. However, unlike SiNG, FNG aswell as its many variants [Martens and Grosse, 2015, Thomas et al., 2016, Song et al., 2018] requiresto evaluate the score function∇θ log pθ (pθ denotes the p.d.f. of αθ). Leaving aside its existence issue,the score function for the generative model αθ is difficult to compute as it involves T´1

θ , the inversionof the push-forward mapping, and detpJT´1

θ q, the determinant of the Jacobian of T´1θ pzq. One can

possibly recast the computation of the score function as a dual functional minimization problem overall continuous functions on X [Essid et al., 2019]. However, such functional minimization problemitself is difficult to solve. As a result, FNG has limited applicability in our problem of interest.

Instead of using the KL divergence, Li and Montúfar [2018] propose to measure the distance between(discrete) probability distributions using the optimal transport and develop the Wasserstein NaturalGradient (WNG). WNG inherits FNG’s merit of reparameterization invariance. However, WNGrequires to compute the Wasserstein information matrix (WIM), which does not attain a closed formexpression when d ą 1, rendering WNG impractical [Li and Zhao, 2019, Li and Montúfar, 2020].As a workaround, one can recast a single WNG step to a dual functional maximization problem viathe Legendre duality. While itself remains challenging and can hardly be globally optimized, Liet al. [2019] simplify the dual subproblem by restricting the optimization domain to an affine spaceof functions (a linear combinations of several bases). Clearly, the quality of this solver dependsheavily on the accuracy of this affine approximation. Alternatively, Arbel et al. [2019] restrict the dualfunctional optimization to a Reproducing Kernel Hilbert Space (RKHS). By adding two additionalregularization terms, the simplified dual subproblem admits a closed form solution. However, inthis way, the gap between the original WNG update and its kernelized version cannot be properlyquantified without overstretched assumptions.

3 Preliminaries

We first introduce the entropy-regularized optimal transport distance and then its debiased version, i.e.the Sinkhorn divergence. Given two probability measures α, β PM`

1 pX q, the entropy-regularizedoptimal transport distance OTγpα, βq :M`

1 pX q ˆM`1 pX q Ñ R` is defined as

OTγpα, βq “ minπPΠpα,βq

xc, πy ` γKLpπ||αb βq. (3)

Here, γ ą 0 is a fixed regularization parameter, Πpα, βq is the set of joint distributions over X 2

with marginals α and β, and we use xc, πy to denote xc, πy “ş

X 2 cpx, yqdπpx, yq. We also useKLpπ||αb βq to denote the Kullback-Leibler divergence between the candidate transport plan π andthe product measure αb β.

Note that OTγpα, βq is not a valid metric as there exists α PM`1 pX q such that OTγpα, αq ‰ 0 when

γ ‰ 0. To remove this bias, consider the Sinkhorn divergence Spα, βq :M`1 pX q ˆM

`1 pX q Ñ R`

introduced in Peyré et al. [2019]:

Spα, βq:“OTγpα, βq ÓTγpα, αq

2´

OTγpβ, βq

2, (4)

which can be regarded as a debiased version of OTγpα, βq. Since γ is fixed throughout this paper,we omit the subscript γ for simplicity. It has been proved that Spα, βq is nonnegative, bi-convex andmetrizes the convergence in law for a compact X and a Lipschitz metric c Peyré et al. [2019].

The Dual Formulation and Sinkhorn Potentials. The entropy-regularized optimal transport prob-lem OTγ , given in (3), is convex with respect to the joint distribution π: Its objective is a sum of alinear functional and the convex KL-divergence, and the feasible set Πpα, βq is convex. Consequently,there is no gap between the primal problem (3) and its Fenchel dual. Specifically, define

H2pf, g;α, βq:“xf, αy ` xg, βy ´ γxexpp1

γpf ‘ g ´ cqq ´ 1, αb βy, (5)

where we denote`

f ‘ g˘

px, yq “ fpxq ` gpyq. We have

OTγpα, βq “ maxf,gPCpX q

H2pf, g;α, βq(

“ xfα,β , αy ` xgα,β , βy, (6)

where fα,β and gα,β , called the Sinkhorn potentials of OTγpα, βq, are the maximizers of (6).

3

Training Adversarial Generative Models. We briefly describe how (1) captures the generativeadversarial model (GAN): In training a GAN, the objective functional in (1) itself is defined througha maximization subproblem Fpαθq “ maxξPΞ Gpξ;αθq. Here ξ P Ξ Ď Rd is some dual adversarialvariable encoding an adversarial discriminator or ground cost. For example, in the ground costadversarial optimal transport formulation of GAN [Salimans et al., 2018, Genevay et al., 2018], wehave Gpξ;αθq “ Scξpαθ, βq. Here, with a slight abuse of notation, Scξpαθ, βq denotes the Sinkhorndivergence between the parameterized measure αθ and a given target measure β. Notice that thesymmetric ground cost cξ in Scξ is no longer fixed to any pre-specified distance like `1 or `2 norm.Instead, cξ is encoded by a parameter ξ so that Scξ can distinguish αθ and β in an adaptive andadversarial manner. By plugging the above Fpαθq to (1), we recover the generative adversarial modelproposed in [Genevay et al., 2018]:

minθPΘ

maxξPΞScξpαθ, βq. (7)

4 Methodology

In this section, we derive the Sinkhorn Natural Gradient (SiNG) algorithm as a steepest descentmethod in the probability space endowed with the Sinkhorn divergence metric. Specifically, SiNGupdates the parameter θt by

θt`1 :“ θt ` η ¨ dt (8)where η ą 0 is the step size and the update direction dt is obtained by solving the following problem.Recall the objective F in (1) and the Sinkhorn divergence S in (4). Let dt “ limεÑ0

∆θtε?ε

, where

∆θtε:“ argmin∆θPRd

F pθt `∆θq s.t. }∆θ} ď εc1 ,Spαθt`∆θ, αθtq ď ε` εc2 . (9)

Here the exponent c1 and c2 can be arbitrary real satisfying 1 ă c2 ă 1.5, c1 ă 0.5 and 3c1´1 ě c2.Proposition 4.1 depicts a simple expression of dt. Before proceeding to derive this expression, wenote that ∆θ “ 0 globally minimizes the non-negative function Spαθt`∆θ, αθtq, which leads to thefollowing first and second order optimality criteria:

∇θSpαθ, αθtq|θ“θt “ 0 and Hpθtq:“∇2θSpαθ, αθtq|θ“θt ě 0. (10)

This property is critical in deriving the explicit formula of the Sinkhorn natural gradient. Fromnow on, the term Hpθtq, which is a key component of SiNG, will be referred to as the Sinkhorninformation matrix (SIM).Proposition 4.1. Assume that the minimum eigenvalue of Hpθtq is strictly positive (but can bearbitrary small) and that ∇2

θF pθq and Hpθq are continuous w.r.t. θ. The SiNG direction has thefollowing explicit expression

dt “ ´

?2

a

xHpθtq´1∇θF pθtq,∇θF pθtqy¨Hpθtq´1∇θF pθtq. (11)

Interestingly, the SiNG direction does not involve the Hessian of F . This is due to a Lagrangian-basedargument that we sketch here. Note that the continuous assumptions on ∇2

θF pθq and Hpθq enable usto approximate the objective and the constraint in (9) via the second-order Taylor expansion.

Proof sketch for Proposition 4.1. The second-order Taylor expansion of the Lagrangian of (9) is

Gp∆θq “ F pθtq` x∇θF pθtq,∆θy`1

2x∇2

θF pθtq∆θ,∆θy`

λ

2xHpθtq∆θ,∆θy´λε´λεc2 , (12)

where λ ě 0 is the dual variable. Since the minimum eigenvalue of Hpθtq is strictly positive, for asufficiently small ε, by taking λ “ Op 1?

εq, we have that Hpθtq ` 1

λ∇2θF pθ

tq is also positive definite.In such case, a direct computation reveals that G is minimized at

Ě∆θ˚ “ ´1

λ

ˆ

Hpθtq `1

λ∇2θF pθ

tq

˙´1

∇θF pθtq. (13)

Consequently, the term involving∇2θF pθ

tq vanishes when ε approaches zero and we obtain the result.

The above argument is made precise in Appendix A.1.

4

Remark 4.1. Note that our derivation also applies to the Fisher-Rao natural gradient or theWasserstein natural gradient: If we replace the Sinkhorn divergence by the KL divergence (or theWasserstein distance), the update direction dt » rHpθtqs

´1∇θF pθtq still holds, where Hpθtq is theHessian matrix of the KL divergence (or the Wasserstein distance). This observation works for ageneral functional as a local metric Thomas et al. [2016] as well.

The following proposition states that SiNG is invariant to reparameterization in its continuous timelimit (η Ñ 0). The proof is stated in Appendix A.2.Proposition 4.2. Let Φ be an invertible and smoothly differentiable function and denote a re-parameterization φ “ Φpθq. Define Hpφq:“∇2

φSpαΦ´1pφq, αΦ´1pφqq|φ“φ and F pφq:“F pΦ´1pφqq.Use 9θ and 9φ to denote the time derivative of θ and φ respectively. Consider SiNG in its continuous-timelimit under these two parameterizations:

9θs “ ´Hpθsq´1∇F pθsq and 9φs “ ´Hpφsq

´1∇F pφsq with φ0 “ Φpθ0q. (14)

Then θs and φs are related by the equation φs “ Φpθsq at all time s ě 0.

The SiNG direction is a “curved" negative gradient of the loss function F pθq and the “curvature" is ex-actly given by the Sinkhorn Information Matrix (SIM), i.e. the Hessian Hpθtq “ ∇2

θSpαθ, αθtq|θ“θtof the Sinkhorn divergence. An important question is whether SIM is computationally tractable. In thenext section, we derive its explicit expression and describe how it can be efficiently computed. This isin sharp contrast to the Wasserstein information matrix (WIM) as in the WNG method proposed in Liand Montúfar [2018], which does not attain an explicit form for d ą 1 (d is the parameter dimension).

While computing the update direction dt involves the inversion of Hpθtq, it can be computed usingthe classical conjugate gradient algorithm, requiring only a matrix-vector product. Consequently, ourSinkhorn Natural Gradient (SiNG) admits a simple and elegant implementation based on modernauto-differential mechanisms such as PyTorch. We will elaborate this point in Appendix E.

5 Sinkhorn Information Matrix

In this section, we describe the explicit expression of the Sinkhorn information matrix (SIM) andshow that it can be computed very efficiently using simple function operations (e.g. log and exp)and integrals with respect to αθ (with complexity logarithmic in terms of the reciprocal of the targetaccuracy). The computability of SIM and hence SiNG is the key contribution of our paper. In the casewhen we can only compute the integration with respect to αθ in a Monte Carlo manner, an empiricalestimator of SIM (eSIM) is proposed in the next section with a delicate stability analysis.Since Sp¨, ¨q is a linear combination of terms like OTγp¨, ¨q–see (4), we can focus on the term∇2θOTγpαθ, αθtq|θ“θt in Hpθtq and the other term∇2

θOTγpαθ, αθq|θ“θt can be handled similarly.Having these two terms, SIM is computed as Hpθtq “ r∇2

θOTγpαθ, αθtq `∇2θOTγpαθ, αθqs|θ“θt .

Recall that the entropy regularized optimal transport distance OTγ admits an equivalent dual concave-maximization form (6). Due to the concavity of H2 w.r.t. g in (5), the corresponding optimalgf “ argmaxgPCpX qH2pf, g;α, βq can be explicitly computed for any fixed f P CpX q: Given afunction f P CpX q and a measure α PM`

1 pX q, define the Sinkhorn mapping as

A`

f , α˘

pyq:“´ γ log

ż

Xexp

ˆ

´1

γcpx, yq `

1

γfpxq

˙

dαpxq. (15)

The first-order optimality of gf writes gf “ Apf, αq. Then, (6) can be simplified to the followingproblem with a single potential variable:

OTγpαθ, βq “ maxfPCpX q

H1pf, θq:“xf, αθy ` xA`

f, αθ˘

, βy(

, (16)

where we emphasize the impact of θ toH1 by writing it explicitly as a variable forH1. Moreover, inH1 the dependence on β is dropped as β is fixed. We also denote the optimal solution to the R.H.S.of (16) by fθ which is one of the Sinkhorn potentials for OTγpαθ, βq.

The following proposition describes the explicit expression of ∇2θOTγpαθ, αθtq|θ“θt based on the

above dual representation. The proof is provided in Appendix B.1.

5

Proposition 5.1. Recall the definition of the dual-variable function H1 : CpX q ˆ Θ Ñ R in (16)and the definition of the second-order Fréchet derivative at the end of Section 1. For a parameterizedpush-forward measure αθ “ Tθ7µ and a fixed measure β PM`

1 pX q, we have

∇2θOTγpαθ, βq “ ´D

211H1pfθ, θq ˝ pDfθ, Dfθq `D

222H1pfθ, θq, (17)

where Dfθ denotes the Fréchet derivative of the Sinkhorn potential fθ w.r.t. the parameter θ.Remark 5.1 (SIM for 1d-Gaussian). It is in general difficult to give closed form expression of theSIM. However, in the simplest case when αθ is a one-dimensional Gaussian distribution with aparameterized mean, i.e. αθ “ N pµpθq, σ2q, SIM can be explicitly computed as ∇2

θSpαθ, βq “2∇2

θµpθq due to the closed form expression of the entropy regularized optimal transport betweenGaussian measures [Janati et al., 2020].

Suppose that we have the Sinkhorn potential fθ and its the Fréchet derivative Dfθ. Then the termsD2ijH1pf, θq, i, j “ 1, 2 can all be evaluated using a constant amount of simple function operations,

e.g. log and exp, since we know the explicit expression ofH1. Consequently, it is sufficient to haveestimators f εθ and gεθ of fθ and Dfθ respectively, such that }f εθ ´ fθ}8 ď ε and }gεθ ´Dfθ}op ď εfor an arbitrary target accuracy ε. This is because the high accuracy approximation of fθ and Dfθimply the high accuracy approximation of∇2

θOTγpαθ, βq due to the Lipschitz continuity of the termsD2ijH1pf, θq, i, j “ 1, 2. We derive these expressions and their Lipschitz continuity in Appendix B.

For the Sinkhorn Potential fθ, its estimator f εθ can be efficiently computed using the Sinkhorn-Knoppalgorithm Sinkhorn and Knopp [1967]. We provide more details on this in Appendix B.2.Proposition 5.2 (Computation of the Sinkhorn Potential fθ – (Theorem 7.1.4 in [Lemmens andNussbaum, 2012] and Theorem B.10 in [Luise et al., 2019]). Assume that the ground cost function cis bounded, i.e. 0 ď cpx, yq ďMc,@x, y P X . Denote λ:“ exppMc{γq´1

exppMc{γq`1 ă 1 and define

B`

f, θ˘

:“A`

A`

f, αθ˘

, β˘

. (18)

Then the fixed point iteration f t`1 “ B`

f t, θ˘

converges linearly: }f t`1 ´ fθ}8 “ Opλtq.

For the Fréchet derivative Dfθ, we construct its estimator in the following proposition.Proposition 5.3 (Computation of the Fréchet derivative Dfθ). Let f εθ be an approximation offθ such that }f εθ ´ fθ}8 ď ε. Choose a large enough l, for instance l “ rlogλ

13 s{2. Define

E`

f, θ˘

“ B`

¨ ¨ ¨B`

f, θ˘

¨ ¨ ¨ , θ˘

, the l times composition of B in its first variable. Then the sequence

gt`1θ “ D1E

`

f εθ , θ˘

˝ gtθ `D2E`

f εθ , θ˘

(19)

converges linearly to a ε-neighborhood of Dfθ, i.e. }gt`1θ ´Dfθ}op “ Opε` p 2

3 qt}g0

θ ´Dfθ}opq.

We deferred the proof to the above proposition to Appendix B.3. The high-accuracy estimators f εθand gεθ derived in the above propositions can both be obtained usingOplog 1

ε q function operations andintegrals. With the expression of SIM and the two propositions discussing the efficient computationof fθ and Dfθ, we obtain the following theorem.Theorem 5.1 (Computability of SIM). For any given target accuracy ε ą 0, there exists an estimatorHεpθq, such that }Hεpθq ´Hpθq}op ď ε, and the estimator can be computed using Oplog 1

ε q simplefunction operations and integrations with respect to αθ.

This result shows a significantly broader applicability of SiNG than WNG, as the latter can only beused in limited situations due to the intractability of computing WIM.

6 Empirical Estimator of SIM

In the previous section, we derived an explicit expression for the Sinkhorn information matrix (SIM)and described how it can be computed efficiently. In this section, we provide an empirical estimatorfor SIM (eSIM) in the case where the integration w.r.t. αθ can only be computed in a Monte-Carlomanner. Moreover, we prove the stability of eSIM by showing that the Fréchet derivative of theSinkhorn potential with respect to the parameter θ is continuous with respect to the underlyingmeasure µ, which is interesting on its own.

6

Recall that the parameterized measure has the structure αθ “ Tθ7µ, where µ PM`1 pZq is some

probability measure on the latent space Z Ď Rq and Tθ : Z Ñ X is some push-forward mappingparameterized by θ P Θ. We use µ to denote an empirical measure of µ with n Dirac measures:µ “ 1

n

řni“1 δzi with zi

iid„ µ and we use αθ to denote the corresponding empirical measure of αθ:

αθ “ Tθ7µ “1n

řni“1 δTθpziq. Based on the above definition, we propose the following empirical

estimator for the Sinkhorn information matrix (eSIM)

Hpθtq “ ∇2θSpαθ, αθtq|θ“θt . (20)

The following theorem shows stability of eSIM. The proof is provided in Appendix C.Theorem 6.1. Define the bounded Lipschitz metric of measures dbl :M`

1 pX qˆM`1 pX q Ñ R` by

dblpα, βq:“ sup}ξ}blď1

|xξ, αy ´ xξ, βy|, (21)

where we denote }ξ}bl:“maxt}ξ}8, }ξ}Lipu with }ξ}Lip:“maxx,yPX|ξpxq´ξpyq|}xý} . Assume that the

ground cost function is bounded and Lipschitz continuous. Then

}Hpθtq ´Hpθtq}op “ Opdblpµ, µqq. (22)

In the rest of this subsection, we analyze the structure of Hpθtq and describe how it can be ef-ficiently computed. Similar to the previous section, we focus on the term ∇2

θOTγpαθ, βq withαθ “

1n

řni“1 δTθpziq and β “ 1

n

řni“1 δyi for arbitrary yi P X .

First, notice that the output of the Sinkhorn mapping (15) is determined solely by the function valuesof the input f at the support of α. Using f “ rf1, . . . , fns P Rn with fi “ fpxiq to denote the valueextracted from f on supppαq, we define for a discrete probability measures α “ 1

n

řni“1 δxi the

discrete Sinkhorn mapping A`

f , α˘

: Rn ˆM`1 pX q Ñ CpX q as

A`

f , α˘

pyq:“´ γ log´ 1

n

nÿ

i“1

exp´

´1

γcpxi, yq `

1

γfi

¯¯

“ A`

f , α˘

pyq, (23)

where the last equality should be understood as two functions being identical. Since both αθ and β inOTγpαθ, βq are discrete, (16) can be reduced to

OTγpαθ, βq “ maxfPRn

#

H1pf , θq “1

nfJ1n `

1

n

nÿ

i“1

A`

f , αθ˘

pyiq

+

. (24)

Now, let fθ be the solution to the above problem. We can compute the first order gradient ofOTγpαθ, βq with respect to θ by

∇θOTγpαθ, βq “ JJfθ ¨∇1H1pfθ, θq `∇2H1pfθ, θq. (25)

Here Jfθ “BfθBθ P R

nˆd denotes the Jacobian matrix of fθ with respect to θ and ∇iH1 denotes thegradient of H1 with respect to its ith variable for i “ 1, 2. Importantly, the optimality condition offθ implies∇1H1pfθ, θq “ 0n. Further, we compute the second order gradient of OTγpαθ, βq withrespect to θ by (we omit the parameter pfθ, θq of H1)

∇2θOTγpαθ, βq “ Tfθ ˆ1∇1H1`J

Jfθ¨∇11H1 ¨Jfθ `J

Jfθ¨∇12H1`∇21HJ1 ¨Jfθ `∇22H1, (26)

where Tfθ “B2fθBθ2 P R

nˆdˆd is a tensor denoting the second-order Jacobian matrix of fθ with respectto θ andˆ1 denotes the tensor product along its first dimension. Using the fact that∇1H1pfθ, θq “ 0n,we drop the first term and simplify∇2

θOTγpαθ, βq to (again we omit the parameter pfθ, θq of H1)

∇2θOTγpαθ, βq “ JJfθ ¨∇11H1 ¨ Jfθ ` J

Jfθ¨∇12H1 `∇21HJ1 ¨ Jfθ `∇22H1. (27)

As we have the explicit expression of H1, we can explicitly compute ∇ijH1 given that we have theSinkhorn potential fθ. Further, if we can compute Jfθ , we are then able to compute ∇2

θOTγpαθ, βq.The following propositions can be viewed as discrete counterparts of Proposition 5.2 and Proposition5.3 respectively. Both fθ and Jfθ can be well-approximated using a number of finite dimensionalvector/matrix operations which is logarithmic in the desired accuracy. Besides, given these twoquantities, one can easily check that∇ijH1 can be evaluated withinOppn`dq2q arithmetic operations.Consequently, we can compute an ε-accurate approximation of eSIM in time Oppn` dq2 log 1

ε q.

7

Proposition 6.1 (Computation of the Sinkhorn Potential fθ). Assume that the ground cost function cis bounded, i.e. 0 ď cpx, yq ďMc,@x, y P X . Denote λ:“ exppMc{γq´1

exppMc{γq`1 ă 1 and define

B`

f , θ˘

:“A`

g, β˘

with g “ rA`

f , αθ˘

py1q, . . . , A`

f , αθ˘

pynqs P Rn. (28)

Then the fixed point iteration f t`1 “ B`

f t, θ˘

converges linearly: }f t`1 ´ fθ}8 “ OpλtqProposition 6.2 (Computation of the Jacobian Jfθ ). Let fε be an approximation of fθ such that}fε´fθ}8 ď ε. Pick l “ rlogλ

13 s{2. Define E

`

f , θ˘

“ B`

¨ ¨ ¨ B`

f , θ˘

¨ ¨ ¨ , θ˘

, the l times compositionof B in its first variable. Then the sequence of matrices

Jt`1 “ J1E`

fε, θ˘

¨ Jt ` J2E`

fε, θ˘

, (29)

converges linearly to an ε neighbor of Jfθ : }Jt`1 ´ Jfθ}op “ Opε` p 23 qt}J0 ´ Jfθ}opq. Here JiE

denotes the Jacobian matrix of E with respect to its ith variable.

The SiNG direction dt involves the inversion of Hpθtq. This can be (approximately) computed usingthe classical conjugate gradient (CG) algorithm, using only matrix-vector products. Combining eSIMand CG, we describe a simple and elegant PyTorch-based implementation for SiNG in Appendix E,

7 Experiment

In this section, we compare SiNG with other SGD-type solvers by training generative models. Wedid not compare with WNG Li and Montúfar [2018] since WNG can only be implemented forthe case where the parameter dimension d is 1. We also tried to implement KWNG Arbel et al.[2019], which however diverges in our setting. In particular, we encounter the case when the KWNGdirection has negative inner product with the euclidean gradient direction, leading to its divergence.As we discussed in the related work, the gap between KWNG and WNG cannot be quantified withreasonable assumptions, which explains our observation. In all the following experiments, we pickthe push-forward map Tθ to be the generator network in DC-GAN [Radford et al., 2015]. For moredetailed experiment settings, please see Appendix D.

7.1 Squared-`2-norm as Ground Metric

We first consider the distribution matching problem, where our goal is to minimize the Sinkhorndivergence between the parameterized generative model αθ “ Tθ7µ and a given target distribution β,

minθPΘ

F pθq “ Spαθ, βq. (30)

Here, Tθ is a neural network describing the push-forward map withits parameter summarized in θ and µ is a zero-mean isometric Gaus-sian distribution. In particular, the metric on the ground set X is setto the vanilla squared-`2 norm, i.e. cpx, yq “ }x´ y}2 for x, y P X .Our experiment considers a specific instance of problem (30) wherewe take the measure β to be the distribution of the images in theCelebA dataset. We present the comparison of the generator loss(the objective value) vs time plot in right figure. The entropy regu-larization parameter γ is set to 0.01 for both the objective and theconstraint. We can see that SiNG is much more efficient at reducingthe objective value than ADAM given the same amount of time.

7.2 Squared-`2-norm with an Additional Encoder as Ground Metric

We then consider a special case of problem (7), where the metric on the ground set X is set tosquared-`2-norm with a fixed parameterized encoder (i.e. we fix the variable ξ in the max part of(7)): cξpx, yq “ }φξpxq ´ φξpyq}2. Here φξp¨q : X Ñ Rq is a neural network encoder that outputsan embedding of the input in a high dimensional space (q ą q, where we recall q is the dimension ofthe ground set X ). In particular, we set φξp¨q to be the discriminator network in DC-GAN withoutthe last classification layer [Radford et al., 2015]. Two specific instances are considered: we takethe measure β to be the distribution of the images in either the CelebA or the Cifar10 dataset. The

8

0 10 20 30 40 50

number of epochs

10 3

10 2

10 1

100

101

gene

rato

r los

s

AdamamsgradRMSpropSiNG

0 10 20 30 40 50

number of epochs

10 3

10 2

10 1

100

gene

rato

r los

s

AdamamsgradRMSpropSiNG

Figure 1: Generator losses on CelebA (left) and Cifar10 (right).

parameter ξ of the encoder φ is obtained in the following way: we first use SiNG to train a generativemodel by alternatively taking a SiNG step on θ and taking an SGD step on ξ. After sufficiently manyiterations (when the generated image looks real or specifically 50 epochs), we fix the encoder φξ.We then set the objective functional (1) to be Fpαθq “ Scξpαθ, βq (see (7)), and compare SiNGand SGD-type algorithms in the minimization of F under a consensus random initialization. Wereport the comparison in Figure 1, where we observe the significant improvement from SiNG inboth accuracy and efficiency. Such phenomenon is due to the fact that SiNG is able to use geometryinformation by considering SIM while other method does not. Moreover, the pretrained ground costcξ may capture some non-trivial metric structure of the images and consequently geometry-faithfullymethod like our SiNG can thus do better.

7.3 Training GAN with SiNG

Figure 2: Comparison of the visual quality of the images generated by Adam (left) and SiNG (right).

Finally, we showcase the the advantage of training a GAN model using SiNG over SGD-based solvers.Specifically, we consider the GAN model (7). The entropy regularization of the Sinkhorn divergenceobjective is set to γ “ 100 as suggested in Table 2 of [Genevay et al., 2018]. The regularization forthe constraint is set to γ “ 1 in SiNG. We used ADAM as the optimizer for the discriminators (withstep size 10´3 and batch size 4000). The result is reported in Figure 2. We can see that the imagesgenerated using SiNG are much more vivid than the ones obtained using SGD-based optimizers. Weremark that our main goal has been to showcase that SiNG is more efficient in reducing the objectivevalue compared to SGD-based solvers, and hence, we have used a relatively simpler DC-GAN typegenerator and discriminator (details given in the supplementary materials). If more sophisticatedResNet type generators and discriminators are used, the image quality can be further improved.

9

8 Broader Impact

We propose the Sinkhorn natural gradient (SiNG) algorithm for minimizing an objective functionalover a parameterized family of generative-model type measures. While our results do not immediatelylead to broader societal impacts (as they are mostly theoretical), they can lead to new potential positiveimpacts. SiNG admits explicit update rule which can be efficiently carried out in an exact mannerunder both continuous and discrete settings. Being able to exploit the geometric information providedin the Sinkhorn information matrix, we observe the remarkable advantage of SiNG over existingstate-of-the-art SGD-type solvers. Such algorithm is readily applicable to many types of existinggenerative adversarial models and possibly helps the development of the literature.

Acknowledgment

This work is supported by NSF CPS-1837253.

ReferencesS.-I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276,

1998.

S.-I. Amari, O. Barndorff-Nielsen, R. Kass, S. Lauritzen, C. Rao, et al. Differential geometricaltheory of statistics. In Differential geometry in statistical inference, pages 19–94. Institute ofMathematical Statistics, 1987.

M. Arbel, A. Gretton, W. Li, and G. Montúfar. Kernelized wasserstein natural gradient. arXiv preprintarXiv:1910.09652, 2019.

M. Essid, D. F. Laefer, and E. G. Tabak. Adaptive optimal transport. Information and Inference: AJournal of the IMA, 8(4):789–816, 2019.

J. Feydy, T. Séjourné, F.-X. Vialard, S.-i. Amari, A. Trouve, and G. Peyré. Interpolating betweenoptimal transport and mmd using sinkhorn divergences. In The 22nd International Conference onArtificial Intelligence and Statistics, pages 2681–2690, 2019.

A. Genevay, G. Peyre, and M. Cuturi. Learning generative models with sinkhorn divergences. InInternational Conference on Artificial Intelligence and Statistics, pages 1608–1617, 2018.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,pages 2672–2680, 2014.

G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overviewof mini-batch gradient descent.

H. Janati, B. Muzellec, G. Peyré, and M. Cuturi. Entropic optimal transport between (unbalanced)gaussian measures has a closed form. arXiv preprint arXiv:2006.02572, 2020.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,2014.

B. Lemmens and R. Nussbaum. Nonlinear Perron-Frobenius Theory, volume 189. CambridgeUniversity Press, 2012.

W. Li and G. Montúfar. Natural gradient via optimal transport. Information Geometry, 1(2):181–214,2018.

W. Li and G. Montúfar. Ricci curvature for parametric statistics via optimal transport. InformationGeometry, pages 1–29, 2020.

W. Li and J. Zhao. Wasserstein information matrix. arXiv preprint arXiv:1910.11248, 2019.

W. Li, A. T. Lin, and G. Montúfar. Affine natural proximal learning. In International Conference onGeometric Science of Information, pages 705–714. Springer, 2019.

10

G. Luise, S. Salzo, M. Pontil, and C. Ciliberto. Sinkhorn barycenters with free support via frank-wolfealgorithm. In Advances in Neural Information Processing Systems 32. 2019.

J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature.In International conference on machine learning, pages 2408–2417, 2015.

G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends® in MachineLearning, 11(5-6):355–607, 2019.

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks, 2015.

S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. arXiv preprintarXiv:1904.09237, 2019.

T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving gans using optimal transport. InInternational Conference on Learning Representations, 2018.

R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. PacificJournal of Mathematics, 21(2):343–348, 1967.

Y. Song, J. Song, and S. Ermon. Accelerating natural gradient with higher-order invariance. Proceed-ings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan,Stockholm, 2018.

P. Thomas, B. C. Silva, C. Dann, and E. Brunskill. Energetic natural gradient descent. In InternationalConference on Machine Learning, pages 2887–2895, 2016.

11

A Appendix Section for Methodology

A.1 Proof of Proposition 4.1

Denote the Lagrangian function byGλp∆θq “ F pθt `∆θq ` λ pSpαθt`∆θ, αθtq ´ ε´ ε

c2q . (31)We have the following inequality which characterize a lower bound of the solution to (9) (recall that1 ă c2 ă 1.5, c1 ă 0.5 and 3c1 ´ 1 ě c2) ,

min∆θPRd

F pθt `∆θq

s.t. }∆θ} ď εc1

Spαθt`∆θ, αθtq ď ε` εc2

“ min}∆θ}ďεc1

maxλě0

Gλp∆θq ě maxλě0

min}∆θ}ďεc1

Gλp∆θq. (32)

We now focus on the R.H.S. of the above inequality. Denote the second-order Taylor expansion ofthe Lagrangian Gλ by Gλ:

Gλp∆θq “ F pθtq ` x∇θF pθtq,∆θy `1

2x∇2

θF pθtq∆θ,∆θy `

λ

2xHpθtq∆θ,∆θy ´ λε´ λεc2 ,

where we used the optimality condition (10) of Spα, αtq so that the first-order term of Spα, αtqvanishes. Besides, Hpθq is defined in (10). The error of such approximation can be bounded as

Gλp∆θq ´ Gλp∆θq “ Oppλ` 1q}∆θ}3q. (33)Further, for any fixed λ, denote ∆θ˚λ “ argmin}∆θ}ďεc1 Gλp∆θq.

We can then derive the following lower bound on the minimization subproblem of the R.H.S. of (32):maxλě0

min}∆θ}ďεc1

Gλp∆θq “ maxλě0

Gλp∆θ˚λq Óppλ` 1q}∆θ˚λ}

3q

ě maxλě0

Gλp∆θ˚λq Óppλ` 1qε3c1q

ě maxλě0

min}∆θ}ďεc1

Gλp∆θq Óppλ` 1qε3c1q,

Note that for sufficiently large λ, Hpθtq ` 1λ∇

2θF pθ

tq ą 0 by recalling the positive definiteness ofHpθtq. In this case, as a convex program, min}∆θ}ďεc1 Gλp∆θq admits the closed form solution:Denote Ě∆θ˚λ “ argmin Gλp∆θq. We have

Ě∆θ˚λ “ ´1

λ

ˆ

Hpθtq `1

λ∇2θF pθ

tq

˙´1

∇θF pθtq and GpĚ∆θ˚λq “ F pθtq á

2λ´ λε´ λεc2 , (34)

where we denote a:“x“

Hpθtq ` 1λ∇

2θF pθ

tq‰´1∇θF pθtq,∇θF pθtqy ą 0.

For sufficiently small ε, by taking λ “a

a2ε with a:“xrHpθtqs

´1∇θF pθtq,∇θF pθtqy ą 0 (notethat }Ě∆θ˚λ} “ Op

?εq ă εc1 and is hence feasible for c1 ă 0.5), the R.H.S. of (32) has the following

lower bound (recall that we have 3c1 ´ 1 ě c2)

maxλě0

min}∆θ}ďεc1

Gλp∆θq ě F pθtq ´ pa?

2a`

c

a

2q?εÓpεc2´0.5q. (35)

This result leads to the following lower bound on (9):

limεÑ0

F pθt `∆θtεq ´ F pθtq

?ε

ě ´

b

2xrHpθtqs´1∇θF pθtq,∇θF pθtqy, (36)

where ∆θtε is the solution to (9). Finally, observe that the equality is achieved by taking ∆θtε “

´

?2εpHpθtqq

´1∇θF pθtq?xrHpθtqs´1∇θF pθtq,∇θF pθtqy

:

limεÑ0

F pθt `∆θtεq ´ F pθtq

?ε

“ limεÑ0

1?εx∇F pθtq,∆θtεy “ ´

b

2xrHpθtqs´1∇θF pθtq,∇θF pθtqy,

(37)and ∆θtε is feasible for sufficiently small ε (note that we have 1

2xHpθtq∆θtε,∆θ

tεy “ ε):

Spαθt`∆θtε, αθtq ď

1

2xHpθtq∆θtε,∆θ

tεy Òpε1.5q ă ε` εc2 , (38)

and }∆θtε} “ Op?εq ă εc1 for c1 ă 0.5. This leads to our conclusion.

12

A.2 Proof of Proposition 4.2

Our goal is to show that the continuous-time limit of Φpθsq satisfies the same differential equation asφs provided that Φpθ0q “ φ0. To do so, first compute the differential equation of Φpθsq

BΦpθsq

Bs“ ∇θΦpθsq 9θs “ ´∇θΦpθsqHpθsq´1∇F pθsq, (39)

where∇θΦpθsq is the Jacobian matrix of Φpθq w.r.t. θ at θ “ θs. We then compute the differentialequation of φs (note that∇φΦ´1pφsq is the Jacobian matrix of Φ´1pφq w.r.t. φ at φ “ φs)

9φs “ ´“

∇2φSpαΦ´1pφq, αΦ´1pφsqq|φ“φs

‰´1∇φF pΦ´1pφqq|φ“φs

“ ´“

∇φΦ´1pφsqJ∇2

θSpαθ, αθsq|θ“θs∇φΦ´1pφsq‰´1∇φΦ´1pφsq

J∇F pθq|θ“θs (40)

“ ´“

∇φΦ´1pφsq‰´1∇2

θSpαθ, αθsq|θ“θs∇F pθq|θ“θs“ ´∇θΦpθsqHpθsq´1∇F pθsq (41)

“BΦpθsq

Bs.

Here we use the following lemma in (40). We use Φ´1pφsq “ θs and the inverse function theorem∇θΦpθsq “

“

∇φΦ´1pφsq‰´1

in (41).Lemma A.1.

∇2φSpαΦ´1pφq, αΦ´1pφsqq|φ“φs “ ∇φΦ´1pφsq

J∇2θSpαθ, αθsq|θ“θs∇φΦ´1pφsq (42)

Proof. This lemma can be proved with simple computations. We compute only for the terms in∇2θOTγpαθ, αθsq as example. The terms in ∇2

θOTγpαθ, αθq can be computed similarly. Recall theexpression

∇2θOTγpαθ, βq “ D2

11H1pfθ, θq ˝ pDfθ, Dfθq` D212H1pfθ, θq ˝ pDfθ, Idq

`D221H1pfθ, θq ˝ pId, Dfθq` D2

22H1pfθ, θq ˝ pId, Idq.(43)

We compute

∇2φOTγpαΦ´1pφq, βq “ D2

11H1pfΦ´1pφq,Φ´1pφqq ˝ pDfΦ´1pφq ˝ JΦ´1pφq, DfΦ´1pφq ˝ JΦ´1pφqq

`D212H1pfΦ´1pφq,Φ

´1pφqq ˝ pDfΦ´1pφq ˝ JΦ´1pφq, JΦ´1pφqq


´1pφqq ˝ pJΦ´1pφq, DfΦ´1pφq ˝ JΦ´1pφqq


´1pφqq ˝ pJΦ´1pφq, JΦ´1pφqq.(44)

Plugging Φ´1pφsq “ θs to the above equality, we have

∇2φOTγpαΦ´1pφq, βq|φ“φs “ ∇φΦ´1pφsq

J∇2θOTγpαθ, βq|θ“θs∇φΦ´1pφsq. (45)

13

B Appendix on SIM

B.1 Proof of Proposition 5.1

We will derive the explicit expression of ∇2θOTγpαθ, αθtq|θ“θt based on the dual representation

(16). Recall the definition of the Fréchet derivative in Definition 2 and its chain rule Dpf ˝ gqpxq “Dfpgpxqq ˝Dgpxq. We compute the first-order gradient by

∇θOTγpαθ, βq “ ∇θH1pfθ, θq “ D1H1pfθ, θq ˝Dfθloooooooooomoooooooooon

G1pfθ,θq

`D2H1pfθ, θqloooooomoooooon

G2pfθ,θq

, (46)

where DiH1 denote the Fréchet derivative of H1 with respect to its ith variable. Importantly, theoptimality condition of (16) implies that D1H1pfθ, θqrgs “ 0,@g P CpX q.Further, in order to compute the second order gradient of OTγpαθ, βq with respect to θ, we firstcompute the gradient of Gi, i “ 1, 2:

∇θG1pfθ, θq “ D1H1pfθ, θq ˝D2fθ `D


212H2pfθ, θq ˝ pDfθ, Idq,

(47)

∇θG2pfθ, θq “ D221H1pfθ, θq ˝ pId, Dfθq `D2

22H1pfθ, θq ˝ pId, Idq. (48)

Using the fact that D1H1pfθ, θqrgs “ 0,@g P CpX q, we can drop the first term in the R.H.S. of (47).Combining the above results, we have

∇2θOTγpαθ, βq “ D2

11H1pfθ, θq ˝ pDfθ, Dfθq` D212H1pfθ, θq ˝ pDfθ, Idq

`D221H1pfθ, θq ˝ pId, Dfθq` D2

22H1pfθ, θq ˝ pId, Idq.(49)

Moreover, we can further simplify the above expression by noting that for any g P T pRd, CpX qq, i.e.any bounded linear operators from Rd to CpX q,

∇θ pD1H1pfθ, θq ˝ gq “ D211H1pfθ, θq ˝ pg,Dfθq `D

212H1pfθ, θq ˝ pg, Idq “ 0. (50)

Plugging in g “ Dfθ in the above equality we have

D211H1pfθ, θq ˝ pDfθ, Dfθq “ ´D

212H1pfθ, θq ˝ pDfθ, Idq. (51)

Consequently we derive (we omit the identity operator pId, Idq for the second term)

∇2θOTγpαθ, βq “ ´D


222H1pfθ, θq, (52)

where we note that D212H1pfθ, θq ˝ pDfθ, Idq is symmetric from (51) and

D221H1pfθ, θq ˝ pId, Dfθq “

“

D212H1pfθ, θq ˝ pDfθ, Idq

‰J“ D2

12H1pfθ, θq ˝ pDfθ, Idq. (53)

These two terms can be computed explicitly and involve only simple function operations like expand log and integration with respect to αθ and β, as discussed in the following.

B.1.1 Explicit Expression of∇2θOTγpαθ, βq

Denote A1 “ D211H1pfθ, θq ˝ pDfθ, Dfθq as the first term of (52). We note that A1 P Rdˆd

is a matrix and hence is a bilinear operator. If we can compute hJ1 A1h2 for any two directionsh1, h2 P Rd, we are able to compute entries of A1 by taking h1 and h2 to be the canonical bases. Wecompute this quantity hJ1 A1h2 as follows.

For a fixed y P X , denote Ty : X ˆ CpX q Ñ R by

Typx, fq:“ exppćpx, yq{γq exppfpxq{γq.

Denote g1 “ Dfθrh1s P CpX q for some direction h1 P Rd (recall that Dfθ P T pRd, CpX qq, whereT pV,W q is the family of bounded linear operators from set V to set W ). Use the chain rule ofFréchet derivative to compute

`

D1Apf, αθqrg1s˘

pyq “ ´

ş

X Typx, fqg1pxqdαθpxqş

X Typx, fqdαθpxq. (54)

14

Let h2 P Rd be another direction and denote g2 “ Dfθrh2s P CpX q. We compute`

D211Apf, αθqrg1, g2s

˘

pyq

“

ş

X Typx, fqg1pxqg2pxqdαθpxq

γş

X Typx, fqdαθpxq´

ş

X 2 Typx, fqTypx1, fqg1pxqg2px1qdαθpxqdαθpx

1q

γ“ş

X Typx, fqdαθpxq‰2 . (55)

Moreover, for any two directions h1, h2 P Rd, we compute D211H1pf, θq

“

Dfθrh1s, Dfθrh2s‰

by

D211H1pf, θq

“


“

ż

X

`

D211Apfθ, αθq

“

Dfθrh1s, Dfθrh2s‰˘

pyqdβpyq, (56)

which by plugging in (55) yields closed a form expression with only simple function operations likeexp and log and integration with respect to αθ and β.

We then compute the second term of (52). Using the change-of-variable formula, we have

Apf, Tθ7µqpyq “ ´γ log

ż

Zexp

ˆ

´1

γcpTθpzq, yq `

1

γfpTθpzqq

˙

dµpzq. (57)

For any f P CpX q, the first-order Fréchet derivative ofH1pf, θq w.r.t. its second variable is given by

D2H1pf, θq “

ż

Zx∇θTθpzq,∇f

`

Tθpzq˘

ydµpzq

`

ż

X

ş

Z Ty`

Tθpzq, f˘@

∇θTθpzq,∇1c`

Tθpzq, y˘

´∇f`

Tθpzq˘D

dµpzqş

Z Ty`

Tθpzq, f˘

dµpzqdβpyq.

Denote uzpθ, fq “ ∇1c`

Tθpzq, y˘

´∇f`

Tθpzq˘

. The second-order Fréchet derivative is given by

D222H1pf, θq (58)

“

ż

Z∇2θTθpzq ˆ1 ∇f

`

Tθpzq˘

`∇θTθpzqJ∇2f`

Tθpzq˘

∇θTθpzqdµpzq

`1

γ

ż

X

ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJuzpθ, fquzpθ, fqJ∇θTθpzqdµpzqş

Z Ty`

Tθpzq, f˘

dµpzqdβpyq

`

ż

X

ş

Z Ty`

Tθpzq, f˘

∇2θTθpzq ˆ1 uzpθ, fqdµpzq

ş

Z Ty`

Tθpzq, f˘

dµpzqdβpyq

`

ż

X

ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJr∇11cpTθpzq, yq ´∇2f`

Tθpzq˘

s∇θTθpzqdµpzqş

Z Ty`

Tθpzq, f˘

dµpzqdβpyq

`1

γ

ż

X

ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJuzpθ, fqdµpzq“ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJuzpθ, fqdµpzq‰J

“ş

Z Ty`

Tθpzq, f˘

dµpzq‰2 dβpyq.

Here∇θTθpzq P Rqˆd and∇2θTθpzq P Rqˆdˆd denote the first and second order Jacobian of Tθpzq

w.r.t. to θ; ˆ1 denotes the tensor product along the first dimension; ∇f P Rq and ∇2f P Rqˆqdenote the first and second order gradient of f w.r.t. its input;∇1c P Rq and∇11c P Rqˆq denote thefirst and second order gradient of c w.r.t. its first input. By plugging in f “ fθ, we have the explicitexpression of the second term of (52).

B.2 More details in Proposition 5.2

First, we recall some existing results about the Sinkhorn potential fθ.Assumption B.1. The ground cost function c is bounded and we denote Mc:“maxx,yPX cpx, yq.

It is known that, under the above boundedness assumption on the ground cost function c, fθ is asolution to the generalized DAD problem (eq. (7.4) in [Lemmens and Nussbaum, 2012]), which isthe fixed point to the operator B : CpX q ˆΘ Ñ CpX q defined as

Bpf, θq:“A`

Apf, αθq, β˘

. (59)

15

Further, the Birkhoff-Hopf Theorem (Sections A.4 and A.7 in [Lemmens and Nussbaum, 2012])states that exppB{γq is a contraction operator under the Hilbert metric with a contraction factor λ2

where λ:“ exppMc{γq´1exppMc{γq`1 ă 1 (see also Theorem B.5 in [Luise et al., 2019]): For strictly positive

functions u, u1 P CpX q, define the Hilbert metric as

dHpu, u1q:“ log max

x,yPX

upxqu1pyq

u1pxqupyq. (60)

For any measure α PM`1 pX q, we have

dHpexppApf, αθq{γq, exppApf 1, αθq{γqq ď λdHpexppf{γq, exppf 1{γqq. (61)

Consequently, by applying the fixed point iteration

f t`1 “ Bpf t, θq, (62)

also known as the Sinkhorn-Knopp algorithm, one can compute fθ in logarithmic time: }f t`1 ´

fθ}8 “ Opλtq (Theorem. 7.1.4 in [Lemmens and Nussbaum, 2012] and Theorem B.10 in [Luiseet al., 2019]).

While the above discussion shows that the output of the Sinkhorn-Knopp algorithm well approximatesthe Sinkhorn potential fθ, it would be useful to discuss more about the boundedness property of thesequence tf tu produced by the above Sinkhorn-Knopp algorithm. We first show that under boundedinitialization f0, the entire sequence tf tu is bounded.Lemma B.1. Suppose that we initialize the Sinkhorn-Knopp algorithm with f0 P CpX q such that}f0}8 ďMc. One has }f t}8 ďMc, for t “ 1, 2, 3, ¨ ¨ ¨ .

Proof. For }f}8 ďMc and any measure α PM`1 pX q, we have

}Apf, αq}8 “ γ} log

ż

Xexptćpx, ¨q{γu exptfpxq{γudαpxq}8 ď γ log exppMc{γq ďMc.

One can then check the lemma via induction.

We then show that the sequence tf tu has bounded first, second and third-order gradients under thefollowing assumptions on the ground cost function c.Assumption B.2. The cost function c is Gc-Lipschitz continuous with respect to one of its inputs:For all x, x1 P X ,

|cpx, yq ´ cpx1, yq| ď Gc}x´ x1}.

Assumption B.3. The gradient of the cost function c is Lc-Lipschitz continuous: for all x, x1 P X ,

}∇1cpx, yq ´∇1cpx1, yq} ď Lc}x´ x

1}.

Assumption B.4. The Hessian matrix of the cost function c is L2,c-Lipschitz continuous: for allx, x1 P X ,

}∇211cpx, yq ´∇2

11cpx1, yq} ď L2,c}x´ x

1}.

Lemma B.2. Assume that the initialization f0 P CpX q satisfies }f0}8 ďMc.(i.) Under Assumptions B.1 and B.2, DGf such that }∇f t}2,8 ď Gf ,@t ą 0.(ii.) Under Assumptions B.1 - B.3, DLf such that }∇2f tpxq} ď Lf ,@t ą 0.(iii.) Under Assumptions B.1 - B.4, DL2,f such that }∇2f tpxq´∇2f tpyq}op ď L2,f }x´ y},@t ą 0.(iv). For }f}8 ďMc, the function Bpf, θqpxq is Gf -Lipschitz continuous.

Proof. We denote kpx, yq:“ exptćpx, yq{γu in this proof.

(i) Under Assumptions B.1 and B.2, k is rGc{γs-Lipschitz continuous w.r.t. its first variable. Forf P CpX q such that }f}8 ďMc, we bound

|Apf, αqpxq Ápf, αqpyq| “ γ| log

ż

Xrkpz, yq ´ kpz, xqs exptfpzq{γudαpzq|

ď γ exppMc{γqGc{γ}x´ y}2 “ exppMc{γqGc}x´ y}2.

16

Using Lemma B.1, we know that tf tu is Mc-bounded and hence

}∇f t`1}2,8 ď Gf “ expp2Mc{γqG2c .

(ii) Under Assumption B.1, kpx, yq ě expp´Mc{γq. We compute

∇`

Apf, αq˘

pxq “

ş

X kpz, xq exptfpzq{γu∇1cpx, zqdαpzqş

X kpz, xq exptfpzq{γudαpzq. #

g1pxq

g2pxq

Let g1 : Rq Ñ Rq and g2 : Rq Ñ R be the numerator and denominator of the above expression. If wehave (a) }g1}2,8 ď G1, (b) }g1pxq´g1pyq} ď L1}xý} and (c) }g2}8 ď G2, (d) |g2pxq´g2pyq| ďL2}x´ y}, (e) g2 ě G2 ą 0, we can bound

}g1pxq

g2pxq´g1pyq

g2pyq} “ }

g1pxqg2pyq ´ g1pyqg2pxq

g2pxqg2pyq} ď

G2L1 `G1L2

G22

}x´ y}, (63)

which means that ∇`

Apf, αq˘

is L-Lipschitz continuous with L “ G2L1`G1L2

G22

. We now prove(a)-(e).

(a) }ş

X kpz, xq exptfpzq{γu∇1cpx, zqdαpzq}2,8 ď exppMc{γq ¨Gc (Assumption B.2).

(b) Note that for any two bounded and Lipschitz continuous functions h1 : X Ñ R andh2 : X Ñ Rq , their product is also Lipschitz continuous:

}h1pxq ¨ h2pxq ´ h1pyq ¨ h2pyq} ď r|h1|8 ¨Gh2` }h2}2,8 ¨Gh1

s}x´ y}, (64)

where Ghi denotes the Lipschitz constant of hi, i “ 1, 2. Hence for g1, we have

}g1pxq ´ g1pyq} ď exppMc{γq ¨ pLc `G2c{γq ¨ }x´ y},

since kpx, yq ď 1, }∇1kpx, yq} ď Gc{γ, }∇1cpx, yq} ď Gc, }∇211cpx, yq}op ď Lc.

(c) }ş

X kpz, ¨q exptfpzq{γudαpzq}8 ď exppMc{γq.

(d) |ş

X rkpz, xq ´ kpz, yqs exptfpzq{γudαpzq| ď exppMc{γq ¨Gc{γ ¨ }x´ y}.

(e)ş

X kpz, xq exptfpzq{γudαpzq ě expp´2Mc{γq ą 0.

Combining the above points, we prove the existence of Lf .

For (iii), compute that

∇2`

Apf, αq˘

pxq

“

ş

X kpz, xq exptfpzq{γu∇1cpx, zq∇1cpx, zqJdαpzq

ş

X kpz, xq exptfpzq{γudαpzq#1

`

ş

X kpz, xq exptfpzq{γu∇211cpx, zqdαpzq

ş

X kpz, xq exptfpzq{γudαpzq#2

´

ş

X kpz, xq exptfpzq{γu∇1cpx, zqdαpzq“ş

X kpz, xq exptfpzq{γu∇1cpx, zqdαpzq‰J

“ş

X kpz, xq exptfpzq{γudαpzq‰2 . #3

We now analyze #1-#3 individually.

#1 Note that for any two bounded and Lipschitz continuous functions h1 : X Ñ R andh2 : X Ñ Rqˆq , their product is also Lipschitz continuous:

}h1pxq ¨ h2pxq ´ h1pyq ¨ h2pyq}op ď r|h1|8 ¨Gh2` }h2}op,8 ¨Gh1

s}x´ y}, (65)

where Ghi denotes the Lipschitz constant of hi, i “ 1, 2.

Take h1pxq “ kpz1, xq exptfpz1q{γu{ş

X kpz, xq exptfpzq{γudαpzq. h1 is bounded sincekpz1, xq ď 1 and

ş

X kpz, xq exptfpzq{γudαpzq ě expp´2Mc{γq ą 0. h1 is Lipschitzcontinuous since we additionally have kpz1, xq being Lipschitz continuous (see (63)).

Take h2pxq “ ∇1cpx, zq∇1cpx, zqJ. h2 is bounded since }∇1cpx, zq} ď Gc (Assumption

B.2). h2 is Lipschitz continuous due to Assumption B.3.

17

#2 Following the similar argument as #1, we have the result. Note that h2pxq “ ∇211cpx, zq is

Lipschitz continuous due to Assumption B.4.

#3 We follow the similar argument as #1 by taking

h1pxq “kpz1, xq exptfpz1q{γukpz1, xq exptfpz1q{γu

“ş

X kpz, xq exptfpzq{γudαpzq‰2 ,

and taking

h2pxq “ ∇1cpx, zqr∇1cpx, zqsJ.

Combining the above points, we prove the existence of L2,f .

(iv) As a composition ofA, we also have that Bpf, θq is Gf -Lipschitz continuous (see Gf in (i)).

Moreover, based on the above continuity results, we can show that the first-order gradient ∇f εθ (andsecond-order gradient∇2f εθ ) also converges to∇fθ (and∇2fθ) in time logarithmically dependingon 1{ε.

Lemma B.3. Under Assumptions B.1-B.3, the Sinkhorn-Knopp algorithm, i.e. the fixed point iteration

f t`1 “ Bpf t, θq, (66)

computes∇fθ in logarithm time: }∇f t`1 ´∇fθ}2,8 “ ε with t “ Oplog 1ε q.

Proof. For a fix point x P X and any direction h P Rq , we have

f tpx` η ¨ hq ´ f tpxq “ ηr∇f tpxqsJh` η2

2hJ∇2f tpx` η1 ¨ hqh,

where η ą 0 is some constant to be determined later and 0 ď η1 ď η is obtained from the mean valuetheorem. Similarly, we have for 0 ď η2 ď η

fθpx` η ¨ hq ´ fθpxq “ ηr∇fθpxqsJh`η2

2hJ∇2fθpx` η2 ¨ hqh.

We can then compute

|r∇f tpxq ´∇fθpxqsJh| ď2

η}f t ´ fθ}8 ` ηLf }h}

2.

Take h “ ∇f tpxq ´∇fθpxq and η “ 2Lf

. We derive from the above inequality

}∇f tpxq ´∇fθpxq}2 ď 2Lf }ft ´ fθ}8.

Consequently, if we have 2Lf }ft ´ fθ}8 ď ε2, we can prove that }∇f t ´∇fθ}2,8 ď ε since x is

arbitrary. This can be achieve in logarithmic time using the Sinkhorn-Knopp algorithm.

Lemma B.4. Under Assumptions B.1-B.4, the Sinkhorn-Knopp algorithm, i.e. the fixed point iteration

f t`1 “ Bpf t, θq, (67)

computes∇2fθ in logarithm time: }∇2f t`1 ´∇2fθ}op,8 “ ε with t “ Oplog 1ε q.

Proof. This follows a similar argument as Lemma B.3 by noticing that the third order gradient of f t(and fθ) is bounded due to Assumption B.4.

18

B.3 Proof of Proposition 5.3

We now construct a sequence tgtu to approximate the Fréchet derivative of the Sinkhorn potentialDfθ such that for all t ě T pεq with some integer function T pεq of the target accuracy ε, we have}gtθ ´Dfθ}op ď ε. In particular, we show that such ε-accurate approximation can be achieved usinga logarithmic amount of simple function operations and integrations with respect to αθ.

For a given target accuracy ε ą 0, denote ε “ ε{Ll, where Ll is a constant defined in LemmaB.5. First, Use the Sinkhorn-Knopp algorithm to compute f εθ , an approximation of fθ such that}f εθ ´ fθ}8 ď ε. This computation can be done in Oplog 1

ε q from Proposition 5.2.

Denote Epf, θq “ Blpf, θq “ B`

¨ ¨ ¨Bpf, θq, ¨ ¨ ¨ , θ˘

, the l times composition of B in its first variable.Pick l “ rlogλ

13 s{2. From the contraction of A under the Hilbert metric (61), we have

}Epf, θq ´ Epf 1, θq}8 ď γdHpexppEpf, θq{γq, exppEpf 1, θq{γqq

ď γλ2ldHpexppf{γq, exppf 1{γqq ď 2λ2l}f ´ f 1}8 ď2

3}f ´ f 1}8,

where we use }f ´ f 1}8 ď dHpexppfq, exppf 1qq ď 2}f ´ f 1}8 in the first and third inequalities.Consequently, Erf, θs is a contraction operator w.r.t. f under the l8 norm, which is equivalent to

}D1Epf, θq}op ď2

3. (68)

Now, given arbitrary initialization g0θ : Θ Ñ T pRd, CpX qq1, construct iteratively

gt`1θ “ D1Epf εθ , θq ˝ gtθ `D2Epf εθ , θq, (69)

where ˝ denotes the composition of (linear) mappings. In the following, we show that

}gt`1θ ´Dfθ}op ď 3ε` p

2

3qt}g0

θ ´Dfθ}op.

First, note that fθ is a fixed point of Ep¨, θq

fθ “ Epfθ, θq.

Take the Fréchet derivative w.r.t. θ on both sides of the above equation. Using the chain rule, wecompute

Dfθ “ D1Epfθ, θq ˝Dfθ `D2Epfθ, θq. (70)

For any direction h P Rd, we bound the difference of the directional derivatives by

}gt`1θ rhs ´Dfθrhs}8

ď }D1Epfθ, θq“

Dfθrhs‰

´D1Epf εθ , θq“

gtθrhs‰

}8 ` }D2Epf εθ , θqrhs ´D2Epfθ, θqrhs}8

ď2

3}Dfθrhs ´ g

tθrhs}8 ` Ll

`

}f εθ ´ fθ}8 ` }∇f εθ ´∇fθ}8˘

}h}8

ď2

3}Dfθ ´ g

tθ}op}h}8 ` ε}h}8,

where in the second inequality we use the bound on D1E in (68) and the Ll-Lipschitz continuity ofD2E with respect to its first argument (recall that f εθ is obtained from the Sinkhorn-Knopp algorithmand hence }f εθ}8 ďMc from Lemma B.1 and }∇f εθ}2,8 ď Gf from (i) of Lemma B.2). The aboveinequality is equivalent to

}gt`1θ ´Dfθ}op ´ 3ε ď

2

3

`

}Dfθ ´ gtθ}op ´ 3ε

˘

ñ }gt`1θ ´Dfθ}op ď 3ε` p

2

3qt}g0

θ ´Dfθ}op.

Therefore, after T pεq “ Oplog 1ε q iterations, we find gT pεqθ such that }gT pεqθ ´Dfθ}op ď 4ε.

Assumption B.5 (Boundedness of ∇θTθpxq). There exists some GT ą 0 such that for any x P Xand θ P Θ, }∇θTθpxq}op ď GT .

1Recall that T pRd, CpX qq is the family of bounded linear operators from Rd to CpX q

19

Lemma B.5 (Lipschitz continuity of D2E). Under Assumptions B.1 - B.3 and B.5, D2E is Lipschitzcontinuous with respect to its first variable: For f, f 1 P CpX q such that }f}8 ďMc (}f 1}8 ďMc)and }∇f}8 ď Gf (}∇f 1}8 ď Gf ), and θ P Θ there exists some Ll such that

}D2Epf, θq ´D2Epf 1, θq}op ď Ll`

}f ´ f 1}8 ` }∇f ´∇f 1}2,8˘

. (71)

Proof. Recall that Ep¨, θq “ Blp¨, θq. Using the chain rule of Fréchet derivative, we compute

D2Blpf, θq “ D1B`

Bl´1pf, θq, θ˘

˝D2Bl´1pf, θq `D2B`

Bl´1pf, θq, θ˘

. (72)

We bound the two terms on the R.H.S. individually.

Analyze the first term of (72). For a given f , use Af and Bf to denote two linear operatorsdepending on f . We have }Af ˝ Bf ´ Af 1 ˝ Bf 1}op “ Op}f ´ f 1}8 ` }∇f ´∇f 1}2,8q if bothAf and Bf are bounded, }Af Áf 1}op “ Op}f ´ f 1}8 ` }∇f ´∇f 1}2,8q, and }Bf ´Bf 1}op “Op}f ´ f 1}8 ` }∇f ´∇f 1}2,8q:

}Af ˝Bf Áf 1 ˝Bf 1}op ď }Af ˝Bf Áf ˝Bf 1}op ` }Af ˝Bf 1 Áf 1 ˝Bf 1}op

ď rmaxf}Bf }op ¨ LA `max

f}Af }op ¨ LBs

“

}f ´ f 1}8 ` }∇f ´∇f 1}2,8‰

, (73)

where LA and LB denote the constants of operators Af and Bf such that

}Af Áf 1} ď LA“

}f ´ f 1}8 ` }∇f ´∇f 1}2,8‰

}Bf ´Bf 1} ď LB“

}f ´ f 1}8 ` }∇f ´∇f 1}2,8‰

.

We now takeAf “ D1B

`

Bl´1pf, θq, θ˘

and Bf “ D2Bl´1pf, θq.

}Af }op is bounded from the following lemma.

Lemma B.6. Bpf, θq is 1-Lipschitz continuous with respect to its first variable.

Proof. We compute that for any measure κ and any function g P CpX q,

D1Apf, κqrgs “ş

X expt´ 1γ

`

cpx, yq ´ fpxq˘

ugpxqdκpxqş

X expt´ 1γ

`

cpx, yq ´ fpxq˘

udκpxq. (74)

Note that

}D1Apf, κqrgs}8 ď }ş

X expt´ 1γ

`

cpx, yq ´ fpxq˘

udκpxqş

X expt´ 1γ

`

cpx, yq ´ fpxq˘

udκpxq}8 ¨ }g}8 “ }g}8, (75)

and consequently we have }D1Apf, κq}op ď 1. Further, since B is the composition of A in its firstvariable, we have that }D1Bpf, θq}op ď 1.

}Bf }op is bounded from the following lemma.

Lemma B.7. Assume that f P CpX q satisfies }f}8 ďMc and }∇f}2,8 ď Gf . Under AssumptionsB.2 and B.5, @l ě 1, }D2Blpf, θq}op is Ml-bounded, with Ml “ l ¨ expp3Mc{γq ¨GT ¨ pGc `Gf q.

Proof. In this proof, we denote Apf, θq:“Apf, αθq to make the dependence ofA on θ explicit. Usingthe chain rule of Fréchet derivative, we compute


Bl´1pf, θq, θ˘

˝D2Bl´1pf, θq `D2B`

Bl´1pf, θq, θ˘

. (76)

We will use Ml to denote the upper bound of }D2Blpf, θq}op. Consequently we have

Ml ď }D1B`

Bl´1pf, θq, θ˘

}op}D2Bl´1pf, θq}op ` }D2B`

Bl´1pf, θq, θ˘

}op

ďMl´1 ` }D2B`

Bl´1pf, θq, θ˘

}op,

20

where we use Lemma B.6 in the second inequality. Recall that Bpf, θq “ ApApf, θq, βq. Againusing the chain rule of the Fréchet derivative, we compute

D2Bpf, θq “ D1A`

Apf, θq, β˘

˝D2Apf, θq, (77)

and hence

}D2Bpf, θq}op ď }D1A`

Apf, θq, β˘

}op ¨ }D2Apf, θq}op ď }D2Apf, θq}op, (78)

where we use (75) in the second inequality. We now bound }D2Apf, θq}op. Denote

ωypxq:“ exppćpx, yq{γq exppfpxq{γq.

We have expp´2Mc{γq ď ωypxq ď exppMc{γq from }f}8 ď Mc and Assumption B.1. For anydirection h P Rq (note that D2Apf, θqrhs : X Ñ R) and any y P X , we compute

`

D2Apf, θqrhs˘

pyq “

ş

X ωypTθpxqqxr∇θTθpxqsJ r´∇1cpTθpxq, yq `∇fpTθpxqqs , hydµpxqş

X ωypTθpxqqdµpxq,

where ∇θTθpxq denotes the Jacobian matrix of Tθpxq w.r.t. θ. Consequently we bound

}D2Apf, θqrhs}8 ď expp3Mc{γq}∇θTθpxq}op ¨ r}∇1c`

Tθpxq, y˘

} ` }∇f`

Tθpxq˘

}s ¨ }h}

ď expp3Mc{γq ¨GT ¨ pGc `Gf q}h},

which implies}D2Apf, θq}op ď expp3Mc{γq ¨GT ¨ pGc `Gf q. (79)

To show the Lipschitz continuity of Af , i.e. }Af ´ Af 1} ď LA}f ´ f 1}8, we first establish thefollowing continuity lemmas of D1Bp¨, θq and Bl´1p¨, θq.

Lemma B.8. For f P CpX q such }f}8 ďMc, D1Bpf, θq is L-Lipschitz continuous with respect toits first variable with L “ 2LA.

Proof. Use the chain rule of Fréchet derivative to compute

D1Bpf, θq “ D1A`

Apf, αθq, β˘

loooooooooomoooooooooon

Uf

˝D1Apf, αθqlooooomooooon

Vf

. (80)

We analyze the Lipschitz continuity of }D1Bpf, θq}op following the same logic as (73):

• The 1-boundedness of Uf and Vf is from Lemma B.6.

• The LA-Lipschitz continuity of Vf is from Lemma B.11.

• The LA-Lipschitz continuity of Uf is from Lemmas B.6 and B.11.

Consequently, we have that D1Bpf, θq is 2LA-Lipschitz continuous w.r.t. its first variable.

Lemma B.9. @l,Blpf, θq is 1-Lipschitz continuous with respect to its first variable.

Proof. Use the chain rule of Fréchet derivative to compute


Bl´1pf, θq, θ˘

˝D1Bl´1pf, θq. (81)

Consequently }D1Blpf, θq}op ď }D1Bpf, θq}lop. Further, we have }D1Bpf, θq}op ď 1 from LemmaB.6 which leads to the result.

We have that Af is Lipschitz continuous since (i) Af is the composition of Lipschitz continuousoperators D1Bp¨, θq and Bl´1pf ¨, θq and (ii) for }f}8 ď Mc, @l ě 0, }Blpf, θq}8 ď Mc (theargument is similar to Lemma B.1).

We prove }Bf ´ Bf 1} ď Ll“

}f ´ f 1}8 ` }∇f ´∇f 1}2,8‰

via induction. The following lemmaestablishes the base case for D2Bpf, θq (when l “ 2). Note that the boundedness of }f}8 (}f 1}8)and }∇f}8 (}∇f 1}8) remains valid after the operator B (Lemma B.1 and (i) of Lemma (B.2)).

21

Lemma B.10. There exists constant L1 such that for }f}8 ďMc (}f 1}8 ďMc) and }∇f}8 ď Gf(}∇f 1}8 ď Gf )

}D2Bpf, θq ´D2Bpf 1, θq}op ď L1

“

}f ´ f 1}8 ` }∇f ´∇f 1}2,8‰

. (82)

Proof. In this proof, we denote Apf, θq:“Apf, αθq to make the dependence of A on θ explicit.Recall that Bpf, θq “ ApApf, θq, βq. Use the chain rule of Fréchet derivative to compute

D2Bpf, θq “ D1A`

Apf, αθq, β˘

loooooooooomoooooooooon

Uf

˝D2Apf, θqloooomoooon

Vf

. (83)

We analyze the Lipschitz continuity of }D2Bpf, θq}op following the same logic as (73):

• The 1-boundedness of Uf is from Lemma B.6.

• The expp3Mc{γq ¨GT ¨ pGc `Gf q-boundedness of Vf is from (79).

• The LA-Lipschitz continuity of Uf is from Lemmas B.6 and B.11 and the fact that for}f}8 ďMc, }Apf, θq}8 ďMc (the argument is similar to Lemma B.1).

• DenoteTypx, fq:“ exppćpx, yq{γq exppfpxq{γq.

We compute

Vf “

ş

Z Ty`

Tθpzq, f˘

r∇θTθpzqsJ“

´∇1cpTθpzq, yq `∇f`

Tθpzq˘‰

dµpzqş

Z Ty`

Tθpzq, f˘

dµpzq, #

PfQf

Denote the numerator by Pf and the denominator by Qf . Following the similar idea as (63),we show that both }Pf }op and }Qf }8 are bounded, Qf is Lipschitz continuous w.r.t. f , Qfis positive and bounded from below, and }Pf ´Pf 1}op ď Lvr}f ´ f

1}8`}∇f ´∇f 1}2,8sfor some constant Lv .

– The boundedness of }Pf }op is from the boundedness of f , Assumptions B.5, B.2, andthe boundedness of∇f .

– The boundedness of }Qf }8 is from the boundedness of f .– Use DQf to denote the Fréchet derivative of Qf w.r.t. f . For any function g P CpX q,

DQf rgs “

ż

XTypx, fqgpxq{γdαθpxq, (84)

where we recall that αθ “ Tθ7µ. Further, we have }DQf rgs}8 ď exppMc{γq{γ}g}8,which implies the Lipschitz continuity of Qf (for }f}8 ďMc).

– We prove that for }f}8 ďMc (}f 1}8 ďMc) and }∇f}8 ď Gf (}∇f 1}8 ď Gf ),

}Pf ´ Pf 1}op ď Lvr}f ´ f1}8 ` }∇f ´∇f 1}2,8s.

For a fixed z P Z , denote

pzf :“Ty`

Tθpzq, f˘

r∇θTθpzqsJ“


Tθpzq˘‰

.

Note that Pf “ş

Z pzfdµpzq. For any direction h P Rd, we bound

}pzf rhs ´ pzf 1rhs}op

ď}D2Ty`

Tθpzq, f˘

}op}f ´ f1}8 ¨max

y|r∇θTθpzqhsJ

“


Tθpzq˘‰

|

` rmaxyTy`

Tθpzq, f˘

s ¨ }∇θTθpzqh}}∇f`

Tθpzq˘

´∇f 1`

Tθpzq˘

}

ď exppMc{γq{γ ¨GT ¨ pGc `Gf q ¨ }f ´ f1}8 ¨ }h} ` exppMc{γq ¨GT ¨ }h} ¨ }∇f ´∇f 1}2,8.

Consequently, we have that there exists a constant Lv such that

}pzf rhs ´ pzf 1rhs}8 ď Lvr}f ´ f

1}8 ` }∇f ´∇f 1}2,8s ¨ }h}.

22

The above lemma shows the base case for the induction. Now suppose that the inequality}D2Bkpf, θq ´D2Bkpf 1, θq}op ď Lk

“

}f ´ f 1}8 ` }∇f ´∇f 1}2,8‰

holds.For the case of k ` 1, we compute the Fréchet derivative

D2Bk`1pf, θq “ D1B`

Bkpf, θq, θ˘

˝D2Bkpf, θq `D2B`

Bkpf, θq, θ˘

,

and hence we can bound

}D2Bk`1pf, θq ´D2Bk`1pf 1, θq}op

ď }D1B`

Bkpf, θq, θ˘

˝`

D2Bkpf, θq ´D2Bkpf 1, θq˘

}op

` }

ˆ

D1B`

Bkpf, θq, θ˘

´D1B`

Bkpf 1, θq, θ˘

˙

˝D2Bkpf 1, θq}op

` }D2B`

Bkpf, θq, θ˘

´D2B`

Bkpf 1, θq, θ˘

}op

ď }D2Bkpf, θq ´D2Bkpf 1, θq}op (85)

` LA}Bkpf, θq ´ Bkpf 1, θq}8}D2Bkpf 1, θq}op` L1

“

}Bkpf, θq ´ Bkpf 1, θq}8 ` }∇Bkpf, θq ´∇Bkpf 1, θq}2,8‰

ď Lkr}f ´ f1}8 ` }∇f ´∇f 1}2,8s ` LA ¨Mk ¨ }f ´ f

1}8

` L1}f ´ f1}8 ` L1}∇Bkpf, θq ´∇Bkpf 1, θq}2,8

ď pLk ` L1 ` LAMkqr}f ´ f1}8 ` }∇f ´∇f 1}2,8s ` L1}∇Bkpf, θq ´∇Bkpf 1, θq}2,8. (86)

Here in the third inequality, we use the induction for the first term, Lemma B.7 for the second term.Notice that ∇Apf, θq is Lipschitz continuous w.r.t. f : Denote kpx, yq:“ exptćpx, yq{γu. For anyfixed x P X ,

∇`

Apf, αq˘

pxq “

ş

X kpz, xq exptfpzq{γu∇1cpx, zqdαpzqş

X kpz, xq exptfpzq{γudαpzq, #

g1pfq

g2pfq

where we denote the numerator and denominator of the above expression by g1 : CpX q Ñ Rq andg2 : CpX q Ñ R. From the boundedness of g1 and g2, the Lipschitz continuity of g1 and g2 w.r.t. tof , and the fact that g2 is positive and bounded away from zero, we conclude that there exists someconstant LA,f such that for any x P X (this follows similarly as (63))

}∇`

Apf, αq˘

pxq ´∇`

Apf 1, αq˘

pxq} ď LA,f }f ´ f1}8. (87)

Recall that Bk is the compositions of operators in the form of A. Consequently, we have that

}∇Bkpf, θq ´∇Bkpf 1, θq}2,8 ď LA,f }f ´ f1}8.

Plugging this result into (86), we prove that the induction holds for k ` 1:

}D2Bk`1pf, θq´D2Bk`1pf 1, θq}op ď pLk`L1`LAMk`L1LA,f qr}f´f1}8`}∇f´∇f 1}2,8s.

Consequently, for any finite l, we have }Bf ´ Bf 1} ď Ll“

}f ´ f 1}8 ` }∇f ´ ∇f 1}2,8‰

, whereLl “ l ¨ pL1 ` LAMk ` L1LA,f q.

Lemma B.11. Under Assumption B.1, for f P CpX q such }f}8 ď Mc, there exists constant LAsuch that D1Apf, αθq is LA-Lipschitz continuous with respect to its first variable.

Proof. Let g P CpX q any function. Denote Typx, fq:“ exppćpx, yq{γq exppfpxq{γq. For a fixedpoint y P X and any function g P CpX q, we compute that

`

D1Apf, θqrgs˘

pyq “

ş

X Typx, fqgpxqdαθpxqş

X Typx, fqdαθpxq, #

g1pfq

g2pfq

where we denote the numerator and denominator of the above expression by g1 : CpX q Ñ Rq andg2 : CpX q Ñ R. From the boundedness of g1 and g2, the Lipschitz continuity of g1 and g2 w.r.t. tof , and the fact that g2 is positive and bounded away from zero, we conclude that there exists someconstant LA such that for any x P X (this follows similarly as (63)).

23

Analyze the second term of (72). We bound the second term of (72) using Lemma B.10:

}D2BpBl´1pf, θq, θq ´D2BpBl´1pf 1, θq, θq}op

ď L1r}Bl´1pf, θq ´ Bl´1pf 1, θq}8 ` }∇Bl´1pf, θq ´∇Bl´1pf 1, θq}2,8s

ď L1r}f ´ f1}8 ` LA,f }f ´ f

1}8s “ L1 ¨ p1` LA,f q}f ´ f1}8,

where we use (87) in the second inequality.

Combing the analysis for the two terms of (72), we conclude the result.

B.4 Proof of Theorem 5.1

We prove that the approximation error of ∇2θOTγpαθ, βq using the estimated Sinkhorn potential f εθ

and the estimated Fréchet derivative gεθ is of the order

Op}f εθ ´ fθ}8 ` }∇f εθ ´∇fθ}2,8 ` }∇2f εθ ´∇2fθ}op,8 ` }gεθ ´Dfθ}opq.

The other term∇2θOTγpαθ, αθq is handled in a similar manner.

Recall the simplified expression of ∇2θOTγpαθ, βq in (52). Given the estimator f εθ (gεθ) of fθ (Dfθ),

we need to prove the following bounds of differences in terms of the estimation accuracy: For anyh1, h2 P Rd,

|D211H1pfθ, θq

“


´D211H1pf

εθ , θq

“

gεθrh1s, gεθrh2s

‰

|

“ O p}h1} ¨ }h2} ¨ p}fεθ ´ fθ}8 ` }g

εθ ´Dfθ}opqq , (88)

}D222H1pfθ, θq ´D

222H1pf

εθ , θq}op

“ O`

}f εθ ´ fθ}8 ` }∇f εθ ´∇fθ}2,8 ` }∇2f εθ ´∇2fθ}op,8˘

. (89)

Note that from the definition of the operator norm the first results is equivalent to the bound in theoperator norm. Using Propositions 5.2 and 5.3 and Lemmas B.3, B.4, we know that we can computethe estimators f εθ and gεθ such that }f εθ´fθ}8 ď ε, }∇f εθ´∇fθ}2,8 ď ε, and }∇2f εθ´∇2fθ}op,8 ď ε,and }gεθ ´Dfθ}op ď ε in logarithm time Oplog 1

ε q. Together with (88) and (89) proved above, wecan compute an ε-accurate estimation of ∇2

θOTγpαθ, βq (in the operator norm) in logarithm timeOplog 1

ε q.

Bounding (88). Recall the definition of D211H1pfθ, θq

“


in (56). Denote

A1 “ D211Apfθ, αθq, v1 “ Dfθrh1s, v2 “ Dfθrh2s,

A2 “ D211Apf εθ , αθq, u1 “ gεθrh1s, u2 “ gεθrh2s.

Based on these definitions, we have

D211H1pfθ, θq

“


“

ż

XA1rv1, v2spyqdβpyq

D211H1pf

εθ , θq

“

gεθrh1s, gεθrh2s

‰

“

ż

XA2ru1, u2spyqdβpyq.

Using the triangle inequality, we have

}A1rv1, v2sÁ2ru1, u2s}8 (90)ď }A1rv1 ´ u1, v2s}8 ` }A1ru1, v2 ´ u2s}8 ` }pA1 Á2qru1, u2s}8.

We bound the three terms on the R.H.S. individually.

For the first term on the R.H.S. of (90), we recall the explicit expression of A1rv1, v2spyq in (55) as

A1rv1, v2spyq “

ş

X Typx, fθqv1pxqv2pxqdαθpxq

γş

X Typx, fθqdαθpxq´

ş

X 2 Typx, fθqTypx1, fθqv1pxqv2px1qdαθpxqdαθpx

1q

γ“ş

X Typx, fθqdαθpxq‰2 .

24

Here we recall Typx, fq:“ exppćpx, yq{γq exppfpxq{γq. We bound using the facts that Typx, fθqis bounded from above and bounded away from zero

|A1rv1 ´ u1, v2spyq| ď |

ş

X Typx, fθq`

v1pxq ´ u1pxq˘

v2pxqdαθpxq

γş

X Typx, fθqdαθpxq|

` |

ş

X 2 Typx, fθqTypx1, fθq`

v1pxq ´ u1pxq˘

v2px1qdαθpxqdαθpx

1q

γ“ş

X Typx, fθqdαθpxq‰2 |

“ Op}v1 ´ u1}8 ¨ }v2}8q.

Further, we have }u1 ´ v1}8 “ Op}Dfθ ´ gεθ}op ¨ }h1}q and }v1}8 “ Op}h2}q. Consequently, thefirst term on the R.H.S. of (90) is of order Op}Dfθ ´ gεθ}op ¨ }h1} ¨ }h2}q.

Following the same argument, we have the second term on the R.H.S. of (90) is of order Op}Dfθ ´gεθ}op ¨ }h1} ¨ }h2}q.

To bound the third term on the R.H.S. of (90), denote

A11ru1, u2s:“

ş

X Typx, fθqu1pxqu2pxqdαθpxq

γş

X Typx, fθqdαθpxqandA21ru1, u2s:“

ş

X Typx, fεθqu1pxqu2pxqdαθpxq

γş

X Typx, fεθqdαθpxq

,

and denote

A12ru1, u2s:“

ş

X Typx, fθqu1pxqdαθpxqş

X Typx1, fθqu2px

1qdαθpx1q

γ“ş

X Typx, fθqdαθpxq‰2 ,

and A22ru1, u2s:“

ş

X Typx, fεθqu1pxqdαθpxq

ş

X Typx1, f εθqu2px

1qdαθpx1q

γ“ş

X Typx, fεθqdαθpxq

‰2 .

We show that both |`

A11 Á21

˘

ru1, u2s| and |`

A12 Á22

˘

ru1, u2s| are of order Op}Dfθ ´ gεθ}op ¨}h1} ¨ }h2}q. This then implies |

`

A1 Á2

˘

ru1, u2s| “ Op}Dfθ ´ gεθ}op ¨ }h1} ¨ }h2}q.With the argument similar to (63), we obtain that |

`

A11 ´ A21

˘

ru1, u2s| “ Op}Dfθ ´ gεθ}op ¨}u1} ¨ }u2}q using the boundedness and Lipschitz continuity of the numerator and denominator ofA11ru1, u2s w.r.t. to fθ and the fact that the denominator is positive and bounded away from zero(see the discussion following (63)). Further, since both Dfθ and gεθ are bounded linear operators,we have that u1 “ Oph1q and u2 “ Oph2q. Consequently, we prove that |

`

A11 Á21

˘

ru1, u2s| “

Op}fθ ´ f εθ}op ¨ }h1} ¨ }h2}q.Similarly, we can prove that |

`

A12 Á22

˘

ru1, u2s| “ Op}fθ ´ f εθ}op ¨ }h1} ¨ }h2}q.

Altogether, we have proved (88).

Bounding (89). Recall that the expression of D222H1pf, θq in (58). For a fixed y P X and a fixed

z1 P Z , denote (recall that uzpθ, fq “ ∇1c`

Tθpzq, y˘

´∇f`

Tθpzq˘

)

B1pfq “∇2θTθpz

1q ˆ1 ∇f`

Tθpz1q˘

B2pfq “∇θTθpz1qJ∇2f`

Tθpz1q˘

∇θTθpz1q

B3pfq “

ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJuzpθ, fquzpθ, fqJ∇θTθpzqdµpzqş

Z Ty`

Tθpzq, f˘

dµpzq

B4pfq “

ş

Z Ty`

Tθpzq, f˘

∇2θTθpzq ˆ1 uzpθ, fqdµpzq

ş

Z Ty`

Tθpzq, f˘

dµpzq

B5pfq “

ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJ∇11cpTθpzq, yq∇θTθpzqdµpzqş

Z Ty`

Tθpzq, f˘

dµpzq

B6pfq “ ´

ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJ∇2f`

Tθpzq˘

∇θTθpzqdµpzqş

Z Ty`

Tθpzq, f˘

dµpzq

B7pfq “

ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJuzpθ, fqdµpzq“ş

Z Ty`

Tθpzq, f˘

∇θTθpzqJuzpθ, fqdµpzq‰J

“ş

Z Ty`

Tθpzq, f˘

dµpzq‰2

25

Based on these definitions, we have

D222H1pf, θq “

ż

Z

2ÿ

i“1

Bipfqdµpz1q `

ż

X

7ÿ

i“3

Bipfqdβpyq.

We bound the above seven terms individually.Assumption B.6. For a fixed z P Z and θ P Θ, use ∇2

θTθpzq P T pRd ˆ Rd Ñ Rqq2 to denotethe second-order Jacobian of Tθpzq w.r.t. θ. Use ˆ1 to denote the tensor product along the firstdimension. For any two vectors g, g1 P Rd, we assume that

}∇2θTθpzq ˆ1 g ´∇2

θTθpzq ˆ1 g1}op “ Op}g ´ g1}q. (91)

For the first term, using the boundedness of∇2θTθpz

1q (Assumption B.6), we have that

}B1pfθq ´B1pfεθq}op “ Op}∇fθ ´∇f εθ}2,8q.

For the second term, using the boundedness of∇θTθpz1q, we have that

}B2pfθq ´B2pfεθq}op “ Op}∇2fθ ´∇2f εθ}op,8q.

For the third term, note that }uzpθ, fθq ´ uzpθ, fεθq} “ Op}∇fθ ´ ∇f εθ}2,8q. With the argument

similar to (63), we obtain that

}B3pfθq ´B3pfεθq}op “ Op}fθ ´ f εθ}8 ` }∇fθ ´∇f εθ}2,8q. (92)

This is from the boundedness and Lipschitz continuity of Ty`

Tθpzq, f˘

w.r.t. to f , the boundednessand Lipschitz continuity of uzpθ, fq w.r.t. ∇f , and the fact that Ty

`

Tθpzq, f˘

is positive and boundedaway from zero.

For the forth term, following the similar argument as the third term and using the boundedness of∇2θTθpzq, we have that


For the fifth term, following the similar argument as the third term and using the boundedness of∇θTθpzq and∇11cpTθpzq, yq, we have that

}B5pfθq ´B5pfεθq}op “ Op}fθ ´ f εθ}8q. (94)

For the sixth term, following the similar argument as the third term and using the boundedness of∇θTθpzq, we have that

}B6pfθq ´B6pfεθq}op “ Op}fθ ´ f εθ}8 ` }∇2fθ ´∇2f εθ}op,8q. (95)

For the last term, following the similar argument as the third term and using the boundedness of∇θTθpzq, we have that


Combing the above results, we obtain (89).

2Recall that T pU,W q is the family of bounded linear operators from U to W .

26

C eSIM appendix

C.1 Proof of Theorem 6.1

In this section, we use fµθ to denote the Sinkhorn potential to OTγpTθ7µ, βq. This allows us toemphasize the continuity of its Fréchet derivative w.r.t. the underlying measure µ. Similarly, we writeBµpf, θq and Eµpf, θq instead of Bpf, θq and Epf, θq, which are used to characterize the fixed pointproperty of the Sinkhorn potential.

To prove Theorem 6.1, we need the following lemmas.Lemma C.1. The Sinkhorn potential fµθ is Lipschitz continuous with respect to µ:

}fµθ ´ fµθ }8 “ Opdblpµ, µqq. (97)

Lemma C.2. The gradient of the Sinkhorn potential fµθ is Lipschitz continuous with respect to µ:

}∇fµθ ´∇fµθ }2,8 “ Opdblpµ, µqq. (98)

Lemma C.3. The Hessian of the Sinkhorn potential fµθ is Lipschitz continuous with respect to µ:

}∇2fµθ ´∇2f µθ }op,8 “ Opdblpµ, µqq. (99)

Lemma C.4. The Fréchet derivative of the Sinkhorn potential fµθ w.r.t. the parameter θ, i.e. Dfµθ , isLipschitz continuous with respect to µ:

}Dfµθ ´Dfµθ }op “ Opdblpµ, µqq. (100)

Once we have these lemmas, we can prove 6.1 in the same way as the proof of 5.1 in Appendix B.4.

C.2 Proof of Lemma C.1

Note that from the definition of the bounded Lipschitz distance, we havedblpα, αq “ sup

}ξ}blď1

|xξ, αy ´ xξ, αy| “ sup}ξ}blď1

|xξ ˝ Tθ, µy ´ xξ ˝ Tθ, µy|

ď sup}ξ}blď1

}ξ ˝ Tθ}bl ¨ dblpµ, µq ď GT ¨ dblpµ, µq, (101)

where we use }ξ ˝ Tθ}lip ď GT from Assumption B.5.

We have Lemma C.1 by combining the above results with the following lemma.Lemma C.5. Under Assumption B.1 and Assumption B.2, the Sinkhorn potential is Lipschitz contin-uous with respect to the bounded Lipschitz metric: Given measures α, α1 and β, we have

}fα,β ´ fα1,β}8 ď Gbldblpα1, αq and }gα,β ´ gα1,β1}8 ď Gbldblpα

1, αq.

where Gbl “ 2γ expp2Mc{γqG1bl{p1´ λ

2q with G1bl “ maxtexpp3Mc{γq, 2Gc expp3Mc{γq{γu

and λ “ exppMc{γq´1exppMc{γq`1 .

Proof. Let pf, gq and pf 1, g1q be the Sinkhorn potentials to OTγpα, βq and OTγpα1, βq respectively.

Denote u:“ exppf{γq, v:“ exppg{γq and u1:“ exppf 1{γq, v1:“ exppg1{γq. From Lemma C.7, u isbounded in terms of the L8 norm:

}u}8 “ maxxPX

|upxq| “ maxxPX

exppf{γq ď expp2Mc{γq,

which also holds for v, u1, v1. Additionally, from Lemma C.8,∇u exists and }∇u} is bounded:

maxx}∇upxq} “ max

x

1

γ|upxq|}∇fpxq} ď 1

γ}upxq}8max

x}∇fpxq} ď Gc expp2Mc{γq

γ.

Define the mapping Aαµ:“1{pLαµq with

Lαµ “

ż

Xlp¨, yqµpyqdαpyq,

where lpx, yq:“ exppćpx, yq{γq. From Assumption B.1, we have }l}8 ď exppMc{γq and fromAssumption B.2 we have }∇xlpx, yq} ď exppMc{γq

Gcγ . From the optimality condition of f and g,

we have v “ Aαu and u “ Aβv. Similarly, v1 “ Aα1u1 and u1 “ Aβv

1. Recall the definition of theHilbert metric in (60). Note that dHpµ, νq “ dHp1{µ, 1{νq if µpxq ą 0 and νpxq ą 0 for all x P Xand hence dHpLαµ,Lανq “ dHpAαµ,Aανq. We recall the result in (61) using the above notations.

27

Lemma C.6 (Birkhoff-Hopf Theorem Lemmens and Nussbaum [2012], see Lemma B.4 in Luiseet al. [2019]). Let λ “ exppMc{γq´1

exppMc{γq`1 and α P M`1 pX q. Then for every u, v P CpX q, such that

upxq ą 0, vpxq ą 0 for all x P X , we have

dHpLαu, Lαvq ď λdHpu, vq.

Note that

} logµ´ log ν}8 ď dHpµ, νq “ } logµ´ log ν}8 ` } log ν ´ logµ}8 ď 2} logµ´ log ν}8.

In the following, we derive upper bound for dHpµ, νq and use such bound to analyze the Lipschitzcontinuity of the Sinkhorn potentials f and g.Construct v:“Aαu

1. Using the triangle inequality (which holds since vpxq, v1pxq, vpxq ą 0 for allx P X ), we have

dHpv, v1q ď dHpv, vq ` dHpv, v

1q ď λdHpu, u1q ` dHpv, v

1q,

where the second inequality is due to Lemma C.6. Note that u1 “ Aβv1. Apply Lemma C.6 again to

obtaindHpu, u

1q ď λdHpv, v1q.

Together, we obtain

dHpv, v1q ď λ2dHpv, v

1q ` dHpv, v1q ` λdHpu, u

1q ď λ2dHpv, v1q ` dHpv, v

1q,

which leads todHpv, v

1q ď1

1´ λ2rdHpv, v

1qs.

To bound dHpv, v1q, observe the following:

dHpv1, vq “dHpLα1u

1, Lαu1q ď 2} logLα1u

1 ´ logLαu1}8

“2 maxxPX

|∇ logpaxqprLα1u1spxq ´ rLαu

1spxqq| “ 2 maxxPX

1

ax|rLα1u

1spxq ´ rLαu1spxq|

ď2 maxt}1{Lα1u1}8, }1{Lαu

1}8u}Lα1u1 ´ Lαu

1}8, (102)

where ax P rrLα1u1spxq, rLαu1spxqss in the second line is from the mean value theorem. Further, inthe inequality we use maxt}1{Lαu

1}8, }1{Lαu1}8u “ maxt}Aα1u

1}8, }Aαu1}8u ď expp2Mc{γq.

Consequently, all we need to bound is the last term }Lα1u1 ´ Lαu1}8.

We first note that @x P X , }lpx, ¨qu1p¨q}bl ă 8: In terms of } ¨ }8

}lpx, ¨qu1p¨q}8 ď }lpx, ¨q}8}u1}8 ď expp3Mc{γq ă 8.

In terms of } ¨ }lip, we bound

}lpx, ¨qu1p¨q}lip ď }lpx, ¨q}8}u1}lip ` }lpx, ¨q}lip}u

1}8

ď exppMc{γqGc expp2Mc{γq

γ` exppMc{γq

Gcγ

expp2Mc{γq “2Gc expp3Mc{γq

γă 8.

Together we have }lpx, yqu1pyq}bl ď maxtexpp3Mc{γq,2Gc expp3Mc{γq

γ u. From the definition of theoperator Lα, we have

}Lα1u1 ´ Lαu

1}8 “ maxx|

ż

Xlpx, yqu1pyqdα1pyq ´

ż

Xlpx, yqu1pyqdαpyq| ď }lpx, yqu1pyq}bldblpα

1, αq.

All together we derive

dHpv1, vq ď

2 expp2Mc{γq}lpx, yqu1pyq}bl

1´ λ2¨ dblpα

1, αq pλ “exppMc{γq ´ 1

exppMc{γq ` 1q.

Further, since dHpv1, vq ě } log v1 ´ log v}8 “1γ }f

1 ´ f}8, we have the result:

}f 1 ´ f}8 ď2γ expp2Mc{γq}lpx, yqu

1pyq}bl1´ λ2

¨ dblpα1, αq.

Similar argument can be made for }g1 ´ g}8.

28

Lemma C.7 (Boundedness of the Sinkhorn Potentials). Let pf, gq be the Sinkhorn potentials ofproblem (6) and assume that there exists xo P X such that fpxoq “ 0 (otherwise shift the pair byfpxoq). Then, under Assumption B.1, }f}8 ď 2Mc and }g}8 ď 2Mc.

Next, we analyze the Lipschitz continuity of the Sinkhorn potential fα,βpxq with respect to the inputx.

Assumption B.2 implies that ∇xcpx, yq exists and for all x, y P X , }∇xcpx, yq} ď Gc. It furtherensures the Lipschitz-continuity of the Sinkhorn potential.Lemma C.8 (Proposition 12 of Feydy et al. [2019]). Under Assumption B.2, for a fixed pair ofmeasures pα, βq, the corresponding Sinkhorn potential f : X Ñ R is Gc-Lipschitz continuous, i.e.for x1, x2 P X

|fα,βpx1q ´ fα,βpx2q| ď Gc}x1 ´ x2}. (103)Further, the gradient∇fα,β exists at every point x P X , and }∇fα,βpxq} ď Gc,@x P X .Lemma C.9. Under Assumption B.3, for a fixed pair of measures pα, βq, the gradient of the corre-sponding Sinkhorn potential f : X Ñ R is Lipschitz continuous,

}∇fpx1q ´∇fpx2q} ď Lf }x1 ´ x2}, (104)

where Lf :“4G2

c

γ ` Lc.


We have Lemma C.2 by combining (101) with the following lemma.Lemma C.10 (Lemma C.2 restated). Under Assumption B.1 and Assumption B.2, the gradient ofthe Sinkhorn potential is Lipschitz continuous with respect to the bounded Lipschitz metric: Givenmeasures α, α1 and β, we have

}∇fα,β ´∇fα1,β}8 “ O`

dblpα1, αq

˘

Proof. From the optimality condition of the Sinkhorn potentials, one have thatż

Xhα,βpx, yqdβpyq “ 1,with hα,βpx, yq:“ exp

ˆ

1

γ

`

fα,βpxq ` gα,βpyq ´ cpx, yq˘

˙

. (105)

Taking gradient w.r.t. x on both sides of the above equation, the expression of∇fα,β writes

∇fα,βpxq “ş

X hα,βpx, yq∇xcpx, yqdβpyqş

X hα,βpx, yqdβpyq“

ż

Xhα,βpx, yq∇xcpx, yqdβpyq. (106)

We have that @x, y, hα,βpxq is Lipschitz continuous w.r.t. α, which is due to the boundedness offα,βpxq, gα,βpyq and the ground cost c, and Lemma C.1. Further, since }∇xcpx, yq} is bounded fromAssumption B.2 we have the Lipschitz continuity of∇fα,β w.r.t. α, i.e.

}∇fα,βpxq ´∇fα1,βpxq} “ O`

dblpα1, αq

˘

.


We have Lemma C.3 by combining (101) with the following lemma.Lemma C.11 (Lemma C.3 restated). Under Assumptions B.1-B.3, the Hessian of the Sinkhornpotential is Lipschitz continuous with respect to the bounded Lipschitz metric: Given measures α, α1and β, we have

}∇2fα,β ´∇2fα1,β}op,8 “ O`

dblpα1, αq

˘

Proof. Taking gradient w.r.t. x on both sides of (106), the expression of∇2fα,β writes

∇2fα,βpxq “

ż

X

1

γhα,βpx, yqp∇fα,βpxq ´∇xcpx, yqqr∇xcpx, yqsJ ` hα,βpx, yq∇2

xxcpx, yqdβpyq.

29

From the boundedness of hα,β ,∇fα,β and∇xc, and the Lipschitz continuity of hα,β and∇fα,β w.r.t.α, we have that the first integrand of∇2fα,β is Lipschitz continuous w.r.t. α. Further, combining theboundedness of }∇2

xxcpx, yq} from Assumption B.3 and the Lipschitz continuity of hα,β w.r.t. α, wehave the Lipschitz continuity of∇2fα,βpxq, i.e.

}∇2fα,βpxq ´∇2fα1,βpxq} “ O`

dblpα1, αq

˘

.


The optimality of the Sinkhorn potential fµθ can be restated as

fµθ “ Bµpfµθ , θq, (107)

where we recall the definition of Bµ in (18)

Bµpf, θq “ A`

Apf, Tθ7µq, βµ˘

. (108)

Note that it is possible that βµ depends on µ, which is the case in OTγpαθ, αθtq as βµ “ αθt “ Tθt 7µ.

Under Assumption B.1, let λ “ eMc{γ´1eMc{γ`1

. By repeating the above fixed point iteration (107) l “rlogλ

13 s{2 times, we have that

fµθ “ Eµpfµθ , θq, (109)

where Eµpf, θq “ Blµpf, θq “ Bµ`

¨ ¨ ¨Bµpf, θq ¨ ¨ ¨ , θ˘

is the l times composition of Bµ in its firstvariable. We have from (68)

||D1Eµpf, θq}op ď2

3, (110)

where we recall for a (linear) operator C : CpX q Ñ CpX q, }C}op:“maxfPCpX q}Cf}8}f}8

.

Let h P Rd be any direction. Taking Fréchet derivative w.r.t. θ on both sides of (109), we derive

Dfµθ rhs “ D1Eµpfµθ , θq“

Dfµθ rhs‰

`D2Eµpfµθ , θqrhs. (111)

Using the triangle inequality, we bound

}Dfµθ rhs ´Dfµθ rhs}8

ď }D1Eµpfµθ , θq“

Dfµθ rhs‰

´D1Eµpfθ,µ, θq“

Df µθ rhs‰

}8

` }D2Eµpfµθ , θqrhs ´D2Eµpf µθ , θqrhs}8ď }D1Eµpfµθ , θq

“

Dfµθ rhs‰

´D1Eµpfµθ , θq“

Df µθ rhs‰

}8 1©` }D1Eµpfθ,µ, θq

“

Df µθ rhs‰


Df µθ rhs‰

}8 2©` }D1Eµpfθ,µ, θq

“

Df µθ rhs‰


Df µθ rhs‰

}8 3©` }D2Eµpfµθ , θqrhs ´D2Eµpf µθ , θqrhs}8. 4©

(112)

The following subsections analyze 1© to 4© individually. In summary, we have

1© ď2

3}Dfµθ rhs ´Df

µθ rhs}8, (113)

and 2©, 3©, 4© are all of order Opdblpµ, µq ¨ }h}q. Therefore we conclude

1

3}Dfµθ rhs ´Df

µθ rhs}8 “ Opdblpµ, µq ¨ }h}q ñ }Dfµθ ´Df

µθ }op “ Opdblpµ, µqq. (114)

C.5.1 Bounding 1©

From the linearity of D1Eµpfµθ , θq and (110), we bound

1© “ }D1Eµpfµθ , θq“

Dfµθ rhs ´Dfµθ rhs

‰

}8

ď }D1Eµpfµθ , θq}op}Dfµθ rhs ´Df

µθ rhs}8 ď

2

3}Dfµθ rhs ´Df

µθ rhs}8.

30

C.5.2 Bounding 2©

From Lemma B.8, we know that D1Bµpf, θq is Lipschitz continuous w.r.t. its first variable:

}D1Bµpf, θq ´D1Bµpf 1, θq}op “ Op}f ´ f 1}8q. (115)

Recall that Eµpf, θq “ Blµpf, θq. Using the chain rule of the Fréchet derivative, we have

D1Eµpf, θq “ D1Blµpf, θq “ D1Bµ`

Bl´1µ pf, θq, θ

˘

˝D1Bl´1µ pf, θq. (116)

Consequently, we can bound 2© in a recursive way: for any two functions f, f 1 P CpX q

}D1Blµpf, θq ´D1Blµpf 1, θq}op“ }D1Bµ

`

Bl´1µ pf, θq, θ

˘

˝D1Bl´1µ pf, θq ´D1Bµ

`

Bl´1µ pf 1, θq, θ

˘

˝D1Bl´1µ pf 1, θq}op

ď }D1Bµ`

Bl´1µ pf, θq, θ

˘

˝`

D1Bl´1µ pf, θq ´D1Bl´1

µ pf 1, θq˘

}op

` }

ˆ

D1Bµ`

Bl´1µ pf, θq, θ

˘

´D1Bµ`


˘

˙

˝D1Bl´1µ pf 1, θq}op

ď }D1Bµ`

Bl´1µ pf, θq, θ

˘

}op}D1Bl´1µ pf, θq ´D1Bl´1

µ pf 1, θq}8

Òp}Bl´1µ pf, θq ´ Bl´1

µ pf 1, θq}8 ¨ }D1Bl´1µ pf 1, θq}opq

“ Op}f ´ f 1}8q ` }D1Bl´1µ pf, θq ´D1Bl´1

µ pf 1, θq}8,

where in the first inequality we use the triangle inequality, in the second inequality, we use thedefinition of } ¨ }op and (115), and in the last equality we use (115) and the fact that Bk is Lipschitzcontinuous with respect its first argument for any finite k (see Lemma B.9). Besides, since fµθ iscontinuous with respect to µ (see Lemma C.1), we have

}D1Blpfµθ , θq ´D1Blpf µθ , θq}op “ Opdblpµ, µqq. (117)

We then show that }Df µθ rhs}8 “ Op}h}8q: Using (111), we have that

}Df µθ rhs}8 ď2

3}Df µθ rhs}8 ` }D2Eµpfµθ , θqrhs}8 ñ }Df µθ rhs}8 ď 3}D2Eµpfµθ , θq}op}rhs}8.

Lemma B.7 shows that }D2Eµpfµθ , θq}op is bounded and therefore we have

}Df µθ rhs}8 “ Op}h}8q. (118)

Combining the above results, we obtain

2© ď }D1Blpfµθ , θq ´D1Blpf µθ , θq}op}Dfµθ rhs}8 “ Opdblpµ, µq ¨ }h}8q.

C.5.3 Bounding 3©

Denote ωypxq “ expp´ cpx,yqγ q exppfpxq{γq. Assume that }f}8 ď Mc and }∇f}2,8 ď Gf . Then

we have for any y P X ,

}ωy}8 ď exppMc{γq, }∇ωy}2,8 ď exppMc{γqpGc `Gf q{γ. (119)

Therefore, }ωy}bl “ maxtexppMc{γq, exppMc{γqpGc `Gf q{γu is bounded (recall the definitionof bounded Lipschitz norm in Theorem 6.1). Besides, for any y P X , ωypxq is positive and boundedaway from zero

ωypxq ě expp´2Mc{γq. (120)

For a fixed measure κ and g P CpX q, we compute that

D1Apf , κqrgs “ş

X ωypxqgpxqdκpxqş

X ωypxqdκpxq. (121)

31

This expression allows us to bound for two measures κ and κ1

}`

D1Apf , κq ´D1Apf , κ1q˘

rgs}8 “ }

ş


X ωypxqdκpxq´

ş

X ωypxqgpxqdκ1pxq

ş

X ωypxqdκ1pxq

}8

ď }

ş


X ωypxqdκpxq´

ş


X ωypxqdκ1pxq

}8 ` }

ş


X ωypxqdκ1pxq

´

ş

X ωypxqgpxqdκ1pxq

ş

X ωypxqdκ1pxq

}8.

We now bound these two terms individually. For the first term, we have

}

ş


X ωypxqdκpxq´

ş


X ωypxqdκ1pxq

}8

ď }

ż

Xωypxqgpxqdκpxq}8}

ş

X ωypxq rdκpxq ´ dκ1pxqsş

X ωypxqdκpxqş

X ωypxqdκ1pxq

}8

ď }ωy}8 ¨ }g}8 ¨ }ωypxq}bl ¨ dblpκ, κ1q ¨ expp4Mc{γq “ Op}g}8 ¨ dblpκ, κ1qq,

where we use (119) and (120) in the last equality. For the second term, we bound

}

ş


X ωypxqdκpxq´

ş

X ωypxqgpxqdκ1pxq

ş

X ωypxqdκpxq}8 ď }

ş

X ωypxqgpxqrdκpxq ´ dκ1pxqsş

X ωypxqdκpxq}8

ď exppMc{γq ¨ }ωypxq}bl ¨ }g}bl ¨ dblpκ, κ1q “ Op}g}bl ¨ dblpκ, κ1qq.

Combining the above inequalities, we have

}`

D1Apf , κq ´D1Apf , κ1q˘

rgs}8 “ Op}g}bl ¨ dblpκ, κ1qq. (122)

Denote α “ Tθ7µ and α “ Tθ7µ. From the chain rule of the Fréchet derivative, we compute

}`

D1Bµpf, θq ´D1Bµpf, θq˘

rgs}8

“›

›

ˆ

D1A`

Apf, αq, βµ˘

˝D1Apf, αq ´D1A`

Apf, αq, βµ˘

˝D1Apf, αq˙

rgs›

›

8

ď›

›D1A`

Apf, αq, βµ˘“`

D1Apf, αq ´D1Apf, αq˘

rgs‰›

›

8

`›

›

ˆ

D1A`

Apf, αq, βµ˘

´D1A`

Apf, αq, βµ˘

˙

“

D1Apf, αqrgs‰›

›

8

`›

›

ˆ

D1A`

Apf, αq, βµ˘

´D1A`

Apf, αq, βµ˘

˙

“

D1Apf, αqrgs‰›

›

8.

We now bound these three terms one by one.For the first term, use (110) to derive

›

›D1A`



rgs‰›

›

8

ď }D1Apf, αqrgs ´D1Apf, αqrgs}8 “ Op}g}bl ¨ dblpα, αqq,

where we use }D1A`

Apf, αq, βµ˘

}op ď 1 (75) and (122) in the second equality.

Combining the above result with (101) gives›

›D1A`



rgs‰›

›

8“ Op}g}bl ¨ dblpµ, µqq.

For the second term, use (122) to derive

›

›

ˆ

D1A`

Apf, αq, βµ˘

´D1A`

Apf, αq, βµ˘

˙

“

D1Apf, αqrgs‰›

›

8

“ Op}D1Apf, αqrgs}bl ¨ dblpβµ, βµqq.

We now bound }D1Apf, αqrgs}bl. From (75), we have that }D1Apf, αqrgs}8 ď }g}8. Besides, notethat D1Apf, αqrgs is a function mapping from X to R and recall the expression of D1Apf, αqrgs in(121). To show that D1Apf, αqrgspyq is Lipschitz continuous w.r.t. y, we use the similar argumentas (63): Under Assumption B.1 and assume that }f}8 ďMc, the numerator and denominator of (63)

32

are both Lipschitz continuous w.r.t. y and bounded; the denominator is positive and bounded awayfrom zero. Consequently, we can bound for any y P X

}∇yD1Apf, αqrgspyq} ď 2 expp4Mc{γq}g}8 ¨Gc, (123)

and therefore›

›

ˆ

D1A`

Apf, αq, βµ˘

´D1A`

Apf, αq, βµ˘

˙

“

D1Apf, αqrgs‰›

›

8“ Op}g}8 ¨ dblpβµ, βµqq.

For the third term, first note that we can use (101) and the mean value theorem to bound

}Apf, αq Ápf, αq}8 “ OpmaxyPX

}ωy}bl ¨ dblpα, αqq “ Opdblpµ, µqq. (124)

Hence, we use Lemma B.11 to derive

›

›

ˆ

D1A`

Apf, αq, βµ˘

´D1A`

Apf, αq, βµ˘

˙

“

D1Apf, αqrgs‰›

›

8

“ Op}Apf, αq Ápf, αq}8 ¨ }D1Apf, αqrgs}8q “ Op}g}8 ¨ dblpµ, µqq,

where we use (124) and the fact that }D1Apf, αq}op is bounded in the last equality.Combing the above three results, we have

}`


rgs}8 “ Op}g}bl ¨ dblpµ, µqq. (125)

Recall that Eµpf, θq “ Blµpf, θq. Using the chain rule of the Fréchet derivative, we have

D1Eµpf, θq “ D1Blµpf, θq “ D1Bµ`

Bl´1µ pf, θq, θ

˘

˝D1Bl´1µ pf, θq. (126)

Denote g “ Df µθ rhs. We can bound 3© in the following way:

3© “ }D1Bµ`

Bl´1µ pf, θq, θ

˘“

D1Bl´1µ pf, θqrgs

‰

´D1Bµ`

Bl´1µ pf, θq, θ

˘

rD1Bl´1µ pf, θqrgss}8

ď }D1Bµ`

Bl´1µ pf, θq, θ

˘“`


µ pf, θq˘

rgs‰

}8

` }

ˆ

D1Bµ`

Bl´1µ pf, θq, θ

˘

´D1Bµ`

Bl´1µ pf, θq, θ

˘

˙

“


‰

}8

` }

ˆ

D1Bµ`

Bl´1µ pf, θq, θ

˘

´D1Bµ`

Bl´1µ pf, θq, θ

˘

˙

“


‰

}8

ď }D1Bµ`

Bl´1µ pf, θq, θ

˘

}op}`


µ pf, θq˘

rgs}8 #1


µ pf, θq}8 ¨ }D1Bl´1µ pf, θqrgs}8q #2

Òp}D1Bl´1µ pf, θqrgs}bl ¨ dblpµ, µqq, #3

where in the first inequality we use the triangle inequality, in the second inequality we use thedefinition of } ¨ }op, (115) and (125). We now analyze the R.H.S. of the above inequality one by one.For the first term, use }D1Bµ

`

Bl´1µ pf, θq, θ

˘

}op ď 1 and then use (125). We have

#1 ď }`


rgs}8 “ Op}g}bl ¨ dblpµ, µqq.

For the second term, note that Bkµ is the composition of the terms Apf, αq and Apf, βµq. Using asimilar argument like (124), for any finite k, we have

}Bl´1µ pf, θq ´ Bl´1

µ pf, θq}8 “ Opdblpµ, µqq.

Together with the fact that }D1Bpf, θq}op ď 1, we have

#2 “ Op}g}8 ¨ dblpµ, µqq.

Finally, for the third term, note that Bµ is the composition of the terms Apf, αq and Apf, βµq. Usinga similar argument like (123) to bound

#3 “ Op}g}8 ¨ dblpµ, µqq.

33

Combining these three results, we have

3© “ }`

D1Blµpf, θq ´D1Blµpf, θq˘

rgs}8 “ Op}g}bl ¨ dblpµ, µqq. (127)

We now bound }Df µθ rhs}bl (g “ Df µθ rhs). From the fixed point definition of the Sinkhorn potentialin (107), we can compute the Fréchet derivative Dfµθ by

Dfµθ “ D1A`

Apfµθ , αθq, βµ˘

˝D1Apfµθ , αθq˝Dfµθ `D1A

`


˝D2Apfµθ , θq, (128)

where we recall Apf, θq:“Apf, αθq. For any direction h P Rd and any y P X , Dfµθ rhs is a functionwith its gradient bounded by

}∇yDfµθ rhspyq} ď }∇yˆ

D1A`


„

D1Apfµθ , αθq“

Dfµθ rhs‰

˙

pyq} #1

`}∇y´

D1A`

Apfµθ , αθq, βµ˘“

D2Apfµθ , θqrhs‰

¯

pyq}. #2

We now bound the R.H.S. individually:For #1, take f “ Arf, αθs, κ “ βµ and g “ D1Apfµθ , αθq

“

Dfµθ rhs‰

in (121). Using (123) and(118), we have

#1 “ Op}g}8q “ Op}Dfµθ rhs}8q “ Op}h}q. (129)

For #2, take f “ Arf, αθs, κ “ βµ and g “ D2Apfµθ , θqrhs in (121). Using (123) and (79), wehave

#2 “ Op}g}8q “ Op}D2Apfµθ , θqrhs}8q “ Op}h}q. (130)

Combining these two bounds, we have

}Dfµθ rhs}bl “ Op}h}q. (131)

By plugging the above result to (127), we bound

3© “ }`

D1Blµpf, θq ´D1Blµpf, θq˘

rgs}8 “ Opdblpµ, µq ¨ }h}q. (132)

C.5.4 Bounding 4©

We have from the triangle inequality

4© ď }D2Eµpfµθ , θqrhs ´D2Eµpfµθ , θqrhs}8 ` }D2Eµpfµθ , θqrhs ´D2Eµpf µθ , θqrhs}8. (133)

We analyze these two terms on the R.H.S..

For the first term of (133), use the chain rule of Fréchet derivative to compute

D2Eµpf, θqrhs “ D1Bµ`

Bl´1µ pf, θq, θ

˘“

D2Bl´1µ pf, θqrhs

‰

`D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs. (134)

Consequently, we can bound

}`

D2Eµpf, θq ´D2Eµpf, θq˘

rhs}8

ď}D1Bµ`

Bl´1µ pf, θq, θ

˘“


‰

´D1Bµ`

Bl´1µ pf, θq, θ

˘“


‰

}8 #1

` }D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs ´D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs}8. #2

We analyze #1 and #2 individually.

Bounding #1. We first note that Apf, αq is Lipschitz continuous w.r.t. α (see also (124)):

}Apf, αq Ápf, α1q}8 ď expp2Mc{γq ¨ }ωy}bl ¨ dblpα, α1q “ Opdblpα, α1qq, (135)

where in the equality we use (119). As Bkµ is the composition of A, it is Lipschitz continuous withrespect to µ for finite k. Note that the boundedness of }f}8 and }∇f}8 remains valid after the

34

operator B (Lemma B.1 and (i) of Lemma (B.2)). We then bound

#1 ď }D1Bµ`

Bl´1µ pf, θq, θ

˘“`


µ pf, θq˘

rhs‰

}8

` }

ˆ

D1Bµ`

Bl´1µ pf, θq, θ

˘

´D1Bµ`

Bl´1µ pf, θq, θ

˘

˙

“


‰

}8

` }

ˆ

D1Bµ`

Bl´1µ pf, θq, θ

˘

´D1Bµ`

Bl´1µ pf, θq, θ

˘

˙

“


‰

}8

ď }D1Bµ`

Bl´1µ pf, θq, θ

˘

}op}D2Bl´1µ pf, θqrhs ´D2Bl´1

µ pf, θqrhs}8


µ pf, θq}8 ¨ }D2Bl´1µ pf, θqrhs}8q

Òpdblpµ, µq ¨ }D2Bl´1µ pf, θqrhs}8q

ď }D2Bl´1µ pf, θqrhs ´D2Bl´1

µ pf, θqrhs}8 Òpdblpµ, µq ¨ }h}q,where in the second inequality we use the definition of } ¨ }op, (115) and (125), and in the lastinequality we use the fact that }D1Bµpf, θq}op ď 1, Bkµ is Lipschitz continuous with respect to µ forfinite k (see the discussion above) and that }D2Bl´1

µ pf, θq}op is bounded (see Lemma B.7.

Bounding #2. To make the dependences of A on θ and µ explicit, we denote

Apf, θ, µq “ Apf, Tθ7µq.To bound the second term, we first establish that for any k ě 0,∇Bk`1

µ pf, θq is Lipschitz continuousw.r.t. µ, i.e.

}∇Bk`1µ pf, θq ´∇Bk`1

µ pf, θq}2,8 “ Opdblpµ, µqq, (136)

as follows: First note that∇Apf, θ, µq is Lipschitz continuous w.r.t. µ, i.e.

}∇Apf, θ, µqpyq ´∇Apf, θ, µqpyq} “ Opdblpµ, µqq. (137)

This is because for any y P X (note that Apf, θ, µqp¨q : X Ñ R is a function of y),

}∇Apf, θ, µqpyq ´∇Apf, θ, µqpyq}

“ }

ş

X ωypxq∇1cpy, xqdαθpxqş

X ωypxqdαθpxq´

ş

X ωypxq∇1cpy, xqdαθpxqş

X ωypxqdαθpxq}

ď }

ş

X ωypxq∇1cpy, xq`

dαθpxq ´ dαθpxq˘

ş

X ωypxqdαθpxq}

` }

ż

Xωypxq∇1cpy, xqdαθpxq} ¨ }

ş

X ωypxq`

dαθpxq ´ dαθpxq˘

ş

X ωypxqdαθpxqş

X ωypxqdαθpxq}

“ Opdblpµ, µqq.Here in the last equality, we use the facts that }ωyp¨q∇1cpy, ¨q}bl and }ωy}bl are bounded, andş

X ωypxqdαθpxq is strictly positive and bounded away from zero. Recall that Bµpf, θq “ApApf, θ, µq, βµq. We can then prove (136) by bounding

}∇Bk`1µ pf, θq ´∇Bk`1

µ pf, θq}

“ }∇ApApBkµpf, θq, θ, µq, βµq ´∇ApApBkµpf, θq, θ, µq, βµq}

ď }∇ApApBkµpf, θq, θ, µq, βµq ´∇ApApBkµpf, θq, θ, µq, βµq} &1

` }∇ApApBkµpf, θq, θ, µq, βµq ´∇ApApBkµpf, θq, θ, µq, βµq} &2

` }∇ApApBkµpf, θq, θ, µq, βµq ´∇ApApBkµpf, θq, θ, µq, βµq} &3

“ Opdblpµ, µqqHere we bound &1 using (137), the Lipschitz continuity of∇A w.r.t. its second variable; we bound&2 using the Lipschitz continuity of∇A w.r.t. its first variable and (124), the Lipschitz continuity ofA w.r.t. µ; we bound &3 using (124), the Lipschitz continuity of A w.r.t. µ, and the fact that Bkµ isthe composition of the terms Apf, αq and Apf, βµq.We then establish that D2Bµpf, θq is Lipschitz continuous w.r.t. µ.

35

Assumption C.1. }∇zr∇θTθpzqs}op is bounded

Lemma C.12. Assume that }f}8 ď Mc, }∇f}2,8 ď Gf , }∇2f}op,8 ď Lf Under AssumptionsB.5, C.1 and B.1, we have

}D2Bµpf, θq ´D2Bµpf, θq}op “ Opdblpµ, µqq. (138)

Proof. Denote ωypxq “ exp´

ćpx,yq`fpxqγ

¯

and

φypzq “ r∇θTθpzqsJ r´∇1cpTθpzq, yq `∇fpTθpzqqswhere ∇θTθpzq denotes the Jacobian matrix of Tθpzq with respect to θ.

The Fréchet derivative D2Apf, θ, µqrhs can be computed by

D2Apf, θ, µqrhs “ş

X ωy`

Tθpzq˘

xφypzq, hydµpzqş

X ωy`

Tθpzq˘

dµpzq. (139)

Recall that }f}8 ďMc, }∇f}2,8 ď Gf . Using the above expression we can bound

}`

D2Apf, θ, µq ´D2Apf, θ, µq˘

rhs}8

“›

›

ş

X ωy`

Tθpzq˘

xφypzq, hydµpzqş

X ωy`

Tθpzq˘

dµpzq´

ş

X ωy`

Tθpzq˘

xφypzq, hydµpxqş

X ωy`

Tθpzq˘

dµpxq

›

›

8

ď›

›

ş

X ωy`

Tθpzq˘

xφypzq, hydµpzqş

X ωy`

Tθpzq˘

dµpzq´

ş

X ωy`

Tθpzq˘

xφypzq, hydµpxqş

X ωy`

Tθpzq˘

dµpzq

›

›

8

`›

›

ş

X ωy`

Tθpzq˘

xφypzq, hydµpxqş

X ωy`

Tθpzq˘

dµpzq´

ş

X ωy`

Tθpzq˘

xφypzq, hydµpxqş

X ωy`

Tθpzq˘

dµpxq

›

›

8

“›

›

ş

X ωy`

Tθpzq˘

xφypzq, hy rdµpzq ´ dµpxqsş

X ωy`

Tθpzq˘

dµpzq

›

›

8

`›

›

ş

X ωy`

Tθpzq˘

xφypzq, hydµpxqş

X ωy`

Tθpzq˘

rdµpxq ´ dµpzqsş

X ωy`

Tθpzq˘

dµpzqş

X ωy`

Tθpzq˘

dµpxq

›

›

8

ď expp2Mc{γq ¨ }ωy`

Tθpzq˘

xφypzq, hy}bl ¨ dblpµ, µq

` expp5Mc{γq ¨ }φy}8 ¨ }h}8 ¨ }ωy}bl ¨ dblpµ, µq.

For the first term, note that }ωy`

Tθpzq˘

xφypzq, hy}bl ď }ωy}bl ¨ }φy}bl ¨ }h}8 and }ωy}bl is bounded(see (119)). We just need to bound }φy}bl. Under Assumption B.5 that }∇θTθpzq}op ď GT , weclearly have that }φy}8 is bounded. For }φy}lip, compute that

∇zφypzq “ ∇zr∇θTθpzqs ˆ1 r´∇1cpTθpzq, yq `∇fpTθpzqqs`∇θTθpzqJ

“

´∇211cpTθpzq, yq `∇2fpTθpzqq

‰

∇θTθpzq.

Recall that }∇2fpxq}op is bounded. Consequently, under Assumption C.1, we can see that }∇zφypzq}is bounded. Together, }φy}bl is bounded. As a result, we have

}`


rhs}8 “ Opdblpµ, µq ¨ }h}q. (140)

Based on the above result, we can further bound

}`


rhs}8

“ }

ˆ

D1A`

Apf, θ, µq, β˘

˝D2Apf, θ, µq ´D1A`

Apf, θ, µq, β˘

˝D2Apf, θ, µq˙

rhs}8

ď }D1A`

Apf, θ, µq, β˘“`


rhs‰

}8 ##1

` }

ˆ

D1A`

Apf, θ, µq, β˘

´D1A`

Apf, θ, µq, β˘

˙

“

D2Apf, θ, µqrhs‰

}8 ##2

` }

ˆ

D1A`

Apf, θ, µq, β˘

´D1A`

Apf, θ, µq, β˘

˙

“

D2Apf, θ, µqrhs‰

}8. ##3

36

For the first term, use }D1A`

Apf, θ, µq, β˘

}op ď 1 (75) and (140) to bound

##1 ď }D2Apf, θ, µqrhs ´D2Apf, θ, µqrhs}8 “ Opdblpµ, µq ¨ }h}q.

For the second term, recall the expression of D2Apf, θ, µqrhs in (139). Under Assumption B.1 andassume that }f}8 ď Mc, one can see that }D2Apf, θ, µqrhs}bl “ Op}h}q. Further, use (122) anddblpβ, βq “ O

`

dblpµ, µq˘

from (101) to bound

##2 “ Op}D2Apf, θ, µqrhs}bl ¨ dblpβ, βqq “ Op}h} ¨ dblpµ, µqq.

For the third term, use Lemma B.11 to bound

##3 “ Op}D2Apf, θ, µqrhs}8 ¨ }Apf, θ, µq ´ Apf, θ, µq}8q “ Opdblpµ, µq ¨ }h}q,

where we use }D2Apf, θ, µqrhs}8 “ Op}h}q and (124). Altogether, we have

}D2Bµpf, θqrhs ´D2Bµpf, θqrhs}8 “ Opdblpµ, µq ¨ }h}q. (141)

We are now ready to bound #2.

#2 ď }D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs ´D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs}8

` }D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs ´D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs}8

“ Op}Bl´1µ pf, θq ´ Bl´1

µ pf, θq}8 ` }∇Bl´1µ pf, θq ´∇Bl´1

µ pf, θq}2,8q

Òpdblpµ, µq ¨ }h}q“ Opdblpµ, µq ¨ }h}q,

where we use Lemma B.10 and (138) (124) in the first equality.

Combining #1 and #2. Combining the above results, we yield

}D2Blµpf, θqrhs´D2Blµpf, θqrhs}8 ď }D2Bl´1µ pf, θqrhs´D2Bl´1

µ pf, θqrhs}8Òpdblpµ, µq¨}h}8q,

which, via recursion, implies that (recall that D2Eµpf, θqrhs “ D2Blµpf, θqrhs)

}D2Eµpf, θqrhs ´D2Eµpf, θqrhs}8 “ Opdblpµ, µq ¨ }h}q. (142)

To bound the second term of (133), compute the expression of D2Eµpf, θqrhs via the chain rule:

D2Eµpf, θqrhs “ D1Bµ`

Bl´1µ pf, θq, θ

˘“


‰

`D2Bµ`

Bl´1µ pf, θq, θ

˘

rhs. (143)

Recall that Eµpf, θq “ Blµpf, θq. We then show in an inductive manner that the second term of (133)is of order Opdblpµ, µq ¨ }h}q: For any finite k ě 1,

}D2Bkµpfµθ , θqrhs ´D2Bkµpf

µθ , θqrhs}8 “ Opdblpµ, µq ¨ }h}q. (144)

For the base case when l “ 1, we only have the second term of (143) inD2Eµpf, θqrhs. Consequently,from Lemma B.10, we have

}D2Bµ`

Bl´1µ pfµθ , θq, θ

˘

´D2Bµ`

Bl´1µ pf µθ , θq, θ

˘

}op

“ Op}Bl´1µ pfµθ , θq ´ B

l´1µ pf µθ , θq}8 ` }∇B

l´1µ pfµθ , θq ´∇B

l´1µ pf µθ , θq}2,8q “ Opdblpµ, µqq,

(145)

37

where we use (136) in the second equality.Now assume that for l “ k the statement (144) holds. For any two function f, f 1 P CpX q, we bound

}D2Bkµpf, θqrhs ´D2Bkµpf 1, θqrhs}8ď }D1Bµ

`

Bl´1µ pf, θq, θ

˘“

D2Bl´1µ pf, θqrhs ´D2Bl´1

µ pf 1, θqrhs‰

}8

` }

ˆ

D1Bµ`

Bl´1µ pf, θq, θ

˘

´D1Bµ`


˘

˙

“

D2Bl´1µ pf 1, θqrhs

‰

}8

` }

ˆ

D2Bµ`

Bl´1µ pf, θq, θ

˘

´D2Bµ`


˘

˙

rhs}8.

ď }`


µ pf 1, θq˘

rhs}8 }D1Bµpf, θq}op ď 1


µ pf 1, θq}8 ¨ }D2Bl´1µ pf 1, θqrhs}8q Lemma B.8

Òpdblpµ, µq ¨ }h}q. (145)

“ Opp}f ´ f 1}8 ` }∇f ´∇f 1}2,8q ¨ }h}q Lemma B.5

Opp}f ´ f 1}8q ¨ }h}qOpdblpµ, µq ¨ }h}q.

Plug in f “ fµθ and f 1 “ f µθ and use Lemmas C.1 and C.2. We prove the statement (144) holds forl “ k ` 1. Consequently, we have that

}D2Eµpfµθ , θqrhs ´D2Eµpf µθ , θqrhs}8 “ Opdblpµ, µq ¨ }h}q. (146)

In conclusion, we have4© “ Opdblpµ, µq ¨ }h}q. (147)

38

Table 1: Structure of the encoder

Layer (type) Output Shape Param #Conv2d-1 [-1, 64, 32, 32] 4,800

LeakyReLU-2 [-1, 64, 32, 32] 0Conv2d-3 [-1, 128, 16, 16] 204,800

BatchNorm2d-4 [-1, 128, 16, 16] 256LeakyReLU-5 [-1, 128, 16, 16] 0

Conv2d-6 [-1, 256, 8, 8] 819,200BatchNorm2d-7 [-1, 256, 8, 8] 512LeakyReLU-8 [-1, 256, 8, 8] 0

Conv2d-9 [-1, 512, 4, 4] 3,276,800BatchNorm2d-10 [-1, 512, 4, 4] 1,024LeakyReLU-11 [-1, 512, 4, 4] 0

Table 2: Structure of the generator

Layer (type) Output Shape Param #ConvTranspose2d-1 [-1, 256, 4, 4] 262,144

BatchNorm2d-2 [-1, 256, 4, 4] 512ReLU-3 [-1, 256, 4, 4] 0

ConvTranspose2d-4 [-1, 128, 8, 8] 524,288BatchNorm2d-5 [-1, 128, 8, 8] 256

ReLU-6 [-1, 128, 8, 8] 0ConvTranspose2d-7 [-1, 64, 16, 16] 131,072

BatchNorm2d-8 [-1, 64, 16, 16] 128ReLU-9 [-1, 64, 16, 16] 0

ConvTranspose2d-10 [-1, 3, 32, 32] 3,072Tanh-11 [-1, 3, 32, 32] 0

D Experiment Details

We use the generator from DC-GAN Radford et al. [2015]. And the adversarial ground cost cξ in theform of

cξpx, yq “ }φξpxq ´ φξpyq}22, (148)

where φξ : Rq Ñ Rq is an encoder that maps the original data point (and the generated image)to a higher dimensional space (q ą q). We pick φξ to be an CNN with a similar structure as thediscriminator of DC-GAN except that we discard the last layer which was used for classification.Specifically, the networks used are given in Table 1 and 2.

We set the step size β of SiNG to be 30 and set the maximum allow Sinkhorn divergence in eachiteration to be 0.1. Note that the step size is set after the normalization in (11). For Adam, RMSprop,and AMSgrad, we set all of their initial step sizes to be 1.0ˆ e´3, which is in general recommendedby the GAN literature. The minibatch sizes of both the real images and the generated imagesfor each iteration are set to 3000. We uniformly set the γ parameter in the objective (recall thatFpαθq “ Scξpαθ, βq) and the constraint to 100.

The code is in https://github.com/shenzebang/Sinkhorn_Natural_Gradient.

39

https://github.com/shenzebang/Sinkhorn_Natural_Gradient

E PyTorch Implementation

In this section, we focus on the empirical version of SiNG, where we approximate the gradient ofthe function F by a minibatch stochastic gradient and approximate SIM by eSIM. In this case, allcomponents involved in the optimization procedure can be represented by finite dimensional vectors.

It is known that the stochastic gradient admits an easy implementation in PyTorch. However, at thefirst sight, the computation of eSIM is quite complicated as it requires to construct two sequences f tand gt to estimate the Sinkhorn potential and the Fréchet derivative. As we discussed earlier, it is wellknown that we can solve the inversion of a p.s.d. matrix via the Conjugate Gradient (CG) methodwith only matrix-vector-product operations. In particular, in this case, we no longer need to explicitlyform eSIM in the computer memory. Consequently, to implement the empirical version of SiNGusing CG and eSIM, one can resort to the auto-differential mechanism provided by PyTorch: First,we use existing PyTorch package like geomloss3 to compute the tensor f representing the Sinkhornpotential f εθ . Note the the sequence f t is constructed implicitly by calling geomloss. We then usethe ".detach()" function in PyTorch to maintain only the value of the f while discarding all of its"grad_fn" entries. We then enable the "autograd" mechanism is PyTorch and run several loops ofSinkhorn mapping Apf, αθq (Apf, αθtq) so that the output tensor now records all the dependence onthe parameter θ via the implicitly constructed computational graph. We can then easily compute thematrix-vector-product use the Pearlmutter’s algorithm (Pearlmutter, 1994).

3https://www.kernel-operations.io/geomloss/

40

Date post:	19-Feb-2022
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

arXiv:2011.04162v1 [stat.ML] 9 Nov 2020

Documents