Optimal Transport vs. Fisher-Rao distance between Copulas

IntroductionStatistical distances

Optimal Transport vs. Fisher-Rao distancebetween Copulas

IEEE SSP 2016

G. Marti, S. Andler, F. Nielsen, P. Donnat

June 28, 2016

Gautier Marti Optimal Transport vs. Fisher-Rao distance between Copulas


Clustering of Time Series

We need a distance Dij between time series xi and xj

If we look for ‘correlation’, Dij is a decreasing function of ρij ,a measure of ‘correlation’

Several choices are available for ρij . . .



Copulas

Sklar’s Theorem:

F (xi , xj) = Cij(Fi (xi ),Fj(xj))

Cij , the copula, encodes the dependence structureFrechet-Hoeffding bounds:

max{ui + uj − 1, 0} ≤ Cij(ui , uj) ≤ min{ui , uj}

(left) lower-bound, (mid) independence, (right) upper-bound copulas



Copulas - Gaussian Example

Gaussian copula: CGaussR (ui , uj) = ΦR(Φ−1(ui ),Φ

−1(uj))

The distribution is parametrized by a correlation matrix R.



The Target/Forget (copula-based) Dependence Coefficient

Dependence is measured as the relative distance from independence tothe nearest target-dependence: comonotonicity or counter-monotonicity

Which distances are appropriate between copulas for the task ofclustering (copulas and time series)?



Definitions - Fisher-Rao geodesic distance

Metrization of the paramater space {θ ∈ Rd |∫p(X ; θ)dx = 1}.

Consider the metric gjk(θ) = −∫ ∂2 log p(x ,θ)

∂θj∂θkp(x , θ)dx ,

the infinitesimal length ds(θ) =√

(∇θ)>G (θ)∇θ,

the Fisher-Rao geodesic distance

FR(θ1, θ2) =

∫ θ2

θ1

ds(θ).

f -divergences induce infinitesimal length proportional toFisher-Rao infinitesimal length:

Df (θ‖θ + dθ) =1

2(∇θ)>G (θ)∇θ.

Thus, they have the same local behaviour [1].



Definitions - Optimal Transport distances

Wasserstein metric

Wp(µ, ν)p = infγ∈Γ(µ,ν)

∫M×M

d(x , y)pdγ(x , y)

Image from Optimal Transport for Image Processing, Papadakis

Other transportation distances: regularized discrete optimaltransport [3], Sinkhorn distances [2], . . .



Geometry of covariances



Distances between Gaussian copulas

Copulas C1,C2,C3 encoding a correlation of 0.5, 0.99, 0.9999 respectively;Which pair of copulas is the nearest?- For Fisher-Rao, Kullback-Leibler, Hellinger and related divergences:D(C1,C2) ≤ D(C2,C3);- For Wasserstein: W2(C2,C3) ≤W2(C1,C2)



Distances as a function of (ρ1, ρ2)

Distance heatmap and surface as a function of (ρ1, ρ2)

for Fisher-Rao for Wasserstein W2



Distances impact on clustering

Datasets of bivariate time series are generated from six Gaussian copulaswith correlation .1, .2, .6, .7, .99, .9999

Distance heatmaps for Fisher-Rao (left), W2 (right); Using Wardclustering, Fisher-Rao yields clusters of copulas with correlations{.1, .2, .6, .7}, {.99}, {.9999}, W2 yields {.1, .2}, {.6, .7}, {.99, .9999}



Fisher metric and the Cramer–Rao lower bound

Cramer–Rao lower bound (CRLB)

The variance of any unbiased estimator θ of θ is bounded by thereciprocal of the Fisher information G (θ):

var(θ) ≥ 1

G (θ).

In the bivariate Gaussian copula case,

var(ρ) ≥ (ρ− 1)2(ρ+ 1)2

3(ρ2 + 1).




We consider the set of 2× 2 correlation matrices C =

(1 θθ 1

)parameterized by θ.

Let x =

(x1x2

)∈ R2.

f (x ; θ) = 1

2π

√1−θ2

exp(− 1

2x>C−1x

)= 1

2π

√1−θ2

exp

(− 1

2(1−θ2)(x2

1 + x22 − 2θx1x2)

)log f (x ; θ) = − log(2π

√1− θ2)− 1

2(1−θ2)(x2

1 + x22 − 2θx1x2)

∂2 log f (x ;θ)

∂θ2 = − θ2+1(θ2−1)2 −

x21

2(θ+1)3 +x21

2(θ−1)3 −x22

2(θ+1)3 +x22

2(θ−1)3 −x1x2

(θ+1)3 −x1x2

(θ−1)3

Then, we compute∫∞−∞

∂2 log f (x ;θ)

∂θ2 f (x ; θ)dx .

Since E[x1] = E[x2] = 0, E[x1x2] = θ, E[x21 ] = E[x2

2 ] = 1, we get∫∞−∞

∂2 log f (x ;θ)

∂θ2 f (x ; θ)dx =

− θ2+1(θ2−1)2 −

12(θ+1)3 + 1

2(θ−1)3 −1

2(θ+1)3 + 12(θ−1)3 −

θ(θ+1)3 −

θ(θ−1)3 = − 3(θ2+1)

(θ−1)2(θ+1)2

Thus,

G(θ) =3(θ2 + 1)

(θ − 1)2(θ + 1)2.




In the bivariate Gaussian copula case,

var(ρ) ≥ (ρ− 1)2(ρ+ 1)2

3(ρ2 + 1).

Recall that locally Fisher-Rao and the f -divergences are aquadratic form of the Fisher metric (∇θ)>G (θ)∇θ. So, thediscriminative power of these distances is well calibrated withrespect to statistical uncertainty. For this purpose, they induce theappropriate curvature on the parameter space.



Properties of these distances

In addition, for clustering we prefer OT since:

in a parametric setting:

Fisher-Rao and f -divergences are defined on density manifolds,but some important copulas (such as the Frechet-Hoeffdingupper bound) do not belong to these manifolds;Thus, in case of closed-form formulas (such as in the Gaussiancase), they are ill-defined for these copulas (for perfectdependence, covariance is not invertible)

in a non-parametric/empirical setting:

f -divergences are defined for absolutely continuous measures,thus require a pre-processing KDEthey are not aware of the support geometry, thus badly handlenoise on the support



Barycenters

OT is defined for both discrete/empirical and continuous measuresand is support-geometry aware:

0 0.5 10

0.5

1

0.0000

0.0015

0.0030

0.0045

0.0060

0.0075

0.0090

0.0105

0.0120

0 0.5 10

0.5

1

0.0000

0.0015

0.0030

0.0045

0.0060

0.0075

0.0090

0.0105

0.0120

0 0.5 10

0.5

1

0.0000

0.0008

0.0016

0.0024

0.0032

0.0040

0.0048

0.0056

0 0.5 10

0.5

1

0.0000

0.0015

0.0030

0.0045

0.0060

0.0075

0.0090

0.0105

0.0120

0 0.5 10

0.5

1

0.0000

0.0015

0.0030

0.0045

0.0060

0.0075

0.0090

0.0105

0.0120

5 copulas describing the dependence between X ∼ U([0, 1]) andY ∼ (X ± εi )2, where εi is a constant noise specific for each distribution

0 0.5 10

0.5

1Wasserstein barycenter copula

0.0000

0.0004

0.0008

0.0012

0.0016

0.0020

0.0024

0.0028

0.0032

Barycenter of the 5 copulas for a divergence and OT



Future Research

Develop further geometries of copulas

using Optimal Transport: show that dependence-clustering oftime series is improved over standard correlationsusing f -divergences: detect efficiently dependence-regimeswitching in multivariate time series (cf. Frederic Barbaresco’swork on radar signal processing)

Numerical experiments and code:

https://www.datagrapple.com/Tech/fisher-vs-ot.html


https://www.datagrapple.com/Tech/fisher-vs-ot.html


Shun-ichi Amari and Andrzej Cichocki.Information geometry of divergence functions.Bulletin of the Polish Academy of Sciences: TechnicalSciences, 58(1):183–195, 2010.

Marco Cuturi.Sinkhorn distances: Lightspeed computation of optimaltransport.In Advances in Neural Information Processing Systems, pages2292–2300, 2013.

Sira Ferradans, Nicolas Papadakis, Julien Rabin, Gabriel Peyre,and Jean-Francois Aujol.Regularized discrete optimal transport.Springer, 2013.


Date post:	16-Jan-2017
Category:	Data & Analytics
Upload:	hellebore-capital-limited
View:	143 times
Download:	0 times

Optimal Transport vs. Fisher-Rao distance between Copulas

Data & Analytics