Slides: Total Jensen divergences: Definition, Properties and k-Means++ Clustering

Total Jensen divergences: Definition, Properties

and k-Means++ Clustering

Frank Nielsen1 Richard Nock2

www.informationgeometry.org

1Sony Computer Science Laboratories, Inc.2UAG-CEREGMIA

September 2013

c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/19


Divergences: Distortion measuresF a smooth convex function, the generator.

◮ Skew Jensen divergences:

J ′α(p : q) = αF (p) + (1− α)F (q) − F (αp + (1− α)q),

= (F (p)F (q))α − F ((pq)α),

where (pq)γ = γp + (1− γ)q = q + γ(p − q) and(F (p)F (q))γ = γF (p)+(1−γ)F (q) = F (q)+γ(F (p)−F (q)).

◮ Bregman divergences:

B(p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉,

limα→0

Jα(p : q) = B(p : q),

limα→1

Jα(p : q) = B(q : p).

◮ Statistical Bhattacharrya divergence:

Bhat(p1 : p2) = − log

∫

p1(x)αp2(x)

1−αdν(x) = J ′

α(θ1 : θ2)

for exponential families [5].c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/19

Geometrically designed divergences

Plot of the convex generator F .

q pp+q

2

B(p : q)

J(p, q)

tB(p : q)

F : (x, F (x))

(p, F (p))

(q, F (q))


Total Bregman divergencesConformal divergence, conformal factor ρ:

D ′(p : q) = ρ(p, q)D(p : q)

plays the role of “regularizer” [8]

Invariance by rotation of the axes of the design space

tB(p : q) =B(p : q)

√

1 + 〈∇F (q),∇F (q)〉= ρB(q)B(p : q),

ρB(q) =1

√

1 + 〈∇F (q),∇F (q)〉.

Total squared Euclidean divergence:

tE (p, q) =1

2

〈p − q, p − q〉√

1 + 〈q, q〉.


Total Jensen divergences

tB(p : q) = ρB(q)B(p : q), ρB(q) =

√

1

1 + 〈∇F (q),∇F (q)〉

tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) =

√

√

√

√

1

1 + (F (p)−F (q))2

〈p−q,p−q〉

Jensen-Shannon divergence, square root is a metric [2]:

JS(p, q) =1

2

d∑

i=1

pi log2pi

pi + qi+

1

2

d∑

i=1

qi log2qi

pi + qi

LemmaThe square root of the total Jensen-Shannon divergence is not ametric.


Total Jensen divergence: Illustration

p q(pq)α

F (p)

F (q)

(F (p)F (q))α

(F (p)F (q))βJ ′

α(p : q)

F ((pq)α)

tJ′α(p : q)

F (p′)

F (q′)

(F (p′)F (q′))α

(F (p′)F (q′))β

J ′

α(p′ : q′)

F ((p′q′)α)

tJ′α(p′ : q′)

p′ (p′q′)α

q′O

O


Total Jensen divergence: Illustration

α on graph plot, β on interpolated segmentTwo kinds of total Jensen divergences (but one always yieldsclosed-form)

p q p q

β < 0

F ((pq)α)

F ((pq)α)(F (p)F (q))β

(F (p)F (q))β

β < 0β > 1

β > 1β ∈ [0, 1]

β ∈ [0, 1]


Total Jensen divergences/Total Bregman divergences

Total Jensen is not a generalization of total Bregman.limit cases α ∈ {0, 1}, we have:

limα→0

tJα(p : q) = ρJ(p, q)B(p : q) 6= ρB(q)B(p : q),

limα→1

tJα(p : q) = ρJ(p, q)B(q : p) 6= ρB(p)B(q : p),

since ρJ(p, q) 6= ρB(q).

Squared chord slope index in ρJ :

s2 =∆2

F

‖∆‖2=

∆⊤∇F (ǫ)∆⊤∇F (ǫ)

∆⊤∆= 〈∇F (ǫ),∇F (ǫ)〉 = ‖∇F (ǫ)‖2.


Conformal factor from mean value theorem

When p ≃ q, ρJ(p, q) ≃ ρB(q), and the total Jensen divergencetends to the total Bregman divergence for any value of α.

ρJ(p, q) =1

√

1 + 〈∇F (ǫ),∇F (ǫ)〉= ρB(ǫ),

for ǫ ∈ [p, q].

For univariate generators, explicitly the value of ǫ:

ǫ = ∇F−1

(

∆F

∆

)

= ∇F ∗

(

∆F

∆

)

,

where F ∗ is the Legendre convex conjugate [5].Stolarsky mean [7]:

tJα(p : q) = ρB(ǫ)J(p : q)


Centroids and statistical robustness

Centroids (barycenters) are minimizers of average (weighted)divergences:

L(x ;w) =

n∑

i=1

wi × tJα(pi : x),

cα = arg minx∈X

L(x ;w),

◮ Is it unique?

◮ Is it robust to outliers [3]?

Iterative convex-concave procedure (CCCP) [5]


Robustness of Jensen centroids (univariate generator)

TheoremThe Jensen centroid is robust for a strictly convex and smoothgenerator f if |f ′(p+y

2 )| is bounded on the domain X for anyprescribed p.

◮ Jensen-Shannon: X = R+, f (x) = x log x − x ,f ′(x) = log(x),

f ′′(x) = 1/x .|f ′(p+y

2 )| = | log p+y2 | is unbounded when y → +∞.

JS centroid is not robust

◮ Jensen-Burg: X = R+, f (x) = − log x , f ′(x) = −1/x ,

f ′′(x) = 1x2

|f ′(p+y2 )| = | 2

p+y | is always bounded for y ∈ (0,+∞).

z(y) = 2p2(

1

p−

2

p + y

)

When y → ∞, we have |z(y)| → 2p < ∞.JB centroid is robust.


Clustering: No closed-form centroid, no cry!

k-means++ [1] picks up randomly seeds, no centroid calculation.


Divergence-based k-means++

TheoremSuppose there exist some U and V such that, ∀x , y , z:

tJα(x : z) ≤ U(tJα(x : y) + tJα(y : z)) , (triangular inequality)

tJα(x : z) ≤ V tJα(z : x) , (symmetric inequality)

Then the average potential of total Jensen seeding with k clusterssatisfies

E [tJα] ≤ 2U2(1 + V )(2 + log k)tJopt,α,

where tJopt,α is the minimal total Jensen potential achieved by aclustering in k clusters.


Divergence-based k-means++: Two assumptions H

H:

◮ First, the maximal condition number of the Hessian of F , thatis, the ratio between the maximal and minimal eigenvalue(> 0) of the Hessian of F , is upperbounded by K1.

◮ Second, we assume the Lipschitz condition on F that∆2

F/〈∆,∆〉 ≤ K2, for some K2 > 0.

LemmaAssume 0 < α < 1. Then, under assumption H, for anyp, q, r ∈ S, there exists ǫ > 0 such that:

tJα(p : r) ≤2(1 + K2)K

21

ǫ

(

1

1− αtJα(p : q) +

1

αtJα(q : r)

)

.


Divergence-based k-means++

Corollary

The total skew Jensen divergence satisfies the following triangularinequality:

tJα(p : r) ≤2(1 + K2)K

21

ǫα(1− α)(tJα(p : q) + tJα(q : r)) .

U =2(1 + K2)K

21

ǫ

LemmaSymmetric inequality condition holds for V = K 2

1 (1 + K2)/ǫ, forsome 0 < ǫ < 1.


Total Jensen divergences: Recap

Total Jensen divergence = conformal divergence withnon-separable double-sided conformal factor.

◮ Invariant to axis rotation of “design space“

◮ Equivalent to total Bregman divergences [8, 4] only whenp ≃ q

◮ Square root of total Jensen-Shannon divergence is not ametric (square root of total JS is a metric).

◮ Jensen centroids are not always robust (e.g., Jensen-Shannoncentroid)

◮ Total Jensen k-means++ do not require centroidcomputations and guaranteed approximation

Interest of conformal divergences in SVM [9] (double-sidedseparable), in information geometry [6] (flattening).


Thank you.

@article{totalJensen-arXiv1309.7109 ,

author="Frank Nielsen and Richard Nock",

title="Total {J}ensen divergences: {D}efinition, Properties and $k$-Means++ Clustering",

year="2013",

eprint="arXiv/1309.7109"

}




Bibliographic references IDavid Arthur and Sergei Vassilvitskii.

k-means++: the advantages of careful seeding.

In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages1027–1035. Society for Industrial and Applied Mathematics, 2007.

Bent Fuglede and Flemming Topsoe.

Jensen-Shannon divergence and Hilbert space embedding.

In IEEE International Symposium on Information Theory, pages 31–31, 2004.

F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel.

Robust Statistics: The Approach Based on Influence Functions.

Wiley Series in Probability and Mathematical Statistics, 1986.

Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total Bregman soft clustering.

Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhattacharyya centroids.

IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.

Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari.

A dually flat structure on the space of escort distributions.

Journal of Physics: Conference Series, 201(1):012012, 2010.


Bibliographic references II

Kenneth B Stolarsky.

Generalizations of the logarithmic mean.

Mathematics Magazine, 48(2):87–92, 1975.

Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.

Total Bregman divergence and its applications to DTI analysis.

IEEE Transactions on Medical Imaging, pages 475–483, 2011.

Si Wu and Shun-ichi Amari.

Conformal transformation of kernel functions a data dependent way to improve support vector machineclassifiers.

Neural Processing Letters, 15(1):59–67, 2002.


Date post:	05-Dec-2014
Category:	Technology
Upload:	frank-nielsen
View:	425 times
Download:	1 times

Slides: Total Jensen divergences: Definition, Properties and k-Means++ Clustering

Technology