Date post: | 05-Dec-2014 |
Category: |
Technology |
Upload: | frank-nielsen |
View: | 425 times |
Download: | 1 times |
Total Jensen divergences: Definition, Properties
and k-Means++ Clustering
Frank Nielsen1 Richard Nock2
www.informationgeometry.org
1Sony Computer Science Laboratories, Inc.2UAG-CEREGMIA
September 2013
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/19
Divergences: Distortion measuresF a smooth convex function, the generator.
◮ Skew Jensen divergences:
J ′α(p : q) = αF (p) + (1− α)F (q) − F (αp + (1− α)q),
= (F (p)F (q))α − F ((pq)α),
where (pq)γ = γp + (1− γ)q = q + γ(p − q) and(F (p)F (q))γ = γF (p)+(1−γ)F (q) = F (q)+γ(F (p)−F (q)).
◮ Bregman divergences:
B(p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉,
limα→0
Jα(p : q) = B(p : q),
limα→1
Jα(p : q) = B(q : p).
◮ Statistical Bhattacharrya divergence:
Bhat(p1 : p2) = − log
∫
p1(x)αp2(x)
1−αdν(x) = J ′
α(θ1 : θ2)
for exponential families [5].c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/19
Geometrically designed divergences
Plot of the convex generator F .
q pp+q
2
B(p : q)
J(p, q)
tB(p : q)
F : (x, F (x))
(p, F (p))
(q, F (q))
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/19
Total Bregman divergencesConformal divergence, conformal factor ρ:
D ′(p : q) = ρ(p, q)D(p : q)
plays the role of “regularizer” [8]
Invariance by rotation of the axes of the design space
tB(p : q) =B(p : q)
√
1 + 〈∇F (q),∇F (q)〉= ρB(q)B(p : q),
ρB(q) =1
√
1 + 〈∇F (q),∇F (q)〉.
Total squared Euclidean divergence:
tE (p, q) =1
2
〈p − q, p − q〉√
1 + 〈q, q〉.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/19
Total Jensen divergences
tB(p : q) = ρB(q)B(p : q), ρB(q) =
√
1
1 + 〈∇F (q),∇F (q)〉
tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) =
√
√
√
√
1
1 + (F (p)−F (q))2
〈p−q,p−q〉
Jensen-Shannon divergence, square root is a metric [2]:
JS(p, q) =1
2
d∑
i=1
pi log2pi
pi + qi+
1
2
d∑
i=1
qi log2qi
pi + qi
LemmaThe square root of the total Jensen-Shannon divergence is not ametric.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/19
Total Jensen divergence: Illustration
p q(pq)α
F (p)
F (q)
(F (p)F (q))α
(F (p)F (q))βJ ′
α(p : q)
F ((pq)α)
tJ′α(p : q)
F (p′)
F (q′)
(F (p′)F (q′))α
(F (p′)F (q′))β
J ′
α(p′ : q′)
F ((p′q′)α)
tJ′α(p′ : q′)
p′ (p′q′)α
q′O
O
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/19
Total Jensen divergence: Illustration
α on graph plot, β on interpolated segmentTwo kinds of total Jensen divergences (but one always yieldsclosed-form)
p q p q
β < 0
F ((pq)α)
F ((pq)α)(F (p)F (q))β
(F (p)F (q))β
β < 0β > 1
β > 1β ∈ [0, 1]
β ∈ [0, 1]
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/19
Total Jensen divergences/Total Bregman divergences
Total Jensen is not a generalization of total Bregman.limit cases α ∈ {0, 1}, we have:
limα→0
tJα(p : q) = ρJ(p, q)B(p : q) 6= ρB(q)B(p : q),
limα→1
tJα(p : q) = ρJ(p, q)B(q : p) 6= ρB(p)B(q : p),
since ρJ(p, q) 6= ρB(q).
Squared chord slope index in ρJ :
s2 =∆2
F
‖∆‖2=
∆⊤∇F (ǫ)∆⊤∇F (ǫ)
∆⊤∆= 〈∇F (ǫ),∇F (ǫ)〉 = ‖∇F (ǫ)‖2.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/19
Conformal factor from mean value theorem
When p ≃ q, ρJ(p, q) ≃ ρB(q), and the total Jensen divergencetends to the total Bregman divergence for any value of α.
ρJ(p, q) =1
√
1 + 〈∇F (ǫ),∇F (ǫ)〉= ρB(ǫ),
for ǫ ∈ [p, q].
For univariate generators, explicitly the value of ǫ:
ǫ = ∇F−1
(
∆F
∆
)
= ∇F ∗
(
∆F
∆
)
,
where F ∗ is the Legendre convex conjugate [5].Stolarsky mean [7]:
tJα(p : q) = ρB(ǫ)J(p : q)
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/19
Centroids and statistical robustness
Centroids (barycenters) are minimizers of average (weighted)divergences:
L(x ;w) =
n∑
i=1
wi × tJα(pi : x),
cα = arg minx∈X
L(x ;w),
◮ Is it unique?
◮ Is it robust to outliers [3]?
Iterative convex-concave procedure (CCCP) [5]
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/19
Robustness of Jensen centroids (univariate generator)
TheoremThe Jensen centroid is robust for a strictly convex and smoothgenerator f if |f ′(p+y
2 )| is bounded on the domain X for anyprescribed p.
◮ Jensen-Shannon: X = R+, f (x) = x log x − x ,f ′(x) = log(x),
f ′′(x) = 1/x .|f ′(p+y
2 )| = | log p+y2 | is unbounded when y → +∞.
JS centroid is not robust
◮ Jensen-Burg: X = R+, f (x) = − log x , f ′(x) = −1/x ,
f ′′(x) = 1x2
|f ′(p+y2 )| = | 2
p+y | is always bounded for y ∈ (0,+∞).
z(y) = 2p2(
1
p−
2
p + y
)
When y → ∞, we have |z(y)| → 2p < ∞.JB centroid is robust.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/19
Clustering: No closed-form centroid, no cry!
k-means++ [1] picks up randomly seeds, no centroid calculation.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/19
Divergence-based k-means++
TheoremSuppose there exist some U and V such that, ∀x , y , z:
tJα(x : z) ≤ U(tJα(x : y) + tJα(y : z)) , (triangular inequality)
tJα(x : z) ≤ V tJα(z : x) , (symmetric inequality)
Then the average potential of total Jensen seeding with k clusterssatisfies
E [tJα] ≤ 2U2(1 + V )(2 + log k)tJopt,α,
where tJopt,α is the minimal total Jensen potential achieved by aclustering in k clusters.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/19
Divergence-based k-means++: Two assumptions H
H:
◮ First, the maximal condition number of the Hessian of F , thatis, the ratio between the maximal and minimal eigenvalue(> 0) of the Hessian of F , is upperbounded by K1.
◮ Second, we assume the Lipschitz condition on F that∆2
F/〈∆,∆〉 ≤ K2, for some K2 > 0.
LemmaAssume 0 < α < 1. Then, under assumption H, for anyp, q, r ∈ S, there exists ǫ > 0 such that:
tJα(p : r) ≤2(1 + K2)K
21
ǫ
(
1
1− αtJα(p : q) +
1
αtJα(q : r)
)
.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/19
Divergence-based k-means++
Corollary
The total skew Jensen divergence satisfies the following triangularinequality:
tJα(p : r) ≤2(1 + K2)K
21
ǫα(1− α)(tJα(p : q) + tJα(q : r)) .
U =2(1 + K2)K
21
ǫ
LemmaSymmetric inequality condition holds for V = K 2
1 (1 + K2)/ǫ, forsome 0 < ǫ < 1.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/19
Total Jensen divergences: Recap
Total Jensen divergence = conformal divergence withnon-separable double-sided conformal factor.
◮ Invariant to axis rotation of “design space“
◮ Equivalent to total Bregman divergences [8, 4] only whenp ≃ q
◮ Square root of total Jensen-Shannon divergence is not ametric (square root of total JS is a metric).
◮ Jensen centroids are not always robust (e.g., Jensen-Shannoncentroid)
◮ Total Jensen k-means++ do not require centroidcomputations and guaranteed approximation
Interest of conformal divergences in SVM [9] (double-sidedseparable), in information geometry [6] (flattening).
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/19
Thank you.
@article{totalJensen-arXiv1309.7109 ,
author="Frank Nielsen and Richard Nock",
title="Total {J}ensen divergences: {D}efinition, Properties and $k$-Means++ Clustering",
year="2013",
eprint="arXiv/1309.7109"
}
www.informationgeometry.org
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/19
Bibliographic references IDavid Arthur and Sergei Vassilvitskii.
k-means++: the advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages1027–1035. Society for Industrial and Applied Mathematics, 2007.
Bent Fuglede and Flemming Topsoe.
Jensen-Shannon divergence and Hilbert space embedding.
In IEEE International Symposium on Information Theory, pages 31–31, 2004.
F. R. Hampel, P. J. Rousseeuw, E. Ronchetti, and W. A. Stahel.
Robust Statistics: The Approach Based on Influence Functions.
Wiley Series in Probability and Mathematical Statistics, 1986.
Meizhu Liu, Baba C. Vemuri, Shun-ichi Amari, and Frank Nielsen.
Shape retrieval using hierarchical total Bregman soft clustering.
Transactions on Pattern Analysis and Machine Intelligence, 34(12):2407–2419, 2012.
Frank Nielsen and Sylvain Boltz.
The Burbea-Rao and Bhattacharyya centroids.
IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.
Atsumi Ohara, Hiroshi Matsuzoe, and Shun-ichi Amari.
A dually flat structure on the space of escort distributions.
Journal of Physics: Conference Series, 201(1):012012, 2010.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/19
Bibliographic references II
Kenneth B Stolarsky.
Generalizations of the logarithmic mean.
Mathematics Magazine, 48(2):87–92, 1975.
Baba Vemuri, Meizhu Liu, Shun-ichi Amari, and Frank Nielsen.
Total Bregman divergence and its applications to DTI analysis.
IEEE Transactions on Medical Imaging, pages 475–483, 2011.
Si Wu and Shun-ichi Amari.
Conformal transformation of kernel functions a data dependent way to improve support vector machineclassifiers.
Neural Processing Letters, 15(1):59–67, 2002.
c© 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/19