LETTER Communicated by Andrzej Cichocki
Divergence-Based Vector Quantization
Thomas [email protected] HaaseUniversity of Applied Sciences Mittweida, Department of Mathematics,Natural and Computer Sciences, 09648 Mittweida, Germany
Supervised and unsupervised vector quantization methods for classifi-cation and clustering traditionally use dissimilarities, frequently takenas Euclidean distances. In this article, we investigate the applicability ofdivergences instead, focusing on online learning. We deduce the mathe-matical fundamentals for its utilization in gradient-based online vectorquantization algorithms. It bears on the generalized derivatives of thedivergences known as Frechet derivatives in functional analysis, whichreduces in finite-dimensional problems to partial derivatives in a naturalway. We demonstrate the application of this methodology for widely ap-plied supervised and unsupervised online vector quantization schemes,including self-organizing maps, neural gas, and learning vector quan-tization. Additionally, principles for hyperparameter optimization andrelevance learning for parameterized divergences in the case of super-vised vector quantization are given to achieve improved classificationaccuracy.
1 Introduction
Supervised and unsupervised vector quantization for classification andclustering is strongly associated with the concept of dissimilarity, usuallyjudged in terms of distances. The most common choice is the Euclideanmetric. Recently, however, alternative dissimilarity measures have becomeattractive for advanced data processing. Examples are functional metricslike Sobolev distances or kernel-based dissimilarity measures (Villmann &Schleif, 2009; Lee & Verleysen, 2007). These metrics take the functional struc-ture of the data into account (Lee & Verleysen, 2005; Ramsay & Silverman,2006; Rossi, Delannay, Conan-Gueza, & Verleysen, 2005; Villmann, 2007).
Information theory–based vector quantization approaches are proposedconsidering divergences for clustering (Banerjee, Merugu, Dhillon, &Ghosh, 2005; Jang, Fyfe, & Ko, 2008; Lehn-Schiøler, Hegde, Erdogmus, &Principe, 2005; Hegde, Erdogmus, Lehn-Schiøler, Rao, & Principe, 2004). Forother data processing methods like multidimensional scaling (MDS; Lai &Fyfe, 2009), stochastic neighbor embedding (Maaten & Hinton, 2008), blind
Neural Computation 23, 1343–1392 (2011) C© 2011 Massachusetts Institute of Technology
1344 T. Villmann and S. Haase
source separation (Minami & Eguchi, 2002), or nonnegative matrix factor-ization (Cichocki, Lee, Kim, & Choi, 2008), divergence-based approachesare also introduced. In prototype-based classification, first approaches us-ing information-theoretic approaches have been proposed (Erdogmus, 2002;Torkkola, 2003; Villmann, Hammer, Schleif, Hermann, & Cottrell, 2008).
Yet a systematic analysis of prototype-based clustering and classifica-tion relying on divergences has not yet been given. Further, the existingapproaches usually are carried out in batch mode for optimization but arenot available for online learning, which requires calculating the derivativesof the underlying metrics (i.e., divergences).
In this letter, we offer a systematic approach for divergence-based vec-tor quantization using divergence derivatives. For this purpose, importantbut general classes of divergences are identified, widely following and ex-tending the scheme introduced by Cichocki, Zdunek, Phan, and Amari(2009). The mathematical framework for functional derivatives of continu-ous divergences is given by the functional-analytic generalization of com-mon derivatives—the concept of Frechet derivatives (Frigyik, Srivastava, &Gupta, 2008b; Kantorowitsch & Akilow, 1978). This can be seen as a gener-alization of partial derivatives for discrete variants of the divergences. Thefunctional approach is here preferred for clarity. Yet it also offers greater flex-ibility in specific variants of functional data processing (Villmann, Haase,Simmuteit, Haase, & Schleif, 2010).
After characterizing the different classes of divergences and introducingFrechet derivatives, we apply this framework to several divergences anddivergence classes to obtain generalized derivatives, which can be used foronline learning in divergence-based methods for supervised and unsuper-vised vector quantization as well as other gradient-based approaches. Weexplicitly explore the derivatives to provide examples.
Then we consider some of the most prominent approaches for unsu-pervised as well as supervised prototype-based vector quantization in thelight of divergence-based online learning using Frechet derivatives, includ-ing self-organizing maps (SOM), neural gas (NG), and generalized learningvector quantization (GLVQ). For the GLVQ supervised approach, we alsoprovide a gradient learning scheme, hyperparameter adaptation, for opti-mizing parameters that occur in the case of parameterized divergences.
The focus of the letter is mainly on giving a unified framework for theapplication of widely ranged divergences and classes thereof in gradient-based online vector quantization and their mathematical foundation. Weformulate the problem in a functional manner following the approaches inFrigyik et al. (2008b), Csiszar (1967), and Liese and Vajda (2006). This allowsa compact description of the mathematical theory based on the concept ofFrechet derivatives. We also state that the functional approach includes alarger class of divergence functionals than the discrete (pointwise) approachas Frigyik, Srivastava, and Gupta (2008a) point out. Beside these extensions,the functional approach using Frechet derivatives obviously reduces to
Divergence-Based Vector Quantization 1345
partial derivatives for the discrete case. We therefore prefer the functionalapproach in this letter.
However, as a proof of concept, we show for several classes of param-eterized divergences their utilization in SOM learning for an artificial butillustrate example in comparison to Euclidean distance learning as stan-dard.
2 Characterization of Divergences
Generally, divergences are functionals designed for determining a similaritybetween nonnegative integrable measure functions p and ρ with a domainV and the constraints p (x) ≤ 1 and ρ (x) ≤ 1 for all x∈V. We denote suchmeasure functions as positive measures. The weight of the functional p isdefined as
W (p) =∫
Vp (x) dx. (2.1)
Positive measures p with weight W (p) = 1 are denoted as (probability)density functions.1
Divergences D (p||ρ) are defined as functionals that have to be nonneg-ative and zero iff p ≡ ρ except on a zero-measure set. Further, D (p||ρ) isrequired to be convex with respect to the first argument. Yet divergences areneither necessarily symmetric nor have to fulfill the triangle inequality asit is supposed for metrics. According to the classification given in Cichockiet al. (2009), one can distinguish at least three main classes of divergences:Bregman divergences, Csiszar’s f -divergences, and γ -divergences empha-sizing different properties. We offer some basic properties about these butdo not go into detail about them because this would be outside the scopeof the letter. (For detailed property investigations, see Cichocki & Amari,2010, and Cichocki et al., 2009.)
We generally assume that p and ρ are positive measure (densities) thatare not necessarily normalized. In case of (normalized) densities, we explic-itly refer to these as probability densities.
2.1 Bregman Divergences. Bregman divergences are defined by gener-ating convex functions � in the following way using a functional interpre-tation (Bregman, 1967; Frigyik et al., 2008b).
Let � be a strictly convex real-valued function with the domain L(the Lebesgue-integrable functions). Further, � is assumed to be twice
1Each setF = { f } of arbitrary nonnegative integrable functionals f with domain V canbe transformed into a set of positive measures simply by p = f
c with c = sup f ∈F [W ( f )].
1346 T. Villmann and S. Haase
Figure 1: Illustration of the Bregman divergence DB�(p||ρ) as a vertical distance
between p and the tangential hyperplane to the graph of � at point ρ, taking pand ρ as points in a functional space.
continuously Frechet differentiable (Kantorowitsch & Akilow, 1978). ABregman divergence is defined as DB
� : L × L −→ R+ with
DB� (p||ρ) = � (p) − � (ρ) − δ� (ρ)
δρ[p − ρ] , (2.2)
whereby δ�(ρ)δρ
is the Frechet derivative of � with respect to ρ (see section3.1).
The Bregman divergence DB� (p||ρ) can be interpreted as a measure of
convexity of the generating function �. Taking p and ρ as points in afunctional space, DB
� (p||ρ) plays the role of vertical distance between p andthe tangential hyperplane to the graph of � at point ρ, which is illustratedin Figure 1.
Bregman divergences are linear according to the generating function �:
DB�1+λ�2
(p||ρ) = DB�1
(p||ρ) + λ · DB�2
(p||ρ) .
Further, DB� (p||ρ) is invariant under affine transforms � (q ) = � (q ) +
�g [q ] + ξ for positive measures g and q with a requirement that � (q ),� (q ), and � �g [q ] are not independent but have to be related according to
�g [q ] = δ� (g)δg
· q − δ� (g)δg
· q .
Further, �g is supposed to be a linear operator independent of q (Frigyiket al., 2008a) and ξ is a scalar. In that case,
DB� (p||ρ) = DB
� (p||ρ)
Divergence-Based Vector Quantization 1347
is valid. Further, the generalized Pythagorean theorem holds for any triplep, ρ, τ of positive measures:
DB� (p||τ ) = DB
� (p||ρ) + DB� (ρ||τ ) + δ� (ρ)
δρ[p − ρ] − δ� (τ )
δτ[p − ρ] .
The sensitivity of a Bregman divergence at p is defined as
s (p, τ ) = ∂2 DB� (p||p + ατ )
∂α2 |α=0 (2.3)
= −τδ2� (p)
δp2 τ,
with τ ∈ L and the restriction that∫
τ (x) dx = 0 (Santos-Rodrıguez,Guerrero-Curieses, Alaiz-Rodrıguez, & Cid-Sueiro, 2009). Note that δ2�(p)
δp2
is the Hessian of the generating function. The sensitivity s (p, τ ) measuresthe velocity of change of the divergence at point p in the direction of τ .
A last property mentioned here is an optimality one (Banerjee et al.,2005). Given a set S of positive measures p with the (functional) meanμ = E [p ∈ S] and the additional restriction that μ is a relative interior ofS,2 then for given p ∈ S, the unique minimizer of E p
[DB
� (p||ρ)]
is ρ =μ. The inverse direction of this statement is also true: if E p
[DB
F (p||ρ)]
isminimum for ρ = μ, then DB
F (p||ρ) is a Bregman divergence. This propertypredestinates Bregman divergences for clustering problems.
Finally, we give some important examples:� Generalized Kullback-Leibler divergence for non-normalized p and
ρ (Cichocki et al., 2009):
DG K L (p||ρ) =∫
p log(
pρ
)dx−
∫(p − ρ) dx (2.4)
with the generating function
� ( f ) =∫ (
f · log f − f)
dx.
If p and ρ are normalized densities (probability densities), DG K L (p||ρ)is reduced to the usual Kullback-Leibler divergence (Kullback &Leibler, 1951; Kapur, 1994),
DK L (p||ρ) =∫
p log(
pρ
)dx, (2.5)
2If S follows a statistical distribution with existing functional expectation value ES,then the mean μ can be replaced by ES.
1348 T. Villmann and S. Haase
which is related to the Shannon-entropy (Shannon, 1948),
HS (p) = −∫
p log (p) dx, (2.6)
via
DK L (p||ρ) = VS (p, ρ) − HS (p) ,
where
VS (p, ρ) = −∫
p log (ρ) dx
is Shannon’s cross-entropy.� Itakura-Saito divergence (Itakura & Saito, 1973),
DI S (p||ρ) =∫ [
pρ
− log(
pρ
)− 1
]dx, (2.7)
based on the Burg entropy,
H B (p) = −∫
log (p) dx,
which also serves as the generating function
� ( f ) = H B ( f ) .
The Itakura-Saito divergence is also known as negative cross-Burg entropy and fulfills the scale-invariance property, that is,DI S (c · p||c · ρ) = DI S (p||ρ). So the same relative weight is given tolow- and high-energy components of p (Bertin, Fevotte, & Badeau,2009). Due to this, the Itakura-Saito divergence is frequently appliedin image processing and sound processing.
� The Euclidean distance in terms of a Bregman divergence is obtainedby the generating function
� ( f ) =∫
f 2dx.
We extend this definition and introduce the parameterized version,
�η ( f ) =∫
f ηdx,
defining the η-divergence, also known as norm-like divergence(Nielsen & Nock, 2009):
Dη (p||ρ) =∫
pη + (η − 1) · ρη − η · p · ρ(η−1)dx , (2.8)
which converges to the Euclidean distance for η → 2. To ensure theconvexity of � ( f ), the restriction to η > 1 is required.
Divergence-Based Vector Quantization 1349
If we assume that p and ρ are positive measures, then an important subsetof Bregman divergences belongs to the class of β-divergences (Eguchi &Kano, 2001), which are defined, following Cichocki et al. (2009), as
Dβ (p||ρ) =∫
p · pβ−1 − ρβ−1
β − 1dx −
∫pβ − ρβ
βdx (2.9)
=∫
pβ
(1
β − 1− 1
β
)− ρβ−1
(p
β − 1− ρ
β
)dx, (2.10)
with β �= 1 and β �= 0 with the generating function
� ( f ) = f β − β · f + β − 1β (β − 1)
.
In the limit β → 1, the divergence Dβ (p, ρ) becomes the generalizedKullback-Leibler divergence (see equation 2.4).3 The limit β → 0 givesthe Itakura-Saito divergence (see equation 2.7). Further, β-divergences areequivalent to the density power divergences Dβ introduced in Basu, Harris,Hjort, and Jones (1998) by
Dβ (p||ρ) = 1(1 + β)
Dβ (p||ρ) .
Obviously, the η divergence (see equation 2.8) is a rescaled version of theβ-divergence:
Dη (p||ρ) = β · (β − 1) · Dβ (p||ρ) .
Thus, we see that for β = 2, the β-divergence Dβ (p||ρ) becomes (half) theEuclidean distance.
2.2 Csiszar’s f -Divergences. Csiszar’s f -divergences are defined forreal-valued, convex, continuous functions f ∈ F with f (1) = 0 (withoutloss of generality) whereby
F = {g|g : [0,∞) → R, g - convex
}.
The f -divergences Df for positive measures p and ρ are given by
Df (p||ρ) =∫
ρ · f(
pρ
)dx, (2.11)
3The relations pγ −ργ
γ−→γ→0
log pρ
and pγ −1γ
−→γ→0
log p hold.
1350 T. Villmann and S. Haase
with the definitions 0 · f( 0
0
) = 0, 0 · f( a
0
) = limx→0 x · f( a
x
) = limu→∞ a ·f (u)u (Csiszar, 1967; Liese & Vajda, 2006; Taneja & Kumar, 2004). f is called
the determining function for Df (p||ρ). It corresponds to a generalized f -entropy (Cichocki et al., 2009) of the form
Hf (p) = −∫
f (p) dx (2.12)
via
Hf (p) = −Df (p||I) + c, (2.13)
with I being the constant function of value 1 and c is a divergence-depending constant (Cichocki & Amari, 2010).
The f -divergence Df can be interpreted as an average (with respect toρ) of the likelihood ratio p
ρdescribing the change rate of p with respect
to ρ weighted by the determining function f . Df (p||ρ) is jointly convexin both p and ρ. Further, f defines an equivalence class in the sense thatDf (p||ρ) = Df (p||ρ) iff f (x) = f (x) + c · (x − 1) for c ∈ R, that is, Df (p||ρ)is invariant according to a linear shift regarding the determining function f .For f -divergences, a certain kind of symmetry can be stated. Let f, f ∗ ∈ Fand f ∗ is the conjugate function of f , that is, f ∗ (x) = x · f
( 1x
)for x ∈ (0,∞).
Then the relation Df (p||ρ) = Df ∗ (ρ||p) is valid iff the conjugate differsfrom the original by a linear shift as above: f (x) = f ∗ (x) + c · (x − 1). Asymmetric divergence can be obtained for an arbitrary convex functiong ∈ F using its conjugate g∗ for the definition f = g + g∗ as a determiningfunction. Further, the conjugate is important for an upper bound of thedivergence. Let u = p
ρand p as well as ρ densities. Then the f -divergence
is bounded by
0 ≤ Df (p||ρ) ≤ limu→0+
{f (u) + f ∗ (u)
}(2.14)
if the limit exists, as it was shown in Liese and Vajda (1987). Yet this state-ment can extended to p and ρ being positive measures:
Lemma 1. Let p and ρ be positive measures. Then the bounds given in equation2.14 are still valid.
Proof. The proof is given in appendix B.
An important and characterizing property is the monotonicity with re-spect to the coarse graining of the underlying domain D of the positivemeasures p and ρ, which is similarly to the monotonicity of the Fisher
Divergence-Based Vector Quantization 1351
metric (Amari & Nagaoka, 2000). Let K ={κ(y|x) ≥ 0, x∈D, y∈Dy}, with Dy
being the range of y. κ describes a transition probability density, that is,∫κ(y|x)dy = 1 holds ∀x∈ D. Denoting the positive measures of y derived
from p(x) and ρ (x) by pκ (y) and ρκ (y), the monotonicity is expressedby Df (p||ρ) ≥ Df (pκ ||ρκ ).4 Further, an isomorphism can be stated for f -divergences in the following way. Let
h : x �−→y (2.15)
be an invertible function transforming positive measures p1 (x) and ρ1 (x) top2 (y) and ρ2 (y). Then Df (p1||ρ1) = Df (p2||ρ2) holds, and the pairs (p1, ρ1)and (p2, ρ2) are called isomorph (Liese & Vajda, 1987). Conversely, if a mea-sure D (p||ρ) = ∫
ρ (x) · G (p (x) , ρ (x)) dx for an integrable function G is in-variant according to invertible transformations h, then D is an f -divergence(Qiao & Minematsu, 2008). This isomorphism, as well as the monotonicity,employ f -divergences for application in speech, signal, and pattern recog-nition (Basseville, 1988; Qiao & Minematsu, 2008). Finally, Cichocki et al.(2009), suggested a generalization of the f -divergences Df . In that diver-gence, f is no longer convex. It is proposed to be
DGf (p||ρ) = c f
∫(p − ρ) dx+
∫ρ · f
(pρ
)dx (2.16)
with c f = f ′ (1) �= 0 and denoted as a generalized f -divergence. As a con-sequence of this relaxation of the convexity condition, in the case of p and ρ
being probability densities, the first term vanishes, such that the usual formof f -divergences is obtained. Thus, as a famous example, the Hellingerdivergence (Taneja & Kumar, 2004) is
DH (p||ρ) = 12
∫(√
p − √ρ)2 dx, (2.17)
with the generating function f (u) = 2(1 − √
u)
for u = pρ
. Acoording toCichocki et al. (2009), DH (p||ρ) is a properly defined f -divergence only forprobability densities p and ρ.
As the β divergences in the case of Bregman divergences, one can iden-tify an important subset of the f -divergences—the so-called α-divergences
4The equality holds iff the conditional densities pκ (x|y) = p(x)·κ(y|x)pκ (y) and ρκ (x|y) =
ρ(x)·κ(y|x)ρκ (y) are identical (see Amari & Nagaoka, 2000).
1352 T. Villmann and S. Haase
according to the definition given in Cichocki et al. (2009):
Dα (p||ρ) = 1α (α − 1)
∫ [pαρ1−α − α · p + (α − 1) ρ
]dx (2.18)
= 1α (α − 1)
∫ [ρ
((pρ
)α
+ (α − 1))
− α · p]
dx (2.19)
with the generating f -function
f (u) = u
(uα−1 − 1
)α2 − α
+ 1 − uα
and u = ρ
p and α > 0. In the limit α → 1 the generalized Kullback-Leiblerdivergence DG K L (see equation 2.4) is obtained. Further, Cichocki et al.(2009) state that β-divergences can be generated from α-divergences byapplying the nonlinear transforms
p → pβ+2 and ρ → ρβ+2 with α = 1β + 1
.
In addition to the general properties of the f -divergences stated here,one can derive a characteristic behavior for the α-divergences directly fromequation 2.18 depending on the choice of the parameter α (Minka, 2005). Forα � 0 the minimization of Dα (p||ρ) to estimate ρ (x) may exclude modesof the target p (x). Further, for α ≤ 0, the α-divergence is zero-forcing (i.e.,p (x) = 0 forces ρ (x) = 0), while for α ≥ 1, it is zero-avoiding (i.e., ρ (x) > 0whenever p (x) > 0). For α → ∞, ρ (x) covers p (x) completely, and the α-divergence is called inclusive in that case.
The Tsallis-divergence is a widely applied divergence related to α-divergence (see equation 2.18); however, it is defined only for probabilitydensities. It is defined as
DTα (p||ρ) = −
∫p · log
α
(ρ
p
)dx (2.20)
with the convention
logα (z) = z1−α − 1
1 − α(2.21)
such that
DTα (p||ρ) = 1
1 − α
(1 −
∫pαρ1−αdx
)(2.22)
Divergence-Based Vector Quantization 1353
and α �= 1. Obviously this is a rescaled version of the α-divergence (see equa-tion 2.18), which holds only for probability densities (Cichocki & Amari,2010):
DTα (p||ρ) = α · Dα (p||ρ) . (2.23)
The Tsallis divergence is based on the Tsallis entropy,
HTα (p) =− 1
α − 1
(∫pαdx − 1
)(2.24)
=∫
p logα
(1p
)dx, (2.25)
with logα (p) as defined in (see equation 2.21). In the limit, α → 1 for
HTα (p) becomes the Shannon entropy (see equation 2.6) and the divergence
DTα (p||ρ) converges to the Kullback-Leibler divergence (see equation 2.5).Further, the α-divergences are closely related to the generalized Renyi
divergences defined as (Amari, 1985; Cichocki et al., 2009):
DG Rα (p||ρ) = 1
α − 1log
(∫ [pαρ1−α − α · p + (α − 1) ρ
]dx + 1
)(2.26)
for positive measures ρ and p. Lemma 1 can be used to write the generalizedRenyi divergence in terms of the α-divergence:5
DG Rα (p||ρ) = 1
α − 1log (1 + α · (α − 1) · Dα (p||ρ)) . (2.27)
For probability densities, DG Rα (p||ρ) reduces to the usual Renyi divergence
(Renyi, 1961, 1970):
DRα (p||ρ) = 1
α − 1log
(∫pαρ1−αdx
). (2.28)
5A careful transformation of the parameter α is required for exact transformationsbetween both divergences. For details, see Amari (1985) and Cichocki et al. (2009). Further,this statement was given in this book without proving the bounds of the underlying f -divergence for positive measures as it is given in this letter by lemma 1.
1354 T. Villmann and S. Haase
The divergence DRα (p||ρ) is based on the Renyi entropy
H Rα (p) = − 1
α − 1log
(∫pαdx
)(2.29)
via equation 2.13. The Renyi entropy fulfills the additivity property forindependent probabilities p and q :
H Rα (p × q ) = H R
α (p) + H Rα (q ) .
Further, the entropy H Rα (p) is related to the Tsallis entropy (see equation
2.25) by
H Rα (p) = − 1
α − 1log
(1 + (1 − α) · HT
α
),
which, however, has in consequence a different subadditivity property,
HTα (p × q ) = HT
α (p) + HTα (q ) + (1 − α) · HT
α (p) · HTα (q ),
for α �= 1.
2.3 γ -Divergences. A class of robust divergences with respect tooutliers has been proposed by Fujisawa and Eguchi (2008).6 Called γ -divergences it is defined for positive measures ρ and p as
Dγ (p||ρ) = log
⎡⎣(∫pγ+1dx
) 1γ (γ+1) · (∫
ργ+1dx) 1
γ+1(∫p · ργ dx
) 1γ
⎤⎦ (2.30)
= 1γ + 1
log
[(∫pγ+1dx
) 1γ
·(∫
ργ+1dx)]
(2.31)
− log
[(∫p · ργ dx
) 1γ
].
The divergence Dγ (p||ρ) is invariant under scalar multiplication withpositive constants c1 and c2:
Dγ (p||ρ) = Dγ (c1 · p||c2 · ρ) .
6The divergence Dγ (p||ρ) is proposed to be robust for γ ∈ [0, 1] with the existence ofDγ=0 in the limit γ → 0. A detailed analysis of robustness is given in Fujisawa and Eguchi(2008).
Divergence-Based Vector Quantization 1355
The equation Dγ (p||ρ) = 0 holds only if p = c · ρ (c > 0) in the case ofpositive measures. Yet for probability densities, c = 1 is required. In con-tradiction to the f -divergences, an isomorphism here can be stated forh-transformations (see equation 2.15) which are more strictly assumed tobe affine.
As for Bregman divergences, a modified Pythagorean relation betweenpositive measures can be stated for special choices of positive measures p,ρ, τ . Let p be a distortion of ρ defined as a convex combination with apositive distortion measure δ:
pε (x) = (1 − ε) · ρ (x) + ε · δ (x)
Further, a positive measure g is denoted as δ-consistent if
νg =(∫
δ (x) g (x)α dx) 1
α
is sufficiently small for large α > 0. If two positive measures ρ and τ are δ-consistent according to a distortion measure δ, then the Pythagorean relationapproximately holds for ρ, τ and the distortion pε of ρ,
� (pε, ρ, τ ) = Dγ (pε||τ ) − Dγ (pε||ρ) − Dγ (ρ||τ ) = O (ενγ ) ,
with ν = max{νρ, ντ
}. This property implies the robustness of γ -
divergences with respect to distortions according to the resulting approxi-mation,
Dγ (pε||τ ) ≈ Dγ (pε||ρ) + Dγ (ρ||τ ) ,
and Dγ (pε||ρ) should be small because pε is assumed to be a distortion ofρ (Fujisawa & Eguchi, 2008).
In the limit γ → 0 Dγ (ρ||p) becomes the usual Kullback-Leibler diver-gence (see equation 2.5) DK L (ρ|| p) with normalized densities
ρ = ρ
W (ρ)and p = p
W (p).
For γ = 1 the γ -divergence becomes the Cauchy-Schwarz divergence
DC S (p||ρ) = 12
log(∫
ρ2dx·∫
p2dx)
− log (V (p, ρ)) (2.32)
1356 T. Villmann and S. Haase
with
V (p, ρ) =∫
p · ρ dx (2.33)
being the cross-correlation potential. The Cauchy-Schwarz divergenceDC S (p||ρ) was introduced by Principe, Xu, and Fisher (2000) considering theCauchy-Schwarz inequality for norms. It is based on the quadratic Renyi-entropy H R
2 (p) from equation 2.29 (Jenssen, 2005). Obviously, DC S (p||ρ) issymmetric. It is frequently applied for Parzen window estimation and isparticularly suitable for spectral clustering as well as for related graph cutproblems (Jenssen, Principe, Erdogmus, & Eltoft, 2006).
3 Derivatives of Divergences: A Functional Analytic Approach
In this section we provide the mathematical formalism of generalizedderivatives for functionals p and ρ, known as Frechet derivatives or func-tional derivatives. First, we briefly reconsider the theory of functionalderivatives derivatives. Then we investigate the divergence classes withinthis framework. In particular, we explain their Frechet derivatives.
3.1 Functional (Frechet) Derivatives. Suppose X and Y are Banachspaces, U ⊂ X is open, and F : X → Y. F is called Frechet differentiable atx ∈ X if there exists a bounded linear operator δF [x]
δx : X → Y, such that forh ∈ X, the limit is
limh→0
∥∥F (u + h) − F (u) − δF [u]δu [h]
∥∥Y
‖h‖X= 0 .
This general definition can be focused for functional mapping. Let L be afunctional mapping from a linear, functional Banach space B to R. Further,let B be equipped with a norm ‖·‖, and f, h ∈ B are two functionals. TheFrechet derivative δL[ f ]
δ f of L at point f is formally defined as
limε→0
1ε
(L [ f + εh] − L [ f ]) =:δL [ f ]
δ f[h] ,
with δL[ f ]δ f [h] linear in h. The existence and continuity of the limit are equiv-
alent to the existence and continuity of the derivative. (For a detailed intro-duction, see Kantorowitsch & Akilow, 1978.)
If L is linear, then L [ f + εh] − L [ f ] = εL [h] and, hence, δL[ f ]δ f [h] =
L [h]. Further, an analogon of the chain rule known from differential
Divergence-Based Vector Quantization 1357
calculus can be stated: let F : R → R be a continuously differentiable map-ping. We consider the functional
L [ f ] =∫
F ( f (x)) dx.
Then the Frechet derivative δL[ f ]δ f [h] is determined by the derivative F ′ as
can be seen from
1ε
(L [ f + εh] − L [ f ]) = 1ε
∫F ( f (x) + εh (x)) − F ( f (x)) dx
= 1ε
∫F ′ ( f (x)) · εh (x) + O
(ε2h (x)2
)dx
−→ ε→0
∫F ′ ( f (x)) · h (x) dx
and use of the linear property of the integral operator.This property motivates an important remark about divergences, which
can be seen as special integral operators:
Remark 1. Let Lg be an integral operator Lg [ f ] = ∫Fg ( f (x)) dx depend-
ing on a fixed functional g ∈ B. Then the Frechet derivative δLg [ f ]δ f =∫
F ′g ( f (x)) · h (x) dx is determined by the integral kernel F ′
g ( f (x)) =Q (g (x) , f (x)) being a function in x. Therefore, frequently the Frechetderivative δLg [ f ]
δ f is simply identified with Q (g (x) , f (x)) and written asδLg [ f ]
δ f = Q (g (x) , f (x)) but keeping in mind its original interpretation as anintegral kernel defining the integral operator. We will make use from thisabbreviation in the following considering divergences as integral operatorsD (p||ρ) = L p [ρ] and write δD(p||ρ)
δρ= Q (g, f ), also denoted here as Frechet
derivative, for simplicity.
Finally, we remark that the Frechet derivative in finite-dimensionalspaces reduces to the usual partial derivative. In particular, it is repre-sented in coordinates by the Jacobi matrix. Thus, the Frechet derivative is ageneralization of the directional derivatives.
3.2 Frechet Derivatives for the Different Divergence Classes. We arenow ready to investigate functional derivatives of divergences. In particularwe focus on Frechet derivatives.
1358 T. Villmann and S. Haase
3.2.1 Bregman Divergences. We investigate the Frechet derivative for theBregman divergences (see equation 2.2) and formally obtain
δDB� (p||ρ)δρ
= � (p)δρ
− � (ρ)δρ
−δ[
δ�(ρ)δρ
(p − ρ)]
δρ(3.1)
with
δ[
δ�(ρ)δρ
(p − ρ)]
δρ= δ2 [� (ρ)]
δρ2 (p − ρ) − δ� (ρ)δρ
.
In the case of the generalized Kullback-Leibler-divergence (see equation2.4) this reads as
δDG K L (p||ρ)δρ
= − pρ
+ 1, (3.2)
whereas for the usual Kullback-Leibler divergence, equation 2.5,
δDK L (p||ρ)δρ
= − pρ
(3.3)
is obtained.For the Itakura-Saito divergence, equation 2.7, we get
δDI S (p||ρ)δρ
= 1ρ2 (ρ − p) . (3.4)
The η-divergence, equation 2.8, leads to
δDη (p||ρ)δρ
= ρη−2 · (1 − η) · η · (p − ρ) , (3.5)
which reduces in the case of η = 2 to the derivative of the Euclidean dis-tance −2 (p − ρ), commonly used in many vector quantization algorithms,including the online variant of k-means, SOMs, NG, and so on.
Further, for the subset of β-divergences, equation 2.9, we have
δDβ (p||ρ)δρ
= −p · ρβ−2 + ρβ−1 (3.6)
= ρβ−2 (ρ − p) . (3.7)
Divergence-Based Vector Quantization 1359
3.2.2 f -Divergences. For f -divergences, equation 2.11, the Frechetderivative is
δDf (p||ρ)δρ
= f(
pρ
)+ ρ
∂ f (u)∂u
δuδρ
= f(
pρ
)+ ρ
∂ f (u)∂u
· −pρ2 , (3.8)
with u = pρ
. As a famous example, we get for the Hellinger divergence,equation 2.17,
δDH (p||ρ)δρ
= 1 −√
pρ
. (3.9)
The subset of α-divergences, equation 2.18, can be handled by
δDα (p||ρ)δρ
= − 1α
(pαρ−α − 1
). (3.10)
The related Tsallis divergence DTα , equation 2.22, leads to the derivative
δDTα (p||ρ)δρ
= −(
pρ
)α
(3.11)
depending on the parameter α. The generalized Renyi divergences, equa-tion 2.26, are treated according to
δDG Rα (p||ρ)
δρ=− pαρ−α − 1∫
[pαρ1−α − α · p + (α − 1)ρ] dx + 1
= α∫[pαρ1−α − α · p + (α − 1)ρ] dx + 1
δDα(p||ρ)δρ
, (3.12)
which is reduced to
δDRα (p||ρ)δρ
= − pαρ−α∫pαρ1−α dx
(3.13)
in the case of the usual Renyi divergences, equation 2.28.
3.2.3 γ -Divergences. For the γ -divergences, we rewrite equation 2.30 as
Dγ (p||ρ) = 1γ + 1
ln F1 − ln F2,
1360 T. Villmann and S. Haase
with F1 = (∫pγ+1dx
) 1γ · (∫
ργ+1dx)
and F2 = (∫p · ργ dx
) 1γ . Then we get
δDγ (p||ρ)δρ
= 1γ + 1
1F1
δF1
δρ− 1
F2
δF2
δρ
with
δF1
δρ=
(∫pγ+1dx
) 1γ
(∫ργ+1dx
)δρ
=(∫
pγ+1dx) 1
γ
(γ + 1) ργ
and
δF2
δρ= 1
γ
(∫p · ργ dx
) 1γ−1 p · ργ
δρ
=(∫
p · ργ dx) 1
γ−1
pργ−1 ,
such that δDγ (p||ρ)δρ
finally yields
δDγ (p||ρ)δρ
= ργ∫ργ+1dx
− pργ−1∫p · ργ dx
(3.14)
= ργ−1[
ρ∫ργ+1dx
− p∫p · ργ dx
]. (3.15)
Considering the important special case γ = 1, the Frechet derivative of theCauchy-Schwarz divergence, equation 2.32, is derived:
δDC S (p||ρ)δρ
= ρ∫ρ2dx
− pV (p, ρ)
. (3.16)
4 Divergence-Based Online Vector Quantization Using Derivatives
Supervised and unsupervised vector quantization frequently are describedin terms of dissimilarities or distances. Suppose data are given as datavectors v ∈ R
n.Here we focus on prototype-based vector quantization: data processing
(clustering or classification) is realized using prototypes w ∈ Rn as represen-
tatives, whereby the dissimilarity between data points, as well as betweendata and prototypes, is determined by dissimilarity measures ξ (not neces-sarily fulfilling triangle inequality or symmetry restrictions).
Divergence-Based Vector Quantization 1361
Frequently, such algorithms somewhat optimize a cost function E de-pending on the dissimilarity between the data points and the prototypes;usually one has E = E (ξ (vi , wk)) and i = 1, . . . , N the number of data andk = 1, . . . , C the number of prototypes. This cost function can be a variantof the usual classification error in supervised learning or modified meansquared error of the dissimilarities ξ (vi , wk).
If E = E (ξ (vi , wk)) is differentiable with respect to ξ , and ξ is differ-entiable with respect to the prototype w, then a stochastic gradient mini-mization is a widely used optimization scheme for E . This methodologyimplies the calculation of the dissimilarity derivatives ∂ξ
∂wk, which now has
to be considered in light of the above functional analytic investigationsfor divergence measures (i.e., we replace the dissimilarity measure ξ bydivergences).
Therefore, we now assume that the data vectors are discrete representa-tions of continuous positive measures p (x) with vi = p (xi ), i = 1, . . . , n asrequired for divergences. Such data may be spectra or other frequency dataoccurring in many kinds of application like remote-sensing data analysis,mass spectrometry, or signal processing. Thereby, the restriction vi ∈ [0, 1]for positive measures can be fulfilled simply by dividing all data vectors bythe maximum vector entry, taking over all vectors and vector componentsof the data set. In case of probability densities, a subsequent normalizationto stress ‖v‖1 = 1 is required.
Further, we also identify the prototypes as discrete realizations of posi-tive measures ρ (x). Then the derivative ∂ξ
∂w has to be replaced by the (abbre-viated) Frechet derivative δξ
δρin the continuous case (see remark 1), which
reduces to usual partial derivatives in the discrete case. This is formallyachieved by replacing p and ρ by their vectorial counterparts v and w inthe formulas of the divergences provided in section 3.2 and further trans-lating integrals into sums.
In the following, we give prominent examples of unsupervised and su-pervised vector quantization, which can be optimized by gradient methodsusing the framework already introduced.
4.1 Unsupervised Vector Quantization
4.1.1 Basic Vector Quantization. Unsupervised vector quantization is aclass of algorithm for distributing prototypes W = {wk}Z, wk∈ R
n such thatdata points v ∈ V ⊆ R
n are faithfully represented in terms of a dissimilaritymeasure ξ . Thereby, C = card (Z) is the cardinality of the index set Z. Moreformally, the data point v is represented by this prototype ws(v) minimizingthe dissimilarity ξ (v, wk):
v �→ s (v) = argmink∈Z
ξ (v, wk). (4.1)
1362 T. Villmann and S. Haase
The aim of the algorithm is to distribute the prototypes in such a way thatthe quantization error
EVQ = 12
∫P (v) ξ
(v, ws(v)
)dv (4.2)
is minimized. In its simplest form, basic vector quantization (VQ) leads toa (stochastic) gradient descent on EVQ with
� ws(v) = −ε · ∂ξ(v, ws(v)
)∂ws(v)
(4.3)
for prototype update of the winning prototype ws(v) according to equation4.1, also known as the online variant of LBG algorithm (C–means; Linde,Buzo, & Gray, 1980; Zador, 1982). Here, ε is a small, positive value called thelearning rate. As we see, update 4.3 takes into account the derivative of thedissimilarity measure ξ with respect to the prototype. Beside the commonchoice of ξ being the squared Euclidean distance, the choice is given to theuser with the restriction of differentiability. Hence, here we are allowed toapply divergences using derivatives in the sense of Frechet derivatives.
4.1.2 Self-Organizing Maps and Neural Gas. There are several variants ofthe basic vector quantization scheme to avoid local minima or realize aprojective mapping. For example, the latter can be obtained by introducinga topological structure in the index set Z and denoting this strucure as A,usually a regular grid. The resulting vector quantization scheme is the self-organizing map (SOM) introduced by Kohonen (1997). The respective costfunction (in the variant of Heskes, 1999) is
ESOM = 12K (σ )
∫P(v)
∑r∈A
δs(v)r
∑r′∈A
hSOMσ (r, r′)ξ ( v, wr′ ) dv (4.4)
with the so-called neighborhood function
hSOMσ (r, r′) = exp
(−
∥∥r − r′∥∥A
2σ 2
),
and∥∥r − r′∥∥
A is the distance in A according to the topological structure.K (σ ) is a normalization constant depending on the neighborhood range σ .For this SOM, the mapping rule, equation 4.1, is modified to
v �→ s (v) = argminr∈A
∑r′∈A
hSOMσ (r, r′) · ξ (v, wr′ ) , (4.5)
Divergence-Based Vector Quantization 1363
which yields in the limit σ → 0 the original mapping (see equation 4.1). Theprototype update for all prototypes then is given as (Heskes, 1999)
� wr = −εhSOMσ (r, s(v))
∂ξ (v, wr)∂wr
. (4.6)
As above, the utilization of a divergence-based update is straightforwardfor SOM as well.
If the aspect of projective mapping can be ignored while keeping theneighborhood cooperativeness aspect to avoid local minima in vector quan-tization, then the neural gas algorithm (NG) is an alternative to SOM pre-sented by Martinetz, Berkovich, and Schulten (1993). The cost function ofNG to be minimized is
ENG = 12C (σ )
∑j∈A
∫P (v) hNG
σ (v, W, j) ξ(v, w j
)dv, (4.7)
with
hNGσ (v, W, i) = exp
(−ki (v, W)
σ
), (4.8)
with the rank function
ki (v, W) =∑
j
θ(ξ (v, wi ) − ξ
(v, w j
)). (4.9)
The mapping is realized as in basic VQ (see equation 4.1), and the prototypeupdate for all prototypes is similar to that of SOM:
� wi = −εhNGσ (v, W, i)
∂ξ (v, wi )∂wi
. (4.10)
Again, the incorporation of divergences is obvious also for NG.
4.1.3 Further Vector Quantization Approaches. There exists a long list ofother vector quantization approaches, like kernelized SOMs (Hulle, 2000,2002a, 2002b), generative topographic mapping (GTM; Bishop, Svensen,& Williams, 1998), and soft topographic mapping (Graepel, Burger, &Obermayer, 1998), to name just a few. Most of them use the Euclideanmetric and the respective derivatives for adaptation. Thus, the idea ofdivergence-based processing can be transferred to these in a similar manner.
A somewhat reverse SOM has been proposed recently for embeddingdata into an embedding space S: exploration machine (XOM; Wismuller,
1364 T. Villmann and S. Haase
2009). This XOM can be seen as a projective structure preserving mappingof the input data into the embedding space and therefore shows similaritiesto MDS. In the XOM approach, the data points vk ∈ V ⊆ R
n, k = 1, . . . , Nare uniquely associated with prototypes wk ∈ S in the embedding space Sand W = {wk}N
k=1. The dissimilarity ξS in the embedding space usually ischosen to be the quadratic Euclidean metric. Further, a hypothesis aboutthe topological structure of the data vk to be embedded is formulated forthe embedding space S by defining a probability distribution PS (s) forso-called sampling vectors s ∈S. A cost function of XOM can be defined as
EXOM = 12K (σ )
∫S
PS (s)N∑
k=1
·δk∗(s)k
N∑j=1
h XOMσ (vk, v j ) · ξS ( s, w j ) ds
(4.11)
with the mapping rule
k∗(s) = argmink=1,...,N
N∑j=1
h XOMσ (vk, v j ) · ξS ( s, w j ), (4.12)
as pointed out in Bunte, Hammer, Villmann, Biehl, and Wismuller (2010).As in usual SOMs, the neighborhood cooperativeness is given in XOMs bya gaussian,
h XOMσ (vk, v j ) = exp
(−ξV
(vk, v j
)2σ 2
),
with the data dissimilarity ξV(vk, v j
)defined as Euclidean distance in the
original XOM. The update of the prototypes in the embedding space isobtained in complete analogy to SOM as
� wi = −εh XOMσ
(vi , vk∗(s)
) ∂ξS (s, wi )∂wi
. (4.13)
As one can see, we can apply divergences to both ξV and ξS . In case of thelatter, the prototype update, equation 4.13, has to be changed accordinglyusing the respective Frechet derivatives.
4.2 Learning Vector Quantization. Learning vector quantization (LVQ)is the supervised counterpart of basic VQ. Now the data v ∈ V ⊆ R
n tobe learned are equipped with class information cv. Suppose we have Kclasses; we define cv ∈ [0, 1]K . If
∑Kk=1 ci = 1, the labeling is probabilistic,
Divergence-Based Vector Quantization 1365
and possibilistic otherwise. In case of a probabilistic labeling with cv ∈{0, 1}K , the labeling is called crisp.
We now briefly explore how divergences can be used for supervisedlearning. Again we start with the widely applied basic LVQ approaches andthen outline the procedure for some more sophisticated methods withoutany claim of completeness.
4.2.1 Basic LVQ Algorithms. The basic LVQ schemes were invented byKohonen (1997). For standard LVQ, a crisp data labeling is assumed. Fur-ther, the prototypes w j with labels yj correspond to the K classes in sucha way that at least one prototype is assigned to each class. For simplicity,we take exactly one prototype for each class. The task is to distribute theprototypes in such a manner that the classification error is reduced. Therespective algorithms LVQ1 to LVQ3 are heuristically motivated.
As in the unsupervised vector quantization, the similarity between dataand prototypes for LVQs is judged by a dissimilarity measure ξ
(v, w j
).
Beside some small modifications, the basic LVQ schemes LVQ1 to LVQ3mainly consist of determination of the most proximate prototype(s) ws(v)for given v according to the mapping rule, equation 4.1, and subsequentadaptation. Depending on the agreement of cv and ys(v) the adaptation ofthe prototype(s) takes place according to
� ws(v) = α · ε · ∂ξ(v, ws(v)
)∂ws(v)
, (4.14)
and α = 1 iff cv = ys(v), and α = −1 otherwise.A popular generalization of these standard algorithms is the generalized
LVQ (GLVQ) introduced by Sato and Yamada (1996). In GLVQ the classifi-cation error is replaced by a dissimilarity-based cost function that is closelyrelated to the classification error but not identical to it.
For a given data point v, with class label cv, the two best matchingprototypes with respect to the data metric ξ , usually the quadratic Euclid-ian, are determined: ws+(v) has minimum distance ξ+ = ξ
(v, ws+(v)
)under
the constraint that the class labels are identical: ys+(v) = cv. The other bestprototype, ws−(v), has the minimum distance ξ− = ξ
(v, ws−(v)
)supposing
the class labels are different: ys−(v) �= cv. Then the classifier function μ (v) isdefined as
μ (v) = ξ+ − ξ−
ξ+ + ξ− , (4.15)
being negative in case of a correct classification. The value ξ+ − ξ− yieldsthe hypothesis margin of the classifier (Crammer, Gilad-Bachrach, Navot,
1366 T. Villmann and S. Haase
& Tishby, 2002). Then the generalized LVQ (GLVQ) is derived as gradientdescent on the cost function
EGLVQ =∑
v
μ (v) (4.16)
with respect to the prototypes. In each learning step, for a given data point,both ws+(v) and ws−(v) are adapted in parallel. Taking the derivatives ∂ EGLVQ
∂ws+ (v)
and ∂ EGLVQ
∂ws− (v), we get for the updates
�ws+(v) = ε+ · θ+ · ∂ξ(v, ws+(v)
)∂ws+(v)
(4.17)
and
�ws−(v) = −ε− · θ− · ∂ξ(v, ws−(v)
)∂ws−(v)
(4.18)
with the scaling factors
θ+ = 2 · ξ−
(ξ+ + ξ−)2 and θ− = 2 · ξ+
(ξ+ + ξ−)2 . (4.19)
The values ε+ and ε− ∈ (0, 1) are the learning rates.Obviously the distance measure ξ could be replaced for all of these LVQ
schemes by one of the introduced divergences. This offers a new possibilityfor information-theoretic learning in classification schemes, which differsfrom the previous approaches significantly. These earlier approaches stressthe information-optimum class representation, whereas here, the expectedinformation loss in terms of the applied divergence measure is optimized(Torkkola & Campbell, 2000; Torkkola, 2003; Villmann, Hammer, et al.,2008).
4.2.2 Advanced Learning Vector Quantization. Apart from the basic LVQschemes, many more sophisticated prototype-based learning schemes areproposed for classification learning. Here we will restrict ourselves to ap-proaches that can deal with probabilistic or possibilistic labeled trainingdata (uncertain decisions) that are, in addition, related to the basic unsu-pervised and supervised vector quantization algorithms mentioned in thisletter so far.
In particular, we focus on the fuzzy-labeled SOM (FLSOM) and the verysimilar fuzzy-labeled NG (FLNG) (Villmann, Schleif, Kostrzewa, Walch, &Hammer, 2008; Villmann, Hammer, Schleif, Geweniger, & Herrmann, 2006).
Divergence-Based Vector Quantization 1367
Both approaches extend the cost function of its unsupervised counterpartin the following shorthand manner,
EFLSOM/FLNG = (1 − β) ESOM/NG + βEFL,
where EFL measures the classification accuracy . The factor in β ∈ [0, 1) is afactor balancing unsupervised and supervised learning. The classificationaccuracy term EFL is defined as
EFL = 12
∫P (v)
∑r
gγ (v, wr) ψ (cv,yr) dv, (4.20)
where gγ (v, wr) is a gaussian kernel describing a neighborhood range inthe data space
gγ (v, wr) = exp(
−ξ (v, wr)2γ 2
)(4.21)
using the dissimilarity ξ (v, wr) in the data space. ψ (cv,yr) judges the dissim-ilarities between label vectors of data and prototypes. ψ (cv,yr) is originallysuggested to be the quadratic Euclidean distance.
Note that EFL depends on the dissimilarity in the data space ξ (v, wr) viagγ (v, wr). Hence, prototype adaptation in FLSOM/FLNG is influenced bythe classification accuracy
∂ EFLSOM/NG
∂wr= (1 − β)
∂ ESOM/NG
∂wr+ β
∂ EFL
∂wr, (4.22)
which yields
� wr =−ε(1 − β) · hSOM/NGσ (r, s(v))
∂ξ (v, wr)∂wr
(4.23)
+ εβ1
4γ 2 · gγ (v, wr)∂ξ (v, wr)
∂wrψ (cv,yr) .
The label adaptation is influenced only by the second part, EFL. The deriva-tive ∂ EFL
∂yryields
� yr = εlβ · gγ (v, wr)∂ψ (cv,yr)
∂yr(4.24)
with learning rate εl > 0 (Villmann, Schleif, et al., 2008; Villmann et al.,2006). This label learning leads to a weighted average yr of the fuzzy labels
1368 T. Villmann and S. Haase
cv of those data v, that are close to the associated prototypes according toξ (v, wr).
It should be noted at this point that a similar approach can easily beinstalled for XOM in an analog manner, yielding FLXOM.
Clearly, beside the possibility of choosing a divergence measure forξ (v, wr) as in the unsupervised case, there is no contradiction to do sofor the label dissimilarity ψ (cv,yr) in these FL methods. As before, the sim-ple plug-in of the respective discrete divergence variants and their Frechetderivatives modifies the algorithms such that semisupervised learning canproceed by relying on divergences for both variants.
5 SOM Simulations for Various Divergences
In this section, we demonstrate the influence of the chosen divergence andthe dependence on divergence parameters for prototype-based unsuper-vised vector quantization. For this purpose, we consider an artificial butillustrative data set. In the case of parameterized divergences, we vary theparameter settings to show their dependence on the resulting prototype dis-tribution. Further, we investigate the behavior of different divergence typesbut always comparing the results with Euclidean distance-based learningas the standard to show their differences.
These investigations for the toy problem should lead readers to thinkabout the choice of divergences for a specific application as well as optimumparameter settings. The demonstration itself is far from a realistic scenario,which also has to deal with such matters as high-dimensional problems andheterogeneous data distributions.
As an example vector quantization model, we consider the Heskes-SOMaccording to equation 4.4 using a chain lattice with 100 units r and their pro-totypes wr ∈ R
2. The example data distribution consists of 107 data pointsv = (v1, v2) ∈ [0, 1]2. which are constrained such that v1 + v2 = 1 (i.e., thedata v can be taken as probability densities in R
2). Further, generating thedata set, the first argument v1 is chosen randomly according to the datadensity P1 (v1) = 2 · v1, whereas v2 is subsequently calculated according tothe constraint.
The learning rate ε as well as the neighborhood range σ converged dur-ing the SOM learning to the final values ε f inal = 10−6 and σ f inal = 1, respec-tively. The initial values for the learning rate ε as well as the neighborhoodrange σ were appropriately chosen.
We trained SOM networks for the divergences as introduced in section 2using the Frechet derivatives deduced in section 3.1 with different param-eter values.
For the η-divergence (belonging to the Bregman divergences) the resultsare depicted in Figure 2. One can observe that the influence of the parameterη is only marginal. Yet small variations can be detected. For the specialchoice η = 2, Euclidean learning is realized.
Divergence-Based Vector Quantization 1369
Figure 2: Prototype distribution for η-divergence-based SOM for different η-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).
For the β-divergence, the influence of the parameter value β is strongerthan the parameter effect for η-divergences (see Figure 3). In particular, sig-nificant deviations can be observed for higher prototype w1-values, givinga hint of a better discrimination property for this probability range. Lowerprototype w1-values were captured by the β-divergences markedly betterthan by the Euclidean learning.
The α-divergence based learning shows the inclusive and exclusive prop-erties mentioned above. For a positive choice of the control parameter α therange of prototype w1-values captured is quite larger than the one coveredusing negative α-values. However, only small variations can be detectedwithin the two α-domains (positive and negative); that is, the divergence isrelatively robust with respect to the control parameter α (see Figure 4).
For the Tsallis divergence, the influence of the control parameter α isalready detected in the central range of prototype w1-values and significantin the upper range (see Figure 5). Especially in comparison to Euclideanlearning, this gives a hint of a quite good discrimination property for awide probability range.
In contrast to the β-divergence, the influence of the control parameterα of the Renyi divergence is primarily detected in the region with sparsedata density (see Figure 6). However, the Renyi divergence-based learningcovers a wider range of prototype w1-values than the Euclidean learning.
1370 T. Villmann and S. Haase
Figure 3: Prototype distribution for β-divergence-based SOM for different β-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).
Figure 4: Prototype distribution for α-divergence-based SOM for different α-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).
Divergence-Based Vector Quantization 1371
Figure 5: Prototype distribution for Tsallis divergence-based SOM for differentα-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).
Figure 6: Prototype distribution for Renyi divergence-based SOM for differentα-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).
1372 T. Villmann and S. Haase
Figure 7: Prototype distribution for γ -divergence-based SOM for different γ -values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).
The γ -divergence shows the most sensitive behavior of all parameterizeddivergences investigated here (see Figure 7). In particular, the choice of thecontrol parameter γ influences both ranges of probability—the low and thehigh one—with approximately the same sensitivity (see Figure 7).
Thus, it differs from the sensitivity observed for β-divergences. Thisbehavior offers the possibility of tuning the divergence precisely dependingon the specific vector quantization task. Together with stated robustnessof the γ -divergence (Fujisawa & Eguchi, 2008), this adaptive specificitycould provide a high potential for a wide range of application. This isunderscored by the applications in supervised and unsupervised vectorquantization based on the Cauchy-Schwarz divergence (γ = 1) (Jenssenet al., 2006; Mwebaze et al., 2010; Principe et al., 2000; Villmann, Haase,Schleif, & Hammer, 2010).
Figure 8 shows the results of the prototype-based unsupervised vectorquantization using various nonparameterized divergences.
These simulations should be seen, on one hand, as a proof of concept.On the other hand, one can clearly see quite different behavior for the var-ious divergences, resulting in distinguished prototype distributions. Thisleads, in consequence, to diverse vector quantization properties. Therefore,the choice of a divergence for a specific application should be made verycarefully, taking the special properties of the divergences into account.
Divergence-Based Vector Quantization 1373
Figure 8: Prototype distribution for divergence-based SOM, using various di-vergences. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).
6 Extensions for the Basic Adaptation Scheme:Hyperparameter and Relevance Learning
6.1 Hyperparameter Learning for α-, β-, γ -, and η-Divergences
6.1.1 Theoretical Considerations. Considering the parameterized diver-gence families of γ -, α-, β-, and η-divergences, one could further think aboutthe optimal choice of the so-called hyperparameters γ , α, β, η as suggestedin a similar manner for other parameterized LVQ algorithms (Schneider,Biehl, & Hammer, 2009). In case of supervised learning schemes for clas-sification based on differentiable cost functions, the optimization can behandled as an object of a gradient descent–based adaptation procedure.Thus, the parameter is optimized for the classification task at hand.
Suppose the classification accuracy for a certain approach is given as
E = E (ξθ , W)
depending on a parameterized divergence ξθ with parameter θ . If E and ξθ
are both differentiable with respect to θ according to
∂ E (ξθ , W)∂θ
= ∂ E∂ξθ
· ∂ξθ
∂θ,
1374 T. Villmann and S. Haase
a gradient-based optimization is derived by
�θ = −ε∂ E (ξθ , W)
∂θ= −ε
∂ E∂ξθ
· ∂ξθ
∂θ
depending on the derivative ∂ξθ
∂θfor a certain choice of the divergence ξθ .
We assume in the following that the (positive) measures p and ρ arecontinuously differentiable. Then, considering derivatives of parameterizeddivergences ∂ξθ
∂θwith respect to the parameter θ , it is allowed to interchange
integration and differentiation if the resulting integral exists (Fichtenholz,1964). Hence, we can differentiate parameterized divergences with respectto their hyperparameter in that case. For the several α-, β-, γ -, and η-divergences, characterized in section 2, we obtain after some elementarycalculations:
� η-divergence Dη (p||ρ) from equation 2.8:
∂ Dη (p||ρ)∂η
=∫
pη ln p + ρη−1 · (ρ − p + (ηρ − ρ − ηp) · ln ρ) dx
� β-divergence Dβ (p||ρ) from equation 2.9 (see appendix A):
∂ Dβ (p||ρ)∂β
= 1β − 1
∫p(
pβ−1 ln p − ρβ−1 ln ρ −(
pβ−1 − ρβ−1)
(β − 1)
)dx
−∫ (
pβ ln p − ρβ ln ρ) 1
β− 1
β2
(pβ − ρβ
)dx
� α-divergence Dα (p||ρ) from equation 2.18 (see appendix A):
∂ Dα (p||ρ)∂α
=− (2α − 1)
α2 (α − 1)2
∫ [pαρ1−α − α · p + (α − 1) ρ
]dx
+ 1α (α − 1)
∫[pαρ1−α (ln p − ln ρ) − p + ρ]dx
� Tsallis divergence DTα (p||ρ) from equation 2.22:
∂ DTα (p||ρ)∂α
= 1
(1 − α)2
(1 −
∫pαρ1−αdx
)− 1
(1 − α)
∫pαρ1−α (ln p − ln ρ) dx
Divergence-Based Vector Quantization 1375
� Generalized Renyi divergence DG Rα (p||ρ) from equation 2.26 (see
appendix A):
∂ DG Rα (p||ρ)
∂α
= − 1
(α − 1)2 log(∫ [
pαρ1−α − α · p + (α − 1) ρ]
dx + 1)
+ 1α − 1
∫pαρ1−α (ln p − ln ρ) − p + ρdx∫ [
pαρ1−α − α · p + (α − 1) ρ]
dx + 1
� Renyi divergence DRα (p||ρ) from equation 2.28:
∂ DRα (p||ρ)∂α
= − 1
(α − 1)2 log(∫
pαρ1−αdx)
+ 1α − 1
∫pαρ1−α (ln p − ln ρ) dx∫
pαρ1−αdx
� γ -divergence Dγ (p||ρ) from equation2.30 (see appendix A):
∂ Dγ (p||ρ)∂γ
=− (2γ + 1)
γ 2 (γ + 1)2 ln(∫
pγ+1dx)
+∫
pγ+1 ln pdx(γ + 1) γ
∫pγ+1dx
− 1
(γ + 1)2 ln(∫
ργ+1dx)
+∫
ργ+1 ln ρdx(γ + 1)
∫ργ+1dx
+ 1γ 2 ln
(∫p · ργ dx
)−
∫pργ ln ρdx
γ∫
p · ργ dx
6.1.2 Example: Hyperparameter Learning for γ -Divergences in GLVQ. Wenow provide a simulation example for hyperparameter learning. We ap-ply the GLVQ algorithm for classification, the cost function of which isgiven by equation 4.16. Mwebaze et al. (2010) pointed out that GLVQperforms weakly if the Kullback-Leibler divergence is used, whereasCauchy-Schwarz divergence yields good results. Therefore, we demon-strate hyperparameter learning for the γ -divergence, which includes bothKullback-Leibler- and Cauchy-Schwarz divergence by the parameter set-tings γ → 0 and γ = 1, respectively. The hyperparameter update for thisalgorithm reads as
�γ ∼ −∂ EGLVQ
∂γ,
1376 T. Villmann and S. Haase
Figure 9: Example run of γ -parameter control for the γ -divergence in the caseof GLVQ applied to the well-known Iris data set.
which leads to
�γ ∼ θ+ · ∂ Dγ
(v||ws+(v)
)∂γ
+ θ− · ∂ Dγ
(v||ws−(v)
)∂γ
,
with the scaling factors θ+ and θ− taken from equation 4.19.For this purpose, we investigate a simple classification example: the well-
known three-class IRIS data set. We rescaled the data vectors such that therequirements of positive measures are satisfied. We used two prototypes foreach class and 10-fold cross-validation. We initialized the γ -parameter asγ0 = 0.5 to be in the middle between Kullback-Leibler and Cauchy-Schwarzdivergence according to Mwebaze et al. (2010).
Without a γ -parameter update for γ = 0, a classification accuracy of78.34% is obtained with standard deviation σ = 6.17, with the best re-sult being 91.3%. For γ = 1, the average is 95.16%, σ = 1.87 with thebest-performed run yielding 97.3%. The hyperparameter-controlled simu-lations give only a slight improvement achieving, an average performanceof 95.89% but with decreased deviation σ = 0.43. The γ -parameter con-verged to γ f inal = 0.9016 with standard deviation σγ < 10−4. As expectedfrom the noncontrolled experiments, the final γ -value is in the proximityof the Cauchy-Schwarz divergence. However, it is slightly but certainly de-creased. A typical learning progress of γ is depicted in Figure 9. As for theCauchy-Schwarz divergence (γ = 1), the best performance was 97.3% forthe controlled case.
Summarizing, this small experiment shows that hyperparameter opti-mization works well and may lead to better performance and stability.
6.2 Relevance Learning for Divergences. Density functions are re-quired to fulfill the normalization condition, whereas positive measures
Divergence-Based Vector Quantization 1377
are more flexible. This offers the possibility of transferring the idea of rele-vance learning to divergence-based learning vector quantization. Relevancelearning in learning vector quantization is weighting the input data dimen-sions such that classification accuracy is improved (Hammer & Villmann,2002).
In the framework of divergence-based gradient descent learning, wemultiplicatively weight a positive measure q (x) by λ (x) with 0 ≤ λ (x) < ∞and the regularization condition
∫λ (x) dx = 1. Incorporating this idea into
the above approaches, we have to replace in the divergences p by p · λ andρ by ρ · λ. Doing so, we can optimize λ (x) during learning for better per-formance by gradient descent optimization as it is known from vectorialrelevance learning. This leads, again, to Frechet derivatives of the diver-gences but now with respect to the weighting function λ (x). The respectiveframework based on GLVQ for vectorial data is given by the generalized rel-evance learning vector quatization scheme (GRLVQ; Hammer & Villmann,2002). In complete analogy, we obtain the functional relevance update,
�λ ∼ θ+ · δD(
p||ρs+(p))
δλor � λ ∼ θ+ · δD
(p||ρs−(p)
)δλ
,
with s+ (p) and s− (p) playing the same role as in GLVQ. For vectorialrepresentations v and w of p and ρ , respectively, this reduces to the ordinarypartial derivatives:
�λi ∼ θ+ · ∂ D(v||ws+(v)
)∂λi
or � λi ∼ θ+ · ∂ D(v||ws−(v)
)∂λi
.
Applying this methodology, we obtain for the Bregman divergence,
δDB� (λ · p||λ · ρ)
δλ= � (λ · p)
δλ− � (λ · ρ)
δλ−
δ[
δ�(λ·ρ)δρ
λ (p − ρ)]
δλ, (6.1)
with
δ[
δ�(λ·ρ)δρ
λ (p − ρ)]
δλ= (p − ρ)
(δ2 [� (λ · ρ)]
δρ δλλ + δ� (λ · ρ)
δρ
).
This yields the generalized Kullback-Leibler divergence:
δDG K L (λ · p||λ · ρ)δλ
= p · log(
pρ
)− p + ρ. (6.2)
1378 T. Villmann and S. Haase
In the case of the η-divergence (equation 2.8), we calculate
δDη (λ · p||λ · ρ)δλ
= λη−1η(
pη − ρη−1 (pη + (1 − η) ρ)), (6.3)
which reduces for the choice η = 2 (Euclidean distance) to
δDη (λ · p||λ · ρ)δλ
= 2λ (p − ρ)2
as it is known from Hammer and Villmann (2002). Further, for the β-divergence, equation 2.9, which also belongs to the Bregman divergenceclass, we have
δDβ (λ · p||λ · ρ)δλ
= ρ · (λ · p)β + (ρ · (β − 1) − p · β) · (λ · ρ)β
λρ (β − 1). (6.4)
For the class of f -divergences, equation 2.11, we consider
δDf (λ · p||λ · ρ)δλ
= ρ · f(
pρ
)+ λ · ρ
∂ f (u)∂u
δuδλ
= ρ · f(
pρ
)(6.5)
with u = pρ
using the fact that δuδλ
= 0. The relevance learning of the subclassof α-divergences, equation 2.18, follows,
δDα (λ · p||λ · ρ)δλ
= 1α (α − 1)
[ρ ·
((pρ
)α
+ α − 1)
− p · α
], (6.6)
whereas the respective gradient of generalized Renyi divergences, equation2.26, can be derived from this as
δDG Rα (λ · p||λ · ρ)
δλ= α∫ [
λ ·(ρ ·
(pρ
)α
− α · p + (α − 1) · ρ)
1]
dx + 1
×δDα (λ · p||λ · ρ)δλ
. (6.7)
The subset of Tsallis divergences is treated by
δDTα (λ · p||λ · ρ)
δλ= − 1
1 − αpαρ1−α . (6.8)
Divergence-Based Vector Quantization 1379
The γ -divergence classes finally yield
δDγ (λ · p||λ · ρ)δλ
= p (λ · p)γ
γ∫
(λ · p)γ+1 dx+ ρ (λ · ρ)γ∫
(λ · ρ)γ+1 dx
− p · (γ + 1) · (λ · ρ)γ
γ∫
(λ · p) · (λ · ρ)γ dx.
Again the important special case γ = 1 is considered: the relevance learningscheme for the Cauchy-Schwarz divergence, equation 2.32, is derived as
δDC S (λ · p||λ · ρ)δλ
= p · λ · p∫(λ · p)2 dx
+ ρ · λ · ρ∫(λ · ρ)2 dx
− 2 · p · λ · ρ∫λ2 · p · ρdx
.
(6.9)
7 Conclusion
Divergence-based supervised and unsupervised vector quantization hasbeen done so far by applying only a few divergences, primarily Kullback-Leibler divergence. Recent applications also refer to Itakura-Saito diver-gence, Cauchy divergence, and γ -divergence. These approaches are notonline adaptation schemes involving gradient learning but are based onbatch mode, requiring all the data at one time. However, in many cases,online learning is mandatory, for several reasons: the huge amount of data,a subsequently inreasing data set, or the need for very careful learning incomplex problems, for example (Alex, Hasenfuss, & Hammer, 2009). Inthese cases, online learning is required or may be at least, advantageous.
In this letter we give a mathematical foundation for gradient-based vec-tor quantization bearing on the derivatives of the applied divergences.We provide a general framework for the use of arbitrary divergences andtheir derivatives such that they can immediately be plugged into existinggradient-based vector quantization schemes.
For this purpose, we first characterized the main subclasses ofdivergences—Bregman-, α-, β-, γ -, and f -divergences—following Cichockiet al. (2009). We then used the mathematical methodology of Frechet deriva-tives to calculate the functional divergence derivatives.
We show how to use this methodology with famous examples of su-pervised and unsupervised vector quantization, including SOM, NG, andGLVQ. In particular, we explained that the divergences can be taken as suit-able dissimilarity measures for data, which leads to the use of the respec-tive Frechet derivatives in the online learning schemes. Further, we declarehow a parameter adaptation could be integrated in supervised learning toachieve improved classification results in case of the parameterized α-, β-,γ -, and η-divergences. In the last step, we considered a weighting function
1380 T. Villmann and S. Haase
for generalized divergences based on a positive measure. The optimizationscheme for this weight function is obtained by Frechet derivatives againto obtain a relevance learning scheme in analogy to relevance learning inthe usual supervised learning vector quantization (Hammer & Villmann,2002).
Table 1 provides an overview of representatives of the three main classesof divergence characterized in section 2 and their related Frechet deriva-tives. Table 2 provides the received derivatives for relevance learning andhyperparameter learning.
As a proof of concept, the simulations for an illustrative example for theseveral parametric and nonparametric divergences give promising resultsregarding their sensitivity. The differences with Euclidean learning are ob-vious. Moreover, the dependencies in case of parameterized divergencesgive hints for possible real-world application, which should be the nextstep in this work.
Appendix A: Calculation of the Derivatives of the ParameterizedDivergences with Respect to the Hyperparameters
We assume for the differentiation of the divergences with respect to theirhyperparameters that the (positive) measures p and ρ are continuouslydifferentiable. Then, considering derivatives of divergences, integrationand differentiation can be interchanged, if the resulting integral exists(Fichtenholz, 1964).
A.1 β-divergence. The β-divergence is, according to equation 2.9,
Dβ (p||ρ) =∫
p · pβ−1 − ρβ−1
β − 1dx −
∫pβ − ρβ
βdx
= I1 (β) − I2 (β) .
We treat both integrals independently:
∂ I1 (β)∂β
=∫ ∂
[p · pβ−1−ρβ−1
β−1
]∂β
dx
=∫
p
(∂
[pβ−1 − ρβ−1
]∂β
1β − 1
−(
pβ−1 − ρβ−1)
(β − 1)2
)dx
= 1β − 1
∫p
(pβ−1 ln p − ρβ−1 ln ρ −
(pβ−1 − ρβ−1
)(β − 1)
)dx
Divergence-Based Vector Quantization 1381Ta
ble
1:Ta
ble
ofD
iver
genc
esan
dT
heir
Frec
hetD
eriv
ativ
es.
Div
erge
nce
Fam
ilyFo
rmul
aFr
eche
tDer
ivat
ive
Bre
gman
div
erge
nce
DB �
(p||ρ
)=
�(p
)−�
(ρ)−
δ�
(ρ)
δρ[p
−ρ
]δ
DB �
(p||ρ
)δρ
=�
(p)
δρ−
�(ρ
)δρ
−δ
2 [�(ρ
) ]δρ
2(p
−ρ
)−δ�
(ρ)
δρ
Kul
lbac
k-L
eibl
erd
iver
genc
eD
KL
(p||ρ
)=
∫ p·lo
g( p ρ
) dx
δD
KL
(p||ρ
)δρ
=−
p ρ
Gen
eral
ized
Kul
lbac
k-L
eibl
erd
iver
genc
eD
KL
(p||ρ
)=
∫ p·lo
g( p ρ
) dx
−∫ (p
−ρ
)dx
δD
GK
L(p
||ρ)
δρ=
−p ρ
+1
Itak
ura-
Sait
od
iver
genc
eD
IS(p
||ρ)=
∫[ p ρ−
log
( p ρ
) −1] d
xδ
DIS
(p||ρ
)δρ
=1 ρ2
(ρ−
p )
η-d
iver
genc
eD
η(p
||ρ)=
∫ pη+
(η−
1 )·ρ
η−
η·p
·ρ(η
−1) d
xδ
Dη
(p||ρ
)δρ
=ρ
η−2
· (1−
η)·
η· (
p−
ρ)
β-d
iver
genc
eD
β(p
||ρ)=
∫ p·pβ
−1−ρ
β−1
β−1
dx
−∫ pβ
−ρβ
βd
xδ
Dβ
(p||ρ
)δρ
=ρ
β−2
(ρ−
p )
f-d
iver
genc
esD
f(p
||ρ)=
∫ ρ·f
( p ρ
) dx
δD
f(p||
ρ)
δρ=
f( p ρ
) +ρ
∂f (
u )∂
u·−
pρ
2
Gen
eral
ized
f-d
iver
genc
esD
G f(p
||ρ)=
cf∫ p
−ρ
dx
+∫ ρ
·f( p ρ
) dx,
cf
=f′
(1)�=
0δ
DG f
(p||ρ
)δρ
=f( p ρ
) +ρ
∂f (
u )∂
u·−
pρ
2−
cf,
cf
=f′
(1)�=
0
Hel
linge
rd
iver
genc
eD
H(p
||ρ)=
∫ ( √p
−√ ρ
) 2 dx
δD
H(p
||ρ)
δρ=
1−
√ p ρ
α-d
iver
genc
eD
α(p
||ρ)=
1α
(α−1
)
∫ [ pαρ
1−α
−α
·p+
(α−
1 )ρ] d
xδ
Dα
(p||ρ
)δρ
=−
1 α
( pαρ
−α−
1)Ts
allis
div
erge
nce
DT α
(p||ρ
)=
11−
α
( 1−
∫ pαρ
1−α
dx)
δD
T α(p
||ρ)
δρ=
−( p ρ
) αR
enyi
div
erge
nce
DR α
(p||ρ
)=
1α−1
log
(∫ pαρ
1−α
dx)
δD
R α(p
||ρ)
δρ=
−pα
ρ−α
∫ pαρ
1−α
dx
Gen
eral
ized
Ren
yid
iver
genc
eD
GR
α(p
||ρ)=
1α−1
log
(∫ [ pαρ
1−α
−α
·p+
(α−
1 )ρ] d
x+
1)δ
DG
Rα
(p||ρ
)δρ
=−
pαρ
−α−1
∫ [pαρ
1−α−α
·p+ (
α−1
)ρ]d
x+1
γ-d
iver
genc
esD
γ(p
||ρ)=
log
[ (∫ pγ+1
dx )
1γ
(γ+1
)· (∫ ρ
γ+1
dx )
1γ+1
(∫ p·ργ
dx )
1 γ
]δ
Dγ
(p||ρ
)δρ
=ρ
γ∫ ρ
γ+1
dx
−pρ
γ−1
∫ p·ργ
dx
Cau
chy-
Schw
arz
div
erge
nce
DC
S(p
||ρ)=
1 2lo
g( ∫ ρ
2 (x)d
x·∫ p2 (x)d
x
(∫ p·ρd
x )2
)δ
DC
S(p
||ρ)
δρ=
ρ ∫ ρ2 d
x−
p ∫ p·ρd
x
1382 T. Villmann and S. Haase
Tabl
e2:
Tabl
eof
Div
erge
nces
and
The
irD
eriv
ativ
esfo
rR
elev
ance
Lea
rnin
gan
dH
yper
para
met
erL
earn
ing.
Div
erge
nce
Fam
ilyD
eriv
ativ
efo
rR
elev
ance
Lea
rnin
gD
eriv
ativ
efo
rH
yper
para
met
erL
earn
ing
Bre
gman
div
erge
nce
δD
B �(λ
·p||λ
·ρ)
δλ
=�
(λ·p
)δλ
−�
(λ·ρ
)δλ
−(p
−ρ
)( δ2 [�
(λ·ρ
) ]δρ
δλ
λ+
δ�
(λ·ρ
)δρ
)—
KL
div
erge
nce
δD
KL
(λ·p
||λ·ρ
)δλ
=p
·log
( p ρ
)—
Gen
eral
ized
KL
div
erge
nce
δD
GK
L(λ
·p||λ
·ρ)
δλ
=p
·log
( p ρ
) −p
+ρ
—
Itak
ura-
Sait
od
iver
genc
eδ
DIS
(λ·p
||λ·ρ
)δλ
=0
—
η-d
iver
genc
eδ
Dη
(λ·p
||λ·ρ
)δλ
=λ
η−1
η( pη
−ρ
η−1
(pη
+(1
−η
)ρ))
∂D
η(p
||ρ)
∂η
=∫ pη
lnp
+ρ
η−1
· (ρ−
p+
(ηρ
−ρ
−η
p )·ln
ρ)d
x
β-d
iver
genc
eδ
Dβ
(λ·p
||λ·ρ
)δλ
=ρ· (λ
·p)β
+ (ρ· (β
−1)−
p·β)· (
λ·ρ
)β
λρ
(β−1
)
∂D
β(p
||ρ)
∂β
=1
β−1
∫ p( pβ
−1ln
p−
ρβ−1
lnρ
−(p
β−1
−ρβ−1
)(β
−1)
) dx
−∫ ( pβ
lnp
−ρ
βln
ρ) 1 β
−1 β2
( pβ−
ρβ) d
x
f-d
iver
genc
esδ
Df(
λ·p
||λ·ρ
)δλ
=ρ
·f( p ρ
)—
Gen
eral
ized
f-d
iver
genc
esδ
DG f
(λ·p
||λ·ρ
)δλ
=c
f(p
−ρ
)+ρ
·f( p ρ
) ,c
f=
f′(1
)�=
0—
Hel
linge
rd
iver
genc
eδ
DH
(λ·p
||λ·ρ
)δλ
=1 λ
( √ pλ−
√ λρ) 2
—
α-d
iver
genc
eδ
Dα
(λ·p
||λ·ρ
)δλ
=1
α(α
−1)
[ ρ·(( p ρ
) α +α
−1) −
p·α
]∂
Dα
(p||ρ
)∂α
=−
(2α−1
)α
2 (α−1
)2
∫ [ pαρ
1−α
−α
·p+
(α−
1 )ρ] d
x
+1
α(α
−1)
∫ pαρ
1−α
(ln
p−
lnρ
)−p
+ρ
dx
Tsal
lisd
iver
genc
eδ
DT α
(λ·p
||λ·ρ
)δλ
=−
11−
αpα
ρ1−
α
∂D
T α(p
||ρ)
∂α
=1
(1−α
)2
( 1−
∫ pαρ
1−α
dx)
−1
(1−α
)
∫ pαρ
1−α
(ln
p−
lnρ
)dx
Ren
yid
iver
genc
eδ
DR α
(λ·p
||λ·ρ
)δλ
=pα
ρ1−
α
(α−1
)∫ λ·ρ
·( p ρ
) α dx
∂D
R α(p
||ρ)
∂α
=−
1(α
−1)2
log
(∫ pαρ
1−α
dx)
+1
α−1
∫ pαρ
1−α
(ln
p−ln
ρ)d
x∫ pα
ρ1−
αd
x
Divergence-Based Vector Quantization 1383
Tabl
e2:
(Con
tinu
ed.)
Div
erge
nce
Fam
ilyD
eriv
ativ
efo
rR
elev
ance
Lea
rnin
gD
eriv
ativ
efo
rH
yper
para
met
erL
earn
ing
Gen
eral
ized
Ren
yid
iver
genc
eδ
DG
Rα
(λ·p
||λ·ρ
)δλ
=ρ·(( p ρ
) α +α−1
) −p·α
(α−1
)∫[ λ·( ρ
·( p ρ
) α −α·p
+ (α−1
)ρ)] d
x+1
∂D
GR
α(p
||ρ)
∂α
=−
1(α
−1)2
log
(∫ [ pαρ
1−α
−α
·p+
(α−
1 )ρ] d
x+
1)+
1α−1
∫ pαρ
1−α
(ln
p−ln
ρ)−
p+ρ
dx
∫ [pαρ
1−α−α
·p+ (
α−1
)ρ]d
x+1
γ-d
iver
genc
esδ
Dγ
(λ·p
||λ·ρ
)δλ
=p (
λ·p
)γ
γ∫ (λ
·p)γ
+1d
x+
ρ(λ
·ρ)γ
∫ (λ·ρ
)γ+1
dx
−p· (
γ+1
)· (λ·ρ
)γ
γ∫ (λ
·p)· (
λ·ρ
)γd
x
∂D
γ(p
||ρ)
∂γ
=−
(2γ+1
)γ
2 (γ+1
)2ln
(∫ pγ+1
dx) +
∫ pγ+1
lnpd
x(γ
+1)γ
∫ pγ+1
dx
−1
(γ+1
)2ln
(∫ ργ+1
dx) +
∫ ργ+1
lnρ
dx
(γ+1
)∫ ργ+1
dx
+1 γ2
ln(∫ p
·ργ
dx) −
∫ pργ
lnρ
dx
γ∫ p·ρ
γd
xC
auch
y-Sc
hwar
zd
iver
genc
eδ
DC
S(λ
·p||λ
·ρ)
δλ
=p·λ
·p∫ (λ
·p)2 d
x+
ρ·λ·
ρ∫ (λ
·ρ)2 d
x−
2·p·λ
·ρ∫ λ
2 ·p·ρ
dx
—
1384 T. Villmann and S. Haase
∂ I2 (β)∂β
=∫ ∂
[pβ−ρβ
β
]∂β
dx
=∫
∂[
pβ − ρβ]
∂β
1β
− 1β2
(pβ − ρβ
)dx
=∫ (
pβ ln p − ρβ ln ρ) 1
β− 1
β2
(pβ − ρβ
)dx.
Thus,
∂ Dβ (p||ρ)∂β
= 1β − 1
∫p
(pβ−1 ln p − ρβ−1 ln ρ −
(pβ−1 − ρβ−1
)(β − 1)
)dx
−∫ (
pβ ln p − ρβ ln ρ) 1
β− 1
β2
(pβ − ρβ
)dx,
if the integral exists for an appropriate choice of β.
A.2 α-Divergences. We consider the α-divergence, equation 2.18:
Dα (p||ρ) = 1α (α − 1)
∫ [pαρ1−α − α · p + (α − 1) ρ
]dx.
= 1α (α − 1)
I (α) .
We have
∂ Dα (p||ρ)∂α
=∂
[1
α(α−1)
]∂α
I (α) + 1α (α − 1)
∂ I (α)∂α
=− (2α − 1)
α2 (α − 1)2 I (α) + 1α (α − 1)
∂ I (α)∂α
.
The derivative ∂ I (α)∂α
yields
∂ I (α)∂α
=∫
∂[
pαρ1−α − α · p + (α − 1) ρ]
∂αdx
=∫
[pαρ1−α (ln p − ln ρ) − p + ρ]dx.
Divergence-Based Vector Quantization 1385
Finally, we get
∂ Dα (p||ρ)∂α
=− (2α − 1)
α2 (α − 1)2
∫ [pαρ1−α − α · p + (α − 1) ρ
]dx
+ 1α (α − 1)
∫pαρ1−α (ln p − ln ρ) − p + ρdx.
A.3 Renyi Divergences. Considering the generalized Renyi divergenceDG R
α (p||ρ) from equation 2.26,
DG Rα (p||ρ) = 1
α − 1log
(∫ [pαρ1−α − α · p + (α − 1) ρ
]dx + 1
)= 1
α − 1log I (α),
we get
∂ DG Rα (p||ρ)
∂α= − 1
(α − 1)2 log I (α) + 1α − 1
1I (α)
∂ I (α)∂α
with
∂ I (α)∂α
=∫
∂[
pαρ1−α − α · p + (α − 1) ρ]
∂αdx
=∫
pαρ1−α (ln p − ln ρ) − p + ρdx.
Summarizing the differentiation yields
∂ DG Rα (p||ρ)
∂α=− 1
(α − 1)2 log(∫ [
pαρ1−α − α · p + (α − 1) ρ]
dx + 1)
+ 1α − 1
∫pαρ1−α (ln p − ln ρ) − p + ρdx∫ [
pαρ1−α − α · p + (α − 1) ρ]
dx + 1.
We now turn to the usual Renyi divergence DRα (p||ρ) from equation 2.28:
DG Rα (p||ρ) = 1
α − 1log
(∫pαρ1−αdx
).
1386 T. Villmann and S. Haase
We analogously achieve
∂ DG Rα (p||ρ)
∂α= − 1
(α − 1)2 log(∫
pαρ1−αdx)
+ 1α − 1
∫pαρ1−α (ln p − ln ρ) dx∫
pαρ1−αdx.
A.4 γ -Divergences. The remaining divergences are the γ -divergences,equation 2.30:
Dγ (p||ρ) = 1γ + 1
ln
[(∫pγ+1dx
) 1γ
·(∫
ργ+1dx)]
− ln
[(∫p · ργ dx
) 1γ
]
= 1γ + 1
ln
[(∫pγ+1dx
) 1γ
]+ 1
γ + 1ln
[(∫ργ+1dx
)]
− ln
[(∫p · ργ dx
) 1γ
]
= 1(γ + 1) γ
ln I1 (γ ) + 1γ + 1
ln I2 (γ ) − 1γ
ln I3(γ ).
The derivative is obtained according to
∂ Dγ (p||ρ)∂γ
= − (2γ + 1)
γ 2 (γ + 1)2 ln I1 (γ ) + 1(γ + 1) γ I1 (γ )
∂ I1 (γ )∂γ
− 1
(γ + 1)2 ln I2 (γ ) + 1(γ + 1) I2 (γ )
∂ I2 (γ )∂γ
+ 1γ 2 ln I3 (γ ) − 1
γ I3 (γ )∂ I3 (γ )
∂γ.
Next, we calculate the derivatives ∂ I1(γ )∂γ
, ∂ I2(γ )∂γ
, and ∂ I3(γ )∂γ
:
∂ I1 (γ )∂γ
=∫
∂(
pγ+1)
∂γdx
=∫
pγ+1 ln pdx.
Divergence-Based Vector Quantization 1387
∂ I2 (γ )∂γ
=∫
∂(ργ+1
)∂γ
dx
=∫
ργ+1 ln ρdx.
∂ I3 (γ )∂γ
=∫
∂ (p · ργ )∂γ
dx
=∫
pργ ln ρdx.
Collecting all intermediate results, we finally have
∂ Dγ (p||ρ)∂γ
= − (2γ + 1)
γ 2 (γ + 1)2 ln(∫
pγ+1dx)
+∫
pγ+1 ln pdx(γ + 1) γ
∫pγ+1dx
− 1
(γ + 1)2 ln(∫
ργ+1dx)
+∫
ργ+1 ln ρdx(γ + 1)
∫ργ+1dx
+ 1γ 2 ln
(∫p · ργ dx
)−
∫pργ ln ρdx
γ∫
p · ργ dx.
Appendix B: Proof of Lemma 1
We now give the proof of lemma 1. For the proof, we need a propositiongiven in Liese and Vajda (1987):
Proposition 3. Let A = [0,∞)2 andF = {g|g : [0,∞) → R, g - convex
}. Fur-
ther, let f be a function f : A �→ R ∪ {∞} be defined by
f (u, v) = v · f(u
v
)
for an arbitrary f ∈ F with the definitions 0 · f(
00
) = 0, 0 · f(
a0
) = limx→0 x ·f(
ax
) = limu→∞ a · f (u)u . Further, let us denote f∞ = limu→0+
{u · f
(1u
)}and
f (0) = limu→0+{
f (u)}
. Then there exists c ∈ R such that
f (u, v) ≥ u · c + v · ( f (1) − c) for all (u, v) ∈ A
and
f (u, v) ≤ u · f∞ + v · f (0) for all (u, v) ∈ (0,∞)2.
Proof. See Liese and Vajda (1987).
1388 T. Villmann and S. Haase
This proposition gives the essential ingredients to proof the lemma:
Lemma. The f -divergence Df for positive measures p and ρ is bounded (if thelimit exists and it is finite):
0 ≤ Df (p||ρ) ≤ limu→0+
{f (u) + u · f
(1u
)}
with u = pρ
.
Proof. Let p∗ be a nonnegative integrable function defined as
p∗ = 2 · pp + ρ
.
Further, let us define
f (u) = f(u, 2 − u) for u ∈ [0, 2].
Then it follows directly from the above proposition that there is c ∈ R suchthat
2 · f (1) + 2 · c · (p∗ − 1) ≤ f (p∗) ≤ (2 − p∗) · f (0) + p∗ · f∞,
which leads to
f (1) + cp + ρ
· (p − ρ) ≤ ρ
p + ρ· f
(pρ
)≤ ρ
p + ρ· f (0) + p
p + ρ· f∞.
With f being a determining function of an f -divergence, it holds thatf (1) = 0 and thus
c · (p − ρ) ≤ ρ · f(
pρ
)≤ ρ · f (0) + p · f∞.
We now get
c ·∫
(p − ρ) dx ≤ Df (p||ρ) ≤ f (0) ·∫
ρ dx + f∞ ·∫
p dx.
Since p and ρ are positive measures with weights W(p) ≤ 1 and W(ρ) ≤ 1according to equation 2.1, this finally yields
0 ≤ Df (p||ρ) ≤ f (0) + f∞
which completes the proof of the lemma.
Divergence-Based Vector Quantization 1389
References
Alex, N., Hasenfuss, A., & Hammer, B. (2009). Patch clustering for massive data sets.Neurocomputing, 72 (7–9), 1455–1469.
Amari, S.-I. (1985). Differential-geometrical methods in statistics. Berlin: Springer.Amari, S.-I., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford
University Press.Banerjee, A., Merugu, S., Dhillon, I., & Ghosh, J. (2005). Clustering with Bregman
divergences. Journal of Machine Learning Research, 6, 1705–1749.Basseville, M. (1988). Distance measures for signal processing and pattern recognition.
(Tech. Rep. 899). Paris: Institut National de Recherche en Informatique et enAutomatique.
Basu, A., Harris, I., Hjort, N., & Jones, M. (1998). Robust and efficient estimation byminimising a density power divergence. Biometrika, 85(3), 549–559.
Bertin, N., Fevotte, C., & Badeau, R. (2009). A tempering approach for Itakura-saito non-negative matrix factorization. with application to music transcription.In: Proceedings of the IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) (pp. 1545–1548). Piscataway, NJ: IEEE Press.
Bishop, C. M., Svensen, M., Williams, & C.K.I. (1998). GTM: The generative topo-graphic mapping. Neural Computation, 10, 215–234.
Bregman, L. (1967). The relaxation method of finding the common point of convexsets and its application to the solution of problems in convex programming. USSRComputational Mathematics and Mathematical Physics, 7 (3), 200–217.
Bunte, K., Hammer, B., Villmann, T., Biehl, M., & Wismuller, A. (2010). Exploratoryobservation machine (XOM) with Kullback-Leibler divergence for dimensional-ity reduction and visualziation. In M. Verleysen (Ed.), Proc. of European Sympo-sium on Artificial Neural Networks (pp. 87–92). Evere, Belgium: d-side publica-tions.
Cichocki, A., & Amari, S.-I. (2010). Families of alpha- beta- and gamma- divergences:Flexible and robust measures of similarities. Entropy, 12, 1532–1568.
Cichocki, A., Lee, H., Kim, Y.-D., & Choi, S. (2008). Non-negative matrix factorizationwith α-divergence. Pattern Recognition Letters, 29, 1433–1440.
Cichocki, A., Zdunek, R., Phan, A., & Amari, S.-I. (2009). Nonnegative matrix andtensor factorizations. Hoboken, NJ: Wiley.
Crammer, K., Gilad-Bachrach, R., Navot, A., & Tishby, A. (2002). Margin analysisof the LVQ algorithm. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advancesin neural information processing systems, 15 (pp. 462–468): Cambridge, MA: MITPress.
Csiszar, I. (1967). Information-type measures of differences of probability distribu-tions and indirect observations. Studia Sci. Math. Hungaria, 2, 299–318.
Eguchi, S., & Kano, Y. (2001). Robustifying maximum likelihood estimation (Tech. Rep.802). Tokyo: Tokyo Institute of Statistical Mathematics.
Erdogmus, D. (2002). Information theoretic learning: Renyi’s entropy and its application toadaptive systems training. Unpublished doctoral dissertation, University of Florida.
Fichtenholz, G. (1964). Differential- und Integralrechnung (9th ed.). Berlin: DeutscherVerlag der Wissenschaften.
1390 T. Villmann and S. Haase
Frigyik, B., Srivastava, S., & Gupta, M. (2008a). Functional Bregman divergenceand Bayesian estimation of distributions. IEEE Transactions on Information Theory,54(11), 5130–5139.
Frigyik, B. A., Srivastava, S., & Gupta, M. (2008b). An introduction to functional deriva-tives (Tech. Rep. UWEETR-2008-0001). Seattle: Department of Electrical Engineer-ing, University of Washington.
Fujisawa, H., & Eguchi, S. (2008). Robust parameter estimation with a small biasagainst heavy contamination. Journal of Multivariate Analysis, 99, 2053–2081.
Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: Generaliza-tions and new optimization techniques. Neurocomputing, 21(1–3), 173–190.
Hammer, B., & Villmann, T. (2002). Generalized relevance learning vector quantiza-tion. Neural Networks, 15(8–9), 1059–1068.
Hegde, A., Erdogmus, D., Lehn-Schiøler, T., Rao, Y., & Principe, J. (2004). Vectorquantization by density matching in the minimum Kullback-Leibler-divergencesense. In Proc. of the International Joint Conference on Artificial Neural Networks (pp.105–109). Piscataway, NJ: IEEE Press.
Heskes, T. (1999). Energy functions for self-organizing maps. In E. Oja & S. Kaski(Eds.), Kohonen maps (pp. 303–316). Amsterdam: Elsevier.
Hulle, M.M.V. (2000). Faithful representations and topographic maps. Hoboken,NJ: Wiley.
Hulle, M.M.V. (2002a). Joint entropy maximization in kernel-based topographicmaps. Neural Computation, 14(8), 1887–1906.
Hulle, M.M.V. (2002b). Kernel-based topographic map formation achieved with aninformation theoretic approach. Neural Networks, 15, 1029–1039.
Itakura, F., & Saito, S. (1973). Analysis synthesis telephony based on the maximumlikelihood method. In J. Flanagan & R. Rabiner (Eds.), Speech synthesis (pp. 289–292). Stroudsburg, PA: Dowden, Hutchinson, & Ross.
Jang, E., Fyfe, C., & Ko, H. (2008). Bregman divergences and the self organising map.In C. Fyfe, D. Kim, S.-Y., Lee, & H. Yin (Eds.), Intelligent data engineering andautomated learning (pp. 452–458). New York: Springer.
Jenssen, R. (2005). An information theoretic approach to machine learning. Unpublisheddoctoral dissertation, University of Tromsø.
Jenssen, R., Principe, J., Erdogmus, D., & Eltoft, T. (2006). The Cauchy-Schwarzdivergence and Parzen windowing: Connections to graph theory and Mercerkernels. Journal of the Franklin Institute, 343(6), 614–629.
Kantorowitsch, I., & Akilow, G. (1978). Funktionalanalysis in normierten Raumen (2nded.). Berlin: Akademie-Verlag.
Kapur, J. (1994). Measures of information and their application. Hoboken, NJ: Wiley.Kohonen, T. (1997). Self-organizing maps (2nd ext. ed.). New York: Springer.Kullback, S., & Leibler, R. (1951). On information and sufficiency. Annals of Mathe-
matical Statistics, 22, 79–86.Lai, P., & Fyfe, C. (2009). Bregman divergences and multi-dimensional scaling. In
M. Koppen, N., Kasabov, & N. G. Coghill (Eds.), Proceedings of the InternationalConference on Information Processing 2008 (pp. 935–942). New York: Springer.
Lee, J., & Verleysen, M. (2005). Generalization of the l p norm for time series andits application to self-organizing maps. In M. Cottrell (Ed.), Proc. of Workshop onSelf-Organizing Maps (pp. 733–740). Paris: Sorbonne.
Divergence-Based Vector Quantization 1391
Lee, J., & Verleysen, M. (2007). Nonlinear dimensionality reduction. New York: Springer.Lehn-Schiøler, T., Hegde, A., Erdogmus, D., & Principe, J. (2005). Vector quantization
using information theoretic concepts. Natural Computing, 4(1), 39–51.Liese, F., & Vajda, I. (1987). Convex statistical distances. Leipzig: Teubner-Verlag.Liese, F., & Vajda, I. (2006). On divergences and informations in statistics and infor-
mation theory. IEEE Transactions on Information Theory, 52(10), 4394–4412.Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantizer design. IEEE
Transactions on Communications, 28, 84–95.Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine
Learning Research, 9, 2579–2605.Martinetz, T. M., Berkovich, S. G., & Schulten, K. J. (1993). “Neural-gas” network for
vector quantization and its application to time-series prediction. IEEE Trans. onNeural Networks, 4(4), 558–569.
Minami, M., & Eguchi, S. (2002). Robust blind source separation by beta divergence.Neural Computation, 14, 1859–1886.
Minka, T. (2005). Divergence measures and message passing (Tech. Rep. 173). Cambridge,UK: Microsoft Research.
Mwebaze, E., Schneider, P., Schleif, F.-M., Haase, S., Villmann, T., & Biehl, M. (2010).Divergence based learning vector quantization. In M. Verleysen (Ed.), Proc. ofEuropean Symposium on Artificial Neural Networks (pp. 247–252). Evere, Belgium:D-side.
Nielsen, F., & Nock, R. (2009). Sided and symmetrized Bregman centroids. IEEETransactions on Information Theory, 55(6), 2882–2903.
Principe, J. C. III, & Xu, D. (2000). Information theoretic learning. In S. Haykin & J.Fisher (Eds.), Unsupervised adaptive filtering. Hoboken, NJ: Wiley.
Qiao, Y., & Minematsu, N. (2008). f -divergence is a generalized invariant measurebetween disributions. In INTERSPEECH—Proc. of the Annual Conference of theInternational Speech Communication Association (pp. 1349–1352). N.p.: InternationalSpeech Communication Association.
Ramsay, J., & Silverman, B. (2006). Functional data analysis (2nd ed.) New York:Springer.
Renyi, A. (1961). On measures of entropy and information. In Proceedings of the FourthBerkeley Symposium on Mathematical Statistics and Probability. Berkeley: Universityof California Press.
Renyi, A. (1970). Probability theory. Amsterdam: North-Holland.Rossi, F., Delannay, N., Conan-Gueza, B., & Verleysen, M. (2005). Representation of
functional data in neural networks. Neurocomputing, 64, 183–210.Santos-Rodrıguez, R., Guerrero-Curieses, A., Alaiz-Rodrıguez, R., & Cid-Sueiro, J.
(2009). Cost-sensitive learning based on Bregman divergences. Machine Learning,76(2–3), 271–285.
Sato, A., & Yamada, K. (1996). Generalized learning vector quantization. In D. S.Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural informationprocessing systems, 8 (pp. 423–429). Cambridge, MA: MIT Press.
Schneider, P., Biehl, M., & Hammer, B. (2009). Hyperparameter learning in robustsoft LVQ. In M. Verleysen (Ed.), Proceedings of the European Symposium on ArtificialNeural Networks (pp. 517–522). Evere, Belgium: D-side.
1392 T. Villmann and S. Haase
Shannon, C. (1948). A mathematical theory of communication. Bell System TechnicalJournal, 27, 379–432.
Taneja, I., & Kumar, P. (2004). Relative information of type s, Csiszr’s f -divergence,and information inequalities. Information Sciences, 166, 105–125.
Torkkola, K. (2003). Feature extraction by non-parametric mutual information max-imization. Journal of Machine Learning Research, 3, 1415–1438.
Torkkola, K., & Campbell, W. (2000). Mutual information in learning feature transfor-mations. In Proc. of the International Conference on Machine Learning. San Francisco:Morgan Kaufmann.
Villmann, T. (2007). Sobolev metrics for learning of functional data—mathematicaland theoretical aspects (Machine Learning Reports 1 1–15). Available online athttp://www.uni-leipzig.de/compint/mlr/mlr 01 2007.pdf.
Villmann, T., Haase, S., Schleif, F.-M., & Hammer, B. (2010). Divergence based onlinelearning in vector quantization. In L. Rutkowski, W. Duch, J. Kaprzyk, J. Korbicz,& R. Tadeusiewicz (Eds.), Proc. of the International Conference on Artifial Intelligenceand Soft Computing. New York: Springer.
Villmann, T., Haase, S., Simmuteit, S., Haase, M., & Schleif, F.-M. (2010). Functionalvector quantization based on divergence learning. Ulmer Informatik-Berichte, 2010-05, 8–11.
Villmann, T., Hammer, B., Schleif, F.-M., Geweniger, T., & Herrmann, W. (2006).Fuzzy classification by fuzzy labeled neural gas. Neural Networks, 19, 772–779.
Villmann, T., Hammer, B., Schleif, F.-M., Hermann, W., & Cottrell, M. (2008). Fuzzyclassification using information theoretic learning vector quantization. Neurocom-puting, 71, 3070–3076.
Villmann, T., & Schleif, F.-M. (2009). Functional vector quantization by neural maps.In J. Chanussot (Ed.), Proceedings of First Workshop on Hyperspectral Image and SignalProcessing: Evolution in Remote Sensing (pp. 1–4). Piscataway, NJ: IEEE Press.
Villmann, T., Schleif, F.-M., Kostrzewa, M., Walch, A., & Hammer, B. (2008). Classi-fication of mass-spectrometric data in clinical proteomics using learning vectorquantization methods. Briefings in Bioinformatics, 9(2), 129–143.
Wismuller, A. (2009). The exploration machine: A novel method for data visualiza-tion. In J. Principe & R. Miikkulainen (Eds.), Advances in self-organizing maps—Proceedings of the 7th International Workshop (pp. 344–352). New York: Springer.
Zador, P. L. (1982). Asymptotic quantization error of continuous signals and thequantization dimension. IEEE Transactions on Information Theory, 28, 149–159.
Received February 26, 2010; accepted October 5, 2010.