+ All Categories
Home > Documents > Divergence-Based Vector Quantization

Divergence-Based Vector Quantization

Date post: 04-Dec-2016
Category:
Upload: sven
View: 212 times
Download: 0 times
Share this document with a friend
50
LETTER Communicated by Andrzej Cichocki Divergence-Based Vector Quantization Thomas Villmann [email protected] Sven Haase University of Applied Sciences Mittweida, Department of Mathematics, Natural and Computer Sciences, 09648 Mittweida, Germany Supervised and unsupervised vector quantization methods for classifi- cation and clustering traditionally use dissimilarities, frequently taken as Euclidean distances. In this article, we investigate the applicability of divergences instead, focusing on online learning. We deduce the mathe- matical fundamentals for its utilization in gradient-based online vector quantization algorithms. It bears on the generalized derivatives of the divergences known as Fr´ echet derivatives in functional analysis, which reduces in finite-dimensional problems to partial derivatives in a natural way. We demonstrate the application of this methodology for widely ap- plied supervised and unsupervised online vector quantization schemes, including self-organizing maps, neural gas, and learning vector quan- tization. Additionally, principles for hyperparameter optimization and relevance learning for parameterized divergences in the case of super- vised vector quantization are given to achieve improved classification accuracy. 1 Introduction Supervised and unsupervised vector quantization for classification and clustering is strongly associated with the concept of dissimilarity, usually judged in terms of distances. The most common choice is the Euclidean metric. Recently, however, alternative dissimilarity measures have become attractive for advanced data processing. Examples are functional metrics like Sobolev distances or kernel-based dissimilarity measures (Villmann & Schleif, 2009; Lee & Verleysen, 2007). These metrics take the functional struc- ture of the data into account (Lee & Verleysen, 2005; Ramsay & Silverman, 2006; Rossi, Delannay, Conan-Gueza, & Verleysen, 2005; Villmann, 2007). Information theory–based vector quantization approaches are proposed considering divergences for clustering (Banerjee, Merugu, Dhillon, & Ghosh, 2005; Jang, Fyfe, & Ko, 2008; Lehn-Schiøler, Hegde, Erdogmus, & Principe, 2005; Hegde, Erdogmus, Lehn-Schiøler, Rao, & Principe, 2004). For other data processing methods like multidimensional scaling (MDS; Lai & Fyfe, 2009), stochastic neighbor embedding (Maaten & Hinton, 2008), blind Neural Computation 23, 1343–1392 (2011) C 2011 Massachusetts Institute of Technology
Transcript
Page 1: Divergence-Based Vector Quantization

LETTER Communicated by Andrzej Cichocki

Divergence-Based Vector Quantization

Thomas [email protected] HaaseUniversity of Applied Sciences Mittweida, Department of Mathematics,Natural and Computer Sciences, 09648 Mittweida, Germany

Supervised and unsupervised vector quantization methods for classifi-cation and clustering traditionally use dissimilarities, frequently takenas Euclidean distances. In this article, we investigate the applicability ofdivergences instead, focusing on online learning. We deduce the mathe-matical fundamentals for its utilization in gradient-based online vectorquantization algorithms. It bears on the generalized derivatives of thedivergences known as Frechet derivatives in functional analysis, whichreduces in finite-dimensional problems to partial derivatives in a naturalway. We demonstrate the application of this methodology for widely ap-plied supervised and unsupervised online vector quantization schemes,including self-organizing maps, neural gas, and learning vector quan-tization. Additionally, principles for hyperparameter optimization andrelevance learning for parameterized divergences in the case of super-vised vector quantization are given to achieve improved classificationaccuracy.

1 Introduction

Supervised and unsupervised vector quantization for classification andclustering is strongly associated with the concept of dissimilarity, usuallyjudged in terms of distances. The most common choice is the Euclideanmetric. Recently, however, alternative dissimilarity measures have becomeattractive for advanced data processing. Examples are functional metricslike Sobolev distances or kernel-based dissimilarity measures (Villmann &Schleif, 2009; Lee & Verleysen, 2007). These metrics take the functional struc-ture of the data into account (Lee & Verleysen, 2005; Ramsay & Silverman,2006; Rossi, Delannay, Conan-Gueza, & Verleysen, 2005; Villmann, 2007).

Information theory–based vector quantization approaches are proposedconsidering divergences for clustering (Banerjee, Merugu, Dhillon, &Ghosh, 2005; Jang, Fyfe, & Ko, 2008; Lehn-Schiøler, Hegde, Erdogmus, &Principe, 2005; Hegde, Erdogmus, Lehn-Schiøler, Rao, & Principe, 2004). Forother data processing methods like multidimensional scaling (MDS; Lai &Fyfe, 2009), stochastic neighbor embedding (Maaten & Hinton, 2008), blind

Neural Computation 23, 1343–1392 (2011) C© 2011 Massachusetts Institute of Technology

Page 2: Divergence-Based Vector Quantization

1344 T. Villmann and S. Haase

source separation (Minami & Eguchi, 2002), or nonnegative matrix factor-ization (Cichocki, Lee, Kim, & Choi, 2008), divergence-based approachesare also introduced. In prototype-based classification, first approaches us-ing information-theoretic approaches have been proposed (Erdogmus, 2002;Torkkola, 2003; Villmann, Hammer, Schleif, Hermann, & Cottrell, 2008).

Yet a systematic analysis of prototype-based clustering and classifica-tion relying on divergences has not yet been given. Further, the existingapproaches usually are carried out in batch mode for optimization but arenot available for online learning, which requires calculating the derivativesof the underlying metrics (i.e., divergences).

In this letter, we offer a systematic approach for divergence-based vec-tor quantization using divergence derivatives. For this purpose, importantbut general classes of divergences are identified, widely following and ex-tending the scheme introduced by Cichocki, Zdunek, Phan, and Amari(2009). The mathematical framework for functional derivatives of continu-ous divergences is given by the functional-analytic generalization of com-mon derivatives—the concept of Frechet derivatives (Frigyik, Srivastava, &Gupta, 2008b; Kantorowitsch & Akilow, 1978). This can be seen as a gener-alization of partial derivatives for discrete variants of the divergences. Thefunctional approach is here preferred for clarity. Yet it also offers greater flex-ibility in specific variants of functional data processing (Villmann, Haase,Simmuteit, Haase, & Schleif, 2010).

After characterizing the different classes of divergences and introducingFrechet derivatives, we apply this framework to several divergences anddivergence classes to obtain generalized derivatives, which can be used foronline learning in divergence-based methods for supervised and unsuper-vised vector quantization as well as other gradient-based approaches. Weexplicitly explore the derivatives to provide examples.

Then we consider some of the most prominent approaches for unsu-pervised as well as supervised prototype-based vector quantization in thelight of divergence-based online learning using Frechet derivatives, includ-ing self-organizing maps (SOM), neural gas (NG), and generalized learningvector quantization (GLVQ). For the GLVQ supervised approach, we alsoprovide a gradient learning scheme, hyperparameter adaptation, for opti-mizing parameters that occur in the case of parameterized divergences.

The focus of the letter is mainly on giving a unified framework for theapplication of widely ranged divergences and classes thereof in gradient-based online vector quantization and their mathematical foundation. Weformulate the problem in a functional manner following the approaches inFrigyik et al. (2008b), Csiszar (1967), and Liese and Vajda (2006). This allowsa compact description of the mathematical theory based on the concept ofFrechet derivatives. We also state that the functional approach includes alarger class of divergence functionals than the discrete (pointwise) approachas Frigyik, Srivastava, and Gupta (2008a) point out. Beside these extensions,the functional approach using Frechet derivatives obviously reduces to

Page 3: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1345

partial derivatives for the discrete case. We therefore prefer the functionalapproach in this letter.

However, as a proof of concept, we show for several classes of param-eterized divergences their utilization in SOM learning for an artificial butillustrate example in comparison to Euclidean distance learning as stan-dard.

2 Characterization of Divergences

Generally, divergences are functionals designed for determining a similaritybetween nonnegative integrable measure functions p and ρ with a domainV and the constraints p (x) ≤ 1 and ρ (x) ≤ 1 for all x∈V. We denote suchmeasure functions as positive measures. The weight of the functional p isdefined as

W (p) =∫

Vp (x) dx. (2.1)

Positive measures p with weight W (p) = 1 are denoted as (probability)density functions.1

Divergences D (p||ρ) are defined as functionals that have to be nonneg-ative and zero iff p ≡ ρ except on a zero-measure set. Further, D (p||ρ) isrequired to be convex with respect to the first argument. Yet divergences areneither necessarily symmetric nor have to fulfill the triangle inequality asit is supposed for metrics. According to the classification given in Cichockiet al. (2009), one can distinguish at least three main classes of divergences:Bregman divergences, Csiszar’s f -divergences, and γ -divergences empha-sizing different properties. We offer some basic properties about these butdo not go into detail about them because this would be outside the scopeof the letter. (For detailed property investigations, see Cichocki & Amari,2010, and Cichocki et al., 2009.)

We generally assume that p and ρ are positive measure (densities) thatare not necessarily normalized. In case of (normalized) densities, we explic-itly refer to these as probability densities.

2.1 Bregman Divergences. Bregman divergences are defined by gener-ating convex functions � in the following way using a functional interpre-tation (Bregman, 1967; Frigyik et al., 2008b).

Let � be a strictly convex real-valued function with the domain L(the Lebesgue-integrable functions). Further, � is assumed to be twice

1Each setF = { f } of arbitrary nonnegative integrable functionals f with domain V canbe transformed into a set of positive measures simply by p = f

c with c = sup f ∈F [W ( f )].

Page 4: Divergence-Based Vector Quantization

1346 T. Villmann and S. Haase

Figure 1: Illustration of the Bregman divergence DB�(p||ρ) as a vertical distance

between p and the tangential hyperplane to the graph of � at point ρ, taking pand ρ as points in a functional space.

continuously Frechet differentiable (Kantorowitsch & Akilow, 1978). ABregman divergence is defined as DB

� : L × L −→ R+ with

DB� (p||ρ) = � (p) − � (ρ) − δ� (ρ)

δρ[p − ρ] , (2.2)

whereby δ�(ρ)δρ

is the Frechet derivative of � with respect to ρ (see section3.1).

The Bregman divergence DB� (p||ρ) can be interpreted as a measure of

convexity of the generating function �. Taking p and ρ as points in afunctional space, DB

� (p||ρ) plays the role of vertical distance between p andthe tangential hyperplane to the graph of � at point ρ, which is illustratedin Figure 1.

Bregman divergences are linear according to the generating function �:

DB�1+λ�2

(p||ρ) = DB�1

(p||ρ) + λ · DB�2

(p||ρ) .

Further, DB� (p||ρ) is invariant under affine transforms � (q ) = � (q ) +

�g [q ] + ξ for positive measures g and q with a requirement that � (q ),� (q ), and � �g [q ] are not independent but have to be related according to

�g [q ] = δ� (g)δg

· q − δ� (g)δg

· q .

Further, �g is supposed to be a linear operator independent of q (Frigyiket al., 2008a) and ξ is a scalar. In that case,

DB� (p||ρ) = DB

� (p||ρ)

Page 5: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1347

is valid. Further, the generalized Pythagorean theorem holds for any triplep, ρ, τ of positive measures:

DB� (p||τ ) = DB

� (p||ρ) + DB� (ρ||τ ) + δ� (ρ)

δρ[p − ρ] − δ� (τ )

δτ[p − ρ] .

The sensitivity of a Bregman divergence at p is defined as

s (p, τ ) = ∂2 DB� (p||p + ατ )

∂α2 |α=0 (2.3)

= −τδ2� (p)

δp2 τ,

with τ ∈ L and the restriction that∫

τ (x) dx = 0 (Santos-Rodrıguez,Guerrero-Curieses, Alaiz-Rodrıguez, & Cid-Sueiro, 2009). Note that δ2�(p)

δp2

is the Hessian of the generating function. The sensitivity s (p, τ ) measuresthe velocity of change of the divergence at point p in the direction of τ .

A last property mentioned here is an optimality one (Banerjee et al.,2005). Given a set S of positive measures p with the (functional) meanμ = E [p ∈ S] and the additional restriction that μ is a relative interior ofS,2 then for given p ∈ S, the unique minimizer of E p

[DB

� (p||ρ)]

is ρ =μ. The inverse direction of this statement is also true: if E p

[DB

F (p||ρ)]

isminimum for ρ = μ, then DB

F (p||ρ) is a Bregman divergence. This propertypredestinates Bregman divergences for clustering problems.

Finally, we give some important examples:� Generalized Kullback-Leibler divergence for non-normalized p and

ρ (Cichocki et al., 2009):

DG K L (p||ρ) =∫

p log(

)dx−

∫(p − ρ) dx (2.4)

with the generating function

� ( f ) =∫ (

f · log f − f)

dx.

If p and ρ are normalized densities (probability densities), DG K L (p||ρ)is reduced to the usual Kullback-Leibler divergence (Kullback &Leibler, 1951; Kapur, 1994),

DK L (p||ρ) =∫

p log(

)dx, (2.5)

2If S follows a statistical distribution with existing functional expectation value ES,then the mean μ can be replaced by ES.

Page 6: Divergence-Based Vector Quantization

1348 T. Villmann and S. Haase

which is related to the Shannon-entropy (Shannon, 1948),

HS (p) = −∫

p log (p) dx, (2.6)

via

DK L (p||ρ) = VS (p, ρ) − HS (p) ,

where

VS (p, ρ) = −∫

p log (ρ) dx

is Shannon’s cross-entropy.� Itakura-Saito divergence (Itakura & Saito, 1973),

DI S (p||ρ) =∫ [

− log(

)− 1

]dx, (2.7)

based on the Burg entropy,

H B (p) = −∫

log (p) dx,

which also serves as the generating function

� ( f ) = H B ( f ) .

The Itakura-Saito divergence is also known as negative cross-Burg entropy and fulfills the scale-invariance property, that is,DI S (c · p||c · ρ) = DI S (p||ρ). So the same relative weight is given tolow- and high-energy components of p (Bertin, Fevotte, & Badeau,2009). Due to this, the Itakura-Saito divergence is frequently appliedin image processing and sound processing.

� The Euclidean distance in terms of a Bregman divergence is obtainedby the generating function

� ( f ) =∫

f 2dx.

We extend this definition and introduce the parameterized version,

�η ( f ) =∫

f ηdx,

defining the η-divergence, also known as norm-like divergence(Nielsen & Nock, 2009):

Dη (p||ρ) =∫

pη + (η − 1) · ρη − η · p · ρ(η−1)dx , (2.8)

which converges to the Euclidean distance for η → 2. To ensure theconvexity of � ( f ), the restriction to η > 1 is required.

Page 7: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1349

If we assume that p and ρ are positive measures, then an important subsetof Bregman divergences belongs to the class of β-divergences (Eguchi &Kano, 2001), which are defined, following Cichocki et al. (2009), as

Dβ (p||ρ) =∫

p · pβ−1 − ρβ−1

β − 1dx −

∫pβ − ρβ

βdx (2.9)

=∫

(1

β − 1− 1

β

)− ρβ−1

(p

β − 1− ρ

β

)dx, (2.10)

with β �= 1 and β �= 0 with the generating function

� ( f ) = f β − β · f + β − 1β (β − 1)

.

In the limit β → 1, the divergence Dβ (p, ρ) becomes the generalizedKullback-Leibler divergence (see equation 2.4).3 The limit β → 0 givesthe Itakura-Saito divergence (see equation 2.7). Further, β-divergences areequivalent to the density power divergences Dβ introduced in Basu, Harris,Hjort, and Jones (1998) by

Dβ (p||ρ) = 1(1 + β)

Dβ (p||ρ) .

Obviously, the η divergence (see equation 2.8) is a rescaled version of theβ-divergence:

Dη (p||ρ) = β · (β − 1) · Dβ (p||ρ) .

Thus, we see that for β = 2, the β-divergence Dβ (p||ρ) becomes (half) theEuclidean distance.

2.2 Csiszar’s f -Divergences. Csiszar’s f -divergences are defined forreal-valued, convex, continuous functions f ∈ F with f (1) = 0 (withoutloss of generality) whereby

F = {g|g : [0,∞) → R, g - convex

}.

The f -divergences Df for positive measures p and ρ are given by

Df (p||ρ) =∫

ρ · f(

)dx, (2.11)

3The relations pγ −ργ

γ−→γ→0

log pρ

and pγ −1γ

−→γ→0

log p hold.

Page 8: Divergence-Based Vector Quantization

1350 T. Villmann and S. Haase

with the definitions 0 · f( 0

0

) = 0, 0 · f( a

0

) = limx→0 x · f( a

x

) = limu→∞ a ·f (u)u (Csiszar, 1967; Liese & Vajda, 2006; Taneja & Kumar, 2004). f is called

the determining function for Df (p||ρ). It corresponds to a generalized f -entropy (Cichocki et al., 2009) of the form

Hf (p) = −∫

f (p) dx (2.12)

via

Hf (p) = −Df (p||I) + c, (2.13)

with I being the constant function of value 1 and c is a divergence-depending constant (Cichocki & Amari, 2010).

The f -divergence Df can be interpreted as an average (with respect toρ) of the likelihood ratio p

ρdescribing the change rate of p with respect

to ρ weighted by the determining function f . Df (p||ρ) is jointly convexin both p and ρ. Further, f defines an equivalence class in the sense thatDf (p||ρ) = Df (p||ρ) iff f (x) = f (x) + c · (x − 1) for c ∈ R, that is, Df (p||ρ)is invariant according to a linear shift regarding the determining function f .For f -divergences, a certain kind of symmetry can be stated. Let f, f ∗ ∈ Fand f ∗ is the conjugate function of f , that is, f ∗ (x) = x · f

( 1x

)for x ∈ (0,∞).

Then the relation Df (p||ρ) = Df ∗ (ρ||p) is valid iff the conjugate differsfrom the original by a linear shift as above: f (x) = f ∗ (x) + c · (x − 1). Asymmetric divergence can be obtained for an arbitrary convex functiong ∈ F using its conjugate g∗ for the definition f = g + g∗ as a determiningfunction. Further, the conjugate is important for an upper bound of thedivergence. Let u = p

ρand p as well as ρ densities. Then the f -divergence

is bounded by

0 ≤ Df (p||ρ) ≤ limu→0+

{f (u) + f ∗ (u)

}(2.14)

if the limit exists, as it was shown in Liese and Vajda (1987). Yet this state-ment can extended to p and ρ being positive measures:

Lemma 1. Let p and ρ be positive measures. Then the bounds given in equation2.14 are still valid.

Proof. The proof is given in appendix B.

An important and characterizing property is the monotonicity with re-spect to the coarse graining of the underlying domain D of the positivemeasures p and ρ, which is similarly to the monotonicity of the Fisher

Page 9: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1351

metric (Amari & Nagaoka, 2000). Let K ={κ(y|x) ≥ 0, x∈D, y∈Dy}, with Dy

being the range of y. κ describes a transition probability density, that is,∫κ(y|x)dy = 1 holds ∀x∈ D. Denoting the positive measures of y derived

from p(x) and ρ (x) by pκ (y) and ρκ (y), the monotonicity is expressedby Df (p||ρ) ≥ Df (pκ ||ρκ ).4 Further, an isomorphism can be stated for f -divergences in the following way. Let

h : x �−→y (2.15)

be an invertible function transforming positive measures p1 (x) and ρ1 (x) top2 (y) and ρ2 (y). Then Df (p1||ρ1) = Df (p2||ρ2) holds, and the pairs (p1, ρ1)and (p2, ρ2) are called isomorph (Liese & Vajda, 1987). Conversely, if a mea-sure D (p||ρ) = ∫

ρ (x) · G (p (x) , ρ (x)) dx for an integrable function G is in-variant according to invertible transformations h, then D is an f -divergence(Qiao & Minematsu, 2008). This isomorphism, as well as the monotonicity,employ f -divergences for application in speech, signal, and pattern recog-nition (Basseville, 1988; Qiao & Minematsu, 2008). Finally, Cichocki et al.(2009), suggested a generalization of the f -divergences Df . In that diver-gence, f is no longer convex. It is proposed to be

DGf (p||ρ) = c f

∫(p − ρ) dx+

∫ρ · f

(pρ

)dx (2.16)

with c f = f ′ (1) �= 0 and denoted as a generalized f -divergence. As a con-sequence of this relaxation of the convexity condition, in the case of p and ρ

being probability densities, the first term vanishes, such that the usual formof f -divergences is obtained. Thus, as a famous example, the Hellingerdivergence (Taneja & Kumar, 2004) is

DH (p||ρ) = 12

∫(√

p − √ρ)2 dx, (2.17)

with the generating function f (u) = 2(1 − √

u)

for u = pρ

. Acoording toCichocki et al. (2009), DH (p||ρ) is a properly defined f -divergence only forprobability densities p and ρ.

As the β divergences in the case of Bregman divergences, one can iden-tify an important subset of the f -divergences—the so-called α-divergences

4The equality holds iff the conditional densities pκ (x|y) = p(x)·κ(y|x)pκ (y) and ρκ (x|y) =

ρ(x)·κ(y|x)ρκ (y) are identical (see Amari & Nagaoka, 2000).

Page 10: Divergence-Based Vector Quantization

1352 T. Villmann and S. Haase

according to the definition given in Cichocki et al. (2009):

Dα (p||ρ) = 1α (α − 1)

∫ [pαρ1−α − α · p + (α − 1) ρ

]dx (2.18)

= 1α (α − 1)

∫ [ρ

((pρ

+ (α − 1))

− α · p]

dx (2.19)

with the generating f -function

f (u) = u

(uα−1 − 1

)α2 − α

+ 1 − uα

and u = ρ

p and α > 0. In the limit α → 1 the generalized Kullback-Leiblerdivergence DG K L (see equation 2.4) is obtained. Further, Cichocki et al.(2009) state that β-divergences can be generated from α-divergences byapplying the nonlinear transforms

p → pβ+2 and ρ → ρβ+2 with α = 1β + 1

.

In addition to the general properties of the f -divergences stated here,one can derive a characteristic behavior for the α-divergences directly fromequation 2.18 depending on the choice of the parameter α (Minka, 2005). Forα � 0 the minimization of Dα (p||ρ) to estimate ρ (x) may exclude modesof the target p (x). Further, for α ≤ 0, the α-divergence is zero-forcing (i.e.,p (x) = 0 forces ρ (x) = 0), while for α ≥ 1, it is zero-avoiding (i.e., ρ (x) > 0whenever p (x) > 0). For α → ∞, ρ (x) covers p (x) completely, and the α-divergence is called inclusive in that case.

The Tsallis-divergence is a widely applied divergence related to α-divergence (see equation 2.18); however, it is defined only for probabilitydensities. It is defined as

DTα (p||ρ) = −

∫p · log

α

p

)dx (2.20)

with the convention

logα (z) = z1−α − 1

1 − α(2.21)

such that

DTα (p||ρ) = 1

1 − α

(1 −

∫pαρ1−αdx

)(2.22)

Page 11: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1353

and α �= 1. Obviously this is a rescaled version of the α-divergence (see equa-tion 2.18), which holds only for probability densities (Cichocki & Amari,2010):

DTα (p||ρ) = α · Dα (p||ρ) . (2.23)

The Tsallis divergence is based on the Tsallis entropy,

HTα (p) =− 1

α − 1

(∫pαdx − 1

)(2.24)

=∫

p logα

(1p

)dx, (2.25)

with logα (p) as defined in (see equation 2.21). In the limit, α → 1 for

HTα (p) becomes the Shannon entropy (see equation 2.6) and the divergence

DTα (p||ρ) converges to the Kullback-Leibler divergence (see equation 2.5).Further, the α-divergences are closely related to the generalized Renyi

divergences defined as (Amari, 1985; Cichocki et al., 2009):

DG Rα (p||ρ) = 1

α − 1log

(∫ [pαρ1−α − α · p + (α − 1) ρ

]dx + 1

)(2.26)

for positive measures ρ and p. Lemma 1 can be used to write the generalizedRenyi divergence in terms of the α-divergence:5

DG Rα (p||ρ) = 1

α − 1log (1 + α · (α − 1) · Dα (p||ρ)) . (2.27)

For probability densities, DG Rα (p||ρ) reduces to the usual Renyi divergence

(Renyi, 1961, 1970):

DRα (p||ρ) = 1

α − 1log

(∫pαρ1−αdx

). (2.28)

5A careful transformation of the parameter α is required for exact transformationsbetween both divergences. For details, see Amari (1985) and Cichocki et al. (2009). Further,this statement was given in this book without proving the bounds of the underlying f -divergence for positive measures as it is given in this letter by lemma 1.

Page 12: Divergence-Based Vector Quantization

1354 T. Villmann and S. Haase

The divergence DRα (p||ρ) is based on the Renyi entropy

H Rα (p) = − 1

α − 1log

(∫pαdx

)(2.29)

via equation 2.13. The Renyi entropy fulfills the additivity property forindependent probabilities p and q :

H Rα (p × q ) = H R

α (p) + H Rα (q ) .

Further, the entropy H Rα (p) is related to the Tsallis entropy (see equation

2.25) by

H Rα (p) = − 1

α − 1log

(1 + (1 − α) · HT

α

),

which, however, has in consequence a different subadditivity property,

HTα (p × q ) = HT

α (p) + HTα (q ) + (1 − α) · HT

α (p) · HTα (q ),

for α �= 1.

2.3 γ -Divergences. A class of robust divergences with respect tooutliers has been proposed by Fujisawa and Eguchi (2008).6 Called γ -divergences it is defined for positive measures ρ and p as

Dγ (p||ρ) = log

⎡⎣(∫pγ+1dx

) 1γ (γ+1) · (∫

ργ+1dx) 1

γ+1(∫p · ργ dx

) 1γ

⎤⎦ (2.30)

= 1γ + 1

log

[(∫pγ+1dx

) 1γ

·(∫

ργ+1dx)]

(2.31)

− log

[(∫p · ργ dx

) 1γ

].

The divergence Dγ (p||ρ) is invariant under scalar multiplication withpositive constants c1 and c2:

Dγ (p||ρ) = Dγ (c1 · p||c2 · ρ) .

6The divergence Dγ (p||ρ) is proposed to be robust for γ ∈ [0, 1] with the existence ofDγ=0 in the limit γ → 0. A detailed analysis of robustness is given in Fujisawa and Eguchi(2008).

Page 13: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1355

The equation Dγ (p||ρ) = 0 holds only if p = c · ρ (c > 0) in the case ofpositive measures. Yet for probability densities, c = 1 is required. In con-tradiction to the f -divergences, an isomorphism here can be stated forh-transformations (see equation 2.15) which are more strictly assumed tobe affine.

As for Bregman divergences, a modified Pythagorean relation betweenpositive measures can be stated for special choices of positive measures p,ρ, τ . Let p be a distortion of ρ defined as a convex combination with apositive distortion measure δ:

pε (x) = (1 − ε) · ρ (x) + ε · δ (x)

Further, a positive measure g is denoted as δ-consistent if

νg =(∫

δ (x) g (x)α dx) 1

α

is sufficiently small for large α > 0. If two positive measures ρ and τ are δ-consistent according to a distortion measure δ, then the Pythagorean relationapproximately holds for ρ, τ and the distortion pε of ρ,

� (pε, ρ, τ ) = Dγ (pε||τ ) − Dγ (pε||ρ) − Dγ (ρ||τ ) = O (ενγ ) ,

with ν = max{νρ, ντ

}. This property implies the robustness of γ -

divergences with respect to distortions according to the resulting approxi-mation,

Dγ (pε||τ ) ≈ Dγ (pε||ρ) + Dγ (ρ||τ ) ,

and Dγ (pε||ρ) should be small because pε is assumed to be a distortion ofρ (Fujisawa & Eguchi, 2008).

In the limit γ → 0 Dγ (ρ||p) becomes the usual Kullback-Leibler diver-gence (see equation 2.5) DK L (ρ|| p) with normalized densities

ρ = ρ

W (ρ)and p = p

W (p).

For γ = 1 the γ -divergence becomes the Cauchy-Schwarz divergence

DC S (p||ρ) = 12

log(∫

ρ2dx·∫

p2dx)

− log (V (p, ρ)) (2.32)

Page 14: Divergence-Based Vector Quantization

1356 T. Villmann and S. Haase

with

V (p, ρ) =∫

p · ρ dx (2.33)

being the cross-correlation potential. The Cauchy-Schwarz divergenceDC S (p||ρ) was introduced by Principe, Xu, and Fisher (2000) considering theCauchy-Schwarz inequality for norms. It is based on the quadratic Renyi-entropy H R

2 (p) from equation 2.29 (Jenssen, 2005). Obviously, DC S (p||ρ) issymmetric. It is frequently applied for Parzen window estimation and isparticularly suitable for spectral clustering as well as for related graph cutproblems (Jenssen, Principe, Erdogmus, & Eltoft, 2006).

3 Derivatives of Divergences: A Functional Analytic Approach

In this section we provide the mathematical formalism of generalizedderivatives for functionals p and ρ, known as Frechet derivatives or func-tional derivatives. First, we briefly reconsider the theory of functionalderivatives derivatives. Then we investigate the divergence classes withinthis framework. In particular, we explain their Frechet derivatives.

3.1 Functional (Frechet) Derivatives. Suppose X and Y are Banachspaces, U ⊂ X is open, and F : X → Y. F is called Frechet differentiable atx ∈ X if there exists a bounded linear operator δF [x]

δx : X → Y, such that forh ∈ X, the limit is

limh→0

∥∥F (u + h) − F (u) − δF [u]δu [h]

∥∥Y

‖h‖X= 0 .

This general definition can be focused for functional mapping. Let L be afunctional mapping from a linear, functional Banach space B to R. Further,let B be equipped with a norm ‖·‖, and f, h ∈ B are two functionals. TheFrechet derivative δL[ f ]

δ f of L at point f is formally defined as

limε→0

(L [ f + εh] − L [ f ]) =:δL [ f ]

δ f[h] ,

with δL[ f ]δ f [h] linear in h. The existence and continuity of the limit are equiv-

alent to the existence and continuity of the derivative. (For a detailed intro-duction, see Kantorowitsch & Akilow, 1978.)

If L is linear, then L [ f + εh] − L [ f ] = εL [h] and, hence, δL[ f ]δ f [h] =

L [h]. Further, an analogon of the chain rule known from differential

Page 15: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1357

calculus can be stated: let F : R → R be a continuously differentiable map-ping. We consider the functional

L [ f ] =∫

F ( f (x)) dx.

Then the Frechet derivative δL[ f ]δ f [h] is determined by the derivative F ′ as

can be seen from

(L [ f + εh] − L [ f ]) = 1ε

∫F ( f (x) + εh (x)) − F ( f (x)) dx

= 1ε

∫F ′ ( f (x)) · εh (x) + O

(ε2h (x)2

)dx

−→ ε→0

∫F ′ ( f (x)) · h (x) dx

and use of the linear property of the integral operator.This property motivates an important remark about divergences, which

can be seen as special integral operators:

Remark 1. Let Lg be an integral operator Lg [ f ] = ∫Fg ( f (x)) dx depend-

ing on a fixed functional g ∈ B. Then the Frechet derivative δLg [ f ]δ f =∫

F ′g ( f (x)) · h (x) dx is determined by the integral kernel F ′

g ( f (x)) =Q (g (x) , f (x)) being a function in x. Therefore, frequently the Frechetderivative δLg [ f ]

δ f is simply identified with Q (g (x) , f (x)) and written asδLg [ f ]

δ f = Q (g (x) , f (x)) but keeping in mind its original interpretation as anintegral kernel defining the integral operator. We will make use from thisabbreviation in the following considering divergences as integral operatorsD (p||ρ) = L p [ρ] and write δD(p||ρ)

δρ= Q (g, f ), also denoted here as Frechet

derivative, for simplicity.

Finally, we remark that the Frechet derivative in finite-dimensionalspaces reduces to the usual partial derivative. In particular, it is repre-sented in coordinates by the Jacobi matrix. Thus, the Frechet derivative is ageneralization of the directional derivatives.

3.2 Frechet Derivatives for the Different Divergence Classes. We arenow ready to investigate functional derivatives of divergences. In particularwe focus on Frechet derivatives.

Page 16: Divergence-Based Vector Quantization

1358 T. Villmann and S. Haase

3.2.1 Bregman Divergences. We investigate the Frechet derivative for theBregman divergences (see equation 2.2) and formally obtain

δDB� (p||ρ)δρ

= � (p)δρ

− � (ρ)δρ

−δ[

δ�(ρ)δρ

(p − ρ)]

δρ(3.1)

with

δ[

δ�(ρ)δρ

(p − ρ)]

δρ= δ2 [� (ρ)]

δρ2 (p − ρ) − δ� (ρ)δρ

.

In the case of the generalized Kullback-Leibler-divergence (see equation2.4) this reads as

δDG K L (p||ρ)δρ

= − pρ

+ 1, (3.2)

whereas for the usual Kullback-Leibler divergence, equation 2.5,

δDK L (p||ρ)δρ

= − pρ

(3.3)

is obtained.For the Itakura-Saito divergence, equation 2.7, we get

δDI S (p||ρ)δρ

= 1ρ2 (ρ − p) . (3.4)

The η-divergence, equation 2.8, leads to

δDη (p||ρ)δρ

= ρη−2 · (1 − η) · η · (p − ρ) , (3.5)

which reduces in the case of η = 2 to the derivative of the Euclidean dis-tance −2 (p − ρ), commonly used in many vector quantization algorithms,including the online variant of k-means, SOMs, NG, and so on.

Further, for the subset of β-divergences, equation 2.9, we have

δDβ (p||ρ)δρ

= −p · ρβ−2 + ρβ−1 (3.6)

= ρβ−2 (ρ − p) . (3.7)

Page 17: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1359

3.2.2 f -Divergences. For f -divergences, equation 2.11, the Frechetderivative is

δDf (p||ρ)δρ

= f(

)+ ρ

∂ f (u)∂u

δuδρ

= f(

)+ ρ

∂ f (u)∂u

· −pρ2 , (3.8)

with u = pρ

. As a famous example, we get for the Hellinger divergence,equation 2.17,

δDH (p||ρ)δρ

= 1 −√

. (3.9)

The subset of α-divergences, equation 2.18, can be handled by

δDα (p||ρ)δρ

= − 1α

(pαρ−α − 1

). (3.10)

The related Tsallis divergence DTα , equation 2.22, leads to the derivative

δDTα (p||ρ)δρ

= −(

(3.11)

depending on the parameter α. The generalized Renyi divergences, equa-tion 2.26, are treated according to

δDG Rα (p||ρ)

δρ=− pαρ−α − 1∫

[pαρ1−α − α · p + (α − 1)ρ] dx + 1

= α∫[pαρ1−α − α · p + (α − 1)ρ] dx + 1

δDα(p||ρ)δρ

, (3.12)

which is reduced to

δDRα (p||ρ)δρ

= − pαρ−α∫pαρ1−α dx

(3.13)

in the case of the usual Renyi divergences, equation 2.28.

3.2.3 γ -Divergences. For the γ -divergences, we rewrite equation 2.30 as

Dγ (p||ρ) = 1γ + 1

ln F1 − ln F2,

Page 18: Divergence-Based Vector Quantization

1360 T. Villmann and S. Haase

with F1 = (∫pγ+1dx

) 1γ · (∫

ργ+1dx)

and F2 = (∫p · ργ dx

) 1γ . Then we get

δDγ (p||ρ)δρ

= 1γ + 1

1F1

δF1

δρ− 1

F2

δF2

δρ

with

δF1

δρ=

(∫pγ+1dx

) 1γ

(∫ργ+1dx

)δρ

=(∫

pγ+1dx) 1

γ

(γ + 1) ργ

and

δF2

δρ= 1

γ

(∫p · ργ dx

) 1γ−1 p · ργ

δρ

=(∫

p · ργ dx) 1

γ−1

pργ−1 ,

such that δDγ (p||ρ)δρ

finally yields

δDγ (p||ρ)δρ

= ργ∫ργ+1dx

− pργ−1∫p · ργ dx

(3.14)

= ργ−1[

ρ∫ργ+1dx

− p∫p · ργ dx

]. (3.15)

Considering the important special case γ = 1, the Frechet derivative of theCauchy-Schwarz divergence, equation 2.32, is derived:

δDC S (p||ρ)δρ

= ρ∫ρ2dx

− pV (p, ρ)

. (3.16)

4 Divergence-Based Online Vector Quantization Using Derivatives

Supervised and unsupervised vector quantization frequently are describedin terms of dissimilarities or distances. Suppose data are given as datavectors v ∈ R

n.Here we focus on prototype-based vector quantization: data processing

(clustering or classification) is realized using prototypes w ∈ Rn as represen-

tatives, whereby the dissimilarity between data points, as well as betweendata and prototypes, is determined by dissimilarity measures ξ (not neces-sarily fulfilling triangle inequality or symmetry restrictions).

Page 19: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1361

Frequently, such algorithms somewhat optimize a cost function E de-pending on the dissimilarity between the data points and the prototypes;usually one has E = E (ξ (vi , wk)) and i = 1, . . . , N the number of data andk = 1, . . . , C the number of prototypes. This cost function can be a variantof the usual classification error in supervised learning or modified meansquared error of the dissimilarities ξ (vi , wk).

If E = E (ξ (vi , wk)) is differentiable with respect to ξ , and ξ is differ-entiable with respect to the prototype w, then a stochastic gradient mini-mization is a widely used optimization scheme for E . This methodologyimplies the calculation of the dissimilarity derivatives ∂ξ

∂wk, which now has

to be considered in light of the above functional analytic investigationsfor divergence measures (i.e., we replace the dissimilarity measure ξ bydivergences).

Therefore, we now assume that the data vectors are discrete representa-tions of continuous positive measures p (x) with vi = p (xi ), i = 1, . . . , n asrequired for divergences. Such data may be spectra or other frequency dataoccurring in many kinds of application like remote-sensing data analysis,mass spectrometry, or signal processing. Thereby, the restriction vi ∈ [0, 1]for positive measures can be fulfilled simply by dividing all data vectors bythe maximum vector entry, taking over all vectors and vector componentsof the data set. In case of probability densities, a subsequent normalizationto stress ‖v‖1 = 1 is required.

Further, we also identify the prototypes as discrete realizations of posi-tive measures ρ (x). Then the derivative ∂ξ

∂w has to be replaced by the (abbre-viated) Frechet derivative δξ

δρin the continuous case (see remark 1), which

reduces to usual partial derivatives in the discrete case. This is formallyachieved by replacing p and ρ by their vectorial counterparts v and w inthe formulas of the divergences provided in section 3.2 and further trans-lating integrals into sums.

In the following, we give prominent examples of unsupervised and su-pervised vector quantization, which can be optimized by gradient methodsusing the framework already introduced.

4.1 Unsupervised Vector Quantization

4.1.1 Basic Vector Quantization. Unsupervised vector quantization is aclass of algorithm for distributing prototypes W = {wk}Z, wk∈ R

n such thatdata points v ∈ V ⊆ R

n are faithfully represented in terms of a dissimilaritymeasure ξ . Thereby, C = card (Z) is the cardinality of the index set Z. Moreformally, the data point v is represented by this prototype ws(v) minimizingthe dissimilarity ξ (v, wk):

v �→ s (v) = argmink∈Z

ξ (v, wk). (4.1)

Page 20: Divergence-Based Vector Quantization

1362 T. Villmann and S. Haase

The aim of the algorithm is to distribute the prototypes in such a way thatthe quantization error

EVQ = 12

∫P (v) ξ

(v, ws(v)

)dv (4.2)

is minimized. In its simplest form, basic vector quantization (VQ) leads toa (stochastic) gradient descent on EVQ with

� ws(v) = −ε · ∂ξ(v, ws(v)

)∂ws(v)

(4.3)

for prototype update of the winning prototype ws(v) according to equation4.1, also known as the online variant of LBG algorithm (C–means; Linde,Buzo, & Gray, 1980; Zador, 1982). Here, ε is a small, positive value called thelearning rate. As we see, update 4.3 takes into account the derivative of thedissimilarity measure ξ with respect to the prototype. Beside the commonchoice of ξ being the squared Euclidean distance, the choice is given to theuser with the restriction of differentiability. Hence, here we are allowed toapply divergences using derivatives in the sense of Frechet derivatives.

4.1.2 Self-Organizing Maps and Neural Gas. There are several variants ofthe basic vector quantization scheme to avoid local minima or realize aprojective mapping. For example, the latter can be obtained by introducinga topological structure in the index set Z and denoting this strucure as A,usually a regular grid. The resulting vector quantization scheme is the self-organizing map (SOM) introduced by Kohonen (1997). The respective costfunction (in the variant of Heskes, 1999) is

ESOM = 12K (σ )

∫P(v)

∑r∈A

δs(v)r

∑r′∈A

hSOMσ (r, r′)ξ ( v, wr′ ) dv (4.4)

with the so-called neighborhood function

hSOMσ (r, r′) = exp

(−

∥∥r − r′∥∥A

2σ 2

),

and∥∥r − r′∥∥

A is the distance in A according to the topological structure.K (σ ) is a normalization constant depending on the neighborhood range σ .For this SOM, the mapping rule, equation 4.1, is modified to

v �→ s (v) = argminr∈A

∑r′∈A

hSOMσ (r, r′) · ξ (v, wr′ ) , (4.5)

Page 21: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1363

which yields in the limit σ → 0 the original mapping (see equation 4.1). Theprototype update for all prototypes then is given as (Heskes, 1999)

� wr = −εhSOMσ (r, s(v))

∂ξ (v, wr)∂wr

. (4.6)

As above, the utilization of a divergence-based update is straightforwardfor SOM as well.

If the aspect of projective mapping can be ignored while keeping theneighborhood cooperativeness aspect to avoid local minima in vector quan-tization, then the neural gas algorithm (NG) is an alternative to SOM pre-sented by Martinetz, Berkovich, and Schulten (1993). The cost function ofNG to be minimized is

ENG = 12C (σ )

∑j∈A

∫P (v) hNG

σ (v, W, j) ξ(v, w j

)dv, (4.7)

with

hNGσ (v, W, i) = exp

(−ki (v, W)

σ

), (4.8)

with the rank function

ki (v, W) =∑

j

θ(ξ (v, wi ) − ξ

(v, w j

)). (4.9)

The mapping is realized as in basic VQ (see equation 4.1), and the prototypeupdate for all prototypes is similar to that of SOM:

� wi = −εhNGσ (v, W, i)

∂ξ (v, wi )∂wi

. (4.10)

Again, the incorporation of divergences is obvious also for NG.

4.1.3 Further Vector Quantization Approaches. There exists a long list ofother vector quantization approaches, like kernelized SOMs (Hulle, 2000,2002a, 2002b), generative topographic mapping (GTM; Bishop, Svensen,& Williams, 1998), and soft topographic mapping (Graepel, Burger, &Obermayer, 1998), to name just a few. Most of them use the Euclideanmetric and the respective derivatives for adaptation. Thus, the idea ofdivergence-based processing can be transferred to these in a similar manner.

A somewhat reverse SOM has been proposed recently for embeddingdata into an embedding space S: exploration machine (XOM; Wismuller,

Page 22: Divergence-Based Vector Quantization

1364 T. Villmann and S. Haase

2009). This XOM can be seen as a projective structure preserving mappingof the input data into the embedding space and therefore shows similaritiesto MDS. In the XOM approach, the data points vk ∈ V ⊆ R

n, k = 1, . . . , Nare uniquely associated with prototypes wk ∈ S in the embedding space Sand W = {wk}N

k=1. The dissimilarity ξS in the embedding space usually ischosen to be the quadratic Euclidean metric. Further, a hypothesis aboutthe topological structure of the data vk to be embedded is formulated forthe embedding space S by defining a probability distribution PS (s) forso-called sampling vectors s ∈S. A cost function of XOM can be defined as

EXOM = 12K (σ )

∫S

PS (s)N∑

k=1

·δk∗(s)k

N∑j=1

h XOMσ (vk, v j ) · ξS ( s, w j ) ds

(4.11)

with the mapping rule

k∗(s) = argmink=1,...,N

N∑j=1

h XOMσ (vk, v j ) · ξS ( s, w j ), (4.12)

as pointed out in Bunte, Hammer, Villmann, Biehl, and Wismuller (2010).As in usual SOMs, the neighborhood cooperativeness is given in XOMs bya gaussian,

h XOMσ (vk, v j ) = exp

(−ξV

(vk, v j

)2σ 2

),

with the data dissimilarity ξV(vk, v j

)defined as Euclidean distance in the

original XOM. The update of the prototypes in the embedding space isobtained in complete analogy to SOM as

� wi = −εh XOMσ

(vi , vk∗(s)

) ∂ξS (s, wi )∂wi

. (4.13)

As one can see, we can apply divergences to both ξV and ξS . In case of thelatter, the prototype update, equation 4.13, has to be changed accordinglyusing the respective Frechet derivatives.

4.2 Learning Vector Quantization. Learning vector quantization (LVQ)is the supervised counterpart of basic VQ. Now the data v ∈ V ⊆ R

n tobe learned are equipped with class information cv. Suppose we have Kclasses; we define cv ∈ [0, 1]K . If

∑Kk=1 ci = 1, the labeling is probabilistic,

Page 23: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1365

and possibilistic otherwise. In case of a probabilistic labeling with cv ∈{0, 1}K , the labeling is called crisp.

We now briefly explore how divergences can be used for supervisedlearning. Again we start with the widely applied basic LVQ approaches andthen outline the procedure for some more sophisticated methods withoutany claim of completeness.

4.2.1 Basic LVQ Algorithms. The basic LVQ schemes were invented byKohonen (1997). For standard LVQ, a crisp data labeling is assumed. Fur-ther, the prototypes w j with labels yj correspond to the K classes in sucha way that at least one prototype is assigned to each class. For simplicity,we take exactly one prototype for each class. The task is to distribute theprototypes in such a manner that the classification error is reduced. Therespective algorithms LVQ1 to LVQ3 are heuristically motivated.

As in the unsupervised vector quantization, the similarity between dataand prototypes for LVQs is judged by a dissimilarity measure ξ

(v, w j

).

Beside some small modifications, the basic LVQ schemes LVQ1 to LVQ3mainly consist of determination of the most proximate prototype(s) ws(v)for given v according to the mapping rule, equation 4.1, and subsequentadaptation. Depending on the agreement of cv and ys(v) the adaptation ofthe prototype(s) takes place according to

� ws(v) = α · ε · ∂ξ(v, ws(v)

)∂ws(v)

, (4.14)

and α = 1 iff cv = ys(v), and α = −1 otherwise.A popular generalization of these standard algorithms is the generalized

LVQ (GLVQ) introduced by Sato and Yamada (1996). In GLVQ the classifi-cation error is replaced by a dissimilarity-based cost function that is closelyrelated to the classification error but not identical to it.

For a given data point v, with class label cv, the two best matchingprototypes with respect to the data metric ξ , usually the quadratic Euclid-ian, are determined: ws+(v) has minimum distance ξ+ = ξ

(v, ws+(v)

)under

the constraint that the class labels are identical: ys+(v) = cv. The other bestprototype, ws−(v), has the minimum distance ξ− = ξ

(v, ws−(v)

)supposing

the class labels are different: ys−(v) �= cv. Then the classifier function μ (v) isdefined as

μ (v) = ξ+ − ξ−

ξ+ + ξ− , (4.15)

being negative in case of a correct classification. The value ξ+ − ξ− yieldsthe hypothesis margin of the classifier (Crammer, Gilad-Bachrach, Navot,

Page 24: Divergence-Based Vector Quantization

1366 T. Villmann and S. Haase

& Tishby, 2002). Then the generalized LVQ (GLVQ) is derived as gradientdescent on the cost function

EGLVQ =∑

v

μ (v) (4.16)

with respect to the prototypes. In each learning step, for a given data point,both ws+(v) and ws−(v) are adapted in parallel. Taking the derivatives ∂ EGLVQ

∂ws+ (v)

and ∂ EGLVQ

∂ws− (v), we get for the updates

�ws+(v) = ε+ · θ+ · ∂ξ(v, ws+(v)

)∂ws+(v)

(4.17)

and

�ws−(v) = −ε− · θ− · ∂ξ(v, ws−(v)

)∂ws−(v)

(4.18)

with the scaling factors

θ+ = 2 · ξ−

(ξ+ + ξ−)2 and θ− = 2 · ξ+

(ξ+ + ξ−)2 . (4.19)

The values ε+ and ε− ∈ (0, 1) are the learning rates.Obviously the distance measure ξ could be replaced for all of these LVQ

schemes by one of the introduced divergences. This offers a new possibilityfor information-theoretic learning in classification schemes, which differsfrom the previous approaches significantly. These earlier approaches stressthe information-optimum class representation, whereas here, the expectedinformation loss in terms of the applied divergence measure is optimized(Torkkola & Campbell, 2000; Torkkola, 2003; Villmann, Hammer, et al.,2008).

4.2.2 Advanced Learning Vector Quantization. Apart from the basic LVQschemes, many more sophisticated prototype-based learning schemes areproposed for classification learning. Here we will restrict ourselves to ap-proaches that can deal with probabilistic or possibilistic labeled trainingdata (uncertain decisions) that are, in addition, related to the basic unsu-pervised and supervised vector quantization algorithms mentioned in thisletter so far.

In particular, we focus on the fuzzy-labeled SOM (FLSOM) and the verysimilar fuzzy-labeled NG (FLNG) (Villmann, Schleif, Kostrzewa, Walch, &Hammer, 2008; Villmann, Hammer, Schleif, Geweniger, & Herrmann, 2006).

Page 25: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1367

Both approaches extend the cost function of its unsupervised counterpartin the following shorthand manner,

EFLSOM/FLNG = (1 − β) ESOM/NG + βEFL,

where EFL measures the classification accuracy . The factor in β ∈ [0, 1) is afactor balancing unsupervised and supervised learning. The classificationaccuracy term EFL is defined as

EFL = 12

∫P (v)

∑r

gγ (v, wr) ψ (cv,yr) dv, (4.20)

where gγ (v, wr) is a gaussian kernel describing a neighborhood range inthe data space

gγ (v, wr) = exp(

−ξ (v, wr)2γ 2

)(4.21)

using the dissimilarity ξ (v, wr) in the data space. ψ (cv,yr) judges the dissim-ilarities between label vectors of data and prototypes. ψ (cv,yr) is originallysuggested to be the quadratic Euclidean distance.

Note that EFL depends on the dissimilarity in the data space ξ (v, wr) viagγ (v, wr). Hence, prototype adaptation in FLSOM/FLNG is influenced bythe classification accuracy

∂ EFLSOM/NG

∂wr= (1 − β)

∂ ESOM/NG

∂wr+ β

∂ EFL

∂wr, (4.22)

which yields

� wr =−ε(1 − β) · hSOM/NGσ (r, s(v))

∂ξ (v, wr)∂wr

(4.23)

+ εβ1

4γ 2 · gγ (v, wr)∂ξ (v, wr)

∂wrψ (cv,yr) .

The label adaptation is influenced only by the second part, EFL. The deriva-tive ∂ EFL

∂yryields

� yr = εlβ · gγ (v, wr)∂ψ (cv,yr)

∂yr(4.24)

with learning rate εl > 0 (Villmann, Schleif, et al., 2008; Villmann et al.,2006). This label learning leads to a weighted average yr of the fuzzy labels

Page 26: Divergence-Based Vector Quantization

1368 T. Villmann and S. Haase

cv of those data v, that are close to the associated prototypes according toξ (v, wr).

It should be noted at this point that a similar approach can easily beinstalled for XOM in an analog manner, yielding FLXOM.

Clearly, beside the possibility of choosing a divergence measure forξ (v, wr) as in the unsupervised case, there is no contradiction to do sofor the label dissimilarity ψ (cv,yr) in these FL methods. As before, the sim-ple plug-in of the respective discrete divergence variants and their Frechetderivatives modifies the algorithms such that semisupervised learning canproceed by relying on divergences for both variants.

5 SOM Simulations for Various Divergences

In this section, we demonstrate the influence of the chosen divergence andthe dependence on divergence parameters for prototype-based unsuper-vised vector quantization. For this purpose, we consider an artificial butillustrative data set. In the case of parameterized divergences, we vary theparameter settings to show their dependence on the resulting prototype dis-tribution. Further, we investigate the behavior of different divergence typesbut always comparing the results with Euclidean distance-based learningas the standard to show their differences.

These investigations for the toy problem should lead readers to thinkabout the choice of divergences for a specific application as well as optimumparameter settings. The demonstration itself is far from a realistic scenario,which also has to deal with such matters as high-dimensional problems andheterogeneous data distributions.

As an example vector quantization model, we consider the Heskes-SOMaccording to equation 4.4 using a chain lattice with 100 units r and their pro-totypes wr ∈ R

2. The example data distribution consists of 107 data pointsv = (v1, v2) ∈ [0, 1]2. which are constrained such that v1 + v2 = 1 (i.e., thedata v can be taken as probability densities in R

2). Further, generating thedata set, the first argument v1 is chosen randomly according to the datadensity P1 (v1) = 2 · v1, whereas v2 is subsequently calculated according tothe constraint.

The learning rate ε as well as the neighborhood range σ converged dur-ing the SOM learning to the final values ε f inal = 10−6 and σ f inal = 1, respec-tively. The initial values for the learning rate ε as well as the neighborhoodrange σ were appropriately chosen.

We trained SOM networks for the divergences as introduced in section 2using the Frechet derivatives deduced in section 3.1 with different param-eter values.

For the η-divergence (belonging to the Bregman divergences) the resultsare depicted in Figure 2. One can observe that the influence of the parameterη is only marginal. Yet small variations can be detected. For the specialchoice η = 2, Euclidean learning is realized.

Page 27: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1369

Figure 2: Prototype distribution for η-divergence-based SOM for different η-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).

For the β-divergence, the influence of the parameter value β is strongerthan the parameter effect for η-divergences (see Figure 3). In particular, sig-nificant deviations can be observed for higher prototype w1-values, givinga hint of a better discrimination property for this probability range. Lowerprototype w1-values were captured by the β-divergences markedly betterthan by the Euclidean learning.

The α-divergence based learning shows the inclusive and exclusive prop-erties mentioned above. For a positive choice of the control parameter α therange of prototype w1-values captured is quite larger than the one coveredusing negative α-values. However, only small variations can be detectedwithin the two α-domains (positive and negative); that is, the divergence isrelatively robust with respect to the control parameter α (see Figure 4).

For the Tsallis divergence, the influence of the control parameter α isalready detected in the central range of prototype w1-values and significantin the upper range (see Figure 5). Especially in comparison to Euclideanlearning, this gives a hint of a quite good discrimination property for awide probability range.

In contrast to the β-divergence, the influence of the control parameterα of the Renyi divergence is primarily detected in the region with sparsedata density (see Figure 6). However, the Renyi divergence-based learningcovers a wider range of prototype w1-values than the Euclidean learning.

Page 28: Divergence-Based Vector Quantization

1370 T. Villmann and S. Haase

Figure 3: Prototype distribution for β-divergence-based SOM for different β-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 4: Prototype distribution for α-divergence-based SOM for different α-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Page 29: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1371

Figure 5: Prototype distribution for Tsallis divergence-based SOM for differentα-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Figure 6: Prototype distribution for Renyi divergence-based SOM for differentα-values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).

Page 30: Divergence-Based Vector Quantization

1372 T. Villmann and S. Haase

Figure 7: Prototype distribution for γ -divergence-based SOM for different γ -values. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).

The γ -divergence shows the most sensitive behavior of all parameterizeddivergences investigated here (see Figure 7). In particular, the choice of thecontrol parameter γ influences both ranges of probability—the low and thehigh one—with approximately the same sensitivity (see Figure 7).

Thus, it differs from the sensitivity observed for β-divergences. Thisbehavior offers the possibility of tuning the divergence precisely dependingon the specific vector quantization task. Together with stated robustnessof the γ -divergence (Fujisawa & Eguchi, 2008), this adaptive specificitycould provide a high potential for a wide range of application. This isunderscored by the applications in supervised and unsupervised vectorquantization based on the Cauchy-Schwarz divergence (γ = 1) (Jenssenet al., 2006; Mwebaze et al., 2010; Principe et al., 2000; Villmann, Haase,Schleif, & Hammer, 2010).

Figure 8 shows the results of the prototype-based unsupervised vectorquantization using various nonparameterized divergences.

These simulations should be seen, on one hand, as a proof of concept.On the other hand, one can clearly see quite different behavior for the var-ious divergences, resulting in distinguished prototype distributions. Thisleads, in consequence, to diverse vector quantization properties. Therefore,the choice of a divergence for a specific application should be made verycarefully, taking the special properties of the divergences into account.

Page 31: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1373

Figure 8: Prototype distribution for divergence-based SOM, using various di-vergences. Horizontal axis: logarithmic value of the one-dimensional prototypeindex. Vertical axis: first component w1 of the prototypes w = (w1, w2).

6 Extensions for the Basic Adaptation Scheme:Hyperparameter and Relevance Learning

6.1 Hyperparameter Learning for α-, β-, γ -, and η-Divergences

6.1.1 Theoretical Considerations. Considering the parameterized diver-gence families of γ -, α-, β-, and η-divergences, one could further think aboutthe optimal choice of the so-called hyperparameters γ , α, β, η as suggestedin a similar manner for other parameterized LVQ algorithms (Schneider,Biehl, & Hammer, 2009). In case of supervised learning schemes for clas-sification based on differentiable cost functions, the optimization can behandled as an object of a gradient descent–based adaptation procedure.Thus, the parameter is optimized for the classification task at hand.

Suppose the classification accuracy for a certain approach is given as

E = E (ξθ , W)

depending on a parameterized divergence ξθ with parameter θ . If E and ξθ

are both differentiable with respect to θ according to

∂ E (ξθ , W)∂θ

= ∂ E∂ξθ

· ∂ξθ

∂θ,

Page 32: Divergence-Based Vector Quantization

1374 T. Villmann and S. Haase

a gradient-based optimization is derived by

�θ = −ε∂ E (ξθ , W)

∂θ= −ε

∂ E∂ξθ

· ∂ξθ

∂θ

depending on the derivative ∂ξθ

∂θfor a certain choice of the divergence ξθ .

We assume in the following that the (positive) measures p and ρ arecontinuously differentiable. Then, considering derivatives of parameterizeddivergences ∂ξθ

∂θwith respect to the parameter θ , it is allowed to interchange

integration and differentiation if the resulting integral exists (Fichtenholz,1964). Hence, we can differentiate parameterized divergences with respectto their hyperparameter in that case. For the several α-, β-, γ -, and η-divergences, characterized in section 2, we obtain after some elementarycalculations:

� η-divergence Dη (p||ρ) from equation 2.8:

∂ Dη (p||ρ)∂η

=∫

pη ln p + ρη−1 · (ρ − p + (ηρ − ρ − ηp) · ln ρ) dx

� β-divergence Dβ (p||ρ) from equation 2.9 (see appendix A):

∂ Dβ (p||ρ)∂β

= 1β − 1

∫p(

pβ−1 ln p − ρβ−1 ln ρ −(

pβ−1 − ρβ−1)

(β − 1)

)dx

−∫ (

pβ ln p − ρβ ln ρ) 1

β− 1

β2

(pβ − ρβ

)dx

� α-divergence Dα (p||ρ) from equation 2.18 (see appendix A):

∂ Dα (p||ρ)∂α

=− (2α − 1)

α2 (α − 1)2

∫ [pαρ1−α − α · p + (α − 1) ρ

]dx

+ 1α (α − 1)

∫[pαρ1−α (ln p − ln ρ) − p + ρ]dx

� Tsallis divergence DTα (p||ρ) from equation 2.22:

∂ DTα (p||ρ)∂α

= 1

(1 − α)2

(1 −

∫pαρ1−αdx

)− 1

(1 − α)

∫pαρ1−α (ln p − ln ρ) dx

Page 33: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1375

� Generalized Renyi divergence DG Rα (p||ρ) from equation 2.26 (see

appendix A):

∂ DG Rα (p||ρ)

∂α

= − 1

(α − 1)2 log(∫ [

pαρ1−α − α · p + (α − 1) ρ]

dx + 1)

+ 1α − 1

∫pαρ1−α (ln p − ln ρ) − p + ρdx∫ [

pαρ1−α − α · p + (α − 1) ρ]

dx + 1

� Renyi divergence DRα (p||ρ) from equation 2.28:

∂ DRα (p||ρ)∂α

= − 1

(α − 1)2 log(∫

pαρ1−αdx)

+ 1α − 1

∫pαρ1−α (ln p − ln ρ) dx∫

pαρ1−αdx

� γ -divergence Dγ (p||ρ) from equation2.30 (see appendix A):

∂ Dγ (p||ρ)∂γ

=− (2γ + 1)

γ 2 (γ + 1)2 ln(∫

pγ+1dx)

+∫

pγ+1 ln pdx(γ + 1) γ

∫pγ+1dx

− 1

(γ + 1)2 ln(∫

ργ+1dx)

+∫

ργ+1 ln ρdx(γ + 1)

∫ργ+1dx

+ 1γ 2 ln

(∫p · ργ dx

)−

∫pργ ln ρdx

γ∫

p · ργ dx

6.1.2 Example: Hyperparameter Learning for γ -Divergences in GLVQ. Wenow provide a simulation example for hyperparameter learning. We ap-ply the GLVQ algorithm for classification, the cost function of which isgiven by equation 4.16. Mwebaze et al. (2010) pointed out that GLVQperforms weakly if the Kullback-Leibler divergence is used, whereasCauchy-Schwarz divergence yields good results. Therefore, we demon-strate hyperparameter learning for the γ -divergence, which includes bothKullback-Leibler- and Cauchy-Schwarz divergence by the parameter set-tings γ → 0 and γ = 1, respectively. The hyperparameter update for thisalgorithm reads as

�γ ∼ −∂ EGLVQ

∂γ,

Page 34: Divergence-Based Vector Quantization

1376 T. Villmann and S. Haase

Figure 9: Example run of γ -parameter control for the γ -divergence in the caseof GLVQ applied to the well-known Iris data set.

which leads to

�γ ∼ θ+ · ∂ Dγ

(v||ws+(v)

)∂γ

+ θ− · ∂ Dγ

(v||ws−(v)

)∂γ

,

with the scaling factors θ+ and θ− taken from equation 4.19.For this purpose, we investigate a simple classification example: the well-

known three-class IRIS data set. We rescaled the data vectors such that therequirements of positive measures are satisfied. We used two prototypes foreach class and 10-fold cross-validation. We initialized the γ -parameter asγ0 = 0.5 to be in the middle between Kullback-Leibler and Cauchy-Schwarzdivergence according to Mwebaze et al. (2010).

Without a γ -parameter update for γ = 0, a classification accuracy of78.34% is obtained with standard deviation σ = 6.17, with the best re-sult being 91.3%. For γ = 1, the average is 95.16%, σ = 1.87 with thebest-performed run yielding 97.3%. The hyperparameter-controlled simu-lations give only a slight improvement achieving, an average performanceof 95.89% but with decreased deviation σ = 0.43. The γ -parameter con-verged to γ f inal = 0.9016 with standard deviation σγ < 10−4. As expectedfrom the noncontrolled experiments, the final γ -value is in the proximityof the Cauchy-Schwarz divergence. However, it is slightly but certainly de-creased. A typical learning progress of γ is depicted in Figure 9. As for theCauchy-Schwarz divergence (γ = 1), the best performance was 97.3% forthe controlled case.

Summarizing, this small experiment shows that hyperparameter opti-mization works well and may lead to better performance and stability.

6.2 Relevance Learning for Divergences. Density functions are re-quired to fulfill the normalization condition, whereas positive measures

Page 35: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1377

are more flexible. This offers the possibility of transferring the idea of rele-vance learning to divergence-based learning vector quantization. Relevancelearning in learning vector quantization is weighting the input data dimen-sions such that classification accuracy is improved (Hammer & Villmann,2002).

In the framework of divergence-based gradient descent learning, wemultiplicatively weight a positive measure q (x) by λ (x) with 0 ≤ λ (x) < ∞and the regularization condition

∫λ (x) dx = 1. Incorporating this idea into

the above approaches, we have to replace in the divergences p by p · λ andρ by ρ · λ. Doing so, we can optimize λ (x) during learning for better per-formance by gradient descent optimization as it is known from vectorialrelevance learning. This leads, again, to Frechet derivatives of the diver-gences but now with respect to the weighting function λ (x). The respectiveframework based on GLVQ for vectorial data is given by the generalized rel-evance learning vector quatization scheme (GRLVQ; Hammer & Villmann,2002). In complete analogy, we obtain the functional relevance update,

�λ ∼ θ+ · δD(

p||ρs+(p))

δλor � λ ∼ θ+ · δD

(p||ρs−(p)

)δλ

,

with s+ (p) and s− (p) playing the same role as in GLVQ. For vectorialrepresentations v and w of p and ρ , respectively, this reduces to the ordinarypartial derivatives:

�λi ∼ θ+ · ∂ D(v||ws+(v)

)∂λi

or � λi ∼ θ+ · ∂ D(v||ws−(v)

)∂λi

.

Applying this methodology, we obtain for the Bregman divergence,

δDB� (λ · p||λ · ρ)

δλ= � (λ · p)

δλ− � (λ · ρ)

δλ−

δ[

δ�(λ·ρ)δρ

λ (p − ρ)]

δλ, (6.1)

with

δ[

δ�(λ·ρ)δρ

λ (p − ρ)]

δλ= (p − ρ)

(δ2 [� (λ · ρ)]

δρ δλλ + δ� (λ · ρ)

δρ

).

This yields the generalized Kullback-Leibler divergence:

δDG K L (λ · p||λ · ρ)δλ

= p · log(

)− p + ρ. (6.2)

Page 36: Divergence-Based Vector Quantization

1378 T. Villmann and S. Haase

In the case of the η-divergence (equation 2.8), we calculate

δDη (λ · p||λ · ρ)δλ

= λη−1η(

pη − ρη−1 (pη + (1 − η) ρ)), (6.3)

which reduces for the choice η = 2 (Euclidean distance) to

δDη (λ · p||λ · ρ)δλ

= 2λ (p − ρ)2

as it is known from Hammer and Villmann (2002). Further, for the β-divergence, equation 2.9, which also belongs to the Bregman divergenceclass, we have

δDβ (λ · p||λ · ρ)δλ

= ρ · (λ · p)β + (ρ · (β − 1) − p · β) · (λ · ρ)β

λρ (β − 1). (6.4)

For the class of f -divergences, equation 2.11, we consider

δDf (λ · p||λ · ρ)δλ

= ρ · f(

)+ λ · ρ

∂ f (u)∂u

δuδλ

= ρ · f(

)(6.5)

with u = pρ

using the fact that δuδλ

= 0. The relevance learning of the subclassof α-divergences, equation 2.18, follows,

δDα (λ · p||λ · ρ)δλ

= 1α (α − 1)

[ρ ·

((pρ

+ α − 1)

− p · α

], (6.6)

whereas the respective gradient of generalized Renyi divergences, equation2.26, can be derived from this as

δDG Rα (λ · p||λ · ρ)

δλ= α∫ [

λ ·(ρ ·

(pρ

− α · p + (α − 1) · ρ)

1]

dx + 1

×δDα (λ · p||λ · ρ)δλ

. (6.7)

The subset of Tsallis divergences is treated by

δDTα (λ · p||λ · ρ)

δλ= − 1

1 − αpαρ1−α . (6.8)

Page 37: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1379

The γ -divergence classes finally yield

δDγ (λ · p||λ · ρ)δλ

= p (λ · p)γ

γ∫

(λ · p)γ+1 dx+ ρ (λ · ρ)γ∫

(λ · ρ)γ+1 dx

− p · (γ + 1) · (λ · ρ)γ

γ∫

(λ · p) · (λ · ρ)γ dx.

Again the important special case γ = 1 is considered: the relevance learningscheme for the Cauchy-Schwarz divergence, equation 2.32, is derived as

δDC S (λ · p||λ · ρ)δλ

= p · λ · p∫(λ · p)2 dx

+ ρ · λ · ρ∫(λ · ρ)2 dx

− 2 · p · λ · ρ∫λ2 · p · ρdx

.

(6.9)

7 Conclusion

Divergence-based supervised and unsupervised vector quantization hasbeen done so far by applying only a few divergences, primarily Kullback-Leibler divergence. Recent applications also refer to Itakura-Saito diver-gence, Cauchy divergence, and γ -divergence. These approaches are notonline adaptation schemes involving gradient learning but are based onbatch mode, requiring all the data at one time. However, in many cases,online learning is mandatory, for several reasons: the huge amount of data,a subsequently inreasing data set, or the need for very careful learning incomplex problems, for example (Alex, Hasenfuss, & Hammer, 2009). Inthese cases, online learning is required or may be at least, advantageous.

In this letter we give a mathematical foundation for gradient-based vec-tor quantization bearing on the derivatives of the applied divergences.We provide a general framework for the use of arbitrary divergences andtheir derivatives such that they can immediately be plugged into existinggradient-based vector quantization schemes.

For this purpose, we first characterized the main subclasses ofdivergences—Bregman-, α-, β-, γ -, and f -divergences—following Cichockiet al. (2009). We then used the mathematical methodology of Frechet deriva-tives to calculate the functional divergence derivatives.

We show how to use this methodology with famous examples of su-pervised and unsupervised vector quantization, including SOM, NG, andGLVQ. In particular, we explained that the divergences can be taken as suit-able dissimilarity measures for data, which leads to the use of the respec-tive Frechet derivatives in the online learning schemes. Further, we declarehow a parameter adaptation could be integrated in supervised learning toachieve improved classification results in case of the parameterized α-, β-,γ -, and η-divergences. In the last step, we considered a weighting function

Page 38: Divergence-Based Vector Quantization

1380 T. Villmann and S. Haase

for generalized divergences based on a positive measure. The optimizationscheme for this weight function is obtained by Frechet derivatives againto obtain a relevance learning scheme in analogy to relevance learning inthe usual supervised learning vector quantization (Hammer & Villmann,2002).

Table 1 provides an overview of representatives of the three main classesof divergence characterized in section 2 and their related Frechet deriva-tives. Table 2 provides the received derivatives for relevance learning andhyperparameter learning.

As a proof of concept, the simulations for an illustrative example for theseveral parametric and nonparametric divergences give promising resultsregarding their sensitivity. The differences with Euclidean learning are ob-vious. Moreover, the dependencies in case of parameterized divergencesgive hints for possible real-world application, which should be the nextstep in this work.

Appendix A: Calculation of the Derivatives of the ParameterizedDivergences with Respect to the Hyperparameters

We assume for the differentiation of the divergences with respect to theirhyperparameters that the (positive) measures p and ρ are continuouslydifferentiable. Then, considering derivatives of divergences, integrationand differentiation can be interchanged, if the resulting integral exists(Fichtenholz, 1964).

A.1 β-divergence. The β-divergence is, according to equation 2.9,

Dβ (p||ρ) =∫

p · pβ−1 − ρβ−1

β − 1dx −

∫pβ − ρβ

βdx

= I1 (β) − I2 (β) .

We treat both integrals independently:

∂ I1 (β)∂β

=∫ ∂

[p · pβ−1−ρβ−1

β−1

]∂β

dx

=∫

p

(∂

[pβ−1 − ρβ−1

]∂β

1β − 1

−(

pβ−1 − ρβ−1)

(β − 1)2

)dx

= 1β − 1

∫p

(pβ−1 ln p − ρβ−1 ln ρ −

(pβ−1 − ρβ−1

)(β − 1)

)dx

Page 39: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1381Ta

ble

1:Ta

ble

ofD

iver

genc

esan

dT

heir

Frec

hetD

eriv

ativ

es.

Div

erge

nce

Fam

ilyFo

rmul

aFr

eche

tDer

ivat

ive

Bre

gman

div

erge

nce

DB �

(p||ρ

)=

�(p

)−�

(ρ)−

δ�

(ρ)

δρ[p

−ρ

DB �

(p||ρ

)δρ

=�

(p)

δρ−

�(ρ

)δρ

−δ

2 [�(ρ

) ]δρ

2(p

−ρ

)−δ�

(ρ)

δρ

Kul

lbac

k-L

eibl

erd

iver

genc

eD

KL

(p||ρ

)=

∫ p·lo

g( p ρ

) dx

δD

KL

(p||ρ

)δρ

=−

p ρ

Gen

eral

ized

Kul

lbac

k-L

eibl

erd

iver

genc

eD

KL

(p||ρ

)=

∫ p·lo

g( p ρ

) dx

−∫ (p

−ρ

)dx

δD

GK

L(p

||ρ)

δρ=

−p ρ

+1

Itak

ura-

Sait

od

iver

genc

eD

IS(p

||ρ)=

∫[ p ρ−

log

( p ρ

) −1] d

DIS

(p||ρ

)δρ

=1 ρ2

(ρ−

p )

η-d

iver

genc

eD

η(p

||ρ)=

∫ pη+

(η−

1 )·ρ

η−

η·p

·ρ(η

−1) d

(p||ρ

)δρ

η−2

· (1−

η)·

η· (

p−

ρ)

β-d

iver

genc

eD

β(p

||ρ)=

∫ p·pβ

−1−ρ

β−1

β−1

dx

−∫ pβ

−ρβ

βd

(p||ρ

)δρ

β−2

(ρ−

p )

f-d

iver

genc

esD

f(p

||ρ)=

∫ ρ·f

( p ρ

) dx

δD

f(p||

ρ)

δρ=

f( p ρ

) +ρ

∂f (

u )∂

u·−

2

Gen

eral

ized

f-d

iver

genc

esD

G f(p

||ρ)=

cf∫ p

−ρ

dx

+∫ ρ

·f( p ρ

) dx,

cf

=f′

(1)�=

DG f

(p||ρ

)δρ

=f( p ρ

) +ρ

∂f (

u )∂

u·−

2−

cf,

cf

=f′

(1)�=

0

Hel

linge

rd

iver

genc

eD

H(p

||ρ)=

∫ ( √p

−√ ρ

) 2 dx

δD

H(p

||ρ)

δρ=

1−

√ p ρ

α-d

iver

genc

eD

α(p

||ρ)=

(α−1

)

∫ [ pαρ

1−α

−α

·p+

(α−

1 )ρ] d

(p||ρ

)δρ

=−

1 α

( pαρ

−α−

1)Ts

allis

div

erge

nce

DT α

(p||ρ

)=

11−

α

( 1−

∫ pαρ

1−α

dx)

δD

T α(p

||ρ)

δρ=

−( p ρ

) αR

enyi

div

erge

nce

DR α

(p||ρ

)=

1α−1

log

(∫ pαρ

1−α

dx)

δD

R α(p

||ρ)

δρ=

−pα

ρ−α

∫ pαρ

1−α

dx

Gen

eral

ized

Ren

yid

iver

genc

eD

GR

α(p

||ρ)=

1α−1

log

(∫ [ pαρ

1−α

−α

·p+

(α−

1 )ρ] d

x+

1)δ

DG

(p||ρ

)δρ

=−

pαρ

−α−1

∫ [pαρ

1−α−α

·p+ (

α−1

)ρ]d

x+1

γ-d

iver

genc

esD

γ(p

||ρ)=

log

[ (∫ pγ+1

dx )

(γ+1

)· (∫ ρ

γ+1

dx )

1γ+1

(∫ p·ργ

dx )

1 γ

(p||ρ

)δρ

γ∫ ρ

γ+1

dx

−pρ

γ−1

∫ p·ργ

dx

Cau

chy-

Schw

arz

div

erge

nce

DC

S(p

||ρ)=

1 2lo

g( ∫ ρ

2 (x)d

x·∫ p2 (x)d

x

(∫ p·ρd

x )2

DC

S(p

||ρ)

δρ=

ρ ∫ ρ2 d

x−

p ∫ p·ρd

x

Page 40: Divergence-Based Vector Quantization

1382 T. Villmann and S. Haase

Tabl

e2:

Tabl

eof

Div

erge

nces

and

The

irD

eriv

ativ

esfo

rR

elev

ance

Lea

rnin

gan

dH

yper

para

met

erL

earn

ing.

Div

erge

nce

Fam

ilyD

eriv

ativ

efo

rR

elev

ance

Lea

rnin

gD

eriv

ativ

efo

rH

yper

para

met

erL

earn

ing

Bre

gman

div

erge

nce

δD

B �(λ

·p||λ

·ρ)

δλ

=�

(λ·p

)δλ

−�

(λ·ρ

)δλ

−(p

−ρ

)( δ2 [�

(λ·ρ

) ]δρ

δλ

λ+

δ�

(λ·ρ

)δρ

)—

KL

div

erge

nce

δD

KL

(λ·p

||λ·ρ

)δλ

=p

·log

( p ρ

)—

Gen

eral

ized

KL

div

erge

nce

δD

GK

L(λ

·p||λ

·ρ)

δλ

=p

·log

( p ρ

) −p

Itak

ura-

Sait

od

iver

genc

DIS

(λ·p

||λ·ρ

)δλ

=0

η-d

iver

genc

(λ·p

||λ·ρ

)δλ

η−1

η( pη

−ρ

η−1

(pη

+(1

−η

)ρ))

∂D

η(p

||ρ)

∂η

=∫ pη

lnp

η−1

· (ρ−

p+

(ηρ

−ρ

−η

p )·ln

ρ)d

x

β-d

iver

genc

(λ·p

||λ·ρ

)δλ

=ρ· (λ

·p)β

+ (ρ· (β

−1)−

p·β)· (

λ·ρ

λρ

(β−1

)

∂D

β(p

||ρ)

∂β

=1

β−1

∫ p( pβ

−1ln

p−

ρβ−1

lnρ

−(p

β−1

−ρβ−1

)(β

−1)

) dx

−∫ ( pβ

lnp

−ρ

βln

ρ) 1 β

−1 β2

( pβ−

ρβ) d

x

f-d

iver

genc

esδ

Df(

λ·p

||λ·ρ

)δλ

·f( p ρ

)—

Gen

eral

ized

f-d

iver

genc

esδ

DG f

(λ·p

||λ·ρ

)δλ

=c

f(p

−ρ

)+ρ

·f( p ρ

) ,c

f=

f′(1

)�=

0—

Hel

linge

rd

iver

genc

DH

(λ·p

||λ·ρ

)δλ

=1 λ

( √ pλ−

√ λρ) 2

α-d

iver

genc

(λ·p

||λ·ρ

)δλ

=1

α(α

−1)

[ ρ·(( p ρ

) α +α

−1) −

p·α

]∂

(p||ρ

)∂α

=−

(2α−1

2 (α−1

)2

∫ [ pαρ

1−α

−α

·p+

(α−

1 )ρ] d

x

+1

α(α

−1)

∫ pαρ

1−α

(ln

p−

lnρ

)−p

dx

Tsal

lisd

iver

genc

DT α

(λ·p

||λ·ρ

)δλ

=−

11−

αpα

ρ1−

α

∂D

T α(p

||ρ)

∂α

=1

(1−α

)2

( 1−

∫ pαρ

1−α

dx)

−1

(1−α

)

∫ pαρ

1−α

(ln

p−

lnρ

)dx

Ren

yid

iver

genc

DR α

(λ·p

||λ·ρ

)δλ

=pα

ρ1−

α

(α−1

)∫ λ·ρ

·( p ρ

) α dx

∂D

R α(p

||ρ)

∂α

=−

1(α

−1)2

log

(∫ pαρ

1−α

dx)

+1

α−1

∫ pαρ

1−α

(ln

p−ln

ρ)d

x∫ pα

ρ1−

αd

x

Page 41: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1383

Tabl

e2:

(Con

tinu

ed.)

Div

erge

nce

Fam

ilyD

eriv

ativ

efo

rR

elev

ance

Lea

rnin

gD

eriv

ativ

efo

rH

yper

para

met

erL

earn

ing

Gen

eral

ized

Ren

yid

iver

genc

DG

(λ·p

||λ·ρ

)δλ

=ρ·(( p ρ

) α +α−1

) −p·α

(α−1

)∫[ λ·( ρ

·( p ρ

) α −α·p

+ (α−1

)ρ)] d

x+1

∂D

GR

α(p

||ρ)

∂α

=−

1(α

−1)2

log

(∫ [ pαρ

1−α

−α

·p+

(α−

1 )ρ] d

x+

1)+

1α−1

∫ pαρ

1−α

(ln

p−ln

ρ)−

p+ρ

dx

∫ [pαρ

1−α−α

·p+ (

α−1

)ρ]d

x+1

γ-d

iver

genc

esδ

(λ·p

||λ·ρ

)δλ

=p (

λ·p

γ∫ (λ

·p)γ

+1d

x+

ρ(λ

·ρ)γ

∫ (λ·ρ

)γ+1

dx

−p· (

γ+1

)· (λ·ρ

γ∫ (λ

·p)· (

λ·ρ

)γd

x

∂D

γ(p

||ρ)

∂γ

=−

(2γ+1

2 (γ+1

)2ln

(∫ pγ+1

dx) +

∫ pγ+1

lnpd

x(γ

+1)γ

∫ pγ+1

dx

−1

(γ+1

)2ln

(∫ ργ+1

dx) +

∫ ργ+1

lnρ

dx

(γ+1

)∫ ργ+1

dx

+1 γ2

ln(∫ p

·ργ

dx) −

∫ pργ

lnρ

dx

γ∫ p·ρ

γd

xC

auch

y-Sc

hwar

zd

iver

genc

DC

S(λ

·p||λ

·ρ)

δλ

=p·λ

·p∫ (λ

·p)2 d

x+

ρ·λ·

ρ∫ (λ

·ρ)2 d

x−

2·p·λ

·ρ∫ λ

2 ·p·ρ

dx

Page 42: Divergence-Based Vector Quantization

1384 T. Villmann and S. Haase

∂ I2 (β)∂β

=∫ ∂

[pβ−ρβ

β

]∂β

dx

=∫

∂[

pβ − ρβ]

∂β

− 1β2

(pβ − ρβ

)dx

=∫ (

pβ ln p − ρβ ln ρ) 1

β− 1

β2

(pβ − ρβ

)dx.

Thus,

∂ Dβ (p||ρ)∂β

= 1β − 1

∫p

(pβ−1 ln p − ρβ−1 ln ρ −

(pβ−1 − ρβ−1

)(β − 1)

)dx

−∫ (

pβ ln p − ρβ ln ρ) 1

β− 1

β2

(pβ − ρβ

)dx,

if the integral exists for an appropriate choice of β.

A.2 α-Divergences. We consider the α-divergence, equation 2.18:

Dα (p||ρ) = 1α (α − 1)

∫ [pαρ1−α − α · p + (α − 1) ρ

]dx.

= 1α (α − 1)

I (α) .

We have

∂ Dα (p||ρ)∂α

=∂

[1

α(α−1)

]∂α

I (α) + 1α (α − 1)

∂ I (α)∂α

=− (2α − 1)

α2 (α − 1)2 I (α) + 1α (α − 1)

∂ I (α)∂α

.

The derivative ∂ I (α)∂α

yields

∂ I (α)∂α

=∫

∂[

pαρ1−α − α · p + (α − 1) ρ]

∂αdx

=∫

[pαρ1−α (ln p − ln ρ) − p + ρ]dx.

Page 43: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1385

Finally, we get

∂ Dα (p||ρ)∂α

=− (2α − 1)

α2 (α − 1)2

∫ [pαρ1−α − α · p + (α − 1) ρ

]dx

+ 1α (α − 1)

∫pαρ1−α (ln p − ln ρ) − p + ρdx.

A.3 Renyi Divergences. Considering the generalized Renyi divergenceDG R

α (p||ρ) from equation 2.26,

DG Rα (p||ρ) = 1

α − 1log

(∫ [pαρ1−α − α · p + (α − 1) ρ

]dx + 1

)= 1

α − 1log I (α),

we get

∂ DG Rα (p||ρ)

∂α= − 1

(α − 1)2 log I (α) + 1α − 1

1I (α)

∂ I (α)∂α

with

∂ I (α)∂α

=∫

∂[

pαρ1−α − α · p + (α − 1) ρ]

∂αdx

=∫

pαρ1−α (ln p − ln ρ) − p + ρdx.

Summarizing the differentiation yields

∂ DG Rα (p||ρ)

∂α=− 1

(α − 1)2 log(∫ [

pαρ1−α − α · p + (α − 1) ρ]

dx + 1)

+ 1α − 1

∫pαρ1−α (ln p − ln ρ) − p + ρdx∫ [

pαρ1−α − α · p + (α − 1) ρ]

dx + 1.

We now turn to the usual Renyi divergence DRα (p||ρ) from equation 2.28:

DG Rα (p||ρ) = 1

α − 1log

(∫pαρ1−αdx

).

Page 44: Divergence-Based Vector Quantization

1386 T. Villmann and S. Haase

We analogously achieve

∂ DG Rα (p||ρ)

∂α= − 1

(α − 1)2 log(∫

pαρ1−αdx)

+ 1α − 1

∫pαρ1−α (ln p − ln ρ) dx∫

pαρ1−αdx.

A.4 γ -Divergences. The remaining divergences are the γ -divergences,equation 2.30:

Dγ (p||ρ) = 1γ + 1

ln

[(∫pγ+1dx

) 1γ

·(∫

ργ+1dx)]

− ln

[(∫p · ργ dx

) 1γ

]

= 1γ + 1

ln

[(∫pγ+1dx

) 1γ

]+ 1

γ + 1ln

[(∫ργ+1dx

)]

− ln

[(∫p · ργ dx

) 1γ

]

= 1(γ + 1) γ

ln I1 (γ ) + 1γ + 1

ln I2 (γ ) − 1γ

ln I3(γ ).

The derivative is obtained according to

∂ Dγ (p||ρ)∂γ

= − (2γ + 1)

γ 2 (γ + 1)2 ln I1 (γ ) + 1(γ + 1) γ I1 (γ )

∂ I1 (γ )∂γ

− 1

(γ + 1)2 ln I2 (γ ) + 1(γ + 1) I2 (γ )

∂ I2 (γ )∂γ

+ 1γ 2 ln I3 (γ ) − 1

γ I3 (γ )∂ I3 (γ )

∂γ.

Next, we calculate the derivatives ∂ I1(γ )∂γ

, ∂ I2(γ )∂γ

, and ∂ I3(γ )∂γ

:

∂ I1 (γ )∂γ

=∫

∂(

pγ+1)

∂γdx

=∫

pγ+1 ln pdx.

Page 45: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1387

∂ I2 (γ )∂γ

=∫

∂(ργ+1

)∂γ

dx

=∫

ργ+1 ln ρdx.

∂ I3 (γ )∂γ

=∫

∂ (p · ργ )∂γ

dx

=∫

pργ ln ρdx.

Collecting all intermediate results, we finally have

∂ Dγ (p||ρ)∂γ

= − (2γ + 1)

γ 2 (γ + 1)2 ln(∫

pγ+1dx)

+∫

pγ+1 ln pdx(γ + 1) γ

∫pγ+1dx

− 1

(γ + 1)2 ln(∫

ργ+1dx)

+∫

ργ+1 ln ρdx(γ + 1)

∫ργ+1dx

+ 1γ 2 ln

(∫p · ργ dx

)−

∫pργ ln ρdx

γ∫

p · ργ dx.

Appendix B: Proof of Lemma 1

We now give the proof of lemma 1. For the proof, we need a propositiongiven in Liese and Vajda (1987):

Proposition 3. Let A = [0,∞)2 andF = {g|g : [0,∞) → R, g - convex

}. Fur-

ther, let f be a function f : A �→ R ∪ {∞} be defined by

f (u, v) = v · f(u

v

)

for an arbitrary f ∈ F with the definitions 0 · f(

00

) = 0, 0 · f(

a0

) = limx→0 x ·f(

ax

) = limu→∞ a · f (u)u . Further, let us denote f∞ = limu→0+

{u · f

(1u

)}and

f (0) = limu→0+{

f (u)}

. Then there exists c ∈ R such that

f (u, v) ≥ u · c + v · ( f (1) − c) for all (u, v) ∈ A

and

f (u, v) ≤ u · f∞ + v · f (0) for all (u, v) ∈ (0,∞)2.

Proof. See Liese and Vajda (1987).

Page 46: Divergence-Based Vector Quantization

1388 T. Villmann and S. Haase

This proposition gives the essential ingredients to proof the lemma:

Lemma. The f -divergence Df for positive measures p and ρ is bounded (if thelimit exists and it is finite):

0 ≤ Df (p||ρ) ≤ limu→0+

{f (u) + u · f

(1u

)}

with u = pρ

.

Proof. Let p∗ be a nonnegative integrable function defined as

p∗ = 2 · pp + ρ

.

Further, let us define

f (u) = f(u, 2 − u) for u ∈ [0, 2].

Then it follows directly from the above proposition that there is c ∈ R suchthat

2 · f (1) + 2 · c · (p∗ − 1) ≤ f (p∗) ≤ (2 − p∗) · f (0) + p∗ · f∞,

which leads to

f (1) + cp + ρ

· (p − ρ) ≤ ρ

p + ρ· f

(pρ

)≤ ρ

p + ρ· f (0) + p

p + ρ· f∞.

With f being a determining function of an f -divergence, it holds thatf (1) = 0 and thus

c · (p − ρ) ≤ ρ · f(

)≤ ρ · f (0) + p · f∞.

We now get

c ·∫

(p − ρ) dx ≤ Df (p||ρ) ≤ f (0) ·∫

ρ dx + f∞ ·∫

p dx.

Since p and ρ are positive measures with weights W(p) ≤ 1 and W(ρ) ≤ 1according to equation 2.1, this finally yields

0 ≤ Df (p||ρ) ≤ f (0) + f∞

which completes the proof of the lemma.

Page 47: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1389

References

Alex, N., Hasenfuss, A., & Hammer, B. (2009). Patch clustering for massive data sets.Neurocomputing, 72 (7–9), 1455–1469.

Amari, S.-I. (1985). Differential-geometrical methods in statistics. Berlin: Springer.Amari, S.-I., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford

University Press.Banerjee, A., Merugu, S., Dhillon, I., & Ghosh, J. (2005). Clustering with Bregman

divergences. Journal of Machine Learning Research, 6, 1705–1749.Basseville, M. (1988). Distance measures for signal processing and pattern recognition.

(Tech. Rep. 899). Paris: Institut National de Recherche en Informatique et enAutomatique.

Basu, A., Harris, I., Hjort, N., & Jones, M. (1998). Robust and efficient estimation byminimising a density power divergence. Biometrika, 85(3), 549–559.

Bertin, N., Fevotte, C., & Badeau, R. (2009). A tempering approach for Itakura-saito non-negative matrix factorization. with application to music transcription.In: Proceedings of the IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) (pp. 1545–1548). Piscataway, NJ: IEEE Press.

Bishop, C. M., Svensen, M., Williams, & C.K.I. (1998). GTM: The generative topo-graphic mapping. Neural Computation, 10, 215–234.

Bregman, L. (1967). The relaxation method of finding the common point of convexsets and its application to the solution of problems in convex programming. USSRComputational Mathematics and Mathematical Physics, 7 (3), 200–217.

Bunte, K., Hammer, B., Villmann, T., Biehl, M., & Wismuller, A. (2010). Exploratoryobservation machine (XOM) with Kullback-Leibler divergence for dimensional-ity reduction and visualziation. In M. Verleysen (Ed.), Proc. of European Sympo-sium on Artificial Neural Networks (pp. 87–92). Evere, Belgium: d-side publica-tions.

Cichocki, A., & Amari, S.-I. (2010). Families of alpha- beta- and gamma- divergences:Flexible and robust measures of similarities. Entropy, 12, 1532–1568.

Cichocki, A., Lee, H., Kim, Y.-D., & Choi, S. (2008). Non-negative matrix factorizationwith α-divergence. Pattern Recognition Letters, 29, 1433–1440.

Cichocki, A., Zdunek, R., Phan, A., & Amari, S.-I. (2009). Nonnegative matrix andtensor factorizations. Hoboken, NJ: Wiley.

Crammer, K., Gilad-Bachrach, R., Navot, A., & Tishby, A. (2002). Margin analysisof the LVQ algorithm. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advancesin neural information processing systems, 15 (pp. 462–468): Cambridge, MA: MITPress.

Csiszar, I. (1967). Information-type measures of differences of probability distribu-tions and indirect observations. Studia Sci. Math. Hungaria, 2, 299–318.

Eguchi, S., & Kano, Y. (2001). Robustifying maximum likelihood estimation (Tech. Rep.802). Tokyo: Tokyo Institute of Statistical Mathematics.

Erdogmus, D. (2002). Information theoretic learning: Renyi’s entropy and its application toadaptive systems training. Unpublished doctoral dissertation, University of Florida.

Fichtenholz, G. (1964). Differential- und Integralrechnung (9th ed.). Berlin: DeutscherVerlag der Wissenschaften.

Page 48: Divergence-Based Vector Quantization

1390 T. Villmann and S. Haase

Frigyik, B., Srivastava, S., & Gupta, M. (2008a). Functional Bregman divergenceand Bayesian estimation of distributions. IEEE Transactions on Information Theory,54(11), 5130–5139.

Frigyik, B. A., Srivastava, S., & Gupta, M. (2008b). An introduction to functional deriva-tives (Tech. Rep. UWEETR-2008-0001). Seattle: Department of Electrical Engineer-ing, University of Washington.

Fujisawa, H., & Eguchi, S. (2008). Robust parameter estimation with a small biasagainst heavy contamination. Journal of Multivariate Analysis, 99, 2053–2081.

Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: Generaliza-tions and new optimization techniques. Neurocomputing, 21(1–3), 173–190.

Hammer, B., & Villmann, T. (2002). Generalized relevance learning vector quantiza-tion. Neural Networks, 15(8–9), 1059–1068.

Hegde, A., Erdogmus, D., Lehn-Schiøler, T., Rao, Y., & Principe, J. (2004). Vectorquantization by density matching in the minimum Kullback-Leibler-divergencesense. In Proc. of the International Joint Conference on Artificial Neural Networks (pp.105–109). Piscataway, NJ: IEEE Press.

Heskes, T. (1999). Energy functions for self-organizing maps. In E. Oja & S. Kaski(Eds.), Kohonen maps (pp. 303–316). Amsterdam: Elsevier.

Hulle, M.M.V. (2000). Faithful representations and topographic maps. Hoboken,NJ: Wiley.

Hulle, M.M.V. (2002a). Joint entropy maximization in kernel-based topographicmaps. Neural Computation, 14(8), 1887–1906.

Hulle, M.M.V. (2002b). Kernel-based topographic map formation achieved with aninformation theoretic approach. Neural Networks, 15, 1029–1039.

Itakura, F., & Saito, S. (1973). Analysis synthesis telephony based on the maximumlikelihood method. In J. Flanagan & R. Rabiner (Eds.), Speech synthesis (pp. 289–292). Stroudsburg, PA: Dowden, Hutchinson, & Ross.

Jang, E., Fyfe, C., & Ko, H. (2008). Bregman divergences and the self organising map.In C. Fyfe, D. Kim, S.-Y., Lee, & H. Yin (Eds.), Intelligent data engineering andautomated learning (pp. 452–458). New York: Springer.

Jenssen, R. (2005). An information theoretic approach to machine learning. Unpublisheddoctoral dissertation, University of Tromsø.

Jenssen, R., Principe, J., Erdogmus, D., & Eltoft, T. (2006). The Cauchy-Schwarzdivergence and Parzen windowing: Connections to graph theory and Mercerkernels. Journal of the Franklin Institute, 343(6), 614–629.

Kantorowitsch, I., & Akilow, G. (1978). Funktionalanalysis in normierten Raumen (2nded.). Berlin: Akademie-Verlag.

Kapur, J. (1994). Measures of information and their application. Hoboken, NJ: Wiley.Kohonen, T. (1997). Self-organizing maps (2nd ext. ed.). New York: Springer.Kullback, S., & Leibler, R. (1951). On information and sufficiency. Annals of Mathe-

matical Statistics, 22, 79–86.Lai, P., & Fyfe, C. (2009). Bregman divergences and multi-dimensional scaling. In

M. Koppen, N., Kasabov, & N. G. Coghill (Eds.), Proceedings of the InternationalConference on Information Processing 2008 (pp. 935–942). New York: Springer.

Lee, J., & Verleysen, M. (2005). Generalization of the l p norm for time series andits application to self-organizing maps. In M. Cottrell (Ed.), Proc. of Workshop onSelf-Organizing Maps (pp. 733–740). Paris: Sorbonne.

Page 49: Divergence-Based Vector Quantization

Divergence-Based Vector Quantization 1391

Lee, J., & Verleysen, M. (2007). Nonlinear dimensionality reduction. New York: Springer.Lehn-Schiøler, T., Hegde, A., Erdogmus, D., & Principe, J. (2005). Vector quantization

using information theoretic concepts. Natural Computing, 4(1), 39–51.Liese, F., & Vajda, I. (1987). Convex statistical distances. Leipzig: Teubner-Verlag.Liese, F., & Vajda, I. (2006). On divergences and informations in statistics and infor-

mation theory. IEEE Transactions on Information Theory, 52(10), 4394–4412.Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantizer design. IEEE

Transactions on Communications, 28, 84–95.Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine

Learning Research, 9, 2579–2605.Martinetz, T. M., Berkovich, S. G., & Schulten, K. J. (1993). “Neural-gas” network for

vector quantization and its application to time-series prediction. IEEE Trans. onNeural Networks, 4(4), 558–569.

Minami, M., & Eguchi, S. (2002). Robust blind source separation by beta divergence.Neural Computation, 14, 1859–1886.

Minka, T. (2005). Divergence measures and message passing (Tech. Rep. 173). Cambridge,UK: Microsoft Research.

Mwebaze, E., Schneider, P., Schleif, F.-M., Haase, S., Villmann, T., & Biehl, M. (2010).Divergence based learning vector quantization. In M. Verleysen (Ed.), Proc. ofEuropean Symposium on Artificial Neural Networks (pp. 247–252). Evere, Belgium:D-side.

Nielsen, F., & Nock, R. (2009). Sided and symmetrized Bregman centroids. IEEETransactions on Information Theory, 55(6), 2882–2903.

Principe, J. C. III, & Xu, D. (2000). Information theoretic learning. In S. Haykin & J.Fisher (Eds.), Unsupervised adaptive filtering. Hoboken, NJ: Wiley.

Qiao, Y., & Minematsu, N. (2008). f -divergence is a generalized invariant measurebetween disributions. In INTERSPEECH—Proc. of the Annual Conference of theInternational Speech Communication Association (pp. 1349–1352). N.p.: InternationalSpeech Communication Association.

Ramsay, J., & Silverman, B. (2006). Functional data analysis (2nd ed.) New York:Springer.

Renyi, A. (1961). On measures of entropy and information. In Proceedings of the FourthBerkeley Symposium on Mathematical Statistics and Probability. Berkeley: Universityof California Press.

Renyi, A. (1970). Probability theory. Amsterdam: North-Holland.Rossi, F., Delannay, N., Conan-Gueza, B., & Verleysen, M. (2005). Representation of

functional data in neural networks. Neurocomputing, 64, 183–210.Santos-Rodrıguez, R., Guerrero-Curieses, A., Alaiz-Rodrıguez, R., & Cid-Sueiro, J.

(2009). Cost-sensitive learning based on Bregman divergences. Machine Learning,76(2–3), 271–285.

Sato, A., & Yamada, K. (1996). Generalized learning vector quantization. In D. S.Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural informationprocessing systems, 8 (pp. 423–429). Cambridge, MA: MIT Press.

Schneider, P., Biehl, M., & Hammer, B. (2009). Hyperparameter learning in robustsoft LVQ. In M. Verleysen (Ed.), Proceedings of the European Symposium on ArtificialNeural Networks (pp. 517–522). Evere, Belgium: D-side.

Page 50: Divergence-Based Vector Quantization

1392 T. Villmann and S. Haase

Shannon, C. (1948). A mathematical theory of communication. Bell System TechnicalJournal, 27, 379–432.

Taneja, I., & Kumar, P. (2004). Relative information of type s, Csiszr’s f -divergence,and information inequalities. Information Sciences, 166, 105–125.

Torkkola, K. (2003). Feature extraction by non-parametric mutual information max-imization. Journal of Machine Learning Research, 3, 1415–1438.

Torkkola, K., & Campbell, W. (2000). Mutual information in learning feature transfor-mations. In Proc. of the International Conference on Machine Learning. San Francisco:Morgan Kaufmann.

Villmann, T. (2007). Sobolev metrics for learning of functional data—mathematicaland theoretical aspects (Machine Learning Reports 1 1–15). Available online athttp://www.uni-leipzig.de/compint/mlr/mlr 01 2007.pdf.

Villmann, T., Haase, S., Schleif, F.-M., & Hammer, B. (2010). Divergence based onlinelearning in vector quantization. In L. Rutkowski, W. Duch, J. Kaprzyk, J. Korbicz,& R. Tadeusiewicz (Eds.), Proc. of the International Conference on Artifial Intelligenceand Soft Computing. New York: Springer.

Villmann, T., Haase, S., Simmuteit, S., Haase, M., & Schleif, F.-M. (2010). Functionalvector quantization based on divergence learning. Ulmer Informatik-Berichte, 2010-05, 8–11.

Villmann, T., Hammer, B., Schleif, F.-M., Geweniger, T., & Herrmann, W. (2006).Fuzzy classification by fuzzy labeled neural gas. Neural Networks, 19, 772–779.

Villmann, T., Hammer, B., Schleif, F.-M., Hermann, W., & Cottrell, M. (2008). Fuzzyclassification using information theoretic learning vector quantization. Neurocom-puting, 71, 3070–3076.

Villmann, T., & Schleif, F.-M. (2009). Functional vector quantization by neural maps.In J. Chanussot (Ed.), Proceedings of First Workshop on Hyperspectral Image and SignalProcessing: Evolution in Remote Sensing (pp. 1–4). Piscataway, NJ: IEEE Press.

Villmann, T., Schleif, F.-M., Kostrzewa, M., Walch, A., & Hammer, B. (2008). Classi-fication of mass-spectrometric data in clinical proteomics using learning vectorquantization methods. Briefings in Bioinformatics, 9(2), 129–143.

Wismuller, A. (2009). The exploration machine: A novel method for data visualiza-tion. In J. Principe & R. Miikkulainen (Eds.), Advances in self-organizing maps—Proceedings of the 7th International Workshop (pp. 344–352). New York: Springer.

Zador, P. L. (1982). Asymptotic quantization error of continuous signals and thequantization dimension. IEEE Transactions on Information Theory, 28, 149–159.

Received February 26, 2010; accepted October 5, 2010.


Recommended