+ All Categories
Home > Documents > Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the...

Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the...

Date post: 05-Jan-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
12
arXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information divergence that measures the difference between two nonnegative matrices or tensors has found its use in a variety of machine learning problems. Examples are Nonnegative Matrix/Tensor Factorization, Stochastic Neighbor Embedding, topic models, and Bayesian network optimization. The success of such a learning task depends heavily on a suitable divergence. A large variety of divergences have been suggested and analyzed, but very few results are available for an objective choice of the optimal divergence for a given task. Here we present a framework that facilitates automatic selection of the best divergence among a given family, based on standard maximum likelihood estimation. We first propose an approxi- mated Tweedie distribution for the β-divergence family. Selecting the best β then becomes a machine learning problem solved by maximum likelihood. Next, we reformulate α-divergence in terms of β-divergence, which enables automatic selection of α by maximum likelihood with reuse of the learning principle for β-divergence. Furthermore, we show the connections between γ- and β-divergences as well as Rényi- and α-divergences, such that our automatic selection framework is extended to non- separable divergences. Experiments on both synthetic and real- world data demonstrate that our method can quite accurately select information divergence across different learning problems and various divergence families. Index Terms—information divergence, Tweedie distribution, maximum likelihood, nonnegative matrix factorization, stochastic neighbor embedding. I. I NTRODUCTION Information divergences are an essential element in modern machine learning. They originated in estimation theory where a divergence maps the dissimilarity between two probability distributions to nonnegative values. Presently, information divergences have been extended for nonnegative tensors and used in many learning problems where the objective is to min- imize the approximation error between the observed data and the model. Typical applications include Nonnegative Matrix Factorization (see e.g. [1], [2], [3], [4]), Stochastic Neighbor Embedding [5], [6], topic models [7], [8], and Bayesian network optimization [9]. There exist a large variety of information divergences. In Section II, we summarize the most popularly used parametric families including α-, β-, γ - and Rényi-divergences [10], [11], [12], [13], [14] and their combinations (e.g. [15]). The four parametric families in turn belong to broader ones such as the Csiszár-Morimoto f -divergences [16], [17] and Bregman divergences [18]. Data analysis techniques based on informa- tion divergences have been widely and successfully applied to various data such as text [19], electroencephalography [3], facial images [20], and audio spectrograms [21]. The authors are with Department of Information and Computer Sci- ence, Aalto University, 00076, Finland. e-mail: [email protected]; zhi- [email protected]; [email protected] Compared to the rich set of available information diver- gences, there is little research on how to select the best one for a given application. This is an important issue because the performance of a given divergence-based estimation or modeling method in a particular task very much depends on the divergence used. Formulating a learning task in a family of divergences greatly increases the flexibility to handle different types of noise in data. For example, Euclidean distance is suitable for data with Gaussian noise; Kullback-Leibler diver- gence has shown success for finding topics in text documents [7]; and Itakura-Saito divergence has proven to be suitable for audio signal processing [21]. A conventional workaround is to select among a finite number of candidate divergences using a validation set. This however cannot be applied to divergences that are non-separable over tensor entries. The validation approach is also problematic for tasks where all data are needed for learning, for example, cluster analysis. In Section III, we propose a new method of statistical learning for selecting the best divergence among the four popular parametric families in any given data modeling task. Our starting-point is the Tweedie distribution [22], which is known to have a relationship with β-divergence [23], [24]. The Maximum Tweedie Likelihood (MTL) is in principle a disciplined and straightforward method for choosing the optimal β value. However, in order for this to be feasible in practice, two shortcomings with the MTL method have to be overcome: 1) Tweedie distribution is not defined for all β; 2) calculation of Tweedie likelihood is complicated and prone to numerical problems for large β. To overcome these drawbacks, we propose here a novel distribution using an exponential over the β-divergence with a specific augmentation term. The new distribution has the following nice properties: 1) it is close to the Tweedie distribution, especially at four important special cases; 2) it exists for all β R; 3) its likelihood can be calcu- lated by standard statistical software. We call the new density the Exponential Divergence with Augmentation (EDA). EDA is a non-normalized density, i.e., its likelihood includes a normalizing constant which is not analytically available. But, since the density is univariate the normalizing constant can be efficiently and accurately estimated by numerical integration. The method of Maximizing the Exponential Divergence with Augmentation Likelihood (MEDAL) thus gives a more robust β selection in a wider range than MTL. β estimation on EDA can also be carried out using parameter estimation methods, e.g., Score Matching (SM) [25], specifically proposed for non- normalized densities. In the experiments section, we show that SM on EDA also performs as accurately as MEDAL. Besides β-divergence, the MEDAL method is extended to select the best divergence in other parametric families. We reformulate α-divergence in terms of β-divergence after a change of parameters so that α can be optimized using the
Transcript
Page 1: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

arX

iv:1

406.

1385

v1 [

cs.L

G]

5 Ju

n 20

141

Learning the Information DivergenceOnur Dikmen, Zhirong Yang, and Erkki Oja,

Abstract—Information divergence that measures the differencebetween two nonnegative matrices or tensors has found itsuse in a variety of machine learning problems. Examples areNonnegative Matrix/Tensor Factorization, Stochastic NeighborEmbedding, topic models, and Bayesian network optimization.The success of such a learning task depends heavily on asuitable divergence. A large variety of divergences have beensuggested and analyzed, but very few results are available foran objective choice of the optimal divergence for a given task.Here we present a framework that facilitates automatic selectionof the best divergence among a given family, based on standardmaximum likelihood estimation. We first propose an approxi-mated Tweedie distribution for the β-divergence family. Selectingthe best β then becomes a machine learning problem solvedby maximum likelihood. Next, we reformulate α-divergence interms of β-divergence, which enables automatic selection ofαby maximum likelihood with reuse of the learning principle forβ-divergence. Furthermore, we show the connections betweenγ-and β-divergences as well as Rényi- andα-divergences, suchthat our automatic selection framework is extended to non-separable divergences. Experiments on both synthetic and real-world data demonstrate that our method can quite accuratelyselect information divergence across different learning problemsand various divergence families.

Index Terms—information divergence, Tweedie distribution,maximum likelihood, nonnegative matrix factorization, stochasticneighbor embedding.

I. I NTRODUCTION

Information divergences are an essential element in modernmachine learning. They originated in estimation theory wherea divergence maps the dissimilarity between two probabilitydistributions to nonnegative values. Presently, informationdivergences have been extended for nonnegative tensors andused in many learning problems where the objective is to min-imize the approximation error between the observed data andthe model. Typical applications include Nonnegative MatrixFactorization (see e.g. [1], [2], [3], [4]), Stochastic NeighborEmbedding [5], [6], topic models [7], [8], and Bayesiannetwork optimization [9].

There exist a large variety of information divergences. InSection II, we summarize the most popularly used parametricfamilies includingα-, β-, γ- and Rényi-divergences [10], [11],[12], [13], [14] and their combinations (e.g. [15]). The fourparametric families in turn belong to broader ones such asthe Csiszár-Morimotof -divergences [16], [17] and Bregmandivergences [18]. Data analysis techniques based on informa-tion divergences have been widely and successfully appliedto various data such as text [19], electroencephalography [3],facial images [20], and audio spectrograms [21].

The authors are with Department of Information and ComputerSci-ence, Aalto University, 00076, Finland. e-mail: [email protected]; [email protected]; [email protected]

Compared to the rich set of available information diver-gences, there is little research on how to select the best onefor a given application. This is an important issue becausethe performance of a given divergence-based estimation ormodeling method in a particular task very much depends onthe divergence used. Formulating a learning task in a familyofdivergences greatly increases the flexibility to handle differenttypes of noise in data. For example, Euclidean distance issuitable for data with Gaussian noise; Kullback-Leibler diver-gence has shown success for finding topics in text documents[7]; and Itakura-Saito divergence has proven to be suitablefor audio signal processing [21]. A conventional workaroundis to select among a finite number of candidate divergencesusing a validation set. This however cannot be applied todivergences that are non-separable over tensor entries. Thevalidation approach is also problematic for tasks where alldata are needed for learning, for example, cluster analysis.

In Section III, we propose a new method of statisticallearning for selecting the best divergence among the fourpopular parametric families in any given data modeling task.Our starting-point is the Tweedie distribution [22], whichisknown to have a relationship withβ-divergence [23], [24].The Maximum Tweedie Likelihood (MTL) is in principlea disciplined and straightforward method for choosing theoptimal β value. However, in order for this to be feasible inpractice, two shortcomings with the MTL method have to beovercome: 1) Tweedie distribution is not defined for allβ; 2)calculation of Tweedie likelihood is complicated and pronetonumerical problems for largeβ. To overcome these drawbacks,we propose here a novel distribution using an exponential overtheβ-divergence with a specific augmentation term. The newdistribution has the following nice properties: 1) it is close tothe Tweedie distribution, especially at four important specialcases; 2) it exists for allβ ∈ R; 3) its likelihood can be calcu-lated by standard statistical software. We call the new densitythe Exponential Divergence with Augmentation (EDA). EDAis a non-normalized density, i.e., its likelihood includesanormalizing constant which is not analytically available.But,since the density is univariate the normalizing constant can beefficiently and accurately estimated by numerical integration.The method of Maximizing the Exponential Divergence withAugmentation Likelihood (MEDAL) thus gives a more robustβ selection in a wider range than MTL.β estimation on EDAcan also be carried out using parameter estimation methods,e.g., Score Matching (SM) [25], specifically proposed for non-normalized densities. In the experiments section, we show thatSM on EDA also performs as accurately as MEDAL.

Besidesβ-divergence, the MEDAL method is extended toselect the best divergence in other parametric families. Wereformulateα-divergence in terms ofβ-divergence after achange of parameters so thatα can be optimized using the

Page 2: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

2

MEDAL method. Our method can also be applied to non-separable cases. We show the equivalence betweenβ andγ-divergences, and betweenα and Rényi divergences by aconnecting scalar, which allows us to choose the bestγ- orRényi-divergence by reusing the MEDAL method.

We tested our method with extensive experiments, whose re-sults are presented in Section IV. We have used both syntheticdata with a known distribution and real-world data includingmusic, stock prices, and social networks. The MEDAL methodis applied to different learning problems: Nonnegative MatrixFactorization (NMF) [26], [3], [1], Projective NMF [27], [28]and Symmetric Stochastic Neighbor Embedding for visualiza-tion [5], [6]. We also demonstrate that our method outperformsScore Matching on Exponential Divergence distribution (ED),a previous approach forβ-divergence selection [29]. Conclu-sions and discussions on future work are given in Section V.

II. I NFORMATION DIVERGENCES

Many learning objectives can be formulated as an approxi-mation of the formx ≈ µ, wherex > 0 is the observed data(input) andµ is the approximation given by the model. Theformulation for µ totally depends on the task to be solved.Consider Nonnegative Matrix Factorization: thenx > 0 is adata matrix andµ is a product of two lower-rank nonnegativematrices which typically give a sparse representation for thecolumns ofx. Other concrete examples are given in SectionIV.

The approximation error can be measured by various in-formation divergences. Supposeµ is parameterized byΘ.The learning problem becomes an optimization procedurethat minimizes the given divergenceD(x||µ(Θ)) over Θ.Regularization may be applied forΘ for complexity control.For notational brevity we focus on definitions over vectorial x,µ, Θ in this section, while they can be extended to matricesor higher order tensors in a straightforward manner.

In this work we consider four parametric families of di-vergences, which are the widely usedα-, β-, γ- and Rényi-divergences. This collection is rich because it covers mostcommonly used divergences. The definition of the four fami-lies and some of their special cases are given below.

• α-divergence [10], [11] is defined as

Dα(x||µ) =∑

i xαi µ

1−αi − αxi + (α − 1)µi

α(α− 1). (1)

The family contains the following special cases:

Dα=2(x||µ) =DP(x||µ) =1

2

i

(xi − µi)2

µi

Dα→1(x||µ) =DI(x||µ) =∑

i

(xi ln

xi

µi− xi + µi

)

Dα=1/2(x||µ) =2DH(x||µ) = 2∑

i

(√xi −

õi)

2

Dα→0(x||µ) =DI(µ||x) =∑

i

(µi ln

µi

xi− µi + xi

)

Dα=−1(x||µ) =DIP(x||µ) =1

2

i

(xi − µi)2

xi

where DI , DP, DIP, and DH denote non-normalizedKullback-Leibler, Pearson Chi-square, inverse Pearsonand Hellinger distances, respectively.

• β-divergence [30], [31] is defined as

Dβ(x||µ) =∑

i xβ+1i + βµβ+1

i − (β + 1)xiµβi

β(β + 1). (2)

The family contains the following special cases:

Dβ=1(x||µ) =DEU(x||µ) =1

2

i

(xi − µi)2 (3)

Dβ→0(x||µ) =DI(x||µ) =∑

i

(xi ln

xi

µi− xi + µi

)

(4)

Dβ→−1(x||µ) =DIS(x||µ) =∑

i

(xi

µi− ln

xi

µi− 1

)

(5)

Dβ=−2(x||µ) =∑

i

(xi

2µ2i

− 1

µi+

1

2xi

), (6)

whereDEU andDIS denote the Euclidean distance andItakura-Saito divergence, respectively.

• γ-divergence [13] is defined as

Dγ(x||µ) =1

γ(γ + 1)

[ln

(∑

i

xγ+1i

)+ γ ln

(∑

i

µγ+1i

)

−(γ + 1) ln

(∑

i

xiµγi

)]. (7)

The normalized Kullback-Leibler (KL) divergence is aspecial case ofγ-divergence:

Dγ→0(x||µ) = DKL(x̃||µ̃) =∑

i

x̃i lnx̃i

µ̃i, (8)

wherex̃i = xi/∑

j xj and µ̃i = µi/∑

j µj .• Rényi divergence [32] is defined as

Dρ(x||µ) =1

ρ− 1ln(x̃ρi µ̃

1−ρi

)(9)

for ρ > 0. The Rényi divergence also includes thenormalized Kullback-Leibler divergence as its specialcase whenρ → 1.

III. D IVERGENCE SELECTION BY STATISTICAL LEARNING

The above rich collection of information divergences basi-cally allows great flexibility to the approximation framework.However, practitioners must face a choice problem:how toselect the best divergence in a family?In most existingapplications the selection is done empirically by the human.A conventional automatic selection method is cross-validation[33], [34], where the training only uses part of the entriesof x and the remaining ones are used for validation. Thismethod has a number of drawbacks: 1) it is only applicableto the divergences where the entries are separable (e.g.α-or β-divergence). Leaving out some entries forγ- and Rényidivergences is infeasible due to the logarithm or normalization;2) separation of entries is not applicable in some applications

Page 3: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

3

where all entries are needed in the learning, for example,cluster analysis.

Our proposal here is to use the familiar and proven tech-nique of maximum likelihood estimation forautomatic di-vergence selection, using a suitably chosen and very flexibleprobability density model for the data. In the following wediscuss this statistical learning approach for automatic diver-gence selection in the family ofβ -divergences, followed byits extensions to the other divergence families.

A. Selectingβ-divergence

1) Maximum Tweedie Likelihood (MTL):We start from theprobability density function (pdf) of an exponential dispersionmodel (EDM) [22]:

pEDM(x; θ, φ, p) = f(x, φ, p) exp

[1

φ(xθ − κ(θ))

](10)

whereφ > 0 is the dispersion parameter,θ is the canonicalparameter, andκ(θ) is the cumulant function (whenφ = 1 itsderivatives w.r.t.θ give the cumulants). Such a distribution hasmeanµ = κ′(θ) and varianceV (µ, p) = φκ′′(θ). This densityis defined forx ≥ 0, thusµ > 0.

A Tweedie distribution is an EDM whose variance has aspecial form,V (µ) = µp with p ∈ R\(0, 1). The canonicalparameter and the cumulant function that satisfy this propertyare [22]

θ =

{µ1−p−11−p , if p 6= 1

lnµ, if p = 1, κ(θ) =

{µ2−p−12−p , if p 6= 2

lnµ, if p = 2.

(11)

Note thatlnµ is the limit of µt−1t as t → 0. Finite analytical

forms of f(x, φ, p) in Tweedie distribution are generallyunavailable. The function can be expanded with infinite series[35] or approximated by saddle point estimation [36].

It is known that the Tweedie distribution has a connection toβ-divergence (see, e.g., [23], [24]): maximizing the likelihoodof Tweedie distribution for certainp values is equivalent tominimizing the corresponding divergence withβ = 1− p. Es-pecially, the gradients of the log-likelihood of Gamma, Poissonand Gaussian distributions overµi are equal to the ones ofβ-divergence withβ = −1, 0, 1, respectively. This motivates aβ-divergence selection method by Maximum Tweedie Likelihood(MTL).

However, MTL has the following two shortcomings. First,Tweedie distribution is not defined forp ∈ (0, 1). That is, ifthe bestβ = 1−p happens to be in the range(0, 1), it cannotbe found by MTL; in addition, there is little research on theTweedie distribution withβ > 1 (p < 0). Second,f(x, φ, p)in Tweedie distribution is not the probability normalizingconstant (note that it depends onx), and its evaluation requiresad hoc techniques. The existing software using the infiniteseries expansion approach [35] (see Appendix A) is prone tonumerical computation problems especially for−0.1 < β < 0.There is no existing implementation that can calculate Tweedielikelihood for β > 1.

2) Maximum Exponential Divergence with AugmentationLikelihood (MEDAL): Our answer to the above shortcomingsin MTL is to design an alternative distribution with thefollowing properties: 1) it is close to the Tweedie distribution,especially for the four crucial points whenβ ∈ {−2,−1, 0, 1};2) it should be defined for allβ ∈ R; 3) its pdf can be evaluatedmore robustly by standard statistical software.

From (10) and (11) the pdf of the Tweedie distribution iswritten as

pTw(x;µ, φ, β) = f(x, φ, β) exp

[1

φ

(xµβ

β− µβ+1

β + 1

)](12)

w.r.t. β instead ofp, using the relationβ = 1− p. This holdswhen β 6= 0 and β 6= −1. The extra terms1/(1 − p) and1/(2− p) in (11) have been absorbed inf(x, φ, β). The casesβ = 0 or β = −1 have to be analyzed separately.

To make an explicit connection withβ-divergence definedin (2), we suggest a new distribution given in the followingform:

papprox(x;µ, φ, β) = g(x, φ, β) exp

{− 1

φDβ(x||µ)

}

= g(x, φ, β) exp

[1

φ

(− xβ+1

β(β + 1)+

xµβ

β− µβ+1

β + 1

)].

(13)

Now theβ-divergence for scalarx appears in the exponent, andg(x, φ, β) will be used to approximate this with the Tweediedistribution. Ideally, the choice

g(x, φ, β) = f(x, φ, β)/ exp

[1

φ

(− xβ+1

β(β + 1)

)]

would result in full equivalence to Tweedie distribution, asseen from (12). However, becausef(x, φ, β) is unknown inthe general case, suchg is also unavailable.

We can, however, try to approximateg using the fact thatpapprox must be a proper density whose integral is equal to one.From (13) it then follows

exp

[1

φ

µβ+1

β + 1

]

=

∫dx g(x, φ, β) exp

[1

φ

(− xβ+1

β(β + 1)+

xµβ

β

)]

(14)

This integral is, of course, impossible to evaluate becausewedo not even know the function inside. However, the integralcan be approximated nicely by Laplace’s method. Laplace’sapproximation is

∫ b

a

dx f(x)eMh(x) ≈√

M |h′′(x0)|f(x0)e

Mh(x0)

wherex0 = argmaxx h(x) andM is a large constant.In order to approximate (14) by Laplace’s method,1/φ

takes the role ofM and thus the approximation is valid forsmall φ. We need the maximizer of the exponentiated termh(x) = − xβ+1

β(β+1) +xµβ

β . This term has a zero first derivative

Page 4: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

4

and negative second derivative, i.e., it is maximized, atx = µ.Thus, Laplace’s method gives us

exp

[1

φ

µβ+1

β + 1

]

≈√

2πφ

| − µβ−1| g(µ, φ, β) exp[1

φ

(− µβ+1

β(β + 1)+

µβ+1

β

)]

=

√2πφ

µβ−1g(µ, φ, β) exp

[1

φ

µβ+1

β + 1

].

The approximation givesg(µ, φ, β) = 1√2πφ

µ(β−1)/2 whichsuggests the function

g(x, φ, β) =1√2πφ

x(β−1)/2 =1√2πφ

exp

[(β − 1)

2lnx

].

Putting this result into (13) as such does not guarantee aproper pdf however, because it is an approximation, only validat the limit φ → 0. To make it proper, we have to add anormalizing constant into the density in (13).

The pdf of the final distribution, for a scalar argumentx,thus becomes

papprox(x;µ, β, φ) =1

Z(µ, β, φ)exp

{R(x, β)− 1

φDβ(x||µ)

}

(15)

where Z(µ, β, φ) is the normalizing constant counting forthe terms which are independent ofx, and R(x, β) is anaugmentation term given as

R(x, β) =β − 1

2lnx . (16)

This pdf is a proper density for allβ ∈ R, which isguaranteed by the following theorem.

Theorem 1:Let f(x) = exp{

β−12 lnx− 1

φDβ(x||µ)}

.

The improper integral∫∞0

f(x)dx converges.

Proof: Let q =∣∣∣β−1

2

∣∣∣ + 1 + ǫ with any ǫ ∈ (0,∞),

and g(x) = x−q. By these definitions, we haveq >∣∣∣β−1

2

∣∣∣,and then forx ≥ 1,

(β−12 + q

)φ lnx ≤ 0 ≤ Dβ(x||µ),

i.e. 0 ≤ f(x) ≤ g(x). By Cauchy convergence test, weknow that

∫∞1 g(x)dx is convergent becauseq > 1, and so

is∫∞1 f(x)dx. Obviouslyf(x) is continuous and bounded for

x ∈ [0, 1]. Therefore, forx ≥ 0,∫∞0

f(x)dx =∫ 1

0f(x)dx +∫∞

1f(x)dx also converges.

Finally, for vectorialx, the pdf is a product of the marginaldensities:

pEDA(x;µ, β, φ) =1

Z(µ, β, φ)exp

{R(x, β) − 1

φDβ(x||µ)

}

(17)

whereDβ(x||µ) is defined in (2) and

R(x, β) =β − 1

2

i

lnxi . (18)

We call (17) the Exponential Divergence with Augmentation(EDA) distribution, because it applies an exponential overaninformation divergence plus an augmentation term.

The log-likelihood of the EDA density can be written as

ln p(x;µ, β, φ) =∑

i

ln p(xi;µi, β, φ)

=∑

i

[β − 1

2lnxi −

1

φDβ(xi||µi)− lnZ(µi, β, φ)

](19)

due to the fact thatDβ(x||µ) in Eq. (2) and the augmentationterm in (18) are separable overxi, (i.e. xi are independentgivenµi). The bestβ is now selected by

β∗ = argmaxβ

[maxφ

ln p(x;µ, β, φ)

], (20)

whereµ = argminη Dβ(x||η). We call the new divergenceselection methodMaximum EDA Likelihood(MEDAL).

Let us look at the four special cases of Tweedie distribu-tion: Gaussian (N ), Poisson (PO), Gamma (G) and InverseGaussian (IN ). They correspond toβ = 1, 0,−1,−2. Forsimplicity of notation, we may drop the subscripti and writex andµ for one entry inx andµ. Then, the log-likelihoodsof the above four special cases are

ln pN (x;µ, φ) =− 1

2ln(2πφ)− 1

2φ(x− µ)2,

ln pPO(x;µ) =x lnµ− µ− ln Γ(x + 1),

≈x lnµ− µ− ln(2πx)/2− x lnx+ x,

ln pG(x; 1/φ, φµ) =(1/φ− 1) lnx− x

φµ

− (1/φ) ln(φµ)− ln Γ(1/φ),

ln pIN (x;µ, 1/φ) =− 1

2ln(2πφx3)− 1

φ

(1

2

x

µ2− 1

µ+

1

2x

),

where in the Poisson case we employ Stirling’s approxima-tion1. To see the similarity of these four special cases with thegeneral expression for the EDA log-likelihood in Eq. (19),let us look at one term in the sum there. It is a fairlystraightforward exercise to plug in theβ-divergences fromEqs. (3,4,5,6) and the augmentation term from Eq. (18) andsee that the log-likelihoods coincide. The normalizing termlnZ(µ, β, φ)] for these special cases can be determined fromthe corresponding density.

In general, the normalizing constantZ(µ, β, φ) is in-tractable except for a few special cases. Numerical evaluationof Z(µ, β, φ) can be implemented by standard statisticalsoftware. Here we employ the approximation with Gauss-Laguerre quadratures (details in Appendix B).

Finally, let us note that in addition to the maximum likeli-hood estimator, Score Matching (SM) [25], [37] can be appliedto estimation ofβ as a density parameter (see Section IV-A). Ina previous effort, Lu et al. [29] proposed a similar exponentialdivergence (ED) distribution

pED(x;µ, β) ∝ exp [−Dβ(x||µ)] , (21)

but without the augmentation. It is easy to show that ED alsoexists for allβ by changingq = 1+ǫ in the proof of Theorem

1The caseβ = 0 andφ 6= 1 does not correspond to Poisson distribution,but the transformationpEDM(x;µ, φ, 1) = pPO(x/φ;µ/φ)/φ can be usedto evaluate the pdf.

Page 5: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

5

1. We will empirically illustrate the discrepancy between EDand EDA in Section IV-A, showing that the selection basedon ED is however inaccurate, especially forβ ≤ 0.

B. Selectingα-divergence

We extend the MEDAL method toα-divergence selection.This is done by relatingα-divergence toβ-divergence with anonlinear transformation betweenα andβ. Let yi = xα

i /α2α,

mi = µαi /α

2α andβ = 1/α− 1 for α 6= 0. We have

Dβ(yi||mi) =1

β(β + 1)

(yβ+1i + βmβ+1

i − (β + 1)yimβi

)

=−α2

α− 1

(xi

α2+

1− α

α

µi

α2− 1

α

xαi

α2α

µ1−αi

α2(1−α)

)

=Dα(xi||µi)

This relationship allows us to evaluate the likelihood ofµandα usingyi andβ:

p(xi;µi, α, φ) = p(yi;mi, β, φ)

∣∣∣∣dyidxi

∣∣∣∣

= p(yi;mi, β, φ)xα−1i

α2(α−1/2)

= p(yi;mi, β, φ)y−βi |β + 1|

In vectorial form, the bestα for Dα(x||µ) is then given byα∗ = 1/(β∗ + 1) where

β∗ =argmaxβ

{maxφ

[ln p(y;m, β)

− β ln yi + ln |β + 1|]}

, (22)

wherem = argminη Dβ(y||η). This transformation methodcan handle allα exceptα → 0 since it corresponds toβ → ∞.

C. Selectingγ- and Rényi divergences

Above we presented the selection methods for two familieswhere the divergence is separable over the tensor entries. Nextwe consider selection amongγ- and Rényi divergence familieswhere their members are not separable. Our strategy is toreduceγ-divergence toβ-divergence with aconnecting scalar.This is formally given by the following result.

Theorem 2:For x ≥ 0 andτ ∈ R,

argminµ≥0

Dγ→τ (x||µ) = argminµ≥0

[minc>0

Dβ→τ (x||cµ)]

(23)

The proof is done by zeroing the derivative right hand sidewith respect toc (details in Appendix C).

Theorem 2 states that with a positive scalar, the learningproblem formulated by aγ-divergence is equivalent to theone by the correspondingβ-divergence. The latter is separableand can be solved by the methods described in the SectionIII-A. An example is between normalized KL-divergence (inγ-divergence) and the non-normalized KL-divergence (inβ-divergence) with the optimal connecting scalarc =

∑i xi∑i µi

.Example applications on selecting the bestγ-divergence aregiven in Section IV-D.

Similarly, we can also reduce a Rényi divergence to itscorrespondingα-divergence with the same proof technique(see Appendix C).

Theorem 3:For x ≥ 0 andτ > 0,

argminµ≥0

Dρ→τ (x||µ) = argminµ≥0

[minc>0

Dα→τ (x||cµ)]. (24)

IV. EXPERIMENTS

In this section we demonstrate the proposed method onvarious data types and learning tasks. First we provide theresults on synthetic data, whose density is known, to comparethe behavior of MTL, MEDAL and the score matching method[29]. Second, we illustrate the advantage of the EDA densityover ED. Third, we apply our method onα- andβ-divergenceselection in Nonnegative Matrix Factorization (NMF) on real-world data including music and stock prices. Fourth, we testMEDAL in selecting non-separable cases (e.g.γ-divergence)for Projective NMF and s-SNE visualization learning tasksacross synthetic data, images, and a dolphin social network.

A. Synthetic data

1) β-divergence selection:We use here scalar data gen-erated from the four special cases of Tweedie distributions,namely, Inverse Gaussian, Gamma, Poisson, and Gaussiandistributions. We simply fit the best Tweedie, EDA or EDdensity to the data using either the maximum likelihoodmethod or score matching (SM).

In Fig. 1 (first row), the results of the Maximum TweedieLikelihood (MTL) are shown. Theβ value that maximizesthe likelihood in Tweedie distribution is consistent with thetrue parameters, i.e., -2, -1, 0 and 1 respectively for the abovedistributions. Note that Tweedie distributions are not definedfor β ∈ (0, 1), butβ-divergence is defined in this region, whichwill lead to discontinuity in the log-likelihood overβ.

The second and third rows in Fig. 1 present results of theexponential divergence density ED given in Eq. (21). The log-likelihood and negative score matching objectives [29] on thesame four datasets are shown. The estimates are consistentwith the ground truth Gaussian and Poisson data. However, forGamma and Inverse Gaussian data, bothβ estimates deviatefrom the ground truth. Thus, estimators based on ED donot give as accurate estimates as the MTL method. The EDdistribution [29] has an advantage that it is defined also forβ ∈ (0, 1). In the above, we have seen thatβ selection byusing ED is accurate whenβ → 0 or β = 1. However, asexplained in Section III-A2, in the other cases ED and Tweediedistributions are not the same because the terms containingthe observed variable in these distributions are not exactly thesame as those of the Tweedie distributions.

EDA, the augmented ED density introduced in Sec-tion III-A, not only has both the advantage of continuity butalso gives very accurate estimates forβ < 0. The MEDAL log-likelihood curves overβ based on EDA are given in Fig. 1(fourth row). In the β selection of Eq. (20), theφ valuethat maximizes the likelihood withβ fixed is found by agrid search. The likelihood values are the same as those ofspecial Tweedie distributions and there are no abrupt changes

Page 6: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

6

−2 −1 0 1 2−4

−3

−2

−1

0x 104

log−

likel

ihoo

d (T

wee

die)

β (best β=−2)

IN

−2 −1 0 1 2−9000

−8500

−8000

−7500

−7000

−6500

log−

likel

ihoo

d (T

wee

die)

β (best β=−1)

G

−2 −1 0 1 2−6400

−6200

−6000

−5800

−5600

−5400

−5200

log−

likel

ihoo

d (T

wee

die)

β (best β=−0.08)

PO

−2 −1 0 1 2−3500

−3000

−2500

−2000

−1500

−1000

log−

likel

ihoo

d (T

wee

die)

β (best β=1)

N

−2 −1 0 1 2−1.15

−1.1

−1.05

−1

−0.95

−0.9x 104

log−

likel

ihoo

d (E

D)

β (best β=−1.32)

IN

−2 −1 0 1 2−10000

−9000

−8000

−7000

−6000

log−

likel

ihoo

d (E

D)

β (best β=−0.5)

G

−2 −1 0 1 2−6400

−6200

−6000

−5800

−5600

−5400

−5200

log−

likel

ihoo

d (E

D)

β (best β=−0.08)

PO

−2 −1 0 1 2−3500

−3000

−2500

−2000

−1500

−1000

log−

likel

ihoo

d (E

D)

β (best β=0.94)

N

−2 −1 0 1 20

5

10

15

20

nega

tive

SM

obj

ectiv

e (E

D)

β (best β=2)

IN

−2 −1 0 1 20

200

400

600

800

nega

tive

SM

obj

ectiv

e (E

D)

β (best β=−0.63)

G

−2 −1 0 1 20

0.5

1

1.5

2x 106

nega

tive

SM

obj

ectiv

e (E

D)

β (best β=−0.09)

PO

−2 −1 0 1 20

2

4

6

8x 107

nega

tive

SM

obj

ectiv

e (E

D)

β (best β=0.92)

N

−2 −1 0 1 2−15000

−10000

−5000

0

log−

likel

ihoo

d (E

DA

)

β (best β=−2)

IN

−2 −1 0 1 2−10000

−9000

−8000

−7000

−6000

log−

likel

ihoo

d (E

DA

)

β (best β=−1.02)

G

−2 −1 0 1 2−6400

−6200

−6000

−5800

−5600

−5400

−5200

log−

likel

ihoo

d (E

DA

)

β (best β=−0.08)

PO

−2 −1 0 1 2−3500

−3000

−2500

−2000

−1500

−1000

log−

likel

ihoo

d (E

DA

)β (best β=0.94)

N

−2 −1 0 1 2−1000

−500

0

500

1000

nega

tive

SM

obj

ectiv

e (E

DA

)

β (best β=−2)

IN

−2 −1 0 1 2−1000

−500

0

500

1000

1500

nega

tive

SM

obj

ectiv

e (E

DA

)

β (best β=−1)

G

−2 −1 0 1 20

0.5

1

1.5

2x 106

nega

tive

SM

obj

ectiv

e (E

DA

)

β (best β=−0.09)

PO

−2 −1 0 1 20

2

4

6

8x 107

nega

tive

SM

obj

ectiv

e (E

DA

)

β (best β=0.92)

N

a) Inverse Gaussian b) Gamma c) Poisson d) Gaussian

Fig. 1. β selection using (from top to bottom) Tweedie likelihood, EDlikelihood, negative SM objective of ED, EDA likelihood, and negative SM objectiveof EDA. Data were generated using Tweedie distribution withβ = −2,−1, 0, 1 (from left to right).

or discontinuities in the likelihood surface. We also estimatedβ for the EDA density using Score Matching, and curves ofthe negative SM objective are presented in the bottom row ofFig. 1. They also recover the ground truth accurately.

2) α-divergence selection:There is only one known gen-erative model for which the maximum likelihood estima-tor corresponds to the minimizer of the correspondingαdivergence. It is the Poisson distribution. We thus reusedthe Poisson-distributed data of the previous experiments withthe β-divergence. In Fig. 2a, we present the log-likelihoodobjective overα obtained with Tweedie distribution (MTL)

and the transformation from Section III-B. The ground truthα → 1 is successfully recovered with MTL. However, thereare no likelihood estimates forα ∈ (0.5, 1), correspondingto β ∈ (0, 1) for which no Tweedie distributions are defined.Moreover, to our knowledge there are no studies concerningthe pdf’s of Tweedie distributions withβ > 1. For that reason,the likelihood values forα ∈ [0, 0.5) are left blank in the plot.

It can be seen from Fig. 2b and 2c, that the augmentationin the MEDAL method also helps inα selection. Again, bothED and EDA solve most of the discontinuity problem exceptα = 0. Selection using ED fails to find the ground truth

Page 7: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

7

−2 −1 0 1 2−3250

−3200

−3150

−3100

−3050

−3000

−2950lo

g−lik

elih

ood

(Tw

eedi

e)

α (best α=1.01)

PO

−2 −1 0 1 2−3160

−3140

−3120

−3100

−3080

−3060

−3040

log−

likel

ihoo

d (E

D)

α (best α=0.25)

PO

−2 −1 0 1 2−3250

−3200

−3150

−3100

−3050

−3000

log−

likel

ihoo

d (E

DA

)

α (best α=0.97)

PO

−2 −1 0 1 21

1.1

1.2

1.3

1.4

1.5

1.6x 104

nega

tive

SM

obj

ectiv

e (E

DA

)

α (best α=0.87)

PO

(a) (b) (c) (d)

Fig. 2. Log-likelihood of (a) Tweedie, (b) ED, and (c) EDA distributions forα-selection. In the Tweedie plot, blanks correspond toβ = 1/α− 1 values forwhich a Tweedie distribution pdf does not exist or cannot be evaluated, i.e.,β ∈ (0, 1) ∪ (1,∞). In (d), negative SM objective function values are plottedfor EDA.

which equals 1, which is however successfully found by theMEDAL method. SM on EDA recovers the ground truth aswell (Fig. 2d).

B. Divergence selection in NMF

The objective in nonnegative matrix factorization (NMF)is to find a low-rank approximation to the observed databy expressing it as a product of two nonnegative matrices,i.e., V ≈ V̂ = WH with V ∈ R

F×N+ , W ∈ R

F×K+

and H ∈ RK×N+ . This objective is pursued through the

minimization of an information divergence between the dataand the approximation, i.e.,D(V||V̂). The divergence can beany appropriate one for the data/application such asβ, α, γ,Rényi, etc. Here, we chose theβ andα divergences to illustratethe MEDAL method for realistic data.

The optimization ofβ-NMF was implemented using thestandard multiplicative update rules [23], [38]. Similar mul-tiplicative update rules are also available forα-NMF [23].Alternatively, the algorithm forβ-NMF can be used forα-divergence minimization as well, using the transformationexplained in Section III-B.

1) A Short Piano Excerpt:We consider the piano data usedin [21]. It is an audio sequence recorded in real conditions,consisting of four notes played all together in the first measureand in all possible pairs in the subsequent measures. Apower spectrogram with analysis window of size 46 ms wascomputed, leading toF = 513 frequency bins andN = 676time frames. These make up the data matrixV, for which amatrix factorizationV̂ = WH with low rankK = 6 is soughtfor.

In Fig. 3a and 3b, we show the log-likelihood values ofthe MEDAL method forβ and α, respectively. For eachparameter valueβ andα, the multiplicative algorithm for therespective divergence is run for 100 iterations and likelihoodsare evaluated with mean values calculated from the returnedmatrix factorizations. For each value ofβ andα, the highestlikelihood w.r.t.φ (see Eq. (20)) is found by a grid search.

The found maximum likelihood estimateβ = −1 corre-sponds to Itakura-Saito divergence, which is in harmony withthe empirical results presented in [21] and the common beliefthat IS divergence is most suitable for audio spectrograms.Theoptimalα value value was 0.5 corresponding to Hellinger dis-tance. We can also see that the log likelihood value associated

−2 −1 0 1 2−6

−4

−2

0

2

4x 106

log−

likel

ihoo

d (E

DA

)

β (best β=−1)

Piano excerpt

−2 −1 0 1 2−3

−2

−1

0

1

2x 104

log−

likel

ihoo

d (E

DA

)

α (best α=−0.2)

Piano excerpt

a) β div. b) α div.

−2 −1 0 1 2−1

0

1

2

3

4x 105

nega

tive

SM

obj

ectiv

e (E

DA

)

β (best β=−1)

Piano excerpt

c) β div.

Fig. 3. (a, b) Log likelihood values forβ and α for the spectrogram ofa short piano excerpt withF = 513, N = 676, K = 6. (c) Negative SMobjective forβ.

with α = 0.5 is still much less than the one forβ = −1. SMalso findsβ = −1 as can be seen from Fig. 3c.

C. Stock Prices

Next, we repeat the same experiment on a stock price datasetwhich contains Dow Jones Industrial Average. There are 30companies included in the data. They are major Americancompanies from various sectors such as services (e.g., Wal-mart), consumer goods (e.g., General Motors) and healthcare(e.g., Pfizer). The data was collected from 3rd January 2000to 27th July 2011, in total 2543 trading dates. We setK = 5in NMF and masked 50% of the data by following [39]. Thestock data curves are displayed in Fig. 4 (left).

The EDA likelihood curve withβ ∈ [−2, 2] is shown inFigure 4 (bottom left). We can see that the best divergenceselected by MEDAL isβ = 0.4. The corresponding bestφ = 0.006. These results are in harmony with the findingsof Tan and Févotte [39] using the remaining 50% of thedata as validation set, where they found thatβ ∈ [0, 0.5](mind that ourβ values equal theirs minus one) performs

Page 8: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

8

500 1000 1500 2000 25000

50

100

150

days

stoc

k pr

ices

−2 −1 0 1 2−8

−6

−4

−2

0x 104

log−

likel

ihoo

d (E

DA

)

β (best β=0.4)

stock prices

−2 −1 0 1 20

2

4

6

8x 108

nega

tive

SM

obj

ectiv

e (E

DA

)

β (best β=1)

stock prices

Fig. 4. Top: thestock data. Bottom left: the EDA log-likelihood forβ ∈[−2, 2]. Bottom right: negative SM objective function forβ ∈ [−2, 2].

well for a large range ofφ’s. Differently, our method is moreadvantageous because we do not need additional criteria nordata for validations. In Figure 4 (bottom right), negative SMobjective function is plotted forβ ∈ [−2, 2]. With SM, theoptimalβ is found to be 1.

D. Selectingγ-divergence

In this section we demonstrate that the proposed methodcan be applied to applications beyond NMF and to non-separable divergence families. To our knowledge, no otherexisting methods can handle these two cases.

1) Multinomial data: We first exemplifyγ-divergence se-lection for synthetic data drawn from a multinomial dis-tribution. We generated a 1000-dimensional stochastic vec-tor p from the uniform distribution. Next we drewx ∼Multinomial(n,p) with n = 107. The MEDAL method isapplied to find the bestγ-divergence for the approximationof x by p.

Fig. 6 (1st row, left) shows the MEDAL log-likelihood.The peak appears whenγ = 0, which indicates that thenormalized KL-divergence is the most suitable one among theγ-divergence family. Selection using score matching of EDAgives the bestγ also close to zero (Fig. 6 1st row, right). Theresult is expected, because the maximum likelihood estimatorof p in multinomial distribution is equivalent to minimizingthe KL-divergence overp. Our finding also justifies theusage of KL-divergence in topic models with the multinomialdistribution [40], [7].

2) Projective NMF: Next we apply the MEDAL methodto Projective Nonnegative Matrix Factorization (PNMF) [27],[28] based onγ-divergence [13], [19]. Given a nonnegativematrix V ∈ R

F×N+ , PNMF seeks a low-rank nonnegative

matrix W ∈ RF×K+ (K < F ) that minimizesDγ

(V||V̂

),

where V̂ = WWTV. PNMF is able to produce a highlyorthogonalW and thus finds its applications in part-basedfeature extraction and clustering analysis, etc. Different fromconventional NMF (or linear NMF) where each factorizing

matrix only appears once in the approximation, the matrixW occurs twice inV̂. Thus it is a special case of QuadraticNonnegative Matrix Factorization (QNMF) [41].

We choose PNMF for two reasons: 1) we demonstratethe MEDAL performance on QNMF besides the linear NMFalready shown in Section IV-B; 2) PNMF contains only onevariable matrix in learning, without the issue of how tointerleave the updates of different variable matrices.

We first tested MEDAL on a synthetic dataset. We generateda diagonal blockwise data matrixV of size50×30, where twoblocks are of sizes30× 20 and20× 10. The block entries areuniformly drawn from[0, 10]. We then added uniform noisefrom [0, 1] to the all matrix entries. For eachγ, we ran themultiplicative algorithm of PNMF by Yang and Oja [28], [42]to obtainW andV̂. The MEDAL method was then applied toselect the bestγ. The resulting approximated log-likelihood forγ ∈ [−2, 2] is shown in Fig. 6 (2nd row). We can see MEDALand score matching of EDA give similar results, where the bestγ appear at−0.76 and−0.8, respectively. Both resultingW ’sgive perfect clustering accuracy of data rows.

We also tested MEDAL on theswimmer dataset [43] whichis popularly used in the NMF field. Some example imagesfrom this dataset are shown in Fig. 5 (left). We vectorizedeach image in the dataset as a column and concatenated thecolumns into a1024× 256 data matrixV. This matrix is thenfed to PNMF and MEDAL as in the case for the syntheticdataset. Here we empirically set the rank toK = 17 accordingto Tan and Févotte [44] and Yang et al. [45]. The matrixW

was initialized by PNMF based on Euclidean distance to avoidpoor local minima. The resulting approximated log-likelihoodfor γ ∈ [−1, 3] is shown in Figure 6 (3rd row, left). We can seea peak appearing around1.7. Zooming in the region near thepeak shows the bestγ = 1.69. The score matching objectiveover γ values (Fig. 6 3rd row, right) shows a similar peakand the bestγ very close to the one given by MEDAL. Bothmethods result in excellent and nearly identical basis matrix(W) of the data, where the swimmer body as well as fourlimbs at four angles are clearly identified (see Fig. 5 bottomrow).

3) Symmetric Stochastic Neighbor Embedding:Finally, weshow an application beyond NMF, where MEDAL is used tofind the bestγ-divergence for the visualization using Symmet-ric Stochastic Neighbor Embedding (s-SNE) [5], [6].

Suppose there aren multivariate data samples{xi}ni=1 withxi ∈ R

D and their pairwise similarities are represented byan n × n symmetric nonnegative matrixP wherePii = 0and

∑ij Pij = 1. The s-SNE visualization seeks a low-

dimensional embeddingY = [y1,y2, . . . ,yn]T ∈ R

n×d suchthat pairwise similarities in the embedding approximate thosein the original space. Generallyd = 2 or d = 3 for easyvisualization. Denoteqij = q(‖yi − yj‖2) with a certainkernel functionq, for exampleqij =

(1 + ‖yi − yj‖2

)−1.

The pairwise similarities in the embedding are then given byQij = qij/

∑kl:k 6=l qkl. The s-SNE target is thatQ is as

close toP as possible. To measure the dissimilarity betweenP andQ, the conventional s-SNE uses the Kullback-LeiblerdivergenceDKL (P||Q). Here we generalize s-SNE to the

Page 9: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

9

Fig. 5. Swimmer dataset: (top) example images; (bottom) thebest PNMFbasis (W) selected by using (bottom left) MEDAL and (bottom right) scorematching of EDA. The visualization reshapes each column ofW to an imageand displays it by the Matlab functionimagesc.

whole family of γ-divergences as dissimilarity measures andselect the best divergence by our MEDAL method.

We have used a real-worlddolphins dataset2. It is theadjacency matrix of the undirected social network between62 dolphins. We smoothed the matrix by PageRank randomwalk in order to find its macro structures. The smoothedmatrix was then fed to s-SNE based onγ-divergence, withγ ∈ [−2, 2]. The EDA log-likelihood is shown in Fig. 6(4th row, left). By the MEDAL principle the best divergenceis γ = −0.6 for s-SNE and thedolphins dataset. Scorematching of EDA also indicates the bestγ is smaller than0. The resulting visualizations created by s-SNE with therespective bestgamma-divergence are shown in Fig. 7, wherethe node layouts by both methods are very similar. In bothvisualizations we can clearly see two dolphin communities.

V. CONCLUSIONS

We have presented a new method called MEDAL to au-tomatically select the best information divergence in a para-metric family. Our selection method is built upon a statisticallearning approach, where the divergence is learned as theresult of standard density parameter estimation. Maximizingthe likelihood of the Tweedie distribution is a straightforwardway for selectingβ-divergence, which however has someshortcomings. We have proposed a novel distribution, theExponential Divergence with Augmentation (EDA), whichovercomes these shortcomings and thus can give a more robust

2available at http://www-personal.umich.edu/~mejn/netdata/

−2 −1 0 1 2−10000

−9000

−8000

−7000

−6000

−5000

log−

likel

ihoo

d

γ (best γ=0.00)

Multinomial

−2 −1 0 1 20

1

2

3

4

5

6x 106

nega

tive

SM

obj

ectiv

e

γ (best γ=0.08)

Multinomial

−2 −1 0 1 2−4000

−3500

−3000

−2500

−2000

log−

likel

ihoo

d

γ (best γ=−0.76)

PNMF, synthetic data

−2 −1 0 1 20

1000

2000

3000

4000

nega

tive

SM

obj

ectiv

e

γ (best γ=−0.80)

PNMF, synthetic data

−1 0 1 2 3−2

0

2

4

6x 1023

log−

likel

ihoo

d

γ (best γ=1.69)

PNMF, swimmer data

1.55 1.6 1.65 1.7 1.75

−1 0 1 2 30

5

10

15x 1010

nega

tive

SM

obj

ectiv

e

γ (best γ=1.63)

PNMF, swimmer data

1.55 1.6 1.65 1.7 1.75

−2 −1 0 1 24000

6000

8000

10000

12000

log−

likel

ihoo

d

γ (best γ=−0.60)

γ−NE

−2 −1 0 1 21000

2000

3000

4000

5000

6000

7000

nega

tive

SM

obj

ectiv

eγ (best γ=−1.04)

γ−NE

Fig. 6. Selecting the bestγ-divergence: (1st row) for multinomial data,(2nd row) in PNMF for synthetic data, (3rd row) in PNMF for theswimmerdataset, and (4th row) in s-SNE for thedolphins dataset; (left column)using MEDAL and (right column) using score matching of EDA. The red starhighlights the peak and the small subfigures in each plot shows the zoom-inaround the peak. The sub-figures in the 3rd row zoom in the areanear thepeaks.

selection for the parameter over a wider range. The newmethod has been extended toα-divergence selection by anonlinear transformation. Furthermore, we have provided newresults that connect theγ- and β-divergences, which enableus to extend the selection method to non-separable cases.The extension also holds for Rényi divergence with similarrelationship toα-divergence. As a result, our method can beapplied to most commonly used information divergences inlearning.

We have performed extensive experiments to show theaccuracy and applicability of the new method. Comparisonon synthetic data has illustrated that our method is superior toMaximum Tweedie Likelihood, i.e., it finds the ground truthas accurately as MTL, while being defined on all values ofβand being less prone to numerical problems (no abrupt changesin the likelihood). We also showed that a previous estimation

Page 10: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

10

Fig. 7. Visualization of thedolphins social network with the bestγ using(top) MEDAL and (bottom) score matching of EDA. Dolphins andtheir socialconnections are shown by circles and lines, respectively. The backgroundillustrates the node density by the Parzen method [46].

approach by Score Matching on Exponential Divergence dis-tribution (ED, i.e., EDA before augmentation) is not accurate,especially forβ < 0. In the application to NMF, we haveprovided experimental results on various kinds of data includ-ing audio and stock prices. In the non-separable cases, wehave demonstrated selectingγ-divergence for synthetic data,Projective NMF, and visualization by s-SNE. In those caseswhere the correct parameter value is known in advance for thesynthetic data, or there is a wide consensus in the applicationcommunity on the correct parameter value for real-world data,the MEDAL method gives expected results. These results showthat the presented method has not only broad applications butalso accurate selection performance. In the case of new kindsof data, for which the appropriate information divergence isnot known, the MEDAL method provides a disciplined andrigorous way to compute the optimal parameter values.

In this paper we have focused on information divergence forvectorial data. There exist other divergences for higher-ordertensors, for example, LogDet divergence and von Newmanndivergence (see e.g. [47]) that are defined over eigenvaluesofmatrices. Selection among these divergences remains an openproblem.

Here we mainly consider a positive data matrix and selectingthe divergence parameter in(∞,+∞). Tweedie distributionhas no support for zero entries whenβ < 0 and thus giveszero likelihood of the whole matrix/tensor by independence. Infuture work, extension of EDA to accommodate nonnegativedata matrices could be developed forβ ≥ 0.

MEDAL is a two-phase method: theβ selection is basedon the optimization result ofµ. Ideally, both variables shouldbe selected by optimizing the same objective. For maximumlog-likelihood estimator, this requires that the negativelog-likelihood equals theβ-divergence, which is however infea-

sible for all β due to intractability of integrals. Non-MLestimators could be used to attack this open problem.

The EDA distribution family includes the exact Gaussian,Gamma, and Inverse Gaussian distributions, and approximatedPoisson distribution. In the approximation we used the first-order Stirling expansion. One could apply higher-order expan-sions to improve the approximation accuracy. This could beimplemented by further augmentation with higher-order termsaroundβ → 0.

VI. A CKNOWLEDGMENT

This work was financially supported by the Academy of Fin-land (Finnish Center of Excellence in Computational InferenceResearch COIN, grant no 251170; Zhirong Yang additionallyby decision number 140398).

APPENDIX AINFINITE SERIES EXPANSION INTWEEDIE DISTRIBUTION

In the series expansion, an EDM random variable is repre-sented as a sum ofG independent Gamma random variablesx =

∑Gg yg, whereG is Poisson distributed with parameter

λ = µ2−p

φ(2−p) ; and the shape and scale parameters of the Gamma

distribution are−a andb, with a = 2−p1−p andb = φ(p−1)µp−1.

The pdf of the Tweedie distribution is obtained analytically

at x = 0 as e−µ2−p

φ(2−p) . For x > 0 the functionf(x, φ, p) =1x

∑∞j=1 Wj(x, φ, p), where for1 < p < 2

Wj =x−ja(p− 1)ja

φj(1−a)(2 − p)jj!Γ(−ja)(25)

and forp > 2

Wj =1

π

Γ(1 + ja)φj(a−1)(p− 1)ja

Γ(1 + j)(p− 1)jxja(−1)j sin(−πja). (26)

This infinite summation needs approximation in practice.Dunn and Smyth [35] described an approach to select a subsetof these infinite terms to accurately approximatef(x, φ, p).In their approach, Stirling’s approximation of the Gammafunctions are used to find the indexj which gives the highestvalue of the function. Then, in order to find the most significantregion, the indices are progressed in both directions untilnegligible terms are reached.

APPENDIX BGAUSS-LAGUERRE QUADRATURES

This method (e.g. [48]) can evaluate definite integrals of theform

∫ ∞

0

e−zf(z)dz ≈n∑

i

f(zi)wi, (27)

wherezi is theith root of then-th order Laguerre polynomialLn(z), and the weights are given by

wi =zi

(n+ 1)2L2n(zi)

. (28)

The recursive definition ofLn(z) is given by

Ln+1(z) =1

n+ 1[(2n+ 1− z)Ln(z)− nLn−1(z)] , (29)

Page 11: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

11

with L0(z) = 1 andL1(z) = 1 − z. In our experiments, weused the Matlab implementation by Winckel3 with n = 5000.

APPENDIX CPROOFS OFTHEOREMS2 AND 3

Lemma 4:argminz af(z) = argminz a ln f(z) for a ∈ R

andf(z) > 0.The proof of the lemma is simply by the monotonicity ofln.

Next we prove Theorem 2. Forβ ∈ R\{−1, 0}, zeroing∂Dβ(x||cµ)

∂c gives

c∗ =

∑i xiµ

βi∑

i µ1+βi

. (30)

Putting it back tominµmincDβ(x||cµ), we obtain:

minµ

minc

Dβ(x||cµ)

=minµ

1

β(1 + β)

i

x1+βi + β

i

(∑j xjµ

βj∑

j µ1+βj

µi

)1+β

−(1 + β)∑

i

xi

(∑j xjµ

βj∑

j µ1+βj

µi

=minµ

1

β(1 + β)

i

x1+βi −

(∑i xiµ

βi

)1+β

(∑j µ

1+βj

Dropping the constant, and by Lemma 4, the above isequivalent to minimizing

1

β(1 + β)

β ln

j

µ1+βj

− (1 + β) ln

(∑

i

xiµβi

)

Adding a constant 1β(1+β) ln

(∑i x

1+βi

), the objective be-

comes minimizingγ-divergence (replacingβ with γ; see Eq.(7)).

We can apply the similar technique to prove Theorem 3.For α ∈ R\{0, 1}, zeroing ∂Dα(x||cµ)

∂c gives

c∗ =

(∑i x

αi µ

1−αi∑

i µi

)1/α

(31)

Putting it back, we obtain

Dα(x||c∗µ) (32)

=1

α(1− α)

i

αxi + (1 − α)

(∑j x

αj µ

1−αj∑

j µj

)1/α

µi

(33)

−xαi

(∑

j xαj µ

1−αj∑

j µj

)1/α

µi

1−α

(34)

=1

α− 1

i

xαi

(µi∑j µj

)1−α1/α

+

∑i xi

1− α. (35)

3available at http://www.mathworks.se/matlabcentral/fileexchange/

Dropping the constant∑

i xi

1−α , and by Lemma 4, minimizingthe above is equivalent to minimization of (forα > 0)

1

α− 1ln

i

xαi

(µi∑j µj

)1−α (36)

Adding a constant α1−α ln

∑i xi to the above, the objective

becomes minimizing Rényi-divergence (replacingα with ρ;see Eq. (9)).

The proofs for the special cases are similar, where the mainsteps are given below

• β = γ → 0 (or α = ρ → 1): zeroing ∂Dβ→0(x||cµ)∂c gives

c∗ =∑

i xi∑i µi

. Putting it back, we obtainDβ→0(x||c∗µ) =(∑

i xi)Dγ→0(x||µ).• β = γ → −1: zeroing ∂Dβ→−1(x||cµ)

∂c gives c∗ =1M

∑ixi

µi, whereM is the length ofx. Putting it back,

we obtainDβ→−1(x||c∗µ) = MDγ→−1(x||µ).• α = ρ → 0: zeroing ∂Dα→0(x||cµ)

∂c gives

c∗ = exp

(−∑

i µi lnµi

xi∑i µi

).

Putting it back, we obtain

Dα→0(x||c∗µ) = − exp

(−∑

i

µ̃i lnµ̃i

xi

)+∑

i

xi,

where µ̃i = µi/∑

j µj . Dropping the constant∑

i xi,minimizingDα→0(x||c∗µ) is equivalent to minimizationof∑

i µ̃i lnµ̃i

xi. Adding the constantln

∑j xj to the

latter, the objective becomes identical toDρ→0(x||µ),i.e. DKL (µ||x).

REFERENCES

[1] R. Kompass, “A generalized divergence measure for nonnegative matrixfactorization,” Neural Computation, vol. 19, no. 3, pp. 780–791, 2006.

[2] I. S. Dhillon and S. Sra, “Generalized nonnegative matrix approxima-tions with Bregman divergences,” inAdvances in Neural InformationProcessing Systems, vol. 18, 2006, pp. 283–290.

[3] A. Cichocki, H. Lee, Y.-D. Kim, and S. Choi, “Non-negative matrixfactorization withα-divergence,”Pattern Recognition Letters, vol. 29,pp. 1433–1440, 2008.

[4] Z. Yang and E. Oja, “Unified development of multiplicative algorithmsfor linear and quadratic nonnegative matrix factorization,” IEEE Trans-actions on Neural Networks, vol. 22, no. 12, pp. 1878–1891, 2011.

[5] G. Hinton and S. Roweis, “Stochastic neighbor embedding,” in Advancesin Neural Information Processing Systems, 2002, pp. 833–840.

[6] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

[7] D. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journalof Machine Learning Research, vol. 3, pp. 993–1022, 2001.

[8] I. Sato and H. Nakagawa, “Rethinking collapsed variational bayesinference for lda,” inInternational Conference on Machine Learning(ICML), 2012.

[9] T. Minka, “Divergence measures and message passing,” MicrosoftResearch, Tech. Rep., 2005.

[10] H. Chernoff, “A measure of asymptotic efficiency for tests of a hy-pothesis based on a sum of observations,”The Annals of MathematicalStatistics, vol. 23, pp. 493–507, 1952.

[11] S. Amari, Differential-Geometrical Methods in Statistics. SpringerVerlag, 1985.

[12] A. Basu, I. R. Harris, N. Hjort, and M. Jones, “Robust andefficientestimation by minimising a density power divergence,”Biometrika,vol. 85, pp. 549–559, 1998.

Page 12: Onur Dikmen, Zhirong Yang, and Erkki Oja, - arXivarXiv:1406.1385v1 [cs.LG] 5 Jun 2014 1 Learning the Information Divergence Onur Dikmen, Zhirong Yang, and Erkki Oja, Abstract—Information

12

[13] H. Fujisawa and S. Eguchi, “Robust paramater estimation with a smallbias against heavy contamination,”Journal of Multivariate Analysis,vol. 99, pp. 2053–2081, 2008.

[14] A. Cichocki and S.-i. Amari, “Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities,” Entropy,vol. 12, no. 6, pp. 1532–1568, 2010.

[15] A. Cichocki, S. Cruces, and S.-I. Amari, “Generalized alpha-beta diver-gences and their application to robust nonnegative matrix factorization,”Entropy, vol. 13, pp. 134–170, 2011.

[16] I. Csiszár, “Eine informationstheoretische ungleichung und ihre an-wendung auf den beweis der ergodizitat von markoffschen ketten,”Publications of the Mathematical Institute of Hungarian Academy ofSciences Series A, vol. 8, pp. 85–108, 1963.

[17] T. Morimoto, “Markov processes and the h-theorem,”Journal of thePhysical Society of Japan, vol. 18, no. 3, pp. 328–331, 1963.

[18] L. M. Bregman, “The relaxation method of finding the common pointsof convex sets and its application to the solution of problems in convexprogramming,” USSR Computational Mathematics and MathematicalPhysics, vol. 7, no. 3, pp. 200–217, 1967.

[19] Z. Yang, H. Zhang, Z. Yuan, and E. Oja, “Kullback-leibler divergencefor nonnegative for nonnegative matrix factorization,” inProceedings of21st International Conference on Artificial Neural Networks, 2011, pp.14–17.

[20] Z. Yang and E. Oja, “Projective nonnegative matrix factorization withα-divergence,” in Proceedings of 19th International Conference onArtificial Neural Networks, 2009, pp. 20–29.

[21] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegativematrix factor-ization with the Itakura-Saito divergence. With application to musicanalysis,”Neural Computation, vol. 21, no. 3, pp. 793–830, 2009.

[22] B. Jørgensen, “Exponential dispersion models,”Journal of the RoyalStatistical Society. Series B (Methodological), vol. 49, no. 2, pp. 127–162, 1987.

[23] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari,Nonnegative Matrixand Tensor Factorization. John Wiley and Sons, 2009.

[24] Y. K. Yilmaz and A. T. Cemgil, “Alpha/beta divergences and Tweediemodels,”CoRR, vol. abs/1209.4280, 2012.

[25] A. Hyvärinen, “Estimation of non-normalized statistical models usingscore matching,”Journal of Machine Learning Research, vol. 6, pp.695–709, 2005.

[26] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,”Nature, vol. 401, pp. 788–791, 1999.

[27] Z. Yuan and E. Oja, “Projective nonnegative matrix factorization forimage compression and feature extraction,” inProceedings of 14thScandinavian Conference on Image Analysis, 2005, pp. 333–342.

[28] Z. Yang and E. Oja, “Linear and nonlinear projective nonnegative matrixfactorization,” IEEE Transactions on Neural Networks, vol. 21, no. 5,pp. 734–749, 2010.

[29] Z. Lu, Z. Yang, and E. Oja, “Selectingβ-divergence for nonnegativematrix factorization by score matching,” inProceedings of the 22ndInternational Conference on Artificial Neural Networks (ICANN 2012),2012, pp. 419–426.

[30] S. Eguchi and Y. Kano, “Robustifing maximum likelihood estimation,”Institute of Statistical Mathematics, Tokyo, Tech. Rep., 2001.

[31] M. Minami and S. Eguchi, “Robust blind source separation by betadivergence,”Neural Computation, vol. 14, pp. 1859–1886, 2002.

[32] A. Rényi, “On measures of information and entropy,” inProcedingsof 4th Berkeley Symposium on Mathematics, Statistics and Probability,1960, pp. 547–561.

[33] M. Mollah, S. Eguchi, and M. Minami, “Robust prewhitening for icaby minimizing beta-divergence and its application to fastica,” NeuralProcessing Letters, vol. 25, pp. 91–110, 2007.

[34] H. Choi, S. Choi, A. Katake, and Y. Choe, “Learning alpha-integrationwith partially-labeled data,” inProc. of the IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, 2010, pp. 14–19.

[35] P. K. Dunn and G. K. Smyth, “Series evaluation of Tweedieexponentialdispersion model densities,”Statistics and Computing, vol. 15, no. 4, pp.267–280, 2005.

[36] ——, “Tweedie family densities: methods of evaluation,” in Proceedingsof the 16th International Workshop on Statistical Modelling, 2001.

[37] A. Hyvärinen, “Some extensions of score matching,”Comput. Stat. DataAnal., vol. 51, no. 5, pp. 2499–2512, 2007.

[38] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorizationwith the beta-divergence,”Neural Computation, vol. 23, no. 9, 2011.

[39] C. F. V. Tan, “Automatic relevance determination in nonnegative matrixfactorization with theβ-divergence,” IEEE Transactions on PatternAnalysis and Machine Intelligence, 2013, accepted, to appear.

[40] T. Hofmann, “Probabilistic latent semantic indexing,” in InternationalConference on Research and Development in Information Retrieval(SIGIR), 1999, pp. 50–57.

[41] Z. Yang and E. Oja, “Quadratic nonnegative matrix factorization,”Pattern Recognition, vol. 45, no. 4, pp. 1500–1510, 2012.

[42] ——, “Unified development of multiplicative algorithmsfor linearand quadratic nonnegative matrix factorization,”IEEE Transactions onNeural Networks, vol. 22, no. 12, pp. 1878–1891, 2011.

[43] D. Donoho and V. Stodden, “When does non-negative matrix factoriza-tion give a correct decomposition into parts?” inAdvances in NeuralInformation Processing Systems 16, 2003, pp. 1141–1148.

[44] V. Y. F. Tan and C. Févotte, “Automatic relevance determination innonnegative matrix factorization,” inProceedings of 2009 Workshopon Signal Processing with Adaptive Sparse Structured Representations(SPARS’09), 2009.

[45] Z. Yang, Z. Zhu, and E. Oja, “Automatic rank determination in projectivenonnegative matrix factorization,” inProceedings of the 9th Interna-tional Conference on Latent Variable Analysis and Signal Separation(LVA2010), 2010, pp. 514–521.

[46] E. Parzen, “On Estimation of a Probability Density Function and Mode,”The Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065–1076,1962.

[47] B. Kulis, M. A. Sustik, and I. S. Dhillon, “Low-rank kernel learning withbregman matrix divergences,”Journal of Machine Learning Research,vol. 10, pp. 341–376, 2009.

[48] M. Abramowitz and I. A. Stegun, Eds.,Handbook of MathematicalFunctions with Formulas, Graphs, and Mathematical Tables, 9th ed.New York: Dover, 1972, page 890.


Recommended