+ All Categories
Home > Documents > Probabilistic Self-Organizing Maps for Continuous Data

Probabilistic Self-Organizing Maps for Continuous Data

Date post: 15-Dec-2016
Category:
Upload: ezequiel
View: 218 times
Download: 4 times
Share this document with a friend
12

Click here to load reader

Transcript
Page 1: Probabilistic Self-Organizing Maps for Continuous Data

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010 1543

Probabilistic Self-Organizing Maps forContinuous Data

Ezequiel Lopez-Rubio

Abstract—The original self-organizing feature map did not de-fine any probability distribution on the input space. However, theadvantages of introducing probabilistic methodologies into self-organizing map models were soon evident. This has led to a widerange of proposals which reflect the current emergence of proba-bilistic approaches to computational intelligence. The underlyingestimation theories behind them derive from two main lines ofthought: the expectation maximization methodology and stochas-tic approximation methods. Here, we present a comprehensiveview of the state of the art, with a unifying perspective of theinvolved theoretical frameworks. In particular, we examine themost commonly used continuous probability distributions, self-organization mechanisms, and learning schemes. Special empha-sis is given to the connections among them and their relative ad-vantages depending on the characteristics of the problem at hand.Furthermore, we evaluate their performance in two typical appli-cations of self-organizing maps: classification and visualization.

Index Terms—Classification, self-organization, unsupervisedlearning, visualization.

I. Introduction

THE FIELD of self-organizing artificial neural networkshas experienced a sustained growth [5], [20], [71], [73]

since the introduction of the Kohonen’s self-organizing featuremap (SOFM) [29], [30]. From a biological inspiration [44], thefield has evolved to a more computational approach. Followingthe current trends in computational intelligence, there is anincreasing interest in probabilistic methods [8], [23]. Thisway, a body of knowledge has formed around probabilisticself-organizing maps. However, many of them look complexat first sight, which prevents their use by practitioners whoare not familiar with them. Moreover, the different notationsand starting points of their mathematical derivations cluttertheir intrinsic similarities. Here, our goal is to remedy theseinconveniences by presenting them in a manner adequate forthe non-expert reader who could be interested in applyingthem. Also, we highlight the links among them, which mighthelp to clear the path for subsequent developments in this field.

We focus on maps for continuous valued input data, al-though discrete valued data have also received attention [1],

Manuscript received March 21, 2010; revised May 31, 2010 and July 11,2010; accepted July 13, 2010. Date of publication August 19, 2010; dateof current version October 6, 2010. This work was supported in part by theMinistry of Education and Science, Spain, under Project TIN2006-07362, andin part by the Autonomous Government of Andalusia, Spain, under ProjectsP06-TIC-01615 and P07-TIC-02800.

The author is with the Department of Computer Languages and Com-puter Science, University of Malaga, Malaga 29071, Spain (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNN.2010.2060208

[27], [33]. Moreover, we do not consider those models whichdo not define a continuous probability density function (pdf)on the input space [14], [68], [69].

The earliest efforts can be traced to the generative topo-graphic mapping (GTM) [9]. Along the years, this model hasbeen employed for a variety of purposes. A GTM-based maphas been proposed for clustering audio and video data [22]. Ithas also been used to model images [65] and for computingprincipal surfaces [12]. But, perhaps, its most popular appli-cation is data visualization [2], [21]. More recently proposedmodels are used in remote sensing [72], visual data mining[41], and statistical image representation [65]. The degree ofrelationship among them depends on only a few key features,as we outline next.

First of all, every probabilistic self-organizing map ap-proach proposes its own criteria to derive a particular trainingmethod. Nevertheless, some of these criteria have commontheoretical bases. In particular, we may consider two types ofproposals: those which are based on expectation maximization(EM) algorithms [16] and those which rely on the stochasticapproximation theory [49]. These two strategies are largelydifferent; while EM tries to adjust the model to a prespecifiedset of training data (which leads to batch mode learning),stochastic approximation assumes that the data samples comeone by one and their noise must be averaged out on the longrun (which corresponds to online learning). Consequently, theself-organizing models based on each of them inherit thesedistinctive features.

The way of achieving self-organization is also subjectto some variability. Kohonen’s original mechanism did notprovide a probabilistic interpretation of the neighborhoodfunction; this interpretation was introduced by Heskes [26],and needs that the neighborhood function is normalized sothat its values sum up to one. There is a third possibility,which is to define a latent space where the units are arrangedin a lattice [9]; in this later case, a transformation must bedefined between the latent space and the input space wherethe input data lie.

Last but not least, a probabilistic self-organizing map mustchoose a density model for each of its units. Here, a properselection can be conceived as a balance between the flexibilityof the model and its computational complexity. More flexiblemodels tend to be more costly in computational terms, and theyare more prone to the data insufficiency problem [17], [18],[70], which can endanger the learning process if the number oftraining samples is too low with respect to the free parameters

1045-9227/$26.00 c© 2010 IEEE

Page 2: Probabilistic Self-Organizing Maps for Continuous Data

1544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

TABLE I

Characteristics of the Discussed Models

Model Components Training Self-Organization SectionGTM Gaussian-spherical EM Latent III-A1t-GTM Student-spherical EM Latent III-A2SOMM Gaussian-spherical EM Heskes III-B1MLTM Gaussian-spherical EM Kohonen III-B2PbSOM Gaussian-full EM Kohonen III-B3KBTM Gaussian-spherical Stochastic Kohonen IV-A1SOMN Gaussian-full Stochastic Heskes IV-A2PPCASOM Gaussian-PPCA Stochastic Heskes IV-B1TSOM Student-full Stochastic Heskes IV-B2

of the model. Some proposals restrict their attention to themost elementary mixture component densities; in these cases,there is no possible choice. However, all those models, whichallow more complex densities, can be scaled down to simplerones, at least in principle.

In this paper, our aim is to study a selection of the mostrelevant probabilistic self-organizing maps for continuous data.We focus on maps with a fixed topology; proposals whichlearn the topology are not considered [38], [48]. A total ofnine models are discussed; Table I reflects their most importantcharacteristics, with an indication of the section of the paperwhere they are presented (see Section II-B for a descrip-tion of the mixture component types). The most commonlyused mixture component model is also the simplest, namely,the multivariate Gaussian with spherical covariance matrices(Gaussian-spherical). In a sense, this means that Gaussian-spherical is a common endpoint which can be reached frommany different approaches. On the contrary, the trainingschemes and self-organization mechanisms vary more evenly.It is worth notice that no model has been found with latentspace self-organization and stochastic approximation learning,due to the difficulty to combine the online nature of theRobbins–Monro algorithm and the batch mode learning, whichis typical of latent space self-organization.

The structure of this paper is as follows. First, we presentthe common theoretical framework of the discussed models(Section II). This includes examining the self-organizationmechanisms that are most commonly used (Section II-A). Theparticular mechanism chosen by a proposal influences all of itsother aspects deeply, as we will see. After that, the differentprobabilistic models for the units of the map are discussed(Section II-B). As mentioned before, these models affect thecomputational complexity and the flexibility of the maps.

Then, we consider EM and stochastic approximation modelsseparately in order to present them in a coherent manner.Section III is devoted to EM-based proposals and Section IVpresents stochastic approximation methods. Both strategies areaimed to learn certain set of map parameters �, which variesdepending on the particular model.

Section V is devoted to showing the self-organizing mapformation process (Section V-A), and to illustrate some typicalapplications (Sections V-B and V-C). We do not intend tobe exhaustive, but rather to give an idea of the wide rangeof problems that these models are able to solve. Finally, wediscuss the merits of the models and some future lines ofresearch (Section VI).

II. Theoretical Framework

We start by establishing a theoretical foundation which iscommon to all the probabilistic self-organizing models we aregoing to study. This foundation comprises the introduction ofself-organization in probabilistic models (Section II-A) and thepresentation of the most popular probabilistic mixtures whichare used with these models (Section II-B).

A. Self-Organization

All the models we are considering share a fundamentalfeature, namely, the definition of a pdf on the input space.This function is defined as a probabilistic mixture where eachunit of the map is associated to a mixture component. Let theobserved (input) space dimension be D. Then the likelihoodof the observed data t ∈ RD is given by the mixture pdf ofthe map as follows:

p (t) =H∑i=1

πip (t | i) (1)

where H is the number of mixture components (units), πi isthe prior probability or mixing proportion of unit i, and p (t | i)is the pdf associated to unit i.

At this point, an important distinction must be made be-tween models with latent spaces and those that do not use them[45]. The former strategy hypothesizes the existence of a latentspace with a reduced dimensionality (usually bidimensional).A mapping is defined between the latent space R2 and theinput space RD as follows:

t = W� (u) (2)

where � is a set of B constant basis functions, � (u) is aB × 1 column vector, and W is a D × B mapping matrix tobe adjusted. A discrete probability distribution is defined overa regular grid of latent points ui ∈ R2 as follows:

P (u) =1

H

H∑i=1

δ (u − ui) (3)

where δ is the Dirac delta. Then, the particular model at handmust define the density function p (t | u, W) of the input spacepoints t given a latent point u and a mapping matrix W.

On the contrary, models which do not use a latent spacemust design a training algorithm that ensures that the mixtureparameters of neighboring units i, j in the map are similar, sothat

p (t | i) ≈ p (t | j) . (4)

In other words, the topology of the map guides the learningprocess in order to achieve (4). For each training sample tn,a winning unit Winner (n) is chosen, and its parameters areadjusted to the sample. The winning unit is usually definedas that with the largest a posteriori probability of havinggenerated the training sample as follows:

Winner (n) = arg maxi

{P (i | tn)}= arg max

i{πip (tn | i)} . (5)

Page 3: Probabilistic Self-Organizing Maps for Continuous Data

LOPEZ-RUBIO: PROBABILISTIC SELF-ORGANIZING MAPS FOR CONTINUOUS DATA 1545

Their neighbors i are also adjusted, but with a learning rateηi which decays with the topological distance d (i, k) to thewinner as follows:

d (i, Winner (n)) ≥ d (j, Winner (n)) ⇔ ηi ≤ ηj. (6)

Many proposals use a Gaussian neighborhood function �

(not to be confused with the pdf p (t | i) of the mixturecomponents), which varies with the time step n and dependson a decaying neighborhood radius � (n) as follows:

ηi (n) ∝ � (i, Winner (n)) =

exp

(−

(d (i, Winner (n))

� (n)

)2)

(7)

� (n + 1) ≤ � (n) . (8)

Two main lines of reasoning have been proposed to justifycondition (6). The first of them follows Kohonen’s originalproposal [29], so that the self-organizing constraint is notrelated to any probability, but it is introduced as a means ofreinforcing the learning of neighboring units. The other line,proposed by Heskes [26], interprets the neighborhood functionas a confusion probability, i.e., the probability that the winningunit is k while the input sample was in fact generated bymixture component i as follows:

� (i, k) = P (i | k = Winner (n)) . (9)

Please note that in this case we need to normalize theneighborhood function by dividing by a suitable constant, sothat it can be interpreted as a proper probability as follows:

H∑i=1

� (i, k) = 1. (10)

The Heskes’ approach is more in accordance with a proba-bilistic model, since it provides a clear probabilistic interpre-tation of the self-organizing constraint.

In the next section, we discuss how the pdf of the mixturecomponents p (t | i) in (1) can be chosen, and the impact ofthis choice in the resulting model.

B. Mixture Components

There is a wide range of choices for the pdf of the mixturescomponents p (t | i). One of the most general of them isthe multivariate Student-t [47], [53], [75], which is usuallyexpressed as the following:

p (t | i) =�

(νi+D

2

)(det (�i))

12 (πνi)

D2 �

(νi

2

)(

1 +

(t − µi

)T�−1

i

(t − µi

)νi

) −νi−D

2

(11)

where � is the gamma function, µi is the location vector,�i is the symmetric and positive definite scale matrix, and νi

is the degrees of freedom parameter. The mean exists only if

νi > 1, and in that case it coincides with the location vectoras follows:

µi = E [t | i] . (12)

The covariance matrix of the mixture component exists onlyif νi > 2, and it is given by the following:

Ci = E[(

t − µi

) (t − µi

)T | i]

=νi

νi − 2�i. (13)

For the sake of simplicity, we discuss here Student-t distri-butions with νi > 2 only, which ensures the existence of themean vector µi and the covariance matrix Ci. From now on,we call the model (11) “Student-full” because it allows anycovariance matrix.

If the Student-full model is too complex for an application,we can restrict the covariance matrix to be diagonal as follows:

Ci =

⎛⎝ σ2

i1 ... 0... ... ...

0 ... σ2iD

⎞⎠ (14)

and the resulting model is noted “Student-diagonal.” A furtherrestriction is to use only spherical covariance matrices asfollows:

Ci = σ2i I (15)

where I is the D×D identity matrix. The corresponding modelis called “Student-spherical.”

A limit case of the Student-full model is the well-knownmultivariate Gaussian, which we call “Gaussian-full.” If weput νi → ∞ in (11), we obtain the Gaussian pdf as follows:

p (t | i) = (2π)−D/2 (det (Ci))−1/2

exp

(−1

2

(t − µi

)TC−1

i

(t − µi

)). (16)

The Student-t pdf (11) is more heavily tailed as νi di-minishes. Hence, the Gaussian (16) has tails which are lessheavy than any Student-t. As a consequence, Student-t modelsare preferred when many outliers are present in the inputdistribution.

As before, a Gaussian-full model could have too manyparameters. A possible way to reduce this complexity is theprobabilistic principal components analysis (Gaussian-PPCA)[40], [59] as follows:

Ci = σ2i I + WiWT

i (17)

where Wi is a D × K parameter matrix, with K beingthe number of principal components, K ∈ {0, ..., D − 1}. A“Gaussian-diagonal” model could be derived from Gaussian-full by imposing (14) to the unrestricted (16).

The relationships among the models we have just definedare shown in Fig. 1. This discussion should not lead us to thewrong conclusion that more general models are better. Thedegree of detail required by each particular application varies.Furthermore simpler models can be implemented with lessoperations, which is a key factor if we are to process high-dimensional data (large D). The situation is summarized inTable II.

Page 4: Probabilistic Self-Organizing Maps for Continuous Data

1546 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

Fig. 1. Relationships among mixture component models. Inclusions areshown with solid arrows and limit cases with dashed arrows.

TABLE II

Mixture Component Models

Type Free Parameters ComplexityGaussian-spherical D + 1 O (D)Gaussian-diagonal 2 − D O (D)

Gaussian-PPCA DK + D + 1 − K(K−1)2 O

(K2D

)

Gaussian-full 12 D2 + 3

2 D O(D3)

Student-spherical D + 2 O (D)Student-diagonal 2 − D + 1 O (D)

Student-full 12 D2 + 3

2 D + 1 O(D3)

We have carried out a simple experiment to illustrate thetradeoffs to be considered when selecting one of the presentedmodels (see Fig. 2). Our input distribution is bidimensional(D = 2), and we have drawn 1000 training samples from it.It comprises some outliers (lower right of the subfigures) andunconnected regions. We have trained four models on thatinput; note that Gaussian-PPCA is irrelevant in the bidimen-sional case because K = 0 is the same as Gaussian-sphericaland K = 1 is the same as Gaussian-full. We have plotted theunit Mahalanobis distance ellipses in black, i.e., the locus ofthe points which satisfy the following:(

t − µi

)TC−1

i

(t − µi

)= 1. (18)

As seen, models with restricted covariance matrices[Fig. 2(a), (b)] yield very rough approximations to the inputdistribution. In particular, they fail to model its directional-ity. This can only be achieved with full covariance models[Fig. 2(c), (d)]. Moreover, the outliers are adequately modeledby the Student-t only [Fig. 2(d)], as expected. These qualitativeconclusions are confirmed by the quantitative results of theaverage negative log-likelihood (ANLL) (lower is better) asfollows:

ANLL = − 1

N

N∑n=1

log (p (tn)) (19)

where N is the number of test samples, in our case N = 1000,which have been drawn from the input distribution indepen-dently from the training samples.

Until now, we have discussed the main tools available forprobabilistic self-organizing map model design. In the twosections that follow, we will see how these tools have beencombined in the paper to yield specific proposals.

Fig. 2. Comparison of some mixture component models. (a)–(d) Inputsamples are shown as small crosses. (e) Estimated log densities representedas depicted in the graphical scale.

III. Expectation Maximization

This methodology [16], [50], [57], [67] assumes the exis-tence of some hidden data τ, which along with the observeddata t form the complete input data (t, τ). For example, ina mixture model, it is commonly considered that there is ahidden variable hn that indicates which mixture componentgenerated the observed data tn. Then, an iterative procedureis followed, with two steps that are executed in the followingsequence.

1) The E step (expectation) consists in computing theexpectation of the logarithm of the likelihood of thecomplete data, given the current values of the parameters� (n), the observed data t, and tentative new values ofthe parameters � as follows:

L (�, � (n)) = Eτ

[log p (t, τ | �) | t, � (n)

]. (20)

2) The M step (maximization) obtains the new parameters� (n + 1) by maximizing the expected likelihood L withrespect to � as follows:

� (n + 1) = arg max�

L (�, � (n)) . (21)

This procedure is usually carried out in batch mode, i.e.,all the available training data are considered at once. Theway of applying it differs completely from those modelswhose self-organization takes place in a latent space tothose which self-organize in the input space. Hence, wetreat them separately in what follows.

Page 5: Probabilistic Self-Organizing Maps for Continuous Data

LOPEZ-RUBIO: PROBABILISTIC SELF-ORGANIZING MAPS FOR CONTINUOUS DATA 1547

A. Latent Space Models

The overall scheme includes an E step which computes theposterior probabilities or responsibilities as follows:

Rin = P (i | tn, W, β) (22)

where each mixture component i is associated to a point inthe latent space ui. The M step maximizes the complete datalikelihood L corresponding to all the available training data.The update equations are similar for Gaussian-spherical [9]and Student-spherical [64] mixture components. There is acommon variance σ2 for all the mixture components whichis not learnt directly. Instead of this, we learn its inverse asfollows:

β =1

σ2. (23)

Furthermore, the a priori probabilities πi are not subject tolearning because they are assumed to be identical. Next, westudy two models which are based on this framework.

1) GTM: In the E step of the GTM [9], [10], the firsttask is the computation of the responsibilities (22). Thenthe expectation of the complete log likelihood is obtained asfollows:

L (W, β) =N∑

n=1

H∑i=1

Rin log p (tn | ui, W, β) (24)

where N is the number of available training samples. Finally,the M step maximizes (24) with respect to W and β.

2) Student-t GTM: The t-GTM [64] considers Student-spherical mixture components, where the degrees of freedomparameters νi are fixed, i.e., there are no update equations forthem. The E step computes the responsibilities (22).

Like before, the subsequent M step involves the maximiza-tion of (24) with respect to W and β. The Student-sphericalmodel reduces to Gaussian-spherical when νi → ∞, so thet-GTM equations reduce to those of the GTM in that limit.

B. Input Space Models

These models maximize the complete log likelihood (or alower bound of it) without the resort to a latent space. Here,the different strategies have less in common with each other.

1) Self-Organizing Mixture Model (SOMM): The SOMM[66] assumes Gaussian-spherical mixture components, andequal a priori probabilities πi. Let Q be a set of dis-crete distributions over the hidden variables hn that indicatewhich mixture component generated the training sample tn asfollows:

Q = {Q1, ..., QH } . (25)

Self-organization is facilitated if each distribution Qk isformed by the Heskes’ confusion probabilities (9) as follows:

Qk (hn = i) = P (i | k = Winner (n)) = � (i, k) . (26)

Instead of maximizing the complete log likelihood L, theSOMM maximizes a lower bound of the likelihood.

2) Maximum Likelihood Topographic Map (MLTM):The MLTM model [63] considers Gaussian-spherical mixturecomponents with equal a priori probabilities πi, such asthe SOMM. The E step consists in computing the posteriorprobabilities or responsibilities Rin = P (i | tn). The M stepworks in batch mode. It updates the mean vectors µi and thevariances σ2

i for each mixture component.3) Probabilistic Self-Organizing Map (PbSOM): The Pb-

SOM [13] uses Gaussian-full mixture components. It assumesequal a priori probabilities πi because if they are adapted, itis found that some mixture components dominate the learningprocess so that the topological ordering of the map is com-promised. This model is aimed to perform an unsupervisedclustering of the input data, so that every unit i has its clusterSi. The clusters form a partition of the training set as follows:

P = {S1, ..., SH } . (27)

Hence, the objective function O to be maximized is differentfrom (20), since it must take into account the current partitionP . It can be written up to a constant as follows:

O (P, �) =H∑i=1

∑tn∈Si

H∑j=1

� (i, j) log p (tn | j, �) . (28)

It must be highlighted that self-organization is introducedin (28) through the neighborhood function �.

The corresponding update algorithm is called self-organizing classification expectation maximization (SOCEM),and it has an additional C step between the E step and theM step. The E step computes the responsibilities, which aretuned by the neighborhood function. Then the C step assignseach training sample tn to the mixture component i whichhas the largest responsibility for tn as follows:

Si =

{tn | i = arg max

jRjn

}. (29)

Finally, the M step re-estimates the mean vector µi and thecovariance matrix Ci of each mixture component i so that (28)is maximized.

IV. Stochastic Approximation

The stochastic approximation methods that we discuss hereare based in the Robbins–Monro stochastic approximationalgorithm [15], [31], [32], [49], [51]. Our goal is to find thevalue of some parameter � which satisfies as follows:

ζ (�) = 0 (30)

where ζ is a function whose values cannot be obtained directly.What we have is a random variable z which is a noisy estimateof ζ as follows:

E [z (�) | �] = ζ (�) . (31)

Under these conditions, the Robbins–Monro algorithm pro-ceeds iteratively as follows:

� (n + 1) = � (n) + ε (n) z (� (n)) (32)

Page 6: Probabilistic Self-Organizing Maps for Continuous Data

1548 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

where ε (n) is a suitable step size. This algorithm operates inonline mode, i.e., it processes one training sample at a time.It is used in probabilistic self-organizing maps in two distinctways that we discuss next.

A. Relative Entropy Minimization

The relative entropy, also called Kullback–Leibler (KL)divergence or cross entropy, is a widely used measure of thequality of a pdf estimator, which is as follows:

KL = D (p || p) = −∫

p (t) logp (t)p (t)

dt (33)

where p (t) is the real input density and p (t) is the estimatedinput density. It is a nonnegative real number which equalszero if and only if both densities are equal. Hence, the goalis to minimize KL with respect to the model parameters.For each model parameter � we look for a zero of thecorresponding partial derivative as follows:

∂KL

∂�= −

∫p (t)

(1

p (t)∂p (t)∂�

)dt = 0. (34)

This problem can be solved iteratively by means of stochas-tic approximation as follows:

� (n + 1) = � (n) + ε (n)1

p (t)∂p (t)∂�

. (35)

Next, we discuss two models which follow this line ofreasoning.

1) Kernel-Based Topographic Maps (KBTMs): The KBTMmodel [61] uses Gaussian-spherical components with fixedequal a priori probabilities πi. The KBTM can be expressedin terms of relative entropy minimization (KL divergence min-imization). The model can also be obtained by reconstructionerror minimization, such as in [60]. An alternative learningalgorithm can be derived from a different interpretation of therelative entropy minimization criterion [62].

2) Self-Organizing Mixture Networks (SOMNs): TheSOMN [74] can use Gaussian-full or multivariate Cauchymixture components. Multivariate Cauchy distributions areanalogous to multivariate Student-t but with ν = 1. However,Cauchy distributions are rarely found in self-organizing mapsliteratures, so we only discuss the Gaussian-full version here.The SOMN adjusts the a priori probabilities πi, and this needsthe introduction of an additional normalizing constraint in (34)as follows:

∂KL

∂πi

= −∫

p (t)(

1

p (t)∂p (t)∂πi

)dt+

λ∂

∂πi

(H∑i=1

πi − 1

)= 0 (36)

where λ is a Lagrange multiplier which is usually set to 1.

B. Expected Value Approximation

If we wish to estimate the expectation E [S] of certain ran-dom variable S from its samples s by stochastic approximation,we may take the following:

ζ (�) = E [S] − � (37)

z (�) = s − � (38)

which obviously satisfies the condition (31). Hence, (32) readsas follows:

� (n + 1) = � (n) + ε (n) (sn − � (n)) (39)

and � (n) is an approximation of E [S]. In particular, ifwe assume a mixture probability density (as the discussedmodels do), we can estimate the conditional expectation ofa function ϕ (t) given a mixture component i. First, we set asfollows:

S = P (i | t) ϕ (t) (40)

sn = P (i | tn) ϕ (tn) . (41)

Consequently, from (37) we get the following:

ζ (�) = E [P (i | t) ϕ (t)] − �. (42)

Now the iterative method (39) reads as follows:

� (n + 1) = � (n) + ε (n) (P (i | tn) ϕ (tn) − � (n)) . (43)

Then we can approximate the conditional expectation asfollows:

E [ϕ (t) | i] ≈ E [P (i | t) ϕ (t)]E [P (i | t)]

(44)

where it is assumed that the learnt probability density p (t) isa good approximator of the true input density p (t) as follows:

p (t) ≈ p (t) . (45)

Also, please note that πi is estimated by setting ϕ (t) = 1in (40).

Self-organization is achieved by considering Heskes’ con-fusion probabilities (9) in (41) as follows:

P (i | tn) = � (i, Winner (n)) .

Two models are based on this framework, namely, the prob-abilistic principal components analysis self-organizing map(PPCASOM) and the multivariate student-t self-organizingmap (TSOM).1 The set of functions ϕ (t) to be estimated isdifferent in each case.

1Source code for both models available at http://www.lcc.uma.es/˜ezeqlr.

Page 7: Probabilistic Self-Organizing Maps for Continuous Data

LOPEZ-RUBIO: PROBABILISTIC SELF-ORGANIZING MAPS FOR CONTINUOUS DATA 1549

1) PPCASOM: The PPCASOM model [39] defines twoauxiliary variables as follows:

�i = E[(

t − µi

)E

[xT | i

] | i]

(46)

�i = E[E

[xxT | i

] | i]

(47)

where x is the PPCA latent variable vector [59]. Thesevariables satisfy the following:

Wi = �i (�i)−1 (48)

so that we do not estimate Wi directly, but instead compute itfrom (48).

2) Multivariate Student-t Self-Organizing Map: TheTSOM model [37] considers Student-full mixture components.The estimation of the degrees of freedom parameter νi can bedone by means of three possible methods, namely, “Peel” [47],“Shoham” [53], and “Direct” [37]. Each method computes νi

by solving a nonlinear equation in νi specific to the method.

V. Experiments

Now that the nine probabilistic self-organizing models havebeen presented, it is time to explore how can they be usedfor solving computational problems. There are myriad appli-cations of self-organizing maps; we cannot be exhaustive, sowe focus on some significant ones.

A. Self-Organization Experiments

Our first set of experiments is designed to show the self-organization capabilities of probabilistic models. We havechosen the SOMN as an example, but similar results canbe obtained for the other discussed proposals. We have used8 × 8 maps, and we have trained them for 100 000 epochs.The first half of the training (50 000 epochs) was the orderingphase, and the remaining half was the convergence phase. Theordering phase had a linear decay of the neighborhood radiusfrom � (0) = 8 to zero, and the step size had a linear decayfrom εµ (0) = εC (0) = 0.4 to zero. The convergence phasehad constant neighborhood radius and step sizes, � (0) = 0.1,εµ = εC = 0.01.

As seen in Fig. 3, the unfolding of the map is carriedout during the ordering phase, while the fine tuning of theparameters of the units happens in the convergence phase; thisscheme is the same as that of the classical (non-probabilistic)self-organizing maps. However, it must be noted that whilethe mean vectors of the map are not completely unfolded[Fig. 3(b)], the covariance matrices spread so that the inputdistribution is fully covered. This behavior cannot be foundin standard SOMs, because they lack a dispersion measure inthe units such as the covariance matrix.

On the contrary, when the input distribution does notcorrespond to the topology of the map (Fig. 4) the map adjuststhe distances among mean vectors (as in the classic case) andthe covariance matrices (not like classic maps). This effectis particularly evident in Fig. 4(b), where the irregularity ofthe input distribution and the self-organization constraint force

Fig. 3. Training process of a SOMN with the uniform distribution on theunit square as input. (a) Initialization. (b) Ordering phase, step 1000. (c) Endof ordering phase. (d) End of convergence phase.

Fig. 4. Result of SOMN training on the uniform distribution. (a) On a circle.(b) On an irregular shape. The input samples are shown as small crosses.

some units to position their mean vectors on regions of theinput space without samples. In this situation, the affected co-variance matrices spread to model the closest available inputs.

B. Classification

Here, we explore ways to classify data by means of proba-bilistic self-organizing maps, which have been used for thispurpose from their beginnings [25] to the present [56]. Itis assumed that the class labels of the training samples areavailable, i.e., this is not an unsupervised clustering task suchas that which inspires the PbSOM model. We may considertwo different strategies [55].

1) Single Map: We assign class probabilities to each unitof the trained map. The probability of class Sj given unit i isgiven by the Bayes’ theorem as follows:

P(Sj | i

)=

P(Sj

)P

(i | Sj

)P (i)

=

∑t∈Sj

P (i | t)∑k

∑t∈Sk

P (i | t). (49)

The probabilities P(Sj | i

)can be regarded as the soft class

labels assigned to the units i. Then, when a test sample t ispresented to the map, we can estimate the probability that it

Page 8: Probabilistic Self-Organizing Maps for Continuous Data

1550 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

TABLE III

Parameter Selections

Model ParametersGTM Basis functions: 3 × 3, epochs: 50tGTM Basis functions: 3 × 3, epochs: 50, ν = 3

SOMM Neighborhood radius � = 1MLTM Epochs: 50, exponential decay from � (0) = 3PbSOM Epochs: 50, � (0) = 1KBTM Steps: 2 000 000, � (0) = 3.SOMN Steps: 100 000, � (0) = 6, εµ (0) = εC (0) = 0.4

PPCASOM Steps: 100 000, � (0) = 2, ε (0) = 0.01, K = 2TSOM Steps: 100 000, � (0) = 2, ε (0) = 0.01, Shoham

belongs to class Sj as follows:

P(Sj | t

) ≈∑

i

P (i | t) P(Sj | i

)(50)

where we assume that P(Sj | i, t

) ≈ P(Sj | i

)and the

posterior probabilities P (i | t) are obtained as follows:

P (i | t) =πip (t | i)∑k πkp (t | k)

. (51)

This strategy is illustrated in Fig. 5. We have chosen twoUniversity of California at Irvine benchmark classificationdatabases (Iris and Wine) with three classes [4]. Each classis associated to a color: red, green, or blue. We have reducedthe dimensionality of the original data to D = 2 by means ofa global principal components analysis in order to be able toplot the input space. Then a TSOM with 6 × 6 units has beentrained for 100 000 epochs. The “Shoham” method has beenused, and the rest of the parameters are as in the experimentsof [37]. Please note that in this classification strategy, theself-organizing maps are not aware of the class labels of thetraining data; the complete training set is presented to themap without separating the samples by classes. The classprobabilities of the units P

(Sj | i

)are computed only after

the training of the map is finished. Hence, the task of the maptraining is to discover the clusters in the input data.

The class probabilities P(Sj | i

)have been used to assign

a color to each unit i of the map, which has been drawn at theposition indicated by its mean vector µi, i.e., the color of eachunit is a weighted average of the colors of the classes, wherethe weights are the class probabilities of each unit. As seen,when the classes are separable (Iris), most units get a clearclass assignment. But when the problem is harder (Wine), theclass probabilities of many units are less extreme, which couldlead to classification errors; this reflects the difficulty of theclassification problem.

We have also carried out a quantitative comparison. Theclassification accuracy has been considered for this purpose,although there are some other useful performance measuresavailable [37], [42]. The parameters for the probabilisticmodels are as in Table III. The original Kohonen’s SOFMhas also been tested as a reference, with 100 000 time steps,initial neighborhood radius � (0) = 6, and initial learningrate η (0) = 0.4. The results for the tenfold cross-validationexperiments are shown in Table V-B1. As seen, SOMN and

Fig. 5. Single map strategy. (a) Iris. (b) Wine.

TABLE IV

Classification Accuracy Results for the Single Map Strategy

(Higher Is Better)

Model BalanceScale Magic SpambaseSOFM 0.6143 (0.0601) 0.7296 (0.0079) 0.6866 (0.0310)GTM 0.4437 (0.1231) 0.6594 (0.0108) 0.6779 (0.0247)tGTM 0.4590 (0.0665) 0.6484 (0.0090) 0.6059 (0.0157)SOMM 0.7471 (0.0510) 0.7303 (0.0111) 0.6132 (0.0291)MLTM 0.7232 (0.0584) 0.6560 (0.0139) 0.6151 (0.0231)PbSOM 0.5972 (0.0992) 0.6859 (0.0562) 0.7034 (0.0273)KBTM 0.6396 (0.0723) 0.6484 (0.0121) 0.6058 (0.0169)SOMN 0.6619 (0.0669) 0.7930 (0.0170) 0.8454 (0.0179)PPCASOM 0.6551 (0.0956) 0.6841 (0.0316) 0.6262 (0.0515)TSOM 0.5898 (0.0609) 0.7391 (0.0402) 0.8136 (0.0186)

Standard deviations are in parentheses.

TSOM offer a good performance due to their full covariancematrices (and for TSOM, because of its robustness againstoutliers). On the contrary, the smaller-sized BalanceScaledatabase is well suited for spherical covariance models suchas SOMM and MLTM.

2) One Map Per Class: We train a map Mj with theavailable training samples of each class Sj . Then the classprobabilities of a test sample t are computed with the help ofthe pdfs pMj

of each map Mj [11] as follows:

P(Sj | t

) ≈ P(Sj

)pMj

(t)∑h P (Sh) pMh

(t)(52)

that is, we are assuming that the map pdf models the classdensity function adequately as follows:

p(t | Sj

) ≈ pMj(t) . (53)

The concept is illustrated in Fig. 6. A TSOM with3 × 3 units has been trained on each of the three classes ofthe databases mentioned above (Iris and Wine), with the sameparameter selections as in the single map experiments. Wehave drawn all the neurons of a map in the color of the classassociated to the map. The easier classification task (Iris) leadsto clear separations between maps, but the harder one (Wine)produces certain overlapping between the maps of the twoinseparable classes (on the right side). As commented before,this is unavoidable and reflects the more difficult nature ofthe classification problem. It must be pointed out that if wehad used the original data, the extra dimensions would havefacilitated the separation among classes; these experiments areaimed to depict the classification strategies graphically.

Page 9: Probabilistic Self-Organizing Maps for Continuous Data

LOPEZ-RUBIO: PROBABILISTIC SELF-ORGANIZING MAPS FOR CONTINUOUS DATA 1551

Fig. 6. One map per class strategy. (a) Iris. (b) Wine.

TABLE V

Classification Accuracy Results for the One Map Per Class

Strategy (Higher Is Better)

Model BalanceScale Magic SpambaseSOFM 0.6664 (0.0822) 0.7487 (0.0105) 0.7009 (0.0121)GTM 0.4667 (0.0872) 0.6484 (0.0073) 0.5909 (0.0594)tGTM 0.4807 (0.0672) 0.6485 (0.0132) 0.6174 (0.0310)

SOMM 0.8274 (0.0479) 0.7669 (0.0071) 0.6450 (0.0174)MLTM 0.8626 (0.0536) 0.6464 (0.0064) 0.6093 (0.0810)PbSOM 0.9128 (0.0485) 0.7589 (0.0317) 0.8318 (0.0278)KBTM 0.8769 (0.0471) 0.5589 (0.0188) 0.6707 (0.0248)SOMN 0.9158 (0.0184) 0.8227 (0.0084) 0.7400 (0.3072)

PPCASOM 0.9140 (0.0262) 0.7785 (0.0079) 0.6673 (0.0245)TSOM 0.9174 (0.0351) 0.7996 (0.0330) 0.8204 (0.0209)

Standard deviations in parentheses.

A quantitative comparison with the same performance mea-sure, models and parameter selections as in Section V-B1has been done. The results for the tenfold cross-validationare given in Table V-B2. The models with full covariancematrices are well suited for this application (SOMN, TSOM,and PbSOM). On the contrary, the SOMM and MLTM spher-ical models are competitive in the database with the leastdimensionality and size (BalanceScale). This is in line withthe results obtained in Sections V-A and V-B1, i.e., thedimensionality of the data and the number of training samplesshould guide the selection of a model over another, sincewe must achieve a balance between model flexibility and thedata insufficiency problem. In addition to this, the availablecomputation time is another factor to be taken into account;if the dimensionality is very large only spherical and PPCAmodels are feasible because of its linear complexity withrespect to D. Finally, the original SOFM is better than severalprobabilistic approaches (in particular, some with sphericalcovariance matrices), but it is clearly outperformed by theothers. Hence, we have a confirmation of the usefulness ofintroducing probability distributions in self-organizing maps.

C. Visualization

Next, we demonstrate the ability of self-organizing mapsto build faithful representation of complex datasets. This isa common application of probabilistic self-organizing maps[24], [58]. We have considered three distinct situations.

First, we have selected the handwritten twos database fromMNIST [34], which has a high dimensionality D = 784.A SOMM with 8 × 8 units, and a PPCASOM with 8 × 7

Fig. 7. Visualization of MNIST two’s database. (a) SOMM. (b) PPCASOM.(c) Color key.

units and K = 2 have been trained (Fig. 7); the rest of themodel parameters have been set as in the experiments in [66]and [39], respectively. The SOMM provides the mean vectors(centroids) µi of the units. On the contrary, the PPCASOMoffers not only the mean vectors µi, but also the first andsecond principal eigenvectors of the covariance matrix Ci. Wehave used the L*a*b* color space [52] to represent all threevectors jointly; the mean is associated with the luminance (L*)channel, and the first and second eigenvectors are associatedto the chromaticity channels a* and b*, respectively. Hence,chromaticity similarities in Fig. 7(b) correspond to similaritiesin the principal eigenvectors of the units. The L*a*b* spacehas the advantage that its Euclidean distance matches relativeperceptual differences between colors closely. In our case,this means that Euclidean distances among mean vectors andeigenvectors are adequately rendered by the color codification.We employ this color coding in all the PPCASOM plots ofthis subsection.

The central part of the PPCASOM map captures the mostcommon shapes, while the outer units are specialized inless frequent patterns (outliers). On the contrary, the SOMMdoes not adapt to the infrequent patterns, due to its sim-pler probabilistic model. Hence there is a tradeoff betweenthe computational complexity and the level of detail of therepresentation.

Our second experiment is about modeling a video sequencein order to distinguish foreground objects from the back-ground. This is one of the earliest stages in many computervision systems [28], [35], [54], [76]. The input video sequence,which is publicly available [36], depicts a fountain withflowing water (background) and some pedestrians passing infront of it (foreground objects). There are 523 frames (inputsamples), and each frame has 160 × 128 pixels; we haveconverted the frames to grayscale prior to its presentation to

Page 10: Probabilistic Self-Organizing Maps for Continuous Data

1552 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

Fig. 8. Modeling of fountain video sequence. (a) SOMM. (b) PPCASOM.

Fig. 9. Discovering patterns of solar activity. (a) SOMM. (b) PPCASOM.

the self-organizing maps. Hence, the input dimensionality forthe maps is D = 20 480. The parameters have been as inthe MNIST experiment, with the exception that the map sizehas been 5 × 5 both for SOMM and PPCASOM. As seenin Fig. 8, SOMM is only able to capture the background(the overall mean of the distribution). In contrast to this,PPCASOM discovers the foreground objects (pedestrians) bymodelling them with the help of the principal eigenvectors(corresponding to colored regions). Hence, PPCASOM buildsa representation of the background with its mean vectors anda representation of the variability due to foreground objectsby means of its principal eigenvectors.

The third and the last experiment tries to discover activitypatterns in a sequence of images of the Sun; this is a problemof paramount importance in solar research [3], [7], [19]. Theimages are publicly available and come from the Solar andHeliospheric Observatory (SOHO) [43]. We have used 5000images corresponding to extreme ultraviolet imaging telescopeat 171 Angstrom (EIT 171) observations. Each image wasoriginally 512×512 pixels in size, but they have been reducedto 128 × 128 pixels by bicubic interpolation, so that we havean input dimensionality D = 16 384. The map sizes and modelparameters are as in the previous experiment. The results areshown in Fig. 9. Like before, the SOMM only learns averagesof the data, so that we can only distinguish two less active(darker) regions near both poles. In contrast to this, the units ofthe PPCASOM discover regions which activate simultaneously(plotted in the same color), which are represented by theprincipal eigenvectors, as before. The units in the left sideof the map account for variations over large regions, whilethe units in the right side specialize in more localized activityoscillations. The entire surface of the solar disc is colored,which means that all those pixels exhibit a significant variation

with respect to their mean values [this is in contrast withFig. 8(b)].

VI. Discussion

In this section, we outline some important features ofprobabilistic SOMs which can be inferred from the precedingexperiments. These features are to be taken into account whendeciding which model to employ in a particular application.After that, we outline some future lines of research.

Perhaps the most common problem which is to be found bythe average practitioner when employing probabilistic SOMsis that of data insufficiency, in particular, with those modelswith full covariance matrices (PbSOM, SOMN, and TSOM).This problem arises when the number of free parameters ofthe map is too large with respect to the number of availabletraining samples. It is very easily detected by checking thevalue of the condition number of the covariance matrices Ci

as follows:

cond (Ci) = ‖Ci‖∥∥C−1

i

∥∥ . (54)

If the value is high (say, cond (Ci) > 104), then the problemis present and further computations would be unreliable. Themost straightforward way to remedy this problem withoutlosing the directional information of the covariance matrix isto use a PPCASOM model with a small number of principalcomponents K. This reduces the number of free parametersof the map dramatically (see Table II) while retaining theinformation about the principal directions of the data. If thenumber of training samples is extremely small, we could ob-tain an additional decrement of the number of free parametersby reducing the size of the map, i.e., setting a map with lessunits.

Another key factor in the decision about which modelto use is the difference among online learning modelsand batch mode ones. In general, models based on EM(Section III) are more suited to batch mode learning, whilestochastic approximation (Section IV) is more naturally usedwith online learning. Batch mode requires that all the trainingdata (or at least a substantial portion of the training set)is available prior to the start of the learning process. Thiscould not be the case in real-time applications where thetraining data are generated on the fly, which would lead usto prefer models with online learning; an example would bethe adaptive filtering problem [6], [46]. In contrast to this, ifall the training data are available in advance, we could obtaina better performance with batch learning, which considers allthe training set at a time when optimizing the map parameters.

As seen, the decision about which model is the best suitedfor a particular task depends on various considerations. Con-sequently, it cannot be said that one of them is better than allthe others for all purposes.

Future work in probabilistic SOMs includes providing evenmore flexibility to the units, i.e., achieving a better adaptationto the input distribution with less computational complexity.Further improvements could be obtained from incorporatinga data insufficiency alleviation mechanism to the modelswith full covariance matrices. Finally, we must note that the

Page 11: Probabilistic Self-Organizing Maps for Continuous Data

LOPEZ-RUBIO: PROBABILISTIC SELF-ORGANIZING MAPS FOR CONTINUOUS DATA 1553

Gaussian-diagonal and Student-diagonal mixture componentmodels (see Section II-B) have received little attention fromresearchers. This leaves an open field of research, since theyhave a reduced complexity O (D) which makes them adequatefor very high dimensional data, while at the same time theyare more flexible than their well-known spherical counterparts.

VII. Conclusion

An in-depth study of the probabilistic self-organizing mapmodels with fixed topology, which are designed to processcontinuously valued input data, has been presented. We havediscussed in detail the most common ways to achieve self-organization and adaptation to the input distribution. More-over, we have discussed the mixture component models that lieat the heart of these maps. Nine proposals have been selectedto be discussed more deeply, which form a representativesample of the paper. Finally, experimental results have beencarried out to illustrate several significant problems that canbe solved by these maps. At the same time, these results showthe capabilities of some of the previously discussed models.

From the preceding, we can conclude that probabilisticself-organizing maps share a well-founded common theoret-ical framework, and that they combine advantages of self-organizing neural networks and probabilistic mixture models.This makes them suitable for a wide range of applications.Moreover, the path is clear to develop new models with moreflexibility and less computational complexity.

Acknowledgment

The author would like to thank the anonymous reviewersfor their valuable comments and suggestions.

References

[1] D. Alvarez and H. Hidalgo, “Document analysis and visualization withzero-inflated Poisson,” Data Mining Knowledge Discovery, vol. 19,no. 1, pp. 1–23, Aug. 2009.

[2] A. Andrade, S. Nasuto, P. Kyberd, and C. Sweeney-Reed, “Generativetopographic mapping applied to clustering and visualization of motorunit action potentials,” BioSystems, vol. 82, no. 3, pp. 273–284, 2005.

[3] M. Aschwanden, “2-D feature recognition and 3-D reconstruction insolar EUV images,” Solar Phys., vol. 228, nos. 1–2, pp. 339–358, 2005.

[4] A. Asuncion and D. Newman. (2007). UCI MachineLearning Repository [Online]. Available: http://www.ics.uci.edu/mlearn/MLRepository.html

[5] G. Barreto, “Time series prediction with the self-organizing map: Areview,” Studies Comput. Intell., vol. 77, pp. 135–158, 2008.

[6] G. Barreto and L. Souza, “Adaptive filtering with the self-organizingmap: A performance comparison,” Neural Netw., vol. 19, nos. 6–7, pp.785–798, 2006.

[7] E. E. Benevolenskaya, “EUV coronal pattern of complexes of solaractivity,” Advances Space Res., vol. 39, no. 12, pp. 1860–1866, 2007.

[8] C. M. Bishop, Pattern Recognition and Machine Learning. Secaucus,NJ: Springer-Verlag, 2006.

[9] C. M. Bishop and M. Svensén, “The generative topographic mapping,”Neural Comput., vol. 10, no. 1, pp. 215–234, 1998.

[10] C. Bishop, M. Svensén, and C. Williams, “Developments of thegenerative topographic mapping,” Neurocomputing, vol. 21, nos. 1–3,pp. 203–224, 1998.

[11] S. Brahim-Belhouari and A. Bermak, “Gas identification using densitymodels,” Pattern Recog. Lett., vol. 26, no. 6, pp. 699–706, 2005.

[12] K.-Y. Chang and J. Ghosh, “A unified model for probabilistic principalsurfaces,” IEEE Trans. Pattern Analysis Mach. Intell., vol. 23, no. 1,pp. 22–41, Jan. 2001.

[13] S.-S. Cheng, H.-C. Fu, and H.-M. Wang, “Model-based clustering byprobabilistic self-organizing maps,” IEEE Trans. Neural Netw., vol. 20,no. 5, pp. 805–826, May 2009.

[14] T. Chow and S. Wu, “An online cellular probabilistic self-organizingmap for static and dynamic data sets,” IEEE Trans. Circuits Syst. I:Regular Papers, vol. 51, no. 4, pp. 732–747, Apr. 2004.

[15] B. Delyon, M. Lavielle, and E. Moulines, “Convergence of a stochasticapproximation version of the EM algorithm,” Ann. Statist., vol. 27,no. 1, pp. 94–128, 1999.

[16] A. Dempster, N. Lair, and D. Rubin, “Maximum likelihood fromincomplete data via the EM algorithm,” J. Royal Statist. Soc. B, vol. 39,no. 1, pp. 1–38, 1977.

[17] I.-J. Ding, “Incremental mllr speaker adaptation by fuzzy logic control,”Pattern Recog., vol. 40, no. 11, pp. 3110–3119, 2007.

[18] J.-H. Eom, S.-C. Kim, and B.-T. Zhang, “Aptacdss-e: A classifierensemble-based clinical decision support system for cardiovasculardisease level prediction,” Expert Syst. Applicat., vol. 34, no. 4, pp.2465–2479, 2008.

[19] R. Frazin and F. Kamalabadi, “Rotational tomography for 3-Dreconstruction of the white-light and EUV corona in the post-SOHOera,” Solar Phys., vol. 228, nos. 1–2, pp. 219–237, 2005.

[20] R. Freeman and H. Yin, “Web content management by self-organization,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1256–1268,Sep. 2005.

[21] C. Fyfe, “Topographic maps for clustering and data visualization,”Studies Comput. Intell., vol. 115, pp. 111–153, 2008.

[22] C. Fyfe, W. Barbakh, W. Ooi, and H. Ko, “Topological mappings ofvideo and audio data,” Int. J. Neural Syst., vol. 18, no. 6, pp. 481–489,2008.

[23] A. Gammerman, Ed., Computational Learning and ProbabilisticReasoning. New York: Wiley, 1996.

[24] N. Gianniotis and P. Tino, “Visualization of tree-structured data throughgenerative topographic mapping,” IEEE Trans. Neural Netw., vol. 19,no. 8, pp. 1468–1493, Aug. 2008.

[25] K. Haese, “Kalman filter implementation of self-organizing featuremaps,” Neural Comput., vol. 11, no. 5, pp. 1211–1233, 1999.

[26] T. Heskes, “Self-organizing maps, vector quantization, and mixturemodeling,” IEEE Trans. Neural Netw., vol. 12, no. 6, pp. 1299–1305,Nov. 2001.

[27] C.-C. Hsu, “Generalizing self-organizing map for categorical data,”IEEE Trans. Neural Netw., vol. 17, no. 2, pp. 294–304, Mar. 2006.

[28] P. Kaewtrakulpong and R. Bowden, “A real time adaptive visualsurveillance system for tracking low-resolution colour targets indynamically changing scenes,” Image Vision Comput., vol. 21, no. 9,pp. 913–929, Sep. 2003.

[29] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9, pp.1464–1480, Sep. 1990.

[30] T. Kohonen, Self-Organizing Maps, 3rd ed. Secaucus, NJ: Springer-Verlag, 2001.

[31] H. J. Kushner and G. G. Yin, Stochastic Approximation and RecursiveAlgorithms and Applications. New York: Springer-Verlag, 2003.

[32] T. Lai, “Stochastic approximation,” Ann. Stat., vol. 31, no. 2, pp.391–406, 2003.

[33] M. Lebbah, Y. Bennani, and N. Rogovschi, “A probabilistic self-organizing map for binary data topographic clustering,” Int. J. Comput.Intell. Applicat., vol. 7, no. 4, pp. 363–383, 2008.

[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,pp. 2278–2324, Nov. 1998.

[35] L. Li, W. Huang, I. Y.-H. Gu, and Q. Tian, “Statistical modeling ofcomplex backgrounds for foreground object detection,” IEEE Trans.Image Process., vol. 13, no. 11, pp. 1459–1472, Nov. 2004.

[36] L. Li, W. Huang, I. Y.-H. Gu, and Q. Tian. (2010, Mar.). StatisticalModeling of Complex Background for Foreground Object Detection[Online]. Available: http://perception.i2r.a-star.edu.sg/bk model/bkindex.html

[37] E. López-Rubio, “Multivariate Student-t self-organizing maps,” NeuralNetw., vol. 22, no. 10, pp. 1432–1447, 2009.

[38] E. López-Rubio, J. Muñoz-Pérez, and J. Gómez-Ruiz, “Self-organizingdynamic graphs,” Neural Process. Lett., vol. 16, no. 2, pp. 93–109,2002.

[39] E. López-Rubio, J. M. Ortiz-de-Lazcano-Lobato, and D. López-Rodríguez, “Probabilistic PCA self-organizing maps,” IEEE Trans.Neural Netw., vol. 20, no. 9, pp. 1474–1489, Sep. 2009.

[40] E. López-Rubio and J. Ortiz-De-Lazcano-Lobato, “Automatic modelselection by cross-validation for probabilistic PCA,” Neural Process.Lett., vol. 30, no. 2, pp. 113–132, 2009.

Page 12: Probabilistic Self-Organizing Maps for Continuous Data

1554 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010

[41] I. Marroquín, J.-J. Brault, and B. Hart, “A visual data-miningmethodology for seismic-facies analysis, part 1: Testing and comparisonwith other unsupervised clustering methods,” Geophysics, vol. 74,no. 1, pp. P1–P11, 2009.

[42] V. Moschou, D. Ververidis, and C. Kotropoulos, “Assessment of self-organizing map variants for clustering with application to redistributionof emotional speech patterns,” Neurocomputing, vol. 71, nos. 1–3, pp.147–156, 2007.

[43] NASA. (2010, Mar.). Solar and Heliospheric Observatory Data[Online]. Available: http://sohowww.nascom.nasa.gov/home.html

[44] K. Obermayer and T. J. Sejnowski, Eds., Self-Organizing MapFormation: Foundations of Neural Computation. Cambridge, MA: MITPress, 2001.

[45] E. Oja, “Unsupervised learning in neural computation,” Theor. Comput.Sci., vol. 287, no. 1, pp. 187–207, 2002.

[46] I. Olier and A. Vellido, “Advances in clustering and visualization oftime series using GTM through time,” Neural Netw., vol. 21, no. 7, pp.904–913, 2008.

[47] D. Peel and G. McLachlan, “Robust mixture modeling using the tdistribution,” Statist. Comput., vol. 10, no. 4, pp. 339–348, 2000.

[48] A. Rauber, D. Merkl, and M. Dittenbach, “The growing hierarchicalself-organizing map: Exploratory analysis of high-dimensional data,”IEEE Trans. Neural Netw., vol. 13, no. 6, pp. 1331–1341, Nov. 2002.

[49] H. Robbins and S. Monro, “A stochastic approximation method,” Ann.Math. Statist., vol. 22, no. 3, pp. 400–407, 1951.

[50] F. Saâdaoui, “Acceleration of the EM algorithm via extrapolationmethods: Review, comparison and new methods,” Comput. Statist. DataAnal., vol. 54, no. 3, pp. 750–766, 2010.

[51] M. Sato and S. Ishii, “On-line EM algorithm for the normalizedGaussian network,” Neural Comput., vol. 12, no. 2, pp. 407–432, 2000.

[52] S. K. Shevell, The Science of Color, 2nd ed. Amsterdam, TheNetherlands: Elsevier, 2003.

[53] S. Shoham, “Robust clustering by deterministic agglomeration EM ofmixtures of multivariate t-distributions,” Pattern Recog., vol. 35, no. 5,pp. 1127–1142, 2002.

[54] C. Stauffer and W. Grimson, “Learning patterns of activity usingreal-time tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,no. 8, pp. 747–757, Aug. 2000.

[55] X. Tan, S. Chen, Z.-H. Zhou, and F. Zhang, “Recognizing partiallyoccluded, expression variant faces from single training image perperson with som and soft k-nn ensemble,” IEEE Trans. Neural Netw.,vol. 16, no. 4, pp. 875–886, 2005.

[56] C. Teh and C. Lim, “An artificial neural network classifier designbased-on variable kernel and non-parametric density estimation,” NeuralProcess. Lett., vol. 27, no. 2, pp. 137–151, Apr. 2008.

[57] G.-L. Tian, K. Ng, and M. Tan, “EM-type algorithms for computingrestricted MLEs in multivariate normal distributions and multivariatet-distributions,” Comput. Statist. Data Anal., vol. 52, no. 10, pp.4768–4778, 2008.

[58] P. Tino and I. Nabney, “Hierarchical GTM: Constructing localizednonlinear projection manifolds in a principled way,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. 639–656, May 2002.

[59] M. Tipping and C. Bishop, “Mixtures of probabilistic principalcomponent analyzers,” Neural Comput., vol. 11, no. 2, pp. 443–482,1999.

[60] M. M. Van Hulle, “Kernel-based topographic map formation by localdensity modeling,” Neural Comput., vol. 14, no. 7, pp. 1561–1573,2002.

[61] M. Van Hulle, “Joint entropy maximization in kernel-based topographicmaps,” Neural Comput., vol. 14, no. 8, pp. 1887–1906, 2002.

[62] M. Van Hulle, “Entropy-based kernel mixture modeling for topographicmap formation,” IEEE Trans. Neural Netw., vol. 15, no. 4, pp. 850–858,Jul. 2004.

[63] M. Van Hulle, “Maximum likelihood topographic map formation,”Neural Comput., vol. 17, no. 3, pp. 503–513, 2005.

[64] A. Vellido, “Missing data imputation through GTM as a mixture oft-distributions,” Neural Netw., vol. 19, no. 10, pp. 1624–1635, 2006.

[65] J. Verbeek, “Learning nonlinear image manifolds by global alignmentof local linear models,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 28, no. 8, pp. 1236–1250, Aug. 2006.

[66] J. Verbeek, N. Vlassis, and B. Kröse, “Self-organizing mixture models,”Neurocomputing, vol. 63, pp. 99–123, Jan. 2005.

[67] H. Wang and Z. Hu, “On EM estimation for mixture of multivariatet-distributions,” Neural Process. Lett., vol. 30, no. 3, pp. 243–256,2009.

[68] S. Wu and T. Chow, “PRSOM: A new visualization method byhybridizing multidimensional scaling and self-organizing map,” IEEETrans. Neural Netw., vol. 16, no. 6, pp. 1362–1380, Nov. 2005.

[69] S. Wu, T. Chow, K. Ng, and K. Tsang, “Improvement of borrowingchannel assignment for patterned traffic load by online cellularprobabilistic self-organizing map,” Neural Comput. Applicat., vol. 15,nos. 3–4, pp. 298–309, 2006.

[70] L. Xu and M.-Y. Chow, “A classification approach for power distributionsystems fault cause identification,” IEEE Trans. Power Syst., vol. 21,no. 1, pp. 53–60, Feb. 2006.

[71] R. Xu and D. Wunsch II, “Survey of clustering algorithms,” IEEETrans. Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.

[72] H. Yin, “On the equivalence between kernel self-organising maps andself-organising mixture density networks,” Neural Netw., vol. 19, nos.6–7, pp. 780–784, Jul. 2006.

[73] H. Yin, “The self-organizing maps: Background, theories, extensionsand applications,” Studies Comput. Intell., vol. 115, pp. 715–762, 2008.

[74] H. Yin and N. Allinson, “Self-organizing mixture networks forprobability density estimation,” IEEE Trans. Neural Netw., vol. 12,no. 2, pp. 405–411, Mar. 2001.

[75] J. Zhao and Q. Jiang, “Probabilistic PCA for t distributions,”Neurocomputing, vol. 69, nos. 16–18, pp. 2217–2226, 2006.

[76] Z. Zivkovic and F. van der Heijden, “Efficient adaptive densityestimation per image pixel for the task of background subtraction,”Pattern Recog. Lett., vol. 27, no. 7, pp. 773–780, May 2006.

Ezequiel Lopez-Rubio received the M.S. and Ph.D.degrees in computer engineering from the Universityof Málaga, Málaga, Spain, in 1999 and 2002, respec-tively.

He joined the Department of Computer Languagesand Computer Science, University of Málaga, in2000, where he is currently an Associate Professorof Computer Science and Artificial Intelligence. Hiscurrent research interests include unsupervised learn-ing, pattern recognition, and image processing.


Recommended