Bagged Clustering
Friedrich Leisch
Working Paper No. 51August 1999
August 1999
SFB‘Adaptive Information Systems and Modelling in Economics and Management Science’
Vienna University of Economicsand Business Administration
Augasse 2–6, 1090 Wien, Austria
in cooperation withUniversity of Vienna
Vienna University of Technology
http://www.wu-wien.ac.at/am
This piece of research was supported by the Austrian Science Foundation (FWF) under grantSFB#010 (‘Adaptive Information Systems and Modelling in Economics and Management
Science’).
Bagged ClusteringFriedrich Leisch
Abstract|A new ensemble method for cluster analysis isintroduced, which can be interpreted in two di�erent ways:As complexity-reducing preprocessing stage for hierarchicalclustering and as combination procedure for several parti-tioning results. The basic idea is to locate and combinestructurally stable cluster centers and/or prototypes. Ran-dom e�ects of the training set are reduced by repeatedlytraining on resampled sets (bootstrap samples). We discussthe algorithm both from a more theoretical and an appliedpoint of view and demonstrate it on several data sets.
Keywords| cluster analysis, bagging, bootstrap samples,k-means, learning vector quantization
I. Introduction
Clustering is an old data analysis problem and numerous
methods have been developed to solve this task. Most of
the currently popular clustering techniques fall into one of
the following two major categories:
� Partitioning Methods
� Hierarchical Methods
Both methods have in common that they try to group the
data such that patterns belonging to the same group (\clus-
ter") are as similar as possible and patterns belonging to
di�erent groups have strong di�erences. This de�nition is
of course rather vague and accordingly a lot of algorithms
have been de�ned with respect to di�erent notions of simi-
larity/dissimilarity between points and/or groups of points.
In this paper we propose a novel method we call bagged
clustering, which is a combination of partitioning and hier-
archical methods and has|up to our knowledge|not been
reported before in the literature.
In the last years ensemble methods have been success-
fully applied to enhance the performance of unstable or
weak regression and classi�cation algorithms in a variety
of ways. The two most popular approaches are probably
bagging [1] and boosting [2]. We take the main idea of bag-
ging (\bootstrap aggregating"), the creation of new train-
ing sets by bootstrap sampling, and incorporate it into the
cluster analysis framework.
The rest of this paper is organized as follows: Section II
gives a short introduction to partitioning and hierarchi-
cal cluster methods and discusses their respective advan-
tages and disadvantages. Section III introduces the bagged
cluster algorithm, various aspects of the algorithm are dis-
cussed in Section IV and demonstrated on several exam-
ples in Section V. Finally some more theoretical aspects of
The author is with the Institut f�ur Statistik, Wahrscheinlichkeits-theorie und Versicherungsmathematik, Technische Universit�at Wien,Wiedner Hauptstra�e 8{10/1071, A-1040 Wien, Austria. Email:[email protected]
bagged clustering are analyzed in Section VI.
II. Cluster Analysis Methods
A. Partitioning Methods
The standard partitioning methods are designed to �nd
convex clusters in the data, such that each segment can
be represented by a cluster center. Convex clustering (or
data segmentation) is closely related to vector quantiza-
tion, where each input vector is mapped onto a correspond-
ing representative. In fact, the two can be shown to be
identical, i.e., a data partition and the corresponding seg-
ment centers (with respect to a given distance measure)
have a one-to-one relation where one de�nes the other and
vice versa [3] under rather general conditions [4].
Let XN = fx1; : : : ; xNg denote the data set available fortraining and let CK = fc1; : : : ; cKg be a set of K cluster
centers. Further let c(x) 2 CK denote the center closest to
x with respect to some distance measure d. Then solving
a convex clustering problem amounts to
NX
n=1
d(xn; c(xn))! minCK
(1)
i.e., �nding a set of centers such that the mean distance of a
data point to the closest center is minimal. Unfortunately
this problem cannot be solved directly even for simple dis-
tance measures and iterative optimization procedures have
to be used.
Usually d is the Euclidean distance such that the cen-
ter of each cluster is simply the mean of the cluster and
Equation 1 is the sum of the within-cluster variances. If
absolute distance is used, then the correct cluster centers
are the respective medians. Recently several extensions to
non-Euclidean distances have been proposed [5], [6].
Popular partitioning algorithms include \classic" meth-
ods like the k-means algorithm and its online variants
(which are often called hard competitive learning). More
recent algorithms like the neural gas algorithm [7] or SOMs
[8] also fall into this category, but add some regularization
terms to Equation 1, which control the structure of the set
of centers CK by enforcing neighborhood topologies among
the centers.
B. Hierarchical Methods
Hierarchical methods do not try to �nd a segmentation
with a �xed number of clusters, but create solutions for
K = 1; : : : ; N clusters. Trivially for K = 1 the only pos-
sible solution is one big cluster consisting of the complete
data set XN . Similarly for K = N we have N clusters con-
taining only one point, i.e., each point is its own cluster.
In between a hierarchy of clusters is created by repeatedly
joining the two \closest" clusters until the complete data
set forms one cluster (agglomerative clustering); another
method is to repeatedly split clusters (divisive clustering).
We only consider agglomerative methods below.
First a dissimilarity matrix D containing the pairwise
distances d(xn; xm) between all data points is computed;
any distance measure may be used. Then one needs a
method for applying the distance on complete clusters, i.e.,
measuring the distance between two sets of points A and
B. As this is used for joining (or linking) clusters, these are
often referred to as linkage methods [9]. Popular linkage
methods include
Single linkage: Distance between the two closest points
of the clusters
d(A;B) = mina2A;b2B
d(a; b)
resulting in non-convex chain-like cluster structures.
Ward's minimum variance: Tries to �nd compact,
spherical clusters by using the distance
d(A;B) =2jAjjBj
jAj+ jBjjj�a� �bjj2
where j � j denotes the size of a set, �a the mean of set
A and jj � jj2 Euclidean distance.
Other linkage methods like average or complete linkage are
not listed, because they are not used in the experiments
below. The result of hierarchical clustering is typically pre-
sented as a dendrogram, i.e., a tree where the root repre-
sents the one-cluster solution (complete data set) and the
leaves of the tree are the single data points. The height
of the branches correspond to the distances between the
clusters.
Usually there is no \correct" combination of distance
and linkage method. Clustering in general and especially
hierarchical clustering should be seen as exploratory data
analysis and di�erent combinations may reveal di�erent
features of the data set. See standard textbooks on multi-
variate data analysis for details [10].
C. Problems of Classic Methods
Both partitioning and hierarchical cluster methods have
particular strengths and weaknesses. Hierarchical methods
provide solutions for clusters of size K = 1; : : : ; N which
are compatible in the sense that a solution with fewer clus-
ters is obtained by joining some clusters, hence clusters at
a �ner resolution are simply subgroups of the bigger clus-
ters. Hierarchical methods are more exible in the sense
that they can be more easily adapted to distance measures
other than the usual metric distances (Euclidean, abso-
lute). The greatest weakness of hierarchical methods is the
computational e�ort involved. The input typically consists
of a distance matrix between all data points which is of size
O(N2). In each iteration all clusters have to be compared
in order to join the closest two; resulting in long runtimes.
This makes hierarchical methods infeasible for large data
sets.
Partitioning methods, especially online algorithms scale
much better to large data sets. Solutions for di�erent num-
ber of clusters need not be nested, such that they often
cannot easily be compared, but are more exible at dif-
ferent resolution levels. However, they are not as exible
as hierarchical methods with respect to distance measures
etc. Also all partitioning methods are iterative stochastic
procedures and depend very much on initialization. Run-
ning the K-means algorithms twice with di�erent starting
points on the same data set may result in two di�erent
solutions. There is also the open problem of choosing the
\correct" number of clusters. Many di�erent indices have
been developed for this model selection task, however none
has yet been globally accepted [11].
III. The Bagged Cluster Algorithm
In this section we introduce a novel clustering algorithm
combining partitioning and hierarchical methods. The cen-
tral idea is to stabilize partitioning methods like K-means
or competitive learning by repeatedly running the cluster
algorithm and combining the results. K-means is an un-
stable method in the sense that in many runs one will not
�nd the global optimum of the error function but a local
optimum only. Both initializations and small changes in
the training set can have big in uence on the actual local
minimum where the algorithm converges, especially when
the correct number of clusters is unknown.
By repeatedly training on new data sets one gets dif-
ferent solutions which should on average be independent
from training set in uence and random initializations.
We can obtain a collection of training sets by sampling
from the empirical distribution of the original data, i.e.,
by bootstrapping. We then run any partitioning cluster
algorithm|called the base cluster method below|on each
of these training sets. However, we are left with the typical
problem when one obtains several cluster results: There is
no obvious way of choosing the \correct" one (when they
partition the input space di�erently but have similar error)
or combining them.
In [12] the authors propose a voting scheme for cluster
algorithms. The voting proceeds by pairwise comparison
of clusters and measuring the similarity by the amount of
shared points. The combined clustering provides a fuzzy
partition of the data.
We propose to combine the cluster results by hierarchical
clustering, i.e., the results of the base methods are com-
bined into a new data set which is then used as input for
a hierarchical method. The bagged clustering algorithm
works as follows:
1. Construct B bootstrap training samples X 1N ; : : : ;X
BN
by drawing with replacement from the original sample
XN .
2. Run the base cluster method (K-means, competitive
learning, . . . ) on each set, resulting in B �K centers
c11; c12; : : : ; c1K ; c21; : : : ; cBK where K is the number
of centers used in the base method and cij is the j-th
center found using X iN .
3. Combine all centers into a new data set CB =
CB(K) = fc11; : : : ; cBKg.
4. (Optional) Prune the set CB by computing the parti-
tion of XN with respect to CB and remove all centers
where the corresponding cluster is empty (or below a
prede�ned threshold �), resulting in the new set
CBprune(K; �) =�c 2 CB(K)j#fx : c = c(x)g � �
We also make all members of CBprune(K; �) unique, i.e.,
remove duplicates.
5. Run a hierarchical cluster algorithm on CB (or CBprune),resulting in the usual dendrogram.
6. Let c(x) 2 CB denote the center closest to x. A parti-
tion of the original data can now be obtained by cut-
ting the dendrogram at a certain level, resulting in a
partition CB1 ; : : : ; CBm, 1 � m � BK, of set CB . Each
point x 2 XN is now assigned to the cluster containing
c(x).
Example
We take a simple 2-dimensional example from [12] to
demonstrate the algorithm. The data set consists of 3900
points in 3 clusters as shown in Figure 1. The authors call
this example \Cassini" because the shape of the two big
clusters is a Cassini curve.
−1.5 −1.0 −0.5 0.0 0.5 1.0
−2
−1
01
2
−1.5 −1.0 −0.5 0.0 0.5 1.0
−2
−1
01
2
Fig. 1. Cassini problem: local minimum (left) and global minimum(right) of the K-means algorithm.
Standard K-means clustering cannot �nd the structure
in the data completely because the outer clusters are not
convex. Hence, even when we start K-means with the true
centers (the mean values of the three groups) such that
the algorithm converges immediately, we make little errors
at the edges (left plot in Figure 1); however this error is
only very small. The real problem is that the \true cluster
partition" is only a local minimum of error function (1).
The minimum error solution found in 1000 independent
repetitions of the K-means algorithm splits one of the large
clusters into two parts and ignores the small cluster in the
middle (right plot in Figure 1). Note that we used the
correct number of clusters, information which is typically
not available for real world problems.
−1.5 −1.0 −0.5 0.0 0.5 1.0
−2
−1
01
2
−1.5 −1.0 −0.5 0.0 0.5 1.0
−2
−1
01
2
Fig. 2. Cassini problem: 200 centers placed by bagged clustering(left) and �nal solution (right) by combining the 200 centers usinghierarchical clustering.
We now apply our bagged cluster algorithm on this data
using B = 10 bootstrap training samples and K-means as
base method with K = 20 centers in each run. The left
plot in Figure 2 shows the resulting 200 centers. We then
perform hierarchical (Euclidean distance, single linkage) on
these 200 points. The three-cluster partition can be seen in
the right plot in Figure 2, which recovers the three clusters
without error.
IV. Discussion
A. Number of Clusters
By inspection of the dendrogram from the hierarchical
clustering one can in many cases infer the appropriate num-
ber of clusters for a given data set. The dendrogram cor-
responding to single linkage of the Cassini example from
above is shown in Figure 3, which clearly indicates that
3 clusters are present in the data. The lower plot shows
the relative height at which the next split after the current
one occurs for 1,. . . ,20 clusters (black line). The grey line
shows the �rst di�erences of the dark line. This value is
large for splits into two well-separated subtrees; the di�er-
ences are small if shortly after the respective splits there is
another split. Note that we did not need to use the correct
number of clusters in the bagged clustering algorithm, it
can be inferred from the dendrogram.
B. Preprocessing for Hierarchical Clustering
The Cassini example above could also be solved by direct
hierarchical clustering with single linkage of the original
data set. However, the number of data points of N = 3900
is too large, the distance matrix alone has more than 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
2
3
4 56 7 8
910
11 12 13 14 15 16 17 1819 20
Fig. 3. Cassini problem: Hierarchical clustering of 200 bagged clustercenters using single linkage.
million entries. Of course the data structure can easily be
represented by fewer samples; if we take a subsample of size
100 and cluster it hierarchically we get the same solution
as with bagged clustering. The correct number of clusters
can also be inferred from the dendrogram.
However, taking a subsample of the original data has
the disadvantage that possibly valuable information gets
lost. Hence, the complexity reduction (reduction of data
set size) should take all data into account. This can be done
by vector quantization techniques, i.e., using a partitioning
cluster method such as K-means with K large. The quan-
tization performs a smoothing of the original data, noise
and outliers are removed.
Bagged clustering di�ers from this standard textbook
approach in that much fewer centers are used in the parti-
tioning step, i.e., the smoothing (in a single run) is much
stronger (see also the �rst part of Theorem 2 below). How-
ever, new variations are introduced by the use of bootstrap
samples, such that the overall result is less dependent on
random uctuations in the training set and the random
seed (initial centers) of the base method.
C. Combination of Independent Cluster Results
The second view (and initial motivation) of bagged clus-
tering is that it o�ers a method for combining several out-
comes of partitioning methods. Partitioning methods are
usually iterative optimization techniques which can easily
get stuck in local minima and depend heavily on start-
ing conditions. Intuitively a researcher will trust a certain
outcome much more if it is reproducible such that di�erent
restarts of the algorithm produce the same result and the
resulting centers are always the same (or at least close).
Ideally one would use independent training sets for each
repetition in order to get independence from a particular
training set. With the same motivation as in bagged re-
gression or classi�cation [1] we replace the (in practice typ-
ically unavailable) training sets drawn independently from
the data generating distribution by bootstrap samples and
run the base method on each. One now has to check if
several of the resulting centers are close to each other and
group them accordingly, which is exactly what hierarchical
clustering has been designed for.
D. Bagged Clustering for Very Large Data Sets
As the prices of data storage devices (hard disks, . . . )
have decreased dramatically during the last decade, more
and more data get stored. E.g., supermarkets or telephone
companies routinely log all consumer transactions, such
that the corresponding data sets easily get into the Gi-
gabyte range. For such very large data sets even the most
basic calculations like the sample mean or variance are com-
putationally very intensive and many standard statistical
approaches get as infeasible as they were 30 years ago (back
then for small to moderate sample sizes).
Committee methods o�er a viable way for adapting stan-
dard algorithms to scale well as the number of samples in-
creases. In data mining situations it is no problem to get
several independent training sets, one simply samples sev-
eral sets X iN1
from the original sample where N1 � N is as
large as possible (such that it is still possible to be handled
with reasonable computational e�ort). Finally the partial
results inferred from the subsamples are combined into a
�nal solution, in our case the combination is done by hier-
archical clustering.
V. Experiments
A. Data Sets
We have tested bagged clustering on several benchmark
examples, �ve examples with continuous data and three
examples with binary data which are related to our cur-
rent research on data segmentation for tourism marketing.
As base methods we used K-means and hard competitive
learning. In the hierarchical step we used Euclidean dis-
tance for continuous data and absolute distance for the
binary data (these combinations worked best); the agglom-
eration methods were single linkage and Ward's method.
The continuous examples are:
Cassini: Three clusters in 2-dimensional space. Two big-
ger clusters of size 1500 each with one small cluster of
size 900 in between. See Figure 1 for details. This
example is taken from [12].
Quadrants: Four clusters in 3-dimensional space. Three
large quadrant-shaped clusters of size 500 are located
around a smaller cube-shaped cluster of size 200. This
example is also taken from [12].
TABLE I
Scenario 1: Symmetric distribution of 0s and 1s.
x1{x3 x4{x6 x7{x9 x10{x12 m
Type 1 high high low low 1000
Type 2 low low high high 1000
Type 3 low high high low 1000
Type 4 high low low high 1000
Type 5 low high low high 1000
Type 6 high low high low 1000
2 Spirals: This example is often used as classi�cation
benchmark and consists of 2 spiral-shaped clusters. As
this is a hard problem even for supervised learners, we
increased the size of the training set considerably to
make the example feasible for unsupervised learners,
such that both spirals contain 2000 points each. A
similar example with 3 spirals has been used in [13] as
cluster benchmark.
Iris: Edgar Anderson's Iris data, 150 4-dimensional ob-
servations on 3 species of iris (setosa, versicolor, and
virginica).
Segmentation: Image data drawn randomly from a
database of 7 outdoor images. Each instance gives
19 statistics (centroids, densities, saturation, . . . ) of a
3 � 3 pixel region. The size of the complete data set
is 2310 (330 per class). This example was taken from
the UCI repository of machine learning databases at
http://www.ics.uci.edu/~mlearn/.
The binary examples are taken from a larger collection
of data scenarios from tourism marketing [14]. These sce-
narios model \typical" data found in tourism marketing
in a simpli�ed manner and are a result of joint e�orts be-
tween researchers from statistics and management science
in order to create a benchmark collection for this type of
data. All examples were tested with many di�erent cluster-
ing algorithms [15] such that their characteristics are well
known. All scenarios use 12-dimensional binary data with
6 clusters grouped as shown in Table I.
Scenario 1: Variables denoted as \high" are 1 with prob-
ability 0.8, \low" variables are 1 with a probability of
0.2. All clusters have size 1000.
Scenario 3: The probability for \low" is increased from
0.2 to 0.5. All clusters have again size 1000.
Scenario 5: The probability for \low" to be 1 is 0.2 as
in scenario 1, however the cluster sizes are 1000, 300,
700, 3000, 500, and 500, respectively.
B. Evaluation Method
As the true clusters are known in classi�cation problems,
we can evaluate the partition provided by a cluster algo-
rithm by comparing it with the true classes. Let ai denote
the true classes of a problem and bj denote the clusters
found by the cluster algorithm. We then associate cluster
bj with class ai if the majority of points in bj is from class
ai. All points in bj are then classi�ed as ai. Note that the
number of cluster need not necessarily be the same as the
number of classes.
Using this algorithm every cluster algorithm can be
turned into a classi�cation algorithm. As the class informa-
tion is not used during partitioning (unsupervised learning)
but only in the �nal step of associating clusters with classes,
the classi�cation performance of such a \clustering classi-
�er" will usually be (much) worse than the performance
of a designated classi�cation method which uses the class
labels during training. However, we can use this method
to compare and measure the ability of di�erent cluster al-
gorithms to recover structures in data.
The \winner-take-all" matching between clusters and
classes described above is of course not the only possible.
Another possibility is to compare class and cluster centers
and map according to their distance. Obviously this only
makes sense for problems and cluster methods with convex
clusters where centers and segments are dual. The center
based matching is appropriate if the cluster centers them-
selves are used after clustering. E.g., in marketing research
one is often interested in customer pro�les as described by
the mean values of market segments.
C. Results
All experiments were performed using the R software
package for statistical computing, which is a free imple-
mentation of the S language and can be downloaded from
http://www.ci.tuwien.ac.at/R. R functions for bagged
clustering and corresponding graphs will soon be available
on the web and can be obtained from the author upon re-
quest in the meantime.
Table II show the results of our experiments for the con-
tinuous data sets. First we used hard competitive learn-
ing (HCL) and K-means (KMN) as benchmark algorithms
(with the correct number of centers). Then we ran bagged
clustering (BC) with these two base methods and both with
single (s) and Ward's (w) linkage. We used 10 bootstrap
samples and 20 base centers for BC and produce the cor-
rect number of clusters by cutting the tree at the respective
level (before the cluster-class matching). All algorithms
were run 100 times on each data set.
The �rst three columns of the table gives the median
(Med), mean and standard deviation (SD) of the correctly
classi�ed cases in percent. The last column (MErr) gives
the percentage of correctly classi�ed cases of the minimum
error run, which minimized the internal error criterion of
the algorithm over the 100 repetitions (and would hence be
chosen when clustering real data). For HCL and K-means
this internal error is simply the sum of within-cluster vari-
ances, which measures the size of the clusters. For bagged
clustering we also use the sum of cluster sizes of the hi-
erarchical clustering (with respect to the current linkage
method), where each cluster size is given the height of the
respective branch in the dendrogram. Note that the ta-
ble reports percentage of correctly classi�ed cases, not the
internal error criterion.
The Cassini problem is \almost" solvable for HCL and
K-means with only few errors at the edges. However, this
solution is only a local minimum of the error function, the
global optimum splits one of the large clusters into two
parts and ignores the small one, getting approximately 77%
correct. In some repetitions HCL and K-means do better
such that the mean is greater as the median and the stan-
dard deviation is rather large. Bagged clustering makes no
error at all in more than half of all repetitions (median is
100%), we can also detect the correct solution using the
MErr criterion. Even the mean value is above 98% for
Ward's method and above 99% for single linkage. It is not
surprising that single linkage performs better for this ex-
ample, because we have non-convex and non-overlapping
true clusters.
TABLE II
Results for continuous problems.
Med Mean SD MErr
Cassini
HCL 76.92 80.91 8.55 76.92KMN 76.92 79.02 8.57 76.92BC (HCL, s) 100.00 99.82 1.67 100.00BC (HCL, w) 100.00 98.47 5.48 100.00BC (KMN, s) 100.00 99.45 3.23 100.00BC (KMN, w) 100.00 98.25 5.90 100.00
Quadrants
HCL 88.23 91.27 5.20 88.17KMN 88.23 88.56 2.02 88.17BC (HCL, s) 100.00 100.00 - 100.00BC (HCL, w) 100.00 98.12 4.31 100.00BC (KMN, s) 100.00 99.99 0.01 100.00BC (KMN, w) 100.00 97.92 4.46 100.00
2 Spirals
BC (HCL, s) 100.00 87.87 19.83 100.00BC (KMN, s) 95.67 80.11 22.04 84.95
Iris
HCL 89.33 89.04 0.32 89.33KMN 89.33 84.80 9.11 89.33BC (HCL, w) 91.33 91.07 0.75 91.33BC (KMN, w) 90.00 90.27 0.34 90.00
Segmentation
HCL 56.62 55.57 3.05 57.40KMN 55.97 55.90 3.23 53.12BC (HCL, w) 61.23 61.28 3.22 63.16BC (KMN, w) 60.24 60.49 2.91 62.42
The results for the quadrants problem are similar. Again
the correct solution is only a local optimum of HCL and
K-means such that the average performance of both algo-
rithms is around 90%. The MErr solution gets 88% correct.
Again bagged clustering solves the problem in more than
50% of all repetitions. Using bagged clustering with HCL
and single linkage even gave the correct solution in all rep-
etitions.
The spirals problem is of course unsolvable for HCL and
K-means, as both clusters cannot even be approximated
by convex sets. For the same reason we use only single
linkage for bagged clustering. Both clusters are very long
and thin and rather close together. Hence we have to use
many support points for a successful single linkage and used
K = 100 centers for the base methods. Using HCL as base
method yields excellent results with a median of 100%. K-
means is not as good, but still gives competitive results.
The two real world data sets (iris, segmentation) both
have overlapping classes, hence only Ward's linkage is used,
as single linkage is not appropriate for overlapping clusters
(the clusters would be joined immediately). The iris data
set is rather small (150 cases), hence we use fewer centers
in the base method (K = 10). The segmentation data
are larger and high-dimensional, here K = 30 gave good
performance. For both datasets bagged clustering increases
the number of correctly classi�ed cases.
For the binary data scenarios we are not only interested
in statistics of the number of correctly classi�ed cases, but
also how often the correct group pro�les (cluster centers)
were found. Each center is converted to a binary vector
by thresholding at the overall mean value, which is 0.5 for
scenarios 1 and 5 and 0.65 for scenario 3. A center is con-
sidered as detected if all bits are equal (Hamming distance
of zero between thresholded cluster center and thresholded
class mean). The practical explanation for this procedure
is that in marketing research groups are often characterized
as having a certain feature \above or below average".
In all scenarios the classes overlap, hence a 100% correct
classi�cation is impossible. The (optimal) Bayes rate is
82.98% for scenario 1, 48.93% for scenario 3 and 88.85%
for scenario 5. Again we use only Ward's linkage. For
all scenarios we use both the correct number of centers
(K = 6) in the base method and a larger value of K = 20.
HCL is close to the Bayes classi�er for scenario 1, hence
almost no further improvement is possible. Bagged HCL
is slightly better than plain HCL with K = 6 and slightly
worse for K = 20. For K-means, bagging stabilizes the
performance for K = 6 close to the base rate, the perfor-
mance decrease for K = 20 is larger than with HCL, but
it �nds the correct group centers more reliably.
In the other two scenarios bagging always improves on
the base methods classi�cation rate and boosts the number
of centers found (except for K = 6 in scenario 3). Sce-
nario 3 has turned out to be very hard to learn [15] and
the maximum number of correctly identi�ed centers using
12 di�erent cluster algorithms (HCL, K-means, neural gas,
self organizing maps, improved �xpoint method and vari-
ants of these) has been 4 so far. Bagged clustering based on
K-means identi�es all 6 clusters in some runs usingK = 20,
however we currently have no way of identifying these runs
(without using the true data structure).
The dendrogram of scenario 5 (Figure 4) clearly shows
TABLE III
Results for binary data scenarios from tourism marketing.
Classi�cation rate Centers foundK Med Mean SD MErr min mean max
Scenario 1
HCL 82.46 82.49 0.24 82.31 6 6 6KMN 72.45 80.60 4.76 82.28 4 5.82 6BC (HCL, w) 6 82.46 82.46 0.25 82.93 6 6 6BC (KMN, w) 6 81.65 81.48 1.00 81.65 6 6 6BC (HCL, w) 20 81.17 81.03 0.88 79.50 6 6 6BC (KMN, w) 20 76.78 76.69 1.71 75.13 6 6 6
Scenario 3
HCL 29.08 29.34 1.42 29.08 0 1.14 3KMN 31.29 31.76 2.29 27.68 0 0.76 3BC (HCL, w) 6 31.43 31.41 1.19 29.40 0 1.00 1BC (KMN, w) 6 31.53 31.82 1.75 30.52 0 2.60 2BC (HCL, w) 20 34.93 35.06 1.91 37.13 0 1.31 4BC (KMN, w) 20 35.55 35.48 2.03 37.68 0 2.07 6
Scenario 5
HCL 80.09 79.80 0.71 78.96 4 5.36 6KMN 79.09 78.82 1.87 79.05 4 4.86 6BC (HCL, w) 6 86.54 86.29 0.90 84.18 5 5.95 6BC (KMN, w) 6 84.52 84.33 2.18 84.25 5 5.70 6BC (HCL, w) 20 84.31 84.29 1.11 84.40 6 6 6BC (KMN, w) 20 82.12 81.74 1.81 82.28 5 5.97 6
the structure of the data set. One big cluster (half of the
training set) dominates, then there are larger clusters of
size 1000 and 700, and �nally some small clusters of sizes
300 and 500 (twice). Due to the symmetries in the data
set the dendrogram suggests to use 2 clusters (big cluster
vs. others) or 6 clusters.
05
1015
2025
30
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0 2
3
45
6
7 8 910
1112
1314
15 16 1718
19 20
Fig. 4. Scenario 5: Hierarchical clustering of 200 bagged clustercenters using Ward's method.
VI. Analysis of the Algorithm
A. E�ects of Pruning
In the �rst phase of bagged clustering we apply a (parti-
tioning) base cluster method on bootstrap samples of the
original data. Let P denote the base cluster method (in-
cluding all hyperparameters such as learning rates, random
initializations, . . . ). Further let
C?(K) = C?(K;XN ;P)
:= fc j 9CB(K) : c 2 CB(K)g
denote the (theoretical) set of all cluster centers that can be
generated by the base method when applied to bootstrap
samples of size N . Analogously de�ne C?prune(K; �).
Theorem 1: For bagged clustering as de�ned above,
C?prune(K; �) =
fx 2 XN j#fxi 2 XN : xi = xg � �g
Corollary 1: Bagged clustering with pruning of empty
clusters (� = 1) is asymptotically equivalent to hierarchical
clustering of the original data set (where the asymptotics
are with respect to B).
Corollary 2: If xi 6= xj , 8i 6= j, then C?prune(K; �) = ;,
8� > 1.
Proof. The sample X[n]N := fxn; : : : ; xng containing only
replicates of xn is a valid bootstrap sample (generated with
probability N�N if all points in XN are unique), and triv-
ially any partitioning cluster algorithm P should output
xn as unique center in this case. Hence, xn 2 C?(K),
8n = 1; : : : ; N and XN � C?(K);8K. Pruning removes
all points that are not contained at least � times in XN . �
The above theorem basically says that in the limit for
B !1 bagged clustering with pruning is identical to clus-
tering the original data set. Note that without pruning the
algorithm behaves completely di�erent, as bootstrap sam-
ples containing only a few original data points have very
low probability (and hence centers with extreme over�t-
ting are also not very frequent). Pruning e�ectively \thins
out" regions containing many centers while keeping out-
liers and should therefore be used only very carefully. The
main advantage of pruning is that it can drastically reduce
the number of centers used as input for the hierarchical
clustering step.
B. Cluster Distance and Background Noise
Clustering the original data set with the base method
amounts to drawing a single K-tuple with some proba-
bility distribution G = G(F;P ;K;N) depending on the
(unknown) data generating distribution F of x, the cluster
algorithm P (random initializations, . . . ), K and N . By re-
placing the true distribution F of x with the empirical dis-
tribution F of XN we bootstrap the base cluster algorithm
and are hence provided with a sample CB(K) drawn from
G = G(F ;P ;K;N). Standard bootstrap analysis would
now proceed by computing statistics like mean, standard
deviation or con�dence intervals; see [16] for a compre-
hensive introduction to the bootstrap. Bagged clustering
explores CB(K) using hierarchical clustering.
For the following we need generalized linkage meth-
ods using distances on continuous sets, this can easily
be obtained by replacing all minima/maxima with in-
�ma/suprema and sizes of sets with the probability of the
sets (with respect to data distribution F ). E.g., the contin-
uous generalization of the distance corresponding to single
linkage is
ds(A;B) = infa2A;b2B
d(a; b)
and for Ward's method we get
dw(A;B) =2 IPF (A) IPF (B)
IPF (A) + IPF (B)jj�a� �bjj2
Additionally we will need the following properties for dis-
tances of sets:
(I) For all nonempty subsets B of some convex set A
and all sets C, A \ C = ;, it follows that d(A; C) > 0 )
d(B; C) > 0.
(II) For all nonempty compact sets B in the interior of A
and all sets C, A\ C = ;, it follows that d(B; C) > d(A; C).
We get directly from these de�nitions that ds ful�lls both
(I) and (II), while dw is only (I).
Suppose that F is absolute continuous on the input space
X such that the density f exists. One possible de�nition
of a \cluster" is to assume that f is multimodal with each
modus corresponding to one cluster [17]. The following def-
inition characterizes a clustering problem by the amount of
background noise � (maximum density outside the clusters)
and the minimum distance between the clusters.
De�nition 1: We call the pairwise disjoint sets Ai � X ,
i = 1; : : : ;M , (�; Æ)-separated clusters with respect to dis-
tance d, if
1. 8x 2 X : f(x) � �, 9i : x 2 Ai.
2. mini6=j d(A�i ;A
�j) = Æ where
A�i = fx 2 Ai j f(x) � �g:
In general (�; Æ)-separated clusters do not correspond to the
usual notion of partitions (as returned by a partitioning
cluster algorithm), because they do not form a partition
of the complete input space X . However, (�; Æ)-separated
clusters are of course a partition of fx 2 X : f(x) � �g.A sensible assumption on any cluster algorithm P is that
it places cluster centers with high probability in regions of
high data density f , i.e., in the modi of f , if there are suÆ-
ciently many centers available. Suppose that G is absolute
continuous such that its density g exists.
Theorem 2: Let Ai � X , i = 1; : : : ;M be (�; Æ)-
separated clusters with respect to linkage method d and
density f on X for some �; Æ > 0. Further let Bi � Ai,
i = 1; : : : ;M , be a set of compact subsets of the clusters.
Ifg(x) > f(x) 8x 2
SM
i=1 Big(x) � f(x) otherwise
then 9�1 > � such that Bi, i = 1; : : : ;M are (�1; Æ1)-
separated clusters, Æ1 � 0.
If additionally
1. d ful�lls (I), then 9Æ2 > 0 such that the Bi are (�1; Æ2)-separated.
2. all Bi are in the interior of the respective Ai and d
is (II), then 9Æ3 > Æ such that the Bi are (�1; Æ3)-
separated.
Proof. Using the compactness of the Bi we get that g has
a minimum, say b1 2SBi. Let �1 := g(b1), then �1 =
g(b1) > f(b1) � �. The distance properties 1 and 2 follow
directly from (I) and (II). �
Bagged clustering transforms the clustering problem in
the original data space into a new problem in the space
of centers produced by the base method. If the centers of
the base method are concentrated in the modi of f , then
the new clusters are smaller than the original ones with
higher density. Whether the distance of the new clusters is
larger than the distance of the original ones, depends on the
linkage method actually used. Single linkage distance will
increase, but Ward's distance or average distance between
clusters may even decrease.
E.g., consider Gaussian clusters of equal size N=K and
that the number of clusters K is known. Then the min-
imum error solution of K-means places a cluster center
at the mean of each cluster (this is also the maximum
likelihood solution). By clustering bootstrap samples we
get centers with multivariate normal distributions around
the true clusters with K=N times the original variance.
Hence, the clusters in the new space have only K=N times
the size of the original clusters and single linkage distance
grows accordingly. Ward's distance between clusters re-
mains unchanged because the centers of the new clusters
are the same as the centers of the original clusters. How-
ever, Ward's distance within clusters decreases also by a
factor of K=N .
VII. Summary
We have presented a novel clustering framework, which
allows for the combination of hierarchical and partition-
ing algorithms. Partitioning cluster algorithms such as K-
means are used to concentrate centers in regions of high
data density and remove background noise. These centers
are then explored using hierarchical methods. The algo-
rithm compares favorably with standard partitioning tech-
niques on a mixture of arti�cial and real world benchmark
problems.
We are currently extending this research in several di-
rections. The exact number K of centers used by the base
method seemed not to be a critical parameter in our simu-
lations, however, better guidelines for choosingK would be
needed. Another future direction involves interpretation of
the result from hierarchical clustering, as a broad spectrum
of methods for analyzing dendrograms is available in the
literature. This includes splitting the tree into clusters by
more re�ned methods than a horizontal cut and choosing
the number of clusters. Finally, we are also working on new
methods for graphical visualization of (bagged) clusters in
binary data sets.
Acknowledgement
This piece of research was supported by the Austrian
Science Foundation (FWF) under grant SFB#010 (`Adap-
tive Information Systems and Modeling in Economics and
Management Science'). The author wants to thank Kurt
Hornik and Andreas Weingessel for helpful discussions.
References
[1] L. Breiman, \Bagging predictors," Machine Learning, vol. 24,pp. 123{140, 1996.
[2] Y. Freund and R. E. Schapire, \Experiments with a new boostingalgorithm," in Thirteenth International Conference on MachineLearning, 1996.
[3] J. Max, \Quantizing for minimumdistortion," IRE Transactionson Information Theory, vol. IT-6, pp. 7{12, Mar. 1960.
[4] K. P�otzelberger and H. Strasser, \Data compression by unsu-pervised classi�cation," Report 10, SFB \Adaptive InformationSystems and Modeling in Economics and Management Science",http://www.wu-wien.ac.at/am, 1997.
[5] F. Leisch, A. Weingessel, and E. Dimitriadou, \Competitivelearning for binary valued data," in Proceedings of the 8th Inter-national Conference on Arti�cial Neural Networks (ICANN 98)(L. Niklasson, M. Bod�en, and T. Ziemke, eds.), vol. 2, (Sk�ovde,Sweden), pp. 779{784, Springer, Sept. 1998.
[6] D. Weinshall, D. W. Jacobs, and Y. Gdalyahu, \Classi�cation innon-metric spaces," in Advances in Neural Information Process-
ing Systems (M. Kearns, S. Solla, and D. Cohn, eds.), vol. 11,MIT Press, USA, 1999.
[7] T. M. Martinetz, S. G. Berkovich, and K. J. Schulten, \\Neural-Gas" network for vector quantization and its application totime-series prediction," IEEE Transactions on Neural Networks,vol. 4, pp. 558{569, July 1993.
[8] T. Kohonen, Self-organizing maps. Berlin: Springer, 1995.[9] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data. New
York, USA: John Wiley & Sons, Inc., 1990.[10] J. Hartung and B. Elpelt, Multivariate Statistik: Lehr- und
Handbuch der angewandten Statistik. M�unchen, Germany: Old-enbourg Verlag, fourth ed., 1992.
[11] G. W. Milligan, \Clustering validation: Results and implicationsfor applied analyses," in Clustering and Classi�cation (P. Ara-bie, L. Hubert, and G. DeSoete, eds.), pp. 341{375, River Edge,NJ, USA: World Scienti�c Publishers, 1996.
[12] A. Weingessel, E. Dimitriadou, and K. Hornik, \A voting schemefor cluster algorithms," in Neural Networks in Applications, Pro-ceedings of the Fourth International Workshop NN'99 (G. Krell,B. Michaelis, D. Nauck, and R. Kruse, eds.), (Otto-von-GuerickeUniversity of Magdeburg, Germany), pp. 31{37, 1999.
[13] Y. Gdalyahu, D. Weinshall, and M. Werman, \A randomizedalgorithm for pairwise clustering," in Advances in Neural Infor-mation Processing Systems (M. Kearns, S. Solla, and D. Cohn,eds.), vol. 11, MIT Press, USA, 1999.
[14] S. Dolnicar, F. Leisch, and A. Weingessel, \Arti�cial binary datascenarios," Working Paper Series 20, SFB \Adaptive Informa-tion Systems and Modeling in Economics and Management Sci-ence", http://www.wu-wien.ac.at/am, Sept. 1998.
[15] S. Dolnicar, F. Leisch, A. Weingessel, C. Buchta, and E. Dim-itriadou, \A comparison of several cluster algorithms on arti-�cial binary data scenarios from tourism marketing," WorkingPaper Series 7, SFB \Adaptive Information Systems and Mod-eling in Economics and Management Science", http://www.wu-wien.ac.at/am, 1998.
[16] B. Efron and R. J. Tibshirani, An introduction to the bootstrap.Monographs on Statistics and Applied Probability, New York,USA: Chapman & Hall, 1993.
[17] H. H. Bock, \Probabilistic models in cluster analysis," Compu-tational Statistics & Data Analysis, vol. 23, pp. 5{28, 1996.