UN
CO
RR
EC
TED
PR
OO
F
Analytica Chimica Acta xxx (2004) xxx–xxx
3
Minimum spanning tree: ordering edges to identify clustering structure4
Michele Forina∗, M. Concepción Cerrato Oliveros, Chiara Casolino, Monica Casale5
Dipartimento di Chimica e Tecnologie Farmaceutiche ed Alimentari, Facoltà di Farmacia, Università di Genova,6
Via Brigata Salerno (Ponte), Genova 16147, Italy7
Received 8 September 2003; received in revised form 27 February 2004; accepted 27 February 20048
Abstract9
Ordering edges to identify clustering structure (OETICS), the clustering algorithm presented here, is based on the minimum spanning treeconnecting the objects. The edges of the tree are ordered, beginning from the longest edge, to form groups of objects separated by large edges.The plot of ordered edges is the main result of the algorithm. In some aspects OETICS is similar to OPTICS, a known clustering techniquethat orders the objects with reference to the local density, but the solution is unique, because it does not select the value of some parameters,as the generating distance in OPTICS.
10
11
12
13
14
OETICS is applied to many simulated and real data sets, with very different number of objects and variables, and the results are comparedwith those obtained by OPTICS.
15
16
© 2004 Published by Elsevier B.V.17
Keywords: Clustering; Chemometrics; Pattern recognition27
1. Introduction28
Clustering techniques are widely used in chemistry, with29
different objectives. In food chemistry the objective is fre-30
quently the detection of groups of similar foods, of similar31
consumers, of similar panellists. In pharmaceutical chem-32
istry clustering techniques are used to detect groups of sim-33
ilar conformers, or of molecules with similar structure. In34
analytical chemistry clustering techniques are used to se-35
lect a number of representative samples for calibration, or36
to evaluate the homogeneity of a calibration set.37
Many clustering techniques[1,2] are used: visual cluster-38
ing (on principal components or on the axes of projection39
pursuit), hierarchical agglomerative and divisive methods40
with the related dendrograms, non-hierarchical techniques41
as K-means and K-medoids, fuzzy clustering. However, be-42
cause of the huge variety of problems and data, these tech-43
niques are not completely satisfactory, at least as can be44
deduced from the number of papers about new clustering45
techniques. A reason is that the dendrogram obtained with46
some techniques shows apparently well separated clusters47
also when there are not real clusters, as the example inFig. 148
∗ Corresponding author. Tel.:+39-0103532630; fax:+39-0103532684.E-mail address: [email protected] (M. Forina).
shows. The application of statistical tests to evaluate the49
significance of clusters (obtained cutting the dendrogram,50
generally at the long branches) can erroneously confirm the51
existence of false clusters[3]. 52
A second reason is that today clustering techniques must53
be sometimes applied to very large data bases. In this case54
the usual hierarchical techniques can not be applied easily55
both because is rather rare to find computer programs able56
to draw dendrograms in the case of e.g. 100 samples, and57
because of difficulties in the interpretation. 58
A second reason is there are two types of clustering. A first59
type of clustering, the so-called natural clustering, is associ-60
ated with the presence of agglomerates of objects where the61
distance between the two closest objects of two different ag-62
glomerates (between-clusters distance) is clearly larger than63
the within-agglomerate distances. A second type of cluster-64
ing is present when there are parts of the space with high65
density of points, and parts with low density, without a sharp66
boundary. Many techniques have been developed to study67
the second type of clusters, with the principal objective of68
application to very large data sets. Among these techniques69
OPTICS[4], final step of a number of developments, seems70
very powerful. 71
The results of OPTICS depend on two settable parame-72
ters, the “generating distance”ε, and the number of points 73
required for the connectivity “MinPoints”, and, obviously74
1 0003-2670/$ – see front matter © 2004 Published by Elsevier B.V.2 doi:10.1016/j.aca.2004.02.064
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
2 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx
Fig. 1. A bidimensilonal data set drawn from a uniform distribution, andthe dendrogram obtained with the average-linkage unweighted aggliomer-ation technique.
from the scaling procedure. The possibility to optimise the75
techniques by selection of suitable values of the parameters76
gives power to the technique, but this power can be obtained77
only by skilled people. For this reason, here it is presented a78
clustering technique similar to OPTICS but without settable79
parameters, so with less power but with the simplicity that80
unskilled people can prefer. This technique works on the81
edges of the minimum spanning tree connecting the objects82
[5,6], and for this reason we indicate it with the ordering83
edges to identify clustering structure (OETICS), that echoes84
the method from which derives.85
As in the case of many other clustering techniques, both86
OPTICS and OETICS give great importance to the visual87
representation of clustering: so figures are the principal prod-88
uct of both techniques.89
2. Data files90
Some simulated data matrices with two variables and a91
number of objects from 20 to 1000 have been used in the92
development and evaluation of clustering techniques. Data93
set Twentydata (20 objects) is used to illustrate the details94
of the OETICS algorithm. It is reported inTable 1. Data set95
Kriegelexample was obtained by scanning a figure in the pa-96
per describing OPTICS. It has 300 objects. Data sets 40data,97
50data and 75data have a special structure useful to illus-98
trate some details of clustering with OPTICS or OETICS.99
They are shown in some figures further.100
Table 1Data set Twentydata
Index X Y Index X Y Index X Y Index X Y
1 46 500 2 124 579 3 166 536 4 236 5095 174 484 6 197 481 7 214 464 8 200 4559 191 438 10 235 452 11 316 434 12 355 455
13 355 434 14 356 415 15 321 371 16 374 40517 403 437 18 380 379 19 491 463 20 529 473
Three real data sets have been used. Coffeedata ([7], avail- 101
able at[8]) has 43 objects (7 Robusta and 36 Arabica cof-102
fees) described by 13 chemical and physical variables. The103
two categories are separable. Oliveoils[9] has been used by104
many Authors working with multivariate classification, e.g.105
[10,11]. The 572 objects are described by eight fatty acids,106
and are divided in nine categories, Italian regions. Breast-107
cancer[12] contains 699 objects, described by nine vari-108
ables. There are two categories, benign (458 objects) and109
malignant (241 objects). 110
In our experience (and in that of many people working111
with these data) the categories in Oliveoils and Breastcancer112
are not separable. 113
Simulated data were used without pretreatments. Instead114
autoscaling was applied to all the real data sets. 115
3. OPTICS 116
OPTICS is based on the reachability distance. Two pa-117
rameters are defined by the operator:ε, generating distance;118
the radius of the spherical spaceS(q) around a given object119
q used to define the non-connectivity. When inS(q) there 120
are no objects others thanq this object is a singleton. Its121
reachability distance is UNDEFINED. 122
The objects which reachability distance is undefined con-123
stitute the NOISE. The others are in one or more CLUS-124
TERS. 125
The cardinality is the number of objects others thanq 126
in S(q). MinPoints: is the minimum cardinality ofq with 127
reference toS(q) required for the connectivity. An object128
q such thatS(q) ≥ MinPoints is a core point. Its core dis-129
tance (core(q)) is the distance of the Minpoints-neighbour.130
In Fig. 2 the main points of the OPTICS definitions are ex-131
plained in the case of MinPoints= 3. A point, asp1, with 132
distance fromq less than core has reachability distance= 133
core(q). A point, asp2, with distance fromq less thanε has 134
reachability distance equal to the distance fromq. When the 135
distance is larger thanε the point can not be reached directly136
from q. 137
Results of OPTICS are shown in a “reachability plot” as138
that in Fig. 3, where the clusters (corresponding to a local139
high density of points) are identified by the minima of the140
reachability distance. 141
The OPTICS algorithm starts with the definition of the142
generating distance and of the minimum cardinality. The143
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 3
Fig. 2. Core distance and Reachability distances. With MinPoints= 3,core(q) is the distance from the third neighbour.
Fig. 3. Data set Kriegelexample[4] and Reachability plots (with differentsettings: generating distance 70, (A) Minpoints 10; (B) MinPoints 2).
starting reachability distance of all the objects is set at UN-144
DEFINED, i.e. to a conventional value >ε. 145
The algorithm reads the objects in the data file, starting146
with object number 1. However this reading (point a) of the147
algorithm) is only one of the way to access to the data in148
the data file, so that during this ordered access the algorithm149
can also found objects that were processed previously. 150
(a) If end of data file then go to point f) Read objectI. 151
(b) In the case objectI was processed previously then update152
I (I = I + 1) and go back to point (a). 153
(c) ObjectI is processed and added to the ORDEREDLIST.154
(c1) If the cardinality CARD(I) of the spaceS(I) is 155
<MinPoints then: ReachDist(I) = 1.2ε (i.e. UN- 156
DEFINED), updateI (I = I + 1), go back to point 157
(a). 158
(c2) Otherwise [CARD(I) ≥ MinPoints] a new cluster 159
begins. 160
(c21) Evaluate core(I) (distance of the MinPoint-th-neighbour161
from I). 162
(c22) Set ReachDist(I) = core(I). 163
(c23) Add to a LIST the neighbours of the processed164
object (distance from the processed object≤ε), 165
with their ReachDist (rpi of the objectp from i 166
is the maximum between the distance ofp from i 167
and core(i)). 168
Processed objects can be re-processed here and elimi-169
nated by the ordered list in the case their ReachDist170
was UNDEFINED (they are not core points, but171
they can be reached by a core point). 172
In the case a neighbour was previously in LIST then173
its reachability distance is the minimum between174
the previous and the new ReachDist. 175
Sort the objects in LIST according to their ReachDist.176
(d) In the case LIST is empty updateI (I = I + 1). Go back 177
to point a). 178
(e) Read the first object in LIST. BeJ its index in the data 179
file. Object J is cancelled from LIST, processed and180
added to the ORDEREDLIST. 181
(e1) If CARD(J) < MinPoints go back to point (d) to182
read the next object in LIST. 183
(e2) Otherwise [CARD(J)≥ MinPoints]. 184
(e21) Evaluate core(J). 185
(e22) If ReachDist(J) > core(J) then ReachDist(J) 186
= core(J). 187
(e33) Go to point (c23) to add to LIST the neighbours188
of J. 189
(f) Algorithm END. 190
In the example shown inFig. 3A, 38 objects constitute the191
noise (six at the left of the reachability distance plot, 32 on192
the right). According the definition of NOISE and CLUS-193
TER, there is only one big cluster.Fig. 4 projects the ab- 194
scissa of the Kriegel Reachability plot in the bidimensional195
space of data. 196
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
4 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx
Fig. 4. Representation on the space of the original information of theorder found by OPTICS. The generating distance (70) is shown as acircle around object 5. Minpoints was 10. Objects of NOISE are shownas empty squares.
Fig. 5was obtained in a similar way with MinPoints= 10,197
but with a different value of the generating distance, so that198
three clusters are separated. However, also in the case there is199
no separation as inFig. 3, the reachability distance indicates200
clearly that there are part of the information space with a201
high density of points.202
Fig. 5. Reachability plot (below) and representation on the space of theoriginal information of the order found by OPTICS (up). The generatingdistance (55, compare withFig. 4) is shown as a dashed circle aroundobject 5. Minpoints was 10. Objects of NOISE are shown as emptysquares.
Fig. 6. Data set 75data and results of OPTICS clustering as a function ofthe generating distance. Minpoints= 5. Distance between the two nearestobjects is 500.
The possibility to build a hierarchy of clusters by using203
different values of the generating distance is a further inter-204
esting property of OPTICS.Fig. 6 shows a bidimensional205
data set with 75 objects in three clusters with different den-206
sity. By increasing the generating distance first OPTICS de-207
tects the two clusters A and B. Then some objects in C are208
clustered. OPTICS detects a maximum of four clusters: A,209
B, two clusters and nine singletons in C. Then an interval of210
generating distances corresponds to the three clusters, A–C.211
With a generating distances of 1580–1600 A and B form an212
unique cluster and finally withε > 1610 all the objects are213
in the same cluster. The generating distance seems rather214
critical in the detection of clusters and frequently a small215
difference modifies very much the results of OPTICS clus-216
tering. 217
A further problem arises when the real clusters are218
neatly separated without the presence of singletons in219
the intermediate space.Fig. 7 shows an example where220
when the last object of the cluster A is processed the221
next object (of the cluster B) has reachability distance222
UNDEFINED, but this distance becomes defined when223
the objects in cluster B are processed. The consequence224
is that the reachability distance plot does not show the225
presence of two clusters. This drawback can be elimi-226
nated with a “cluster mark”, that indicates on the axis227
of the plot where a new cluster begin (point c2 of the 228
algorithm).
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 5
Fig. 7. Data set 50data and reachability plot. The generating distance isshown in the Figure. Minpoints was 5.
4. OETICS229
Here it is presented a technique, ordering edges to identify230
the clustering structure (OETICS), based on the edges of the231
Minimum Spanning Tree (MST), characterized by the fact232
that it does not have selectable parameters. OETICS starts233
with the pre-treatment of data (as in all the distance-based234
clustering techniques). It continues with the application of235
a suitable algorithm to compute the MST, as it is shown in236
Fig. 8.237
Fig. 8. Minimum spanning tree for Kriegelexample.
A graph is a set of vertices and edges, which connect238
them. The degree or valence of a vertex is the number of239
edges that touch it. 240
In our case the vertices are the objects. A path or treep 241
through a graph is a sequence of connected objects:p = 〈o0, 242
o1,. . . , ok〉. The length of a path is the numberk of edges. 243
A graph contains no cycles if there is no path of non-zero244
length through the graph,p = 〈o0, o1,. . . , ok〉 such thato0 245
= ok. A spanning tree of a graph,G, is a set ofN − 1 edges 246
that connect all theN objects of the graph. 247
If a cost,cij, is associated with each edge,eij = (oi, oj) 248
then, the minimum spanning tree (MST) is the set of edges249
such that the sum of the costs over theN − 1 edges is 250
minimum. 251
In this case the cost associated with each edge is the252
Euclidean distance between the two connected objects. 253
Two algorithms are usually used to find the minimum254
spanning tree connectingN objects. The Prim algorithm[5] 255
begins from whatever object, that constitutes a zero-length256
tree, with only one object connected. Then, in each step of257
the algorithm, an object is connected to the tree. The object258
is the object (previously non-connected) with the minimum259
distance from one of the connected objects (Table 2). 260
The Kruskal algorithm[6] begins with the connection of261
the two nearest objects. In each step the two nearest objects262
are connected, provided that they are not in the same tree263
(in this case a cycle would be formed). When both the two264
objects are not connected they constituted a new path with265
non-zero length. So a number of separated trees, a forest, can266
be formed. When the two connected objects are in different267
trees, the two trees merge. The algorithm continues until the268
complete link in a unique tree. 269
Then the edges are ordered by the OETICS algorithm:270
(a) Begin with the longest edge. It is the first in theordered 271
edges list. All the objects are “active”. In the case the 272
Table 2Steps of PRIM algorithm—data set Twentydata
Connected objects Distance
7 8 16.64338 9 19.23547 6 24.04166 5 23.19487 10 24.18686 4 48.01045 3 52.61183 2 60.1082
10 11 82.975911 13 39.13 14 19.026314 16 20.591313 12 21.16 18 26.683316 17 43.185614 15 56.222817 19 91.760619 20 39.29382 1 111.0180
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
6 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx
Fig. 9. Minimum spanning tree for data set Twentydata. The index ofobjects is shown.
longest edge is not terminal, i.e. both the objects con-273
nected by the edge are connected to other objects, one274
of the two objects is set provisionally to “inactive”. It is275
Marked.276
(b) The two objects connected by the edge are OETICS277
ordered. Their “ordering level” is increased by 1. In the278
case one of the ordered objects has an ordering level279
equal to its degree (number of edges which the object280
is connected to) the objects becomes “inactive”.281
(c) In the case there are not ordered active objects and the282
object of the longest edge isMarked, it becomes active.283
If all the edges have been ordered then go to point (f).284
(d) Detect the shortest edge between the non-ordered edges285
connecting one of the ordered active objects.286
(e) Go to point (b).287
(f) OETICS END.288
Fig. 10. OETICS plot for data set Twentydata. Each edge is identified by the two connected objects.
Table 3Steps of OETICS for Twentydata. The active objects are those afterordering the edge, that decreases by 1 the degree of the two connectedobjects, so that the objects with degree 1 become inactive
Step Firstobject
Degree Secondobject
Degree Distance Activeorderedobjects
1 1 1 2 2 111.018 22 2 1 3 2 60.108 33 3 1 5 2 52.612 54 5 1 6 3 23.195 65 6 2 7 3 24.042 6, 76 7 2 8 2 16.643 6, 7, 87 8 1 9 1 19.235 6, 78 7 1 10 2 24.187 6, 109 6 1 4 1 48.010 10
10 10 1 11 2 82.976 1111 11 1 13 3 39.000 1312 13 2 14 3 19.026 13, 1413 14 2 16 3 20.591 13, 14, 1614 13 1 12 1 21.000 14, 1615 16 2 18 1 26.683 14, 1616 16 1 17 2 43.186 14, 1717 14 1 15 1 56.223 1718 17 1 19 2 91.761 1919 19 1 20 1 39.294 –
Fig. 9shows the minimum spanning tree of data set Twen-289
tydata. In this case, the longest edge connects objects 1 and290
2. It is a terminal edge, so that the OETICS algorithm can291
evolve only from object number 2. The details of the steps292
of the algorithm are shown inTable 3, and the OETICS plot 293
is shown inFig. 10. 294
The plot of ordered edges inFig. 10can be cut at a cer-295
tain level (as usual for dendrograms) to identify clusters and296
singletons. When the level is that shown in the figure, OET-297
ICS identifies a singleton (object 1), a cluster with only two298
objects (19 and 20) and two clusters with 9 and 8 objects.
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 7
Fig. 11. Data set 40data and its OETICS plot with the marked edge fourto nine.
The cutting level can be obtained automatically by means299
of a test that uses the critical value of the length of edges300
drawn from a multivariate uniform or normal distribution301
[13]. However, the observation of the plot and the interpreta-302
tion of possibile clusters can avoid the risks of automatism.303
The second example refers to a case where the longest304
edge is not a terminal edge.Fig. 11shows the data set 40data305
where four clusters (A–D) can be easily identified and where306
the longest edge connects objects 4 and 11. The algorithm307
selects one of these two objects, object 11 and connects308
edges 11–19, 19–18, 18–14, 18–17, 17, 20, 17–16, 14–15,309
14–13 and 13-12. So all the objects of cluster B are ordered.310
The algorithm finds now two large edges connecting object311
20–26 and 38. It selects the shortest, and gradually orders the312
objects in cluster C. Then, it returns to the edge connecting313
20 and 26 and gradually orders the objects in cluster D,314
finishing with the edge 29–21. Now object 4, provisionally315
inactive, becomes active and from it the algorithm begins to316
order the objects in cluster A, with the edge 4–9. In the plot317
of ordered edges an arrow indicates that object 4 is marked,318
so that edge 4–9 can not be considered as contiguous to edge319
29–21. Ideally at the left of 4–9 there is a replicate of the320
starting edge 4–11.
Fig. 12. Rearranged plot of ordered edges for the data set 40data.
5. Rearrangement 321
In spite of the mark, the plot of ordered edges inFig. 11is 322
not very clear. So the edges are rearranged: the marked 4–9323
is arranged immediately at the left of the starting edge. Then324
edge 9–8 is arranged at the left of 4–9, and so on, up to the325
final edge 10–7 that becomes the first in the plot, as shown in326
Fig. 12. Now the plot of ordered edges is easily interpreted327
and the three largest edges separate the four clusters. 328
6. Smoothing 329
Fig. 13A shows the plot of ordered edges in the case of330
data set Kriegelexample. The largest edge was not a terminal331
edge, but it had only one edge on one side, so that the rear-332
rangement produced a very limited, almost negligible, effect333
(the largest edge is the second in the plot). The original map334
shows all the irregularities of the distances in the minimum335
spanning tree, more or less as in the case of OPTICS when336
the value of MinPoints is low (as in the map ofFig. 3B). It 337
is possible to apply a moderate smoothing to the lengths of338
edges, as inFig. 13B and C. This smoothing regularises the339
plot and can, perhaps, help in the interpretation. However,340
when the plot is used to identify the clusters, it is preferable341
to use the original plot without smoothing.Fig. 13A shows 342
a possible decision about the clusters, obtained by cutting343
the plot below a selected “reasonable” length of the edges.344
The result is shown inFig. 14. It is, obviously, very sub- 345
jective as in all the cases of ill-defined clusters without neat346
separation. 347
7. Results with real data sets 348
7.1. Coffeedata 349
Fig. 15shows as the largest edge separates perfectly the350
seven Robusta samples (the six edges on the left, after rear-351
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
8 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx
Fig. 13. OETICS map for data set Kriegelexample. (A) Usual map; (B)Savitzky–Golay smoothing with five points; (C) Savitzky–Golay smooth-ing with 11 points.
rangement). The medium length of the edges related to the352
Arabica samples, on the right, indicates that the these sam-353
ples are closer than the Robusta ones.354
OPTICS (Fig. 16) separates the two categories, without355
outliers, only with a small value ofMinPoints = 2 and a356
relatively large value (4.5) of the generating distance. Both357
Fig. 14. Assignment of clusters for Kriegelexample on the basis ofFig. 13A.
Fig. 15. Data set Coffeedata. Result with OETICS.
are necessary because of the small number of samples of358
Robusta and their rather large distance. A mark must be used359
to indicate the separation of the two clusters, because, as360
in the example50data in Fig. 7, there are not outliers with361
large reachability distance from both clusters. Without the362
mark the plot of reachability distance is hardly interpretable.363
In fact, due to the large generating distance, a lot of objects364
(33) were found in the sphericalS(q) space around the first365
object, and added to the LIST of the neighbours (point c23 of 366
the algorithm). Because of the objects in LIST are ordered367
according to their reachability distance (but updated every368
time the algorithm executes point c23), the result is that 369
also the ordered objects are almost in the order of their370
reachability distance. 371
Fig. 16. Data set Coffeedata. Result with OPTICS.ε, generating distance= 4.5; MinPoints= 2.
Fig. 17. Data set Oliveoils. Result with OETICS.
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 9
Table 4Exploration of OPTICS minimum cardinality and of the generating distance in the case of data set Coffeedata
MinPoints 2 3 4 5
ε Clusters Noise Clusters Noise Clusters Noise Clusters Noise
1.70 1 39 0 43 0 43 0 431.84 1 38 1 39 0 43 0 431.98 3 31 1 38 1 38 0 432.12 2 26 2 31 1 38 0 432.26 2 23 2 27 2 30 1 352.40 3 20 1 26 1 26 1 272.54 2 17 1 21 1 25 1 262.68 2 15 1 19 1 19 1 192.82 1 11 1 12 1 15 1 152.96 2 6 1 9 1 9 1 93.10 2 5 1 8 1 8 1 83.24 2 5 1 8 1 8 1 83.38 2 4 1 7 1 7 1 73.52 2 4 1 7 1 7 1 73.66 2 3 2 3 1 7 1 73.80 2 3 2 3 1 7 1 73.94 2 1 2 1 2 2 1 74.08 2 1 2 1 2 1 2 14.22 2 1 2 1 2 1 2 14.36 2 0 2 1 2 1 2 14.50 2 0 2 1 2 1 2 14.64 2 0 2 1 2 1 2 14.78 2 0 2 1 2 1 2 14.92 2 0 2 1 2 1 2 1
Table 4shows the results (number of clusters and of out-372
liers (noise)) obtained with different values of the settable373
parameters of OPTICS. In fact, in this case, with a small374
number of objects, it was possible to perform a wide study375
Fig. 18. PC plot of data set Oliveoil. The index of category is reported for the 572 objects: (1) North Apulia; (2) Calabria; (3) South Apulia; (4) Sicily;(5) Inland Sardinia; (6) Coast Sardinia; (7) East Liguria; (8) West Liguria; (9) Umbria. The position of the first object in the data set indicates the startingpoint of OPTICS. The longest edge is the starting point of OETICS. The black filled squares indicate the approximate position of the clusters ofFig. 17.
of the effect of these parameters, what is not so easy when376
the number of objects is large, so that the computing time377
increases very much (being proportional to the square of the378
number of objects). The number of clusters and outliers is
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
10 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx
Fig. 19. Data set Oliveoils. Result with OPTICS.ε, Generating distance= 3; MinPoints= 10.
not enough to decide the optimum value of the parameters.379
It is usually necessary to draw the corresponding plots of380
the reachability distance. In this case the choiceMinPoints381
= 2 andε = 4.5 was due to the previous knowledge of the382
existence of two categories.383
7.2. Oliveoils384
The results are reported inFigs. 17–19. The differences385
between the plots of OETICS and OPTICS are due to the386
smoothing character of the second and to the different start-387
ing point. OETICS starts from the longest edge connecting388
a sample of East Liguria to a sample of West Liguria. This389
is a terminal edge so that no rearrangement is required. The390
minimum spanning tree continues in the part of the space391
rich in samples of West Liguria. OPTICS starts with the392
sample number 1, of North Apulia, so that begins to order393
samples of North Apulia. There are however many points394
of similarity. The “valleys” corresponding to high density of395
points, to “density clusters” are more or less the same. The396
right part of the OETICS plot corresponds to the left part of397
the PC plot ofFig. 18, with many samples of South Apulia398
and some samples of Sicily, with relatively small density,399
and with some zones where there is a relatively high den-400
sity of samples from Calabria or North Apulia. This part of401
Fig. 20. Data set Breastcancer. Result with OETICS.
Fig. 21. Data set Breastcancer. Result with OPTICS.ε, generating distance= 3; MinPoints= 10.
OETICS plot corresponds to the part of the OPTICS plot402
left the cluster of Coast Sardinia. It seems that OPTICS403
separates better the samples of Calabria. However the clus-404
ter of Calabria shown inFig. 19 contains many samples405
from Sicily. The two small clusters of Calabria inFig. 17 406
have only samples of Calabria, but less than the cluster of407
OPTICS. 408
7.3. Breastcancer 409
The results are reported inFigs. 20 and 21. OETICS starts 410
from a terminal edge, so that no rearrangement was neces-411
sary. There are a lot of equal objects in Breastcancer so that412
the length of edges or the reachability distance are zero. The413
plots of OETICS and of OPTICS are very similar, the only414
important difference is a consequence of the different start-415
ing point. The interval marked by arrows inFig. 20contains 416
almost all the samples of the category Benign (444 out of417
458) with only 17 Malignant. 418
8. Conclusions 419
The OETICS algorithm, based on ordering the edges of420
the minimum spanning tree connecting the objects, seems421
to offer an alternative to OPTICS, the original clustering422
technique based on the local density of objects and on two423
selectable parameters. OETICS is rigid (except the possibil-424
ity of smoothing of the edge lengths in the final plot), so425
that it does not require an exploration work, as frequently426
necessary in OPTICS to find the optimum of the generat-427
ing distance and of the minimum cardinality. The solution428
of OETICS is unique, i.e. it does not depend on the order429
of the objects. The results are more or less equivalent. On430
the other hand OETICS has not the same flexibility as OP-431
TICS, and it is rather sensible to the local differences in the432
length of edges and to the local details that the procedure of433
OPTICS eliminates. 434
OETICS is contained in the program MST (minimum435
spanning tree) of V-PARVUS[8].
ACA 225314 1–11
UN
CO
RR
EC
TED
PR
OO
F
M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 11
Acknowledgements436
This study has been developed with funds from the project437
UE WineDB, from the University of Genova, and from CNR438
(National Research Council of Italy).439
References440
[1] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. De Jong, P.J.441
Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Quali-442
metrics, Elsevier, Amsterdam, 1998.443
[2] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduc-444
tion to cluster Analysis, Wiley, 1990.445
[3] M. Forina, C. Casolino, S. Lanteri, Annali Chim. (Rome) 93 (2003)446
55.447
[4] M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander “OPTICS: Or-448
dering Points To Identify the Clustering Structure”, in: Proceedings449
of the ACM SIGMOD’99 Internationa Conference On Management450
of Data, Philadelphia, PA, 1999, pp. 49–60.
[5] R.C. Prim, Bell Sys. Tech. J. 36 (1957) 1389. 451
[6] J.B. Kruskal, Proc. Am. Math. Soc. 7 (1956) 48. 452
[7] H. Streuli, Aroma und Geschmacksstoffe in Lebensmitteln, Foster453
Verlag AG, Zurich, Swiss, 1967. 454
[8] M. Forina, S. Lanteri, C. Armanino, C. Cerrato Oliveros, C. Casolino,455
V-PARVUS 2003, An extendable package of programs for explorative456
data analysis, classification and regression analysis, Dip. Chimica457
e Tecnologie Farmaceutiche, University of Genova, available (free,458
with manual and examples)http://www.parvus.unige.it. 459
[9] M. Forina, E. Tiscornia, Annali Chim. (Rome) 72 (1982) 143–460
155. 461
[10] J. Zupan, M. Novic, X. Li, J. Gasteiger, Anal. Chim. Acta 292 (1994)462
219–234. 463
[11] P.K. Hopke, D.L. Massart, Chemom. Intell. Lab. Syst. 19 (1993)464
35–41. 465
[12] W.H. Wolberg, O.L. Mangasarian, Multisurface method of pattern466
separation for medical diagnosis applied to breast cytology, in: Pro-467
ceedings of the National Academy of Sciences, USA, vol. 87, De-468
cember 1990, pp. 9193–9196. 469
[13] M. Forina, S. Lanteri, I. Esteban Dı́ez, Anal. Chim. Acta 446 (2001) 470
59. 471
ACA 225314 1–11