+ All Categories
Home > Documents > Minimum spanning tree: ordering edges to identify clustering structure

Minimum spanning tree: ordering edges to identify clustering structure

Date post: 30-Apr-2023
Category:
Upload: unige-it1
View: 0 times
Download: 0 times
Share this document with a friend
11
UNCORRECTED PROOF Analytica Chimica Acta xxx (2004) xxx–xxx 3 Minimum spanning tree: ordering edges to identify clustering structure 4 Michele Forina , M. Concepción Cerrato Oliveros, Chiara Casolino, Monica Casale 5 Dipartimento di Chimica e Tecnologie Farmaceutiche ed Alimentari, Facoltà di Farmacia, Università di Genova, 6 Via Brigata Salerno (Ponte), Genova 16147, Italy 7 Received 8 September 2003; received in revised form 27 February 2004; accepted 27 February 2004 8 Abstract 9 Ordering edges to identify clustering structure (OETICS), the clustering algorithm presented here, is based on the minimum spanning tree connecting the objects. The edges of the tree are ordered, beginning from the longest edge, to form groups of objects separated by large edges. The plot of ordered edges is the main result of the algorithm. In some aspects OETICS is similar to OPTICS, a known clustering technique that orders the objects with reference to the local density, but the solution is unique, because it does not select the value of some parameters, as the generating distance in OPTICS. 10 11 12 13 14 OETICS is applied to many simulated and real data sets, with very different number of objects and variables, and the results are compared with those obtained by OPTICS. 15 16 © 2004 Published by Elsevier B.V. 17 Keywords: Clustering; Chemometrics; Pattern recognition 27 1. Introduction 28 Clustering techniques are widely used in chemistry, with 29 different objectives. In food chemistry the objective is fre- 30 quently the detection of groups of similar foods, of similar 31 consumers, of similar panellists. In pharmaceutical chem- 32 istry clustering techniques are used to detect groups of sim- 33 ilar conformers, or of molecules with similar structure. In 34 analytical chemistry clustering techniques are used to se- 35 lect a number of representative samples for calibration, or 36 to evaluate the homogeneity of a calibration set. 37 Many clustering techniques [1,2] are used: visual cluster- 38 ing (on principal components or on the axes of projection 39 pursuit), hierarchical agglomerative and divisive methods 40 with the related dendrograms, non-hierarchical techniques 41 as K-means and K-medoids, fuzzy clustering. However, be- 42 cause of the huge variety of problems and data, these tech- 43 niques are not completely satisfactory, at least as can be 44 deduced from the number of papers about new clustering 45 techniques. A reason is that the dendrogram obtained with 46 some techniques shows apparently well separated clusters 47 also when there are not real clusters, as the example in Fig. 1 48 Corresponding author. Tel.: +39-0103532630; fax: +39-0103532684. E-mail address: [email protected] (M. Forina). shows. The application of statistical tests to evaluate the 49 significance of clusters (obtained cutting the dendrogram, 50 generally at the long branches) can erroneously confirm the 51 existence of false clusters [3]. 52 A second reason is that today clustering techniques must 53 be sometimes applied to very large data bases. In this case 54 the usual hierarchical techniques can not be applied easily 55 both because is rather rare to find computer programs able 56 to draw dendrograms in the case of e.g. 100 samples, and 57 because of difficulties in the interpretation. 58 A second reason is there are two types of clustering. A first 59 type of clustering, the so-called natural clustering, is associ- 60 ated with the presence of agglomerates of objects where the 61 distance between the two closest objects of two different ag- 62 glomerates (between-clusters distance) is clearly larger than 63 the within-agglomerate distances. A second type of cluster- 64 ing is present when there are parts of the space with high 65 density of points, and parts with low density, without a sharp 66 boundary. Many techniques have been developed to study 67 the second type of clusters, with the principal objective of 68 application to very large data sets. Among these techniques 69 OPTICS [4], final step of a number of developments, seems 70 very powerful. 71 The results of OPTICS depend on two settable parame- 72 ters, the “generating distance” ε, and the number of points 73 required for the connectivity “MinPoints”, and, obviously 74 1 0003-2670/$ – see front matter © 2004 Published by Elsevier B.V. 2 doi:10.1016/j.aca.2004.02.064 ACA 225314 1–11
Transcript

UN

CO

RR

EC

TED

PR

OO

F

Analytica Chimica Acta xxx (2004) xxx–xxx

3

Minimum spanning tree: ordering edges to identify clustering structure4

Michele Forina∗, M. Concepción Cerrato Oliveros, Chiara Casolino, Monica Casale5

Dipartimento di Chimica e Tecnologie Farmaceutiche ed Alimentari, Facoltà di Farmacia, Università di Genova,6

Via Brigata Salerno (Ponte), Genova 16147, Italy7

Received 8 September 2003; received in revised form 27 February 2004; accepted 27 February 20048

Abstract9

Ordering edges to identify clustering structure (OETICS), the clustering algorithm presented here, is based on the minimum spanning treeconnecting the objects. The edges of the tree are ordered, beginning from the longest edge, to form groups of objects separated by large edges.The plot of ordered edges is the main result of the algorithm. In some aspects OETICS is similar to OPTICS, a known clustering techniquethat orders the objects with reference to the local density, but the solution is unique, because it does not select the value of some parameters,as the generating distance in OPTICS.

10

11

12

13

14

OETICS is applied to many simulated and real data sets, with very different number of objects and variables, and the results are comparedwith those obtained by OPTICS.

15

16

© 2004 Published by Elsevier B.V.17

Keywords: Clustering; Chemometrics; Pattern recognition27

1. Introduction28

Clustering techniques are widely used in chemistry, with29

different objectives. In food chemistry the objective is fre-30

quently the detection of groups of similar foods, of similar31

consumers, of similar panellists. In pharmaceutical chem-32

istry clustering techniques are used to detect groups of sim-33

ilar conformers, or of molecules with similar structure. In34

analytical chemistry clustering techniques are used to se-35

lect a number of representative samples for calibration, or36

to evaluate the homogeneity of a calibration set.37

Many clustering techniques[1,2] are used: visual cluster-38

ing (on principal components or on the axes of projection39

pursuit), hierarchical agglomerative and divisive methods40

with the related dendrograms, non-hierarchical techniques41

as K-means and K-medoids, fuzzy clustering. However, be-42

cause of the huge variety of problems and data, these tech-43

niques are not completely satisfactory, at least as can be44

deduced from the number of papers about new clustering45

techniques. A reason is that the dendrogram obtained with46

some techniques shows apparently well separated clusters47

also when there are not real clusters, as the example inFig. 148

∗ Corresponding author. Tel.:+39-0103532630; fax:+39-0103532684.E-mail address: [email protected] (M. Forina).

shows. The application of statistical tests to evaluate the49

significance of clusters (obtained cutting the dendrogram,50

generally at the long branches) can erroneously confirm the51

existence of false clusters[3]. 52

A second reason is that today clustering techniques must53

be sometimes applied to very large data bases. In this case54

the usual hierarchical techniques can not be applied easily55

both because is rather rare to find computer programs able56

to draw dendrograms in the case of e.g. 100 samples, and57

because of difficulties in the interpretation. 58

A second reason is there are two types of clustering. A first59

type of clustering, the so-called natural clustering, is associ-60

ated with the presence of agglomerates of objects where the61

distance between the two closest objects of two different ag-62

glomerates (between-clusters distance) is clearly larger than63

the within-agglomerate distances. A second type of cluster-64

ing is present when there are parts of the space with high65

density of points, and parts with low density, without a sharp66

boundary. Many techniques have been developed to study67

the second type of clusters, with the principal objective of68

application to very large data sets. Among these techniques69

OPTICS[4], final step of a number of developments, seems70

very powerful. 71

The results of OPTICS depend on two settable parame-72

ters, the “generating distance”ε, and the number of points 73

required for the connectivity “MinPoints”, and, obviously74

1 0003-2670/$ – see front matter © 2004 Published by Elsevier B.V.2 doi:10.1016/j.aca.2004.02.064

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

2 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx

Fig. 1. A bidimensilonal data set drawn from a uniform distribution, andthe dendrogram obtained with the average-linkage unweighted aggliomer-ation technique.

from the scaling procedure. The possibility to optimise the75

techniques by selection of suitable values of the parameters76

gives power to the technique, but this power can be obtained77

only by skilled people. For this reason, here it is presented a78

clustering technique similar to OPTICS but without settable79

parameters, so with less power but with the simplicity that80

unskilled people can prefer. This technique works on the81

edges of the minimum spanning tree connecting the objects82

[5,6], and for this reason we indicate it with the ordering83

edges to identify clustering structure (OETICS), that echoes84

the method from which derives.85

As in the case of many other clustering techniques, both86

OPTICS and OETICS give great importance to the visual87

representation of clustering: so figures are the principal prod-88

uct of both techniques.89

2. Data files90

Some simulated data matrices with two variables and a91

number of objects from 20 to 1000 have been used in the92

development and evaluation of clustering techniques. Data93

set Twentydata (20 objects) is used to illustrate the details94

of the OETICS algorithm. It is reported inTable 1. Data set95

Kriegelexample was obtained by scanning a figure in the pa-96

per describing OPTICS. It has 300 objects. Data sets 40data,97

50data and 75data have a special structure useful to illus-98

trate some details of clustering with OPTICS or OETICS.99

They are shown in some figures further.100

Table 1Data set Twentydata

Index X Y Index X Y Index X Y Index X Y

1 46 500 2 124 579 3 166 536 4 236 5095 174 484 6 197 481 7 214 464 8 200 4559 191 438 10 235 452 11 316 434 12 355 455

13 355 434 14 356 415 15 321 371 16 374 40517 403 437 18 380 379 19 491 463 20 529 473

Three real data sets have been used. Coffeedata ([7], avail- 101

able at[8]) has 43 objects (7 Robusta and 36 Arabica cof-102

fees) described by 13 chemical and physical variables. The103

two categories are separable. Oliveoils[9] has been used by104

many Authors working with multivariate classification, e.g.105

[10,11]. The 572 objects are described by eight fatty acids,106

and are divided in nine categories, Italian regions. Breast-107

cancer[12] contains 699 objects, described by nine vari-108

ables. There are two categories, benign (458 objects) and109

malignant (241 objects). 110

In our experience (and in that of many people working111

with these data) the categories in Oliveoils and Breastcancer112

are not separable. 113

Simulated data were used without pretreatments. Instead114

autoscaling was applied to all the real data sets. 115

3. OPTICS 116

OPTICS is based on the reachability distance. Two pa-117

rameters are defined by the operator:ε, generating distance;118

the radius of the spherical spaceS(q) around a given object119

q used to define the non-connectivity. When inS(q) there 120

are no objects others thanq this object is a singleton. Its121

reachability distance is UNDEFINED. 122

The objects which reachability distance is undefined con-123

stitute the NOISE. The others are in one or more CLUS-124

TERS. 125

The cardinality is the number of objects others thanq 126

in S(q). MinPoints: is the minimum cardinality ofq with 127

reference toS(q) required for the connectivity. An object128

q such thatS(q) ≥ MinPoints is a core point. Its core dis-129

tance (core(q)) is the distance of the Minpoints-neighbour.130

In Fig. 2 the main points of the OPTICS definitions are ex-131

plained in the case of MinPoints= 3. A point, asp1, with 132

distance fromq less than core has reachability distance= 133

core(q). A point, asp2, with distance fromq less thanε has 134

reachability distance equal to the distance fromq. When the 135

distance is larger thanε the point can not be reached directly136

from q. 137

Results of OPTICS are shown in a “reachability plot” as138

that in Fig. 3, where the clusters (corresponding to a local139

high density of points) are identified by the minima of the140

reachability distance. 141

The OPTICS algorithm starts with the definition of the142

generating distance and of the minimum cardinality. The143

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 3

Fig. 2. Core distance and Reachability distances. With MinPoints= 3,core(q) is the distance from the third neighbour.

Fig. 3. Data set Kriegelexample[4] and Reachability plots (with differentsettings: generating distance 70, (A) Minpoints 10; (B) MinPoints 2).

starting reachability distance of all the objects is set at UN-144

DEFINED, i.e. to a conventional value >ε. 145

The algorithm reads the objects in the data file, starting146

with object number 1. However this reading (point a) of the147

algorithm) is only one of the way to access to the data in148

the data file, so that during this ordered access the algorithm149

can also found objects that were processed previously. 150

(a) If end of data file then go to point f) Read objectI. 151

(b) In the case objectI was processed previously then update152

I (I = I + 1) and go back to point (a). 153

(c) ObjectI is processed and added to the ORDEREDLIST.154

(c1) If the cardinality CARD(I) of the spaceS(I) is 155

<MinPoints then: ReachDist(I) = 1.2ε (i.e. UN- 156

DEFINED), updateI (I = I + 1), go back to point 157

(a). 158

(c2) Otherwise [CARD(I) ≥ MinPoints] a new cluster 159

begins. 160

(c21) Evaluate core(I) (distance of the MinPoint-th-neighbour161

from I). 162

(c22) Set ReachDist(I) = core(I). 163

(c23) Add to a LIST the neighbours of the processed164

object (distance from the processed object≤ε), 165

with their ReachDist (rpi of the objectp from i 166

is the maximum between the distance ofp from i 167

and core(i)). 168

Processed objects can be re-processed here and elimi-169

nated by the ordered list in the case their ReachDist170

was UNDEFINED (they are not core points, but171

they can be reached by a core point). 172

In the case a neighbour was previously in LIST then173

its reachability distance is the minimum between174

the previous and the new ReachDist. 175

Sort the objects in LIST according to their ReachDist.176

(d) In the case LIST is empty updateI (I = I + 1). Go back 177

to point a). 178

(e) Read the first object in LIST. BeJ its index in the data 179

file. Object J is cancelled from LIST, processed and180

added to the ORDEREDLIST. 181

(e1) If CARD(J) < MinPoints go back to point (d) to182

read the next object in LIST. 183

(e2) Otherwise [CARD(J)≥ MinPoints]. 184

(e21) Evaluate core(J). 185

(e22) If ReachDist(J) > core(J) then ReachDist(J) 186

= core(J). 187

(e33) Go to point (c23) to add to LIST the neighbours188

of J. 189

(f) Algorithm END. 190

In the example shown inFig. 3A, 38 objects constitute the191

noise (six at the left of the reachability distance plot, 32 on192

the right). According the definition of NOISE and CLUS-193

TER, there is only one big cluster.Fig. 4 projects the ab- 194

scissa of the Kriegel Reachability plot in the bidimensional195

space of data. 196

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

4 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx

Fig. 4. Representation on the space of the original information of theorder found by OPTICS. The generating distance (70) is shown as acircle around object 5. Minpoints was 10. Objects of NOISE are shownas empty squares.

Fig. 5was obtained in a similar way with MinPoints= 10,197

but with a different value of the generating distance, so that198

three clusters are separated. However, also in the case there is199

no separation as inFig. 3, the reachability distance indicates200

clearly that there are part of the information space with a201

high density of points.202

Fig. 5. Reachability plot (below) and representation on the space of theoriginal information of the order found by OPTICS (up). The generatingdistance (55, compare withFig. 4) is shown as a dashed circle aroundobject 5. Minpoints was 10. Objects of NOISE are shown as emptysquares.

Fig. 6. Data set 75data and results of OPTICS clustering as a function ofthe generating distance. Minpoints= 5. Distance between the two nearestobjects is 500.

The possibility to build a hierarchy of clusters by using203

different values of the generating distance is a further inter-204

esting property of OPTICS.Fig. 6 shows a bidimensional205

data set with 75 objects in three clusters with different den-206

sity. By increasing the generating distance first OPTICS de-207

tects the two clusters A and B. Then some objects in C are208

clustered. OPTICS detects a maximum of four clusters: A,209

B, two clusters and nine singletons in C. Then an interval of210

generating distances corresponds to the three clusters, A–C.211

With a generating distances of 1580–1600 A and B form an212

unique cluster and finally withε > 1610 all the objects are213

in the same cluster. The generating distance seems rather214

critical in the detection of clusters and frequently a small215

difference modifies very much the results of OPTICS clus-216

tering. 217

A further problem arises when the real clusters are218

neatly separated without the presence of singletons in219

the intermediate space.Fig. 7 shows an example where220

when the last object of the cluster A is processed the221

next object (of the cluster B) has reachability distance222

UNDEFINED, but this distance becomes defined when223

the objects in cluster B are processed. The consequence224

is that the reachability distance plot does not show the225

presence of two clusters. This drawback can be elimi-226

nated with a “cluster mark”, that indicates on the axis227

of the plot where a new cluster begin (point c2 of the 228

algorithm).

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 5

Fig. 7. Data set 50data and reachability plot. The generating distance isshown in the Figure. Minpoints was 5.

4. OETICS229

Here it is presented a technique, ordering edges to identify230

the clustering structure (OETICS), based on the edges of the231

Minimum Spanning Tree (MST), characterized by the fact232

that it does not have selectable parameters. OETICS starts233

with the pre-treatment of data (as in all the distance-based234

clustering techniques). It continues with the application of235

a suitable algorithm to compute the MST, as it is shown in236

Fig. 8.237

Fig. 8. Minimum spanning tree for Kriegelexample.

A graph is a set of vertices and edges, which connect238

them. The degree or valence of a vertex is the number of239

edges that touch it. 240

In our case the vertices are the objects. A path or treep 241

through a graph is a sequence of connected objects:p = 〈o0, 242

o1,. . . , ok〉. The length of a path is the numberk of edges. 243

A graph contains no cycles if there is no path of non-zero244

length through the graph,p = 〈o0, o1,. . . , ok〉 such thato0 245

= ok. A spanning tree of a graph,G, is a set ofN − 1 edges 246

that connect all theN objects of the graph. 247

If a cost,cij, is associated with each edge,eij = (oi, oj) 248

then, the minimum spanning tree (MST) is the set of edges249

such that the sum of the costs over theN − 1 edges is 250

minimum. 251

In this case the cost associated with each edge is the252

Euclidean distance between the two connected objects. 253

Two algorithms are usually used to find the minimum254

spanning tree connectingN objects. The Prim algorithm[5] 255

begins from whatever object, that constitutes a zero-length256

tree, with only one object connected. Then, in each step of257

the algorithm, an object is connected to the tree. The object258

is the object (previously non-connected) with the minimum259

distance from one of the connected objects (Table 2). 260

The Kruskal algorithm[6] begins with the connection of261

the two nearest objects. In each step the two nearest objects262

are connected, provided that they are not in the same tree263

(in this case a cycle would be formed). When both the two264

objects are not connected they constituted a new path with265

non-zero length. So a number of separated trees, a forest, can266

be formed. When the two connected objects are in different267

trees, the two trees merge. The algorithm continues until the268

complete link in a unique tree. 269

Then the edges are ordered by the OETICS algorithm:270

(a) Begin with the longest edge. It is the first in theordered 271

edges list. All the objects are “active”. In the case the 272

Table 2Steps of PRIM algorithm—data set Twentydata

Connected objects Distance

7 8 16.64338 9 19.23547 6 24.04166 5 23.19487 10 24.18686 4 48.01045 3 52.61183 2 60.1082

10 11 82.975911 13 39.13 14 19.026314 16 20.591313 12 21.16 18 26.683316 17 43.185614 15 56.222817 19 91.760619 20 39.29382 1 111.0180

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

6 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx

Fig. 9. Minimum spanning tree for data set Twentydata. The index ofobjects is shown.

longest edge is not terminal, i.e. both the objects con-273

nected by the edge are connected to other objects, one274

of the two objects is set provisionally to “inactive”. It is275

Marked.276

(b) The two objects connected by the edge are OETICS277

ordered. Their “ordering level” is increased by 1. In the278

case one of the ordered objects has an ordering level279

equal to its degree (number of edges which the object280

is connected to) the objects becomes “inactive”.281

(c) In the case there are not ordered active objects and the282

object of the longest edge isMarked, it becomes active.283

If all the edges have been ordered then go to point (f).284

(d) Detect the shortest edge between the non-ordered edges285

connecting one of the ordered active objects.286

(e) Go to point (b).287

(f) OETICS END.288

Fig. 10. OETICS plot for data set Twentydata. Each edge is identified by the two connected objects.

Table 3Steps of OETICS for Twentydata. The active objects are those afterordering the edge, that decreases by 1 the degree of the two connectedobjects, so that the objects with degree 1 become inactive

Step Firstobject

Degree Secondobject

Degree Distance Activeorderedobjects

1 1 1 2 2 111.018 22 2 1 3 2 60.108 33 3 1 5 2 52.612 54 5 1 6 3 23.195 65 6 2 7 3 24.042 6, 76 7 2 8 2 16.643 6, 7, 87 8 1 9 1 19.235 6, 78 7 1 10 2 24.187 6, 109 6 1 4 1 48.010 10

10 10 1 11 2 82.976 1111 11 1 13 3 39.000 1312 13 2 14 3 19.026 13, 1413 14 2 16 3 20.591 13, 14, 1614 13 1 12 1 21.000 14, 1615 16 2 18 1 26.683 14, 1616 16 1 17 2 43.186 14, 1717 14 1 15 1 56.223 1718 17 1 19 2 91.761 1919 19 1 20 1 39.294 –

Fig. 9shows the minimum spanning tree of data set Twen-289

tydata. In this case, the longest edge connects objects 1 and290

2. It is a terminal edge, so that the OETICS algorithm can291

evolve only from object number 2. The details of the steps292

of the algorithm are shown inTable 3, and the OETICS plot 293

is shown inFig. 10. 294

The plot of ordered edges inFig. 10can be cut at a cer-295

tain level (as usual for dendrograms) to identify clusters and296

singletons. When the level is that shown in the figure, OET-297

ICS identifies a singleton (object 1), a cluster with only two298

objects (19 and 20) and two clusters with 9 and 8 objects.

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 7

Fig. 11. Data set 40data and its OETICS plot with the marked edge fourto nine.

The cutting level can be obtained automatically by means299

of a test that uses the critical value of the length of edges300

drawn from a multivariate uniform or normal distribution301

[13]. However, the observation of the plot and the interpreta-302

tion of possibile clusters can avoid the risks of automatism.303

The second example refers to a case where the longest304

edge is not a terminal edge.Fig. 11shows the data set 40data305

where four clusters (A–D) can be easily identified and where306

the longest edge connects objects 4 and 11. The algorithm307

selects one of these two objects, object 11 and connects308

edges 11–19, 19–18, 18–14, 18–17, 17, 20, 17–16, 14–15,309

14–13 and 13-12. So all the objects of cluster B are ordered.310

The algorithm finds now two large edges connecting object311

20–26 and 38. It selects the shortest, and gradually orders the312

objects in cluster C. Then, it returns to the edge connecting313

20 and 26 and gradually orders the objects in cluster D,314

finishing with the edge 29–21. Now object 4, provisionally315

inactive, becomes active and from it the algorithm begins to316

order the objects in cluster A, with the edge 4–9. In the plot317

of ordered edges an arrow indicates that object 4 is marked,318

so that edge 4–9 can not be considered as contiguous to edge319

29–21. Ideally at the left of 4–9 there is a replicate of the320

starting edge 4–11.

Fig. 12. Rearranged plot of ordered edges for the data set 40data.

5. Rearrangement 321

In spite of the mark, the plot of ordered edges inFig. 11is 322

not very clear. So the edges are rearranged: the marked 4–9323

is arranged immediately at the left of the starting edge. Then324

edge 9–8 is arranged at the left of 4–9, and so on, up to the325

final edge 10–7 that becomes the first in the plot, as shown in326

Fig. 12. Now the plot of ordered edges is easily interpreted327

and the three largest edges separate the four clusters. 328

6. Smoothing 329

Fig. 13A shows the plot of ordered edges in the case of330

data set Kriegelexample. The largest edge was not a terminal331

edge, but it had only one edge on one side, so that the rear-332

rangement produced a very limited, almost negligible, effect333

(the largest edge is the second in the plot). The original map334

shows all the irregularities of the distances in the minimum335

spanning tree, more or less as in the case of OPTICS when336

the value of MinPoints is low (as in the map ofFig. 3B). It 337

is possible to apply a moderate smoothing to the lengths of338

edges, as inFig. 13B and C. This smoothing regularises the339

plot and can, perhaps, help in the interpretation. However,340

when the plot is used to identify the clusters, it is preferable341

to use the original plot without smoothing.Fig. 13A shows 342

a possible decision about the clusters, obtained by cutting343

the plot below a selected “reasonable” length of the edges.344

The result is shown inFig. 14. It is, obviously, very sub- 345

jective as in all the cases of ill-defined clusters without neat346

separation. 347

7. Results with real data sets 348

7.1. Coffeedata 349

Fig. 15shows as the largest edge separates perfectly the350

seven Robusta samples (the six edges on the left, after rear-351

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

8 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx

Fig. 13. OETICS map for data set Kriegelexample. (A) Usual map; (B)Savitzky–Golay smoothing with five points; (C) Savitzky–Golay smooth-ing with 11 points.

rangement). The medium length of the edges related to the352

Arabica samples, on the right, indicates that the these sam-353

ples are closer than the Robusta ones.354

OPTICS (Fig. 16) separates the two categories, without355

outliers, only with a small value ofMinPoints = 2 and a356

relatively large value (4.5) of the generating distance. Both357

Fig. 14. Assignment of clusters for Kriegelexample on the basis ofFig. 13A.

Fig. 15. Data set Coffeedata. Result with OETICS.

are necessary because of the small number of samples of358

Robusta and their rather large distance. A mark must be used359

to indicate the separation of the two clusters, because, as360

in the example50data in Fig. 7, there are not outliers with361

large reachability distance from both clusters. Without the362

mark the plot of reachability distance is hardly interpretable.363

In fact, due to the large generating distance, a lot of objects364

(33) were found in the sphericalS(q) space around the first365

object, and added to the LIST of the neighbours (point c23 of 366

the algorithm). Because of the objects in LIST are ordered367

according to their reachability distance (but updated every368

time the algorithm executes point c23), the result is that 369

also the ordered objects are almost in the order of their370

reachability distance. 371

Fig. 16. Data set Coffeedata. Result with OPTICS.ε, generating distance= 4.5; MinPoints= 2.

Fig. 17. Data set Oliveoils. Result with OETICS.

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 9

Table 4Exploration of OPTICS minimum cardinality and of the generating distance in the case of data set Coffeedata

MinPoints 2 3 4 5

ε Clusters Noise Clusters Noise Clusters Noise Clusters Noise

1.70 1 39 0 43 0 43 0 431.84 1 38 1 39 0 43 0 431.98 3 31 1 38 1 38 0 432.12 2 26 2 31 1 38 0 432.26 2 23 2 27 2 30 1 352.40 3 20 1 26 1 26 1 272.54 2 17 1 21 1 25 1 262.68 2 15 1 19 1 19 1 192.82 1 11 1 12 1 15 1 152.96 2 6 1 9 1 9 1 93.10 2 5 1 8 1 8 1 83.24 2 5 1 8 1 8 1 83.38 2 4 1 7 1 7 1 73.52 2 4 1 7 1 7 1 73.66 2 3 2 3 1 7 1 73.80 2 3 2 3 1 7 1 73.94 2 1 2 1 2 2 1 74.08 2 1 2 1 2 1 2 14.22 2 1 2 1 2 1 2 14.36 2 0 2 1 2 1 2 14.50 2 0 2 1 2 1 2 14.64 2 0 2 1 2 1 2 14.78 2 0 2 1 2 1 2 14.92 2 0 2 1 2 1 2 1

Table 4shows the results (number of clusters and of out-372

liers (noise)) obtained with different values of the settable373

parameters of OPTICS. In fact, in this case, with a small374

number of objects, it was possible to perform a wide study375

Fig. 18. PC plot of data set Oliveoil. The index of category is reported for the 572 objects: (1) North Apulia; (2) Calabria; (3) South Apulia; (4) Sicily;(5) Inland Sardinia; (6) Coast Sardinia; (7) East Liguria; (8) West Liguria; (9) Umbria. The position of the first object in the data set indicates the startingpoint of OPTICS. The longest edge is the starting point of OETICS. The black filled squares indicate the approximate position of the clusters ofFig. 17.

of the effect of these parameters, what is not so easy when376

the number of objects is large, so that the computing time377

increases very much (being proportional to the square of the378

number of objects). The number of clusters and outliers is

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

10 M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx

Fig. 19. Data set Oliveoils. Result with OPTICS.ε, Generating distance= 3; MinPoints= 10.

not enough to decide the optimum value of the parameters.379

It is usually necessary to draw the corresponding plots of380

the reachability distance. In this case the choiceMinPoints381

= 2 andε = 4.5 was due to the previous knowledge of the382

existence of two categories.383

7.2. Oliveoils384

The results are reported inFigs. 17–19. The differences385

between the plots of OETICS and OPTICS are due to the386

smoothing character of the second and to the different start-387

ing point. OETICS starts from the longest edge connecting388

a sample of East Liguria to a sample of West Liguria. This389

is a terminal edge so that no rearrangement is required. The390

minimum spanning tree continues in the part of the space391

rich in samples of West Liguria. OPTICS starts with the392

sample number 1, of North Apulia, so that begins to order393

samples of North Apulia. There are however many points394

of similarity. The “valleys” corresponding to high density of395

points, to “density clusters” are more or less the same. The396

right part of the OETICS plot corresponds to the left part of397

the PC plot ofFig. 18, with many samples of South Apulia398

and some samples of Sicily, with relatively small density,399

and with some zones where there is a relatively high den-400

sity of samples from Calabria or North Apulia. This part of401

Fig. 20. Data set Breastcancer. Result with OETICS.

Fig. 21. Data set Breastcancer. Result with OPTICS.ε, generating distance= 3; MinPoints= 10.

OETICS plot corresponds to the part of the OPTICS plot402

left the cluster of Coast Sardinia. It seems that OPTICS403

separates better the samples of Calabria. However the clus-404

ter of Calabria shown inFig. 19 contains many samples405

from Sicily. The two small clusters of Calabria inFig. 17 406

have only samples of Calabria, but less than the cluster of407

OPTICS. 408

7.3. Breastcancer 409

The results are reported inFigs. 20 and 21. OETICS starts 410

from a terminal edge, so that no rearrangement was neces-411

sary. There are a lot of equal objects in Breastcancer so that412

the length of edges or the reachability distance are zero. The413

plots of OETICS and of OPTICS are very similar, the only414

important difference is a consequence of the different start-415

ing point. The interval marked by arrows inFig. 20contains 416

almost all the samples of the category Benign (444 out of417

458) with only 17 Malignant. 418

8. Conclusions 419

The OETICS algorithm, based on ordering the edges of420

the minimum spanning tree connecting the objects, seems421

to offer an alternative to OPTICS, the original clustering422

technique based on the local density of objects and on two423

selectable parameters. OETICS is rigid (except the possibil-424

ity of smoothing of the edge lengths in the final plot), so425

that it does not require an exploration work, as frequently426

necessary in OPTICS to find the optimum of the generat-427

ing distance and of the minimum cardinality. The solution428

of OETICS is unique, i.e. it does not depend on the order429

of the objects. The results are more or less equivalent. On430

the other hand OETICS has not the same flexibility as OP-431

TICS, and it is rather sensible to the local differences in the432

length of edges and to the local details that the procedure of433

OPTICS eliminates. 434

OETICS is contained in the program MST (minimum435

spanning tree) of V-PARVUS[8].

ACA 225314 1–11

UN

CO

RR

EC

TED

PR

OO

F

M. Forina et al. / Analytica Chimica Acta xxx (2004) xxx–xxx 11

Acknowledgements436

This study has been developed with funds from the project437

UE WineDB, from the University of Genova, and from CNR438

(National Research Council of Italy).439

References440

[1] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. De Jong, P.J.441

Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Quali-442

metrics, Elsevier, Amsterdam, 1998.443

[2] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduc-444

tion to cluster Analysis, Wiley, 1990.445

[3] M. Forina, C. Casolino, S. Lanteri, Annali Chim. (Rome) 93 (2003)446

55.447

[4] M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander “OPTICS: Or-448

dering Points To Identify the Clustering Structure”, in: Proceedings449

of the ACM SIGMOD’99 Internationa Conference On Management450

of Data, Philadelphia, PA, 1999, pp. 49–60.

[5] R.C. Prim, Bell Sys. Tech. J. 36 (1957) 1389. 451

[6] J.B. Kruskal, Proc. Am. Math. Soc. 7 (1956) 48. 452

[7] H. Streuli, Aroma und Geschmacksstoffe in Lebensmitteln, Foster453

Verlag AG, Zurich, Swiss, 1967. 454

[8] M. Forina, S. Lanteri, C. Armanino, C. Cerrato Oliveros, C. Casolino,455

V-PARVUS 2003, An extendable package of programs for explorative456

data analysis, classification and regression analysis, Dip. Chimica457

e Tecnologie Farmaceutiche, University of Genova, available (free,458

with manual and examples)http://www.parvus.unige.it. 459

[9] M. Forina, E. Tiscornia, Annali Chim. (Rome) 72 (1982) 143–460

155. 461

[10] J. Zupan, M. Novic, X. Li, J. Gasteiger, Anal. Chim. Acta 292 (1994)462

219–234. 463

[11] P.K. Hopke, D.L. Massart, Chemom. Intell. Lab. Syst. 19 (1993)464

35–41. 465

[12] W.H. Wolberg, O.L. Mangasarian, Multisurface method of pattern466

separation for medical diagnosis applied to breast cytology, in: Pro-467

ceedings of the National Academy of Sciences, USA, vol. 87, De-468

cember 1990, pp. 9193–9196. 469

[13] M. Forina, S. Lanteri, I. Esteban Dı́ez, Anal. Chim. Acta 446 (2001) 470

59. 471

ACA 225314 1–11


Recommended