Post on 05-Jul-2018
transcript
Introduction
MACROS FOR DETERMINING THE OPTIMAL CLUSTERS
IN LARGE DATA SETS B. GERARDIN
J.L. MOLLIERE Electricite de France
In most cases, before using a clustering technique the user has no prior idea of the number of clusters which
will give the better differentiation of his data.
This unknown number may correspond to some real and hidden structure of the data which may be very
desirable to the user to find out.
Most often, the degree of partitioning has no prior meaning before the user gives an interpretation of the
different clusters that the method will produce. However the user is very interested in summarizing his data in
the best possible way ; that is to say in getting a compromise between a good degree of differentiation and a
not too high number of clusters.
A usual way of determining this optimal number of clusters is to look at the squared multiple correlation R 2
(which is the sum of squares between all clusters divided by the total sum of squares) for partitions
corresponding at different numbers of clusters. The plot of R 2 against the number of clusters may suffice to
get the desired number of clusters.
In SAS 82.3 a new criterion, the cubic clustering criterion (CCC), is calculated which provides very clear and
useful information (particularly from a statistical point of view).
In either case (R 2 or CCC), it is necessary to determine as many clusterings as there are different numbers of
clusters required for plotting the R 2 or the CCC.
Here is the advantage of hierarchic hal methods (illustrated by the PROC CLUSTER) which, in a single
execution, give as many clusterings as levels in the hierarchy. However the PROC CLUSTER of SAS 82.3
quickly becomes very C.P.U.-time consuming as the number of observations increases and must hence be
restricted to data sets which are not too large (no more than some hundreds of observations).
On the other hand, the very efficient new procedure F ASTCLUS of SAS 82.3 may also prove to be expensive
when used repeatedly for giving clusterings with different numbers of clusters.
It will be easy to combine the respective advantages of these two PROCS (and of some others) in three
MACROS, which will be described further on ; they will help the user to get the clustering he is looking for,
and provide him with supplementary information about the structure of his data.
45
A. METHODOLOGY IN MACROS MCLASl, MCLAS2, MCLAS.
I. Choosing the number of clusters
MACRO MCLASI
1) The principle of mixed clustering.
On large data sets a useful methodology consists first in summarizing the observations in a large enough
number of clusters (100 may be a standard value) and then applying a hierarchichal clustering technique
for aggregating these groups.
This procedure has the advantages of the hierarchichal method for showing an optimal number of
clusters and solves the difficulty of the too high initial number of observations by first clustering them,
using a non hierarchichal method, in a smaller number of clusters. This number is a parameter of the
procedure; it must be high enough in order not to impose a prior partitionning of the data.
The MACRO MCLASI realizes this mixed clustering technique, combining the use of PROC FASTCLUS
for the initial partitionnirig in a fixed number of clusters (l00 is the default value) and of the PROC
CLUSTER for the second step where the WARD's method is used.
PROC TREE then produces a tree diagram showing the 1 "heights" at which clusters join in the
hierarchy, which may graphically be very clear for cutting the tree at the optimal level. For getting
this optimum, two plots will also be very useful : the first one plotting the CCC criterion and the
second" the semi partial R 2 (or decrease in R 2 caused by joining the clusters) against the number of
clusters. The last plot contains in fact the same information as that displayed on the tree diagram.
The two different approaches for determining the number or clusters are then that by the R2 and that
by the CCC (which itself is linked to the R 2).
Figures 1, 2 and 3 show the diagrams produced by MACRO MCLASI on real data (population 1 of the
examples of part B).
46
!, i ~:
W ~'
:~,
.,. ','
,:< )}; ',' ~ ~ f. ,'.
Jl~ t , ~~ },
Ji. ~;
l; j:
i , ll'
~ ~, J1 J ,. ~ , ~ ·x: }\
i \
t. ~ 11 , ~ ,1~
i ., ~ ?J'
25.(' + , I I
~'.5 + I I I
20 .. (' • I I ,
1'7.5 • I I I
1'5.0 + I
1 I u I , 12'.5 • T I t I • I t 1(1.0 + N I
I I
c 7.~ +
• I , I T I • ~.o +
• I t I D I N 7.5 •
I I I
0.0 + I I I -2.' • I I I
-5'.0 + I
I 9
4
'XfCtITttw of LA 'UCRD SAS MCLIn 1112'1 FRIDAY. MMCH 2'. 1914
SVMBOL 15 YALUE OF NCl
I I I 2 2' 2 2 2
• • • • • • • • • •
Figure 1. Plot of CCC showing 6 clusters Population 1
. .. • •
___ • __ .--+_--+ __ .... __ +-+- .. , --o, __ +,_~_-+_--.._---+. ___ - __ --•• --__ - .... -
]9 • 3 31 •• n.3 :t6 •• 35 • ] .... ]1 • 1 32 • 3 31 • 3 1('1 + ~ ,. 2-. 2'
U 28 + 2 .. 27. 2 II 26. 2 f 25. 2 R 24. 2'
23 + 2 o zz. 2 F 21. 2
20. 2 C 19 + 1 l UI + 1 un + 1 S 16 • I T 15 • 1 f 14 • 1 R 13 • 1 S 12 + 1
11 • 1 10 + 1
9 • •
• • • ., . ., 6. J) • • · . · . 2 • 1
• tt t3 I'T I. Zl 2J 25 Z'P 19 J1 " ,5: ... 39
IlU'IP 0' CLU5T!R S
EXEcunON DE LA "ACRO SolS MCUSl 18:21 FRIDAY. MARCH 23. UB,. 11
r.R_PHe DU "I01Iftt~~ OE CLASS~S EN FONCTION [lJ $fIllI-PAR.TIAL R2'
PLOT OF NCl*VR2 SYMBOL IS VALue OF NCl
Figure 2 . Plot of semi-partial R2 showing 6 clusters Population 1
-+------+----+-----------+----+-----+---------+-------+--------+--------+----".00 0.tt2 0.04 0 • .,)6 O.OB 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24
SE"'I-PARTtAL It-SOUAR ED
47
I uoqelndod
s-A<l:).snlJ 9
6U~M04S we-A6e~p <l<l-Al
. £ <l-A n6 U
fl ,,861 '£l HJO'WW 'AWall:f:::l U:Ql
81'
xxx xxx xxx xxxxx • xxxxx XXX)I:ux uXXllXXXXXX xxxxxxx xxxxx xxx xxxxx X;(X xxxxxxx xxxxxxx xxxxxxxxx luexxXXx xxxxxxxX)(xxxxxxx)(X~ xxxxxxxxxxxxx xxxxunx xxxxxxxxxxx xxxxxx)()txxxxxxxxx XXXXX)()( XX)()()(XX)l.XXXXXXXxxxx X),xxxUJtll.XXXXXX)(XXXXXXX xx)(Xxxxxxxx xx xx xxxxxxxxxxxxx xxxx AXX xx XXXX)LA XXXlI.,\ XX)!. X)I;).X XX XXXll.XXXXl(XXXllXXX XXXX X X XXXXXXXXXX XXXX XXXx xx XX XX XX X XUXXX X)(XXXXXXXAAXXXX XXX XXX XXX XXxx XXXXXXXXXXXXX XXX X XXXXXXXXXAA XXlCXXXXXXXXXXXXXX X~~~~}'-~.X~XX~~!cl!:~~Xl:'~X~X;PI,l!~X)C_X~2'XX~XAXX~XX.~]t_XXXXXXX XXXXlXXXXXX X XXX XXXXXXX XXXXX X XX XXXX XXlOl XXXX XXXXXXX XXlI. X X»X xxnXxxxxXXXXXlOl. XXX XXXll. X XXXXXXXUX XI(XXUXXXXXXXXX)()( )OtxxxxxnXXXXXXXXllXXXXXXX)().XXX)O(XXXXXXXxxxuxxxnxx X)()tXXXXXXXX XXXXxxxxxxxnXXXX XXXXXXXX)(kXXXXXXXXXXXXXXX)()(XXXXXXXXXXXXXXXlI.XXXX){XXX XXXXXXXXXJtl( )(xXX XXXXXXXXXXXXX XXXXXXXXXX),X.w;,( XX XXXXA XXX ... XXX XX XXXXXXXXXXX XXX)(XXXXX X XXXXXXU.XXX X XXX XXXX XXXXXXXXX XXXXXX XX XXXXXX XXXX XX)( XXX )(X)(X XXXXXXX XXAXX XXXX XXXXXA X X XAXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXU:XXXXXXXXXXXXXXxxxxxxnXXXXXXXX XXXXXXXXXXX xxxx XXXXXXXXXXXX X XXXXXX XXXX )(XXX XXXXXXX XX)( XXXX XX XXXXX XXX XXX XXX XXXXXX X X XXXXXXXXXX XXXXXXXXXXXXXXXXX )(XXXXX X)( xx)(xn X)( X){ XXX XX). XXUXXXXXXXXX XXXXX){X XXXXXX X XXX)()(XXXXX X XXXXXXXXX.w;XXXXXX){ XXXXXXXXXXXx.w;XXXXXXXXUXXXXXXXXXkXXXXXX)tXXXXXXXXXXX XXXXXXXXXXX XXXX XXXXXXXXXXXX X XXXXXX XXXXXXXX XXJ()(XX X XXX X XXX XX XXXX XX .... XXXXXXX XXXXXX X X XXXXXXXXXX XXXX XXXXXXXXXXXX X XXXXXX XX XX XXXX XX XX XXX XXXXX XX).X XXXXXXXXXXX XXX XXX X XX X X XXXXXXXXXA XXXX XXXXXXXXXXXX). XXXXXX X XXXXXXXXX XXXXX XXX XXXXXX X XXXXXXXXXX XXX XXXXXX X X )(XXX XXXXXX XX XX XXXX XXXXXXX XX XXXX X)( XXXX XXXX XXXXXXX )(XX XXX)( XX XXXXXXXX XXX X XX XXX X XX X X XXXXXXXXXA X XXXXXXXX XXXXXXXX XX)()(X). XXXX XXXXXXXX XXX XXXX XXX XXXXlOf,Xn.Xxxx),),JC XXX XXX X X XXX),XXJlXXX XXXX XXXXXXXXXXXXX UX..( AX XXXX XXXXX,,(XX XX X XX)' XlI..k.X X),XXXXXXXX XXX XX.k.XXXXXX XXX XXXXXXXXXX XXXX XXXXXXXXXX XXX XXXX XX XXXXXXXX XXXXXXX XXXXXXX XXXXXXXX XX XXX XXX xXXXXX xXXXXxXXJ(XXXX xx XX XXX)()(XX)()()()()(JC XXXXXX XXXXXXXX XXXX XXX XXX X XU XXXXXXXX XX XX X XXX XXX XXX XXXXXXX)()(XXX"( XXX)( XXXXXXXXXXXXX XXXX XXXXXXXXXX XXXXXXX XXX XXXX XXXXXXXXXXXXXXXXXXX XXX XXXXXXXXXlOeXX
:::==~~~~~~ ~~::~=~~:::::~~~=:;::==:;:~=::::;;=:;:;;=:::::::::~~~::~:~:=: XXXXXXXXXJ(XXXXXXX XXXXXXXXXXXXXXXXXXJClI,XUX)!,X)!,X)!,XXlO(XXXXXXX)!,XXXxxxxxxxxxnxxxxnxx XXXX XXlXXXXX)(X XXX XXXXXXX XXX XXXX XX XX xXX XXX X XXX XX XXX X X XXXXXX XXXXXXXXX X XX XXXXXXXXXX xxXXXXXXXXXXXXXXX XXXXXXXXXXXlCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX)(XXXXXXXXXXXX)()(X)(XXXX XXXXXXXX)(XXXlXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxxxxxnXXXXX;UXXXXXXXXXXXXXXXU:XX XXX)( XXXXXXXXXXXXXX XXXX XX XXX)( XX XX XXXX XX .. X)()!, )(XX)!, XX XXXX X)!,X)()l.X). XXXXX)tX XA XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX)(XXXXXXXXXXX)(XXXXXXXXXXXXXUXXXXXXXXXXXXXXXXXXXXXXXXXX XXXX XXXXXXXXXXXXXXUXXXX XXXXXXXXXXXXXX)( XXX XX)'X XXXXXX XXXX XXXX)'XXXX XXX XXX XXXXXXXXX X XXXX XXXXXXXXXXXXXXXXXXXX XXXX XXX X XX XXXXX XX)' X XXX XXXX XX XXXXXXX XXX XXX XXX X XX XXXXXXXXXX XXXX XXXXXXXXXX XXXXXXXX XXXX XX XXXX XXXXXXX XXX XXXX XX XXXXXXX X XXXXXXXXX XXX XXXXXXXXXXXX ~ XXXX XXXXXX)(xX)()( XXX UX)()(X XXXX XXXX XXXXXXX XXXXXXX XX XXxxxxn XXXXXX XXX XXX X XX XXXXXXX)(XX X XXX XXXXXXXXXXXXX)();X XXXX XXXXXXXX XXXXXXX XXX X XXX XX XXXXXXX)( XX X XU XXX X n X XX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX)(XXXXXXXXXXXXU)(XXXXXXXX)(UXXXXXX).XAXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXX)(XXXXXXXXXXX)(XXXXXXXXXXXXXAXXXXXX).XXXXXXXXXXXXXXXXXXXXXXXXXXX)(X XXXX XXXXXXXXXXXXXX XXXXXX X XXX XXXX XXXXXXX xxxxn X XXXXXXXXX X)()'XXXXXXX XXX XXX XXXX XXXXXX X XXX XXXXXXXXXXXX XX XXXXXX XX XX XXXX XXXX XXX XXX XXAX XXXXXX XXXXXX X XXXXX XXX" X XX XXXXXXXXXX X XXX XXXXXXXXXX XX XXXXXXXX XXXX xxx XXX nxxx Xx:aX).XX XXX),)'X XXXXXXXX;"'XXX). XXX), XX X XXXXXXXXX X XXX XXXX XXXXXX XUXi.XXXXX XX).X XXXX XXX). XXXXXXX X;,.,X XX ),XX XXX)(')' XX X XXX XXX XXX X XX X XXXXX),XX X X XlO( XXXXXXXXXXXXXXXXXXXXX ).XA AXXX XXXXX;,cX X.XXXX). X X.k. XAXX XXXXXXX )l.XXXX;w.XXX XXX XXXXXXUXX X XXX XXXX XXXX XXXXXX XXXXX XX XXX XXXX XXXXXXX XXXXX,lX XX XXX)( XXx)' XX X XAX XXXXXX XXX X XXXXX XXX X
X XXX XXXXXXXXXXX XXX XXXXXXXX XX XXX X XX XXXXX )(XX XXX)( XX XXXX XX XXX XXXXX XXX XXX XXX XXXXX XXXXX XXXX XX XX XXX XXXXXXX XX XX XX X XXXXXXX XX)'X XX X XXX XX XX),). XAXX XUX XX XXXX XXX XXX X lOtXXXXXXXXXX
ZSEb..,YbL90Z0ILE..,911016!iiO!iiLBOILElILELb..,£090 6BL9LliEZZIlEil<79b..,Z9Zbl1."..,SE6LlliEibS..,Z!t..,LI IiiWBI:IBiiQIii8QSQQQaBSQI:I'dijliitltl\lSQ8t18b81:l'dOaaaSBB OOOOOOOOOUOUOUUOOUUOOOUOuOlJlJOOliOuO QUailOD
3nOIHJO'W0'3IH ill&O' .... 1 30 ~lSS3~dHl
ISW1:)W S ... S 00':1'101 '1 ElU NOlJ.nJ3X~
O'3J.SCllJ ~ MJI.l'f'll0'3saO ;1) :J!IWN
XXX XXXXX XXX XXXXX • XXXXX XXX XXX XXXXXXXXX •• XXXXX • XX),XX),). XX).)(A ),XU .. AAX ).X). ).XX )().).U,A). AX). XXXXXXX XXX XXXXX+ 0 XXXXXXXXX XXXXXX.XXXXX XXXXXXXXX XXXXX)(XXXXXXX XXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXX .... X XXXXXXXXXXX xxxxx)(xxxi XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXx).),xu. XXXiOtXXXX).), XXxxxx).xxn XXXX)(XXXXXXXXAXXXXXXX I
XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XX XXXXXXXXXXXXXXXXXX),XXXXXXXX XXXX X).): Xxu X).llXXXXX),XX).X)(.).XXXXX:X I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX.k.XXJ( XXXXX xxxxxn XXXXXXXXXXXAU XXXX)'XXX..:X XXX)( XX XX): X XH XXX X XX.k.XXX XXXX ).XXXXXXXXXXX I xxxnxXJ(xxxxXXX)(XXXXXX XXXXXXXXxnXl(l{~l{~_XXXXXx l(~lt~~~Jtxxxxxxxxxxxxxxxxxxxxxxxxxxxxx XXXXXXXX)(,),XXXX).XXXXXXXXXXXXXXXXXX I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXA XXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX),XX XX)(X XX). XXX)(. XXXXXXXXXXXXXX),XX)(.). XXX I XXXXXXXXXXXXXX XXXXXXXX XXX XXX XXX XXXXX XXXX Xxxxx XXX X XXX XXXXXXX XXX X X XX X X xux XX XXXXXXXX XX).X ).XX XXX X XX xxxx XXX X xx"( XXXXX X xxx, XXXXXX)(XXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXXXHXXXXXXXXXXXXXX ).XXX xxxxxXXXXXXXXXXX)'XXX XAXXXX XXX I xxxxxxxx xxxxxx XXXXXX XX XXX XXX XX XXXXXX X XXXXXXXX XXX XXXX Xxxxxxxxn XX XX xxxx XX X)!,X)( XX U xx)(.). XX X.k.X).),)..k.X XX XX XX XX XXXX X XXx:...xx ux J XXXXXXXXXJCXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXX X XXX XXXXXXXXXXXXXX )F.XXXXXXX)(,X .k.XXXXX XXXXX XXX xxxx XXXXXXXXXXXXXAXXXXXXX).+ ~a·o XXX Xl&: XXXlI:X xxxxxxxxxxxx XXX XXX XX XXX XXX X XXXXXX XX XXX XXXX XXXX XXXXXX XXXX J(.XXXXXXX XX XX XX XX XX)()(X XXX X X XX XX XXX X X X X X XX XXXX XX X XXX I XXXXXXXX XXX xxx XX XXXXX X X XXXXX XXXXXXXX X XXXX X X XX XXXX XXX XXXXXXX XX XXX XX ).XX); XX),). XX xXXX xx),x Xx)' X)()(), X)' X XX)(X AX XXX X). x). XXX XX). xXX I xxxxx XXX XXXXXX XXXXXXX X XXXXXXXXXXXXXX X XXXXX X XX XXX xxxx XX XXXXXXXX XX).X )...(xx.xx xxxx ).xx .... XX xxxx .... .k.). .... x );,);,XXX )(AXX XX X X XXX Xli.). XXX XXX I xxxxx XXX XXXXXX XXXXXXXX X XXXXXXX XXX X xxxxxx X X XXX XXX X XXX XXX X XXX XXX XX X), XX XXXX XX XX XltXX XX xxxxx XXX X x),x XX XXX X XX X X XXX)(.AX XXX XXX J XXXXX XXX XXXXXXXXXXXXXX XXX XXX XXXXXXXX XXXX XX XXX XXXX XXX XXX),XXXXXXX)(XX xxxx XXXXXXXXXX XX xxxxx XXX ~ X XX XXXXXX XXX XliX Yo XXX XXX XXX J XXXXX xxxxx XXXX XX XXXXXx XXX XXX X XXXX XXX X XX XX xxxx X)tX X XXX XXXX XXX XXX X XXX X XXX XXX X XXXXXX xxxx XXX x)!,).). XXX X X XlI);,X XX X),).X X XXX XXX).X X I XXXXXXXXXXXXXXXXXXXXX X XXX XXX XX XXXXXX XX)( X xxxxx XXX xxxx XX X XXXX XXX XXXX .k.XXX XXX.k. XX X XX X XXXX XX X XXX xnx X x).x nx)()(x XXXXXX XX)' X X)' I XXXXXXXX XXXXXXXXXXXX X X X XXXX X XX XXXXXXX XX XXX X XX XXXX XXX X XXX XXX)'xX XXXX x).xx).x xx),), XXX). XXXXXX X ":xx X x:.. X XX xxxxx:.c.), .... x)()r. )!,xx XXXAX.k.1 XXXXxxxxxxxxxxxxx XXXXX XXX XXXX XXXXXXX X XXX XXXXX XXX XXXX XXXXXXXXXXXX XX XXXX XX XX XX XXXX XX XXXX X XXX X XX)( XX xxxx XXX X XXX XXX XXX X X X i XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXXX).XXXXXXXX),XXXXX).XXXXX xxxxxx).. XXXXXX X)(XXXXX XXXXXXXXX + 1· 0 XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXxxxx XXXXXXXXXXXXXXXXXXXX)(), XX ),lIXllX),XXXXX ),XX ),X),XXxx).),XXXX),X)();,XXXXXXXXX I X XX xx XXX XX xxxxxxxxxx XX XXXXXXXX XX XXXXXXXXXXX XXXXXX X XXX XXX xxxxxn X XXX XX XXXX X)..k.X XXX X XX XX xx);, X),X X X.k.). XX X)( >.X XX).). XX X X)( X XXX XXX I XX XXXXXXXXXXXX XXXXXXXX X XXXXX XX XXX XXX X XXXXXX XX XXX X X xxx XXXXXXX XXX XXXX XXXX XX X X XX XXXX XX X X xx X XXX X XX X X)(XXX XXX ..:xXX .k.XXX xxx XXX I XXXXX XXX XXXXXX XXXXXXX X XXX XXX XX XXX XXX X X XXXXXXXXXXX XXxx XXXXXXXXXX X XXX AXXX).XX).)(), XXX X XX ),xxx X x>.x X X .... A X)' XlI).X XX X )..XXXXXX XXX XXX I XX xxx XXX Xx XXXX XX XXXXXX XXXXXXXX XXX xxxx XXX XXX XXXX)()( X XXX XXX xxXXXXX XXXX ),XXX xx XX XX X);XX XX).X),),), X).X XX X). XX Xx)')' xx X X ).XX ),lIX xxx XX). I XX XXX XXX XXXXXX XXXXXXX X XXXXXX XXXXXXXX X XXXXX X XXX XXX X X XX XXXXXXXXXlt). ""XX)' X XX)!,X x). XX XXXX x..::u x .... x XX). X)(UXX XX)')"}'X XX XA)(XX X XXX X XlI; I XX X XX XXXXXXXXXXXXX XXXX XXXXXX XXXXXXXX X XXXX X X XXXX XX XXxx XXXXXXX xxx XX XX XXXX Xx X X xx XXX X XX..(x X)!' X XX)' X).XX XXXXX ).XX XX XXX XXX XXX xxx I xxxxx xxx XXXXXXXXXXXXXX XXXXXXXXXXXXXX X XXXXX XXXXXXXXXXX XXXXXXXXXXXXXX)'X XXXX X XXX X X XX XX x). XXX xxx X X X X XX X XXX XX X X XXX XXX XXX X X X I XXXxx XXX XXX XXX X)(XXXXX X X XXXXX XXXXXXXXXXX)(XX X XXXXXX XXXX XXXXXXXXXX X).X)!, XXXX XX X), xx xxxx XX xxxx X XXX X X).)( XX),XXXXXX XXXX XXXXXXX X)... I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXX).X ).).)r.).),XX),X). X)().X),).)(). X Xl\). xx X),XXX)(XXXXXXXX XXXXX X + S 1 ·u XXX XX XXX XXXXXXXXXXXX XX XXX XXX XXXXX XXXX XXXXX X XXXXXX XXx)( XXXXXXXX XX XX XX ),XXX).X XX..:X xXX X XX),X >.X). x>. .... X xn x). u XX Xxxx X XX XXXXXX X XX I XXXXXXXXXXXXXXXXXXXX X X X XX XXX XX XXXX XX X XXXXX XXXXXXXX XXX X XX XXXX xXX XXXX XXXX XX X X XX XXXX xX X X XX X XXX X XXX XX XX),). XX.k.lI XX X Xli X XXX XXX I XXXXX XXX XXXXXX XXXXX)(XX XXX XXXXXXXX XXX X XXX X)()( XXXXXXX XXX XXXXXXX XXXXX XXXX Xx XX XX XX XXXX X),XXX)( X XXX XX).). XX XXX X XX X)( Xx X X XX XX X XXX I XXXXX XXX XXXXXXXXXXX)(XX XXXXXX XXX XXXXX XXX XXX X xx XX XX X XXX XXX X X)(X X XXXX XX X XX;'" XX XX)..X XXx)')'x X)(X).),);,).).), X lI.X XX)(XX X XX)( X);;)' X)()' X XXX XX X I xxxxx XXX XXXXXX XXXXXX X X XXX XXXXXXXX XXX X XXX XX X XXX XXXX XXX XXX)!,XXX XXX XXX). XX xx),x X X XX XXXl\ XX). X XX X X)!,A AXHXX XUX X X X X X X X Xli X XXX)' XX I XXXXX XXX XXX XXX XXXX XXX X X XXXXX XXXXXXXX X XXXXXXXX XXXXX X XX XXX XXXX xxxn XX ...xXX XX JC). XX ),XX)I. X)( XX XXX Xxx XXXX XX ).XXX XX AX X..( X X).). XXX xx,\ I XXX XXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXX XXX XXXXXXXXXX XXXX XXXXXXXXXX).X XXXXXX )'XXX XXXXXX XX).X XX X XXX XXX X XXXX X X XXX)( XX X X).XXXX XXX I xxxxx XXX XXXXXX XX X XXXX X XXXXXllXXX XXXX).X XXXX XX XXXXXXX XXX XXXX XXX XXX :XXX); )(XXX).X XXXX XXXX Xxxx X).X XXX X). XX XX XX)(). X .... XX XX)' X"X XlU..( XX I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXX XXX XXXX XXXX XXXXXXXXXXXX XX ).XXXX), XX XX XX XX),x X),).X X XXX XX),).)(X .k.xX)( XXX X). x ), XAX XX)(XX X I XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXXXXX .... -,x).XXlI.Xxxxx)!,X). XXAXX).). XXX Xli).). X,J(X,\X XXXXX XXXX)(XXXX XXX + Z· 0 XX XXX XX X XXXXX XXX XXXXXX XXX XXXXXXXX XXX X XXXXX X XXXXXXX XXX X XXXXXx XXX Xl( XX XX XXXX X X XX XXX)' X"': XXX).). XXx x),xx XX XX)( X XX XX XXX)()' xxxx XXx I xxxxx XXXXXXXXXXXX XXXXX XXXXXXXXXXX XXX X XXXX XXXXXXXXX Xxx. X XXXXXXXXXXXXX ).XXX XXX XX). XXXX X);' Xx .... XX XXX X XX:.. xx).x). X XXX X X X XXXX XX X XXX I XXXXX)()(XXXXXXXXXXXXXX X XxX XXX XXXXX XXX XXXXXX X XXX XXAX XXX XXXXXXXXXXXX X){, .... XXX .k.XXX X). X X).). X). XX XXX X)'X). X),X :xX ).XX). XX X X XXXX), X)(){,). X XX I XX XXX XXXXXXXXX xxxxXXX X XXX XXXXXXXX XXX X XXXX XX XXX XXX)( XXX XXX X XXX ),XX XX XX l\A XX),X X .... ).X .... XXX).Ao X)I. X)!,). UJ().)().X A": x>. X X XX).). XX). XXX XX).).X), I XXX XXXXXXXXXXX XXXXXX X X XXX XXXXX XXXXXX X XX XXX X XXXX XXXXXX XXXXX XX XXXXXU ).XXX XX XX X);; XXXX XX XXXX X XXX X).AX XX)'X XX XX X X XX..( Xli XXX X ).AX I xxxxx XXX XXXXX XXXXXXXX X)(XXX XXXXX XXX XX). X XXXX XXXXXXXXXX XX XXXXXXXX X). X). XX X X XX XX Xx AX ).XX). XX XXXXX :xX). XXXX ).XXX X X XX X X XX). Xli X XX)' XXX,
Z!i.,.9Z9i£OSE1SLIi£zoalS!t"6Z"'b""'95Z9ij8£!ii!iiLZb~bl:j"'''1Lb9£01b9 llSLl .,fE9ZSZSL9SS"'''L.,S,!ii9''1S9ij68BB.,S,E29IZblbL6LS,6i.1111!tSb€OSlb 16!.La iQSQQ8Q880Qil99S989seS60ililijQ 9loSS\:j8't1Sgij 1:j1:l8IiU'd'a tI'riI:lOijij I ij'ilCl::loJ\l~G
000000000 00000000000 000000000000000000 00000 OOBOO 000 o
CJ31$n1J b(J NOlJ.M}t3sao :W ;lWVN
3nC:lIl-OklVa3IH 3k1'ilklV.1 30 Jl.lJl~SjkldWI
Zl .,861 l(Z HJkI'''' 'A"'UI'ti~ ll~al IS'r'1Jw S"S G~JwW ,1 30 NOI.J.il)3X3
I I I
SZ·O
, • n e ,
\
\
2) Measuring the stability of the results
As is well known, non hierarchichal algorithms always give, for a fixed number of clusters, only local
solutions to the general problem of maximizing the R 2 coefficient. In some algorithms the local
optimum obtained depends essentially on the choice of the initial "seeds" or "kernels" or "centres" of
the different clusters, which are often randomly chosen among the observations.
The method used in PROC FASTCLUS, directly inspired of Mac Queen's K-means algorithm is much
less sensitive to a random choice, since in fact initial seeds are searched for by the algorithm itself
during the first iteration, in such a way that they guarantee a good initialization. However the optimum
obtained still depends on the rank order of the observations on the input file.
A natural question is then to ask about the stability of the results when the initial order of the
observations is changed. This question is answered in MACRO MCLASl (and MCLAS).
For getting a new result with the same clustering procedure (FASTCLUS) a new file is generated where
the observations are ranked following a random arrangement of their ranks on the initial file. This will
be done using random number function UNIFORM.
MACRO MCLASl may be used on the same data, with two options of random rearrangement of the
initial file: repeatable (the random sequence will always be the same in different executions) or non
repeatable (random sequence initialized by a computer clock observation);
The hierarchy obtained in MCLASl may be kept on two output SAS Data Sets making it possible to
completely define the clusterings at any level of the tree, on the initial observations and variables.
The practical importance of finding out the good number of clusters requires that the method gives
some measure of its own robustness, which is proposed here in MACRO MCLASl.
49
II. The Optimisation phase (fixed number of clusters)
MACRO MCLAS2
MACRO MCLASl normally gives an optimum number of clusters. Then remains the problem of getting
the best possible clusters, i.e. maximizing the R 2 criterion. MACRO MCLASl gives a very good starting
solution consisting of the partition corresponding to the optimum level of the hierarchichal tree.
Optimization can be achieved either with MACRO MCLAS2 using PROC FASTCLUS on the centres of
the previous clustering, or with MACRO MCLAS2B using a new SAS procedure (written in FORTRAN
77) PROC ZTRANS applying the exchange method due to S. Regnier (PARIS 1964-1966). It can easily be
proved that such an exchange method which realizes all transfers of an observation from one cluster to
another, making the R2 criterion decrease, is able to improve the optimum obtained by the iterative
algorithm of FASTCLUS, the inverse not being possible.
These two MACROS must be executed after the MACRO MCLASl (or MCLAS). They use the output
data sets of MACRO MCLASl as input data sets and only require the user to specify the number of
clusters he has chosen at the end of the previous step (MACRO MCLASl).
They produce an output SAS data set which includes the final clustering variable. This data set has the
structure of the OUT data set of the PROC F ASTCLUS.
III. On the comparaison of different clusterings
(fixed number of clusters).
MACRO MCLAS
The sequence which we have just shown, of running MACRO MCLASI for determining the number of
clusters, and then MACRO MCLAS2 for optimizing the clusters at the optimal level of the hierarchichal
tree, could seem to solve completely the user's pratical problem.
Of course a solution has been found, and probably a good enough one, at a very cheap cost. However it
may be very useful, and in some cases necessary, to go further.
The main result at the end of this step of our investigation is the indication given by MACRO MCLASI
of the optimal number of clusters.
On the other hand the optimized clusters obtained by MACRO MCLAS2 must only be considered as one
solution corresponding to a local optimum of the optimization problem. MACRO MCLAS will provide a
means for getting nearer to the global optimum.
Based on theoritical results by E. Diday (1974) this approach will define the new notion of "strong
pattern" which, indirectly, will also show itself to be useful for studying the structure of the population
of observations.
50
~ r. li ~ r t ~ t. r: f ~~
" ;~
~ ;": ~
f f~ ~.
~'
K " ~,
~
i.: . :~ \~-
~~:
~: ..
r • ~ I· j' ,. £;: l~.
t· t~: (.
E ~;
~ !!~ I;.
!~ r,
" E ~ ;t.' ~. !;.. :~ ); i;. ~-
t " ~ ... ~ f ~ r..;
.?~ h
~ '{: :,..:. f of,'
I i
i
I ,.
~
, \
1) Different trials using FASTCLUS
(fixed number of clusters).
Direct use of PROC F ASTCLUS is now possible since the number of clusters has been determined.
PROC MCLASS will run five executions of PROC FASTCLUS in the same conditions, only changing
the order (which, in each trial, will be randomly generated from the initial order) of the
observations on the input file.
Of course the five clusterings obtained will be different. This is due to the fact that the solution
given by PROC FASTCLUS is always a local optimum depending on the initialization. In fact only
the best solution maximizing the R 2 criterion is required, and perhaps the optimized clustering
given by MACRO MCLAS2 was better than any of these here obtained.
The real interest of these repeated trials is more to permit us to evaluate how much the
FASTCLUS procedure is dependent on the initialization, i.e. to measure its robustness with regard
to our particular data. A good stability of the results means that the data is strongly structured. On
the other hand a great instability corresponds to a population of . observations not easy to cluster,
that is weakly structured.
A natural way of measuring the differences between different partitions in the same number of
classes is to consider the "~trong patterns"resulting from these partitions ; the definition and
practical interest of this notion will now be outlined .
2) Considering the "strong patterns"
The "strong patterns" resulting of several partitions pi, p2, ... pn, are clusters forming a new .. A 2. . f pI p2 pn partItIOn , mtersection 0 , ,... •
As a consequence of this definition, the observations of a same "strong pattern" are always
together in the same cluster in each of the different partitions considered.
Strong patterns therefore seem very adequate to reveal, inside the population, stable parts (large
strong patterns) and unstable parts (small strong patterns), the case where all strong patterns would
be small corresponding to a population having no real structure and then unable to be clustered.
In MACRO MCLAS, all the strong patterns resulting from the five partitions are calculated and
their numbers of observations printed. Emphasis is put on two quantities which seem to be
characteristic enough of the structure of the data:
a) the total number of strong patterns. The higher this number, the weaker the structure of the
population.
51
! I I L
b) the weight of the strong patterns of a minimum importance (i.e. the number of observations of
which is at least 2,5 % of the total number of observations) in the total population. This weight
is expressed as a percentage.
The partition corresponding to the strong patterns will be kept on an output SAS data set of the
MACRO MCLAS (analogous to the OUT data set of the PROC FASTCLUS).
IV. Clustering the strong patterns
MACRO MCLAS
1) Some theoretical results (E. DIDA Y, 1971}).
pJ£ will represent a partition of the set E of all the observations, corresponding to a local optimum
(as given by PROC FASTCLUS).
Let us consider the strong patterns Ai' ••• Am obtained for (q-1) partitions pJ£I, ••• pJ£q-1 in k clusters
and pJ£q a new partition (the classes of which are pi, •.. P~).
Let P(j/i) ==~ card (A.n pJ£~) car i I J
This quantity expresses the probability that has an element to be in r*f knowing it is Ai"
One can measure the information carried by P knowing AI' .•• A : q n
m card A. JL l(pJ£q/r*I, ••• pJ£q-l) == 2·- I L P(j/i) log P(j/i)
i=l card E j =1
If the partition of the strong patterns AI' ••• Am is 3thinner than the partition pJ£q, the information
brought forward by pJ£ q is zero (since P(j/i) == I or 0).
\ I ( n/ In-I) The invariance of the strong patterns is assured for n == q if.". n)q one has I P P , ...• P == 0
We can then note that the number
n J(k) == > I(pJ£q/p*1 , •.• pJ£q-l) gives an idea on the value of the choice of the number of k classes
q;r
requested; the smallest value or J(k) corresponds to the best choice of k.
Theorem
If l(pJ£n/pJ£l, ••• pJ£n-l) == 0 n such that q n N where N is the maximum number of local optima,
then the partition of the strong patterns pJ£ln... ,1 pJ£q is thinner than the partition
corresponding to the global optimum.
2) Using information I
In MACRO MCLAS the number q of partItIOns is fixed at five. This number will rarely be high . ..) ·f· I (pl£q/pl£l pJ£q-l) 0 enough (only if clusters are very diSJOint for veri yIng , ..• , " .
52
" i}
~. " r f; " f
t r, ~.~
~ ~~ , :.c ;~ l" F t h F \~
J; L J: t L ;.;, ;,;"
1; } R ~j--,;.
," t·
f ~ 1:
~ J\
~ ¥. ~ ~ t·;
~ i l ~ ~:
\ l-\ ..
Nevertheless it will be interesting to look at the decreasing of the information brought by plEq,
from q = 2 to q = 5. These information quantities are therefore computed and printed i'n the
MACRO MCLAS and also their sum J (k). This last number may be very useful if there is still some
doubt about the number of classes. It will be easy to run MACRO MCLASS for the two numbers in
balance, e.g. k 1 and k2 and choose from the two values that for which J (k) is minimum.
An important ability of MACRO MCLAS is that it can be used not only for determining the strong
patterns of the five first partitions, but also for bringing forward new information of five new
partitions to already existing strong patterns (resulting of n previous partitions). For this the
MACRO only requires to know the partition of the existing strong patterns as a variable on an input
data set. The information quantity I (plEq/plEl, .,. , plEq-l) is then calculated for the values q = n+l,
... q = n + 5.
3) Hierarchichal clustering
Another important result expressed by the theorem of paragraph a) is that the optimal clustering
(corresponding to the global optimum) can be obtained as a partition P, where every class of P is
the union of some strong patterns. Of course the theorem refers to the strong patterns resulting
from a number of partitions equal to q such that: I (plEq/plEl, ... , plEq-l) = o.
Even if one is far enough from this value of q, it is worthwhile trying to cluster the strong patterns
in the best possible way in k classes. There is a chance, in doing this, to get nearer to the global
optimum.
The last problem is then to get a good clustering of the strong patterns.
MACRO MCLASS carries out a hierarchichal clustering using PROC CLUSTER with the Ward's
method. An important remark is that the hierarchichal tree (with the R2 and CCC criteria) built on
the strong patterns should not be used for determining the optimal number of clusters in the
population of observations.
However, when the MACRO MCLASI does not give any significant results, the hierarchichal tree
of the strong patterns can bring useful supplementary information for determining the optimal
number of clusters.
The condition for the initial partitions not to impose a prior structure to the hierarchichal
clustering of the strong patterns is that they should be in a high enough number. This will be the
case when the population of observations cannot be easily clustered.
On the other hand, it cannot be proved that the partition obtained at the level of k classes of the
hierarchy of the strong patterns necessarily improves the best of the initial partitions in k classes.
However, after optimization in MCLAS2, the final partition in most cases shows a better value of
the R 2 criterion than the initial partitions.
53
On the contrary, the exchange clustering method applied to the strong patterns, and initialized
with the positions of them in the classes of the best of the initial partitions, necessarily guarantees
the increasing of the R 2 criterion of that partition. This idea suggests that PROC ZTRANS
(carrying out this transfer algorithm) is used instead of PROC CLUSTER for clustering the strong
patterns, when the number or cluster is known; this methodology is realized in MACRO MCLASB.
Coming back to the printed outputs of MACRO MCLAS, it seemed interesting to compare the
aggregations of the strong patterns (especially those of high number of observations) in the
hierarchichal tree with their clustering inside the initial partitions. Therefore cross-tables of the
strong patterns with each of the initial partitions are produced by MACRO MCLAS.
v. Selecting a more stable part of the population of observations
A relative instability of the results often appears when running several times MACRO MCLASI or MACRO
MCLAS. For example, as far as the number of clusters is concerned, some executions will give significant
results (local maximum on the plot of the CCC in agreement with the indication of the semi partial R 2)
confirming the existence of one (sometimes more than one) optimal value, and other executions will be
much less informative, giving indications which are too weak.
Obtaining significant results may even in some cases be very difficult. This is clearly due to the structure
of the population of observations which can or cannot be easily clustered. Obviously a structured population
in quite disjoint clusters would show a great stability of the results obtained by MACRO MCLASI or
MCLAS, whatever the initialization could be.
As was pointed out in paragraph III 2), strong patterns of a high number of observations, if existing, will
correspond to a strongly structured part of the whole population.
Having established, from several trials, that the instable part (strong patterns of small number of
observations) always consists of approximately the same set of observations, it is suggested that these
observations are removed from the population.
Running the clustering MACROS again, on the stable part of the initial population, should lead to more
stable results and then clearly give an optimal number of clusters.
Unstable observations would then be classified in the clusters obtained on the stable part of the population.
The strong patterns then still prove to be very useful for revealing the observations (or groups of
observations) to which the clustering procedure is the more sensitive.
54
B. EXAMPLES
The same population will be clustered on two different sets of questions, the first set structuring the
population strongly, whereas on the second set no prior structure is present. The observations consist of 1095
individuals questionned in a survey on the themes of energy-crisis, nuclear power stations, and the image, in
the public opinion, of the national electricity producing company (Electricite de France). The variables
considered in the first example are an homogenous set of questions about nuclear power stations and nuclear
energy, which are all answered on a common scale in five modalities: high agreement, mean agreement, low
agreement, no agreement at all, and no response.
In the second example, variables are questions intended to analyze the image of the company in the public
opinion. They form a much less homogenous set of variables, essentially qualitative (no graduated scale of
response) and with various numbers of modalities.
In the two cases the correspondance analysis method has been used to summarize the set of qualitative
answers in a small number of quantitative coordinates on the principal components (five in the first population,
five in the second one). These new variables, called factors, will be those considered for the clustering of the
individuals.
In example I of the population questionned on the nuclear theme, correspondance analysis revealed a Guttman
scale of the responses which were all highly correlated. A unique and common scale graduating the intensity of
agreement with each of the assertions proposed, is sufficient to differentiate the individuals. This strong
unique dimension will induce a very pronounced typology, approximately corresponding to the different degrees
of agreement on the common scale, with extra-clusters for degrees of indifference.
On the other hand, the correspondance analysis of the population questionned on the company-image theme
will show at least four different dimensions. This second example, for the clustering MACROS, does not
contain a strong prior structure, as was contained in example I along a unique dimension.
It wil be interesting to test the methodology developed gy MACROS MCLASI and MCLAS on these two
examples.
55
Example 1 : Nuclear theme Results from MACRO MCLASl, corresponding to ten different random initializations, are compiled in the
following table:
Trial n° Optimal number Optimal number R2 value CCC value
of clusters by clusters by at the optimal at the opti-
the CCC R2 level mal level
1 (9) 5 - -
2 ® (1) 0.6468 12.65
3 9 6 (9) - -
4 - 5 - -
5 ® 5 (6) 0.6210 12.73
6 ® 5 (6) 0.6329 16.14
7 ® ® 0.6373 17.42
8 (9) 5 - -
9 ® (6) 0.6203 12.53
10 (8) 8 - -
56
For the CCC criterion only true local maxima are indicated. A number of clusters between parentheses
indicates no local maximum but only a stage of the curve at this level.
For the R 2 criterion the choice is made or the level preceding a "significant" increase of the semi-partial R 2
at the next node, which means that the R 2 decreased substantially more after this merging of two classes than
after the preceding ones. The numbers labeled on the curve are the numbers of clusters after this merging of
two classes, causing a decrease of the R2 equal to the value of the coordinate of the node on the horizontal
semi-partial R 2 axis.
Such an interpretation can be made, for example, on figure 2. When two choices are possible, the second one is
put between parentheses in the table.
In the case of non agreement of the two criteria, it is recommended to choose the value indicated by the CCC,
if corresponding to a true local maximum j the R 2 criterion being of course more subjective.
The main results of MACRO MCLASI, from this table, are:
the strong indication of the number of 6 clusters (local maxima observed in four cases out of 10, and
generally in good agreement with the R 2 criterion)
the weak presumption of the existence of 7 clusters.
the bad inference that could have been made of 5 clusters, only considering the R2 criterion.
This partition in five classes is a rather crude one corresponding to the number of modalities of the different
questons, which is the first national clustering of the population on a common scale: high agreeement, mean
agreement, low agreement, no agreement at all, and response.
a question about the number of 9 clusters because of its repeated occurence, although this number does
not give a true local maximum. From other respects, this number has shown to have some pertinency, if
a thinner partition was desired.
Results from MACRO MCLAS are now collected in the two following tables, the first for initial partitions of 6
clusters, the second for initial partitions of 7 clusters.
57
Initial partitions of 6 clusters
rial nO Optimal Optimal R2 at CCC at Number Number of Maximum Infor-
number number the 6 the 6 of strong obser- R2 of mation
of clusters clusters clusters clusters patterns vations the in i- J
by the by level level in impor- tial
CCC R2 tant partitions
"strong
patterns"
I 7 7 0.6407 13.47 31 1024 0.6612 missing
2 7 7 0,6498 15.38 27 1055 0.6615 0.324
3 7 7 0.6443 14.32 40 1005 0.6613 0.417
4 7 7 0.6482 15.11 29 1038 0.6616 0.348
o . .zo 5 7 7 0.6293 11.18 44 1039 0.6612 ..
Initial partitions of 7 clusters
Trial nO R2 at CCC at
the 7 the 7
clusters clusters
level level
1 8 8 0.6715 15.90 48 939 0.6867 0.453
2 8 8 0.6708 16.21 45 901 0.6878 0.402
3 (]) (7) (§8.43) (18.60) (29) 1047 0.6860 (0.109i
4 9 9 0.6740 18.10 47 899 0.6875 0.381
5 8 8 0.6740 16.91 40 968 0.6875 0.279
58
Several remarks can be made:
the number k of clusters of the initial partitions very often induces an optimal level of the hierarchy of the
strong patterns at (k + 1) clusters.
In the case of a structured population of observations with small number of strong patterns, the hierarchichal
tree of these strong patterns must not be used for choosing the optimal level clusters in the population.
the best clustering of the strong patterns in k clusters is obtained when the optimal level of the hierarchy
obtained from the initial partitions of k clusters also corresponds to the number of k clusters.
It can also be noted in this case (trial nO 3 for k = 7) the small number of the strong patterns (29 is the
smallest value of the five trials).
bad results as to the clustering of the strong patterns in k clusters, often correspond to a high number of
strong patterns, especially of those with a few number of observations.
Another criterion of the best clustering of the strong patterns is the lowest value of the cumulated
information J.
The confirma.tion of the optimal number of 6 clusters can be given by MACRO MCLAS. Comparing the values
of functil"l} (k) (cumulated information brought by successive partitions in k classes), it is found that J
(6) c::; 0.47- < J (7)z 0':12. (the information J is calculated for the intersection of 15 initial partitions)
Example 2 : Company - image theme
The same methodology was applied, using first MACRO MCLASI for suggesting values of the optimal number
of clusters, then MACRO MCLAS for determining the choice and giving the final clustering.
With respect to the population of example 1, this new population, as could be expected, will show some
different results.
Figure 4 shows the plot of the CCC by MACRO MCLASI giving very negative values of the CCC, which means
that there are no significant clusters in the population.
59
r: ,.
c u B 1 C
C L u S T £. R 1 II 6
C R 1 T
t R J 0 PI
E'lCECUTJON DE LA MACRO SAS M(LAS 1 18 :46 FFlYOAY, I1~RCH 30. 1 CJB4
GR,HUE PU CR HERE E..N F()NCT ION OU NOMSIl:[ DE (LA sse ~
PLOT OF ece ~Cl SVMBOL 15 V,HUE Of' Net
1 0.0 •
1 I 1 1
-2.5 • 1 I 1 1
-5.0 • 1 1 I 1
-1.5 • 1 1 1 1 a
-10.0 • 1 1 1 1
-12.5 • 1 1 1
3 3 3 ) 3
3 3 3 3 2 3
Z 2 Z. 2 2
2 2 I 2 Z
1 -15.0 •
1 1 1 1
-11.5 • I· 1 1 9 1 1
-%0.0 • 1 6 7 1 1 I
-Z2. S • 1 I 1 1
-2.5.0 ... ---.----- + -----+ --- --+ -----+---- _ .-----+ -----+ -----+ -----+-----+--- -+----- +----- +-- ---+ -----+ ---+ ----.-----+----- ,-- -II 13 15 17 19 21 23 25 2.7 29 31 ~, ~5 31 19
NUMBER OF ClUSTERS
Figure 4. Plot of eee showing no significant clusters. Population 2
Four trials by MACRO MCLASI have been run, the results of which are the following:
Trial n° Optimal number Optimal number R2 value CCC value
of clusters by clusters by at the level at the level
the CCC R2 of 6 clusters of 6 clusters
I I (10) 'j (7) 0.4402 - 21.42
I 2 (9) 6 (5) 0.4462 - 20.31
3 - 5 0.4604 - 17.53
4 (9) 7 (5) 0.4457 - 20.38
60
\
\,
No clear indication emerging from this table, the number of six clusters has been chosen and four trials by
MACRO MCLAS executed with initial partitions of 6 clusters, the results of which are presented below.
Initial partitions of 6 clusters
, R2 at Trial nO Optimal Optimal CCC at Number Number of Maximum Infor-
number number the 6 the 6 of strong obser- R2 of mation
of clusters clusters clusters clusters patterns vations the in i- J
by the by level level in impor- tial
CCC R2 tant partitions
"strong
patterns"
1 6 6 0.1l976 - 10.18 711 853 0.5159 0.7119
2 6 6 0.5055 @ ® 965 0.5160 ~
3 7 7 0.1l952 - 11.57 109 719 0.5159 0.882
II 9 (7) 6 (9) 0.1l8112 - 13.60 96 852 0.5159 0.864
5 - 5 (6) 0.1l550 - lll.lli 119 237 0.50~2 1.287
With respect to the population 1 we can note the high number of strong patterns obtained.
Another interesting remark (also true for population 1) is that clustering the strong patterns gives a much
better value of the CCC than clustering the observations (MACRO MCLASl). The best value of - 8.23 here
obtained must be compared to the best value of - 17.53 by MACRO MCLASI.
However these values are not at all satisfactory for determining an optimal number of clusters.
Further investigations must be made, using the strong patterns, for removing observations or groups of
observations, in order to increase the CCC values of the clusters obtained on the remaining population.
Only selecting, for the new population, important strong patterns obtained from a large enough number of
partitions, will definetely guarantee better values of the criteria.
61
Conclusion
The three MACROS MCLASl, MCLAS2, MCLAS give the user a lot of elements for choosing the optimal
number of clusters and then determining the best clusters.
The first basic idea developped in MACRO MCLASI is that of mixed clustering, combining use of non
hierarchichal and hierarchichal techniques.
A second basic notion, of E. DIDA Y, is that of strong patterns, used in MACRO MCLAS, and which brings
valuable information for determining the number of clusters, getting nearer to the globally optimal clustering,
and knowing the structure of the population.
The user has the possibility to run trials with different random initiaJizations, either to measure the stability
of the indications about the number of clusters, in MACRO MCLASI, or to explore a larger set of good local
optima in MACRO MCLAS.
When difficulties arise for getting some certainty of an optimal number of clusters, or for obtaining stability
of the clusters, it is possible, using strong patterns, to detect observations or groups of observations which are
sources of unstability.
These MACROS are tools for helping the user to analyze his data, either quickly, only using MACRO MCLASl,
or in more detail, using afterwards MACRO MCLAS, which will give better solutions. In both cases MACRO
MCLAS2 is still able to improve the clusters obtained.
62
APPENDIX * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * MACRO %HCLAS1 * MACRO VAR. A PASSER : * ENTREE = TABLEAU INITIAL , PAR DEFAUT LE DERNIER CREE * VAR = LISTE DES VARIABLES NUf.mRIQUES <==== OBLIGATOIRE * RAND= VALEUR DETEID1INANT LE TIRAGE ALEATOIRE <= OBLIGATOIRE * tlAXC = NOHBRE DE CLASSES , 100 PAR DEFAUT * ARBRE = TABLEAU CREE PAR CLUSTER * CLASSIF1 = TABLEAU CREE PAR FASTCLUS * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *; U1ACRO HCLAS1 (ENTREE= _LAST_,
/*
*/
VAR=, RAND = , HAXC=100, ARBRE=ARBRE, CLASSIF1=CLASSIF1);
APPEL DE L' EN-TETE DE LA MACRO MCLAS1
%INCLUDE FICHIER(MCLAS1T); TITLE EXECUTION DE LA llACRO SAS HCLAS1
%INCLUDE FICHIER(RANDOMa); %RANDml (ENTREE=&ENTREE ,RAND=&RAND) ;
PROC FASTCLUS DATA=&ENTREE OUT=&CLASSIF1 MEAN=FASTMEAN SHORT r-1AXC=&MAXC tlAXITER=10; VAR &VAR;
PROC CLUSTER DATA=FASTMEAN OUTTREE=&ARBRE SHIPLE PRINT=&tlAXC; TITLE3 EXECUTION DE LA CLASSIFICATION HIERARCHIQUE ;
VAR &VAR; FREQ FREQ RlISSTD RMSSTD
DATA TREE; SET &ARBRE;
IF tICL < 40; PROC PLOT DATA=TREE (RENArlE= ( CCC =CCC NCL =NCL));
TITLE3 GRAPHE DU CRITERE EN FONCTION DU-NOMBRE DE CLASSES; PLOT CCC * NCL = NCL;
PROC PLOT DATA=TREE (RENArlE= ( . RSQ =R2 NCL =NCL)); TITLE3 GRAPHE DU NOBERE DE CLASSES-EN FONCTION DU R2; PLOT NCL*R2 = NCL;
PROC PLOT DATA=TREE(RENAME=( SPRSQ =VR2 NCL =NCL)); TITLE3 GRAPHE DU NOHBRE DE CLASSES EN-FONCTION DU SEtU-PARTIAL R2; PLOT NCL*VR2 = NCL;
PROC TREE DATA=&ARBRE; TITLE3 IMPRESSIon DE L"ARBRE HIERARCHIQUE;
HEIGHT SPRSQ; %r1END rICLAS1; -
63
********************************************************************** * r1ACROSAS PER11ETTANT DE PERTURBER L 'ORDRE INITIAL DES INDIVIDUS DU * TABLEAU D'ENTREE PAR UN TIRAGE ALEATOIRE SUIVANT UNE LOI DE PROBA * BILITE UNIFOmm
* * * * *
* * * * * * *
ENTREE:DERNIER TABLEAU CREE,EN L'OCCURENCE LE TABLEAU * CONTENANT LES FACTEURS SUR LES INDIVIDUS *
RAND:PARAHETRE DETERrlINANT LE TYPE DE TIRAGE ALEATOIRE * UTILISE(REPETABLE OU HOH) ,PAR DEFAULT LE TIRAGE * ALEATOIRE HE S' EFFECTUE PAS *
**********************************************************************. , %~lACRO RANDOn (ENTREE = _LAST ,
RAND= ) ; DATA TAB;
SET &ENTREE END=FIN; IF FIN THEH DO;
NOBS=PUT CN_, 8.) ; END;
CALL SYHPUT ( 'NN' , NOBS) ; RUN; %LET N=%EVAL(&NN);
DATA RANDO; %IF %LENGTH(&RAND)-=O %THEN %DO;
DO J=1 TO &H; K1=INT (UNIFOm1(&RAND) *&N) +1;
OUTPUT RANDO; END;
KEEP K1; DATA TAB;
HERGE TAB RANDO; PROC SORT DATA=TAB OUT=&ENTREE;
BY K1; %END;
%11END RA~JDOr1;
64
APPENDIX 2
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * HACRO HCLAS2 * HACRO VAR. A PASSER : * ENTREE = TABLEAU SORTI DE !lCLASl -* ARBRE = TAELEAU CREE PAR %r1CLAS1, PAR DEFAUT CLUSTREE * NCLASSE = NIVEAU DE COUPURE DE L'ARBRE <==== OBLIGATOIRE * CLASSIFl TABLEAU CREE PAR %!lCLAS1, PAR DEFAUT CLASSIFl * CLASSIF2 = TABLEAU FINAL DE LA PROC FASTCLUS, * PAR DEFAUT CLASSIFl
* *
CLASFF VARIABLE DE CLASSE PAR DEFAUT CLUSTER
* * * * * * * * * * * * * * * * * * * * * * %!lACRO tlCLAS2 (ENTREE=CLASSIF1,
ARBRE=ARBRE,
/*
*/
VAR=, NCLASSE=, CLASSIF1=CLASSIF1, CLASSIF2=CLASSIF2, CLASFF=CLUSTER);
EN-TETE DE LA. MACRO !lCLAS2
%INCLUDE FICHIER (rICLAS 2T) ; TITLE EXECUTION DE LA !lACRO HCLAS2;
* * * * * * * * * * * *. ,
PROC TREE DATA=&ARBRE LEVEL=&NCLASSE NOPRINT OUT=TREEOUT; TITLE3 IMPRESSION DE L' 'ARBRE HIERARCHIQUE;
DATA NTREE ; TITLE3 COUPURE DE L"ARBRE AU NIVEAU DE &NCLASSE CLASSES
LENGTH ACLUS 4; SET TREEOUT; CCLUS = SUBSTR(_NArlE_,3,3); ACLUS = CCLUS; KEEP ACLUS CLUSTER;
PROC SORTT OUT=NTREE ; BY ACLUS;
PROC SORTT DATA=&CLASSIFl OUT=ENTREE(REtJAHE=(&CLASFF=ACLUS)); BY &CLASFF;
DATA SEEDO(KEEP=CLUSTER &VAR); tIERCE ENTREE NTREE; BY ACLUS;
PROC SUlHlARY DATA=SEEDO; TITLE3 RECUPERATIOlJ DES NOYAUX PAR LA PROCEDURE SurIr1ARY; CLASS CLUSTER; VAR &VAR; OUTPUT OUT=SEED !IEAN=&VAR;
PROC PRINT DATA=SEED(FIRSTOBS=2 RErJA!1E=(_FREQ_=EFFECTIF CLUSTER=CLASSE));
TITLE3 IrlPRESSIOlJ DES CLASSES AVANT LA DERNIERE CLASSIFICATION; ID CLASSE; VAR EFFECTIF;
PROC FASTCLUS DATA=&ENTREE SEED=SEED(FIRSTOBS=2) OU';'=&CLASSIF2
llAXC=&NCLASSE rlAXITER=lO; TITLE3 EXECUTIOn DE LA DERNIERE CLASSIFICATION;
VAR &VAR; %!!END rlCLAS2;
65
~ ~ ! l ~ \
~l
ENDNOTES
1- At any level the Ward's method joins the classes A and B minimizing
G A' G B centres of the classes
nA, nB numbers of observations of the classes
The height of each node (cluster) in the tree will be equal to this pseudo-distance, called the semi partial R 2
2 _ The intersection of two partitions is the set of the parts obtained in taking the intersection of each class of
one by all the classes of the other.
3- A partition P is said to be thinner than a partition P' of E if every class of P' is the union of classes of P.
REFERENCES
1- P. COLLOMB . Classification par transfert (1977)
Electricite de France. Direction des Etudes et Recherches - Note HI/2578-02
2- E. DIDA Y Optimisation in non hierarchichal clustering (1974)
Pattern Recognition - Pergamon Press 1974
Vol. 6, pp 17 - 33
3- SAS User's Guide: Statistics (1982) : SAS Institute, Inc. Cary, N.C.
4- SAS Technical Report: A 108 Cubic Clustering Criterion (1983)
SAS Institute, Inc. Cary, N.C.
5- W.F. de la Vega; M. Renaud; S. Regnier, Techniques de la classification automatique (1964-1966)
Distributed by the Centre de Calcul de la Maison des Sciences de l'Homme . PARIS.
66