+ All Categories
Home > Documents > Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The...

Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The...

Date post: 18-Mar-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
57
Data Science with R Cluster Analysis [email protected] 22nd June 2014 Visit http://onepager.togaware.com/ for more OnePageR’s. We focus on the unsupervised method of cluster analysis in this chapter. Cluster analysis is a topic that has been much studied by statisticians for decades and widely used in data min- ing. The required packages for this module include: library(rattle) # The weather dataset and normVarNames(). library(randomForest) # Impute missing values using na.roughfix(). library(ggplot2) # Visualise the data through plots. library(animation) # Demonstrate kmeans. library(reshape2) # Reshape data for plotting. library(fpc) # Tuning clusterng with kmeansruns() and clusterboot(). library(clusterCrit) # Clustering criteria. library(wskm) # Weighted subspace clustering. library(amap) # hclusterpar library(cba) # Dendrogram plot library(dendroextras) # To colour clusters library(kohonen) # Self organising maps. As we work through this chapter, new R commands will be introduced. Be sure to review the command’s documentation and understand what the command does. You can ask for help using the ? command as in: ?read.csv We can obtain documentation on a particular package using the help= option of library(): library(help=rattle) This chapter is intended to be hands on. To learn effectively, you are encouraged to have R running (e.g., RStudio) and to run all the commands as they appear here. Check that you get the same output, and you understand the output. Try some variations. Explore. Copyright 2013-2014 Graham Williams. You can freely copy, distribute, or adapt this material, as long as the attribution is retained and derivative work is provided under the same license.
Transcript
Page 1: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R

Cluster Analysis

GrahamWilliamstogawarecom

22nd June 2014

Visit httponepagertogawarecom for more OnePageRrsquos

We focus on the unsupervised method of cluster analysis in this chapter Cluster analysis isa topic that has been much studied by statisticians for decades and widely used in data min-ing

The required packages for this module include

library(rattle) The weather dataset and normVarNames()

library(randomForest) Impute missing values using naroughfix()

library(ggplot2) Visualise the data through plots

library(animation) Demonstrate kmeans

library(reshape2) Reshape data for plotting

library(fpc) Tuning clusterng with kmeansruns() and clusterboot()

library(clusterCrit) Clustering criteria

library(wskm) Weighted subspace clustering

library(amap) hclusterpar

library(cba) Dendrogram plot

library(dendroextras) To colour clusters

library(kohonen) Self organising maps

As we work through this chapter new R commands will be introduced Be sure to review thecommandrsquos documentation and understand what the command does You can ask for help usingthe command as in

readcsv

We can obtain documentation on a particular package using the help= option of library()

library(help=rattle)

This chapter is intended to be hands on To learn effectively you are encouraged to have Rrunning (eg RStudio) and to run all the commands as they appear here Check that you getthe same output and you understand the output Try some variations Explore

Copyright copy 2013-2014 Graham Williams You can freely copy distributeor adapt this material as long as the attribution is retained and derivativework is provided under the same license

Data Science with R OnePageR Survival Guides Cluster Analysis

1 Load Weather Dataset for Modelling

We use the weather dataset from rattle (Williams 2014) and normalise the variable namesMissing values are imputed using naroughfix() from randomForest (Breiman et al 2012)particularly because kmeans() does not handle missing values itself Here we set up the datasetfor modelling Notice in particular we identify the numeric input variables (numi is an integervector containing the column index for the numeric variables and numc is a character vectorcontaining the column names) Many clustering algorithms only handle numeric variables

Required packages

library(rattle) Load weather dataset Normalise names normVarNames()

library(randomForest) Impute missing using naroughfix()

Identify the dataset

dsname lt- weather

ds lt- get(dsname)

names(ds) lt- normVarNames(names(ds))

vars lt- names(ds)

target lt- rain_tomorrow

risk lt- risk_mm

id lt- c(date location)

Ignore the IDs and the risk variable

ignore lt- union(id if (exists(risk)) risk)

Ignore variables which are completely missing

mvc lt- sapply(ds[vars] function(x) sum(isna(x))) Missing value count

mvn lt- names(ds)[(which(mvc == nrow(ds)))] Missing var names

ignore lt- union(ignore mvn)

Initialise the variables

vars lt- setdiff(vars ignore)

Variable roles

inputc lt- setdiff(vars target)

inputi lt- sapply(inputc function(x) which(x == names(ds)) USENAMES=FALSE)

numi lt- intersect(inputi which(sapply(ds isnumeric)))

numc lt- names(ds)[numi]

cati lt- intersect(inputi which(sapply(ds isfactor)))

catc lt- names(ds)[cati]

Impute missing values but do this wisely - understand why missing

if (sum(isna(ds[vars]))) ds[vars] lt- naroughfix(ds[vars])

Number of observations

nobs lt- nrow(ds)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 1 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

2 Introducing Cluster Analysis

The aim of cluster analysis is to identify groups of observations so that within a group theobservations are most similar to each other whilst between groups the observations are mostdissimilar to each other Cluster analysis is essentially an unsupervised method

Our human society has been ldquoclusteringrdquo for a long time to help us understand the environmentwe live in We have clustered the animal and plant kingdoms into a hierarchy of similarities Wecluster chemical structures Day-by-day we see grocery items clustered into similar groups Wecluster student populations into similar groups of students from similar backgrounds or studyingsimilar combinations of subjects

The concept of similarity is often captured through the measurement of distance Thus we oftendescribe cluster analysis as identifying groups of observations so that the distance between theobservations within a group is minimised and between the groups the distance is maximisedThus a distance measure is fundamental to calculating clusters

There are some caveats to performing automated cluster analysis using distance measures Weoften observe particularly with large datasets that a number of interesting clusters will begenerated and then one or two clusters will account for the majority of the observations It is asif these larger clusters simply lump together those observations that donrsquot fit elsewhere

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 2 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

3 Distance Calculation Euclidean Distance

Suppose we pick the first two observations from our dataset and the first 5 numeric variables

ds[12 numi[15]]

min_temp max_temp rainfall evaporation sunshine

1 8 243 00 34 63

2 14 269 36 44 97

x lt- ds[1 numi[15]]

y lt- ds[2 numi[15]]

Then xminus y is simply

x-y

min_temp max_temp rainfall evaporation sunshine

1 -6 -26 -36 -1 -34

Then the square of each difference is

sapply(x-y ^ 2)

min_temp max_temp rainfall evaporation sunshine

3600 676 1296 100 1156

The sum of the squares of the differences

sum(sapply(x-y ^ 2))

[1] 6828

Finally the square root of the sum of the squares of the differences (also known as the Euclideandistance) is

sqrt(sum(sapply(x-y ^ 2)))

[1] 8263

Of course we donrsquot need to calculate this so manually ourselves R provides dist() to calculatethe distance (Euclidean distance by default)

dist(ds[12 numi[15]])

1

2 8263

We can also calculate the Manhattan distance

sum(abs(x-y))

[1] 166

dist(ds[12 numi[15]] method=manhattan)

1

2 166

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 3 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

4 Minkowski Distance

dist(ds[12 numi[15]] method=minkowski p=1)

1

2 166

dist(ds[12 numi[15]] method=minkowski p=2)

1

2 8263

dist(ds[12 numi[15]] method=minkowski p=3)

1

2 6844

dist(ds[12 numi[15]] method=minkowski p=4)

1

2 6368

6

10

14

5 10 15 20index

min

k

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 4 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

5 General Distance

dist(ds[15 numi[15]])

1 2 3 4

2 8263

3 7812 7434

4 41375 38067 37531

daisy(ds[15 numi[15]])

Dissimilarities

1 2 3 4

2 8263

3 7812 7434

daisy(ds[15 cati])

Dissimilarities

1 2 3 4

2 06538

3 06923 05385

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 5 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

6 K-Means Basics Iterative Cluster Search

The k-means algorithm is a traditional and widely used clustering algorithm

The algorithm begins by specifying the number of clusters we are interested inmdashthis is the kEach of the k clusters is identified as the vector of the average (ie the mean) value of eachof the variables for observations within a cluster A random clustering is first constructed thek means calculated and then using the distance measure we gravitate each observation to itsnearest mean The means are then recalculated and the points re-gravitate And so on untilthere is no change to the means

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 6 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 2: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

1 Load Weather Dataset for Modelling

We use the weather dataset from rattle (Williams 2014) and normalise the variable namesMissing values are imputed using naroughfix() from randomForest (Breiman et al 2012)particularly because kmeans() does not handle missing values itself Here we set up the datasetfor modelling Notice in particular we identify the numeric input variables (numi is an integervector containing the column index for the numeric variables and numc is a character vectorcontaining the column names) Many clustering algorithms only handle numeric variables

Required packages

library(rattle) Load weather dataset Normalise names normVarNames()

library(randomForest) Impute missing using naroughfix()

Identify the dataset

dsname lt- weather

ds lt- get(dsname)

names(ds) lt- normVarNames(names(ds))

vars lt- names(ds)

target lt- rain_tomorrow

risk lt- risk_mm

id lt- c(date location)

Ignore the IDs and the risk variable

ignore lt- union(id if (exists(risk)) risk)

Ignore variables which are completely missing

mvc lt- sapply(ds[vars] function(x) sum(isna(x))) Missing value count

mvn lt- names(ds)[(which(mvc == nrow(ds)))] Missing var names

ignore lt- union(ignore mvn)

Initialise the variables

vars lt- setdiff(vars ignore)

Variable roles

inputc lt- setdiff(vars target)

inputi lt- sapply(inputc function(x) which(x == names(ds)) USENAMES=FALSE)

numi lt- intersect(inputi which(sapply(ds isnumeric)))

numc lt- names(ds)[numi]

cati lt- intersect(inputi which(sapply(ds isfactor)))

catc lt- names(ds)[cati]

Impute missing values but do this wisely - understand why missing

if (sum(isna(ds[vars]))) ds[vars] lt- naroughfix(ds[vars])

Number of observations

nobs lt- nrow(ds)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 1 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

2 Introducing Cluster Analysis

The aim of cluster analysis is to identify groups of observations so that within a group theobservations are most similar to each other whilst between groups the observations are mostdissimilar to each other Cluster analysis is essentially an unsupervised method

Our human society has been ldquoclusteringrdquo for a long time to help us understand the environmentwe live in We have clustered the animal and plant kingdoms into a hierarchy of similarities Wecluster chemical structures Day-by-day we see grocery items clustered into similar groups Wecluster student populations into similar groups of students from similar backgrounds or studyingsimilar combinations of subjects

The concept of similarity is often captured through the measurement of distance Thus we oftendescribe cluster analysis as identifying groups of observations so that the distance between theobservations within a group is minimised and between the groups the distance is maximisedThus a distance measure is fundamental to calculating clusters

There are some caveats to performing automated cluster analysis using distance measures Weoften observe particularly with large datasets that a number of interesting clusters will begenerated and then one or two clusters will account for the majority of the observations It is asif these larger clusters simply lump together those observations that donrsquot fit elsewhere

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 2 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

3 Distance Calculation Euclidean Distance

Suppose we pick the first two observations from our dataset and the first 5 numeric variables

ds[12 numi[15]]

min_temp max_temp rainfall evaporation sunshine

1 8 243 00 34 63

2 14 269 36 44 97

x lt- ds[1 numi[15]]

y lt- ds[2 numi[15]]

Then xminus y is simply

x-y

min_temp max_temp rainfall evaporation sunshine

1 -6 -26 -36 -1 -34

Then the square of each difference is

sapply(x-y ^ 2)

min_temp max_temp rainfall evaporation sunshine

3600 676 1296 100 1156

The sum of the squares of the differences

sum(sapply(x-y ^ 2))

[1] 6828

Finally the square root of the sum of the squares of the differences (also known as the Euclideandistance) is

sqrt(sum(sapply(x-y ^ 2)))

[1] 8263

Of course we donrsquot need to calculate this so manually ourselves R provides dist() to calculatethe distance (Euclidean distance by default)

dist(ds[12 numi[15]])

1

2 8263

We can also calculate the Manhattan distance

sum(abs(x-y))

[1] 166

dist(ds[12 numi[15]] method=manhattan)

1

2 166

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 3 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

4 Minkowski Distance

dist(ds[12 numi[15]] method=minkowski p=1)

1

2 166

dist(ds[12 numi[15]] method=minkowski p=2)

1

2 8263

dist(ds[12 numi[15]] method=minkowski p=3)

1

2 6844

dist(ds[12 numi[15]] method=minkowski p=4)

1

2 6368

6

10

14

5 10 15 20index

min

k

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 4 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

5 General Distance

dist(ds[15 numi[15]])

1 2 3 4

2 8263

3 7812 7434

4 41375 38067 37531

daisy(ds[15 numi[15]])

Dissimilarities

1 2 3 4

2 8263

3 7812 7434

daisy(ds[15 cati])

Dissimilarities

1 2 3 4

2 06538

3 06923 05385

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 5 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

6 K-Means Basics Iterative Cluster Search

The k-means algorithm is a traditional and widely used clustering algorithm

The algorithm begins by specifying the number of clusters we are interested inmdashthis is the kEach of the k clusters is identified as the vector of the average (ie the mean) value of eachof the variables for observations within a cluster A random clustering is first constructed thek means calculated and then using the distance measure we gravitate each observation to itsnearest mean The means are then recalculated and the points re-gravitate And so on untilthere is no change to the means

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 6 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 3: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

2 Introducing Cluster Analysis

The aim of cluster analysis is to identify groups of observations so that within a group theobservations are most similar to each other whilst between groups the observations are mostdissimilar to each other Cluster analysis is essentially an unsupervised method

Our human society has been ldquoclusteringrdquo for a long time to help us understand the environmentwe live in We have clustered the animal and plant kingdoms into a hierarchy of similarities Wecluster chemical structures Day-by-day we see grocery items clustered into similar groups Wecluster student populations into similar groups of students from similar backgrounds or studyingsimilar combinations of subjects

The concept of similarity is often captured through the measurement of distance Thus we oftendescribe cluster analysis as identifying groups of observations so that the distance between theobservations within a group is minimised and between the groups the distance is maximisedThus a distance measure is fundamental to calculating clusters

There are some caveats to performing automated cluster analysis using distance measures Weoften observe particularly with large datasets that a number of interesting clusters will begenerated and then one or two clusters will account for the majority of the observations It is asif these larger clusters simply lump together those observations that donrsquot fit elsewhere

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 2 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

3 Distance Calculation Euclidean Distance

Suppose we pick the first two observations from our dataset and the first 5 numeric variables

ds[12 numi[15]]

min_temp max_temp rainfall evaporation sunshine

1 8 243 00 34 63

2 14 269 36 44 97

x lt- ds[1 numi[15]]

y lt- ds[2 numi[15]]

Then xminus y is simply

x-y

min_temp max_temp rainfall evaporation sunshine

1 -6 -26 -36 -1 -34

Then the square of each difference is

sapply(x-y ^ 2)

min_temp max_temp rainfall evaporation sunshine

3600 676 1296 100 1156

The sum of the squares of the differences

sum(sapply(x-y ^ 2))

[1] 6828

Finally the square root of the sum of the squares of the differences (also known as the Euclideandistance) is

sqrt(sum(sapply(x-y ^ 2)))

[1] 8263

Of course we donrsquot need to calculate this so manually ourselves R provides dist() to calculatethe distance (Euclidean distance by default)

dist(ds[12 numi[15]])

1

2 8263

We can also calculate the Manhattan distance

sum(abs(x-y))

[1] 166

dist(ds[12 numi[15]] method=manhattan)

1

2 166

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 3 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

4 Minkowski Distance

dist(ds[12 numi[15]] method=minkowski p=1)

1

2 166

dist(ds[12 numi[15]] method=minkowski p=2)

1

2 8263

dist(ds[12 numi[15]] method=minkowski p=3)

1

2 6844

dist(ds[12 numi[15]] method=minkowski p=4)

1

2 6368

6

10

14

5 10 15 20index

min

k

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 4 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

5 General Distance

dist(ds[15 numi[15]])

1 2 3 4

2 8263

3 7812 7434

4 41375 38067 37531

daisy(ds[15 numi[15]])

Dissimilarities

1 2 3 4

2 8263

3 7812 7434

daisy(ds[15 cati])

Dissimilarities

1 2 3 4

2 06538

3 06923 05385

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 5 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

6 K-Means Basics Iterative Cluster Search

The k-means algorithm is a traditional and widely used clustering algorithm

The algorithm begins by specifying the number of clusters we are interested inmdashthis is the kEach of the k clusters is identified as the vector of the average (ie the mean) value of eachof the variables for observations within a cluster A random clustering is first constructed thek means calculated and then using the distance measure we gravitate each observation to itsnearest mean The means are then recalculated and the points re-gravitate And so on untilthere is no change to the means

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 6 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 4: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

3 Distance Calculation Euclidean Distance

Suppose we pick the first two observations from our dataset and the first 5 numeric variables

ds[12 numi[15]]

min_temp max_temp rainfall evaporation sunshine

1 8 243 00 34 63

2 14 269 36 44 97

x lt- ds[1 numi[15]]

y lt- ds[2 numi[15]]

Then xminus y is simply

x-y

min_temp max_temp rainfall evaporation sunshine

1 -6 -26 -36 -1 -34

Then the square of each difference is

sapply(x-y ^ 2)

min_temp max_temp rainfall evaporation sunshine

3600 676 1296 100 1156

The sum of the squares of the differences

sum(sapply(x-y ^ 2))

[1] 6828

Finally the square root of the sum of the squares of the differences (also known as the Euclideandistance) is

sqrt(sum(sapply(x-y ^ 2)))

[1] 8263

Of course we donrsquot need to calculate this so manually ourselves R provides dist() to calculatethe distance (Euclidean distance by default)

dist(ds[12 numi[15]])

1

2 8263

We can also calculate the Manhattan distance

sum(abs(x-y))

[1] 166

dist(ds[12 numi[15]] method=manhattan)

1

2 166

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 3 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

4 Minkowski Distance

dist(ds[12 numi[15]] method=minkowski p=1)

1

2 166

dist(ds[12 numi[15]] method=minkowski p=2)

1

2 8263

dist(ds[12 numi[15]] method=minkowski p=3)

1

2 6844

dist(ds[12 numi[15]] method=minkowski p=4)

1

2 6368

6

10

14

5 10 15 20index

min

k

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 4 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

5 General Distance

dist(ds[15 numi[15]])

1 2 3 4

2 8263

3 7812 7434

4 41375 38067 37531

daisy(ds[15 numi[15]])

Dissimilarities

1 2 3 4

2 8263

3 7812 7434

daisy(ds[15 cati])

Dissimilarities

1 2 3 4

2 06538

3 06923 05385

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 5 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

6 K-Means Basics Iterative Cluster Search

The k-means algorithm is a traditional and widely used clustering algorithm

The algorithm begins by specifying the number of clusters we are interested inmdashthis is the kEach of the k clusters is identified as the vector of the average (ie the mean) value of eachof the variables for observations within a cluster A random clustering is first constructed thek means calculated and then using the distance measure we gravitate each observation to itsnearest mean The means are then recalculated and the points re-gravitate And so on untilthere is no change to the means

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 6 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 5: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

4 Minkowski Distance

dist(ds[12 numi[15]] method=minkowski p=1)

1

2 166

dist(ds[12 numi[15]] method=minkowski p=2)

1

2 8263

dist(ds[12 numi[15]] method=minkowski p=3)

1

2 6844

dist(ds[12 numi[15]] method=minkowski p=4)

1

2 6368

6

10

14

5 10 15 20index

min

k

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 4 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

5 General Distance

dist(ds[15 numi[15]])

1 2 3 4

2 8263

3 7812 7434

4 41375 38067 37531

daisy(ds[15 numi[15]])

Dissimilarities

1 2 3 4

2 8263

3 7812 7434

daisy(ds[15 cati])

Dissimilarities

1 2 3 4

2 06538

3 06923 05385

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 5 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

6 K-Means Basics Iterative Cluster Search

The k-means algorithm is a traditional and widely used clustering algorithm

The algorithm begins by specifying the number of clusters we are interested inmdashthis is the kEach of the k clusters is identified as the vector of the average (ie the mean) value of eachof the variables for observations within a cluster A random clustering is first constructed thek means calculated and then using the distance measure we gravitate each observation to itsnearest mean The means are then recalculated and the points re-gravitate And so on untilthere is no change to the means

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 6 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 6: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

5 General Distance

dist(ds[15 numi[15]])

1 2 3 4

2 8263

3 7812 7434

4 41375 38067 37531

daisy(ds[15 numi[15]])

Dissimilarities

1 2 3 4

2 8263

3 7812 7434

daisy(ds[15 cati])

Dissimilarities

1 2 3 4

2 06538

3 06923 05385

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 5 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

6 K-Means Basics Iterative Cluster Search

The k-means algorithm is a traditional and widely used clustering algorithm

The algorithm begins by specifying the number of clusters we are interested inmdashthis is the kEach of the k clusters is identified as the vector of the average (ie the mean) value of eachof the variables for observations within a cluster A random clustering is first constructed thek means calculated and then using the distance measure we gravitate each observation to itsnearest mean The means are then recalculated and the points re-gravitate And so on untilthere is no change to the means

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 6 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 7: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

6 K-Means Basics Iterative Cluster Search

The k-means algorithm is a traditional and widely used clustering algorithm

The algorithm begins by specifying the number of clusters we are interested inmdashthis is the kEach of the k clusters is identified as the vector of the average (ie the mean) value of eachof the variables for observations within a cluster A random clustering is first constructed thek means calculated and then using the distance measure we gravitate each observation to itsnearest mean The means are then recalculated and the points re-gravitate And so on untilthere is no change to the means

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 6 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 8: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

7 K-Means Using kmeans()

Here is our first attempt to cluster our dataset

model lt- mkm lt- kmeans(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is because there are non-numeric variables that we are attempting to cluster on

setseed(42)

model lt- mkm lt- kmeans(ds[numi] 10)

So that appears to have succeeded to build 10 clusters The sizes of the clusters can readily belisted

model$size

[1] 29 47 24 55 21 33 35 50 41 31

The cluster centers (ie the means) can also be listed

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 58448 2038 075862 6000 10524 5266

2 130340 3142 009362 7677 10849 4332

3 139833 2102 714167 4892 2917 3954

The component mkm$cluster reports which of the 10 clusters each of the original observationsbelongs

head(model$cluster)

[1] 4 8 6 6 6 10

model$iter

[1] 6

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 7 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 9: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

8 Scaling Datasets

We noted earlier that a unit of distance is different for differently measure variables For exampleone year of difference in age seems like it should be a larger difference than $1 difference in ourincome A common approach is to rescale our data by subtracting the mean and dividing bythe standard deviation This is often referred to as a z-score The result is that the mean for allvariables is 0 and a unit of difference is one standard deviation

The R function scale() can perform this transformation on our numeric data We can see theeffect in the following

summary(ds[numi[15]])

min_temp max_temp rainfall evaporation

Min -530 Min 76 Min 000 Min 020

1st Qu 230 1st Qu150 1st Qu 000 1st Qu 220

Median 745 Median 196 Median 000 Median 420

summary(scale(ds[numi[15]]))

min_temp max_temp rainfall evaporation

Min -20853 Min -1936 Min -0338 Min -1619

1st Qu-08241 1st Qu-0826 1st Qu-0338 1st Qu-0870

Median 00306 Median -0135 Median -0338 Median -0121

The scale() function also provides some extra information recording the actual original meansand the standard deviations

dsc lt- scale(ds[numi[15]])

attr(dsc scaledcenter)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

attr(dsc scaledscale)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Compare that information with the output from mean() and sd()

sapply(ds[numi[15]] mean)

min_temp max_temp rainfall evaporation sunshine

7266 20550 1428 4522 7915

sapply(ds[numi[15]] sd)

min_temp max_temp rainfall evaporation sunshine

6026 6691 4226 2669 3468

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 8 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 10: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

9 K-Means Scaled Dataset

setseed(42)

model lt- mkms lt- kmeans(scale(ds[numi]) 10)

model$size

[1] 34 54 15 70 24 32 30 44 43 20

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 10786 16740 -031018 143079 10397 06088

2 05325 09939 -024074 056206 08068 -02149

3 08808 -02307 377323 001928 -07599 04886

model$totss

[1] 5840

model$withinss

[1] 2494 2724 2112 3280 1492 2877 1568 3662 2620 1370

model$totwithinss

[1] 2420

model$betweenss

[1] 3420

model$iter

[1] 8

model$ifault

[1] 0

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 9 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 11: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

10 Animate Cluster Building

Using kmeansani() from animation (Xie 2013) we can produce an animation that illustratesthe kmeans algorithm

library(animation)

We generate some random data for two variables over 100 observations

cent lt- 15 c(1 1 -1 -1 1 -1 1 -1)

x lt- NULL

for (i in 18) x lt- c(x rnorm(25 mean=cent[i]))

x lt- matrix(x ncol=2)

colnames(x) lt- c(X1 X2)

dim(x)

[1] 100 2

head(x)

X1 X2

[1] 1394 1606

[2] 3012 1078

[3] 1405 1378

The series of plots over the following pages show the convergence of the kmeans algorithm toidentify 4 clusters

par(mar=c(3 3 1 15) mgp=c(15 05 0) bg=white)

kmeansani(x centers=4 pch=14 col=14)

The first plot on the next page shows a random allocation of points to one of the four clusterstogether with 4 random means The points are then mapped to their closest means to define thefour clusters we see in the second plot The means are then recalculated for each of the clustersas seen in the third plot The following plots then iterate between showing the means nearesteach of the points then re-calculating the means Eventually the means do not change locationand the algorithm converges

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 10 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 12: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 11 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 13: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 12 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 14: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 13 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 15: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 14 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 16: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 15 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 17: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 16 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 18: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 17 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 19: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 18 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 20: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 19 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 21: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 20 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 22: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 21 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 23: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 22 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 24: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 23 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 25: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 24 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 26: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Mov

e ce

nter

s

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 25 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 27: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus4 minus2 0 2 4

minus2

02

4

X1

X2

Fin

d cl

uste

r

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 26 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 28: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

11 Visualise the Cluster Radial Plot Using GGPlot2

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

5

6

7

8

9

10

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep 10))

p lt- ggplot(subset(dscm Cluster in 110)

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 27 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 29: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

12 Visualize the Cluster Radial Plot with K=4

min_temp

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pmhumidity_9am

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pm

Cluster

1

2

3

4

nclust lt- 4

model lt- mkms lt- kmeans(scale(ds[numi]) nclust)

dscm lt- melt(model$centers)

names(dscm) lt- c(Cluster Variable Value)

dscm$Cluster lt- factor(dscm$Cluster)

dscm$Order lt- asvector(sapply(1length(numi) rep nclust))

p lt- ggplot(dscm

aes(x=reorder(Variable Order)

y=Value group=Cluster colour=Cluster))

p lt- p + coord_polar()

p lt- p + geom_point()

p lt- p + geom_path()

p lt- p + labs(x=NULL y=NULL)

p lt- p + theme(axisticksy=element_blank() axistexty = element_blank())

p

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 28 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 30: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

13 Visualise the Cluster Cluster Profiles with Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster

1

2

3

4

The radial plot here is carefully engineered to most effectively present the cluster profiles TheR code to generate the plot is defined as CreateRadialPlot() and was originally available fromPaul Williamsonrsquos web site (Department of Geography University of Liverpool)

source(httponepagertogawarecomCreateRadialPlotR)

dsc lt- dataframe(group=factor(14) model$centers)

CreateRadialPlot(dsc gridmin=-2 gridmax=2 plotextentx=15)

We can quickly read the profiles and gain insights into the 4 clusters Having re-scaled all ofthe data we know that the ldquo0rdquo circle is the mean for each variable and the range goes up to 2standard deviations from the mean in either direction We observe that cluster 1 has a centerwith higher pressures whilst the cluster 2 center has higher humidity and cloud cover and lowsunshine cluster 3 has high wind speeds and cluster 4 has higher temperatures evaporation andsunshine

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 29 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 31: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

14 Visualise the Cluster Single Cluster Radial Plot

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

CreateRadialPlot(subset(dsc group==4) gridmin=-2 gridmax=2 plotextentx=15)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 30 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 32: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

15 Visualise the Cluster Grid of Radial Plots

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster1

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster2

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster3

humidity_3pm

pressure_9am

pressure_3pm

cloud_9am

cloud_3pm

temp_9am

temp_3pmmin_temp

humidity_9am

max_temp

rainfall

evaporation

sunshine

wind_gust_speed

wind_speed_9am

wind_speed_3pm

minus2

0

2

Cluster4

p1 lt- CreateRadialPlot(subset(dsc group==1)

gridmin=-2 gridmax=2 plotextentx=2)

p2 lt- CreateRadialPlot(subset(dsc group==2)

gridmin=-2 gridmax=2 plotextentx=2)

p3 lt- CreateRadialPlot(subset(dsc group==3)

gridmin=-2 gridmax=2 plotextentx=2)

p4 lt- CreateRadialPlot(subset(dsc group==4)

gridmin=-2 gridmax=2 plotextentx=2)

library(gridExtra)

gridarrange(p1+ggtitle(Cluster1) p2+ggtitle(Cluster2)

p3+ggtitle(Cluster3) p4+ggtitle(Cluster4))

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 31 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 33: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

16 K-Means Base Case Cluster

model lt- mkms lt- kmeans(scale(ds[numi]) 1)

model$size

[1] 366

model$centers

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 998e-17 1274e-16 -2545e-16 -1629e-16 -5836e-16 199e-16

wind_speed_9am wind_speed_3pm humidity_9am humidity_3pm pressure_9am

1 -1323e-16 -287e-16 -4162e-16 -111e-16 -4321e-15

model$totss

[1] 5840

model$withinss

[1] 5840

model$totwithinss

[1] 5840

model$betweenss

[1] -1819e-11

model$iter

[1] 1

model$ifault

NULL

Notice that this base case provides the centers of the original data and the starting measure ofthe within sum of squares

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 32 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 34: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

17 K-Means Multiple Starts

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 33 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 35: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

18 K-Means Cluster Stability

Rebuilding multiple clusterings using different random starting points will lead to different clus-ters being identified We might expect that clusters that are regularly being identified withdifferent starting points might be more robust as actual clusters representing some cohesionamong the observations belonging to that cluster

The function clusterboot() from fpc (Hennig 2014) provides a convenient tool to identifyrobust clusters

library(fpc)

model lt- mkmcb lt- clusterboot(scale(ds[numi])

clustermethod=kmeansCBI

runs=10

krange=10

seed=42)

boot 1

boot 2

boot 3

boot 4

model

Cluster stability assessment

Cluster method kmeans

Full clustering results are given as parameter result

of the clusterboot object which also provides further statistics

str(model)

List of 31

$ result List of 6

$ result List of 11

$ cluster int [1366] 1 10 5 7 3 1 1 1 1 1

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 34 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 36: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

19 Evaluation of Clustering Quality

Numerous measures are available for evaluating a clustering Many are stored within the datastructure returned by kmeans()

The total sum of squares

model lt- kmeans(scale(ds[numi]) 10)

model$totss

[1] 5840

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

model$betweenss

[1] 3446

The basic concept is the sum of squares This is typically a sum of the square of the distancesbetween observations

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 35 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 37: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

20 Evaluation Within Sum of Squares

The within sum of squares is a measure of how close the observations are within the clustersFor a single cluster this is calculated as the average squared distance of each observation withinthe cluster from the cluster mean Then the total within sum of squares is the sum of the withinsum of squares over all clusters

The total within sum of squares generally decreases as the number of clusters increases As weincrease the number of clusters they individually tend to become smaller and the observationscloser together within the clusters As k increases the changes in the total within sum ofsquares would be expected to reduce and so it flattens out A good value of k might be wherethe reduction in the total weighted sum of squares begins to flatten

model$withinss

[1] 1721 2192 2376 2172 3366 2284 2542 2919 3105 1260

model$totwithinss

[1] 2394

The total within sum of squares is a common measure that we aim to minimise in building aclustering

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 36 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 38: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

21 Evaluation Between Sum of Squares

The between sum or squares is a measure of how far the clusters are from each other

model$betweenss

[1] 3446

A good clustering will have a small within sum of squares and a large between sum of squaresHere we see the relationship between these two measures

0

2000

4000

6000

0 10 20 30 40 50Number of Clusters

Sum

of S

quar

es

Measure

totwithinss

betweenss

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 37 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 39: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

22 K-Means Selecting k Using Scree Plot

crit lt- vector()

nk lt- 120

for (k in nk)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit sum(m$withinss))

crit

[1] 5840 4414 3753 3368 3057 2900 2697 2606 2465 2487 2310 2228 2173 2075

[15] 2108 1970 1937 1882 1846 1830

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus1

0

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 38 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 40: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

23 K-Means Selecting k Using Calinski-Harabasz

The Calinski-Harabasz criteria also known as the variance ratio criteria is the ratio of thebetween sum of squares (divided by k minus 1) to the within sum of squares (divided by n minus k)mdashthe sum of squares is a measure of the variance The relative values can be used to compareclusterings of a single dataset with higher values being better clusterings The criteria is said towork best for spherical clusters with compact centres (as with normally distributed data) usingk-means with Euclidean distance

library(fpc)

nk lt- 120

model lt- kmc lt- kmeansruns(scale(ds[numi]) krange=nk criterion=ch)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 192 174

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 000 11755 10097 8881 8216 7475 6975 6518 6138 5818

[11] 5571 5344 5163 5007 4834 4690 4532 4407 4257 4165

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kmc$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 39 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 41: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

24 K-Means Selecting k Using Average Silhouette Width

The average silhouette width criteria is more computationally expensive than the Calinski-Harabasz criteria which is an issue for larger datasets A dataset of 50000 observations and15 scaled variables testing from 10 to 40 clusters 10 runs took 30 minutes for the Calinski-Harabasz criteria compared to minutes using the average silhouette width criteria check tim-

ingcheck tim-inglibrary(fpc)

nk lt- 120

model lt- kma lt- kmeansruns(scale(ds[numi]) krange=nk criterion=asw)

class(model)

[1] kmeans

model

K-means clustering with 2 clusters of sizes 174 192

Cluster means

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

model$crit

[1] 00000 02255 02192 02225 01894 01908 01867 01811 01662 01502

[11] 01617 01512 01502 01490 01533 01557 01453 01459 01462 01460

model$bestk

[1] 2

dsc lt- dataframe(k=nk crit=scale(kma$crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus3

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 40 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 42: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

25 K-Means Using clusterCrit Calinski Harabasz

The clusterCrit (Desgraupes 2013) package provides a comprehensive collection of clusteringcriteria Here we illustrate its usage with the Calinski Harabasz criteria Do note that we obtaina different model here to that above hence different calculations of the criteria

library(clusterCrit)

crit lt- vector()

for (k in 120)

m lt- kmeans(scale(ds[numi]) k)

crit lt- c(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

Calinski_Harabasz)))

crit[isnan(crit)] lt- 0

crit

[1] 000 8120 8788 8671 7578 6434 6448 4987 5183 4881 4278

[12] 4504 4303 4453 4012 3876 3926 3867 3519 3233

bestCriterion(crit Calinski_Harabasz)

[1] 3

In this case k = 3 is the optimum choice

dsc lt- dataframe(k=nk crit=scale(crit))

dscm lt- melt(dsc idvars=k variablename=Measure)

p lt- ggplot(dscm aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure))

p lt- p + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p lt- p + theme(legendposition=none)

p

minus2

minus1

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 41 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 43: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

26 K-Means Compare All Criteria

We can generate all criteria and then plot them There are over 40 criteria and the are notedon the help page for intCriteria() We generate all the criteria here and then plot the first 6below with the remainder in the following section

m lt- kmeans(scale(ds[numi]) 5)

ic lt- intCriteria(asmatrix(ds[numi]) m$cluster all)

names(ic)

[1] ball_hall banfeld_raftery c_index

[4] calinski_harabasz davies_bouldin det_ratio

[7] dunn gamma g_plus

[10] gdi11 gdi12 gdi13

crit lt- dataframe()

for (k in 220)

m lt- kmeans(scale(ds[numi]) k)

crit lt- rbind(crit asnumeric(intCriteria(asmatrix(ds[numi]) m$cluster

all)))

names(crit) lt- substr(sub(_ names(ic)) 1 5) Shorten for plots

crit lt- dataframe(sapply(crit function(x) x[isnan(x)] lt- 0 x))

dsc lt- cbind(k=220 dataframe(sapply(crit scale)))

dscm lt- melt(dsc idvars=k variablename=Measure)

dscm$value[isnan(dscm$value)] lt- 0

ms lt- ascharacter(unique(dscm$Measure))

p lt- ggplot(subset(dscm Measure in ms[16]) aes(x=k y=value colour=Measure))

p lt- p + geom_point(aes(shape=Measure)) + geom_line(aes(linetype=Measure))

p lt- p + scale_x_continuous(breaks=nk labels=nk)

p

minus2

0

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

Measure

ballh

banfe

cinde

calin

davie

detra

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 42 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 44: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

27 K-Means Plot All Criteria

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

dunn

gamma

gplus

gdi11

gdi12

gdi13

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi21

gdi22

gdi23

gdi31

gdi32

gdi33

minus1

0

1

2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

gdi41

gdi42

gdi43

gdi51

gdi52

gdi53

minus2

0

2

4

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

ksqde

logde

logss

mccla

pbm

point

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

raytu

ratko

scott

sdsca

sddis

sdbw

minus2

minus1

0

1

2

3

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20k

valu

e

silho

tau

trace

trace1

wemme

xiebe

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 43 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 45: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

28 K-Means predict()

rattle (Williams 2014) provides a predictkmeans() to assign new observations to their nearestmeans

setseed(42)

train lt- sample(nobs 07nobs)

test lt- setdiff(seq_len(nobs) train)

model lt- kmeans(ds[train numi] 2)

predict(model ds[test numi])

4 5 6 8 11 14 15 16 17 21 28 30 32 36 44 47 50 55

2 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 2 2

57 61 62 63 67 74 75 76 77 80 84 92 94 95 97 98 99 100

2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 1 2 2

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 44 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 46: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

29 Entropy Weighted K-Means

Sometimes it is better to build clusters based on subsets of variables particularly if there aremany variables Subspace clustering and bicluster analysis are approaches to doing this

We illustrate the concept here using wskm (Williams et al 2012) for weighted subspace k-meansWe use the ewkm() (entropy weighted k-means)

setseed(42)

library(wskm)

mewkm lt- ewkm(ds 10)

Warning NAs introduced by coercion

Error NANaNInf in foreign function call (arg 1)

The error is expected and once again only numeric variables can be clustered

mewkm lt- ewkm(ds[numi] 10)

Clustering converged Terminate

round(100mewkm$weights)

min_temp max_temp rainfall evaporation sunshine wind_gust_speed

1 0 0 100 0 0 0

2 0 0 0 100 0 0

3 0 0 100 0 0 0

4 0 0 0 0 0 0

5 6 6 6 6 6 6

6 0 0 0 100 0 0

7 0 0 0 100 0 0

8 0 0 0 0 0 0

9 6 6 6 6 6 6

10 0 0 100 0 0 0

Exercise Plot the clusters

Exercise Rescale the data so all variables have the same range and then rebuild the clusterand comment on the differences

Exercise Discuss why ewkm might be better than k-means Consider the number of vari-ables as an advantage particularly in the context of the curse of dimensionality

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 45 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 47: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

30 Partitioning Around Medoids PAM

model lt- pam(ds[numi] 10 FALSE euclidean)

summary(model)

Medoids

ID min_temp max_temp rainfall evaporation sunshine wind_gust_speed

[1] 11 91 252 00 42 119 30

[2] 38 165 282 40 42 88 39

plot(ds[numi[15]] col=model$clustering)

points(model$medoids col=110 pch=4)

min_temp

10 20 30

0 4 8 12

minus5

510

20

1020

30

max_temp

rainfall

010

2030

40

04

812

evaporation

minus5 5 10 20

0 10 20 30 40

0 4 8 12

04

812

sunshine

plot(model)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 46 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 48: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

minus6 minus4 minus2 0 2 4

minus4

minus2

02

46

clusplot(pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean))

Component 1

Com

pone

nt 2

These two components explain 5604 of the point variability

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 47 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 49: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

Silhouette width si

minus02 00 02 04 06 08 10

Silhouette plot of pam(x = ds[numi] k = 10 diss = FALSE metric = euclidean)

Average silhouette width 014

n = 366 10 clusters Cj

j nj | aveiisinCj si

1 49 | 020

2 30 | 017

3 23 | 002

4 27 | 010

5 34 | 015

6 45 | 014

7 44 | 011

8 40 | 023

9 26 | 011

10 48 | 009

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 48 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 50: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

31 Clara

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 49 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 51: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

32 Hierarchical Cluster in Parallel

Use hclusterpar() from amap (Lucas 2011)

library(amap)

model lt- hclusterpar(naomit(ds[numi])

method=euclidean

link=ward

nbproc=1)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 50 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 52: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

33 Plotting Hierarchical Cluster

Plot from cba (Buchta and Hahsler 2014)

plot(model main=Cluster Dendrogram xlab= labels=FALSE hang=0)

Add in rectangles to show the clusters

recthclust(model k=10)

050

010

0015

00

Cluster Dendrogram

hclusterpar ( ward)

Hei

ght

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 51 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 53: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

34 Add Colour to the Hierarchical Cluster

Using the dendroextras (Jefferis 2014) package to add colour to the dendrogram

library(dendroextras)

plot(colour_clusters(model k=10) xlab=)0

500

1000

1500

331

290

295

287

306

294

311

297

358

321

357 65 78 121 47 152

151

324

335

326

317

334

115

116

333

318

337

338

366

325

362 57 327

348

154

320

126

130 12 56 108

122

356 44 106 83 84 13 85 20 71 59 87 125

127

131

138 25 134 94 110 93 144 24 49 30 36 58 66 103

111 67 76 51 112 2 38 32 54 102

117 40 45 68 26 140 92 69 70 336

360

361 15 363

132

136 62 63 135 61 137

352

355

332

354 86 95 141

365 17 39 88 119

133

118

139 21 72 73 74 91 128 18 28 16 90 60 89 14 19 191

192

177

190

203

181

313

185

186

161

194

169

175

350

176

193

229

235

322

219

204

206

248

205

230

211

241

213

220

208

209

210

293

236

255

269

273

246

278

266

247

267

304

207

212

148

149

187

312

268

303

182

258

242

299

195

196

197

150

178

292

300

201

274

343

184

323

265

314

298

301

291

315

302

307

153

156

173

157

351 10 164

107

163

129

162 11 124

344

346

347

329

330

167

168

123

158

345

353

316 1

359

104

113

224

339

165

198 97 22 96 50 145

146 46 80 31 23 105

260

202

234

257

237

249

281

279

283

263

251

270

252

256

215

232

259

214

231

250

216

223

233 41 142

143

183

188

172 8

171

308

309

217

218

310

221

222

159

160 55 109

364 9

170 48 7 42 79 27 43 328

296

342

262

272

189

200 6

228 5

280

282

285

227

238

166

240

277

340

174

225 34 98 114 4 53 101

349

100

147 52 3 81 99 120 29 37 77 64 75 82 33 35 275

199

261

284

253

305

319

179

155

341

180

264

254

286

226

239

288

289

271

245

276

243

244

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 52 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 54: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

35 Hierarchical Cluster Binary Variables

Exercise Clustering a large population based on the patterns of missing data within thepopulation is a technique for grouping observations exhibiintg similar patterns of behavi-ouour assuming missing by pattern We can convert each variable to a binary 10 indi-cating presentmissing and then use mona() for a hiearchical clustering Demonstrate thisInclude a levelplot

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 53 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 55: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

36 Self Organising Maps SOM

min_tempmax_temprainfallevaporationsunshine

wind_gust_speedwind_speed_9amwind_speed_3pmhumidity_9amhumidity_3pm

pressure_9ampressure_3pmcloud_9amcloud_3pm

Weather Data

library(kohonen)

setseed(42)

model lt- som(scale(ds[numi[114]]) grid = somgrid(5 4 hexagonal))

plot(model main=Weather Data)

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 54 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 56: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

37 Further Reading and Acknowledgements

The Rattle Book published by Springer provides a comprehensiveintroduction to data mining and analytics using Rattle and RIt is available from Amazon Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from httpdataminingtogawarecom including theDatamining Desktop Survival Guide

This module is one of many OnePageR modules available fromhttponepagertogawarecom In particular follow the links onthe website with a which indicates the generally more developedOnePageR modules

Other resources include

Practical Data Science with R by Nina Zumel and John Mount March 2014 has a goodchapter on Cluster Analysis with some depth of explanation of the sum of squares measuresand good examples of R code for performing cluster analysis It also covers clusterboot()and kmeansruns() from fpc

The radar or radial plot code originated from an RStudio Blog Posting

The definition of all criteria used to measure the goodness of a clustering can be found ina Vingette of the clusterCrit (Desgraupes 2013) package Also available on CRAN

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 55 of 56

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References
Page 57: Data Science with R Cluster Analysis - Togaware6 K-Means Basics: Iterative Cluster Search The k-means algorithm is a traditional and widely used clustering algorithm. The algorithm

Data Science with R OnePageR Survival Guides Cluster Analysis

38 References

Breiman L Cutler A Liaw A Wiener M (2012) randomForest Breiman and Cutlerrsquos ran-dom forests for classification and regression R package version 46-7 URL httpCRAN

R-projectorgpackage=randomForest

Buchta C Hahsler M (2014) cba Clustering for Business Analytics R package version 02-14URL httpCRANR-projectorgpackage=cba

Desgraupes B (2013) clusterCrit Clustering Indices R package version 123 URL http

CRANR-projectorgpackage=clusterCrit

Hennig C (2014) fpc Flexible procedures for clustering R package version 21-7 URL http

CRANR-projectorgpackage=fpc

Jefferis G (2014) dendroextras Extra functions to cut label and colour dendrogram clusters Rpackage version 01-4 URL httpCRANR-projectorgpackage=dendroextras

Lucas A (2011) amap Another Multidimensional Analysis Package R package version 08-7URL httpCRANR-projectorgpackage=amap

R Core Team (2014) R A Language and Environment for Statistical Computing R Foundationfor Statistical Computing Vienna Austria URL httpwwwR-projectorg

Williams GJ (2009) ldquoRattle A Data Mining GUI for Rrdquo The R Journal 1(2) 45ndash55 URLhttpjournalr-projectorgarchive2009-2RJournal_2009-2_Williamspdf

Williams GJ (2011) Data Mining with Rattle and R The art of excavating data for knowl-edge discovery Use R Springer New York URL httpwwwamazoncomgpproduct

1441998896ref=as_li_qf_sp_asin_tlie=UTF8amptag=togaware-20amplinkCode=as2ampcamp=

217145ampcreative=399373ampcreativeASIN=1441998896

Williams GJ (2014) rattle Graphical user interface for data mining in R R package version304 URL httprattletogawarecom

Williams GJ Huang JZ Chen X Wang Q Xiao L (2012) wskm Weighted k-means ClusteringR package version 140 URL httpCRANR-projectorgpackage=wskm

Xie Y (2013) animation A gallery of animations in statistics and utilities to create animationsR package version 22 URL httpCRANR-projectorgpackage=animation

This document sourced from ClustersORnw revision 440 was processed by KnitR version 16of 2014-05-24 and took 306 seconds to process It was generated by gjw on nyx running Ubuntu1404 LTS with Intel(R) Xeon(R) CPU W3520 267GHz having 4 cores and 123GB of RAMIt completed the processing 2014-06-22 123837

Copyright copy 2013-2014 Grahamtogawarecom Module ClustersO Page 56 of 56

  • Load Weather Dataset for Modelling
  • Introducing Cluster Analysis
  • Distance Calculation Euclidean Distance
  • Minkowski Distance
  • General Distance
  • K-Means Basics Iterative Cluster Search
  • K-Means Using kmeans()
  • Scaling Datasets
  • K-Means Scaled Dataset
  • Animate Cluster Building
  • Visualise the Cluster Radial Plot Using GGPlot2
  • Visualize the Cluster Radial Plot with K=4
  • Visualise the Cluster Cluster Profiles with Radial Plot
  • Visualise the Cluster Single Cluster Radial Plot
  • Visualise the Cluster Grid of Radial Plots
  • K-Means Base Case Cluster
  • K-Means Multiple Starts
  • K-Means Cluster Stability
  • Evaluation of Clustering Quality
  • Evaluation Within Sum of Squares
  • Evaluation Between Sum of Squares
  • K-Means Selecting k Using Scree Plot
  • K-Means Selecting k Using Calinski-Harabasz
  • K-Means Selecting k Using Average Silhouette Width
  • K-Means Using clusterCrit Calinski_Harabasz
  • K-Means Compare All Criteria
  • K-Means Plot All Criteria
  • K-Means predict()
  • Entropy Weighted K-Means
  • Partitioning Around Medoids PAM
  • Clara
  • Hierarchical Cluster in Parallel
  • Plotting Hierarchical Cluster
  • Add Colour to the Hierarchical Cluster
  • Hierarchical Cluster Binary Variables
  • Self Organising Maps SOM
  • Further Reading and Acknowledgements
  • References

Recommended