Clustering SolutionsFINAL Exploratory Run
Full 10’ Resolution – 41,311 samples
Michael A. LindgrenEWHALE Laboratory
Institute of Arctic BiologyUniversity of Alaska Fairbanks
February 11, 2011
About This Run…• This “FINAL” exploratory run, refers to the decision of
which clustering level the group will choose for the final Biome Shift Analysis.
• I was able to modify the R code to pass a very large proximity matrix created in RandomForests to the PAM clustering algorithm, where all 10’ resolution samples were included.
• The clustering levels I am showing for at least the preliminary decision making about the optimal number are 5, 10, 15, 20, 25, & 30.
• Also included are silhouette plots for each cluster level.
• The silhouette value for each point is a measure of how similar that point is to points in its own cluster compared to points in other clusters, and ranges from -1 to +1. It is defined as:
S(i) = (min(b(i,:),2) - a(i)) ./ max(a(i),min(b(i,:)))
• where a(i) is the average distance from the ith point to the other points in its cluster, and b(i,k) is the average distance from the ith point to points in another cluster k.
*From MathWorks website, developers of Matlab.
See document I have attached with this Presentation, which discusses the Silhouette Plots as a metric of deciding when an acceptable cluster solution is achieved.
Silhouette Plots
Silhouette Plots
5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Assessment of Average Sihouette Widths of Different Cluster Solutions
Number of Clusters Returned
Aver
age
Siho
uett
e W
idth
5 Clusters Returned
5 Clusters Returned
5 Clusters Returned
10 Clusters Returned
10 Clusters Returned
10 Clusters Returned
15 Clusters Returned
15 Clusters Returned
15 Clusters Returned
20 Clusters Returned
20 Clusters Returned
20 Clusters Returned
25 Clusters Returned
25 Clusters Returned
25 Clusters Returned
30 Clusters Returned
30 Clusters Returned
30 Clusters Returned