Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme
AG Intelligente Systeme - Data Mining group
Data Mining I
Summer semester 2019
Lecture 11: Clustering – 2: Hiearchical clustering
Lectures: Prof. Dr. Eirini Ntoutsi
TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar
Clustering topics covered in DM1
1. Partitioning-based clustering
kMeans, kMedoids
2. Density-based clustering
DBSCAN
3. Grid-based clustering
4. Hierarchical clustering
1. Diana, Agnes, BIRCH, ROCK, CHAMELEON
5. Clustering evaluation
Data Mining I @SS19: Clustering 3
1 2 3 4 5
2
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 3
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 8
Hierarchical-based clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized also as a dendrogram
A tree like diagram that records the sequences of merges or splits & cluster memberships
The height at which two clusters are merged in the dendrogram reflects their distance
An instance can belong to multiple clusters.
The assignement though is still hard
Data Mining I @SS19: Clustering 3
1
2
3
4
5
6
1
23 4
5
Nested clusters
Dis
tan
ce
1 3 2 5 4 60
0.05
0.1
0.15
0.2
Dendrogram
1
2
34
5
Points
9
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
A clustering can be obtained by ‘cutting’ the dendrogram at the proper level
Cutting based on distance (i.e., I want ≤ 0.1 distance)
Cutting based on the number of clusters (i.e., I want 2 clusters)
Data Mining I @SS19: Clustering 3
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
10
Applications of hierarchical clustering 1/3
The dendrogram of clusters may correspond to meaningful taxonomies
Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
Data Mining I @SS19: Clustering 3
Source: http://currents.plos.org/treeoflife/article/the-tree-of-life-and-a-new-classification-of-bony-fishes/
11
Applications of hierarchical clustering 2/3
The dendrogram of clusters may correspond to meaningful taxonomies
Dendrogram showing hierarchical clustering of tissue gene expression data with colours denoting tissues.
Data Mining I @SS19: Clustering 3
Source: http://genomicsclass.github.io/book/pages/clustering_and_heatmaps.html
12
Applications of hierarchical clustering 3/3
The dendrogram of clusters may correspond to meaningful taxonomies
USArrests dataset: statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.
Data Mining I @SS19: Clustering 3
Source: https://uc-r.github.io/hc_clustering
13
Hierarchical vs Partitioning
Data Mining I @SS19: Clustering 3
p4
p1 p3
p2
p4p1 p2 p3
Partitioning clustering
Nested clusters
Dendrogram
Hierarchical clustering algorithms typically have local objectives
Partitioning algorithms typically have global objectivese.g., k-Means
14
Hierarchical clustering methods
Two main types of hierarchical clustering
Agglomerative or AGNES (Agglomerative Nesting):
Bottom-up approach
Start with the points as individual clusters
At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
Divisive or DIANA (Divisive analysis):
Top-down approach
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a single point (or there are k clusters)
Merge or split one cluster at a time
Data Mining I @SS19: Clustering 3
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1 3 2 5 4 60
0.05
0.1
0.15
0.2
15
Hierarchical clustering methods
Traditional hierarchical algorithms use a similarity or distance matrix to decide on which cluster to split/merge next
Employed distance/similarity function depends on the application
Data Mining I @SS19: Clustering 3
p1
p3
p12
…
p2
p1 p2 p3 … p12
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
16
Agglomerative clustering algorithm
Most popular hierarchical clustering technique
Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation: the computation of the proximity of two clusters
Different approaches (single link, complete link, …..) which lead to different algorithms
Data Mining I @SS19: Clustering 3
1 3 2 5 4 60
0.05
0.1
0.15
0.2
17
Starting situation
Start with clusters of individual points and a proximity matrix
Data Mining I @SS19: Clustering 3
p1
p3
p12
…
p2
p1 p2 p3 … p12
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
18
Intermediate situation I
After some merging steps, we have some clusters
Data Mining I @SS19: Clustering 3
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
19
Intermediate situation II
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
Data Mining I @SS19: Clustering 3
Proximity matrix
...p1 p2 p3 p4 p9 p10 p11 p12
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
C1
C4
C2 C5
C3
20
Merging
Two major questions for merging
How we identify the closest pair of clusters to be merged?
How do we update the proximity matrix?
Data Mining I @SS19: Clustering 3
C1
C4
C2 U C5
C3
...p1 p2 p3 p4 p9 p10 p11 p12
? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity matrix
21
Distance between clusters
Each cluster is a set of points
How do we compare two sets of points/clusters?
A variety of different methods Single link (or MIN)
Complete link (or MAX)
Group average
Distance between centroids
Distance between medoids
Other methods driven by an objective function
Ward’s Method uses squared error
Data Mining I @SS19: Clustering 3
Distance?
22
Distance between clusters: Single link distance or MIN
Single link (or MIN) distance between Ci and Cj is the minimum distance between any object in Ci
and any object in Cj, i.e.,
i.e., the distance is defined by the two closest objects (shortest edge)
Data Mining I @SS19: Clustering 3
jiyxjisl CyCxyxdCCdis ,),(min, ,
Ci CJ
23
Distance between clusters: Complete link or MAX
Complete link (or MAX) distance between Ci and Cj is the maximum distance between any object in Ci
and any object in Cj, i.e.,
i.e., the distance is defined by the two most dissimilar objects (longest edge)
Data Mining I @SS19: Clustering 3
Ci CJ
jiyxjicl CyCxyxdCCdis ,),(max, ,
24
Distance between clusters: Group average
Group average distance between Ci and Cj is the average distance between any object in Ci and any
object in Cj, i.e.,
Data Mining I @SS19: Clustering 3
ji
CyCx
jiavgCC
yxd
CCdisji
,
),(
,
25
Distance between clusters: Centroid distance
Centroid distance between Ci and Cj is the distance between the centroid ci of Ci and the centroid cj
of Cj, i.e.,
Data Mining I @SS19: Clustering 3
),(, jijicentroids ccdCCdis
n
p
c
n
i
i
m
1
Centroid of a cluster
26
Example
Data Mining I @SS19: Clustering 3
Dataset (6 2D points)
Distance matrix (Euclidean distance)
27
Back to the pseudocode of the agglomerative clustering algorithm
Pseudocode of the algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Data Mining I @SS19: Clustering 3
1 3 2 5 4 60
0.05
0.1
0.15
0.2
28
Single link distance or MIN agglomerative clustering algorithm
Similarity of two clusters is based on the most similar (closest) pair of objects
Determined by one pair of points
Data Mining I @SS19: Clustering 3
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
Dendrogram
jiyxjisl CyCxyxdCCdis ,),(min, ,
Nested clusters
29
Short break (5’)
Given the following 1-dimensional dataset, build a hierarchical agglomerative clustering using single-link distance
Data Mining I @SS19: Clustering 3 30
Single link distance (MIN): strengths
Can discover clusters of arbitrary shapes
Data Mining I @SS19: Clustering 3
Original points Two clusters
31
Single link distance (MIN): limitations
Data Mining I @SS19: Clustering 3
Two clustersOriginal points
Sensitive to noise and outliers
DBSCAN can be viewed as a robust variant of single link distance
It excludes noisy points between clusters to avoid undesirable chaining effects.
32
Single link distance (MIN): limitations
Data Mining I @SS19: Clustering 3
Produces long, elongated clusters (chain-like clusters)
33
Complete link distance or MAX agglomerative clustering algorithm
Similarity of two clusters is based on the least similar (most distant) pair of objects
Determined by one pair of points
Data Mining I @SS19: Clustering 3
Nested clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
jiyxjicl CyCxyxdCCdis ,),(max, ,
34
Complete link distance (MAX): strengths
Data Mining I @SS19: Clustering 3
Original points Two clusters
Less susceptible to noise and outliers and comparing to MIN
35
Complete link distance (MAX): limitations
Because it focuses on minimizing the diameter of the cluster, it will create clusters so that all of them have similar diameter
If there are natural larger clusters than others, it tends to break large clusters
Data Mining I @SS19: Clustering 3
Original points Two clusters
36
Short break (5’)
Given the following 1-dimensional dataset, build a hierarchical agglomerative clustering using complete-link distance
Data Mining I @SS19: Clustering 3 37
(Group) Average-link distance agglomerative clustering algorithm
Proximity of two clusters is the average of pairwise distances between objects in the two clusters.
Determined by all pairs of points in the two clusters
Data Mining I @SS19: Clustering 3
Nested clusters
1
2
3
4
5
6
1
2
5
3
4
Dendrogram
3 6 4 2 5 10
0.05
0.1
0.15
0.2
0.25
ji
CyCx
jiavgCC
yxd
CCdisji
,
),(
,
38
(Group) Average-link distance: strengths and limitations
Compromise between Single and Complete Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards spherical clusters
Data Mining I @SS19: Clustering 3 39
Centroid-link distance agglomerative clustering algorithm
The distance between two clusters is the distance of their corresponding centroids
Difference to other measures (often considered bad): the possibility of inversions
Two clusters that are merged at step k might be more similar than the pair of clusters merged in step k-1
For the other methods, distance between clusters monotonically increases (or at worst does not increase)
Data Mining I @SS19: Clustering 3
),(, jijicentroids ccdCCdis
40
Ward’s method
Ward’s method or Ward's minimum variance method
Clusters are represented by centroids
The proximity between two clusters is measured in terms of theincrease in SSE (sum of squared error) that results from merging the two clusters
At each step, merge the pair of clusters that leads to minimum increase in total inter-cluster variance after merging.
Data Mining I @SS19: Clustering 3
Nested clusters
1
2
3
4
5
61
2
5
3
4
41
Ward’s method cont’
Ward’s method seems similarly to k-Means: it tries to minimize the sum of square distances of points from their cluster centroids, but not globally
Less susceptible to noise and outliers
Biased towards spherical clusters
Data Mining I @SS19: Clustering 3 42
Comparison of the different methods
Data Mining I @SS19: Clustering 3
Group average Ward’s method
1
2
3
4
5
6
1
2
5
3
4
Complete link (MAX)
1
2
3
4
5
6
1
2
5
34
1
2
3
4
5
6
1
2 5
3
41
2
3
4
5
6
1
2
3
4
5
Single link (MIN)
43
Hierarchical methods: complexity
O(N2) space to store the proximity matrix
N is the number of points.
O(N3) time in most of the cases
There are N steps and at each step the size, N2, proximity matrix must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time for some approaches using appropriate data structures
Data Mining I @SS19: Clustering 3 44
Hierarchical clustering: overview
No knowledge on the number of clusters
Produces a hierarchy of clusters, not a flat clustering
A single clustering can be obtained from the dendrogram
No backtracking: Merging decisions are final
Once a decision is made to combine two clusters, it cannot be undone
Lack of a global objective function
Decisions are local, at each step
No objective function is directly minimized
Different schemes have problems with one or more of the following:
Sensitivity to noise and outliers
Breaking large clusters
Difficulty handling different sized clusters and convex shapes
Inefficiency, especially for large datasets
Data Mining I @SS19: Clustering 3 45
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 46
Bisecting k-Means
Hybrid method, combines k-Means and hierarchical clustering
Idea: first split the set of points into two clusters, select one of these clusters for further splitting, and so on, until k clusters remain.
Pseudocode:
Which cluster to split?
The one with the largest SSE (worse one)
Based on SSE and size
…
Data Mining I @SS19: Clustering 3 47
Bisecting k-Means
An example
Data Mining I @SS19: Clustering 3 48
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 49
An overview on clustering
Intuitively, a cluster is a set of data objects that are similar to one another within the same cluster and dissimilar to the objects
in other clusters
Cluster analysis: Find similarities between data according to the characteristics found in the data and group similar data
objects into clusters
Key points in clustering
Similarity/ distance function
Learning algorithm
An unsupervised learning task
No clues on the number of clusters, nor in the characteristics of these clusters
Important DM task: as a stand-alone tool or as a preprocessing step
A large amount of algorithms
Partitioning methods
Hierarchical methods
Density-based methods
Model-based methods
….
Data Mining I @SS19: Clustering 3 50
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 51
Homework/ tutorial
Homework
Use the Elki data mining tool to experiment with clustering algorithms http://elki.dbs.ifi.lmu.de/
Or Python/ Weka (more limited w.r.t. clustering)
Readings:
Tan P.-N., Steinbach M., Kumar V book, Chapter 8.
Data Clustering: A Review, https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf
Nando de Freitas youtube video: https://www.youtube.com/watch?v=voN8omBe2r4
Data Mining I @SS19: Clustering 3 52
Outline
Hierarchical clustering
Bisecting k-Means
An overview of clustering
Homework/tutorial
Things you should know from this lecture
Data Mining I @SS19: Clustering 3 53
Homework/ tutorial
Hierarchical clustering basics
Agglomerative approach
Similarity measures between clusters
Bisecting kMeans
Data Mining I @SS19: Clustering 3 54