CS480 Introduction to Machine Learning Unsupervised Learning
Edith Law
Supervised Learning
•Finding a teacher may be difficult, expensive, or impossible •Unsupervised learning is about learning without a teacher.
Unsupervised Learning
In unsupervised learning, data consists only of examples and not the corresponding labels.
Our job is to make sense of or find some pattern of regularity in the data, even though no one has provided the correct labels.
For example, we might want to do •clustering: automatically partition the data into groups. •dimensionality reduction: project high dimensional data into lower
dimensional space so that it can be more easily visualized
Overview
• Clustering (K-Means) • Hierarchical Clustering • PCA
�4
Overview
• Clustering (K-Means) • Hierarchical Clustering • PCA
�5
A Simple Clustering Example
• A fruit merchant approaches you, with a set of apples to classify according to their variety. – Tells you there are five varieties of apples in the basket. – Tells you the weight and colour of each apple in the basket.
• Can you label each apple with the correct variety? – What would you need to know / assume?
A Simple Clustering Example
• Data = <x1, ?>, <x2, ?>, …, <xn, ?>
• You know there are 5 varieties.
• Assume each variety generates apples according to a (variety-specific) 2D Gaussian distribution. - If you know µi, σi2 for each class, it’s easy to classify the apples. - If you know the class of each apple, it’s easy to estimate µi, σi2.
• What if we know neither?
Chicken and Egg Problem
In unsupervised clustering, the goal is to find clusters in the data.
We represent each cluster by its cluster center. • If we know the cluster centers, we can assign each point to its
nearest cluster. • If we know which points belong to which clusters, then we can
compute the center.
This is a chicken and egg problem, which can be solved via iterations. • guess cluster centres • assign point to closest centre • recompute centres • repeat until clusters stop moving
This iterative process is the idea behind the K-means algorithm.
A Simple Algorithm: K-means clustering
• Objective: Cluster n instances into K distinct classes. • Preliminaries:
– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.
Gaussian). – Step 3: Randomly estimate the parameters of the K
distributions. • Iterate, until convergence:
– Step 4: Assign instances to the most likely classes based on the current parametric distributions.
– Step 5: Estimate the parametric distribution of each class based on the latest assignment.
K-means algorithm
1. Ask user how many clusters.
Image courtesy of Andrew Moore, Carnegie Mellon U.
This data could easily be modeled by Gaussians.
K-means algorithm
Image courtesy of Andrew Moore, Carnegie Mellon U. 1. Ask user how many clusters. 2. Randomly guess k centers:
{ µ1,…, µk } (assume σ2 is known).
This data could easily be modeled by Gaussians.
K-means algorithm
Image courtesy of Andrew Moore, Carnegie Mellon U. 1. Ask user how many clusters. 2. Randomly guess k centers:
{ µ1,…, µk } (assume σ2 is known).
3. Assign each data point to the closest center.
This data could easily be modeled by Gaussians.
K-means algorithm
Image courtesy of Andrew Moore, Carnegie Mellon U. 1. Ask user how many clusters. 2. Randomly guess k centers:
{ µ1,…, µk } (assume σ2 is known).
3. Assign each data point to the closest center.
4. Each centre finds the centroid of the points it owns.
5. Repeat
REPEAT!
This data could easily be modeled by Gaussians.
K-Means Clustering (Daume’s Version)
K-means algorithm starts
Image courtesy of Andrew Moore, Carnegie Mellon U.
(Pelleg and More, 1999)
https://dl.acm.org/citation.cfm?id=312248
K-means algorithm continues (2)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm continues (3)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm continues (4)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm continues (5)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm continues (6)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm continues (7)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm continues (8)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm continues (9)
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-means algorithm terminates
Image courtesy of Andrew Moore, Carnegie Mellon U.
K-Means is an instance of the EM Algorithm
• Objective: Cluster n instances into K distinct classes. • Preliminaries:
– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.
Gaussian). – Step 3: Randomly estimate the parameters of the K
distributions. • Iterate, until convergence:
– Step 4: Assign instances to the most likely classes based on the current parametric distributions.
– Step 5: Estimate the parametric distribution of each class based on the latest assignment. maximization step
expectation step
Properties of K-means
Does it converge? Yes, but to a local optimum (Proof in Daume)
How long does it take the converge? In practice, very quickly (usually fewer than 20 iterations). In theory, O(knm), i.e., exponential in the number of data points • k = #centers • n = #datapoints • m = dimensionality of data
Does it converge to the right answer? It is not guaranteed to converge to the “right answer”, partly because we have no way to knowing what the right answer is.
Properties of K-means
Rapid convergence depends on initialization
• Can use random re-starts (e.g., run the algorithm 10 times with different initialization ) to get better local optimum.
• Alternately, can choose your initial centers carefully: - Place µ1 on top of a randomly chosen datapoint. - Place µ2 on top of datapoint that is furthest from µ1. - Place µ3 on top of datapoint that is furthest from both µ1 and µ2.
K-means: Choosing K
A common approach is to search over many solutions (i.e., with different K) and find one that that minimizes a certain criterion:
BIC : arg minK
LK + λmK log N
“measure of quality of the clustering” (e.g., sum of squared distance between any data point and its assigned center)
# data points# centers
# dimensions
From: http://www.cs.cmu.edu/~./awm/tutorials/kmeans11.pdf
e.g., Bayes Information Criterion (BIC)
K-means: How to Choose K
W(K) =K
∑k=1
∑i∈Ik
∥xi − xk∥2
•Within-Cluster Scatter - how tightly grouped the clusters are
•Between-Cluster Scatter - how spread apart the clusters are from each other - nk is the number of data points in cluster Ck
Let Ik be the set of indices of data points belonging to cluster Ck
B(K) =K
∑k=1
nk∥xk − x∥2 xk =1nk ∑
i∈Ik
xi x =1n
n
∑i=1
xiwhere
K-means: How to Choose K
Goal: the clustering assignment should simultaneously have a small W and a large B.
Choose K (upper-bounded by Kmax) with the largest CH(K) score
K* = arg maxK∈{2,…,Kmax}
CH(K)
within-cluster scatter between-cluster scatter
B(K) =K
∑k=1
nk∥xk − x∥2
CH(K) =B(K)/(K − 1)W(K)/(N − K)
“Calinski-Harabasz Index”
W(K) =K
∑k=1
∑i∈Ik
∥xi − xk∥2
Overview
• Clustering (K-Means) • Hierarchical Clustering • PCA
�31
Hierarchical Clustering
• A hierarchy of clusters, where the cluster at each level are created by merging clusters from the next lower level.
• Two general approaches: – Bottom-up: Recursively merge a pair of clusters. – Top-down: Recursively split the existing clusters.
• Use dissimilarity measure to select split/merge pairs: – Measure pairwise distance between any points in the 2 clusters.
• E.g. Euclidean distance, Manhattan distance. – Measure distance over entire clusters using linkage criterion.
• E.g. Min/Max/Mean over pairs of points.
Hierarchical Clustering
There is a hierarchical sequence of clustering assignments, which can be represented as a dendrogram.
A B C D E F G
A (B F) C D E G
(A E) (B F) C D G
(A E) (B F) (C G) D
((A E) (C G)) (B F) D
(((A E) (C G)) (B F)) D
((((A E) (C G)) (B F)) D)
Hierarchical Clustering Forms Dendrograms
A Dendrogram is a tree where each node represents a group:
• leaf: a group with a single data point.
• root: a group containing whole dataset.
• internal node: has two child nodes representing the groups that were merged to form it.
Each internal node is drawn at a height proportional to the dissimilarity between its two children • assume that the leaf nodes are at height zero.
Linkage Functions
• Linkage: function d(G,H) that takes two groups G,H as input and competes a dissimilarity score between them.
• The clustering process will result in different dendrograms depending on the choice of linkage function we use to measure dissimilarity between groups.
Linkage Function images from Manning et al., 2008
• single linkage (i.e., nearest-neighbor linkage): - the dissimilarity between G and H is the smallest dissimilarity between
two points in the opposite groups.
dsingle(G, H) = mini∈G,j∈H
dij
• complete linkage (i.e., furthest-neighbor linkage): - the dissimilarity between G and H is the largest dissimilarity between
two points in the opposite groups.
dcomplete(G, H) = maxi∈G,j∈H
dij
Linkage Function
• average linkage: the dissimilarity between G and H is the average dissimilarity over all points in opposite groups.
daverage(G, H) =1
nG ⋅ nH ∑i∈G,j∈H
dij
Example: Complete Linkage
From: http://www.econ.upf.edu/~michael/stanford/maeb7.pdf
Step 1: look for the most similar pair (lowest similarity score)
Example: Complete Linkage
Step 2: join B and F at level 0.20. This forms a node.
Example: Complete Linkage
Step 3: calculate the similarity score between each data point x and the merged pair (B, F). - complete linkage means dissimilarity = max of d(x, B) and d(x, F) - e.g., d(A,B)=0.5, d(A,F)=0.6250, therefore d(A,(B,F)) = 0.6250.
Example: Complete Linkage
This is what the table looks like after re-calculating the similarity scores between each point and the merged (B, F) pair.
Step 4: Repeat the process. Find the smallest similarity score.
Example: Complete Linkage
Step 5: join A and E at level 0.25.
Example: Complete Linkage
Step 6: calculate the similarity score between each data point x and the merged pair (A, E)- e.g., dissimilarity between (A, E) and (B, F) is the max of
d(A, (B,F))=0.6250 and d(E, (B,F))=0.7778.
Example: Complete Linkage
This is what the table looks like after re-calculating the similarity scores between each point and the merged (A,E) pair.
Example: Complete Linkage
Example: Complete Linkage
Example: Complete Linkage
Example: Complete Linkage
Example: Complete Linkage
Example: Complete Linkage
Example: Complete Linkage
Single vs Complete Linkage
https://www-users.cs.umn.edu/~kumar001/dmbook/ch7_clustering.pdf
single linkage: •sensitive to noise/outliers •clusters tend to be
elliptical, long and skinny.
complete linkage: • less sensitive to noise/outliers •clusters tend to be tight,
compact, globular
Hierarchical Clustering of News Articleshttp://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html
Overview
• Clustering (K-Means) • Hierarchical Clustering • PCA
�54
Dimensionality Reduction
How do we automatically detect and remove redundant dimensions? Looking for a vector u that points in the direction of maximal variance!
x x
xxx
x
x xx x
xx xx x
u2
u1
Skill
Enjo
ymen
t
Direction of Maximum Variance
Direction of Maximum Variance
Principal Component Analysis: The Problem
•find orthonormal basis vectorsU = [u(1) u(2) … u(k)] where k ≪ n
z = UT x where zk = (u(k))T x
•reconstructed data points
x =K
∑k=1
zku(k)
•cost function: reconstruction error
J =1n
n
∑i=1
∥xi − xi∥2
•want: minU
J
Principal Component Analysis: The Solution
•The solution turns out to be the first K eigenvectors of the data covariance matrix (see [B]Sec.12.1)
•Closed-form: - use Singular Value Decomposition (SVD) on covariance matrix
•Other PCA formulations: - maximizing variance of the projected data (see [D]Sec.15.2)
Principal Component Analysis: The Algorithm
•normalize features (ensure every feature has zero mean) and optimally scale feature
•compute “covariance matrix”
•compute its “eigenvectors”
•keep the first k eigenvectors and project to get new features z
Principal Component Analysis: The Algorithm
PCA: Example (Smith, 2002)
PCA: Example (Smith, 2002)
Step 1: subtract the mean of each dimension from the data along that dimension.
xi − x
∀i = 1,…,10yi − y
PCA: Example (Smith, 2002)
Step 2: Calculate the covariance matrix.•covariance measures how two
variables vary from the mean with respect to each other.
var(X) =∑n
i=1 (Xi − X)(Xi − X)
(n − 1)
cov(X) =∑n
i=1 (Xi − X)(Yi − Y )
(n − 1)
•covariance matrix captures covariance values between all dimensions
C = (cov(x, x) cov(x, y)cov(y, x) cov(y, y))
C = (0.6166 0.61540.6154 0.7166)
non-diagonal values are positive, indicating that x increases as y increases.
PCA: Example (Smith, 2002)
Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix. •Eigenvector v of a linear transformation is a vector that, upon
transformation, does not change direction.Av = λv
(2 32 1) × (3
2) = (128 ) = 4 × (3
2)•Here, the associated eigenvalue is 4.
•Note that all eigenvectors of a matrix are orthogonal (perpendicular) to each other—we can re-express the data using eigenvectors as the new axes!
PCA: Example (Smith, 2002)
Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix.
eigenvalues = (0.0491.284)
eigenvectors = (−0.7352 − 0.67790.6779 − 0.7352 )
•the unit eigenvectors:
•the corresponding eigenvalues:
PCA: Example (Smith, 2002)
Step 4: Choose components.
eigenvalues = (0.0491.284)eigenvectors = (−0.7352 −0.6779
0.6779 −0.7352)
•the eigenvector with the highest eigenvalue is the principal component of the dataset.
•form a matrix of k eigenvectors, ordered by eigenvalues from largest to smallest.
W = [eig1, eig2, …, eigk]
W = (−0.6779 −0.7352−0.7352 0.6779 ) W = (−0.6779
−0.7352)If k < m, then you are essentially discarding some dimensions. e.g.,
PCA: Example (Smith, 2002)
Step 5: Derive the new dataset
•multiply transpose of W on the left of the mean-adjusted dataset, transposed.
(−0.6779 −0.7352−0.7352 0.6779 ) (0.69 −1.31 0.39 0.09 1.29 0.49 0.19 −0.81 −0.31 −0.71
0.49 −1.21 0.99 0.29 1.09 0.79 −0.31 −0.81 −0.31 −1.01)
PCA: Example (Smith, 2002)
Step 5: Derive the new dataset
PCA: Example (Smith, 2002)
Step 5: Derive the new dataset
PCA: Example (Smith, 2002)
To Reconstruct the data:
•We can get exactly the original data back if we had used all the eigenvectors,
•but we would have lost some information if we used only some of the eigenvectors.
Uses of Dimensionality Reduction
•Compression •Visualization
- pick dimension = 2 or 3. •Pre-processing (to avoid the
curse of dimensionality)
https://en.wikipedia.org/wiki/Eigenface
What you should know
• K-means clustering and its properties • Hierarchical clustering and different linkage functions • How to cluster a toy dataset using K-means / hierarchical clustering • procedures and applications of Principal Component Analysis