Semester 2, 2017Lecturer: Andrey Kan
Lecture 16. Manifold LearningCOMP90051 Statistical Machine Learning
Copyright: University of MelbourneSwiss roll image: Evan-Amos, Wikimedia Commons, CC0
Statistical Machine Learning (S2 2017) Deck 16
This lectureβ’ Introduction to manifold learning
β Motivationβ Focus on data transformation
β’ Unfolding the manifoldβ Geodesic distancesβ Isomap algorithm
β’ Spectral clusteringβ Laplacian eigenmapsβ Spectral clustering pipeline
2
Statistical Machine Learning (S2 2017) Deck 16
Manifold Learning
Recovering low dimensional data representation non-
linearly embedded within a higher dimensional space
3
Statistical Machine Learning (S2 2017) Deck 16
The limitation of k-means and GMMβ’ K-means algorithm can find spherical clusters
β’ GMM can find elliptical clusters
β’ These algorithms will struggle in cases like this
4
desired resultK-means clustering
Figure from Murphy
Statistical Machine Learning (S2 2017) Deck 16
Focusing on data geometryβ’ We are not dismissing the k-means algorithm yet,
but we are going to put it aside for a moment
β’ One approach to address the problem in the previous slide would be to introduce improvements to algorithms such as k-means
β’ Instead, letβs focus on geometry of the data and see if we can transform the data to make it amenable for simpler algorithmsβ Recall βtransform the data vs modify the modelβ discussion
in supervised learning
5
Statistical Machine Learning (S2 2017) Deck 16
Non-linear data embeddingβ’ Recall the example with 3D GPS coordinates that denote a carβs
location on a 2D surface
β’ In a similar example consider coordinates of items on a picnic blanket which is approximately a planeβ In this example, the data resides on a plane embedded in 3D
β’ A low dimensional surface can be quite curved in a higher dimensional spaceβ A plane of dough (2D) in a Swiss roll (3D)
6
Picnic blanket image: Theo Wright, Flickr, CC2Swiss roll image: Evan-Amos, Wikimedia Commons, CC0
Statistical Machine Learning (S2 2017) Deck 16
Key assumption: Itβs simpler than it looks!
7
β’ Key assumption: High dimensional data actually resides in a lower dimensional space that is locally Euclidean
β’ Informally, the manifold is a subset of points in the high-dimensional space that locally looks like a low-dimensional space
Statistical Machine Learning (S2 2017) Deck 16
Manifold example
8
β’ Informally, the manifold is a subset of points in the high-dimensional space that locally looks like a low-dimensional space
β’ Example: arc of a circleβ consider a tiny bit of a circumference (2D) can treat as line (1D)
A
BC
π΄π΄π΄π΄ β π΄π΄π΄π΄ + π΄π΄π΄π΄
Statistical Machine Learning (S2 2017) Deck 16
ππ-dimensional manifoldβ’ Definition from Guillemin and Pollack, Differential Topology, 1974
β’ A mapping ππ on an open set ππ β πΉπΉππ is called smooth if it has continuous partial derivatives of all orders
β’ A map ππ:ππ β πΉπΉππ is called smooth if around each point ππ β ππ there is an open set ππ β πΉπΉππ and a smooth map πΉπΉ:ππ β πΉπΉππ such that πΉπΉequals ππ on ππ β© ππ
β’ A smooth map ππ:ππ β ππ of subsets of two Euclidean spaces is a diffeomorphism if it is one to one and onto, and if the inverse map ππβ1:ππ β ππ is also smooth. ππ and ππ are diffeomorphic if such a map exists
β’ Suppose that ππ is a subset of some ambient Euclidean space πΉπΉππ. Then ππ is an ππ-dimensional manifold if each point ππ β ππ possesses a neighbourhood ππ β ππ which is diffeomorphic to an open set ππ βπΉπΉππ
9
Statistical Machine Learning (S2 2017) Deck 16
Manifold examples
10
β’ A few examples of manifolds are shown below
β’ In all cases, the idea is that (hopefully) once the manifold is βunfoldedβ, the analysis, such as clustering becomes easy
β’ How to βunfoldβ a manifold?
Statistical Machine Learning (S2 2017) Deck 16
Geodesic Distancesand Isomap
A non-linear dimensionality reduction algorithm that
preserves locality information using geodesic distances
11
Statistical Machine Learning (S2 2017) Deck 16
12
β’ Find a lower dimensional representation of data that preserves distances between points (MDS)
β’ Do visualization, clustering, etc. on lower dimensional representation Problems?
General idea: Dimensionality reduction
A
AB
B
?
Statistical Machine Learning (S2 2017) Deck 16
βGlobal distancesβ VS geodesic distances
β’ βGlobal distancesβ cause a problem: we may not want to preserve them
β’ We are interested in preserving distances along the manifold (geodesic distances)
geodesic distance
CD
Images: ClkerFreeVectorImages and Kaz @pixabay.com (CC0) 7
Statistical Machine Learning (S2 2017) Deck 16
MDS and similarity matrixβ’ In essence, βunfoldingβ a manifold is achieved via
dimensionality reduction, using methods such as MDS
β’ Recall that the input of an MDS algorithm is similarity (aka proximity) matrix where each element π€π€ππππ denotes how similar data points ππ and ππ are
β’ Replacing distances with geodesic distances simply means constructing a different similarity matrix without changing the MDS algorithmβ Compare it to the idea of modular learning in kernel methods
β’ As you will see shortly, there is a close connection between similarity matrices and graphs and in the next slide, we review basic definitions from graph theory
14
Statistical Machine Learning (S2 2017) Deck 16
Refresher on graph terminologyβ’ Graph is a tuple πΊπΊ = {ππ,πΈπΈ}, where ππ is a set of
vertices, and πΈπΈ β ππ Γ ππ is a set of edges. Each edge is a pair of verticesβ Undirected graph: pairs are unorderedβ Directed graph: pairs are ordered
β’ Graphs model pairwise relations between objectsβ Similarity or distance between the data points
β’ In a weighted graph, each vertex π£π£ππππ has an associated weight π€π€ππππβ Weights capture the strength of the relation between
objects15
Statistical Machine Learning (S2 2017) Deck 16
Weighted adjacency matrixβ’ We will consider weighted undirected graphs with
non-negative weights π€π€ππππ β₯ 0. Moreover, we will assume that π€π€ππππ = 0, if and only if vertices ππ and ππare not connected
β’ The degree of a vertex π£π£ππ β ππ is defined as
deg ππ β‘οΏ½ππ=1
πππ€π€ππππ
β’ A weighted undirected graph can be represented with an weighted adjacency matrix πΎπΎ that contain weights π€π€ππππ as its elements
16
Statistical Machine Learning (S2 2017) Deck 16
Similarity graph models data geometryβ’ Geodesic distances can be
approximated using a graph in which vertices represent data points
β’ Let ππ(ππ, ππ) be the Euclidean distance between the points in the original space
β’ Option 1: define some local radius ππ. Connect vertices ππ and ππ with an edge if ππ ππ, ππ β€ ππ
β’ Option 2: define nearest neighbor threshold ππ. Connect vertices ππ and ππif ππ is among the ππ nearest neighbors of ππ OR ππ is among the ππnearest neighbors of ππ
β’ Set weight for each edge to ππ ππ, ππ17
geodesic distance
CD
C
D
Statistical Machine Learning (S2 2017) Deck 16
Computing geodesic distancesβ’ Given the similarity graph,
compute shortest paths between each pair of pointsβ E.g., using Floyd-Warshall
algorithm in ππ ππ3
β’ Set geodesic distance between vertices ππ and ππ to the length (sum of weights) of the shortest path between them
β’ Define a new similarity matrix based on geodesic distances
18
geodesic distance
CD
C
D
Statistical Machine Learning (S2 2017) Deck 16
Isomap: summary1. Construct the similarity
graph
2. Compute shortest paths
3. Geodesic distances are the lengths of the shortest paths
4. Construct similarity matrix using geodesic distances
5. Apply MDS19
geodesic distance
CD
C
D
Statistical Machine Learning (S2 2017) Deck 16
Spectral Clustering
An spectral graph theory approach to non-linear
dimensionality reduction
20
Statistical Machine Learning (S2 2017) Deck 16
Data processing pipelinesβ’ Isomap algorithm can be considered a pipeline in a
sense that in combines different processing blocks, such as graph construction, and MDS
β’ Here MDS serves as a core sub-routine to Isomap
β’ Spectral clustering is similar to Isomap in that it also comprises a few standard blocks, including k-means clustering
β’ In contrast to Isomap, spectral clustering uses a different non-linear mapping technique called Laplacian eigenmap
21
Statistical Machine Learning (S2 2017) Deck 16
Spectral clustering algorithm1. Construct similarity graph, use the corresponding
adjacency matrix as a new similarity matrixβ Just as in Isomap, the graph captures local geometry and
breaks long distance relationsβ Unlike Isomap, the adjacency matrix is used βas isβ,
shortest paths are not used
2. Map data to a lower dimensional space using Laplacian eigenmaps on the adjacency matrixβ This uses results from spectral graph theory
3. Apply k-means clustering to the mapped points
22
Statistical Machine Learning (S2 2017) Deck 16
Similarity graph for spectral clusteringβ’ Again, we start with constructing a similarity graph. This can
be done in the same way as for Isomap (but no need to compute the shortest paths)
β’ Recall that option 1 was to connect points that are closer than ππ, and options 2 was to connect points within ππ neiborhood
β’ There is also option 3 usually considered for spectral clustering. Here all points are connected to each other (the graph is fully connected). The weights are assigned using a Gaussian kernel (aka heat kernel) with width parameter ππ
π€π€ππππ = exp β1ππ
ππππ β ππππ2
23
Statistical Machine Learning (S2 2017) Deck 16
Graph Laplacianβ’ Recall that πΎπΎ denotes weighted adjacency matrix which
contains all weights π€π€ππππβ’ Next, degree matrix π«π« is defined as a diagonal matrix
with vertex degrees on the diagonal. Recall that a vertex degree is deg ππ = βππ=1ππ π€π€ππππ
β’ Finally, another special matrix associated with each graph is called unnormalised graph Laplacian and is defined as π³π³ β‘ π«π« βπΎπΎβ For simplicity, here we introduce spectral clustering using
unnormalised Laplacian. In practice, it is common to use a Laplacian normalised in certain way, e.g., π³π³ππππππ β‘ π°π° β π«π«β1πΎπΎ, where π°π° is an identity matrix
24
Statistical Machine Learning (S2 2017) Deck 16
Laplacian eigenmapsβ’ Laplacian eigenmaps, a central sub-routine of spectral clustering, is
a non-linear dimensionality reduction method
β’ Similar to MDS, the idea is to map the original data points ππππ β πΉπΉππ, ππ = 1, β¦ ,ππ to a set of low-dimensional points ππππ β πΉπΉππ, ππ < ππ that βbest representβ the original data
β’ Laplacian eigenmaps use a similarity matrix πΎπΎ rather than original data coordinates as a starting pointβ Here the similarity matrix πΎπΎ is the weighted adjacency matrix of the
similarity graph
β’ Earlier, weβve seen examples of how βbest representβ criterion is formalised in MDS methods
β’ Laplacian eigenmaps use a different criterion, namely the aim is to minimise (subject to some constraints)
οΏ½ππ,ππ
ππππ β ππππ2π€π€ππππ
25
Statistical Machine Learning (S2 2017) Deck 16
Alternative representation of mappingβ’ This minimisation problem is solved using results from
spectral graph theory
β’ Instead of the mapped points ππππ, the output can be viewed as a set of ππ βdimensional vectors ππππ, ππ = 1, β¦ , ππ. The solution eigenmap is expressed in terms of these ππππβ For example, if the mapping is onto 1D line, ππ1 = ππ is just a
collection of coordinates for all ππ pointsβ If the mapping is onto 2D, ππ1 is a collection of all the fist
coordinates, and ππ2 is a collection of all the second coordinates
β’ For illustrative purposes, we will consider a simple example of mapping to 1D
26
Statistical Machine Learning (S2 2017) Deck 16
Problem formulation for 1D eigenmapβ’ Given an ππ Γ ππ similarity matrix πΎπΎ, our aim is to find
a 1D mapping ππ, such that ππππ is the coordinate of the mapped πππ‘π‘π‘ point. We are looking for a mapping that minimises 1
2βππ,ππ ππππ β ππππ
2π€π€ππππ
β’ Clearly for any ππ, this can be minimised by multiplying ππ by a small constant, so we need to introduce a scaling constraint, e.g., ππ 2 = ππβ²ππ = 1
β’ Next recall that π³π³ β‘ π«π« βπΎπΎ
27
Statistical Machine Learning (S2 2017) Deck 16
Re-writing the objective in vector formβ’ 1
2βππ,ππ ππππ β ππππ
2π€π€ππππ
β’ = 12βππ,ππ ππππ2π€π€ππππ β 2πππππππππ€π€ππππ + ππππ2π€π€ππππ
β’ = 12βππ=1ππ ππππ2 βππ=1ππ π€π€ππππ β 2βππ,ππ πππππππππ€π€ππππ + βππ=1ππ ππππ2 βππ=1ππ π€π€ππππ
β’ = 12βππ=1ππ ππππ2deg ππ β 2βππ,ππ πππππππππ€π€ππππ + βππ=1ππ ππππ2deg ππ
β’ = βππ=1ππ ππππ2deg ππ β βππ,ππ πππππππππ€π€ππππ
β’ = ππβ²π«π«ππ β ππβ²πΎπΎππ
β’ = ππβ²π³π³ππ
28
Statistical Machine Learning (S2 2017) Deck 16
Laplace recontre Lagrangeβ’ Our problem becomes to minimise ππβ²π³π³ππ, subject to ππβ²ππ = 1. Recall
the method of Lagrange multipliers. Introduce a Lagrange multiplier ππ, and set derivatives of the Lagrangian to zero
β’ ππ = ππβ²π³π³ππ β ππ ππβ²ππ β 1
β’ 2ππβ²π³π³β² β 2ππππβ² = 0
β’ π³π³ππ = ππππ
β’ The latter is precisely the definition of an eigenvector with ππ being the corresponding eigenvalue!
β’ Critical points of our objective function ππβ²π³π³ππ = 12βππ,ππ ππππ β ππππ
2π€π€ππππ
are eigenvectors of π³π³
β’ Note that the function is actually minimised using eigenvector ππ, which is not useful. Therefore, for 1D mapping we use an eigenvector with the second smallest eigenvalue
29
DΓ©jΓ vu?
Statistical Machine Learning (S2 2017) Deck 16
Laplacian eigenmaps: summaryβ’ Start with points ππππ β πΉπΉππ. Construct a similarity graph using
one of 3 options
β’ Construct weighted adjacency matrix πΎπΎ (do not compute shortest paths) and the corresponding Laplacian matrix π³π³
β’ Compute eigenvectors of π³π³, and arrange them in the order of the corresponding eignevalues 0 = ππ1 < ππ2 < β― < ππππ
β’ Take eigenvectors corresponding to ππ2 to ππππ+1 as ππ1, β¦ ,ππππ, ππ < ππ, where each ππππ corresponds to one of the new dimensions
β’ Combine all vectors into an ππ Γ ππ matrix, with ππππ in columns. The mapped points are the rows of the matrix
30
Statistical Machine Learning (S2 2017) Deck 16
Spectral clustering: summary
31
1. Construct a similarity graph
2. Map data to a lower dimensional space using Laplacian eigenmaps on the adjacency matrix
3. Apply k-means clusteringto the mapped points
spectral clustering result
Statistical Machine Learning (S2 2017) Deck 16
This lectureβ’ Introduction to manifold learning
β Motivationβ Focus on data transformation
β’ Unfolding the manifoldβ Geodesic distancesβ Isomap algorithm
β’ Spectral clusteringβ Laplacian eigenmapsβ Spectral clustering pipeline
32