PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 56
Part 2. Spectral Clustering from Matrix Perspective
A brief tutorial emphasizing recent developments
(More detailed tutorial is given in ICML’04 )
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 57
From PCA to spectral clusteringusing generalized eigenvectors
∑=j iji wd
In Kernel PCA we compute eigenvector: vWv λ=
Consider the kernel matrix:
Generalized Eigenvector:
)(),( jiij xxW φφ=
DqWq λ=
),,( 1 ndddiagD L=
This leads to Spectral Clustering !
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 58
Indicator Matrix Quadratic ClusteringFramework
Unsigned Cluster indicator Matrix H=(h1, …, hK)
0,..),Tr( max ≥= HIHHtsWHH TTH
;XXW T=
Kernel K-means clustering:
Spectral clustering (normalized cut)
K-means: ))(),(( ><= ji xxW φφKernel K-means
0,..),Tr( max ≥= HIDHHtsWHH TTH
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 59
Brief Introduction to Spectral Clustering(Laplacian matrix based clustering)
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 60
Some historical notes• Fiedler, 1973, 1975, graph Laplacian matrix• Donath & Hoffman, 1973, bounds• Hall, 1970, Quadratic Placement (embedding)• Pothen, Simon, Liou, 1990, Spectral graph
partitioning (many related papers there after)• Hagen & Kahng, 1992, Ratio-cut• Chan, Schlag & Zien, multi-way Ratio-cut• Chung, 1997, Spectral graph theory book• Shi & Malik, 2000, Normalized Cut
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 61
Spectral Gold-Rush of 20019 papers on spectral clustering
• Meila & Shi, AI-Stat 2001. Random Walk interpreation of Normalized Cut
• Ding, He & Zha, KDD 2001. Perturbation analysis of Laplacian matrix on sparsely connected graphs
• Ng, Jordan & Weiss, NIPS 2001, K-means algorithm on the embeded eigen-space
• Belkin & Niyogi, NIPS 2001. Spectral Embedding• Dhillon, KDD 2001, Bipartite graph clustering• Zha et al, CIKM 2001, Bipartite graph clustering• Zha et al, NIPS 2001. Spectral Relaxation of K-means• Ding et al, ICDM 2001. MinMaxCut, Uniqueness of relaxation.• Gu et al, K-way Relaxation of NormCut and MinMaxCut
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 62
Spectral Clustering
min cutsize , without explicit size constraints
Need to balance sizes
But where to cut ?
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 63
Graph Clustering
max within-cluster similarities (weights)
min between-cluster similarities (weights)
∑∑∈ ∈
=Ai Bj
ijw(A,B) sim
∑∑∈ ∈
=Ai Aj
ijw(A,A) sim
Balance weight
Balance size
Balance volume
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 64
Clustering Objective Functions
• Ratio Cut
• Normalized Cut
• Min-Max-Cut
|B|s(A,B)
|A|s(A,B)
(A,B)J Rcut +=
),(),),(
),(),(),(
BAsBs(BBAs
BAsAAsBAs
++
+=
s(B,B)s(A,B)
s(A,A)s(A,B)(A,B)JMMC +=
BANcut d
BAsd
BAsBAJ ),(),(),( +=
∑∑∈ ∈
=Ai Bj
ijws(A,B)
∑∈
=Ai
iA dd
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 65
Normalized Cut (Shi & Malik, 2000)
Min similarity between A & B: ∑∈
∑∈
=Ai Bj
ijws(A,B)
Balance weights
⎪⎩
⎪⎨⎧
∈−∈
=BidddAiddd
iqBA
AB
if if
//
)(Cluster indicator:
BANcut d
BAsd
BAsBAJ ),(),(),( += ∑∈
=Ai
iA dd
∑∈
=Gi
idd
0,1 == DeqDqq TTNormalization: Substitute q leads to qWDq(q)J T
Ncut )( −=
)1()( −+− DqqqWDq TT λqmin
DqqWD λ=− )(Solution is eigenvector of
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 66
A simple example2 dense clusters, with sparse connections between them.
Eigenvector q2Adjacency matrix
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 67
K-way Spectral ClusteringK ≥ 2
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 68
K-way Clustering Objectives
• Ratio Cut
• Normalized Cut
• Min-Max-Cut
∑∑ −=⎟⎟
⎠
⎞⎜⎜⎝
⎛+=
>< k k
kk
lk l
lk
k
lkK ||C
C,GCs||C
,CCs||C
,CCsCCJ
)()()(),,(
,1 LRcut
∑∑ −=⎟⎟
⎠
⎞⎜⎜⎝
⎛+=
>< k k
kk
lk l
lk
k
lkK d
C,GCsd
,CCsd
,CCsCCJ
)()()(),,(
,1 LNcut
∑∑ −=⎟⎟
⎠
⎞⎜⎜⎝
⎛+=
>< k kk
kk
lk ll
lk
kk
lkK CCs
C,GCsCCs,CCs
CCs,CCs
CCJ),(
)(),()(
),()(
),,(,
1 LMMC
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 69
K-way Spectral Relaxation
Tk
T
T
h
h
h
)11,00,00(
)00,11,00(
)00,00,11(
2
1
LLL
LLL
LLL
LLL
=
=
=Unsigned cluster indicators:
kTk
kTk
T
T
k hhhWDh
hhhWDhhhJ
)()(),,(11
111
−++−= LLRcut
Re-write:
kTk
kTk
T
T
k DhhhWDh
DhhhWDhhhJ
)()(),,(11
111
−++−= LLNcut
kTk
kTk
T
T
k WhhhWDh
WhhhWDhhhJ
)()(),,(11
111
−++−= LLMMC
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 70
K-way Normalized Cut Spectral Relaxation
Unsigned cluster indicators:
))~((
)~()~(),,( 111
YWIY
yWIyyWIyyyJT
kTk
Tk
−=
−++−=
TrNcut LL
Re-write:
By K. Fan’s theorem, optimal solution is eigenvectors: Y=(v1,v2, …, vk),
),,(min 11 kk yyJ LL Ncut≤++ λλ (Gu, et al, 2001)
}||||/)00,11,00( 2/12/1
kT
n
k hDDyk
LLL=
IYYYWIY TT
Y=− tosubject Tr:Optimize ),)~((min
2/12/1~ −−= WDDW
kkk vvWI λ=− )~(
kkkkk vDuDuuWD 2/1,)( −==− λ
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 71
K-way Spectral Clustering is difficult
• Spectral clustering is best applied to 2-way clustering – positive entries for one cluster – negative entries for another cluster
• For K-way (K>2) clustering– Positive and negative signs make cluster
assignment difficult– Recursive 2-way clustering– Low-dimension embedding. Project the data to
eigenvector subspace; use another clustering method such as K-means to cluster the data (Ng et al; Zha et al; Back & Jordan, etc)
– Linearized cluster assignment using spectral ordering and cluster crossing
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 72
Scaled PCA: a Unified Framework for clustering and ordering
• Scaled PCA has two optimality properties– Distance sensitive ordering– Min-max principle Clustering
• SPCA on contingency table ⇒ Correspondence Analysis– Simultaneous ordering of rows and columns– Simultaneous clustering of rows and columns
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 73
Scaled PCAsimilarity matrix S=(sij) (generated from XXT)
Nonlinear re-scaling:
DqqDDzzDDSDSk
Tkkk
k
Tkkk ⎥
⎦
⎤⎢⎣
⎡=== ∑∑ λλ 21
21
21
21
~
2/1.. )/(~ ,~
21
21
jiijij ssssSDDS == −−
qk = D-1/2 zk is the scaled principal component
Apply SVD on ⇒S~
),,(diag 1 nddD L= .ii sd =
1..,/,1 02/1
00 === qsdzλDqqDsddS
k
Tkkk
T ../ 1∑
==−⇒ λ
Subtract trivial component
(Ding, et al, 2002)
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 74
Scaled PCA on a Rectangle Matrix⇒ Correspondence Analysis
Nonlinear re-scaling: 2/1.. )(~ ,~ /2
121
jiijijcr ppppPDDP == −−
are the scaled row and column principal component (standard coordinates in CA)
Apply SVD on P~
ck
Tkkkr
T DgfDprcP ..1
/ ∑=
=− λ
Subtract trivial component
Tnppr ),,( ..1 L=
Tnppc ),,( .1. L=
kckkrk vDguDf 21
21
, −− ==
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 75
Correspondence Analysis (CA)
• Mainly used in graphical display of data• Popular in France (Benzécri, 1969)• Long history
– Simultaneous row and column regression (Hirschfeld, 1935)
– Reciprocal averaging (Richardson & Kuder, 1933; Horst, 1935; Fisher, 1940; Hill, 1974)
– Canonical correlations, dual scaling, etc.• Formulation is a bit complicated (“convoluted”
Jolliffe, 2002, p.342)• “A neglected method”, (Hill, 1974)
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 76
Clustering of Bipartite Graphs (rectangle matrix)
Simultaneous clustering of rows and columnsof a contingency table (adjacency matrix B )
Examples of bipartite graphs
• Information Retrieval: word-by-document matrix
• Market basket data: transaction-by-item matrix
• DNA Gene expression profiles
• Protein vs protein-complex
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 77
Bipartite Graph Clustering
⎩⎨⎧
∈−∈
=2
1
if1 if1
)(RrRr
ifi
i
⎩⎨⎧
∈−∈
=2
1
if1 if1
)(CcCc
igi
i
Clustering indicators for rows and columns:
⎟⎟⎠
⎞⎜⎜⎝
⎛=
2212
2111
,,
,,
CRCR
CRCR
BBBB
B ⎟⎟⎠
⎞⎜⎜⎝
⎛=
00
TBB
W ⎟⎟⎠
⎞⎜⎜⎝
⎛=
gf
q
)()(
)()(
),;,(22
12
11
122121 Ws
WsWsWs
RRCCJ MMC +=Substitute and obtain
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟⎟
⎠
⎞⎜⎜⎝
⎛gf
DD
gf
BB
DD
c
rT
c
r λ0
0f,g are determined by
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 78
Spectral Clustering of Bipartite Graphs
)(2)()(
)(2)()(
),;,(22
1221
11
1221
,
,,
,
,,2121
CR
CRCR
CR
CRCRMMC Bs
BsBsBs
BsBsRRCCJ
++
+=
Simultaneous clustering of rows and columns(adjacency matrix B )
cut
min between-cluster sum of xyz weights: s(R1,C2), s(R2,C1)
max within-cluster sum of xyz xyz weights: s(R1,C1), s(R2,C2)
(Ding, AI-STAT 2003)
∑ ∑∈ ∈
=1 2
21)( ,
Rr CcijCR
i j
bBs
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 79
Internet Newsgroups
Simultaneous clustering of documents and words
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 80
Embedding in Principal Subspace
Cluster Self-Aggregation(proved in perturbation analysis)
(Hall, 1970, “quadratic placement” (embedding) a graph)
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 81
Spectral Embedding: Self-aggregation
(Ding, 2004)
• Compute K eigenvectors of the Laplacian.• Embed objects in the K-dim eigenspace
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 82
Spectral embedding is not topology preserving
700 3-D data points form 2 interlock rings
In eigenspace, they shrink and separate
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 83
Spectral Embedding
(Ding, 2004)
Simplex Embedding Theorem.Objects self-aggregate to K centroidsCentroids locate on K corners of a simplex
• Simplex consists K basis vectors + coordinate origin• Simplex is rotated by an orthogonal transformation T•T are determined by perturbation analysis
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 84
Perturbation Analysis
1C2C
3C
Assume data has 3 dense clusters sparsely connected.
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
33
22
11
WW
WW
3231
2321
1312
WWWWWW
zzWDDzW λ== −− )(ˆ 2/12/1DqWq λ= zDq 2/1−=
Off-diagonal blocks are between-cluster connections, assumed small and are treated as a perturbation
(Ding et al, KDD’01)
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 85
Spectral Perturbation Theorem
kkk tt λ=Γ
21
21 −− ΩΓΩ=Γ
)](,),([ 1 kCC ρρ Ldiag=Ω
∑ ≠=
kpp kpkk sh|
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−
−−−−
=Γ
KKKK
K
K
hss
shsssh
L
MLMM
L
L
21
22221
11211
Spectral Perturbation Matrix
),( qppq CCss =
Orthogonal Transform Matrix )1( KT tt ,,L=
T are determined by:
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 86
Connectivity Network
⎩⎨⎧
=otherwise0
cluster same tobelong if1
ji,Cij
DqqDCK
k
Tkkk ∑
=
≅1
λScaled PCA provides
Green’s function : ∑= −
=≈K
k
Tk
kk qqGC
2 11λ
Projection matrix: ∑=
≡≈K
k
Tkk qqPC
1(Ding et al, 2002)
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 87
1st order Perturbation: Example 1
Between-cluster connections suppressed
Within-cluster connections enhanced
Sim
ilarit
y m
atrix
WCo
nnec
tivity
m
atrix
Effects of self-aggregation
268.0,300.0 22 == λλ
1st order solution
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 88
Scaled principal components have optimality properties:
Ordering– Adjacent objects along the order are similar– Far-away objects along the order are dissimilar– Optimal solution for the permutation index are given by
scaled PCA.
Clustering– Maximize within-cluster similarity– Minimize between-cluster similarity– Optimal solution for cluster membership indicators given
by scaled PCA.
Optimality Properties of Scaled PCA
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 89
Spectral Graph Ordering
(Hall, 1970), “quadratic placement of a graph”:
Solution are eigenvectors of Laplacian
xWDxwxxJ T
ijijji )()( 2 −=−=∑
(Barnard, Pothen, Simon, 1993), envelop reduction of sparse matrix: find ordering such that the envelop is minimized
∑∑ −⇒−ij
ijjii
ijj wxxwji 2)(min ||maxmin
Find coordinate x to minimize
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 90
Distance Sensitive Ordering
∑ −= +
= dnid dii
wJ 1 ,)( πππ
)()(min 11
2 πππ
∑ −== n
d dJdJ
),,(),,1( 1 nn πππ LL =
Given a graph. Find an optimal Ordering of the nodes.
The larger distance, the larger weights, panelity.
∩∩∩∩ ∩∩∩∩∩∩∪∪∪∪∪∪∪∪
:)(2 π=dJ31,ππw
π permutation indexes
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 91
Distance Sensitive Ordering
∑∑ −=−=ji
jijiwjiwjiJ
ij πππππππ
,,
2,
2 )()()(
∑ −− −=ij
jiji w ,211 )( ππ
∑ +−+− −−
−=ij
jinn
nn wjin
,2
2/2/)1(
2/2/)1( )(
11
8
2 ππ
}1,,3,1{2/
2/)1(1
nn
nn
nn
nnq i
i−−−=+−=
−
Lπ
Define: shifted and rescaled inverse permutation indexes
qWDqwqqJ Tn
ijijji
n )()()( 42
822 −=−= ∑π
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 92
Distance Sensitive Ordering
Once q2 is computed, since
can be uniquely recovered from q2
1122 )()( −− <⇒< jijqiq ππ
1−iπ
Implementation: sort q2 induces π
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 93
Re-ordering of Genes and Tissues
)()(
randomJJr π=
)random(
)(
1
11
=
== =
d
dd J
Jr π
18.0=r
39.31 ==dr
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 94
Spectral clustering vs Spectral ordering
• Continuous approximation of both integer programming problems are given by the same eigenvector
• Different problems could have the same continuous approximate solution.
• Quality of the approximation:
Ordering: better quality: the solution relax from a set of evenly spaced discrete values
Clustering: less better quality: solution relax from 2 discrete values
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 95
Linearized Cluster Assignment
• Spectral ordering on connectivity network• Cluster crossing
– Sum of similarities along anti-diagonal – Gives 1-D curve with valleys and peaks– Divide valleys and peaks into clusters
Turn spectral clustering to 1D clustering problem
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 96
Cluster overlap and crossing
• Cluster overlap
• Cluster crossing compute a smaller fraction of cluster overlap.
• Cluster crossing depends on an ordering o. It sums weights cross the site i along the order
• This is a sum along anti-diagonals of W.
∑∑∈ ∈
=Ai Bj
ijws(A,B)
Given similarity W, and clusters A,B.
∑=
+−=m
jjiojiowi
1)(),( )(ρ
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 97
cluster crossing
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 98
K-way Clustering Experiments
Accuracy of clustering results:
56.4%67.2%75.7%Data B
75.1%82.8%89.0%Data A
Embedding+ K-means
Recursive 2-way clustering
LinearizedAssignment
Method
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 99
Some Additional Advanced/related Topics
• Random talks and normalized cut• Semi-definite programming • Sub-sampling in spectral clustering• Extending to semi-supervised classification• Green’s function approach• Out-of-sample embeding