Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | claribel-whitehead |
View: | 213 times |
Download: | 0 times |
1
A General and Scalable Approach to Mixed Membership Clustering
Frank Lin William W. Cohen∙School of Computer Science Carnegie Mellon University∙
December 11, 2012 International Conference on Data Mining∙
2
Mixed Membership Clustering
3
Motivation
• Spectral clustering is nice• But two drawbacks:
◦ Computationally expensive◦ No mixed-membership clustering
4
Our Solution
• Convert a node-centric representation of the graph to an edge-centric one
• Adapt this representation to work with a scalable clustering method - Power Iteration Clustering
5
Mixed Membership Clustering
6
Perspective
• Since◦ an edge represents a relationship between two
entities.◦ an entity can belong to as many groups as its
relationships• Why don’t we group the relationships instead
of the entities?
7
Edge Clustering
8
Edge Clustering
• Assumptions:◦ An edge represents relationship between two
nodes◦ A node can belong to multiple clusters, but an
edge can only belong to one
Quite general – we can allow parallel edges if
needed
9
Edge Clustering
• How to cluster edges?• Need a edge-centric view of the graph G
◦ Traditionally: a line graph L(G)• Problem: potential (and likely) size blow up!• size(L(G))=O(size(G)2)
◦ Our solution: a bipartite feature graph B(G)• Space-efficient• size(B(G))=O(size(G))
Transform edges into
nodes!
Side note: can also be used to represent tensors efficiently!
10
Edge ClusteringThe
original graph G
The line graph L(G)
BFG - the bipartite feature graph B(G)
Costly for star-shaped structure!
Only use twice the space of G
a
b
c
d
e
ab
ac
bc
cd
ce
ab
ac
cd
bc
ce
a ab
ac
ce
cb
cb
c
b
d e
11
Edge Clustering
• A general recipe:1. Transform affinity matrix A into B(A)2. Run cluster method and get edge clustering3. For each node, determine mixed membership
based on the membership of its incident edges
The matrix dimensions of B(A) is very big – can only use
sparse methods on large datasets
Perfect for PIC and implicit manifolds!☺
12
Edge Clustering What are the dimensions of the
matrix that represent B(A)?
If A is a |V| x |V| matrix…Then B(A) is a
(|V|+|E|) x (|V|+|E|) matrix!
Need a clustering method that takes full advantage of the
sparsity of B(A)!
13
Power Iteration Clustering:Quick Overview
• Spectral clustering methods are nice, a natural choice for graph data
• But they are expensive (slow)• Power iteration clustering (PIC) can provide a
similar solution at a very low cost (fast)!
14
The Power IterationBegins with a
random vector
Ends with a piece-wise constant vector!
Overall absolute distance between points decreases, here we show relative distance
15
Implication
• We know: the 2nd to kth eigenvectors of W=D-
1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)
• Then: a linear combination of piece-wise constant vectors is also piece-wise constant!
16
Spectral Clustering
valu
e
index
1 2 3cluster
datasets
2nd s
mal
lest
ei
genv
ecto
r 3rd
sm
alle
st
eige
nvec
tor
clus
terin
g s
pace
17
Linear Combination…
a·
b·
+
=
18
Power Iteration Clustering
PIC results
vt
19
Power Iteration Clustering
• The algorithm:
20
We just need the clusters to be separated in some space
Key idea:To do clustering, we may not need all the information in a full spectral embedding(e.g., distance between clusters in a k-
dimension eigenspace)
Power Iteration Clustering
21
Mixed Membership Clustering with PIC
• Now we have◦ a sparse matrix representation, and◦ a fast clustering method that works on sparse
matrices• We’re good to go!
Not so fast!!
Iterative methods like PageRank and power iteration don’t work on bipartite graphs, B(A) is a bipartite
graph!
Solution: convert it to a unipartite (aperiodic)
graph!
22
Mixed Membership Clustering with PIC
• Define a similarity function:
Similarity between
edges I and j…
is proportional to the product of the incident
nodes they have in common…
and inversely proportional to the
number of edges this node is incident to
Then we simply just use a matrix S where S(I,j)=s(I,j) in place of B(A)!
23
Mixed Membership Clustering with PIC
• Now we have◦ a sparse matrix representation, and◦ a fast clustering method that works on sparse
matrices, and◦ an unipartite graph
• We’re good to go?
Similar to line graphs, matrix S may no longer be sparse (e.g., star-shapes)!
Back to where we started?
24
Mixed Membership Clustering with PIC
• Observations:
S NF FT
25
Mixed Membership Clustering with PIC
• Simply replace one line:
26
Mixed Membership Clustering with PIC
• Simply replace one line:
We get the exact same result, but with all sparse
matrix operations
27
• That’s pretty cool. But how well does it work?
28
Experiments
• Compare:◦ NCut◦ Node-PIC (single membership)◦ MM-PIC using different cluster label schemes:
• Max - pick the most frequent edge cluster (single membership)
• T@40 - pick edge clusters with at least 40% frequency• T@20 - pick edge clusters with at least 20% frequency• T@10 - pick edge clusters with at least 10% frequency• All - use all incident edge clusters
1 cluster label
many labels
?
29
Experiments
• Data source:◦ BlogCat1
• 10,312 blogs and links• 39 overlapping category labels
◦ BlogCat2• 88,784 blogs and link• 60 overlapping category labels
• Datasets:◦ Pick pairs of categories with enough overlap◦ BlogCat1: 86 category pair datasets◦ BlogCat2: 158 category pair datasets
At least 1%
30
Result
• F1 scores for clustering category pairs from the BlogCat1 dataset:
Max is better than Node!
Generally a lower threshold is better, but not
All
31
Result
• Important - MM-PIC wins where it matters:
y-axis: difference in
F1 score when the method “wins”
x-axis: ratio of mixed membership instances
Each point is a two-cluster dataset
When MM-PIC
does better, it
does much better
MM-PIC does better on
datasets with more mixed membership
instances
# of datasets where the
method “wins”
32
MM-PIC Result
• F1 scores for clustering category pairs from the (bigger) BlogCat2 dataset: More differences
between thresholds
Did not use NCut because
the datasets are too big...
Threshold matters!
33
Result
• Again, MM-PIC wins where it matters:
34
Questionsch2+3
PICicml 2010
Clustering Classification
ch4+5
MRWasonam 2010
ch6
ImplicitManifolds
ch6.1
IM-PICecai 2010
ch6.2
IM-MRWmlg 2011
ch7
MM-PICin submission
ch8
GK SSLin submission
ch9
Future Work
?
35
Additional Slides
+
36
Power Iteration Clustering
• Spectral clustering methods are nice, a natural choice for graph data
• But they are expensive (slow)• Power iteration clustering (PIC) can provide a
similar solution at a very low cost (fast)!
37
Background: Spectral Clustering
Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where D is a diagonal matrix
D(i,i)=Σj A(i,j)
3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest
corresponding eigenvalues5. Project the data points onto the space spanned by these
eigenvectors6. Run k-means on the projected data points
38
Background: Spectral Clustering
datasets
2nd s
mal
lest
ei
genv
ecto
r 3rd
sm
alle
st
eige
nvec
tor
valu
e
index
1 2 3cluster
clus
terin
g s
pace
39
Background: Spectral Clustering
Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where D is a diagonal matrix
D(i,i)=Σj A(i,j)
3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest
corresponding eigenvalues5. Project the data points onto the space spanned by these
eigenvectors6. Run k-means on the projected data points
Finding eigenvectors and eigenvalues of a matrix is
slow in general
Can we find a similar low-dimensional embedding for clustering without
eigenvectors?
There are more efficient
approximation methods*
Note: the eigenvectors of I-D-1A corresponding to the smallest eigenvalues are the eigenvectors of D-1A corresponding to the largest
40
The Power Iteration
• The power iteration is a simple iterative method for finding the dominant eigenvector of a matrix:
tt cWvv 1
W : a square matrix
vt : the vector at
iteration t;
v0 typically a random vector
c : a normalizing constant to keep vt
from getting too large or too small
Typically converges quickly; fairly efficient if W is a sparse matrix
41
The Power Iteration
• The power iteration is a simple iterative method for finding the dominant eigenvector of a matrix:
tt cWvv 1
What if we let W=D-1A(like Normalized Cut)?
i.e., a row-normalized
affinity matrix
42
The Power IterationBegins with a
random vector
Ends with a piece-wise constant vector!
Overall absolute distance between points decreases, here we show relative distance
43
Implication
• We know: the 2nd to kth eigenvectors of W=D-
1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)
• Then: a linear combination of piece-wise constant vectors is also piece-wise constant!
44
Spectral Clustering
valu
e
index
1 2 3cluster
datasets
2nd s
mal
lest
ei
genv
ecto
r 3rd
sm
alle
st
eige
nvec
tor
clus
terin
g s
pace
45
Linear Combination…
a·
b·
+
=
46
Power Iteration Clustering
PIC results
vt
47
We just need the clusters to be separated in some space
Key idea:To do clustering, we may not need all the information in a full spectral embedding(e.g., distance between clusters in a k-
dimension eigenspace)
Power Iteration Clustering
48
When to Stop
ntnnk
tkkk
tkk
tt cccc eeeev ...... 111111
The power iteration with its components:
n
t
nnk
t
kkk
t
kkt
t
c
c
c
c
c
c
ceeee
v
111
1
1
1
1
111
11
......
If we normalize:
At the beginning, v changes fast,
“accelerating” to converge locally due to
“noise terms” with small λ
When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the
eigenvalue ratios are close to 1
Because they are raised to the power t, the eigenvalue ratios determines how fast
v converges to e1
49
Power Iteration Clustering
• A basic power iteration clustering (PIC) algorithm:
Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck
1. Pick an initial vector v0
2. Repeat• Set vt+1 ← Wvt
• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0
3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck
50
Evaluating Clustering for Network DatasetsEach dataset is an
undirected, weighted,
connected graph
Every node is labeled by human
to belong to one of k classes
Clustering methods are only given k and
the input graph
Clusters are matched to classes
using the Hungarian algorithm
We use classification
metrics such as accuracy, precision,
recall, and F1 score; we also use clustering metrics such as purity and normalized mutual information (NMI)
51
PIC RuntimeNormalized Cut
Normalized Cut, faster
eigencomputation
Ran out of memory (24GB)
52
PIC Accuracy on Network Datasets
Upper triangle: PIC does
better
Lower triangle: NCut or
NJW does better
53
Multi-Dimensional PIC
• One robustness question for vanilla PIC as data size and complexity grow:
• How many (noisy) clusters can you fit in one dimension without them “colliding”?
Cluster signals cleanly separated
A little too close for comfort?
54
Multi-Dimensional PIC
• Solution:◦ Run PIC d times with different random starts and
construct a d-dimension embedding◦ Unlikely any pair of clusters collide on all d
dimensions
55
Multi-Dimensional PIC
• Results on network classification datasets:
RED: PIC using 1 random start
vector
GREEN: PIC using 1 degree
start vector
BLUE: PIC using 4
random start vectors
1-D PIC embeddings lose
on accuracy at higher k’s
compared to NCut and NJW
(# of clusters) But using a 4 random vectors instead helps!
Note # of vectors << k
56
PIC Related Work
• Related clustering methods:
PIC is the only one using a reduced dimensionality – a critical feature for graph
data!
57
Multi-Dimensional PIC
• Results on name disambiguation datasets:
Again using a 4 random vectors seems to work!
Again note # of vectors << k
58
PIC: Versus Popular Fast Sparse Eigencomputation Methods
For Symmetric Matrices For General Matrices Improvement
Successive Power Method
Basic; numerically unstable, can be
slow
Lanczos Method Arnoldi MethodMore stable, but
require lots of memory
Implicitly Restarted Lanczos Method
(IRLM)
Implicitly Restarted Arnoldi Method
(IRAM)More memory-
efficient
Method Time Space
IRAM (O(m3)+(O(nm)+O(e))×O(m-k))×(# restart) O(e)+O(nm)
PIC O(e)x(# iterations) O(e)
Randomized sampling
methods are also popular
59
PIC: Another View
• PIC’s low-dimensional embedding, which we will call a power iteration embedding (PIE), is related to diffusion maps:
(Coifman & Lafon 2006)
60
PIC: Another View
• PIC’s low-dimensional embedding, which we will call a power iteration embedding (PIE), is related to diffusion maps:
(Coifman & Lafon 2006)
61
PIC: Another View
• PIC’s low-dimensional embedding, which we will call a power iteration embedding (PIE), is related to diffusion maps:
(Coifman & Lafon 2006)
62
PIC: Another View
• Result:
PIE is a random projection of the data in the diffusion space W with scale parameter t
We can use results from diffusion maps for applying PIC!
We can also use results from random projection for applying PIC!
63
PIC Extension: Hierarchical Clustering
• Real, large-scale data may not have a “flat” clustering structure
• A hierarchical view may be more useful
Good News:The dynamics of a PIC embedding display a hierarchically convergent
behavior!
64
PIC Extension: Hierarchical Clustering
• Why?• Recall PIC embedding at time t:
n
t
nn
tt
t
t
c
c
c
c
c
c
ceeee
v
113
1
3
1
32
1
2
1
21
11
...
Less significant eigenvectors / structures go away first, one by one
More salient structure stick
around
e’s – eigenvectors (structure) SmallBig
There may not be a clear
eigengap - a gradient of
cluster saliency
65
PIC Extension: Hierarchical Clustering
PIC already converged to 8 clusters…
But let’s keep on iterating…
“N” still a part of the “2009”
cluster…
Similar behavior also noted in matrix-matrix power
methods (diffusion maps, mean-shift, multi-resolution
spectral clustering)
Same dataset you’ve seen
Yes(it might take a while)
66
Distributed / Parallel Implementations
• Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development
• PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications
• Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework
• We propose to use• Alternatives:
Existing graph analysis tool:
67
Adjacency Matrix vs. Similarity Matrix
• Adjacency matrix:• Similarity matrix:• Eigenanalysis:
xAx
xxAx
xxIA
xSx
)(
)(
Same eigenvectors and same ordering
of eigenvalues!
A
IAS What about the
normalized versions?
68
Adjacency Matrix vs. Similarity Matrix
• Normalized adjacency matrix:• Normalized similarity matrix:• Eigenanalysis:
xDAxD
xDxAxD
xIxDAxD
xxIAD
)ˆ(ˆ
ˆˆ
ˆˆ
ˆ
11
11
11
1
Eigenvectors the same if degree is
the same
AD 1
IAD 1ˆ
Recent work on degree-corrected Laplacian (Chaudhuri 2012) suggests that it is
advantageous to tune α for clustering graphs
with a skewed degree distribution and does
further analysis