A General and Scalable Approach to Mixed Membership Clustering Frank Lin ∙ William W. Cohen School...

1

A General and Scalable Approach to Mixed Membership Clustering

Frank Lin William W. Cohen∙School of Computer Science Carnegie Mellon University∙

December 11, 2012 International Conference on Data Mining∙

2

Mixed Membership Clustering

3

Motivation

• Spectral clustering is nice• But two drawbacks:

◦ Computationally expensive◦ No mixed-membership clustering

4

Our Solution

• Convert a node-centric representation of the graph to an edge-centric one

• Adapt this representation to work with a scalable clustering method - Power Iteration Clustering

5

Mixed Membership Clustering

6

Perspective

• Since◦ an edge represents a relationship between two

entities.◦ an entity can belong to as many groups as its

relationships• Why don’t we group the relationships instead

of the entities?

7

Edge Clustering

8

Edge Clustering

• Assumptions:◦ An edge represents relationship between two

nodes◦ A node can belong to multiple clusters, but an

edge can only belong to one

Quite general – we can allow parallel edges if

needed

9

Edge Clustering

• How to cluster edges?• Need a edge-centric view of the graph G

◦ Traditionally: a line graph L(G)• Problem: potential (and likely) size blow up!• size(L(G))=O(size(G)2)

◦ Our solution: a bipartite feature graph B(G)• Space-efficient• size(B(G))=O(size(G))

Transform edges into

nodes!

Side note: can also be used to represent tensors efficiently!

10

Edge ClusteringThe

original graph G

The line graph L(G)

BFG - the bipartite feature graph B(G)

Costly for star-shaped structure!

Only use twice the space of G

a

b

c

d

e

ab

ac

bc

cd

ce

ab

ac

cd

bc

ce

a ab

ac

ce

cb

cb

c

b

d e

11

Edge Clustering

• A general recipe:1. Transform affinity matrix A into B(A)2. Run cluster method and get edge clustering3. For each node, determine mixed membership

based on the membership of its incident edges

The matrix dimensions of B(A) is very big – can only use

sparse methods on large datasets

Perfect for PIC and implicit manifolds!☺

12

Edge Clustering What are the dimensions of the

matrix that represent B(A)?

If A is a |V| x |V| matrix…Then B(A) is a

(|V|+|E|) x (|V|+|E|) matrix!

Need a clustering method that takes full advantage of the

sparsity of B(A)!

13

Power Iteration Clustering:Quick Overview

• Spectral clustering methods are nice, a natural choice for graph data

• But they are expensive (slow)• Power iteration clustering (PIC) can provide a

similar solution at a very low cost (fast)!

14

The Power IterationBegins with a

random vector

Ends with a piece-wise constant vector!

Overall absolute distance between points decreases, here we show relative distance

15

Implication

• We know: the 2nd to kth eigenvectors of W=D-

1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)

• Then: a linear combination of piece-wise constant vectors is also piece-wise constant!

16

Spectral Clustering

valu

e

index

1 2 3cluster

datasets

2nd s

mal

lest

ei

genv

ecto

r 3rd

sm

alle

st

eige

nvec

tor

clus

terin

g s

pace

17

Linear Combination…

a·

b·

+

=

18

Power Iteration Clustering

PIC results

vt

19


• The algorithm:

20

We just need the clusters to be separated in some space

Key idea:To do clustering, we may not need all the information in a full spectral embedding(e.g., distance between clusters in a k-

dimension eigenspace)


21

Mixed Membership Clustering with PIC

• Now we have◦ a sparse matrix representation, and◦ a fast clustering method that works on sparse

matrices• We’re good to go!

Not so fast!!

Iterative methods like PageRank and power iteration don’t work on bipartite graphs, B(A) is a bipartite

graph!

Solution: convert it to a unipartite (aperiodic)

graph!

22


• Define a similarity function:

Similarity between

edges I and j…

is proportional to the product of the incident

nodes they have in common…

and inversely proportional to the

number of edges this node is incident to

Then we simply just use a matrix S where S(I,j)=s(I,j) in place of B(A)!

23


• Now we have◦ a sparse matrix representation, and◦ a fast clustering method that works on sparse

matrices, and◦ an unipartite graph

• We’re good to go?

Similar to line graphs, matrix S may no longer be sparse (e.g., star-shapes)!

Back to where we started?

24


• Observations:

S NF FT

25


• Simply replace one line:

26


• Simply replace one line:

We get the exact same result, but with all sparse

matrix operations

27

• That’s pretty cool. But how well does it work?

28

Experiments

• Compare:◦ NCut◦ Node-PIC (single membership)◦ MM-PIC using different cluster label schemes:

• Max - pick the most frequent edge cluster (single membership)

• T@40 - pick edge clusters with at least 40% frequency• T@20 - pick edge clusters with at least 20% frequency• T@10 - pick edge clusters with at least 10% frequency• All - use all incident edge clusters

1 cluster label

many labels

?

29

Experiments

• Data source:◦ BlogCat1

• 10,312 blogs and links• 39 overlapping category labels

◦ BlogCat2• 88,784 blogs and link• 60 overlapping category labels

• Datasets:◦ Pick pairs of categories with enough overlap◦ BlogCat1: 86 category pair datasets◦ BlogCat2: 158 category pair datasets

At least 1%

30

Result

• F1 scores for clustering category pairs from the BlogCat1 dataset:

Max is better than Node!

Generally a lower threshold is better, but not

All

31

Result

• Important - MM-PIC wins where it matters:

y-axis: difference in

F1 score when the method “wins”

x-axis: ratio of mixed membership instances

Each point is a two-cluster dataset

When MM-PIC

does better, it

does much better

MM-PIC does better on

datasets with more mixed membership

instances

# of datasets where the

method “wins”

32

MM-PIC Result

• F1 scores for clustering category pairs from the (bigger) BlogCat2 dataset: More differences

between thresholds

Did not use NCut because

the datasets are too big...

Threshold matters!

33

Result

• Again, MM-PIC wins where it matters:

34

Questionsch2+3

PICicml 2010

Clustering Classification

ch4+5

MRWasonam 2010

ch6

ImplicitManifolds

ch6.1

IM-PICecai 2010

ch6.2

IM-MRWmlg 2011

ch7

MM-PICin submission

ch8

GK SSLin submission

ch9

Future Work

?

35

Additional Slides

+

36


• Spectral clustering methods are nice, a natural choice for graph data

• But they are expensive (slow)• Power iteration clustering (PIC) can provide a

similar solution at a very low cost (fast)!

37

Background: Spectral Clustering

Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where D is a diagonal matrix

D(i,i)=Σj A(i,j)

3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest

corresponding eigenvalues5. Project the data points onto the space spanned by these

eigenvectors6. Run k-means on the projected data points

38


datasets

2nd s

mal

lest

ei

genv

ecto

r 3rd

sm

alle

st

eige

nvec

tor

valu

e

index

1 2 3cluster

clus

terin

g s

pace

39


Normalized Cut algorithm (Shi & Malik 2000):1. Choose k and similarity function s2. Derive A from s, let W=I-D-1A, where D is a diagonal matrix

D(i,i)=Σj A(i,j)

3. Find eigenvectors and corresponding eigenvalues of W4. Pick the k eigenvectors of W with the 2nd to kth smallest

corresponding eigenvalues5. Project the data points onto the space spanned by these

eigenvectors6. Run k-means on the projected data points

Finding eigenvectors and eigenvalues of a matrix is

slow in general

Can we find a similar low-dimensional embedding for clustering without

eigenvectors?

There are more efficient

approximation methods*

Note: the eigenvectors of I-D-1A corresponding to the smallest eigenvalues are the eigenvectors of D-1A corresponding to the largest

40

The Power Iteration

• The power iteration is a simple iterative method for finding the dominant eigenvector of a matrix:

tt cWvv 1

W : a square matrix

vt : the vector at

iteration t;

v0 typically a random vector

c : a normalizing constant to keep vt

from getting too large or too small

Typically converges quickly; fairly efficient if W is a sparse matrix

41

The Power Iteration

• The power iteration is a simple iterative method for finding the dominant eigenvector of a matrix:

tt cWvv 1

What if we let W=D-1A(like Normalized Cut)?

i.e., a row-normalized

affinity matrix

42

The Power IterationBegins with a

random vector

Ends with a piece-wise constant vector!

Overall absolute distance between points decreases, here we show relative distance

43

Implication

• We know: the 2nd to kth eigenvectors of W=D-

1A are roughly piece-wise constant with respect to the underlying clusters, each separating a cluster from the rest of the data (Meila & Shi 2001)

• Then: a linear combination of piece-wise constant vectors is also piece-wise constant!

44

Spectral Clustering

valu

e

index

1 2 3cluster

datasets

2nd s

mal

lest

ei

genv

ecto

r 3rd

sm

alle

st

eige

nvec

tor

clus

terin

g s

pace

45

Linear Combination…

a·

b·

+

=

46


PIC results

vt

47

We just need the clusters to be separated in some space

Key idea:To do clustering, we may not need all the information in a full spectral embedding(e.g., distance between clusters in a k-

dimension eigenspace)


48

When to Stop

ntnnk

tkkk

tkk

tt cccc eeeev ...... 111111

The power iteration with its components:

n

t

nnk

t

kkk

t

kkt

t

c

c

c

c

c

c

ceeee

v

111

1

1

1

1

111

11

......

If we normalize:

At the beginning, v changes fast,

“accelerating” to converge locally due to

“noise terms” with small λ

When “noise terms” have gone to zero, v changes slowly (“constant speed”) because only larger λ terms (2…k) are left, where the

eigenvalue ratios are close to 1

Because they are raised to the power t, the eigenvalue ratios determines how fast

v converges to e1

49


• A basic power iteration clustering (PIC) algorithm:

Input: A row-normalized affinity matrix W and the number of clusters kOutput: Clusters C1, C2, …, Ck

1. Pick an initial vector v0

2. Repeat• Set vt+1 ← Wvt

• Set δt+1 ← |vt+1 – vt|• Increment t• Stop when |δt – δt-1| ≈ 0

3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

50

Evaluating Clustering for Network DatasetsEach dataset is an

undirected, weighted,

connected graph

Every node is labeled by human

to belong to one of k classes

Clustering methods are only given k and

the input graph

Clusters are matched to classes

using the Hungarian algorithm

We use classification

metrics such as accuracy, precision,

recall, and F1 score; we also use clustering metrics such as purity and normalized mutual information (NMI)

51

PIC RuntimeNormalized Cut

Normalized Cut, faster

eigencomputation

Ran out of memory (24GB)

52

PIC Accuracy on Network Datasets

Upper triangle: PIC does

better

Lower triangle: NCut or

NJW does better

53

Multi-Dimensional PIC

• One robustness question for vanilla PIC as data size and complexity grow:

• How many (noisy) clusters can you fit in one dimension without them “colliding”?

Cluster signals cleanly separated

A little too close for comfort?

54


• Solution:◦ Run PIC d times with different random starts and

construct a d-dimension embedding◦ Unlikely any pair of clusters collide on all d

dimensions

55


• Results on network classification datasets:

RED: PIC using 1 random start

vector

GREEN: PIC using 1 degree

start vector

BLUE: PIC using 4

random start vectors

1-D PIC embeddings lose

on accuracy at higher k’s

compared to NCut and NJW

(# of clusters) But using a 4 random vectors instead helps!

Note # of vectors << k

56

PIC Related Work

• Related clustering methods:

PIC is the only one using a reduced dimensionality – a critical feature for graph

data!

57


• Results on name disambiguation datasets:

Again using a 4 random vectors seems to work!

Again note # of vectors << k

58

PIC: Versus Popular Fast Sparse Eigencomputation Methods

For Symmetric Matrices For General Matrices Improvement

Successive Power Method

Basic; numerically unstable, can be

slow

Lanczos Method Arnoldi MethodMore stable, but

require lots of memory

Implicitly Restarted Lanczos Method

(IRLM)

Implicitly Restarted Arnoldi Method

(IRAM)More memory-

efficient

Method Time Space

IRAM (O(m3)+(O(nm)+O(e))×O(m-k))×(# restart) O(e)+O(nm)

PIC O(e)x(# iterations) O(e)

Randomized sampling

methods are also popular

59

PIC: Another View

• PIC’s low-dimensional embedding, which we will call a power iteration embedding (PIE), is related to diffusion maps:

(Coifman & Lafon 2006)

60

PIC: Another View



61

PIC: Another View



62

PIC: Another View

• Result:

PIE is a random projection of the data in the diffusion space W with scale parameter t

We can use results from diffusion maps for applying PIC!

We can also use results from random projection for applying PIC!

63

PIC Extension: Hierarchical Clustering

• Real, large-scale data may not have a “flat” clustering structure

• A hierarchical view may be more useful

Good News:The dynamics of a PIC embedding display a hierarchically convergent

behavior!

64


• Why?• Recall PIC embedding at time t:

n

t

nn

tt

t

t

c

c

c

c

c

c

ceeee

v

113

1

3

1

32

1

2

1

21

11

...

Less significant eigenvectors / structures go away first, one by one

More salient structure stick

around

e’s – eigenvectors (structure) SmallBig

There may not be a clear

eigengap - a gradient of

cluster saliency

65


PIC already converged to 8 clusters…

But let’s keep on iterating…

“N” still a part of the “2009”

cluster…

Similar behavior also noted in matrix-matrix power

methods (diffusion maps, mean-shift, multi-resolution

spectral clustering)

Same dataset you’ve seen

Yes(it might take a while)

66

Distributed / Parallel Implementations

• Distributed / parallel implementations of learning methods are necessary to support large-scale data given the direction of hardware development

• PIC, MRW, and their path folding variants have at their core sparse matrix-vector multiplications

• Sparse matrix-vector multiplication lends itself well to a distributed / parallel computing framework

• We propose to use• Alternatives:

Existing graph analysis tool:

67

Adjacency Matrix vs. Similarity Matrix

• Adjacency matrix:• Similarity matrix:• Eigenanalysis:

xAx

xxAx

xxIA

xSx

)(

)(

Same eigenvectors and same ordering

of eigenvalues!

A

IAS What about the

normalized versions?

68

Adjacency Matrix vs. Similarity Matrix

• Normalized adjacency matrix:• Normalized similarity matrix:• Eigenanalysis:

xDAxD

xDxAxD

xIxDAxD

xxIAD

)ˆ(ˆ

ˆˆ

ˆˆ

ˆ

11

11

11

1

Eigenvectors the same if degree is

the same

AD 1

IAD 1ˆ

Recent work on degree-corrected Laplacian (Chaudhuri 2012) suggests that it is

advantageous to tune α for clustering graphs

with a skewed degree distribution and does

further analysis

Date post:	13-Dec-2015
Category:	Documents
Upload:	claribel-whitehead
View:	213 times
Download:	0 times

A General and Scalable Approach to Mixed Membership Clustering Frank Lin ∙ William W. Cohen School...

Documents