+ All Categories
Home > Documents > Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions,...

Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions,...

Date post: 28-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
171
CMU SCS Talk 3: Graph Mining Tools – Tensors, communities, parallelism Christos Faloutsos CMU
Transcript
Page 1: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Talk 3: Graph Mining Tools – Tensors, communities,

parallelism

Christos Faloutsos CMU

Page 2: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

(C) 2011, C. Faloutsos 2

Overall Outline

•  Introduction – Motivation •  Talk#1: Patterns in graphs; generators •  Talk#2: Tools (Ranking, proximity) •  Talk#3: Tools (Tensors, scalability) •  Conclusions

KAIST-2011

Page 3: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Outline •  Task 4: time-evolving graphs – tensors •  Task 5: community detection •  Task 6: virus propagation •  Task 7: scalability, parallelism and hadoop •  Conclusions

KAIST-2011 (C) 2011, C. Faloutsos 3

Page 4: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 4

Thanks to •  Tamara Kolda (Sandia)

for the foils on tensor definitions, and on TOPHITS

Page 5: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 5

Detailed outline

•  Motivation •  Definitions: PARAFAC and Tucker •  Case study: web mining

Page 6: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 6

Examples of Matrices: Authors and terms

data mining classif. tree ... John Peter Mary Nick

...

Page 7: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 7

Motivation: Why tensors?

•  Q: what is a tensor?

Page 8: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 8

Motivation: Why tensors?

•  A: N-D generalization of matrix:

data mining classif. tree ... John Peter Mary Nick

...

KDD’09

Page 9: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 9

Motivation: Why tensors?

•  A: N-D generalization of matrix:

data mining classif. tree ... John Peter Mary Nick

...

KDD’08

KDD’07

KDD’09

Page 10: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 10

Tensors are useful for 3 or more modes

Terminology: ‘mode’ (or ‘aspect’):

data mining classif. tree ...

Mode (== aspect) #1

Mode#2

Mode#3

Page 11: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 11

Notice

•  3rd mode does not need to be time •  we can have more than 3 modes

...

IP destination

Dest. port

IP source

80 125

Page 12: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 12

Notice •  3rd mode does not need to be time •  we can have more than 3 modes

–  Eg, fFMRI: x,y,z, time, person-id, task-id

http://denlab.temple.edu/bidms/cgi-bin/browse.cgi

From DENLAB, Temple U. (Prof. V. Megalooikonomou +)

Page 13: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 13

Motivating Applications •  Why tensors are useful?

– web mining (TOPHITS) –  environmental sensors –  Intrusion detection (src, dst, time, dest-port) –  Social networks (src, dst, time, type-of-contact) –  face recognition –  etc …

Page 14: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 14

Detailed outline

•  Motivation •  Definitions: PARAFAC and Tucker •  Case study: web mining

Page 15: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 15

Tensor basics

•  Multi-mode extensions of SVD – recall that:

Page 16: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 16

Reminder: SVD

– Best rank-k approximation in L2

A m

n

Σ m

n

U

VT

Page 17: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 17

Reminder: SVD

– Best rank-k approximation in L2

A m

n

≈ +

σ1u1°v1 σ2u2°v2

Page 18: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

KAIST-2011 (C) 2011, C. Faloutsos 18

Goal: extension to >=3 modes

~

I x R

A B

J x R

R x R x R

I x J x K

+…+ =

Page 19: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 19

Main points:

•  2 major types of tensor decompositions: PARAFAC and Tucker

•  both can be solved with ``alternating least squares’’ (ALS)

Page 20: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 20

= U

I x R

V J x R

R x R x R

Specially Structured Tensors •  Tucker Tensor •  Kruskal Tensor

I x J x K

= U

I x R

V J x S

R x S x T

I x J x K

Our Notation

Our Notation

+…+ =

u1 uR

v1

w1

vR

wR

“core”

Page 21: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

KAIST-2011 (C) 2011, C. Faloutsos 21

Tucker Decomposition - intuition

I x J x K

~ A

I x R

B J x S

R x S x T

•  author x keyword x conference •  A: author x author-group •  B: keyword x keyword-group •  C: conf. x conf-group •  G: how groups relate to each other

Page 22: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 22

Intuition behind core tensor

•  2-d case: co-clustering •  [Dhillon et al. Information-Theoretic Co-

clustering, KDD’03]

Page 23: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 23

m

m

n

n l

k

k l

eg, terms x documents

Page 24: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 24

term x term-group

doc x doc group

term group x doc. group

med. terms

cs terms common terms

med. doc cs doc

Page 25: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 25

Tensor tools - summary

•  Two main tools – PARAFAC – Tucker

•  Both find row-, column-, tube-groups –  but in PARAFAC the three groups are identical

•  ( To solve: Alternating Least Squares )

Page 26: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 26

Detailed outline

•  Motivation •  Definitions: PARAFAC and Tucker •  Case study: web mining

Page 27: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 27

Web graph mining

•  How to order the importance of web pages? – Kleinberg’s algorithm HITS – PageRank – Tensor extension on HITS (TOPHITS)

Page 28: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 28

Kleinberg’s Hubs and Authorities (the HITS method)

Sparse adjacency matrix and its SVD:

authority scores for 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scores for 2nd topic

from

to

Kleinberg, JACM, 1999

Page 29: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 29

authority scores for 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scores for 2nd topic

from

to

HITS Authorities on Sample Data

We started our crawl from http://www-neos.mcs.anl.gov/neos,

and crawled 4700 pages, resulting in 560

cross-linked hosts.

Page 30: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 30

Three-Dimensional View of the Web

Observe that this tensor is very sparse!

Kolda, Bader, Kenny, ICDM05

Page 31: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 31

Three-Dimensional View of the Web

Observe that this tensor is very sparse!

Kolda, Bader, Kenny, ICDM05

Page 32: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 32

Three-Dimensional View of the Web

Observe that this tensor is very sparse!

Kolda, Bader, Kenny, ICDM05

Page 33: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 33

Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.

authority scores for 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scores for 2nd topic

from

to

term scores for 1st topic

term scores for 2nd topic

Page 34: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 34

Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.

authority scores for 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scores for 2nd topic

from

to

term scores for 1st topic

term scores for 2nd topic

Page 35: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 35

TOPHITS Terms & Authorities on Sample Data

TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms.

authority scores for 1st topic

hub scores for 1st topic

hub scores for 2nd topic

authority scores for 2nd topic fro

m

to

term scores for 1st topic

term scores for 2nd topic

Tensor PARAFAC

wk = # unique links using term k

Page 36: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 36

Conclusions

•  Real data are often in high dimensions with multiple aspects (modes)

•  Tensors provide elegant theory and algorithms – PARAFAC and Tucker: discover groups

Page 37: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 37

References •  T. G. Kolda, B. W. Bader and J. P. Kenny.

Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages 242-249, November 2005.

•  Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006

Page 38: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 38

Resources

•  See tutorial on tensors, KDD’07 (w/ Tamara Kolda and Jimeng Sun):

www.cs.cmu.edu/~christos/TALKS/KDD-07-tutorial

Page 39: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 39

Tensor tools - resources

•  Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/TensorToolbox

2-39 Copyright: Faloutsos, Tong (2009) 2-39 ICDE’09

•  T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 2009 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/TensorReview-preprint.pdf

•  T. Kolda and J. Sun: Scalable Tensor Decomposition for Multi-Aspect Data Mining (ICDM 2008)

Page 40: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Outline •  Task 4: time-evolving graphs – tensors •  Task 5: community detection •  Task 6: virus propagation •  Task 7: scalability, parallelism and hadoop •  Conclusions

KAIST-2011 (C) 2011, C. Faloutsos 40

Page 41: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 41

Detailed outline

•  Motivation •  Hard clustering – k pieces •  Hard co-clustering – (k,l) pieces •  Hard clustering – optimal # pieces •  Observations

Page 42: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 42

Problem

•  Given a graph, and k •  Break it into k (disjoint) communities

Page 43: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 43

Problem

•  Given a graph, and k •  Break it into k (disjoint) communities

k = 2

Page 44: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 44

Solution #1: METIS

•  Arguably, the best algorithm •  Open source, at

–  http://www.cs.umn.edu/~metis

•  and *many* related papers, at same url •  Main idea:

–  coarsen the graph; –  partition; –  un-coarsen

Page 45: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 45

Solution #1: METIS •  G. Karypis and V. Kumar. METIS 4.0:

Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, 1998.

•  <and many extensions>

Page 46: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 46

Solution #2 (problem: hard clustering, k pieces) Spectral partitioning: •  Consider the 2nd smallest eigenvector of the

(normalized) Laplacian

Page 47: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 47

Solutions #3, …

Many more ideas: •  Clustering on the A2 (square of adjacency

matrix) [Zhou, Woodruff, PODS’04] •  Minimum cut / maximum flow [Flake+,

KDD’00] •  …

Page 48: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 48

Detailed outline

•  Motivation •  Hard clustering – k pieces •  Hard co-clustering – (k,l) pieces •  Hard clustering – optimal # pieces •  Soft clustering – matrix decompositions •  Observations

Page 49: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 49

Problem definition

•  Given a bi-partite graph, and k, l •  Divide it into k row groups and l row groups •  (Also applicable to uni-partite graph)

Page 50: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 50

Co-clustering

•  Given data matrix and the number of row and column groups k and l

•  Simultaneously – Cluster rows into k disjoint groups – Cluster columns into l disjoint groups

Page 51: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 51 Copyright: Faloutsos, Tong (2009) 2-51

Co-clustering •  Let X and Y be discrete random variables

–  X and Y take values in {1, 2, …, m} and {1, 2, …, n} –  p(X, Y) denotes the joint probability distribution—if

not known, it is often estimated based on co-occurrence data

–  Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc.

•  Key Obstacles in Clustering Contingency Tables –  High Dimensionality, Sparsity, Noise –  Need for robust and scalable algorithms

Reference: 1.  Dhillon et al. Information-Theoretic Co-clustering, KDD’03

Page 52: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 52

m

m

n

n l

k

k l

eg, terms x documents

Page 53: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 53

doc x doc group

term group x doc. group

med. terms

cs terms common terms

med. doc cs doc

term x term-group

Page 54: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 54

Co-clustering

Observations •  uses KL divergence, instead of L2 •  the middle matrix is not diagonal

– we saw that earlier in the Tucker tensor decomposition

•  s/w at: www.cs.utexas.edu/users/dml/Software/cocluster.html

Page 55: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 55

Detailed outline

•  Motivation •  Hard clustering – k pieces •  Hard co-clustering – (k,l) pieces •  Hard clustering – optimal # pieces •  Soft clustering – matrix decompositions •  Observations

Page 56: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 56

Problem with Information Theoretic Co-clustering

•  Number of row and column groups must be specified

Desiderata:

  Simultaneously discover row and column groups

"  Fully Automatic: No “magic numbers”

  Scalable to large graphs

Page 57: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 57

Cross-association

Desiderata:

  Simultaneously discover row and column groups

  Fully Automatic: No “magic numbers”

  Scalable to large matrices

Reference: 1.  Chakrabarti et al. Fully Automatic Cross-Associations, KDD’04

Page 58: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 58

What makes a cross-association “good”?

versus

Column groups

Column groups

Row

gro

ups

Row

gro

ups

Why is this better?

Page 59: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 59

What makes a cross-association “good”?

versus

Column groups

Column groups

Row

gro

ups

Row

gro

ups

Why is this better?

simpler; easier to describe easier to compress!

Page 60: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 60

What makes a cross-association “good”?

Problem definition: given an encoding scheme •  decide on the # of col. and row groups k and l •  and reorder rows and columns, •  to achieve best compression

Page 61: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 61

Main Idea

sizei * H(xi) + Cost of describing cross-associations

Code Cost Description Cost

Σi Total Encoding Cost =

Good Compression

Better Clustering

Minimize the total cost (# bits)

for lossless compression

Page 62: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 62

Algorithm k = 5 row

groups

k=1, l=2

k=2, l=2

k=2, l=3

k=3, l=3

k=3, l=4

k=4, l=4

k=4, l=5

l = 5 col groups

Page 63: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 63

Experiments “CLASSIC”

•  3,893 documents

•  4,303 words

•  176,347 “dots”

Combination of 3 sources:

•  MEDLINE (medical)

•  CISI (info. retrieval)

•  CRANFIELD (aerodynamics)

Doc

umen

ts

Words

Page 64: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 64

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

Doc

umen

ts

Words

Page 65: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 65

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

MEDLINE (medical)

insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell,

tissue, patient

Page 66: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 66

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

CISI (Information Retrieval)

providing, studying, records, development, students, rules

abstract, notation, works, construct, bibliographies

MEDLINE (medical)

Page 67: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 67

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

CRANFIELD (aerodynamics)

shape, nasa, leading, assumed, thin

CISI (Information Retrieval)

MEDLINE (medical)

Page 68: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 68

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

paint, examination, fall, raise, leave, based

CRANFIELD (aerodynamics)

CISI (Information Retrieval)

MEDLINE (medical)

Page 69: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 69

Algorithm Code for cross-associations (matlab):

www.cs.cmu.edu/~deepay/mywww/software/CrossAssociations-01-27-2005.tgz!

Variations and extensions: •  ‘Autopart’ [Chakrabarti, PKDD’04] •  www.cs.cmu.edu/~deepay!

Page 70: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 70

Algorithm •  Hadoop implementation [ICDM’08]

Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008: 512-521

Page 71: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 71

Detailed outline

•  Motivation •  Hard clustering – k pieces •  Hard co-clustering – (k,l) pieces •  Hard clustering – optimal # pieces •  Observations

Page 72: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 72

Observation #1

•  Skewed degree distributions – there are nodes with huge degree (>O(10^4), in facebook/linkedIn popularity contests!)

Page 73: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 73

Observation #2

•  Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+’04], [Leskovec+,’08]

Page 74: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 74

Observation #2

•  Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+,’04], [Leskovec+,’08]

? ?

Page 75: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 75

Jellyfish model [Tauro+]

A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001

Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, M. Faloutsos, J. of Communications and Networks, Vol. 8, No. 3, pp 339-350, Sept. 2006.

Page 76: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 76

Strange behavior of min cuts

•  ‘negative dimensionality’ (!)

NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy

Statistical Properties of Community Structure in Large Social and Information Networks, J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. WWW 2008.

Page 77: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 77

“Min-cut” plot •  Do min-cuts recursively.

log (# edges)

log (mincut-size / #edges)

N nodes

Mincut size = sqrt(N)

Page 78: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 78

“Min-cut” plot •  Do min-cuts recursively.

log (# edges)

log (mincut-size / #edges)

N nodes

New min-cut

Page 79: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 79

“Min-cut” plot •  Do min-cuts recursively.

log (# edges)

log (mincut-size / #edges)

N nodes

New min-cut

Slope = -0.5

For a d-dimensional grid, the slope is -1/d

Page 80: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 80

“Min-cut” plot

log (# edges)

log (mincut-size / #edges)

Slope = -1/d

For a d-dimensional grid, the slope is -1/d

log (# edges)

log (mincut-size / #edges)

For a random graph, the slope is 0

Page 81: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 81

“Min-cut” plot •  What does it look like for a real-world

graph?

log (# edges)

log (mincut-size / #edges)

?

Page 82: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 82

Experiments •  Datasets:

–  Google Web Graph: 916,428 nodes and 5,105,039 edges

–  Lucent Router Graph: Undirected graph of network routers from www.isi.edu/scan/mercator/maps.html; 112,969 nodes and 181,639 edges

–  User Website Clickstream Graph: 222,704 nodes and 952,580 edges

NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy

Page 83: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 83

Experiments •  Used the METIS algorithm [Karypis, Kumar,

1995]

log (# edges)

log

(min

cut-s

ize

/ #ed

ges)

•  Google Web graph

•  Values along the y-axis are averaged

•  We observe a “lip” for large edges

•  Slope of -0.4, corresponds to a 2.5-dimensional grid!

Slope~ -0.4

Page 84: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Google graph

KAIST-2011 (C) 2011, C. Faloutsos 84

Log(#edges)

log (mincut-size / #edges)

Log(#edges)

All min-cuts averaged

Page 85: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 85

Experiments •  Same results for other graphs too…

Lucent Router graph

Clickstream graph

Page 86: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 86

Conclusions – Practitioner’s guide

•  Hard clustering – k pieces •  Hard co-clustering – (k,l) pieces •  Hard clustering – optimal # pieces •  Observations

METIS

Co-clustering

Cross-associations

‘jellyfish’: Maybe, there are no good cuts

Page 87: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Outline •  Task 4: time-evolving graphs – tensors •  Task 5: community detection •  Task 6: virus propagation •  Task 7: scalability, parallelism and hadoop •  Conclusions

KAIST-2011 (C) 2011, C. Faloutsos 87

Page 88: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 88

Detailed outline •  Problem definition •  Analysis •  Experiments

Page 89: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS Immunization and epidemic

thresholds •  Q1: which nodes to immunize? •  Q2: will a virus vanish, or will it create an

epidemic?

KAIST-2011 (C) 2011, C. Faloutsos 89

Page 90: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Q1: Immunization: • Given

• a network, • k vaccines, and • the virus details

• Which nodes to immunize?

KAIST-2011 90 (C) 2011, C. Faloutsos

Page 91: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Q1: Immunization: • Given

• a network, • k vaccines, and • the virus details

• Which nodes to immunize?

KAIST-2011 91 (C) 2011, C. Faloutsos

Page 92: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Q1: Immunization: • Given

• a network, • k vaccines, and • the virus details

• Which nodes to immunize?

KAIST-2011 92 (C) 2011, C. Faloutsos

Page 93: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Q1: Immunization: • Given

• a network, • k vaccines, and • the virus details

• Which nodes to immunize?

A: immunize the ones that maximally raise the `epidemic threshold’ [Tong+, ICDM’10]

KAIST-2011 93 (C) 2011, C. Faloutsos

Page 94: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Q2: will a virus take over? •  Flu-like virus (no immunity, ‘SIS’) •  Mumps (life-time immunity, ‘SIR’) •  Pertussis (finite-length immunity, ‘SIRS’)

KAIST-2011 (C) 2011, C. Faloutsos 94

β: attack prob δ: heal prob

Page 95: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Q2: will a virus take over? •  Flu-like virus (no immunity, ‘SIS’) •  Mumps (life-time immunity, ‘SIR’) •  Pertussis (finite-length immunity, ‘SIRS’)

KAIST-2011 (C) 2011, C. Faloutsos 95

β: attack prob δ: heal prob

Α: depends on connectivity (avg degree? Max degree? variance? Something else?

Page 96: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 96

The model: SIS

•  ‘Flu’ like: Susceptible-Infected-Susceptible •  Virus ‘strength’ s= β/δ

Infected

Healthy

NN1

N3

N2 Prob. β

Prob. δ

Page 97: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 97

Epidemic threshold τ of a graph: the value of τ, such that

if strength s = β / δ < τ an epidemic can not happen Thus, •  given a graph •  compute its epidemic threshold

Page 98: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 98

Detailed outline •  Problem definition •  Analysis •  Experiments

Page 99: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 99

Epidemic threshold τ

What should τ depend on? •  avg. degree? and/or highest degree? •  and/or variance of degree? •  and/or third moment of degree? •  and/or diameter?

Page 100: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 100

Epidemic threshold

•  [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

Page 101: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 101

Epidemic threshold

•  [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

largest eigenvalue of adj. matrix A

attack prob.

recovery prob. epidemic threshold

Proof: [Wang+03] (proof: for SIS=flu only)

Page 102: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 102

Beginning of proof Healthy @ t+1: - ( healthy or healed ) - and not attacked @ t

Let: p(i , t) = Prob node i is sick @ t+1

1 - p(i, t+1 ) = (1 – p(i, t) + p(i, t) * δ ) * Πj (1 – β aji * p(j , t) )

Below threshold, if the above non-linear dynamical system above is ‘stable’ (eigenvalue of Hessian < 1 )

Page 103: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 103

Epidemic threshold for various networks

Formula includes older results as special cases: •  Homogeneous networks [Kephart+White]

–  λ1,A = <k>; τ = 1/<k> (<k> : avg degree)

•  Star networks (d = degree of center) –  λ1,A = sqrt(d); τ = 1/ sqrt(d)

•  Infinite power-law networks –  λ1,A = ∞; τ = 0 ; [Barabasi]

Page 104: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 104

Epidemic threshold

•  [Theorem 2] Below the epidemic threshold, the epidemic dies out exponentially

Page 105: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Recent generalization •  [Prakash+, arxiv ‘10]: similar threshold, for

almost all virus propagation models (VPM) – SIS -> flu – SIR -> mumps – SIRS -> whooping cough (temporary

immunity) – SIIR (-> HIV) – …

KAIST-2011 (C) 2011, C. Faloutsos 105

Page 106: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

A2: will a virus take over? •  For all typical virus propagation models (flu,

mumps, pertussis, HIV, etc) •  The only connectivity measure that matters, is

1/λ1 the first eigenvalue of the adj. matrix Proof for all VPM: [Prakash+, ‘10, arxiv]

KAIST-2011 (C) 2011, C. Faloutsos 106

Page 107: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 107

Detailed outline •  Epidemic threshold

– Problem definition – Analysis – Experiments

Page 108: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 108

Experiments (Oregon)

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

Page 109: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

KAIST-2011 (C) 2011, C. Faloutsos 109

SIS simulation - # infected nodes vs time

Time (linear scale)

#inf. (log scale)

above

at

below

Log - Lin

Page 110: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

KAIST-2011 (C) 2011, C. Faloutsos 110

SIS simulation - # infected nodes vs time

Log - Lin

Time (linear scale)

#inf. (log scale)

above

at

below

Exponential decay

Page 111: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

KAIST-2011 (C) 2011, C. Faloutsos 111

SIS simulation - # infected nodes vs time

Log - Log

Time (log scale)

#inf. (log scale)

above

at

below

Page 112: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

KAIST-2011 (C) 2011, C. Faloutsos 112

SIS simulation - # infected nodes vs time

Time (log scale)

#inf. (log scale)

above

at

below

Log - Log

Power-law Decay (!)

Page 113: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

How about other VPMs?

KAIST-2011 (C) 2011, C. Faloutsos P6-113

Page 114: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

A2: will a virus take over? (SIRS case)

KAIST-2011 (C) 2011, C. Faloutsos 114

Fraction of infected

Time ticks

Below: exp. extinction

Above: take-over

Graph: Portland, OR 31M links 1.5M nodes

Page 115: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 115

Conclusions

λ1,A : Eigenvalue of adjacency matrix determines the survival of (almost) any virus

•  measure of connectivity (~ # paths) •  Can answer ‘what-if’ scenarios

– May guide immunization policies

•  Can help us avoid expensive simulations

Page 116: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 116

References •  D. Chakrabarti, Y. Wang, C. Wang, J.

Leskovec, and C. Faloutsos, Epidemic Thresholds in Real Networks, in ACM TISSEC, 10(4), 2008

•  Ganesh, A., Massoulie, L., and Towsley, D., 2005. The effect of network topology on the spread of epidemics. In INFOCOM.

Page 117: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 117

References (cont’d)

•  Hethcote, H. W. 2000. The mathematics of infectious diseases. SIAM Review 42, 599–653.

•  Hethcote, H. W. AND Yorke, J. A. 1984. Gonorrhea Transmission Dynamics and Control. Vol. 56. Springer. Lecture Notes in Biomathematics.

Page 118: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 118

References (cont’d)

•  Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint, in SRDS 2003 (pages 25-34), Florence, Italy

Page 119: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Outline •  Task 4: time-evolving graphs – tensors •  Task 5: community detection •  Task 6: virus propagation •  Task 7: scalability, parallelism and hadoop •  Conclusions

KAIST-2011 (C) 2011, C. Faloutsos 119

Page 120: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-120

Scalability

•  How about if graph/tensor does not fit in core?

•  How about handling huge graphs?

Page 121: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-121

Scalability

•  How about if graph/tensor does not fit in core?

•  [‘MET’: Kolda, Sun, ICMD’08, best paper award]

•  How about handling huge graphs?

Page 122: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-122

Scalability •  Google: > 450,000 processors in clusters of

~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003]

•  Yahoo: 5Pb of data [Fayyad, KDD’07] •  Problem: machine failures, on a daily basis •  How to parallelize data mining tasks, then?

Page 123: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-123

Scalability •  Google: > 450,000 processors in clusters of ~2000

processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003]

•  Yahoo: 5Pb of data [Fayyad, KDD’07] •  Problem: machine failures, on a daily basis •  How to parallelize data mining tasks, then? •  A: map/reduce – hadoop (open-source clone)

http://hadoop.apache.org/

Page 124: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-124

2’ intro to hadoop •  master-slave architecture; n-way replication

(default n=3) •  ‘group by’ of SQL (in parallel, fault-tolerant way) •  e.g, find histogram of word frequency

–  compute local histograms –  then merge into global histogram

select course-id, count(*) from ENROLLMENT group by course-id

Page 125: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-125

2’ intro to hadoop •  master-slave architecture; n-way replication

(default n=3) •  ‘group by’ of SQL (in parallel, fault-tolerant way) •  e.g, find histogram of word frequency

–  compute local histograms –  then merge into global histogram

select course-id, count(*) from ENROLLMENT group by course-id map

reduce

Page 126: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-126

User Program

Reducer

Reducer

Master

Mapper

Mapper

Mapper

fork fork fork

assign map assign

reduce

read local write

remote read, sort

Output File 0

Output File 1

write Split 0 Split 1 Split 2

Input  Data (on  HDFS)

By default: 3-way replication; Late/dead machines: ignored, transparently (!)

Page 127: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-127

D.I.S.C.

•  ‘Data Intensive Scientific Computing’ [R. Bryant, CMU] –  ‘big data’ –  www.cs.cmu.edu/~bryant/pubdir/cmu-

cs-07-128.pdf

Page 128: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-128

~200Gb (Yahoo crawl) - Degree Distribution: •  in 12 minutes with 50 machines •  Many (link spams ?) at out-degree 1200

Analysis of a large graph

Page 129: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

(C) 2011, C. Faloutsos 129

Centralized Hadoop/PEGASUS

Degree Distr. old old

Pagerank old old

Diameter/ANF old DONE

Conn. Comp old DONE

Triangles DONE Visualization STARTED

Outline – Algorithms & results

KAIST-2011

Page 130: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

HADI for diameter estimation •  Radius Plots for Mining Tera-byte Scale

Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10

•  Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B)

•  Our HADI: linear on E (~10B) – Near-linear scalability wrt # machines – Several optimizations -> 5x faster

(C) 2011, C. Faloutsos 130 KAIST-2011

Page 131: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

????

19+ [Barabasi+]

131 (C) 2011, C. Faloutsos

Radius

Count

KAIST-2011

~1999, ~1M nodes

Page 132: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) •  Largest publicly available graph ever studied.

????

19+ [Barabasi+]

132 (C) 2011, C. Faloutsos

Radius

Count

KAIST-2011

??

~1999, ~1M nodes

Page 133: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) •  Largest publicly available graph ever studied.

????

19+? [Barabasi+]

133 (C) 2011, C. Faloutsos

Radius

Count

KAIST-2011

14 (dir.) ~7 (undir.)

Page 134: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • 7 degrees of separation (!) • Diameter: shrunk

????

19+? [Barabasi+]

134 (C) 2011, C. Faloutsos

Radius

Count

KAIST-2011

14 (dir.) ~7 (undir.)

Page 135: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape?

????

135 (C) 2011, C. Faloutsos

Radius

Count

KAIST-2011

~7 (undir.)

Page 136: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

136 (C) 2011, C. Faloutsos

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) •  effective diameter: surprisingly small. •  Multi-modality (?!)

KAIST-2011

Page 137: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Radius Plot of GCC of YahooWeb.

137 (C) 2011, C. Faloutsos KAIST-2011

Page 138: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

138 (C) 2011, C. Faloutsos

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) •  effective diameter: surprisingly small. •  Multi-modality: probably mixture of cores .

KAIST-2011

Page 139: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

139 (C) 2011, C. Faloutsos

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) •  effective diameter: surprisingly small. •  Multi-modality: probably mixture of cores .

KAIST-2011

EN

~7

Conjecture: DE

BR

Page 140: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

140 (C) 2011, C. Faloutsos

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) •  effective diameter: surprisingly small. •  Multi-modality: probably mixture of cores .

KAIST-2011

~7

Conjecture:

Page 141: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Running time - Kronecker and Erdos-Renyi Graphs with billions edges.

details

Page 142: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

(C) 2011, C. Faloutsos 142

Centralized Hadoop/PEGASUS

Degree Distr. old old

Pagerank old old

Diameter/ANF old DONE

Conn. Comp old DONE

Triangles DONE Visualization STARTED

Outline – Algorithms & results

KAIST-2011

Page 143: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS Generalized Iterated Matrix

Vector Multiplication (GIMV)

(C) 2011, C. Faloutsos 143

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up).

KAIST-2011

Page 144: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS Generalized Iterated Matrix

Vector Multiplication (GIMV)

(C) 2011, C. Faloutsos 144

•  PageRank •  proximity (RWR) •  Diameter •  Connected components •  (eigenvectors, •  Belief Prop. •  … )

Matrix – vector Multiplication

(iterated)

KAIST-2011

details

Page 145: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

145

Example: GIM-V At Work •  Connected Components – 4 observations:

Size

Count

(C) 2011, C. Faloutsos KAIST-2011

Page 146: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

146

Example: GIM-V At Work •  Connected Components

Size

Count

(C) 2011, C. Faloutsos KAIST-2011

1) 10K x larger than next

Page 147: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

147

Example: GIM-V At Work •  Connected Components

Size

Count

(C) 2011, C. Faloutsos KAIST-2011

2) ~0.7B singleton nodes

Page 148: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

148

Example: GIM-V At Work •  Connected Components

Size

Count

(C) 2011, C. Faloutsos KAIST-2011

3) SLOPE!

Page 149: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

149

Example: GIM-V At Work •  Connected Components

Size

Count 300-size

cmpt X 500. Why? 1100-size cmpt

X 65. Why?

(C) 2011, C. Faloutsos KAIST-2011

4) Spikes!

Page 150: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

150

Example: GIM-V At Work •  Connected Components

Size

Count

suspicious financial-advice sites

(not existing now)

(C) 2011, C. Faloutsos KAIST-2011

Page 151: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

151

GIM-V At Work •  Connected Components over Time •  LinkedIn: 7.5M nodes and 58M edges

Stable tail slope after the gelling point

(C) 2011, C. Faloutsos KAIST-2011

Page 152: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-152

Conclusions

•  Hadoop: promising architecture for Tera/Peta scale graph mining

Resources: •  http://hadoop.apache.org/core/ •  http://hadoop.apache.org/pig/

Higher-level language for data processing

Page 153: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos P8-153

References •  Jeffrey Dean and Sanjay Ghemawat, MapReduce:

Simplified Data Processing on Large Clusters, OSDI'04 •  Christopher Olston, Benjamin Reed, Utkarsh Srivastava,

Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD 2008: 1099-1110

Page 154: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Overall Conclusions •  Real graphs exhibit surprising patterns

(power laws, shrinking diameter, super-linearity on edge weights, triangles etc)

•  SVD: a powerful tool (HITS, PageRank) •  Several other tools: tensors, METIS, …

– But: good communities might not exist…

•  Immunization: first eigenvalue •  Scalability: hadoop/parallelism

KAIST-2011 (C) 2011, C. Faloutsos 154

Page 155: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

(C) 2011, C. Faloutsos 155

Our goal:

Open source system for mining huge graphs:

PEGASUS project (PEta GrAph mining System)

•  www.cs.cmu.edu/~pegasus •  code and papers

KAIST-2011

Page 156: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

(C) 2011, C. Faloutsos 156

Project info

Akoglu, Leman

Chau, Polo

Kang, U McGlohon, Mary

Tong, Hanghang

Prakash, Aditya

KAIST-2011

Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab

www.cs.cmu.edu/~pegasus

Koutra, Danae

Page 157: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Extra material •  E-bay fraud detection •  Outlier detection

KAIST-2011 (C) 2011, C. Faloutsos 157

Page 158: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 158

Detailed outline •  Fraud detection in e-bay •  Anomaly detection

Page 159: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 159

E-bay Fraud detection

w/ Polo Chau & Shashank Pandit, CMU

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks, S. Pandit, D. H. Chau, S. Wang, and C. Faloutsos (WWW'07), pp. 201-210

Page 160: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 160

E-bay Fraud detection

•  lines: positive feedbacks •  would you buy from him/her?

Page 161: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 161

E-bay Fraud detection

•  lines: positive feedbacks •  would you buy from him/her?

•  or him/her?

Page 162: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

KAIST-2011 (C) 2011, C. Faloutsos 162

E-bay Fraud detection - NetProbe

Belief Propagation gives:

Page 163: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Popular press

And less desirable attention: •  E-mail from ‘Belgium police’ (‘copy of

your code?’) KAIST-2011 (C) 2011, C. Faloutsos 163

Page 164: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Extra material •  E-bay fraud detection •  Outlier detection

KAIST-2011 (C) 2011, C. Faloutsos 164

Page 165: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

OddBall: Spotting Anomalies in Weighted Graphs

Leman Akoglu, Mary McGlohon, Christos Faloutsos

Carnegie Mellon University School of Computer Science

PAKDD 2010, Hyderabad, India

Page 166: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Main idea For each node, •  extract ‘ego-net’ (=1-step-away neighbors) •  Extract features (#edges, total weight, etc

etc) •  Compare with the rest of the population

(C) 2011, C. Faloutsos 166 KAIST-2011

Page 167: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS What is an egonet?

ego

167

egonet

(C) 2011, C. Faloutsos KAIST-2011

Page 168: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

Selected Features   Ni: number of neighbors (degree) of ego i   Ei: number of edges in egonet i   Wi: total weight of egonet i   λw,i: principal eigenvalue of the weighted

adjacency matrix of egonet I

168 (C) 2011, C. Faloutsos KAIST-2011

Page 169: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS Near-Clique/Star

169 KAIST-2011 (C) 2011, C. Faloutsos

Page 170: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS Near-Clique/Star

170 (C) 2011, C. Faloutsos KAIST-2011

Page 171: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,

CMU SCS

END

KAIST-2011 (C) 2011, C. Faloutsos 171


Recommended