+ All Categories
Home > Documents > Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA...

Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA...

Date post: 25-Dec-2015
Category:
Upload: britney-davidson
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
26
Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA [email protected]
Transcript
Page 1: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Graph-based Analytics

Wei WangDepartment of Computer Science

Scalable Analytics Institute

UCLA

[email protected]

Page 2: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Graphs/NetworksFFSM (ICDM03), SPIN (KDD04),GDIndex (ICDE07)MotifMining (PSB04, RECOMB04, ProteinScience06, SSDBM07, BIBM08)COM(CIKM09), GAIA (SIGMOD10), LTS (ICDE11)CGC (KDD13)

Graphs are everywhere

•Frequent subgraphs•Discriminative subgraphs•Graph classification•Graph clustering

Page 3: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Graph Clustering

• Graphs clusteringDecompose a network into sub-networks based on

some topological propertiesUsually we look for dense sub-networks

Page 4: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Detect protein functional modules in a PPI network

from Nataša Pržulj – Introduction to Bioinformatics. 2011.

Page 5: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Community Detection in Social Network

Collaboration network between scientistsfrom Santo Fortunato –Community detection in graphs

Page 6: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Multi-view Graph clustering

• Graphs collected from multiple sources/domains

• Multi-view graph clusteringRefine clusteringResolve ambiguity

Page 7: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Motivation• Multi-view

Exact one-to-oneComplete mappingThe same size

• More common cases Many-to-manyTolerate partial mappingDifferent sizesMappings are associated

with weights(confidence)

Page 8: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Motivation

• Objective: design algorithm which is FlexibilityRobustness

Suitable for common cases :Many-to-many weighted partial mappings for multi-domain graph clustering.

Flexibility and Robustness

Noisy graphs have little influence on others

Page 9: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Problem Formulation

A(1) A(2) A(3)affinity matrix

Sa,b(i,j) denotes the weight between the a-th

instance in Dj and the b-th instance in Di.

To partition each A(π) into kπ clusters while considering the co-regularized constraints implicitly encoded in cross-domain relationships in S.

Page 10: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Here, , where each

represents the cluster assignment of the a-th instance in domain Dπ

Co-regularized multi-domain graph clustering (CGC)

• Single-domain ClusteringSymmetric Non-negative matrix factorization (NMF).Minimizing:

( ) ( ) ( ) ( ) 2|| ( ) ||TFL A H H . .s t ( ) 0H

( ) ( ) ( ) ( )1* * *[ , ,..., ] n kT

a nH h h h R

( )*ah

Page 11: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Co-regularized multi-domain graph clustering (CGC)

• Cross-domain Co-regularizationResidual sum of squares (RSS) loss (when the number of

clusters is the same for different domains).

Clustering disagreement (CD) loss (when the number of clusters is the same or different).

Page 12: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Co-regularized multi-domain graph clustering (CGC)

• Residual sum of squares (RSS) loss Directly compare the H(π) inferred in different domains. To penalize the inconsistency of cross-domain cluster partitions for

the l-th cluster in Di, the loss for the b-th instance is

where denotes the set of indices of instances in Di that are

mapped to , and is its cardinality. The RSS loss is

e

( , ) ( , ) ( ) ( ) 2, ,( ( , ) )i j i j j jb l b b lJ E x l h

( )( , )

( , ) ( ) ( , ) ( ), ,( , ) ( )

( )

1( , )

| ( ) | ji jb

i j j i j ib b a a li j j

a N xb

E x l S hN x

( , ) ( )( )i j j

bN x( , ) ( )| ( ) |i j j

bN x( )jbx

( , ) ( , ) ( , ) ( ) ( ) 2,

1 1

|| ||jnk

i j i j i j i jRSS b l F

l b

J J S H H

Page 13: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

H(1)

12×2

H(2)

19×2

H(3)

7×2

S(3,2)H(3)

19×2S(1,2)H(1)

19×2

H(1)

C1 C2

A 0.8 0.2

B 0.7 0.3

… … …

C 0.1 0.9

S(3,2)

1 2 … 3 4 5

a 0 0 … 0 0 0.4

…… … … … …

S(1,2)

A B … C

1 0.6 0 … 0

2 0.9 0.8 … 0

…… … … …

3 0 0.1 … 0

4 0 0 … 0.6

5 0 0 … 0

H(2)

C1 C2

1 0.8 0.2

2 0.7 0.3

… … …

3 0.1 0.9

4

5

H(3)

C1 C2

a 0.8 0.2

.. … ..

Page 14: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Co-regularized multi-domain graph clustering (CGC)

• Clustering disagreement (CD) Indirectly measure the clustering inconsistency of cross-domain cluster

partitions . Intuition:

• and are mapped to 2A⃝� B⃝� ⃝, and is mapped to 4C⃝� ⃝ . Intuitively, if the similarity between cluster assignments for 2⃝ and 4 ⃝ is small, then the similarity of clustering assignments between and and the A⃝� C⃝�similarity between and should also be small.B⃝� C⃝� The CD loss is ( , ) ( , ) ( ) ( , ) ( ) ( ) ( ) 2|| ( ) ( ) ||i j i j i i j i T j j T

CD FJ S H S H H H

0. 8

0. 4

0. 6

0. 6

0. 9

0. 7

0. 1

0. 70. 60. 90. 8

0. 6

Page 15: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Co-regularized multi-domain graph clustering (CGC)

• Objective function (Joint Matrix Optimization):

( )

( ) ( , ) ( , )

0(1 ) 1 ( , )

mind

i i j i j

H d i i j I

o L J

Can be solved with an alternating scheme: optimize the objective with respect to one variable while fixing others.

Page 16: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

• Data sets:UCI (Iris, Wine, Ionosphere, WDBC)

Construct two cross-domain relationships: Iris-Wine, Ionosphere-WDBC, (positive/negative instances only mapped to positive/negative instances in another domain)

Newsgroups data (from 20 Newsgroups)comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,

comp.sys.mac.hardwarerec.motorcycles, rec.sport.baseball, rec.sport.hockey

protein-protein interaction (PPI) networks (from BioGrid), gene co-expression networks (from Gene Expression Ominbus), genetic interaction network (from TEAM)

Experimental Study

Page 17: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Experimental Study• Effectiveness (UCI data set)

Page 18: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Experimental Study• Robustness Evaluation (UCI)

Page 19: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Experimental Study

• Performance Evaluation

Page 20: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Experimental Study

• Protein Module Detection by Integrating Multi-Domain Heterogeneous Data

5412 genes490032 genetic markers across 4890 (1952 disease and 2938 healthy) samples.We use 1 million top-ranked genetic marker pairs to construct the network and the test statistics as theweights on the edges

Page 21: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Experimental Study

Protein Module Detection:• Evaluation: standard Gene Set Enrichment

Analysis (GSEA)we identify the most significantly enriched Gene Ontology

categories significance (p-value) is determined by the Fisher’s exact test raw p-values are further calibrated to correct for the multiple

testing problem

Page 22: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

GSEA

• The hypergeometric distribution is used to model the probability of observing at least k genes from a cluster of size n by chance in a category containing f genes from a total genome size of g genes.

• For example, if the majority of genes in a cluster appear from one category, then it is unlikely that this happens by chance and the category’s p-value would be close to 0.

Page 23: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Experimental Study• Protein Module Detection:

Comparison of CGC and single-domain graph clustering (k = 100)

Page 24: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Experimental Study• Protein Module Detection:

Page 25: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Summary

• In this project,we developed a flexible co-regularized method,

CGC, to tackle the many-to-many, weighted, partial mappings for multi-domain graph clustering.

CGC utilizes cross-domain relationship as co-regularizing penalty to guide the search of consensus clustering structure.

CGC is robust even when the cross-domain relationships based on prior knowledge are noisy.

• SIGKDD’13

Page 26: Graph-based Analytics Wei Wang Department of Computer Science Scalable Analytics Institute UCLA weiwang@cs.ucla.edu.

Comments and Questions

[email protected]


Recommended