Learning Mixed Membership Models using Spectral...

Post on 04-Aug-2020

0 views 0 download

transcript

Learning Mixed Membership Modelsusing Spectral Methods

Anima Anandkumar

U.C. Irvine

Data vs. Information

Big data: Missing observations, gross corruptions, outliers.

Learning useful information is finding needle in a haystack!

Learning with big data: statistically and computationally challenging!

Optimization for Learning

Most learning problems can be cast as optimization.

Unsupervised Learning

Clusteringk-means, hierarchical . . .

Maximum Likelihood EstimatorProbabilistic latent variable models

Supervised Learning

Optimizing a neural network withrespect to a loss function

Input

Neuron

Output

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg..

Images taken from https://www.facebook.com/nonconvex

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg.. Real world is mostly non-convex!

Images taken from https://www.facebook.com/nonconvex

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

In high dimensions possiblyexponential local optima

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

In high dimensions possiblyexponential local optima

How to deal with non-convexity?

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

In high dimensions possiblyexponential local optima

How to deal with non-convexity?

Spectral methods provide an answer.

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results

4 Conclusion

Matrices and Tensors as Data Structures

Modern data: Multi-modal,multi-relational data.

Matrices: pairwise relations. Tensors: higher order relations.

Multi-modal data figure from Lise Getoor slides.

Tensor Representation of Higher Order Moments

Multivariate Moments: Many possibilities, . . .

For a random vector x, can compute empirical moments

E[x⊗ x], E[x⊗ x⊗ x], . . .

Invert moments to learn about model parameters.

Tensor Representation of Higher Order Moments

Multivariate Moments: Many possibilities, . . .

For a random vector x, can compute empirical moments

E[x⊗ x], E[x⊗ x⊗ x], . . .

Invert moments to learn about model parameters.

Matrix: Pairwise Moments

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor: Higher order Moments

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Spectral Decomposition of Tensors

M2 =∑iλiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

Spectral Decomposition of Tensors

M2 =∑iλiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =∑iλiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

We have developed efficient methods to solve tensor decomposition.

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→ M(I, v)

‖M(I, v)‖ =Mv

‖Mv‖ .

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→ M(I, v)

‖M(I, v)‖ =Mv

‖Mv‖ .

Algorithm: tensor power method: v 7→ T (I, v, v)

‖T (I, v, v)‖ .

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→ M(I, v)

‖M(I, v)‖ =Mv

‖Mv‖ .

Algorithm: tensor power method: v 7→ T (I, v, v)

‖T (I, v, v)‖ .

• vi’s are the only robust fixed points.

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→ M(I, v)

‖M(I, v)‖ =Mv

‖Mv‖ .

Algorithm: tensor power method: v 7→ T (I, v, v)

‖T (I, v, v)‖ .

• vi’s are the only robust fixed points. • All other eigenvectors are saddle points.

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→ M(I, v)

‖M(I, v)‖ =Mv

‖Mv‖ .

Algorithm: tensor power method: v 7→ T (I, v, v)

‖T (I, v, v)‖ .

• vi’s are the only robust fixed points. • All other eigenvectors are saddle points.

For an orthogonal tensor, no spurious local optima!

Guaranteed Tensor Decomposition

Non-orthogonal decomposition T1 =∑

i∈[k] λiai ⊗ ai ⊗ ai.

Whitening matrix W : computed fromtensor slices.

Multilinear transform: T2 = T1(W,W,W )to orthogonalize.

To do whitening we need A1 to be fullcolumn rank.

v1v2

v3

Wa1a2a3

Tensor T1 Tensor T2

Summary on Tensor Methods

= + ....

Tensor decomposition: iterative algorithm with tensor contractions.

Tensor contraction: T (W1,W2,W3), where Wi are matrices/vectors.

Strengths of tensor method

Guaranteed convergence to globally optimal solution!

Fast and accurate, orders of magnitude faster than previous methods.

Embarrassingly parallel and suited for distributed systems, e.g.Spark.

Exploit optimized BLAS operations on CPUs and GPUs.

Open source software available: on CPU, GPU and Spark.

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results

4 Conclusion

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results

4 Conclusion

Mining Graph Data

Various Statistics

Mining Graph Data

Various Statistics

Clustering/Grouping Nodes

Mining Graph Data

Various Statistics

Clustering/Grouping Nodes

Link Prediction: Predict Missing Edges

Accomplish all tasks via statistical community models

Network Communities in Various Domains

Social Networks

Social ties: e.g. friendships, co-authorships

Biological Networks

Functional relationships:e.g. gene regulation, neural activity.

Recommendation Systems

Recommendations: e.g. yelp reviews.

Community Detection: Infer hidden communities from observed network.

Mixed Membership Community Models

Mixed Membership Community Models

Mixed Membership Community Models

Mixed Membership Community Models

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results

4 Conclusion

Connection to LDA Topic Models

What statistics yield guaranteed learning of communities?

Connection to LDA Topic Models

What statistics yield guaranteed learning of communities?

Topic Model Analogy

x

a b c

Word 1 Word 2 Word 3

Documents

Subgraph Counts as Graph Moments

Subgraph Counts as Graph Moments

Utilize presence of star counts for learning

Subgraph Counts as Graph Moments

Utilize presence of star counts for learning

Subgraph Counts as Graph Moments

Utilize presence of star counts for learning

2-Star Count Matrix

M2(a, b) =1

|X|# of common neighbors in X

=1

|X|∑

x∈X

G(x, a)G(x, b).

M2 =1

|X|∑

x∈X

[G⊤x,A ⊗G⊤

x,B]

x

a bA B

X

n: Network size. M2 ∈ R|A|×|B| = O(n× n).

3-star Counts

Third order tensor: three-dimensional array.

3-Star Count Tensor

M3(a, b, c) =1

|X|# of common neighbors in X

=1

|X|∑

x∈X

G(x, a)G(x, b)G(x, c).

M3 =1

|X|∑

x∈X

[G⊤x,A ⊗G⊤

x,B ⊗G⊤x,C ]

x

a b c

A B C

X

n: Network size. M3 ∈ R|A|×|B| = O(n× n× n).

Multi-view Representation

Conditional independence of the three views

πx: community membership vector of node x.

3-stars

xX

A B C

Graphical model

πx

G⊤x,A G⊤

x,BG⊤

x,C

FA FB FC

Linear Multiview Model: E[G⊤x,A|Π] = FAπx.

E[M3|ΠA,B,C ] =∑i∈[k]

λi[(FA)i ⊗ (FB)i ⊗ (FC)i]

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results

4 Conclusion

Main Resultsk communities, n nodes. Uniform communities.

α0: Sparsity level of community memberships (Dirichlet parameter).

p, q: intra/inter-community edge density.

Scaling Requirements

n = Ω(k2(α0 + 1)3),p− q√

p= Ω

((α0 + 1)1.5k√

n

).

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.

Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Main Resultsk communities, n nodes. Uniform communities.

α0: Sparsity level of community memberships (Dirichlet parameter).

p, q: intra/inter-community edge density.

Scaling Requirements

n = Ω(k2(α0 + 1)3),p− q√

p= Ω

((α0 + 1)1.5k√

n

).

For stochastic block model (α0 = 0), tight results

Tight guarantees for sparse graphs (scaling of p, q)

Tight guarantees on community size: require at least√n sized

communities

Efficient scaling w.r.t. sparsity level of memberships α0

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.

Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Main Results (Contd)

α0: Sparsity level of community memberships (Dirichlet parameter).

Π: Community membership matrix, Π(i): ith community

S: Estimated supports, S(i, j): Support for node j in community i.

Norm Guarantees

1

nmax

i‖Πi −Πi‖1 = O

((α0 + 1)3/2

√p

(p− q)√n

)

Main Results (Contd)

α0: Sparsity level of community memberships (Dirichlet parameter).

Π: Community membership matrix, Π(i): ith community

S: Estimated supports, S(i, j): Support for node j in community i.

Norm Guarantees

1

nmax

i‖Πi −Πi‖1 = O

((α0 + 1)3/2

√p

(p− q)√n

)

Support Recovery

∃ ξ s.t. for all nodes j ∈ [n] and all communities i ∈ [k], w.h.p

Π(i, j) ≥ ξ ⇒ S(i, j) = 1 and Π(i, j) ≤ ξ

2⇒ S(i, j) = 0.

Zero-error Support Recovery of Significant Memberships of All Nodes

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results

4 Conclusion

Experimental Results

Facebook

n ∼ 20, 000

Yelp

n ∼ 40, 000

DBLP

n ∼ 1 million

Tensor vs. Variational

FB YP DBLP−sub DBLP10

0

101

102

103

104

105

Datasets

Run

ning

tim

e (s

econ

ds)

VariationalTensor

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business

1 Latin American Salvadoreno Restaurant

2 Gluten Free P.F. Chang’s China Bistro

3 Hobby Shops Make Meaning

4 Mass Media KJZZ

5 Yoga Sutra Midtown

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business

1 Latin American Salvadoreno Restaurant

2 Gluten Free P.F. Chang’s China Bistro

3 Hobby Shops Make Meaning

4 Mass Media KJZZ

5 Yoga Sutra Midtown

Top-5 bridging businesses

Business Categories

Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe

Pizzeria Bianco Restaurants, Pizza, Phoenix

FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix

Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch

Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results

4 Conclusion

Learning Communities using Spectral Methods

Guaranteed to recover correct model

Efficient sample and computational complexities

Better performance compared to EM, VariationalBayes etc.

Other Models

Expand to other hypergraph community models.

Our recent work considers social tagging networks: tripartite3-hypergraph models.