Learning Mixed Membership Modelsusing Spectral Methods
Anima Anandkumar
U.C. Irvine
Data vs. Information
Big data: Missing observations, gross corruptions, outliers.
Learning useful information is finding needle in a haystack!
Learning with big data: statistically and computationally challenging!
Optimization for Learning
Most learning problems can be cast as optimization.
Unsupervised Learning
Clusteringk-means, hierarchical . . .
Maximum Likelihood EstimatorProbabilistic latent variable models
Supervised Learning
Optimizing a neural network withrespect to a loss function
Input
Neuron
Output
Convex vs. Non-convex Optimization
Progress is only tip of the iceberg..
Images taken from https://www.facebook.com/nonconvex
Convex vs. Non-convex Optimization
Progress is only tip of the iceberg.. Real world is mostly non-convex!
Images taken from https://www.facebook.com/nonconvex
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
In high dimensions possiblyexponential local optima
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
In high dimensions possiblyexponential local optima
How to deal with non-convexity?
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
In high dimensions possiblyexponential local optima
How to deal with non-convexity?
Spectral methods provide an answer.
Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results
4 Conclusion
Matrices and Tensors as Data Structures
Modern data: Multi-modal,multi-relational data.
Matrices: pairwise relations. Tensors: higher order relations.
Multi-modal data figure from Lise Getoor slides.
Tensor Representation of Higher Order Moments
Multivariate Moments: Many possibilities, . . .
For a random vector x, can compute empirical moments
E[x⊗ x], E[x⊗ x⊗ x], . . .
Invert moments to learn about model parameters.
Tensor Representation of Higher Order Moments
Multivariate Moments: Many possibilities, . . .
For a random vector x, can compute empirical moments
E[x⊗ x], E[x⊗ x⊗ x], . . .
Invert moments to learn about model parameters.
Matrix: Pairwise Moments
E[x⊗ x] ∈ Rd×d is a second order tensor.
E[x⊗ x]i1,i2 = E[xi1xi2 ].
For matrices: E[x⊗ x] = E[xx⊤].
Tensor: Higher order Moments
E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.
E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].
Spectral Decomposition of Tensors
M2 =∑iλiui ⊗ vi
= + ....
Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2
Spectral Decomposition of Tensors
M2 =∑iλiui ⊗ vi
= + ....
Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2
M3 =∑iλiui ⊗ vi ⊗ wi
= + ....
Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2
We have developed efficient methods to solve tensor decomposition.
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→ M(I, v)
‖M(I, v)‖ =Mv
‖Mv‖ .
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→ M(I, v)
‖M(I, v)‖ =Mv
‖Mv‖ .
Algorithm: tensor power method: v 7→ T (I, v, v)
‖T (I, v, v)‖ .
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→ M(I, v)
‖M(I, v)‖ =Mv
‖Mv‖ .
Algorithm: tensor power method: v 7→ T (I, v, v)
‖T (I, v, v)‖ .
• vi’s are the only robust fixed points.
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→ M(I, v)
‖M(I, v)‖ =Mv
‖Mv‖ .
Algorithm: tensor power method: v 7→ T (I, v, v)
‖T (I, v, v)‖ .
• vi’s are the only robust fixed points. • All other eigenvectors are saddle points.
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→ M(I, v)
‖M(I, v)‖ =Mv
‖Mv‖ .
Algorithm: tensor power method: v 7→ T (I, v, v)
‖T (I, v, v)‖ .
• vi’s are the only robust fixed points. • All other eigenvectors are saddle points.
For an orthogonal tensor, no spurious local optima!
Guaranteed Tensor Decomposition
Non-orthogonal decomposition T1 =∑
i∈[k] λiai ⊗ ai ⊗ ai.
Whitening matrix W : computed fromtensor slices.
Multilinear transform: T2 = T1(W,W,W )to orthogonalize.
To do whitening we need A1 to be fullcolumn rank.
v1v2
v3
Wa1a2a3
Tensor T1 Tensor T2
Summary on Tensor Methods
= + ....
Tensor decomposition: iterative algorithm with tensor contractions.
Tensor contraction: T (W1,W2,W3), where Wi are matrices/vectors.
Strengths of tensor method
Guaranteed convergence to globally optimal solution!
Fast and accurate, orders of magnitude faster than previous methods.
Embarrassingly parallel and suited for distributed systems, e.g.Spark.
Exploit optimized BLAS operations on CPUs and GPUs.
Open source software available: on CPU, GPU and Spark.
Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results
4 Conclusion
Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results
4 Conclusion
Mining Graph Data
Various Statistics
Mining Graph Data
Various Statistics
Clustering/Grouping Nodes
Mining Graph Data
Various Statistics
Clustering/Grouping Nodes
Link Prediction: Predict Missing Edges
Accomplish all tasks via statistical community models
Network Communities in Various Domains
Social Networks
Social ties: e.g. friendships, co-authorships
Biological Networks
Functional relationships:e.g. gene regulation, neural activity.
Recommendation Systems
Recommendations: e.g. yelp reviews.
Community Detection: Infer hidden communities from observed network.
Mixed Membership Community Models
Mixed Membership Community Models
Mixed Membership Community Models
Mixed Membership Community Models
Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results
4 Conclusion
Connection to LDA Topic Models
What statistics yield guaranteed learning of communities?
Connection to LDA Topic Models
What statistics yield guaranteed learning of communities?
Topic Model Analogy
x
a b c
Word 1 Word 2 Word 3
Documents
Subgraph Counts as Graph Moments
Subgraph Counts as Graph Moments
Utilize presence of star counts for learning
Subgraph Counts as Graph Moments
Utilize presence of star counts for learning
Subgraph Counts as Graph Moments
Utilize presence of star counts for learning
2-Star Count Matrix
M2(a, b) =1
|X|# of common neighbors in X
=1
|X|∑
x∈X
G(x, a)G(x, b).
M2 =1
|X|∑
x∈X
[G⊤x,A ⊗G⊤
x,B]
x
a bA B
X
n: Network size. M2 ∈ R|A|×|B| = O(n× n).
3-star Counts
Third order tensor: three-dimensional array.
3-Star Count Tensor
M3(a, b, c) =1
|X|# of common neighbors in X
=1
|X|∑
x∈X
G(x, a)G(x, b)G(x, c).
M3 =1
|X|∑
x∈X
[G⊤x,A ⊗G⊤
x,B ⊗G⊤x,C ]
x
a b c
A B C
X
n: Network size. M3 ∈ R|A|×|B| = O(n× n× n).
Multi-view Representation
Conditional independence of the three views
πx: community membership vector of node x.
3-stars
xX
A B C
Graphical model
πx
G⊤x,A G⊤
x,BG⊤
x,C
FA FB FC
Linear Multiview Model: E[G⊤x,A|Π] = FAπx.
E[M3|ΠA,B,C ] =∑i∈[k]
λi[(FA)i ⊗ (FB)i ⊗ (FC)i]
Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results
4 Conclusion
Main Resultsk communities, n nodes. Uniform communities.
α0: Sparsity level of community memberships (Dirichlet parameter).
p, q: intra/inter-community edge density.
Scaling Requirements
n = Ω(k2(α0 + 1)3),p− q√
p= Ω
((α0 + 1)1.5k√
n
).
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Main Resultsk communities, n nodes. Uniform communities.
α0: Sparsity level of community memberships (Dirichlet parameter).
p, q: intra/inter-community edge density.
Scaling Requirements
n = Ω(k2(α0 + 1)3),p− q√
p= Ω
((α0 + 1)1.5k√
n
).
For stochastic block model (α0 = 0), tight results
Tight guarantees for sparse graphs (scaling of p, q)
Tight guarantees on community size: require at least√n sized
communities
Efficient scaling w.r.t. sparsity level of memberships α0
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Main Results (Contd)
α0: Sparsity level of community memberships (Dirichlet parameter).
Π: Community membership matrix, Π(i): ith community
S: Estimated supports, S(i, j): Support for node j in community i.
Norm Guarantees
1
nmax
i‖Πi −Πi‖1 = O
((α0 + 1)3/2
√p
(p− q)√n
)
Main Results (Contd)
α0: Sparsity level of community memberships (Dirichlet parameter).
Π: Community membership matrix, Π(i): ith community
S: Estimated supports, S(i, j): Support for node j in community i.
Norm Guarantees
1
nmax
i‖Πi −Πi‖1 = O
((α0 + 1)3/2
√p
(p− q)√n
)
Support Recovery
∃ ξ s.t. for all nodes j ∈ [n] and all communities i ∈ [k], w.h.p
Π(i, j) ≥ ξ ⇒ S(i, j) = 1 and Π(i, j) ≤ ξ
2⇒ S(i, j) = 0.
Zero-error Support Recovery of Significant Memberships of All Nodes
Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results
4 Conclusion
Experimental Results
n ∼ 20, 000
Yelp
n ∼ 40, 000
DBLP
n ∼ 1 million
Tensor vs. Variational
FB YP DBLP−sub DBLP10
0
101
102
103
104
105
Datasets
Run
ning
tim
e (s
econ
ds)
VariationalTensor
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business
1 Latin American Salvadoreno Restaurant
2 Gluten Free P.F. Chang’s China Bistro
3 Hobby Shops Make Meaning
4 Mass Media KJZZ
5 Yoga Sutra Midtown
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business
1 Latin American Salvadoreno Restaurant
2 Gluten Free P.F. Chang’s China Bistro
3 Hobby Shops Make Meaning
4 Mass Media KJZZ
5 Yoga Sutra Midtown
Top-5 bridging businesses
Business Categories
Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe
Pizzeria Bianco Restaurants, Pizza, Phoenix
FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix
Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch
Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe
Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Spectral Methods for Learning Community ModelsCommunity Detection in GraphsSpectral Methods for Community DetectionTheoretical GuaranteesExperimental Results
4 Conclusion
Learning Communities using Spectral Methods
Guaranteed to recover correct model
Efficient sample and computational complexities
Better performance compared to EM, VariationalBayes etc.
Other Models
Expand to other hypergraph community models.
Our recent work considers social tagging networks: tripartite3-hypergraph models.