Spectral Methods for Learning
Latent Variable Models:Unsupervised and Supervised Settings
Anima Anandkumar
U.C. Irvine
Learning with Big Data
Data vs. Information
Messy Data
Missing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables !
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Data vs. Information
Messy Data
Missing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables !
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning is finding needle in a haystack
Data vs. Information
Messy Data
Missing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables !
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning is finding needle in a haystack
Learning with big data: computationally challenging!
Principled approaches for finding low dimensional structures?
How to model information structures?
Latent variable models
Incorporate hidden or latent variables.
Information structures: Relationships between latent variables andobserved data.
How to model information structures?
Latent variable models
Incorporate hidden or latent variables.
Information structures: Relationships between latent variables andobserved data.
Basic Approach: mixtures/clusters
Hidden variable is categorical.
How to model information structures?
Latent variable models
Incorporate hidden or latent variables.
Information structures: Relationships between latent variables andobserved data.
Basic Approach: mixtures/clusters
Hidden variable is categorical.
Advanced: Probabilistic models
Hidden variables have more general distributions.
Can model mixed membership/hierarchicalgroups.
x1 x2 x3 x4 x5
h1
h2 h3
Latent Variable Models (LVMs)
Document modeling
Observed: words.
Hidden: topics.
Social Network Modeling
Observed: social interactions.
Hidden: communities, relationships.
Recommendation Systems
Observed: recommendations (e.g., reviews).
Hidden: User and business attributes
Unsupervised Learning: Learn LVM without labeled examples.
LVM for Feature Engineering
Learn good features/representations for classification tasks, e.g.,computer vision and NLP.
Sparse Coding/Dictionary Learning
Sparse representations, low dimensional hidden structures.
A few dictionary elements make complicated shapes.
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi, yi)}, learn a classifier y = f(x).
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi, yi)}, learn a classifier y = f(x).
Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] = σ(〈u, x〉).
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi, yi)}, learn a classifier y = f(x).
Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] = σ(〈u, x〉).
Mixture of Logistic Regressions
E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi, yi)}, learn a classifier y = f(x).
Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] = σ(〈u, x〉).
Mixture of Logistic Regressions
E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)
Multi-layer/Deep Network
E[y|x] = σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))
Challenges in Learning LVMs
Computational Challenges
Maximum likelihood is NP-hard in most scenarios.
Practice: Local search approaches such as Back-propagation, EM,Variational Bayes have no consistency guarantees.
Sample Complexity
Sample complexity is exponential (w.r.t hidden variable dimension)for many learning methods.
Guaranteed and efficient learning through spectral methods
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Classical Spectral Methods: Matrix PCA and CCA
Unsupervised Setting: PCAFor centered samples {xi}, find projection P withRank(P ) = k s.t.
minP
1
n
∑
i∈[n]
‖xi − Pxi‖2.
Result: Eigen-decomposition of S = Cov(X).
Supervised Setting: CCAFor centered samples {xi, yi}, find
maxa,b
a⊤E[xy⊤]b√
a⊤E[xx⊤]a b⊤E[yy⊤]b.
Result: Generalized eigen decomposition.
x y
〈a, x〉〈b, y〉
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix)
Clustering on projected vectors (e.g. k-means).
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix)
Clustering on projected vectors (e.g. k-means).
Basic method works only for single memberships.
Failure to cluster under small separation.
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix)
Clustering on projected vectors (e.g. k-means).
Basic method works only for single memberships.
Failure to cluster under small separation.
Efficient Learning Without Separation Constraints?
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Beyond SVD: Spectral Methods on Tensors
How to learn the mixture models without separation constraints?
◮ PCA uses covariance matrix of data. Are higher order moments helpful?
Unified framework?
◮ Moment-based estimation of probabilistic latent variable models?
SVD gives spectral decomposition of matrices.◮ What are the analogues for tensors?
Moment Matrices and Tensors
Multivariate Moments in Unsupervised Setting
M1 := E[x], M2 := E[x⊗ x], M3 := E[x⊗ x⊗ x].
Matrix
E[x⊗ x] ∈ Rd×d is a second order tensor.
E[x⊗ x]i1,i2 = E[xi1xi2 ].
For matrices: E[x⊗ x] = E[xx⊤].
Tensor
E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.
E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].
Moment Matrices and Tensors
Multivariate Moments in Unsupervised Setting
M1 := E[x], M2 := E[x⊗ x], M3 := E[x⊗ x⊗ x].
Matrix
E[x⊗ x] ∈ Rd×d is a second order tensor.
E[x⊗ x]i1,i2 = E[xi1xi2 ].
For matrices: E[x⊗ x] = E[xx⊤].
Tensor
E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.
E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].
Multivariate Moments in Supervised Setting
M1 := E[x],E[y], M2 := E[x⊗ y], M3 := E[x⊗ x⊗ y].
Spectral Decomposition of Tensors
M2 =∑
i
λiui ⊗ vi
= + ....
Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2
Spectral Decomposition of Tensors
M2 =∑
i
λiui ⊗ vi
= + ....
Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2
M3 =∑
i
λiui ⊗ vi ⊗ wi
= + ....
Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2
u⊗ v ⊗ w is a rank-1 tensor since its (i1, i2, i3)th entry is ui1vi2wi3 .
How to solve this non-convex problem?
Decomposition of Orthogonal Tensors
M3 =∑
i
wiai ⊗ ai ⊗ ai.
Suppose A has orthogonal columns.
Decomposition of Orthogonal Tensors
M3 =∑
i
wiai ⊗ ai ⊗ ai.
Suppose A has orthogonal columns.
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
Decomposition of Orthogonal Tensors
M3 =∑
i
wiai ⊗ ai ⊗ ai.
Suppose A has orthogonal columns.
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
ai are eigenvectors of tensor M3.
Analogous to matrix eigenvectors:Mv =M(I, v) = λv.
Decomposition of Orthogonal Tensors
M3 =∑
i
wiai ⊗ ai ⊗ ai.
Suppose A has orthogonal columns.
M3(I, a1, a1) =∑
iwi〈ai, a1〉2ai = w1a1.
ai are eigenvectors of tensor M3.
Analogous to matrix eigenvectors:Mv =M(I, v) = λv.
Two Problems
How to find eigenvectors of a tensor?
A is not orthogonal in general.
Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R
d×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R
d×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→M(I, v)
‖M(I, v)‖.
Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R
d×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→M(I, v)
‖M(I, v)‖.
Algorithm: tensor power method: v 7→T (I, v, v)
‖T (I, v, v)‖.
Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R
d×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→M(I, v)
‖M(I, v)‖.
Algorithm: tensor power method: v 7→T (I, v, v)
‖T (I, v, v)‖.
How do we avoid spurious solutions (not part of decomposition)?
• {vi}’s are the only robust fixed points.
Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R
d×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→M(I, v)
‖M(I, v)‖.
Algorithm: tensor power method: v 7→T (I, v, v)
‖T (I, v, v)‖.
How do we avoid spurious solutions (not part of decomposition)?
• {vi}’s are the only robust fixed points. • All other eigenvectors are saddle points.
Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R
d×d×d:
T =∑
i∈[k]
λivi ⊗ vi ⊗ vi.
Recall matrix power method: v 7→M(I, v)
‖M(I, v)‖.
Algorithm: tensor power method: v 7→T (I, v, v)
‖T (I, v, v)‖.
How do we avoid spurious solutions (not part of decomposition)?
• {vi}’s are the only robust fixed points. • All other eigenvectors are saddle points.
For an orthogonal tensor, no spurious local optima!
Whitening: Conversion to Orthogonal Tensor
M3 =∑
i
wiai ⊗ ai ⊗ ai, M2 =∑
i
wiai ⊗ ai.
Find whitening matrix W s.t. W⊤A = V is an orthogonal matrix.
When A ∈ Rd×k has full column rank, it is an invertible
transformation.
v1
v2v3
Wa1a2a3
Use pairwise moments M2 to find W .
SVD of M2 is needed.
Putting it together
Non-orthogonal tensor M3 =∑
iwiai ⊗ ai ⊗ ai, M2 =∑
iwiai ⊗ ai.
Whitening matrix W :
Multilinear transform: T =M3(W,W,W )
v1v2
v3
Wa1a2a3
Tensor M3 Tensor T
Putting it together
Non-orthogonal tensor M3 =∑
iwiai ⊗ ai ⊗ ai, M2 =∑
iwiai ⊗ ai.
Whitening matrix W :
Multilinear transform: T =M3(W,W,W )
v1v2
v3
Wa1a2a3
Tensor M3 Tensor T
Tensor Decomposition: Guaranteed Non-Convex Optimization!
Putting it together
Non-orthogonal tensor M3 =∑
iwiai ⊗ ai ⊗ ai, M2 =∑
iwiai ⊗ ai.
Whitening matrix W :
Multilinear transform: T =M3(W,W,W )
v1v2
v3
Wa1a2a3
Tensor M3 Tensor T
Tensor Decomposition: Guaranteed Non-Convex Optimization!
For what latent variable models can we obtain M2 and M3 forms?
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Types of Latent Variable Models
What is the form of hidden variables h?
Basic Approach: mixtures/clusters
Hidden variable h is categorical.
Advanced: Probabilistic models
Hidden variable h has more general distributions.
Can model mixed memberships, e.g. Dirichletdistribution.
x1 x2 x3 x4 x5
h1
h2 h3
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Topic Modeling
Geometric Picture for Topic ModelsTopic proportions vector (h)
Document
Geometric Picture for Topic ModelsSingle topic (h)
Geometric Picture for Topic ModelsSingle topic (h)
AAA
x1
x2
x3Word generation (x1, x2, . . .)
Geometric Picture for Topic ModelsSingle topic (h)
AAA
x1
x2
x3Word generation (x1, x2, . . .)
Linear model: E[xi|h] = Ah .
Moments for Single Topic Models
E[xi|h] = Ah.
w := E[h].
Learn topic-word matrix A, vector wx1 x2 x3 x4 x5
AAAAA
h
Moments for Single Topic Models
E[xi|h] = Ah.
w := E[h].
Learn topic-word matrix A, vector wx1 x2 x3 x4 x5
AAAAA
h
Pairwise Co-occurence Matrix Mx
M2 := E[x1 ⊗ x2] = E[E[x1 ⊗ x2|h]] =k
∑
i=1
wiai ⊗ ai
Triples Tensor M3
M3 := E[x1 ⊗ x2 ⊗ x3] = E[E[x1 ⊗ x2 ⊗ x3|h]] =
k∑
i=1
wiai ⊗ ai ⊗ ai
Moments under LDA
M2 := E[x1 ⊗ x2] −α0
α0 + 1E[x1]⊗ E[x1]
M3 := E[x1 ⊗ x2 ⊗ x3] −α0
α0 + 2E[x1 ⊗ x2 ⊗ E[x1]]−more stuff...
Then
M2 =∑
wi ai ⊗ ai
M3 =∑
wi ai ⊗ ai ⊗ ai.
Three words per document suffice for learning LDA.
Similar forms for HMM, ICA, sparse coding etc.
“Tensor Decompositions for Learning Latent Variable Models” by A. Anandkumar, R. Ge, D.
Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Network Community Models
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1
0.1 0.8 0.1
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1
0.1 0.8 0.1
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1
0.1 0.8 0.1
0.9
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1
0.1 0.8 0.1
0.1
Network Community Models
0.4 0.3 0.3 0.7 0.2 0.1
0.1 0.8 0.1
Subgraph Counts as Graph Moments
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Subgraph Counts as Graph Moments
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Subgraph Counts as Graph Moments
3-Star Count Tensor
M3(a, b, c) =1
|X|# of common neighbors in X
=1
|X|
∑
x∈X
G(x, a)G(x, b)G(x, c).
M3 =1
|X|
∑
x∈X
[G⊤x,A ⊗G⊤
x,B ⊗G⊤x,C ]
x
a b c
A B C
X
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Computational Complexity (k ≪ n)
n = # of nodes
N = # of iterations
k = # of communities.
c = # of cores.
Whiten STGD Unwhiten
Space O(nk) O(k2) O(nk)Time O(nsk/c+ k3) O(Nk3/c) O(nsk/c)
Whiten: matrix/vector products and SVD.
STGD: Stochastic Tensor Gradient Descent
Unwhiten: matrix/vector products
Our approach: O(nskc
+ k3)
Embarrassingly Parallel and fast!
Tensor Decomposition on GPUs
102
103
10−1
100
101
102
103
104
Number of communities k
Runningtime(secs)
MATLAB Tensor Toolbox(CPU)
CULA Standard Interface(GPU)
CULA Device Interface(GPU)
Eigen Sparse(CPU)
Summary of Results
FriendUsers
n ∼ 20k
BusinessUserReviews
Yelp
n ∼ 40k
AuthorCoauthor
DBLP(sub)
n ∼ 1 million(∼ 100k)
Error (E) and Recovery ratio (R)
Dataset k Method Running Time E RFacebook(k=360) 500 ours 468 0.0175 100%Facebook(k=360) 500 variational 86,808 0.0308 100%.Yelp(k=159) 100 ours 287 0.046 86%Yelp(k=159) 100 variational N.A..DBLP sub(k=250) 500 ours 10,157 0.139 89%DBLP sub(k=250) 500 variational 558,723 16.38 99%DBLP(k=6000) 100 ours 5407 0.105 95%
Thanks to Prem Gopalan and David Mimno for providing variational code.
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business Stars Review Counts1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business Stars Review Counts1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31
Bridgeness: Distance from vector [1/k, . . . , 1/k]⊤
Top-5 bridging nodes (businesses)
Business CategoriesFour Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, TempePizzeria Bianco Restaurants, Pizza, PhoenixFEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, PhoenixMatt’s Big Breakfast Restaurants, Phoenix, Breakfast& BrunchCornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Moment Tensors for Associative Models
Multivariate Moments: Many possibilities...
E[x⊗ y],E[x⊗ x⊗ y],E[ψ(x) ⊗ y] . . . .
Feature Transformations of the Input: x 7→ ψ(x)
How to exploit them?
Are moments E[ψ(x) ⊗ y] useful?
If ψ(x) is a matrix/tensor, we have matrix/tensor moments.
Can carry out spectral decomposition of the moments.
Score Function Features
Higher order score function: Sm(x) := (−1)m∇(m)p(x)
p(x)
∗ Can be a matrix or a tensor instead of a vector.
∗ Derivative w.r.t parameter or input
Form the cross-moments: E [y · Sm(x)].
Extension of Stein’s lemma: E [y · Sm(x)] = E
[
∇(m)G(x)]
when E[y|x] := G(x)
Spectral decomposition:E
[
∇(m)G(x)]
=∑
j∈[k]
u⊗mj
Can be applied for learning of associative latent variable models.
Learning Deep Neural Networks
Realizable Setting E[y|x] = σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))
M3 = E[y · S3(x)] =∑
i∈[r]
λi · u⊗3i
where ui = e⊤i A1 are rows of A1.
Guaranteed learning of weights(layer-by-layer) via tensordecomposition.
Similar guarantees for learning mixture of classifiers
Automated Extraction of Discriminative Features
Outline
1 Introduction
2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors
3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results
4 Moment Tensors in Supervised Setting
5 Conclusion
Conclusion: Guaranteed Non-Convex Optimization
Tensor Decomposition
Efficient sample and computational complexities
Better performance compared to EM, Variational Bayes etc.
In practice
Scalable and embarrassingly parallel: handle large datasets.
Efficient performance: perplexity or ground truth validation.
Related Topics
Overcomplete Tensor Decomposition: Neural networks, sparsecoding and ICA models tend to be overcomplete (more neurons thaninput dimensions).
Provable Non-Convex Iterative Methods: Robust PCA, Dictionarylearning etc.
My Research Group and Resources
Furong Huang Majid Janzamin Hanie Sedghi
Niranjan UN Forough Arabshahi
ML summer school lectures available athttp://newport.eecs.uci.edu/anandkumar/MLSS.html