Download - Spectral Methods for Learning Latent Variable Models ...

Spectral Methods for Learning

Latent Variable Models:Unsupervised and Supervised Settings

Anima Anandkumar

U.C. Irvine

Learning with Big Data

Data vs. Information

Messy Data

Missing observations, gross corruptions, outliers.

High dimensional regime: as data grows, more variables !

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.


Messy Data





Learning is finding needle in a haystack


Messy Data





Learning is finding needle in a haystack

Learning with big data: computationally challenging!

Principled approaches for finding low dimensional structures?

How to model information structures?

Latent variable models

Incorporate hidden or latent variables.

Information structures: Relationships between latent variables andobserved data.





Basic Approach: mixtures/clusters

Hidden variable is categorical.






Hidden variable is categorical.

Advanced: Probabilistic models

Hidden variables have more general distributions.

Can model mixed membership/hierarchicalgroups.

x1 x2 x3 x4 x5

h1

h2 h3

Latent Variable Models (LVMs)

Document modeling

Observed: words.

Hidden: topics.

Social Network Modeling

Observed: social interactions.

Hidden: communities, relationships.

Recommendation Systems

Observed: recommendations (e.g., reviews).

Hidden: User and business attributes

Unsupervised Learning: Learn LVM without labeled examples.

LVM for Feature Engineering

Learn good features/representations for classification tasks, e.g.,computer vision and NLP.

Sparse Coding/Dictionary Learning

Sparse representations, low dimensional hidden structures.

A few dictionary elements make complicated shapes.

Associative Latent Variable Models

Supervised Learning

Given labeled examples {(xi, yi)}, learn a classifier y = f(x).


Supervised Learning


Associative/conditional models: p(y|x).

Example: Logistic regression: E[y|x] = σ(〈u, x〉).


Supervised Learning




Mixture of Logistic Regressions

E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)


Supervised Learning




Mixture of Logistic Regressions

E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)

Multi-layer/Deep Network

E[y|x] = σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))

Challenges in Learning LVMs

Computational Challenges

Maximum likelihood is NP-hard in most scenarios.

Practice: Local search approaches such as Back-propagation, EM,Variational Bayes have no consistency guarantees.

Sample Complexity

Sample complexity is exponential (w.r.t hidden variable dimension)for many learning methods.

Guaranteed and efficient learning through spectral methods

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Outline

1 Introduction




5 Conclusion

Classical Spectral Methods: Matrix PCA and CCA

Unsupervised Setting: PCAFor centered samples {xi}, find projection P withRank(P ) = k s.t.

minP

1

n

∑

i∈[n]

‖xi − Pxi‖2.

Result: Eigen-decomposition of S = Cov(X).

Supervised Setting: CCAFor centered samples {xi, yi}, find

maxa,b

a⊤E[xy⊤]b√

a⊤E[xx⊤]a b⊤E[yy⊤]b.

Result: Generalized eigen decomposition.

x y

〈a, x〉〈b, y〉

Shortcomings of Matrix Methods

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix)

Clustering on projected vectors (e.g. k-means).





Basic method works only for single memberships.

Failure to cluster under small separation.





Basic method works only for single memberships.

Failure to cluster under small separation.

Efficient Learning Without Separation Constraints?

Outline

1 Introduction




5 Conclusion

Beyond SVD: Spectral Methods on Tensors

How to learn the mixture models without separation constraints?

◮ PCA uses covariance matrix of data. Are higher order moments helpful?

Unified framework?

◮ Moment-based estimation of probabilistic latent variable models?

SVD gives spectral decomposition of matrices.◮ What are the analogues for tensors?

Moment Matrices and Tensors

Multivariate Moments in Unsupervised Setting

M1 := E[x], M2 := E[x⊗ x], M3 := E[x⊗ x⊗ x].

Matrix

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Moment Matrices and Tensors

Multivariate Moments in Unsupervised Setting

M1 := E[x], M2 := E[x⊗ x], M3 := E[x⊗ x⊗ x].

Matrix

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Multivariate Moments in Supervised Setting

M1 := E[x],E[y], M2 := E[x⊗ y], M3 := E[x⊗ x⊗ y].

Spectral Decomposition of Tensors

M2 =∑

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

Spectral Decomposition of Tensors

M2 =∑

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =∑

i

λiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

u⊗ v ⊗ w is a rank-1 tensor since its (i1, i2, i3)th entry is ui1vi2wi3 .

How to solve this non-convex problem?

Decomposition of Orthogonal Tensors

M3 =∑

i

wiai ⊗ ai ⊗ ai.

Suppose A has orthogonal columns.


M3 =∑

i

wiai ⊗ ai ⊗ ai.


M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.


M3 =∑

i

wiai ⊗ ai ⊗ ai.


M3(I, a1, a1) =∑


ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv =M(I, v) = λv.


M3 =∑

i

wiai ⊗ ai ⊗ ai.


M3(I, a1, a1) =∑


ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv =M(I, v) = λv.

Two Problems

How to find eigenvectors of a tensor?

A is not orthogonal in general.

Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R

d×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.


d×d×d:

T =∑

i∈[k]


Recall matrix power method: v 7→M(I, v)

‖M(I, v)‖.


d×d×d:

T =∑

i∈[k]



‖M(I, v)‖.

Algorithm: tensor power method: v 7→T (I, v, v)

‖T (I, v, v)‖.


d×d×d:

T =∑

i∈[k]



‖M(I, v)‖.


‖T (I, v, v)‖.

How do we avoid spurious solutions (not part of decomposition)?

• {vi}’s are the only robust fixed points.


d×d×d:

T =∑

i∈[k]



‖M(I, v)‖.


‖T (I, v, v)‖.


• {vi}’s are the only robust fixed points. • All other eigenvectors are saddle points.


d×d×d:

T =∑

i∈[k]



‖M(I, v)‖.


‖T (I, v, v)‖.


• {vi}’s are the only robust fixed points. • All other eigenvectors are saddle points.

For an orthogonal tensor, no spurious local optima!

Whitening: Conversion to Orthogonal Tensor

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Find whitening matrix W s.t. W⊤A = V is an orthogonal matrix.

When A ∈ Rd×k has full column rank, it is an invertible

transformation.

v1

v2v3

Wa1a2a3

Use pairwise moments M2 to find W .

SVD of M2 is needed.

Putting it together

Non-orthogonal tensor M3 =∑

iwiai ⊗ ai ⊗ ai, M2 =∑

iwiai ⊗ ai.

Whitening matrix W :

Multilinear transform: T =M3(W,W,W )

v1v2

v3

Wa1a2a3

Tensor M3 Tensor T

Putting it together



iwiai ⊗ ai.



v1v2

v3

Wa1a2a3

Tensor M3 Tensor T

Tensor Decomposition: Guaranteed Non-Convex Optimization!

Putting it together



iwiai ⊗ ai.



v1v2

v3

Wa1a2a3

Tensor M3 Tensor T

Tensor Decomposition: Guaranteed Non-Convex Optimization!

For what latent variable models can we obtain M2 and M3 forms?

Outline

1 Introduction




5 Conclusion

Types of Latent Variable Models

What is the form of hidden variables h?


Hidden variable h is categorical.

Advanced: Probabilistic models

Hidden variable h has more general distributions.

Can model mixed memberships, e.g. Dirichletdistribution.

x1 x2 x3 x4 x5

h1

h2 h3

Outline

1 Introduction




5 Conclusion

Topic Modeling

Geometric Picture for Topic ModelsTopic proportions vector (h)

Document

Geometric Picture for Topic ModelsSingle topic (h)


AAA

x1

x2

x3Word generation (x1, x2, . . .)


AAA

x1

x2

x3Word generation (x1, x2, . . .)

Linear model: E[xi|h] = Ah .

Moments for Single Topic Models

E[xi|h] = Ah.

w := E[h].

Learn topic-word matrix A, vector wx1 x2 x3 x4 x5

AAAAA

h

Moments for Single Topic Models

E[xi|h] = Ah.

w := E[h].

Learn topic-word matrix A, vector wx1 x2 x3 x4 x5

AAAAA

h

Pairwise Co-occurence Matrix Mx

M2 := E[x1 ⊗ x2] = E[E[x1 ⊗ x2|h]] =k

∑

i=1

wiai ⊗ ai

Triples Tensor M3

M3 := E[x1 ⊗ x2 ⊗ x3] = E[E[x1 ⊗ x2 ⊗ x3|h]] =

k∑

i=1

wiai ⊗ ai ⊗ ai

Moments under LDA

M2 := E[x1 ⊗ x2] −α0

α0 + 1E[x1]⊗ E[x1]

M3 := E[x1 ⊗ x2 ⊗ x3] −α0

α0 + 2E[x1 ⊗ x2 ⊗ E[x1]]−more stuff...

Then

M2 =∑

wi ai ⊗ ai

M3 =∑

wi ai ⊗ ai ⊗ ai.

Three words per document suffice for learning LDA.

Similar forms for HMM, ICA, sparse coding etc.

“Tensor Decompositions for Learning Latent Variable Models” by A. Anandkumar, R. Ge, D.

Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.

Outline

1 Introduction




5 Conclusion

Network Community Models


0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1


0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1


0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

0.9


0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

0.1


0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

Subgraph Counts as Graph Moments

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.

Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.





3-Star Count Tensor

M3(a, b, c) =1

|X|# of common neighbors in X

=1

|X|

∑

x∈X

G(x, a)G(x, b)G(x, c).

M3 =1

|X|

∑

x∈X

[G⊤x,A ⊗G⊤

x,B ⊗G⊤x,C ]

x

a b c

A B C

X



Outline

1 Introduction




5 Conclusion

Computational Complexity (k ≪ n)

n = # of nodes

N = # of iterations

k = # of communities.

c = # of cores.

Whiten STGD Unwhiten

Space O(nk) O(k2) O(nk)Time O(nsk/c+ k3) O(Nk3/c) O(nsk/c)

Whiten: matrix/vector products and SVD.

STGD: Stochastic Tensor Gradient Descent

Unwhiten: matrix/vector products

Our approach: O(nskc

+ k3)

Embarrassingly Parallel and fast!

Tensor Decomposition on GPUs

102

103

10−1

100

101

102

103

104

Number of communities k

Runningtime(secs)

MATLAB Tensor Toolbox(CPU)

CULA Standard Interface(GPU)

CULA Device Interface(GPU)

Eigen Sparse(CPU)

Summary of Results

FriendUsers

Facebook

n ∼ 20k

BusinessUserReviews

Yelp

n ∼ 40k

AuthorCoauthor

DBLP(sub)

n ∼ 1 million(∼ 100k)

Error (E) and Recovery ratio (R)

Dataset k Method Running Time E RFacebook(k=360) 500 ours 468 0.0175 100%Facebook(k=360) 500 variational 86,808 0.0308 100%.Yelp(k=159) 100 ours 287 0.046 86%Yelp(k=159) 100 variational N.A..DBLP sub(k=250) 500 ours 10,157 0.139 89%DBLP sub(k=250) 500 variational 558,723 16.38 99%DBLP(k=6000) 100 ours 5407 0.105 95%

Thanks to Prem Gopalan and David Mimno for providing variational code.

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/k, . . . , 1/k]⊤

Top-5 bridging nodes (businesses)

Business CategoriesFour Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, TempePizzeria Bianco Restaurants, Pizza, PhoenixFEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, PhoenixMatt’s Big Breakfast Restaurants, Phoenix, Breakfast& BrunchCornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

Outline

1 Introduction




5 Conclusion

Moment Tensors for Associative Models

Multivariate Moments: Many possibilities...

E[x⊗ y],E[x⊗ x⊗ y],E[ψ(x) ⊗ y] . . . .

Feature Transformations of the Input: x 7→ ψ(x)

How to exploit them?

Are moments E[ψ(x) ⊗ y] useful?

If ψ(x) is a matrix/tensor, we have matrix/tensor moments.

Can carry out spectral decomposition of the moments.

Score Function Features

Higher order score function: Sm(x) := (−1)m∇(m)p(x)

p(x)

∗ Can be a matrix or a tensor instead of a vector.

∗ Derivative w.r.t parameter or input

Form the cross-moments: E [y · Sm(x)].

Extension of Stein’s lemma: E [y · Sm(x)] = E

[

∇(m)G(x)]

when E[y|x] := G(x)

Spectral decomposition:E

[

∇(m)G(x)]

=∑

j∈[k]

u⊗mj

Can be applied for learning of associative latent variable models.

Learning Deep Neural Networks

Realizable Setting E[y|x] = σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))

M3 = E[y · S3(x)] =∑

i∈[r]

λi · u⊗3i

where ui = e⊤i A1 are rows of A1.

Guaranteed learning of weights(layer-by-layer) via tensordecomposition.

Similar guarantees for learning mixture of classifiers

Automated Extraction of Discriminative Features

Outline

1 Introduction




5 Conclusion

Conclusion: Guaranteed Non-Convex Optimization

Tensor Decomposition

Efficient sample and computational complexities

Better performance compared to EM, Variational Bayes etc.

In practice

Scalable and embarrassingly parallel: handle large datasets.

Efficient performance: perplexity or ground truth validation.

Related Topics

Overcomplete Tensor Decomposition: Neural networks, sparsecoding and ICA models tend to be overcomplete (more neurons thaninput dimensions).

Provable Non-Convex Iterative Methods: Robust PCA, Dictionarylearning etc.

My Research Group and Resources

Furong Huang Majid Janzamin Hanie Sedghi

Niranjan UN Forough Arabshahi

ML summer school lectures available athttp://newport.eecs.uci.edu/anandkumar/MLSS.html

http://newport.eecs.uci.edu/anandkumar/MLSS.html