+ All Categories
Home > Documents > Maximum Margin Matrix Factorizationlekheng/meetings/datamining/srebro.pdf · Maximum Margin Matrix...

Maximum Margin Matrix Factorizationlekheng/meetings/datamining/srebro.pdf · Maximum Margin Matrix...

Date post: 31-Aug-2018
Category:
Upload: hoangthuy
View: 232 times
Download: 0 times
Share this document with a friend
34
Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute—Chicago Joint work with Alex d’Aspremont Princeton Jason Rennie MIT Ben Marlin Toronto Tommi Jaakkola MIT Adi Schraibman Hebrew U Noga Alon Tel-Aviv Michael Fink Hebrew U Yonatan Amit Hebrew U Shimon Ullman Weizmann
Transcript

Maximum Margin

Matrix Factorization

Nati SrebroToyota Technological Institute—Chicago

Joint work with

Alex d’AspremontPrinceton

Jason RennieMIT

Ben MarlinToronto

Tommi JaakkolaMIT

Adi SchraibmanHebrew U

Noga AlonTel-Aviv

Michael FinkHebrew U

Yonatan AmitHebrew U

Shimon UllmanWeizmann

Collaborative Prediction

Based on partially observed matrix:⇒⇒⇒⇒ Predict unobserved entries

2 1 4 5

4 1 3 5

5 4 1 33 5 2

4 5 3

2 1 41 5 5 4

2 5 43 3 1 5 2 13 1 2 34 5 1 3

3 3 52 1 1

5 2 4 41 3 1 5 4 5

1 2 4 5

?

? ?

?

? ?

??

?

?

“Will user i like movie j?”us

ers

movies

Linear Factor Model

u1 v1×

u2 v2×+

u3 v3×+

comic value

dramatic value

violencepreferences of a specific user

+1

+2

-1

2 4 5 1 4 2movies

characteristics of the user

movies

Linear Factor Modelu

V’×2 4 5 1 4 2

movies movies

Linear Factor Modelu

V’U

×≈≈≈≈

2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2

2 2 1 4 52 4 1 4 2 3

1 3 1 1 4 34 2 2 5 3 1

YY

movies

user

s

movies

Matrix Factorization

V’U

×≈≈≈≈

Xrank k=

Unconstrained: Low Rank Approximation

• Additive Gaussian noise: minimize |Y-UV’|Fro

• General additive noise• General conditional models

– Multiplicative noise, Exponential-PCA [Collins+01] ,Multinomial (pLSA [Hofmann01] ), etc

• General loss functions– Hinge loss, loss functions appropriate for ratings, etc [Gordon03]

Unconstrained U,V,fully observed Y � use SVD

non-convex,no explicit solution

2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2

2 2 1 4 52 4 1 4 2 3

1 3 1 1 4 34 2 2 5 3 1

YY

Matrix Factorization

• Non-Negativity [LeeSeung99]

• Stochasticity (convexity) [LeeSeung97] [Hofmann01]

• Sparsity– Clustering as an extreme (when rows of U sparse)

Overall number of factors still constrainedNon-convex optimization problems

V’U

×≈≈≈≈

Xrank k=

2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2

2 2 1 4 52 4 1 4 2 3

1 3 1 1 4 34 2 2 5 3 1

YY

Outline

• Maximum Margin Matrix Factorizationlow max-norm and low trace-norm matrices

– Unbounded number of factors– Convex!

• Learning MMMF: Semidefinite Programming• Generalization error bounds

– Learning with low-rank (finite factor) matrices– MMMF (low max-norm / trace-norm)

• Representational power of low rank, max-norm and trace-norm matrices

Collaborative Prediction withMatrix Factorization

Fit factorizable (low-rank) matrix X=UV’to observed entries.

minimize Σloss(Xij;Yij)

Use matrix X to predict unobserved entries.

V’

U

2 1 4 5

4 1 3 5

5 4 1 33 5 2

4 5 3

2 1 41 5 5 4

2 5 43 3 1 5 2 13 1 2 34 5 1 3

3 3 52 1 1

5 2 4 41 3 1 5 4 5

1 2 4 5

?

? ?

?

? ?

??

?

?

observationprediction

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1[Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]

Collaborative Prediction withMatrix Factorization

V’

U

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1v1 v2 v3 v4 v5 v6 v7 v8 v9 v10v11v12

-1.51.3 0.4

-4.88.3 2.5

3.40.7 -0.2

1.61.7 -5.2

0.9-3.7 2.1

2.74.3 -0.5

6.44.7 0.2

-5.86.0 0.3

-1.5

-4.8

3.4

1.6

0.9

2.7

6.4

-5.8

0.4

2.5

-0.2

-5.2

2.1

-0.5

0.2

0.3

1.3

8.3

0.7

1.7

-3.7

4.3

4.7

6.0

When U is fixed,

each row is a linear classification problem:•rows of U are feature vectors•columns of V are linear classifiers

Fitting U and V:Learning features that work well across all classification problems.

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

rows of U(users)

columns of V(movies)

Geometric Interpretation:Co-embedding Points and Separating Hyperplanes

+1

+1

+1

-1

-1

-1

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

rows of U(users)

columns of V(movies)

Geometric Interpretation:Co-embedding Points and Separating Hyperplanes

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

Max-Margin Matrix Factorization:Bound norms of U,V instead of their dimensionality

low normlow normrows of U

(users)

columns of V(movies)

For observed Yij ∈∈∈∈ ±1:Yij Xij ≥≥≥≥ Margin

‹Ui,Vj›

(maxi|Ui|2) (maxj|Vj|2) ≤≤≤≤ 1bound norms uniformly:

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

low normlow norm

U is fixed:each column of V is SVM

Max-Margin Matrix Factorization:Bound norms of U,V instead of their dimensionality

For observed Yij ∈∈∈∈ ±1:Yij Xij ≥≥≥≥ Margin

‹Ui,Vj›

(maxi|Ui|2) (maxj|Vj|

2) ≤≤≤≤ R2bound norms uniformly:

|X|max= min (maxi|Ui|)(maxj|Vj|)X=UV

(∑i |Ui|2) (∑j |Vj|

2) ≤≤≤≤ R2bound norms on average:

|X|tr = min |U|Fro|V|FroX=UV

max-norm

Max-Margin Matrix Factorization:Bound norms of U,V instead of their dimensionality

(maxi|Ui|2) (maxj|Vj|

2) ≤≤≤≤ R2bound norms uniformly:

|X|max= min (maxi|Ui|)(maxj|Vj|)X=UV

(∑i |Ui|2) (∑j |Vj|

2) ≤≤≤≤ R2bound norms on average:

|X|tr = min |U|Fro|V|FroX=UV

αU1V1 +(1-α)U2V2

1Vα

21 Vα−

21

1U

α

Unlike rank(X) ≤≤≤≤ k, these are convex constraints!

max-norm

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

(∑i |Ui|2) (∑j |Vj|2) ≤≤≤≤ 1

|X|tr = ∑(singular values of X)

S

Max-Margin Matrix Factorization:Convex Combination of Factors

∑ si uiv’i

∑si≤1outer product ofnorm-1 vectors:

rank-1 norm-1 matrix

unit norm rows

unit

norm

col

umns

bound norms on average:

Finding Max-Margin Matrix Factorizations

BX’

XAp.s.d.

minimize ½(tr(A)+tr(B))

Yij Xij ≥≥≥≥ 1

X=UV

minimize (∑i |Ui|2)·(∑j |Vj|2)

Yij Xij ≥≥≥≥ 1

|X|tr =∑ (singular values of X)

BX’

XA

VU

V’

U’p.s.d.

= minA,B ½( tr(A) + tr(B) )

[Fazel Hindi Boyd 2001]

Dual variable Qij for each observed (i,j)

||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1

maximize ∑ Qij

0 ≤≤≤≤ Qij

Finding Max-Margin Matrix Factorizations

sparse elementwise product (zero for unobserved entries)

BX’

XAp.s.d.

minimize tr(A)+tr(B)

Yij Xij ≥≥≥≥ 1

X

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1

-1 +1 +1

+1 +1 -1

-1 -1 +1

-1 +1 +1 +1

-1 +1 +1

+1 +1 -1 +1 -1 -1

+1 -1 -1 +1

+1 +1 -1 +1

-1 -1 +1

-1 -1 -1

+1 -1 +1 +1

-1 -1 -1 +1 +1 +1

-1 -1 +1 +1

Y

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1

-1 +1 +1

+1 +1 -1

-1 -1 +1

-1 +1 +1 +1

-1 +1 +1

+1 +1 -1 +1 -1 -1

+1 -1 -1 +1

+1 +1 -1 +1

-1 -1 +1

-1 -1 -1

+1 -1 +1 +1

-1 -1 -1 +1 +1 +1

-1 -1 +1 +1

QQsparse observations

(constraints)dense primal sparse dual

Dual variable Qij for each observed (i,j)

||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1

maximize ∑ Qij

0 ≤≤≤≤ Qij

Finding Max-Margin Matrix Factorizations

sparse elementwise product (zero for unobserved entries)

BX’

XAp.s.d.

minimize ½(tr(A)+tr(B))

Yij Xij ≥≥≥≥ 1

X

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1

-1 +1 +1

+1 +1 -1

-1 -1 +1

-1 +1 +1 +1

-1 +1 +1

+1 +1 -1 +1 -1 -1

+1 -1 -1 +1

+1 +1 -1 +1

-1 -1 +1

-1 -1 -1

+1 -1 +1 +1

-1 -1 -1 +1 +1 +1

-1 -1 +1 +1

Y

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1

-1 +1 +1

+1 +1 -1

-1 -1 +1

-1 +1 +1 +1

-1 +1 +1

+1 +1 -1 +1 -1 -1

+1 -1 -1 +1

+1 +1 -1 +1

-1 -1 +1

-1 -1 -1

+1 -1 +1 +1

-1 -1 -1 +1 +1 +1

-1 -1 +1 +1

QQsparse observations

(constraints)dense primal sparse dual

+ c ∑ξξξξij

|X|tr err(X)

≤ c- ξξξξij

Dual variable Qij for each observed (i,j)

||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1

maximize ∑ Qij

0 ≤≤≤≤ Qij

Finding Max-Margin Matrix Factorizations

BX’

XAp.s.d.

minimize ½(tr(A)+tr(B))

Yij Xij ≥≥≥≥ 1

X

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1

-1 +1 +1

+1 +1 -1

-1 -1 +1

-1 +1 +1 +1

-1 +1 +1

+1 +1 -1 +1 -1 -1

+1 -1 -1 +1

+1 +1 -1 +1

-1 -1 +1

-1 -1 -1

+1 -1 +1 +1

-1 -1 -1 +1 +1 +1

-1 -1 +1 +1

Y

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1

-1 +1 +1

+1 +1 -1

-1 -1 +1

-1 +1 +1 +1

-1 +1 +1

+1 +1 -1 +1 -1 -1

+1 -1 -1 +1

+1 +1 -1 +1

-1 -1 +1

-1 -1 -1

+1 -1 +1 +1

-1 -1 -1 +1 +1 +1

-1 -1 +1 +1

QQsparse observations

(constraints)dense primal sparse dual

+ c ∑ξξξξij

|X|tr err(X)

≤ c- ξξξξij

X*=UDV ⇐ Q*=USV

can be recovered by solving small LP

Fast Optimization of the Dual

minimize |X|tr + c ∑(1-YijXij)+

err(X)

maximize1 - b····err(X)

|X|tr

||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1maximize ∑ Qij

0 ≤≤≤≤ Qij ≤ c

∑ Qij ≤≤≤≤ 1minimize ||Q⊗⊗⊗⊗Y||2

0 ≤≤≤≤ Qij ≤ b

~~~

maximize v - b∑∑∑∑(v-YijXij)~

|X|tr ≤ 1~~X=X/v

err(X) / #obs

|X|tr0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Attainable( err(X) , |X|tr )

“Front” of extreme err(X),|X|tr values

(aka “regularization path”)

min max u’(Q⊗⊗⊗⊗Y)u

Use Nesterov’s smoothing for saddle problems.

Subgradient (and grad of smoothed problem) given

by SVD

~

Finding Max-Margin Matrix Factorizations

62:12:2167:35:3457:23:093162x3162

19:09:496:40:065:44:071778x1778

1:35:28 hours41:15 min34:35 min1000x1000

19:11 min3:37 min3:27 min562x562

2:41 min2:34 min19 sec316x316

35 sec18 sec2 sec178x178

10 sec3 sec2 sec100x100

50% observed10% observed1% observed

Over 1.5 million observations

3.02 GHz Xeon

Prediction Performance

0.4

0.41

0.42

0.43

0.44

0.45

0.46

EachMovie74,424×1,648

2.6M observations (1—6)

MovieLens6,040×3,952

1M observations (1—5)

URP(Aspect/LDA-like)

"Atitude“(softmax-based)

MMMF

Top methods from [Marlin04]comprehensive comparison

Nor

mal

ized

Mea

n A

bsol

ute

Err

or

5 million observations

MMMF Optimization: Other Optionsminimize |X|tr + loss(X)

• Gradient descent of smoothed primal:|X|tr = ∑i |λi(X)|smooth

∂|X|tr/∂X = U ∂|S|smooth/∂S V

• Gradient descent on unconstrained:|U|Fro + |V|Fro + loss(UV)

– Non-convex– If dimensionality of U,V large enough, still guaranteed

global optimum [Burer Monteiro 2004]

−1 −0.5 0 0.5 10

0.2

0.4

0.6

0.8

1

norm

singular value

Outline

• Maximum Margin Matrix Factorizationlow max-norm and low trace-norm matrices

– Unbounded number of factors– Convex!

• Learning MMMF: Semidefinite Programming• Generalization error bounds

– Learning with low-rank (finite factor) matrices– MMMF (low max-norm / trace-norm)

• Representational power of low rank, max-norm and trace-norm matrices

Generalization Error Bounds

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

X YSS

random

unknown, assumption-free

∀∀∀∀Y PrS ( ∀∀∀∀X∈� D(X;Y)<DS(X;Y)+ε ) > 1-δ

training set

source distribution

hypothesis

D(X;Y) = |{ ij | XijYij≤0 }| / nmgeneralization error

DS(X;Y) = |{ ij∈S | XijYij≤0 }| / |S|empirical error

Assuming a low-rank structure (eigengap):Asymptotic behavior [Azar+01]Sample complexity, query strategy [Drineas+02]

Generalization Error Bounds:Low Trace-Norm, Max-Norm or Rank

∀∀∀∀Y PrS ( ∀∀∀∀X∈� D(X;Y)<DS(X;Y)+ε ) > 1-δ

D(X;Y) = |{ ij | XijYij≤0 }| / nmgeneralization error

DS(X;Y) = |{ ij∈S | XijYij<1 }| / |S|empirical error

||log)(

1212

SmnR δε ++=� = { X∈Rn×m | |X|2max≤R2 }

mc 2(A) = min |X|2maxAijXij≥1

||2loglog)( 18

Smnk k

enδε ++

=� = { X∈Rn×m | rank(X)≤k }dc (A) = min rank(X)

AijXij>0

||loglog)( 12 2

3

SnmnR

K δε ++=� = { X∈Rn×m | |X|2tr/nm≤R2 }ac2(A) = min |X|2tr/nm

AijXij≥1

Matrices Realizable by Low-Rank, Low-Trace-Norm and Low-Max-

Norm• Rank and max-norm are uniform measures, but trace-norm

is “on-average” measure; Might be able to realize with low trace-norm, but require high (O(n2/3)) rank, max-norm

• Realizable with low max-norm⇒ Realizable with low rank (and extra log factor)

dc(A) ≤ 10 mc 2(A) log(3nm)

• Some of matrices realizable with low rank require arbitrarily high max-norm, trace-norm [Sherstov 07]

Maximum Margin Matrix Factorizationas a Convex Combination of Classifiers

{ UV | (∑ |Ui|2)(∑ |Vi|2) ≤≤≤≤ 1 }= convex-hull( { uv’ | u ∈∈∈∈ Rn, v ∈∈∈∈ Rm |u|=|v|=1} )

conv( { uv’ | u ∈∈∈∈ {±1}n, v ∈∈∈∈ {±1}m} ) ⊂⊂⊂⊂ { UV | (max |Ui|2)(max |Vj|2) ≤≤≤≤ 1 }

⊂⊂⊂⊂ 2 conv( { uv’ | u ∈∈∈∈ {±1}n, v ∈∈∈∈ {±1}m} )

rank-one, unit-norm, matrices

rank-one sign matrices

Transfer Learning inMulti-Task and Multi-Class Settings

minimize ½|X|tr + loss(X predicting Y)

minimize ½|W|tr + loss(WΦ predicting Y)

W ΦY

instances

tasks

instancesfeatures

tasks ≈

[Argyriou et al 2007] [Abernethy et al] [Amit et al 2 007]

Transfer Learning inMulti-Task and Multi-Class Settings

minimize ½|X|tr + loss(X predicting Y)

minimize ½|W|tr + loss(WΦ predicting Y)

ΦYU V

Learned feature space VΦ

instances

tasks

instancesfeatures

tasks ≈

[Argyriou et al 2007] [Abernethy et al] [Amit et al 2 007]

Mammal Recognition Experiment

• 72 classes of mammals • Training: 1000 images• Testing: 1000 images

• Compare:– Frobenius norm SVM

– Trace norm SVM

Mammal Recognition Experiment

5 10 15 20 25 30 35 40−20

−10

0

10

20

30

40

number of training instances

accu

racy

cha

nge

usin

g tr

ace

norm Trace Norm

Frobenius norm

Mammal Recognition Experiment

Frobenius

Trace

Max-Margin Matrix FactorizationNati Srebro, TTI-Chicago

– Infinite number of factors (norm replaces dimensionality)– Convex optimization problem (SDP)

• Low-norm factorization is true goal, not surrogate for rank– Infinite factor model, connection to SVM– Generalization error bounds– Theoretically, not a good approximation to rank– Appropriate model in practice—good empirical results

• Not only collaborative filtering!– Multi-task learning– Multi-class learning– Semi-supervised learning following eg [Rie Zhang 05,07]– Bottleneck-type methods [Globerson Tishby 04]– …


Recommended