Maximum Margin
Matrix Factorization
Nati SrebroToyota Technological Institute—Chicago
Joint work with
Alex d’AspremontPrinceton
Jason RennieMIT
Ben MarlinToronto
Tommi JaakkolaMIT
Adi SchraibmanHebrew U
Noga AlonTel-Aviv
Michael FinkHebrew U
Yonatan AmitHebrew U
Shimon UllmanWeizmann
Collaborative Prediction
Based on partially observed matrix:⇒⇒⇒⇒ Predict unobserved entries
2 1 4 5
4 1 3 5
5 4 1 33 5 2
4 5 3
2 1 41 5 5 4
2 5 43 3 1 5 2 13 1 2 34 5 1 3
3 3 52 1 1
5 2 4 41 3 1 5 4 5
1 2 4 5
?
? ?
?
? ?
??
?
?
“Will user i like movie j?”us
ers
movies
Linear Factor Model
u1 v1×
u2 v2×+
u3 v3×+
comic value
dramatic value
violencepreferences of a specific user
+1
+2
-1
2 4 5 1 4 2movies
characteristics of the user
movies
Linear Factor Modelu
V’U
×≈≈≈≈
2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2
2 2 1 4 52 4 1 4 2 3
1 3 1 1 4 34 2 2 5 3 1
YY
movies
user
s
movies
Matrix Factorization
V’U
×≈≈≈≈
Xrank k=
Unconstrained: Low Rank Approximation
• Additive Gaussian noise: minimize |Y-UV’|Fro
• General additive noise• General conditional models
– Multiplicative noise, Exponential-PCA [Collins+01] ,Multinomial (pLSA [Hofmann01] ), etc
• General loss functions– Hinge loss, loss functions appropriate for ratings, etc [Gordon03]
Unconstrained U,V,fully observed Y � use SVD
non-convex,no explicit solution
2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2
2 2 1 4 52 4 1 4 2 3
1 3 1 1 4 34 2 2 5 3 1
YY
Matrix Factorization
• Non-Negativity [LeeSeung99]
• Stochasticity (convexity) [LeeSeung97] [Hofmann01]
• Sparsity– Clustering as an extreme (when rows of U sparse)
Overall number of factors still constrainedNon-convex optimization problems
V’U
×≈≈≈≈
Xrank k=
2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2
2 2 1 4 52 4 1 4 2 3
1 3 1 1 4 34 2 2 5 3 1
YY
Outline
• Maximum Margin Matrix Factorizationlow max-norm and low trace-norm matrices
– Unbounded number of factors– Convex!
• Learning MMMF: Semidefinite Programming• Generalization error bounds
– Learning with low-rank (finite factor) matrices– MMMF (low max-norm / trace-norm)
• Representational power of low rank, max-norm and trace-norm matrices
Collaborative Prediction withMatrix Factorization
Fit factorizable (low-rank) matrix X=UV’to observed entries.
minimize Σloss(Xij;Yij)
Use matrix X to predict unobserved entries.
V’
U
2 1 4 5
4 1 3 5
5 4 1 33 5 2
4 5 3
2 1 41 5 5 4
2 5 43 3 1 5 2 13 1 2 34 5 1 3
3 3 52 1 1
5 2 4 41 3 1 5 4 5
1 2 4 5
?
? ?
?
? ?
??
?
?
observationprediction
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1[Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]
Collaborative Prediction withMatrix Factorization
V’
U
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1v1 v2 v3 v4 v5 v6 v7 v8 v9 v10v11v12
-1.51.3 0.4
-4.88.3 2.5
3.40.7 -0.2
1.61.7 -5.2
0.9-3.7 2.1
2.74.3 -0.5
6.44.7 0.2
-5.86.0 0.3
-1.5
-4.8
3.4
1.6
0.9
2.7
6.4
-5.8
0.4
2.5
-0.2
-5.2
2.1
-0.5
0.2
0.3
1.3
8.3
0.7
1.7
-3.7
4.3
4.7
6.0
When U is fixed,
each row is a linear classification problem:•rows of U are feature vectors•columns of V are linear classifiers
Fitting U and V:Learning features that work well across all classification problems.
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
rows of U(users)
columns of V(movies)
Geometric Interpretation:Co-embedding Points and Separating Hyperplanes
+1
+1
+1
-1
-1
-1
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
rows of U(users)
columns of V(movies)
Geometric Interpretation:Co-embedding Points and Separating Hyperplanes
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
Max-Margin Matrix Factorization:Bound norms of U,V instead of their dimensionality
low normlow normrows of U
(users)
columns of V(movies)
For observed Yij ∈∈∈∈ ±1:Yij Xij ≥≥≥≥ Margin
‹Ui,Vj›
(maxi|Ui|2) (maxj|Vj|2) ≤≤≤≤ 1bound norms uniformly:
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
low normlow norm
U is fixed:each column of V is SVM
Max-Margin Matrix Factorization:Bound norms of U,V instead of their dimensionality
For observed Yij ∈∈∈∈ ±1:Yij Xij ≥≥≥≥ Margin
‹Ui,Vj›
(maxi|Ui|2) (maxj|Vj|
2) ≤≤≤≤ R2bound norms uniformly:
|X|max= min (maxi|Ui|)(maxj|Vj|)X=UV
(∑i |Ui|2) (∑j |Vj|
2) ≤≤≤≤ R2bound norms on average:
|X|tr = min |U|Fro|V|FroX=UV
max-norm
Max-Margin Matrix Factorization:Bound norms of U,V instead of their dimensionality
(maxi|Ui|2) (maxj|Vj|
2) ≤≤≤≤ R2bound norms uniformly:
|X|max= min (maxi|Ui|)(maxj|Vj|)X=UV
(∑i |Ui|2) (∑j |Vj|
2) ≤≤≤≤ R2bound norms on average:
|X|tr = min |U|Fro|V|FroX=UV
αU1V1 +(1-α)U2V2
1Vα
21 Vα−
21
Uα
−
1U
α
Unlike rank(X) ≤≤≤≤ k, these are convex constraints!
max-norm
-1 -1 +1 +1
+1 -1-1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
V
U
(∑i |Ui|2) (∑j |Vj|2) ≤≤≤≤ 1
|X|tr = ∑(singular values of X)
S
Max-Margin Matrix Factorization:Convex Combination of Factors
∑ si uiv’i
∑si≤1outer product ofnorm-1 vectors:
rank-1 norm-1 matrix
unit norm rows
unit
norm
col
umns
bound norms on average:
Finding Max-Margin Matrix Factorizations
BX’
XAp.s.d.
minimize ½(tr(A)+tr(B))
Yij Xij ≥≥≥≥ 1
X=UV
minimize (∑i |Ui|2)·(∑j |Vj|2)
Yij Xij ≥≥≥≥ 1
|X|tr =∑ (singular values of X)
BX’
XA
VU
V’
U’p.s.d.
= minA,B ½( tr(A) + tr(B) )
[Fazel Hindi Boyd 2001]
Dual variable Qij for each observed (i,j)
||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1
maximize ∑ Qij
0 ≤≤≤≤ Qij
Finding Max-Margin Matrix Factorizations
sparse elementwise product (zero for unobserved entries)
BX’
XAp.s.d.
minimize tr(A)+tr(B)
Yij Xij ≥≥≥≥ 1
X
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1
-1 +1 +1
+1 +1 -1
-1 -1 +1
-1 +1 +1 +1
-1 +1 +1
+1 +1 -1 +1 -1 -1
+1 -1 -1 +1
+1 +1 -1 +1
-1 -1 +1
-1 -1 -1
+1 -1 +1 +1
-1 -1 -1 +1 +1 +1
-1 -1 +1 +1
Y
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1
-1 +1 +1
+1 +1 -1
-1 -1 +1
-1 +1 +1 +1
-1 +1 +1
+1 +1 -1 +1 -1 -1
+1 -1 -1 +1
+1 +1 -1 +1
-1 -1 +1
-1 -1 -1
+1 -1 +1 +1
-1 -1 -1 +1 +1 +1
-1 -1 +1 +1
QQsparse observations
(constraints)dense primal sparse dual
Dual variable Qij for each observed (i,j)
||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1
maximize ∑ Qij
0 ≤≤≤≤ Qij
Finding Max-Margin Matrix Factorizations
sparse elementwise product (zero for unobserved entries)
BX’
XAp.s.d.
minimize ½(tr(A)+tr(B))
Yij Xij ≥≥≥≥ 1
X
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1
-1 +1 +1
+1 +1 -1
-1 -1 +1
-1 +1 +1 +1
-1 +1 +1
+1 +1 -1 +1 -1 -1
+1 -1 -1 +1
+1 +1 -1 +1
-1 -1 +1
-1 -1 -1
+1 -1 +1 +1
-1 -1 -1 +1 +1 +1
-1 -1 +1 +1
Y
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1
-1 +1 +1
+1 +1 -1
-1 -1 +1
-1 +1 +1 +1
-1 +1 +1
+1 +1 -1 +1 -1 -1
+1 -1 -1 +1
+1 +1 -1 +1
-1 -1 +1
-1 -1 -1
+1 -1 +1 +1
-1 -1 -1 +1 +1 +1
-1 -1 +1 +1
QQsparse observations
(constraints)dense primal sparse dual
+ c ∑ξξξξij
|X|tr err(X)
≤ c- ξξξξij
Dual variable Qij for each observed (i,j)
||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1
maximize ∑ Qij
0 ≤≤≤≤ Qij
Finding Max-Margin Matrix Factorizations
BX’
XAp.s.d.
minimize ½(tr(A)+tr(B))
Yij Xij ≥≥≥≥ 1
X
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1
-1 +1 +1
+1 +1 -1
-1 -1 +1
-1 +1 +1 +1
-1 +1 +1
+1 +1 -1 +1 -1 -1
+1 -1 -1 +1
+1 +1 -1 +1
-1 -1 +1
-1 -1 -1
+1 -1 +1 +1
-1 -1 -1 +1 +1 +1
-1 -1 +1 +1
Y
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1
-1 +1 +1
+1 +1 -1
-1 -1 +1
-1 +1 +1 +1
-1 +1 +1
+1 +1 -1 +1 -1 -1
+1 -1 -1 +1
+1 +1 -1 +1
-1 -1 +1
-1 -1 -1
+1 -1 +1 +1
-1 -1 -1 +1 +1 +1
-1 -1 +1 +1
QQsparse observations
(constraints)dense primal sparse dual
+ c ∑ξξξξij
|X|tr err(X)
≤ c- ξξξξij
X*=UDV ⇐ Q*=USV
can be recovered by solving small LP
Fast Optimization of the Dual
minimize |X|tr + c ∑(1-YijXij)+
err(X)
maximize1 - b····err(X)
|X|tr
||Q ⊗⊗⊗⊗ Y||2 ≤≤≤≤ 1maximize ∑ Qij
0 ≤≤≤≤ Qij ≤ c
∑ Qij ≤≤≤≤ 1minimize ||Q⊗⊗⊗⊗Y||2
0 ≤≤≤≤ Qij ≤ b
~~~
maximize v - b∑∑∑∑(v-YijXij)~
|X|tr ≤ 1~~X=X/v
err(X) / #obs
|X|tr0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Attainable( err(X) , |X|tr )
“Front” of extreme err(X),|X|tr values
(aka “regularization path”)
min max u’(Q⊗⊗⊗⊗Y)u
Use Nesterov’s smoothing for saddle problems.
Subgradient (and grad of smoothed problem) given
by SVD
~
Finding Max-Margin Matrix Factorizations
62:12:2167:35:3457:23:093162x3162
19:09:496:40:065:44:071778x1778
1:35:28 hours41:15 min34:35 min1000x1000
19:11 min3:37 min3:27 min562x562
2:41 min2:34 min19 sec316x316
35 sec18 sec2 sec178x178
10 sec3 sec2 sec100x100
50% observed10% observed1% observed
Over 1.5 million observations
3.02 GHz Xeon
Prediction Performance
0.4
0.41
0.42
0.43
0.44
0.45
0.46
EachMovie74,424×1,648
2.6M observations (1—6)
MovieLens6,040×3,952
1M observations (1—5)
URP(Aspect/LDA-like)
"Atitude“(softmax-based)
MMMF
Top methods from [Marlin04]comprehensive comparison
Nor
mal
ized
Mea
n A
bsol
ute
Err
or
5 million observations
MMMF Optimization: Other Optionsminimize |X|tr + loss(X)
• Gradient descent of smoothed primal:|X|tr = ∑i |λi(X)|smooth
∂|X|tr/∂X = U ∂|S|smooth/∂S V
• Gradient descent on unconstrained:|U|Fro + |V|Fro + loss(UV)
– Non-convex– If dimensionality of U,V large enough, still guaranteed
global optimum [Burer Monteiro 2004]
−1 −0.5 0 0.5 10
0.2
0.4
0.6
0.8
1
norm
singular value
Outline
• Maximum Margin Matrix Factorizationlow max-norm and low trace-norm matrices
– Unbounded number of factors– Convex!
• Learning MMMF: Semidefinite Programming• Generalization error bounds
– Learning with low-rank (finite factor) matrices– MMMF (low max-norm / trace-norm)
• Representational power of low rank, max-norm and trace-norm matrices
Generalization Error Bounds
-1 -1 +1 +1
+1 -1 -1 +1
+1 +1 -1 -1-1 +1 +1
+1 +1 -1
-1 -1 +1-1 +1 +1 +1
-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1
-1 -1 +1-1 -1 -1
+1 -1 +1 +1-1 -1 -1+1 +1 +1
-1 -1 +1 +1
X YSS
random
unknown, assumption-free
∀∀∀∀Y PrS ( ∀∀∀∀X∈� D(X;Y)<DS(X;Y)+ε ) > 1-δ
training set
source distribution
hypothesis
D(X;Y) = |{ ij | XijYij≤0 }| / nmgeneralization error
DS(X;Y) = |{ ij∈S | XijYij≤0 }| / |S|empirical error
Assuming a low-rank structure (eigengap):Asymptotic behavior [Azar+01]Sample complexity, query strategy [Drineas+02]
Generalization Error Bounds:Low Trace-Norm, Max-Norm or Rank
∀∀∀∀Y PrS ( ∀∀∀∀X∈� D(X;Y)<DS(X;Y)+ε ) > 1-δ
D(X;Y) = |{ ij | XijYij≤0 }| / nmgeneralization error
DS(X;Y) = |{ ij∈S | XijYij<1 }| / |S|empirical error
||log)(
1212
SmnR δε ++=� = { X∈Rn×m | |X|2max≤R2 }
mc 2(A) = min |X|2maxAijXij≥1
||2loglog)( 18
Smnk k
enδε ++
=� = { X∈Rn×m | rank(X)≤k }dc (A) = min rank(X)
AijXij>0
||loglog)( 12 2
3
SnmnR
K δε ++=� = { X∈Rn×m | |X|2tr/nm≤R2 }ac2(A) = min |X|2tr/nm
AijXij≥1
Matrices Realizable by Low-Rank, Low-Trace-Norm and Low-Max-
Norm• Rank and max-norm are uniform measures, but trace-norm
is “on-average” measure; Might be able to realize with low trace-norm, but require high (O(n2/3)) rank, max-norm
• Realizable with low max-norm⇒ Realizable with low rank (and extra log factor)
dc(A) ≤ 10 mc 2(A) log(3nm)
• Some of matrices realizable with low rank require arbitrarily high max-norm, trace-norm [Sherstov 07]
Maximum Margin Matrix Factorizationas a Convex Combination of Classifiers
{ UV | (∑ |Ui|2)(∑ |Vi|2) ≤≤≤≤ 1 }= convex-hull( { uv’ | u ∈∈∈∈ Rn, v ∈∈∈∈ Rm |u|=|v|=1} )
conv( { uv’ | u ∈∈∈∈ {±1}n, v ∈∈∈∈ {±1}m} ) ⊂⊂⊂⊂ { UV | (max |Ui|2)(max |Vj|2) ≤≤≤≤ 1 }
⊂⊂⊂⊂ 2 conv( { uv’ | u ∈∈∈∈ {±1}n, v ∈∈∈∈ {±1}m} )
rank-one, unit-norm, matrices
rank-one sign matrices
Transfer Learning inMulti-Task and Multi-Class Settings
minimize ½|X|tr + loss(X predicting Y)
minimize ½|W|tr + loss(WΦ predicting Y)
W ΦY
instances
tasks
instancesfeatures
tasks ≈
[Argyriou et al 2007] [Abernethy et al] [Amit et al 2 007]
Transfer Learning inMulti-Task and Multi-Class Settings
minimize ½|X|tr + loss(X predicting Y)
minimize ½|W|tr + loss(WΦ predicting Y)
ΦYU V
Learned feature space VΦ
instances
tasks
instancesfeatures
tasks ≈
[Argyriou et al 2007] [Abernethy et al] [Amit et al 2 007]
Mammal Recognition Experiment
• 72 classes of mammals • Training: 1000 images• Testing: 1000 images
• Compare:– Frobenius norm SVM
– Trace norm SVM
Mammal Recognition Experiment
5 10 15 20 25 30 35 40−20
−10
0
10
20
30
40
number of training instances
accu
racy
cha
nge
usin
g tr
ace
norm Trace Norm
Frobenius norm
Max-Margin Matrix FactorizationNati Srebro, TTI-Chicago
– Infinite number of factors (norm replaces dimensionality)– Convex optimization problem (SDP)
• Low-norm factorization is true goal, not surrogate for rank– Infinite factor model, connection to SVM– Generalization error bounds– Theoretically, not a good approximation to rank– Appropriate model in practice—good empirical results
• Not only collaborative filtering!– Multi-task learning– Multi-class learning– Semi-supervised learning following eg [Rie Zhang 05,07]– Bottleneck-type methods [Globerson Tishby 04]– …