.
......
Guaranteed Matrix Completion via Non-convexFactorization
Zhi-Quan (Tom) Luo
The Chinese University of Hong Kong, Shenzhen, Chinaand
The University of Minnesota, Minneapolis, MN, USA
joint work with Ruoyu Sun
July 5, 2015
. . . . . .
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Outline
...1 Introduction and Result
...2 Technical DetailsFormulation, Algorithm and Initial PointProof IdeasProof Sketch of Step 1
Non-convex Matrix Completion Zhi-Quan Luo 2 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Outline
...1 Introduction and Result
...2 Technical DetailsFormulation, Algorithm and Initial PointProof IdeasProof Sketch of Step 1
Non-convex Matrix Completion Zhi-Quan Luo 2 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Reference
Ruoyu Sun and Zhi-Quan Luo, “Guaranteed Matrix Completion viaNon-convex Factorization,” arXiv preprint arXiv:1411.8003 (2014);FOCS’2015, forthcoming.
Non-convex Matrix Completion Zhi-Quan Luo 3 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Netflix Challenge
Netflix prize (2006): 500,000 customers, 17,000 movies, 100 millionratings (1.2% of all ratings).
Challenge: predict missing ratings ($1, 000, 000 prize).
One approach: exploring low-rank structure
Low-rank matrix completion:complete a low-rank matrix from a few entries [Candes-Recht-09]
Non-convex Matrix Completion Zhi-Quan Luo 4 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Idea of Matrix Completion (MC)
Low-rank: M ∈ Rn×n, with rank r ≪ n.
Partial observation: Ω ⊂ [n]× [n]: set of sampled positions.
If M = E11, little hope to recover M.
E11 =
1 0 0 · · · 00 0 0 · · · 0...
...... · · ·
...0 0 0 · · · 0
assumptions on M, Ω are needed
Non-convex Matrix Completion Zhi-Quan Luo 5 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Matrix Completion
Result [Candes-Recht’09]: recover M ∈ Rn×n from O(nr log n) entriesvia convex optimization, if
M is “generic” (incoherent).Set of observations Ω are “generic” (random).
M =
x1y1 x1y2 x1y3 · · · x1yn
x2y1 x2y2 x2y3 · · · x2yn...
...... · · ·
...xny1 xny2 xny3 · · · xnyn
O(nr log n) v.s. n2 ambient dim.
O(nr) is necessary: nr free variableslog n due to coupon collecting effect: need n log n samples to cover eachrow/column.
Non-convex Matrix Completion Zhi-Quan Luo 6 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Why is Matrix Completion Interesting?
1) Natural extension of sparsity to matrix domain.
Related low-rank pursuing approaches:Phase retrieval; Robust PCA; Tensor completion
2) Practice: lots of applications.Recommendation systems, computer vision, network anomaly detection, etc.
3) Theory: A fundamental mathematical problem.Related to: statistics, learning, optimization, linear algebra, information theory, etc.
Non-convex Matrix Completion Zhi-Quan Luo 7 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Matrix Factorization v.s. Nuclear Norm
Two popular methods: nuclear norm, matrix factorization.
Method 1: Nuclear norm minimization [Candes-Recht-09]
minZ∈Rn×n
12∥PΩ(M − Z)∥2
F + λ∥Z∥∗. (1)
where
PΩ(X) =
Xij, if (i, j) ∈ Ω0 if (i, j) ∈ Ω
n2 variables; nonsmooth but convex
Method 2: matrix factorization (MF) based formulation [Koren09]:
P0 : minX∈Rn×r,Y∈Rn×r
12∥PΩ(M − XYT)∥2
F + λ(∥X∥2F + ∥Y∥2
F). (2)
nr variables; smooth but nonconvex
Non-convex Matrix Completion Zhi-Quan Luo 8 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
How can this be possible?
The matrix factorization (MF) formulation [Koren09]:
P0 : minX∈Rn×r,Y∈Rn×r
12∥PΩ(M − XYT)∥2
F (3)
Even if we can solve (P0), how can we be sure M = X∗(Y∗)T?
∥PΩ(M − XYT)∥2F = 0 =⇒ ∥(M − XYT)∥2
F = 0.
Observation: Let p = |Ω|/n2. If Ω and M,X,Y are independent, then
E[∥PΩ(M − XYT)∥2
F]= p∥(M − XYT)∥2
F.
Difficulty: the iterates (X,Y) cannot be independent of Ω!Resampling = use different samples at each iteration and discard.
Not practical: waste of resources; accuracy pre-determinedNo exact recovery: infinite samples for exact recovery!
Non-convex Matrix Completion Zhi-Quan Luo 9 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Nuclear Norm Formulation
Nuc-norm formulation (∈ SDP) is convex (global convergence):
Standard SDP solvers (interior point method)
Proximal gradient method and variants [Toh-Yun10],[Ma-Goldfarb-Chen11]
Linear convergence under certain conditions[Agarwal-Negahban,Wainwright12], [Hou-Zhou-So-Luo13]
Pros: convex; guaranteed recovery [Candes-Recht-09].
Cons: slow for big data (requires SVD per-iteration); large memoryrequirement
Non-convex Matrix Completion Zhi-Quan Luo 10 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Matrix Factorization Formulation
Algorithms for non-convex MF model (converge to stationary points):
Alt-Min [Koren-09],[Wen-Yin-Zhang-12],
SGD (Stochastic Gradient Descent) [Koren-09], [Funk-06]
Other Alt-Min methods: multi-block [Yu-Hsieh-Si-Dhillon-12], blockmajorization [Hastie-Mazumder-Lee-Zadeh-14]
Pros:Fast in practice, little storage
Flexibility: can incorporate data aspects
Cons: limited performance analysis (more later)
Our goal: bridge the gap between theory and practice
Non-convex Matrix Completion Zhi-Quan Luo 11 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Formulation
Start from a constrained version (extra requirements on factors X, Y):
P′1 : min
X∈Rn×r,Y∈Rn×rF(X, Y) , 1
2∥PΩ(M − XYT)∥2
F.
(X, Y) ∈ K1 := ∥X∥F ≤ βT , ∥Y∥F ≤ βT , (X, Y) ∈ K2 := ∥X(i)∥ ≤ β1, ∥Y(i)∥ ≤ β1, ∀ i.
(4)
K1: boundednessK2: incoherence
Non-convex Matrix Completion Zhi-Quan Luo 12 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Formulation
Consider a penalized version of the above problem:
P1 : minX∈Rn×r,Y∈Rn×r
F(X,Y) , 12∥PΩ(M − XYT)∥2
F + G(X, Y), (5)
where
G(X, Y) ,ρG1(3∥X∥2
F
2β2T
) + ρG1(3∥Y∥2
F
2β2T
) + ρ
n∑i=1
G1(3∥X(i)∥2
2β21
) + ρ
n∑j=1
G1(3∥Y(j)∥2
2β21
),
and G1(z) = max(z − 1, 0)2, and ρ is a large enough constant.
Non-convex Matrix Completion Zhi-Quan Luo 13 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Our Contributions
.Theorem [Sun-L.’14]..
......
Suppose M ∈ Rn×n has rank r and
is incoherent
has a condition number κ = σ1/σr.
For i.i.d random observation set Ω ⊆ [n]× [n] with size
|Ω| ≥ Cn log n · poly(r, κ, µ), (6)
with specific initialization, many standard algorithms for (P1) converge toglobal optima, AND recovering M w.h.p.
Remark: initialization and formulation will be specified later.Standard algorithms include GD, SGD, Alt-Min.
Non-convex Matrix Completion Zhi-Quan Luo 14 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Proof Idea (1)
Why global convergence possible for non-convex problems?Basin of attraction + good initial point.
(I) Problem property: basin of attraction (hard)a convex neighborhood in the space of (X, Y)nonconvex in the space of Mevery stationary point (X∗, Y∗) of (P1) in the basin satisfies X∗(Y∗)T = M
Non-convex Matrix Completion Zhi-Quan Luo 15 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Proof Idea (2)
(II) Algorithm properties(II.a) Convergence to stationary points. (easy)
(II.b) All iterates in a basin. (moderate)
(II.b1) Initial point in the basin (extension of [KMO’09])(II.b2) All subsequent iterates stay in the basin.
Proof of main result: (II) shows Algorithm 1-4 converge to a stationary pointin the basin, which by (I) equals M (global optimum).
Non-convex Matrix Completion Zhi-Quan Luo 16 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Applicable Algorithms
Define xt = (Xt, Yt), ∆t , xt+1 − xt.Our result applies to any algorithm with two properties (besides initialization):
(II.a) converges to stationary point;(II.b) satisfies one of three mild conditions (in order to keep the iterates inthe “basin”),
1) F(xt + λ∆t) ≤ 2F(x0),∀ λ ∈ [0, 1] ∀ t;
2) 1 = arg minλ∈R
ψ(xt,∆t;λ), where ψ is a“convex upper bound”, ∀ t;
3) F(xt) ≤ 2F(x0), d(xt, x0) ≤56δ, ∀ t.
where δ = O(σr).
Non-convex Matrix Completion Zhi-Quan Luo 17 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Applicable Algorithms
Three typical classes of applicable algorithms:
GD with constant step-size and SGD satisfies 1);
Block coordinate descent or more generally, BSUM, satisfies 2);
Any descent algorithm with“bounded” update (d(xt, x0) ≤ 56δ) at each
iteration satisfies 3).
Non-convex Matrix Completion Zhi-Quan Luo 18 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Related Works in Matrix Completion
Result for MF model in Grassmann manifold [Keshavan-Montanari-Oh’09]
Result for Alt-Min (variants)[Keshavan11],[Jain-Netrapalli-Sanhavi12],[Hardt13]
Table: Comparison with Recent Studies on Alt-Min
Studies on Alt-Min Our workApplicability one algorithm (Alt-Min) many algorithmsForm require resampling standard formTechnique analysis of power method random graph + perturbation
Non-convex Matrix Completion Zhi-Quan Luo 19 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Related Works for Other Problems
Phase Retrieval (PR)Nuclear norm formulation (convex) [Candes-Eldar-Strohmer-Voroninski-11]
Non-convex resampling based Alt-Min [Netrapalli-Jain-Sanhavi-13]
Non-convex gradient descent (no resampling) [Candes-Li-Soltanolkotabi-14]
Non-convex sparse regression[Zhang-Zhang-12],[Loh-Wainwright-13],[Fan-Xue-Zou-14]
EM algorithm [Balakrishnan-Wainwright-14]
Power method for computing eigenvectorsSparse PCA [Wang-Lu-Liu-14],[Yuan-Zhang-13],[Deshpande-Montanari-14]
Tensor decomposition[Anandkumar-Ge-Hsu-Kadade-12],[Anandkumar-Ge-Janzamin-14]
Non-convex Matrix Completion Zhi-Quan Luo 20 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Remarks on Non-convex Guarantee
Remark 1: geometry v.s. algorithm specific.Most works only analyze one algorithm, especially power-method
Our work is about geometry: “basin of attraction”
Few works on geometry: [Keshavan-Montanari-Oh’09],[Balakrishnan-Wainwright-14] (EM), [Candes-Li-Soltanolkotabi-14] (PR)
So many algorithms (variants) for MC and more are coming out, it is better togive a unified analysis. [KMO’09] does not cover SGD.
Remark 2: Good initialization is often necessary.
Exception in sparse regression (perhaps due to convex loss?).
Good initialization not found yet in [Balakrishnan-Wainwright-14] (EM),[Deshpande-Montanari-14] (PCA)
Non-convex Matrix Completion Zhi-Quan Luo 21 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Formulation
Start from a constrained version (extra requirements on factors X, Y):
P′1 : min
X∈Rn×r,Y∈Rn×rF(X, Y) , 1
2∥PΩ(M − XYT)∥2
F. (8a)
∥X∥F ≤ βT , ∥Y∥F ≤ βT , (8b)
∥X(i)∥ ≤ β1, ∥Y(i)∥ ≤ β1, ∀ i. (8c)
We consider a penalized version of the above problem:
P1 : minX∈Rn×r,Y∈Rn×r
F(X, Y) , 12∥PΩ(M − XYT)∥2
F + G(X, Y), (9)
where
G(X, Y) ,ρG1(3∥X∥2
F
2β2T
) + ρG1(3∥Y∥2
F
2β2T
) + ρ
n∑i=1
G1(3∥X(i)∥2
2β21
) + ρ
n∑j=1
G1(3∥Y(j)∥2
2β21
),
in which G1(z) = max(z − 1, 0)2, and ρ is a large enough constant.
Non-convex Matrix Completion Zhi-Quan Luo 22 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Algorithms
Consider four typical algorithms, all using the previous initialization.
Algorithm 1: GD (Gradient descent).
Algorithm 2: two-block Alt-Min.
Algorithm 3: row BSUM (block successive upper bound minimization).Difference choice of blocks compared to Algorithm 2.
Algorithm 4: SGD.
Remark: We cover 3 classes of first order methods:GD, alternating method (BCD-type), SGD (incremental gradient method).
Non-convex Matrix Completion Zhi-Quan Luo 23 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Choice of Initial Point
Let p = |Ω|/n2. Initialization consists of two steps.Step 1: SVD.
Compute (X0,D0, Y0) = SVDr
(1pPΩ(M)
).
Define X0 = X0D1/20 , Y0 = Y0D1/2
0 .
Step 2: Scaling (to force incoherence)Define new matrices X0,Y0 to make ∥X(i)
0 ∥2 ≤ 2β21/3, ∥Y(j)
0 ∥2 ≤ 2β21/3.
X(i)0 =
X(i)0
∥X(i)0 ∥
min
∥X(i)
0 ∥,√
23β1
, ∀ i.
Y(j)0 =
Y(j)0
∥Y(j)0 ∥
min
∥Y(j)
0 ∥,√
23β1
, ∀ j.
(10)
Claim of good initialization: (X0, Y0) ∈ K1 ∩ K2 ∩ K(δ), where
K(δ) := (X, Y) | ∥M − XYT∥ ≤ δ = O(σr)
Non-convex Matrix Completion Zhi-Quan Luo 24 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
(I) Local Strong Convexity?
Local strong convexity:⟨∇f (x)−∇f (x∗), x − x∗⟩ ≥ c∥x − x∗∥2, ∀ x close to x∗
Non-convex Matrix Completion Zhi-Quan Luo 25 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
(I) Local Convexity-Like Property
.Lemma 1..
......
Under the conditions of Theorem 1 (main result), w.h.p. the following holds: for any(X, Y) ∈ K1 ∩ K2 ∩ K(δ), there exists U,V ∈ Rn×r such that UVT = M and
⟨∇XF(X, Y),X − U⟩+ ⟨∇Y F(X, Y), Y − V⟩ (11a)
≥ p9βT
(∥X − U∥2F + ∥Y − V∥2
F) ≥p9∥M − XYT∥2
F (11b)
Remark: (??) approximately viewed as local (strong) convexityNon-convex Matrix Completion Zhi-Quan Luo 26 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Main Difficulties and Techniques
Difficulty 1: Iterates depend on Ω.Hard to estimate PΩ(Z) if Z depends on Ω
Re-sampling avoids the difficulty, artificially
Solution: Random graph lemma in [K-M-O-09] (due to [Feige-Ofek-03]).
Difficulty 2: Distance to a factor space
Recall (11) bounds supUVT=M⟨(U,V)− (X, Y),∇F⟩.
U,V are coupled. Estimate dist((X, Y), S), where S = (U,V) | UVT = M.
In [K-M-O-09], estimate d(U,X), d(V, Y) independently (in Grassman manifold).
Related to perturbation analysis [Wedin-70], but much more difficult.
Non-convex Matrix Completion Zhi-Quan Luo 27 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
A Decomposition Result
If M is close to XYT , then ∃ M = UVT s.t. U,V are close to X,Y , resp..Proposition 1..
......
Suppose(1) ∥M − XYT∥F = d ≤ Σmin
10 ;(2) ∥X∥F ≤ βT , ∥Y∥F ≤ βT ;Then there exist U,V ∈ Rn×r such that(a) UVT = M;(b) ∥U − X∥F ≤ 2βT
Σmind, ∥V − Y∥F ≤ 4βT
Σmind;
Related to perturbation analysis [Wedin70]: if M and Z are close, thentheir singular vector spaces are also close.
Prop 1 is not enough! Lemma 1 requires an advanced version of Prop 1.
Non-convex Matrix Completion Zhi-Quan Luo 28 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Proof of Lemma 1: Outline
Since F = F + G, only need to find factorization M = UVTs.t.:
Step 1:ϕF , ⟨∇XF,X − U⟩+ ⟨∇YF,Y − V⟩ ≥ p
9d2, (12)
where d , ∥M − XYT∥F≤ O(Σmin) .
Step 2:ϕG , ⟨∇XG,X − U⟩+ ⟨∇YG,Y − V⟩ ≥ 0. (13)
In the rest, we only prove Step 1. (Omit Step 2, which is much morecomplicated.)
Non-convex Matrix Completion Zhi-Quan Luo 29 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Difficulty of Step 1
Need to prove ϕF ≥ p9 d2, where
ϕF =⟨∇XF,X − U⟩+ ⟨∇YF,Y − V⟩=⟨PΩ(XYT − M)Y,X − U⟩+ ⟨PΩ(XYT − M)TX,Y − V⟩=⟨PΩ(XYT − M), (X − U)YT + X(Y − V)T⟩.
(14)
Question: how to bound ∥PΩ(XYT − M)∥F by d = ∥M − XYT∥F?
Guess: E(PΩ(S)) = pS, thus: w.h.p. ∥PΩ(S)∥2F ≥ pd2/2, ∀S?
Game: pick Ω first, adversary picks S depending on ΩIf (1, 1) /∈ Ω, picks S = dE11, then PΩ(S) = 0 < pd2/2.
Remark:In our problem, Xk, Yk depend on Ω.By resampling, this technical difficulty is (artificially) avoided.
Non-convex Matrix Completion Zhi-Quan Luo 30 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Solution of Step 1
Lemma 2 [C-R-09]: ∥PΩA∥2F ≥ p
2∥A∥2F, where
A ∈ T , UWT2 + W1VT | W1,W2 ∈ Rn×r. (15)
Intuition: T is the “tangent space” of a fixed incoherent matrix M, thusindependent of Ω.
Define A = U(Y − V)T + (X − U)VT ∈ T , B = (X − U)(Y − V)T , then
XYT − M = A + B.
ϕF =⟨PΩ(XYT − M), (X − U)YT + X(Y − V)T⟩=⟨PΩ(A + B),A + 2B⟩=∥PΩ(A)∥2
F + 2∥PΩ(B)∥2F + 3⟨PΩ(A),PΩ(B)⟩.
Guess :≈pd2 + 2pd4 + 3pd3?
We prove :≈pd2 + 2p1
102 d2 + 3p110
d2.⇐= “Worse than expected” bound !
(16)
Non-convex Matrix Completion Zhi-Quan Luo 31 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Solution of Step 1 (cont’d)
Step 1.1: ∥PΩ(A)∥F ≥√
2p3 d, implied by Lemma 2 and ∥A∥F ≥ 2
3 d (toprove).
Note ∥A∥F = ∥(XYT − M)− B∥F ≥ d − ∥B∥F, so need to prove
∥B∥F ≤ 13
d.
Will prove stronger result: ∥B∥F ≤ O(d2).
Step 1.2: ∥PΩ(B)∥2F = ∥PΩ((U − X)(V − Y)T)∥2
F ≤ p100 d2 .
Expectation of LHS = pd4, so allow to lose a factor of d2.
Non-convex Matrix Completion Zhi-Quan Luo 32 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Technical result for Step 1.2: A Random Graph Lemma
Step 1.2 requires a random graph lemma in [Feige-Ofek-03,KMO’09]:Related to 2nd largest eigenvalue of adjacency matrix of a random graph.
.Random Graph Lemma..
......
∃ constants C0,C1,C1 s.t. if |Ω| ≥ C0n log n, then w.h.p.,∑(i,j)∈Ω
xiyj ≤ C1p∥x∥∥y∥+ C2√
np(∑
i
x2i )(
∑i
y2i ), ∀x, y ∈ Rn. (17)
Does not require Ω to be independent of (x, y)!
Non-convex Matrix Completion Zhi-Quan Luo 33 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Examples
∑(i,j)∈Ω
xiyj ≤ C1p∥x∥∥y∥+ C2√
np(∑
i
x2i )(
∑i
y2i ), ∀x, y ∈ Rn.
E.g. 1: x = y = (d2, . . . , d2)/√
n; note ∥x∥ = ∥y∥ = d2.LHS = pd4, RHS = O(pd4 +
√npd4/n2) ≈ O(pd4). The first term is
expected.
E.g. 2: (i0, j0) ∈ Ω, x = d2ei0 , y = d2ej0 .LHS = d4, RHS = O(pd2 +
√npd4) ≈ O(pd2 + d4) .
–compare to pd4 (expectation), lose d2 (acceptable) or p(≈ 1/n,un-acceptable).
Need “incoherence” to control the second term =⇒ not to lose p; canlose d2.
Non-convex Matrix Completion Zhi-Quan Luo 34 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Concluding Remarks
Main contribution: recovery guarantee for non-convex factorizationbased matrix completion
Key idea: “local strongly-convex basin”Apply to many first-order algorithmsNo resampling!
Math tools:Perturbation analysis;A random graph lemma (stronger than concentration bounds)
Future directionsnoisy matrix completionremove penalty on row-normsRobust PCA, tensor completion, etc.
Non-convex Matrix Completion Zhi-Quan Luo 35 / 36
. . . . . .
Introduction and Result. . . . . . . . . . . . . . .Technical Details
Thank You!
Non-convex Matrix Completion Zhi-Quan Luo 36 / 36