Guaranteed Matrix Completion via Non-convex Factorization · Reference Ruoyu Sun and Zhi-Quan Luo,...

.

......

Guaranteed Matrix Completion via Non-convexFactorization

Zhi-Quan (Tom) Luo

The Chinese University of Hong Kong, Shenzhen, Chinaand

The University of Minnesota, Minneapolis, MN, USA

joint work with Ruoyu Sun

July 5, 2015

. . . . . .

. . . . . .

Introduction and Result. . . . . . . . . . . . . . .Technical Details

Outline

...1 Introduction and Result

...2 Technical DetailsFormulation, Algorithm and Initial PointProof IdeasProof Sketch of Step 1

Non-convex Matrix Completion Zhi-Quan Luo 2 / 36

. . . . . .


Outline

...1 Introduction and Result

...2 Technical DetailsFormulation, Algorithm and Initial PointProof IdeasProof Sketch of Step 1


. . . . . .


Reference

Ruoyu Sun and Zhi-Quan Luo, “Guaranteed Matrix Completion viaNon-convex Factorization,” arXiv preprint arXiv:1411.8003 (2014);FOCS’2015, forthcoming.


. . . . . .


Netflix Challenge

Netflix prize (2006): 500,000 customers, 17,000 movies, 100 millionratings (1.2% of all ratings).

Challenge: predict missing ratings ($1, 000, 000 prize).

One approach: exploring low-rank structure

Low-rank matrix completion:complete a low-rank matrix from a few entries [Candes-Recht-09]


. . . . . .


Idea of Matrix Completion (MC)

Low-rank: M ∈ Rn×n, with rank r ≪ n.

Partial observation: Ω ⊂ [n]× [n]: set of sampled positions.

If M = E11, little hope to recover M.

E11 =

1 0 0 · · · 00 0 0 · · · 0...

...... · · ·

...0 0 0 · · · 0

assumptions on M, Ω are needed


. . . . . .


Matrix Completion

Result [Candes-Recht’09]: recover M ∈ Rn×n from O(nr log n) entriesvia convex optimization, if

M is “generic” (incoherent).Set of observations Ω are “generic” (random).

M =

x1y1 x1y2 x1y3 · · · x1yn

x2y1 x2y2 x2y3 · · · x2yn...

...... · · ·

...xny1 xny2 xny3 · · · xnyn

O(nr log n) v.s. n2 ambient dim.

O(nr) is necessary: nr free variableslog n due to coupon collecting effect: need n log n samples to cover eachrow/column.


. . . . . .


Why is Matrix Completion Interesting?

1) Natural extension of sparsity to matrix domain.

Related low-rank pursuing approaches:Phase retrieval; Robust PCA; Tensor completion

2) Practice: lots of applications.Recommendation systems, computer vision, network anomaly detection, etc.

3) Theory: A fundamental mathematical problem.Related to: statistics, learning, optimization, linear algebra, information theory, etc.


. . . . . .


Matrix Factorization v.s. Nuclear Norm

Two popular methods: nuclear norm, matrix factorization.

Method 1: Nuclear norm minimization [Candes-Recht-09]

minZ∈Rn×n

12∥PΩ(M − Z)∥2

F + λ∥Z∥∗. (1)

where

PΩ(X) =

Xij, if (i, j) ∈ Ω0 if (i, j) ∈ Ω

n2 variables; nonsmooth but convex

Method 2: matrix factorization (MF) based formulation [Koren09]:

P0 : minX∈Rn×r,Y∈Rn×r

12∥PΩ(M − XYT)∥2

F + λ(∥X∥2F + ∥Y∥2

F). (2)

nr variables; smooth but nonconvex


. . . . . .


How can this be possible?

The matrix factorization (MF) formulation [Koren09]:


12∥PΩ(M − XYT)∥2

F (3)

Even if we can solve (P0), how can we be sure M = X∗(Y∗)T?

∥PΩ(M − XYT)∥2F = 0 =⇒ ∥(M − XYT)∥2

F = 0.

Observation: Let p = |Ω|/n2. If Ω and M,X,Y are independent, then

E[∥PΩ(M − XYT)∥2

F]= p∥(M − XYT)∥2

F.

Difficulty: the iterates (X,Y) cannot be independent of Ω!Resampling = use different samples at each iteration and discard.

Not practical: waste of resources; accuracy pre-determinedNo exact recovery: infinite samples for exact recovery!


. . . . . .


Nuclear Norm Formulation

Nuc-norm formulation (∈ SDP) is convex (global convergence):

Standard SDP solvers (interior point method)

Proximal gradient method and variants [Toh-Yun10],[Ma-Goldfarb-Chen11]

Linear convergence under certain conditions[Agarwal-Negahban,Wainwright12], [Hou-Zhou-So-Luo13]

Pros: convex; guaranteed recovery [Candes-Recht-09].

Cons: slow for big data (requires SVD per-iteration); large memoryrequirement


. . . . . .


Matrix Factorization Formulation

Algorithms for non-convex MF model (converge to stationary points):

Alt-Min [Koren-09],[Wen-Yin-Zhang-12],

SGD (Stochastic Gradient Descent) [Koren-09], [Funk-06]

Other Alt-Min methods: multi-block [Yu-Hsieh-Si-Dhillon-12], blockmajorization [Hastie-Mazumder-Lee-Zadeh-14]

Pros:Fast in practice, little storage

Flexibility: can incorporate data aspects

Cons: limited performance analysis (more later)

Our goal: bridge the gap between theory and practice


. . . . . .


Formulation

Start from a constrained version (extra requirements on factors X, Y):

P′1 : min

X∈Rn×r,Y∈Rn×rF(X, Y) , 1

2∥PΩ(M − XYT)∥2

F.

(X, Y) ∈ K1 := ∥X∥F ≤ βT , ∥Y∥F ≤ βT , (X, Y) ∈ K2 := ∥X(i)∥ ≤ β1, ∥Y(i)∥ ≤ β1, ∀ i.

(4)

K1: boundednessK2: incoherence


. . . . . .


Formulation

Consider a penalized version of the above problem:


F(X,Y) , 12∥PΩ(M − XYT)∥2

F + G(X, Y), (5)

where

G(X, Y) ,ρG1(3∥X∥2

F

2β2T

) + ρG1(3∥Y∥2

F

2β2T

) + ρ

n∑i=1

G1(3∥X(i)∥2

2β21

) + ρ

n∑j=1

G1(3∥Y(j)∥2

2β21

),

and G1(z) = max(z − 1, 0)2, and ρ is a large enough constant.


. . . . . .


Our Contributions

.Theorem [Sun-L.’14]..

......

Suppose M ∈ Rn×n has rank r and

is incoherent

has a condition number κ = σ1/σr.

For i.i.d random observation set Ω ⊆ [n]× [n] with size

|Ω| ≥ Cn log n · poly(r, κ, µ), (6)

with specific initialization, many standard algorithms for (P1) converge toglobal optima, AND recovering M w.h.p.

Remark: initialization and formulation will be specified later.Standard algorithms include GD, SGD, Alt-Min.


. . . . . .


Proof Idea (1)

Why global convergence possible for non-convex problems?Basin of attraction + good initial point.

(I) Problem property: basin of attraction (hard)a convex neighborhood in the space of (X, Y)nonconvex in the space of Mevery stationary point (X∗, Y∗) of (P1) in the basin satisfies X∗(Y∗)T = M


. . . . . .


Proof Idea (2)

(II) Algorithm properties(II.a) Convergence to stationary points. (easy)

(II.b) All iterates in a basin. (moderate)

(II.b1) Initial point in the basin (extension of [KMO’09])(II.b2) All subsequent iterates stay in the basin.

Proof of main result: (II) shows Algorithm 1-4 converge to a stationary pointin the basin, which by (I) equals M (global optimum).


. . . . . .


Applicable Algorithms

Define xt = (Xt, Yt), ∆t , xt+1 − xt.Our result applies to any algorithm with two properties (besides initialization):

(II.a) converges to stationary point;(II.b) satisfies one of three mild conditions (in order to keep the iterates inthe “basin”),

1) F(xt + λ∆t) ≤ 2F(x0),∀ λ ∈ [0, 1] ∀ t;

2) 1 = arg minλ∈R

ψ(xt,∆t;λ), where ψ is a“convex upper bound”, ∀ t;

3) F(xt) ≤ 2F(x0), d(xt, x0) ≤56δ, ∀ t.

where δ = O(σr).


. . . . . .


Applicable Algorithms

Three typical classes of applicable algorithms:

GD with constant step-size and SGD satisfies 1);

Block coordinate descent or more generally, BSUM, satisfies 2);

Any descent algorithm with“bounded” update (d(xt, x0) ≤ 56δ) at each

iteration satisfies 3).


. . . . . .


Related Works in Matrix Completion

Result for MF model in Grassmann manifold [Keshavan-Montanari-Oh’09]

Result for Alt-Min (variants)[Keshavan11],[Jain-Netrapalli-Sanhavi12],[Hardt13]

Table: Comparison with Recent Studies on Alt-Min

Studies on Alt-Min Our workApplicability one algorithm (Alt-Min) many algorithmsForm require resampling standard formTechnique analysis of power method random graph + perturbation


. . . . . .


Related Works for Other Problems

Phase Retrieval (PR)Nuclear norm formulation (convex) [Candes-Eldar-Strohmer-Voroninski-11]

Non-convex resampling based Alt-Min [Netrapalli-Jain-Sanhavi-13]

Non-convex gradient descent (no resampling) [Candes-Li-Soltanolkotabi-14]

Non-convex sparse regression[Zhang-Zhang-12],[Loh-Wainwright-13],[Fan-Xue-Zou-14]

EM algorithm [Balakrishnan-Wainwright-14]

Power method for computing eigenvectorsSparse PCA [Wang-Lu-Liu-14],[Yuan-Zhang-13],[Deshpande-Montanari-14]

Tensor decomposition[Anandkumar-Ge-Hsu-Kadade-12],[Anandkumar-Ge-Janzamin-14]


. . . . . .


Remarks on Non-convex Guarantee

Remark 1: geometry v.s. algorithm specific.Most works only analyze one algorithm, especially power-method

Our work is about geometry: “basin of attraction”

Few works on geometry: [Keshavan-Montanari-Oh’09],[Balakrishnan-Wainwright-14] (EM), [Candes-Li-Soltanolkotabi-14] (PR)

So many algorithms (variants) for MC and more are coming out, it is better togive a unified analysis. [KMO’09] does not cover SGD.

Remark 2: Good initialization is often necessary.

Exception in sparse regression (perhaps due to convex loss?).

Good initialization not found yet in [Balakrishnan-Wainwright-14] (EM),[Deshpande-Montanari-14] (PCA)


. . . . . .


Formulation

Start from a constrained version (extra requirements on factors X, Y):

P′1 : min

X∈Rn×r,Y∈Rn×rF(X, Y) , 1

2∥PΩ(M − XYT)∥2

F. (8a)

∥X∥F ≤ βT , ∥Y∥F ≤ βT , (8b)

∥X(i)∥ ≤ β1, ∥Y(i)∥ ≤ β1, ∀ i. (8c)

We consider a penalized version of the above problem:


F(X, Y) , 12∥PΩ(M − XYT)∥2

F + G(X, Y), (9)

where

G(X, Y) ,ρG1(3∥X∥2

F

2β2T

) + ρG1(3∥Y∥2

F

2β2T

) + ρ

n∑i=1

G1(3∥X(i)∥2

2β21

) + ρ

n∑j=1

G1(3∥Y(j)∥2

2β21

),

in which G1(z) = max(z − 1, 0)2, and ρ is a large enough constant.


. . . . . .


Algorithms

Consider four typical algorithms, all using the previous initialization.

Algorithm 1: GD (Gradient descent).

Algorithm 2: two-block Alt-Min.

Algorithm 3: row BSUM (block successive upper bound minimization).Difference choice of blocks compared to Algorithm 2.

Algorithm 4: SGD.

Remark: We cover 3 classes of first order methods:GD, alternating method (BCD-type), SGD (incremental gradient method).


. . . . . .


Choice of Initial Point

Let p = |Ω|/n2. Initialization consists of two steps.Step 1: SVD.

Compute (X0,D0, Y0) = SVDr

(1pPΩ(M)

).

Define X0 = X0D1/20 , Y0 = Y0D1/2

0 .

Step 2: Scaling (to force incoherence)Define new matrices X0,Y0 to make ∥X(i)

0 ∥2 ≤ 2β21/3, ∥Y(j)

0 ∥2 ≤ 2β21/3.

X(i)0 =

X(i)0

∥X(i)0 ∥

min

∥X(i)

0 ∥,√

23β1

, ∀ i.

Y(j)0 =

Y(j)0

∥Y(j)0 ∥

min

∥Y(j)

0 ∥,√

23β1

, ∀ j.

(10)

Claim of good initialization: (X0, Y0) ∈ K1 ∩ K2 ∩ K(δ), where

K(δ) := (X, Y) | ∥M − XYT∥ ≤ δ = O(σr)


. . . . . .


(I) Local Strong Convexity?

Local strong convexity:⟨∇f (x)−∇f (x∗), x − x∗⟩ ≥ c∥x − x∗∥2, ∀ x close to x∗


. . . . . .


(I) Local Convexity-Like Property

.Lemma 1..

......

Under the conditions of Theorem 1 (main result), w.h.p. the following holds: for any(X, Y) ∈ K1 ∩ K2 ∩ K(δ), there exists U,V ∈ Rn×r such that UVT = M and

⟨∇XF(X, Y),X − U⟩+ ⟨∇Y F(X, Y), Y − V⟩ (11a)

≥ p9βT

(∥X − U∥2F + ∥Y − V∥2

F) ≥p9∥M − XYT∥2

F (11b)

Remark: (??) approximately viewed as local (strong) convexityNon-convex Matrix Completion Zhi-Quan Luo 26 / 36

. . . . . .


Main Difficulties and Techniques

Difficulty 1: Iterates depend on Ω.Hard to estimate PΩ(Z) if Z depends on Ω

Re-sampling avoids the difficulty, artificially

Solution: Random graph lemma in [K-M-O-09] (due to [Feige-Ofek-03]).

Difficulty 2: Distance to a factor space

Recall (11) bounds supUVT=M⟨(U,V)− (X, Y),∇F⟩.

U,V are coupled. Estimate dist((X, Y), S), where S = (U,V) | UVT = M.

In [K-M-O-09], estimate d(U,X), d(V, Y) independently (in Grassman manifold).

Related to perturbation analysis [Wedin-70], but much more difficult.


. . . . . .


A Decomposition Result

If M is close to XYT , then ∃ M = UVT s.t. U,V are close to X,Y , resp..Proposition 1..

......

Suppose(1) ∥M − XYT∥F = d ≤ Σmin

10 ;(2) ∥X∥F ≤ βT , ∥Y∥F ≤ βT ;Then there exist U,V ∈ Rn×r such that(a) UVT = M;(b) ∥U − X∥F ≤ 2βT

Σmind, ∥V − Y∥F ≤ 4βT

Σmind;

Related to perturbation analysis [Wedin70]: if M and Z are close, thentheir singular vector spaces are also close.

Prop 1 is not enough! Lemma 1 requires an advanced version of Prop 1.


. . . . . .


Proof of Lemma 1: Outline

Since F = F + G, only need to find factorization M = UVTs.t.:

Step 1:ϕF , ⟨∇XF,X − U⟩+ ⟨∇YF,Y − V⟩ ≥ p

9d2, (12)

where d , ∥M − XYT∥F≤ O(Σmin) .

Step 2:ϕG , ⟨∇XG,X − U⟩+ ⟨∇YG,Y − V⟩ ≥ 0. (13)

In the rest, we only prove Step 1. (Omit Step 2, which is much morecomplicated.)


. . . . . .


Difficulty of Step 1

Need to prove ϕF ≥ p9 d2, where

ϕF =⟨∇XF,X − U⟩+ ⟨∇YF,Y − V⟩=⟨PΩ(XYT − M)Y,X − U⟩+ ⟨PΩ(XYT − M)TX,Y − V⟩=⟨PΩ(XYT − M), (X − U)YT + X(Y − V)T⟩.

(14)

Question: how to bound ∥PΩ(XYT − M)∥F by d = ∥M − XYT∥F?

Guess: E(PΩ(S)) = pS, thus: w.h.p. ∥PΩ(S)∥2F ≥ pd2/2, ∀S?

Game: pick Ω first, adversary picks S depending on ΩIf (1, 1) /∈ Ω, picks S = dE11, then PΩ(S) = 0 < pd2/2.

Remark:In our problem, Xk, Yk depend on Ω.By resampling, this technical difficulty is (artificially) avoided.


. . . . . .


Solution of Step 1

Lemma 2 [C-R-09]: ∥PΩA∥2F ≥ p

2∥A∥2F, where

A ∈ T , UWT2 + W1VT | W1,W2 ∈ Rn×r. (15)

Intuition: T is the “tangent space” of a fixed incoherent matrix M, thusindependent of Ω.

Define A = U(Y − V)T + (X − U)VT ∈ T , B = (X − U)(Y − V)T , then

XYT − M = A + B.

ϕF =⟨PΩ(XYT − M), (X − U)YT + X(Y − V)T⟩=⟨PΩ(A + B),A + 2B⟩=∥PΩ(A)∥2

F + 2∥PΩ(B)∥2F + 3⟨PΩ(A),PΩ(B)⟩.

Guess :≈pd2 + 2pd4 + 3pd3?

We prove :≈pd2 + 2p1

102 d2 + 3p110

d2.⇐= “Worse than expected” bound !

(16)


. . . . . .


Solution of Step 1 (cont’d)

Step 1.1: ∥PΩ(A)∥F ≥√

2p3 d, implied by Lemma 2 and ∥A∥F ≥ 2

3 d (toprove).

Note ∥A∥F = ∥(XYT − M)− B∥F ≥ d − ∥B∥F, so need to prove

∥B∥F ≤ 13

d.

Will prove stronger result: ∥B∥F ≤ O(d2).

Step 1.2: ∥PΩ(B)∥2F = ∥PΩ((U − X)(V − Y)T)∥2

F ≤ p100 d2 .

Expectation of LHS = pd4, so allow to lose a factor of d2.


. . . . . .


Technical result for Step 1.2: A Random Graph Lemma

Step 1.2 requires a random graph lemma in [Feige-Ofek-03,KMO’09]:Related to 2nd largest eigenvalue of adjacency matrix of a random graph.

.Random Graph Lemma..

......

∃ constants C0,C1,C1 s.t. if |Ω| ≥ C0n log n, then w.h.p.,∑(i,j)∈Ω

xiyj ≤ C1p∥x∥∥y∥+ C2√

np(∑

i

x2i )(

∑i

y2i ), ∀x, y ∈ Rn. (17)

Does not require Ω to be independent of (x, y)!


. . . . . .


Examples

∑(i,j)∈Ω

xiyj ≤ C1p∥x∥∥y∥+ C2√

np(∑

i

x2i )(

∑i

y2i ), ∀x, y ∈ Rn.

E.g. 1: x = y = (d2, . . . , d2)/√

n; note ∥x∥ = ∥y∥ = d2.LHS = pd4, RHS = O(pd4 +

√npd4/n2) ≈ O(pd4). The first term is

expected.

E.g. 2: (i0, j0) ∈ Ω, x = d2ei0 , y = d2ej0 .LHS = d4, RHS = O(pd2 +

√npd4) ≈ O(pd2 + d4) .

–compare to pd4 (expectation), lose d2 (acceptable) or p(≈ 1/n,un-acceptable).

Need “incoherence” to control the second term =⇒ not to lose p; canlose d2.


. . . . . .


Concluding Remarks

Main contribution: recovery guarantee for non-convex factorizationbased matrix completion

Key idea: “local strongly-convex basin”Apply to many first-order algorithmsNo resampling!

Math tools:Perturbation analysis;A random graph lemma (stronger than concentration bounds)

Future directionsnoisy matrix completionremove penalty on row-normsRobust PCA, tensor completion, etc.


. . . . . .


Thank You!


Date post:	12-Oct-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Guaranteed Matrix Completion via Non-convex Factorization · Reference Ruoyu Sun and Zhi-Quan Luo,...

Documents