Accelerating Iterative Hard Thresholding For Low … › files ›...

Post on 24-Jun-2020

4 views 0 download

transcript

Accelerating Iterative Hard Thresholding For Low-rankMatrix Completion Via Adaptive Restart

Trung Vu and Raviv Raich

School of EECS, Oregon State University, Corvallis, OR 97331-5501, USA

{vutru,raich}@oregonstate.edu

May 16, 2019

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 1 / 22

Outline

1 Problem Formulation

2 Background

3 Main Results

4 Conclusions and Future Work

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 2 / 22

The Netflix Prize Problem

Movies

Use

rs

4 ? ?

? ? 4

? 2 ?

4 ? 4

A partially known rating matrix M ∈ Rm×n with rank(M) ≤ r

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 3 / 22

Low-Rank Matrix Completion Problem

M︷ ︸︸ ︷4 ? ?

? ? 4

? 2 ?

4 ? 4

Given r=1−−−−−−→

X∗︷ ︸︸ ︷4 2 4

4 2 4

4 2 4

4 2 4

SVD

====

12

12

12

12

· 6 ·

[23

13

23

]

find Xij , (i , j) ∈ Sc

subject to rank(X ) ≤ r and Xij = Mij for (i , j) ∈ S.

(r < n ≤ m)

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 4 / 22

Notations

Sampling operator XS

[XS ]ij =

{Xij if (i , j) ∈ S0 if (i , j) ∈ Sc

4 2 4

4 2 4

4 2 4

4 2 4

S−→4 0 0

0 0 4

0 2 0

4 0 4

Row selection matrix S(S) ∈ Rs×mn corresponding to S

︸ ︷︷ ︸S(S)

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 1

4

2

4

4

2

4

4

2

4

4

2

4

=

4

4

2

4

4

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 5 / 22

The rank-r projection of an arbitrary matrix X ∈ Rm×n is obtainedby hard-thresholding singular values of X

Pr (X ) =r∑

i=1

σi (X )ui (X )vi (X )T

The SVD of the matrix M can be partitioned based on the signalsubspace and its orthogonal subspace

M =[U1 U2

]Σ1 0

0 0

V T1

V T2

Σ1 ∈ Rr×r

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 6 / 22

Several Formulations of Low-Rank Matrix Completion

find Xij , (i , j) ∈ Sc s.t. rank(X ) ≤ r and XS = MS

Approach Problem formulation Property

Convex

relaxation

min ‖X‖∗ s.t. XS = MS3 Rigorous guarantees

7 Slow convergencemin λ ‖X‖∗ + 1

2 ‖XS −MS‖2F

min τ ‖X‖∗ + 12 ‖X‖

2F s.t. XS = MS

Non-convex

min rank(X ) s.t. XS = MS3 Fast convergence

7 Hard to analyzemin ‖XS −MS‖2

F s.t. rank(X ) ≤ r (∗)

min∥∥[XY T ]S −MS

∥∥2

FX ∈ Rm×r ,Y ∈ Rn×r

‖X‖∗ =∑n

i=1 σi (X )

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 7 / 22

Outline

1 Problem Formulation

2 Background

3 Main Results

4 Conclusions and Future Work

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 8 / 22

Iterative Hard Thresholding for Matrix Completion

minX∈Rm×n

1

2‖XS −MS‖2

F s.t. rank(X ) ≤ r (∗)

Iterative hard thresholding (IHT) is a variant of non-convexprojected gradient descent

X (k+1) = Pr(X (k) − αk [X (k) −M]S

)Unlike matrix sensing, the matrix RIP does not hold for MCP

0 · ‖X‖2F ≤ ‖[X ]S‖2

F ≤ 1 · ‖X‖2F

I Global convergence is non-trivial! [Jain, Meka, and Dhillon 2010]

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 9 / 22

Local Convergence of IHT

Algorithm 1 IHTSVD

1: for k = 0, 1, 2, . . . do2: X (k+1) = Pr

(Y (k)

)3: Y (k+1) = PM,S

(X (k+1)

)*PM,S(X ) = XSc + MS

I IHT with unit step size αk = 1

4 0 0

0 0 4

0 2 0

4 0 4

Pr−→

2 0 2

2 0 2

0 0 0

4 0 4

PM,S−−−→

4 0 2

2 0 4

0 2 0

4 0 4

Pr−→ . . .

Source: [Chunikhina, Raich, and Nguyen 2014]

[ibid.] If σ = σmin

(S(Sc )(V2 ⊗ U2)

)> 0, then IHTSVD converges to M

locally at a linear rate 1− σ2.

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 10 / 22

Linearization of the Rank-r Projection

Pr (M + ∆) = M + ∆− U2UT2 ∆V2V

T2 + O(‖∆‖2

F )

Local convergence analysis assumes Y (k) is a perturbed matrix of M

M + E (k+1) = Y (k+1) = PM,S(Pr (Y (k))

)= PM,S

(Pr (M + E (k))

)The recursion on the error matrix E (k+1) =

[Pr (M + E (k))−M

]Sc

can be approximated by

︸ ︷︷ ︸e(k+1)

S(Sc ) vec(E (k+1))1

==(︸ ︷︷ ︸

A

Is − S(Sc )(V2 ⊗ U2)(V2 ⊗ U2)TST(Sc )

)︸ ︷︷ ︸

e(k)

S(Sc ) vec(E (k))

Stable if λmax(A) = 1−(σmin

(S(Sc )(V2 ⊗ U2)

))2

< 1

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 11 / 22

50 100 150 200 250 300 350 400 450 500 550 60010

-12

10-10

10-8

10-6

10-4

10-2

100

102

Our work: 1-

Previous work: 1-2

Figure 1: The distance to the solution (in log-scale) as a function of the iteration number forvarious algorithms. m = 50, n = 40, r = 3, and s = 1000. All algorithms share the samecomputational complexity per iteration

(O(mnr)

)except SVT

(O(mn2)

)[Cai, Candes, and

Shen 2010] and AM(O(sm2r2 + m3r3)

)[Jain, Netrapalli, and Sanghavi 2013].

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 12 / 22

Outline

1 Problem Formulation

2 Background

3 Main Results

4 Conclusions and Future Work

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 13 / 22

Our Contribution

1 Analyze the local convergence of accelerated IHTSVD for solving therank constrained least squares problem (*).

2 Propose a practical way to select momentum step size that enables usto recovers the optimal rate of convergence near the solution.

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 14 / 22

Nesterov’s Accelerated Gradient

Nesterov’s Accelerated Gradient (NAG) is a simple modification togradient descent that provably accelerates the convergence

x (k+1) = y (k) − αk∇f (y (k))

y (k+1) = x (k+1) + βk(x (k+1) − x (k))

If f is µ-strongly convex, L-smooth function, NAG can improve thelinear convergence rate from 1− µ/L to 1−

õ/L by setting

αk =1

L, βk =

1−√µ/L

1 +õ/L

. [Nesterov 2004]

Iteration complexity: O(√κ), compared to O(κ) for gradient descent,

where κ = Lµ is the condition number.

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 15 / 22

The Proposed NAG-IHT

Algorithm 2 NAG-IHT

1: for k = 0, 1, 2, . . . do2: X (k+1) = Pr

(Y (k)

)3: Y (k+1) = PM,S

(X (k+1) + βk(X (k+1) − X (k))

)

Method # Ops./Iter. Local conv. rate #Iters. needed ε-acc.

IHTSVD O(mnr) 1− σ2 1σ2 log(1/ε)

NAG-IHT with βk = 1−σ1+σ O(mnr) 1− σ 1

σ log(1/ε)

∗ σ = σmin

(S(Sc )(V2 ⊗ U2)

)Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 16 / 22

A Practical Method for Step Size Selection

Practical issue: fast convergence requires prior knowledge of globalparameters related to the objective function (βk = 1−σ

1+σ ).

Solution: adaptive restart [O’Donoghue and Candes 2015]

Use an incremental momentumβk = t−1

t+2 starting at t = 1

When f (x (k+1)) > f (x (k)), reset t = 1

0 500 1000 1500 2000 2500 3000

10-15

10-10

10-5

100

Gradient descent

No restart

Restart every 100

Restart every 400

Restart every 700

Restart every 1000With q= /L

Function scheme

Gradient scheme

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 17 / 22

The Proposed Adaptive Restart Scheme for NAG-IHT

Algorithm 3 ARNAG-IHT

1: t = 1

2: f0 =∥∥∥X (0)S −MS

∥∥∥2

F3: for k = 0, 1, 2, . . . do4: X (k+1) = Pr

(Y (k)

)5: Y (k+1) = PM,S

(X (k+1) + t−1

t+2 (X (k+1) − X (k)))

6: fk+1 =∥∥∥X (k+1)S −MS

∥∥∥2

F7: if fk+1 > fk then t = 1 else t = t + 1 . function scheme

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 18 / 22

Numerical Evaluation

100 200 300 400 500 60010

-12

10-10

10-8

10-6

10-4

10-2

100

102

Figure 2: The distance to the solution (in log-scale) as a function of the iteration number forIHT algorithms (solid) and their corresponding theoretical bounds up to a constant (dashed).m = 50, n = 40, r = 3, and s = 1000. *NAG-IHT using optimal step size is not applicable inpractice.

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 19 / 22

Outline

1 Problem Formulation

2 Background

3 Main Results

4 Conclusions and Future Work

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 20 / 22

Conclusions and Future Work

Conclusions

Propose Nesterov’s Accelerated Gradient for iterative hard thresholdingfor matrix completion.Analyze NAG-IHT with optimal step size and prove that the iterationcomplexity improves from O(1/σ2) to O(1/σ) after acceleration.Propose adaptive restart for sub-optimal step size selection thatrecovers the optimal rate of convergence in practice.

Future work

Extend the local convergence analysis to the real-world cases when theunderlying matrix is noisy and/or not close to being low rank.Convergence under a simple initialization suggests potential analysis ofglobal convergence of our algorithm.

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 21 / 22

References I

Cai, J.-F., E. Candes, and Z. Shen (2010). “A Singular Value Thresholding Algorithm for MatrixCompletion”. In: SIAM Journal on Optimization 20.4, pp. 1956–1982.

Chunikhina, E., R. Raich, and T. Nguyen (2014). “Performance analysis for matrix completionvia iterative hard-thresholded SVD”. In: 2014 IEEE Workshop on Statistical SignalProcessing (SSP), pp. 392–395.

Jain, P., R. Meka, and I. Dhillon (2010). “Guaranteed Rank Minimization via Singular ValueProjection”. In: Advances in Neural Information Processing Systems (NIPS), pp. 937–945.

Jain, P., P. Netrapalli, and S. Sanghavi (2013). “Low-rank Matrix Completion Using AlternatingMinimization”. In: Proceedings of the Forty-fifth Annual ACM Symposium on Theory ofComputing, pp. 665–674.

Nesterov, Y. (2004). Introductory lectures on convex optimization: a basic course. KluwerAcademic Publishers.

O’Donoghue, B. and E. Candes (2015). “Adaptive Restart for Accelerated Gradient Schemes”.In: Foundations of Computational Mathematics 15.3, pp. 715–732.

Trung Vu and Raviv Raich (OSU) ICASSP 2019 May 16, 2019 22 / 22