A hybrid quasi-Newton projected-gradient method with application … · 2014-10-10 · A hybrid...

A hybrid quasi-Newton projected-gradient methodwith application to Lasso and basis-pursuit denoise

Ewout van den Berg

Human Language Technologies Group

IBM T.J. Watson Research Center

Work done at the Department of Statistics

Stanford University

October 10, 2014

This work was partially supported by National Science Foundation Grant DMS 0906812(American Reinvestment and Recovery Act).

Background

Basis pursuit denoise

minimizex

‖x‖1 subject to ‖Ax − b‖2 ≤ σ

spgl1 reduces this by to a series of Lasso problems [B, Friedlander, 2008]

minimizex

12‖Ax − b‖22 subject to ‖x‖1 ≤ τ

Root finding with τ+ = τ + (‖r‖22 − σ‖r‖)/‖AT r‖∞

2

Background


minimizex

‖x‖1 subject to ‖Ax − b‖2 ≤ σ

spgl1 reduces this by to a series of Lasso problems [B, Friedlander, 2008]

minimizex


Root finding with τ+ = τ + (‖r‖22 − σ‖r‖)/‖AT r‖∞

2

Background

minimizex


General form

minimizex

f (x) subject to x ∈ C

Solved using spectral projected-gradient (spg) method:

d = −∇f (x) · βx+ = P(x + αd)

ord = P(x −∇f (x) · β)− x

x+ = x + αd

With

β: Barzilai-Borwein scaling parameter [Barzilai,Borwein,1988]

α: Step length from non-monotone line search [Birgin et al., 2000]

P: Orthogonal projection onto C

P(x) := argminv

‖x − v‖2 subject to v ∈ C

3

Motivation

Observation

I (Sometimes) difficult to get a highly accurate solution

I Iterates remain on the same face of C (same sign pattern)

I Very little progress

Typical solution

Detect stagnation on a fixed face

Solve problem constrained to the given face

Check optimality for global problem

Resume if not optimal

Difficulties

I When to initiate this procedure?

I Solving subproblem on incorrect face is wasteful

I Waiting too long defeats the purpose

4

Motivation

Observation




Typical solution

I Detect stagnation on a fixed face

I Solve problem constrained to the given face

I Check optimality for global problem

I Resume if not optimal

Difficulties




4

Motivation

Observation




Typical solution

I Detect stagnation on a fixed face

I Solve problem constrained to the given face

I Check optimality for global problem

I Resume if not optimal

Difficulties




4

Outline

1 Propose a new hybrid method for polyhedral C(Practical only for simple C: `1, bound constrained, simplex)

2 Convergence of the method

3 Application to Lasso and basis pursuit

5

Hybrid method

Basic idea

I Take regular spg steps by default

I After each iteration, check whether F(x+) = F(x) ( 6= C)

I Initialize or update l-bfgs model

I Use quasi-Newton search direction in next iteration

Some issues

I Quasi-Newton direction cannot simply be projected onto CI Naive implementation ignores problem structure

Solution

I Form an l-bfgs model restricted to the current face

I Capture only relevant information

6

Hybrid method

Basic idea





Some issues


Solution



6

Hybrid method

Basic idea





Some issues


Solution



6

Reduced L-BFGS model

Local function

I We only want to model f (x) over the current d-face FI Find an orthonormal basis B ∈ Rn×d for lin(F − F)

I Define f (c) : Rd → R for some fixed x0 ∈ F

f (c) = f (x0 + Bc)

I Choosing c = BT (x − x0) gives f (c) = f (x) for x ∈ F

Model updates

I Standard l-bfgs uses s = x+ − x and y = ∇f (x+)−∇f (x)

I We use s = c+ − c , and y = ∇f (c+)−∇f (c):

s = BT (x+ − x), y = BT (∇f (x+)−∇f (x))

I Never need to choose x0

7

Reduced L-BFGS model

Local function

I We only want to model f (x) over the current d-face FI Find an orthonormal basis B ∈ Rn×d for lin(F − F)

I Define f (c) : Rd → R for some fixed x0 ∈ F

f (c) = f (x0 + Bc)

I Choosing c = BT (x − x0) gives f (c) = f (x) for x ∈ F

Model updates

I Standard l-bfgs uses s = x+ − x and y = ∇f (x+)−∇f (x)

I We use s = c+ − c , and y = ∇f (c+)−∇f (c):

s = BT (x+ − x), y = BT (∇f (x+)−∇f (x))

I Never need to choose x07

Quasi-Newton search direction

Computing the search direction

I Want to compute search direction at current x

I Denote by H−1 the inverse approximate Hessian (Rd×d)

I In the reduced space we compute the search direction

d = −H−1∇f (c) = −H−1BT∇f (x)

I Project back to ambient space using Bd :

d = −BH−1BT∇f (x)

Properties

I Search direction along the face: (x + αd) ∈ F for 0 ≤ α ≤ αmax

I Guaranteed descent direction

8

Self-projection cone

Remaining issues

I Quasi-Newton step must be restricted to the face (α ≤ αmax)

I Fall back to spg step if line search fails (reset Hessian, history)

I Misses mechanism to avoid local minimum on relint(F)


I Update and use l-bfgs model only if −∇f (x+) ∈ S(F(x))

I Where S(F(x)) is the self-projection cone of F(x):

S(F(x)) := {d ∈ Rn | ∃α > 0 : F [P(x + αd)] = F(x)}= N (x) + lin(F(x)−F(x))

9


Remaining issues

I Quasi-Newton step must be restricted to the face (α ≤ αmax)

I Fall back to spg step if line search fails (reset Hessian, history)

I Misses mechanism to avoid local minimum on relint(F)


I Update and use l-bfgs model only if −∇f (x+) ∈ S(F(x))

I Where S(F(x)) is the self-projection cone of F(x):

S(F(x)) := {d ∈ Rn | ∃α > 0 : F [P(x + αd)] = F(x)}= N (x) + lin(F(x)−F(x))

9

Convergence

Theorem

Let f (x) be a twice continuously differentiable convex function that isbounded below and for which there exist constants 0 < µ1 ≤ µ2 <∞such that for all x , v ∈ Rn

µ1‖v‖22 ≤ vT∇2f (x)v ≤ µ2‖v‖22.

Then for any starting point x0 ∈ C, the sequence {xk} generated by thehybrid algorithm converges to the minimizer of f (x) over C.

Proof sketch:

I Finitely many quasi-Newton steps: done or spg converges

I Infinitely many quasi-Newton steps:

I Successful quasi-Newton (l-bfgs) step (Liu and Nocedal):

f (x+)− f (x∗) ≤ (1− c)(f (x)− f (x∗))

I Finite number of quasi-Newton steps on incorrect faces10

Application to general problems

Challenges for general problems

I Projection in SPG is difficult for general CI Facial structure is often unknown

I Finding orthonormal basis for face may be expensive

I Even true for weighted `1 ball

Well suited for simple problems

I Cross polytope (`1-norm)

I Box constrained problems

I Simplex

11

Application to Lasso

Additional conditions

I Typically A ∈ Rm×n with m < n

I Hessian not full rank for d-faces with d > m

I Use quasi-Newton steps only when d ≤ m

Orthogonal projection

I Reduces to soft-thresholding, O(n log n) complexity

Orthonormal basis

I Normalize signs and permute indices: F = conv{e1, . . . , ed+1}I Compute QR factorization of [e2 − e1, . . . , ed+1 − e1]:

Qi ,j =

−√

1/(j2 + j) i ≤ j√j/(j + 1) i = j + 1

0 otherwise.

I Implicit B and BT , can apply in O(n) time

12








Orthonormal basis


Qi ,j =

−√

1/(j2 + j) i ≤ j√j/(j + 1) i = j + 1

0 otherwise.

I Implicit B and BT , can apply in O(n) time

12








Orthonormal basis


Qi ,j =

−√

1/(j2 + j) i ≤ j√j/(j + 1) i = j + 1

0 otherwise.

I Implicit B and BT , can apply in O(n) time12



I Let d = −∇f (x) and define

I1 = {i ∈ [n] | (xi > 0 and di < 0) or (xi < 0 and di > 0)},I2 = {i ∈ [n] | (xi > 0 and di ≥ 0) or (xi < 0 and di ≤ 0)},I3 = (I1 ∪ I2)c ,

I Set sj :=∑

i∈Ij |di | and assume that x 6∈ relint(C), then

d ∈ S(F(x)) iff

s1 = s2 + s3 and s3 = 0, or

s1 < s2 + s3 and maxi∈I3|di | ≤

s2 − s1|I1 ∪ I2|

Line search

I Can compute maximum step length αmax to stay on face

I Objective is quadratic, can find minimum along search direction

I Can compute interval [αwmin, αwmax ] satisfying Wolfe conditions

13



I Let d = −∇f (x) and define

I1 = {i ∈ [n] | (xi > 0 and di < 0) or (xi < 0 and di > 0)},I2 = {i ∈ [n] | (xi > 0 and di ≥ 0) or (xi < 0 and di ≤ 0)},I3 = (I1 ∪ I2)c ,

I Set sj :=∑

i∈Ij |di | and assume that x 6∈ relint(C), then

d ∈ S(F(x)) iff

s1 = s2 + s3 and s3 = 0, or

s1 < s2 + s3 and maxi∈I3|di | ≤

s2 − s1|I1 ∪ I2|

Line search

I Can compute maximum step length αmax to stay on face

I Objective is quadratic, can find minimum along search direction

I Can compute interval [αwmin, αwmax ] satisfying Wolfe conditions13

Numerical experiments

I 10 Sparco problems, each with three τ values [B et al., 2009]

I Random problems: A, A + c , b = Ax , b

I Heaviside matrix, random b

14


Heaviside matrix, random b

0 1 2 3 4 5 6 7 810

−8

10−6

10−4

10−2

100

102

104

Runtime (seconds)

Rel

. dua

lity

gap

Student Version of MATLAB

15


Random 300× 800 A, random b

0 0.5 1 1.5 2 2.5 3 3.5 410

−6

10−4

10−2

100

102

Runtime (seconds)

Rel

. dua

lity

gap


16


Random 300× 800 A, random b

0 100 200 300 400 500 600 700 8000

50

100

150

200

250

300

350

400

450

Iteration

Cum

ulat

ive

# qu

asi−

New

ton

step

s


16


300× 800 random + offset A, b = Ax0, 50-sparse x0

0 0.5 1 1.5 2 2.5 310

−3

10−2

10−1

100

101

102

103

Runtime (seconds)

Rel

. dua

lity

gap


17


Sparco blurspike, τ → σ ≈ 0.1‖b‖2

0 0.5 1 1.5 2 2.5 310

−8

10−6

10−4

10−2

100

102

Runtime (seconds)

Rel

. dua

lity

gap


18


Sparco p3poly, τ → σ ≈ 10−3‖b‖2

0 500 1000 1500 2000 2500 300010

−8

10−6

10−4

10−2

100

102

Runtime (seconds)

Rel

. dua

lity

gap


19


10−1

100

101

102

103

10−1

100

101

102

103

Runtime hybrid

Run

time

orig

inal


20


Lasso

I Sometimes the procedure is never used, small overhead

I Does well on problems that take longer to solve


I SPGL1 has enthusiastic (aggressive) update strategy

I Subproblem terminated before quasi-Newton steps are taken

I Update strategy can lead to run-away behavior

I In those cases accurate solves with hybrid method can help

21

Conclusions

Conclusions

I Hybrid method shows encouraging results

I Apply to box-contrained problems

ReferenceI J. Barzilai and J.M. Borwein, Two-point step size gradient methods, IMA

Journal of Numerical Analysis, 8 (1988), pp. 141–148I E.G. Birgin, J.M. Martınez, and M. Raydan, Nonmonotone spectral pro-

jected gradient methods on convex sets, SIAM Journal on Optimization,10 (2000), pp. 1196–1211

I E. v.d. Berg and M.P. Friedlander, Probing the Pareto frontier for basispursuit solutions, SIAM Journal on Scientific Computing, 2 (2008), pp.890–912

I E. v.d. Berg, M.P. Friedlander, G. Hennenfent, F. Herrmann, R. Saab,and O. Yılmaz, Algorithm 890: Sparco: A testing framework for sparsereconstruction, ACM Transactions on Mathematical Software, 35 (2009),pp. 1–16

22

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

A hybrid quasi-Newton projected-gradient method with application … · 2014-10-10 · A hybrid...

Documents