A hybrid quasi-Newton projected-gradient methodwith application to Lasso and basis-pursuit denoise
Ewout van den Berg
Human Language Technologies Group
IBM T.J. Watson Research Center
Work done at the Department of Statistics
Stanford University
October 10, 2014
This work was partially supported by National Science Foundation Grant DMS 0906812(American Reinvestment and Recovery Act).
Background
Basis pursuit denoise
minimizex
‖x‖1 subject to ‖Ax − b‖2 ≤ σ
spgl1 reduces this by to a series of Lasso problems [B, Friedlander, 2008]
minimizex
12‖Ax − b‖22 subject to ‖x‖1 ≤ τ
Root finding with τ+ = τ + (‖r‖22 − σ‖r‖)/‖AT r‖∞
2
Background
Basis pursuit denoise
minimizex
‖x‖1 subject to ‖Ax − b‖2 ≤ σ
spgl1 reduces this by to a series of Lasso problems [B, Friedlander, 2008]
minimizex
12‖Ax − b‖22 subject to ‖x‖1 ≤ τ
Root finding with τ+ = τ + (‖r‖22 − σ‖r‖)/‖AT r‖∞
2
Background
minimizex
12‖Ax − b‖22 subject to ‖x‖1 ≤ τ
General form
minimizex
f (x) subject to x ∈ C
Solved using spectral projected-gradient (spg) method:
d = −∇f (x) · βx+ = P(x + αd)
ord = P(x −∇f (x) · β)− x
x+ = x + αd
With
β: Barzilai-Borwein scaling parameter [Barzilai,Borwein,1988]
α: Step length from non-monotone line search [Birgin et al., 2000]
P: Orthogonal projection onto C
P(x) := argminv
‖x − v‖2 subject to v ∈ C
3
Motivation
Observation
I (Sometimes) difficult to get a highly accurate solution
I Iterates remain on the same face of C (same sign pattern)
I Very little progress
Typical solution
Detect stagnation on a fixed face
Solve problem constrained to the given face
Check optimality for global problem
Resume if not optimal
Difficulties
I When to initiate this procedure?
I Solving subproblem on incorrect face is wasteful
I Waiting too long defeats the purpose
4
Motivation
Observation
I (Sometimes) difficult to get a highly accurate solution
I Iterates remain on the same face of C (same sign pattern)
I Very little progress
Typical solution
I Detect stagnation on a fixed face
I Solve problem constrained to the given face
I Check optimality for global problem
I Resume if not optimal
Difficulties
I When to initiate this procedure?
I Solving subproblem on incorrect face is wasteful
I Waiting too long defeats the purpose
4
Motivation
Observation
I (Sometimes) difficult to get a highly accurate solution
I Iterates remain on the same face of C (same sign pattern)
I Very little progress
Typical solution
I Detect stagnation on a fixed face
I Solve problem constrained to the given face
I Check optimality for global problem
I Resume if not optimal
Difficulties
I When to initiate this procedure?
I Solving subproblem on incorrect face is wasteful
I Waiting too long defeats the purpose
4
Outline
1 Propose a new hybrid method for polyhedral C(Practical only for simple C: `1, bound constrained, simplex)
2 Convergence of the method
3 Application to Lasso and basis pursuit
5
Hybrid method
Basic idea
I Take regular spg steps by default
I After each iteration, check whether F(x+) = F(x) ( 6= C)
I Initialize or update l-bfgs model
I Use quasi-Newton search direction in next iteration
Some issues
I Quasi-Newton direction cannot simply be projected onto CI Naive implementation ignores problem structure
Solution
I Form an l-bfgs model restricted to the current face
I Capture only relevant information
6
Hybrid method
Basic idea
I Take regular spg steps by default
I After each iteration, check whether F(x+) = F(x) ( 6= C)
I Initialize or update l-bfgs model
I Use quasi-Newton search direction in next iteration
Some issues
I Quasi-Newton direction cannot simply be projected onto CI Naive implementation ignores problem structure
Solution
I Form an l-bfgs model restricted to the current face
I Capture only relevant information
6
Hybrid method
Basic idea
I Take regular spg steps by default
I After each iteration, check whether F(x+) = F(x) ( 6= C)
I Initialize or update l-bfgs model
I Use quasi-Newton search direction in next iteration
Some issues
I Quasi-Newton direction cannot simply be projected onto CI Naive implementation ignores problem structure
Solution
I Form an l-bfgs model restricted to the current face
I Capture only relevant information
6
Reduced L-BFGS model
Local function
I We only want to model f (x) over the current d-face FI Find an orthonormal basis B ∈ Rn×d for lin(F − F)
I Define f (c) : Rd → R for some fixed x0 ∈ F
f (c) = f (x0 + Bc)
I Choosing c = BT (x − x0) gives f (c) = f (x) for x ∈ F
Model updates
I Standard l-bfgs uses s = x+ − x and y = ∇f (x+)−∇f (x)
I We use s = c+ − c , and y = ∇f (c+)−∇f (c):
s = BT (x+ − x), y = BT (∇f (x+)−∇f (x))
I Never need to choose x0
7
Reduced L-BFGS model
Local function
I We only want to model f (x) over the current d-face FI Find an orthonormal basis B ∈ Rn×d for lin(F − F)
I Define f (c) : Rd → R for some fixed x0 ∈ F
f (c) = f (x0 + Bc)
I Choosing c = BT (x − x0) gives f (c) = f (x) for x ∈ F
Model updates
I Standard l-bfgs uses s = x+ − x and y = ∇f (x+)−∇f (x)
I We use s = c+ − c , and y = ∇f (c+)−∇f (c):
s = BT (x+ − x), y = BT (∇f (x+)−∇f (x))
I Never need to choose x07
Quasi-Newton search direction
Computing the search direction
I Want to compute search direction at current x
I Denote by H−1 the inverse approximate Hessian (Rd×d)
I In the reduced space we compute the search direction
d = −H−1∇f (c) = −H−1BT∇f (x)
I Project back to ambient space using Bd :
d = −BH−1BT∇f (x)
Properties
I Search direction along the face: (x + αd) ∈ F for 0 ≤ α ≤ αmax
I Guaranteed descent direction
8
Self-projection cone
Remaining issues
I Quasi-Newton step must be restricted to the face (α ≤ αmax)
I Fall back to spg step if line search fails (reset Hessian, history)
I Misses mechanism to avoid local minimum on relint(F)
Self-projection cone
I Update and use l-bfgs model only if −∇f (x+) ∈ S(F(x))
I Where S(F(x)) is the self-projection cone of F(x):
S(F(x)) := {d ∈ Rn | ∃α > 0 : F [P(x + αd)] = F(x)}= N (x) + lin(F(x)−F(x))
9
Self-projection cone
Remaining issues
I Quasi-Newton step must be restricted to the face (α ≤ αmax)
I Fall back to spg step if line search fails (reset Hessian, history)
I Misses mechanism to avoid local minimum on relint(F)
Self-projection cone
I Update and use l-bfgs model only if −∇f (x+) ∈ S(F(x))
I Where S(F(x)) is the self-projection cone of F(x):
S(F(x)) := {d ∈ Rn | ∃α > 0 : F [P(x + αd)] = F(x)}= N (x) + lin(F(x)−F(x))
9
Convergence
Theorem
Let f (x) be a twice continuously differentiable convex function that isbounded below and for which there exist constants 0 < µ1 ≤ µ2 <∞such that for all x , v ∈ Rn
µ1‖v‖22 ≤ vT∇2f (x)v ≤ µ2‖v‖22.
Then for any starting point x0 ∈ C, the sequence {xk} generated by thehybrid algorithm converges to the minimizer of f (x) over C.
Proof sketch:
I Finitely many quasi-Newton steps: done or spg converges
I Infinitely many quasi-Newton steps:
I Successful quasi-Newton (l-bfgs) step (Liu and Nocedal):
f (x+)− f (x∗) ≤ (1− c)(f (x)− f (x∗))
I Finite number of quasi-Newton steps on incorrect faces10
Application to general problems
Challenges for general problems
I Projection in SPG is difficult for general CI Facial structure is often unknown
I Finding orthonormal basis for face may be expensive
I Even true for weighted `1 ball
Well suited for simple problems
I Cross polytope (`1-norm)
I Box constrained problems
I Simplex
11
Application to Lasso
Additional conditions
I Typically A ∈ Rm×n with m < n
I Hessian not full rank for d-faces with d > m
I Use quasi-Newton steps only when d ≤ m
Orthogonal projection
I Reduces to soft-thresholding, O(n log n) complexity
Orthonormal basis
I Normalize signs and permute indices: F = conv{e1, . . . , ed+1}I Compute QR factorization of [e2 − e1, . . . , ed+1 − e1]:
Qi ,j =
−√
1/(j2 + j) i ≤ j√j/(j + 1) i = j + 1
0 otherwise.
I Implicit B and BT , can apply in O(n) time
12
Application to Lasso
Additional conditions
I Typically A ∈ Rm×n with m < n
I Hessian not full rank for d-faces with d > m
I Use quasi-Newton steps only when d ≤ m
Orthogonal projection
I Reduces to soft-thresholding, O(n log n) complexity
Orthonormal basis
I Normalize signs and permute indices: F = conv{e1, . . . , ed+1}I Compute QR factorization of [e2 − e1, . . . , ed+1 − e1]:
Qi ,j =
−√
1/(j2 + j) i ≤ j√j/(j + 1) i = j + 1
0 otherwise.
I Implicit B and BT , can apply in O(n) time
12
Application to Lasso
Additional conditions
I Typically A ∈ Rm×n with m < n
I Hessian not full rank for d-faces with d > m
I Use quasi-Newton steps only when d ≤ m
Orthogonal projection
I Reduces to soft-thresholding, O(n log n) complexity
Orthonormal basis
I Normalize signs and permute indices: F = conv{e1, . . . , ed+1}I Compute QR factorization of [e2 − e1, . . . , ed+1 − e1]:
Qi ,j =
−√
1/(j2 + j) i ≤ j√j/(j + 1) i = j + 1
0 otherwise.
I Implicit B and BT , can apply in O(n) time12
Application to Lasso
Self-projection cone
I Let d = −∇f (x) and define
I1 = {i ∈ [n] | (xi > 0 and di < 0) or (xi < 0 and di > 0)},I2 = {i ∈ [n] | (xi > 0 and di ≥ 0) or (xi < 0 and di ≤ 0)},I3 = (I1 ∪ I2)c ,
I Set sj :=∑
i∈Ij |di | and assume that x 6∈ relint(C), then
d ∈ S(F(x)) iff
s1 = s2 + s3 and s3 = 0, or
s1 < s2 + s3 and maxi∈I3|di | ≤
s2 − s1|I1 ∪ I2|
Line search
I Can compute maximum step length αmax to stay on face
I Objective is quadratic, can find minimum along search direction
I Can compute interval [αwmin, αwmax ] satisfying Wolfe conditions
13
Application to Lasso
Self-projection cone
I Let d = −∇f (x) and define
I1 = {i ∈ [n] | (xi > 0 and di < 0) or (xi < 0 and di > 0)},I2 = {i ∈ [n] | (xi > 0 and di ≥ 0) or (xi < 0 and di ≤ 0)},I3 = (I1 ∪ I2)c ,
I Set sj :=∑
i∈Ij |di | and assume that x 6∈ relint(C), then
d ∈ S(F(x)) iff
s1 = s2 + s3 and s3 = 0, or
s1 < s2 + s3 and maxi∈I3|di | ≤
s2 − s1|I1 ∪ I2|
Line search
I Can compute maximum step length αmax to stay on face
I Objective is quadratic, can find minimum along search direction
I Can compute interval [αwmin, αwmax ] satisfying Wolfe conditions13
Numerical experiments
I 10 Sparco problems, each with three τ values [B et al., 2009]
I Random problems: A, A + c , b = Ax , b
I Heaviside matrix, random b
14
Numerical experiments
Heaviside matrix, random b
0 1 2 3 4 5 6 7 810
−8
10−6
10−4
10−2
100
102
104
Runtime (seconds)
Rel
. dua
lity
gap
Student Version of MATLAB
15
Numerical experiments
Random 300× 800 A, random b
0 0.5 1 1.5 2 2.5 3 3.5 410
−6
10−4
10−2
100
102
Runtime (seconds)
Rel
. dua
lity
gap
Student Version of MATLAB
16
Numerical experiments
Random 300× 800 A, random b
0 100 200 300 400 500 600 700 8000
50
100
150
200
250
300
350
400
450
Iteration
Cum
ulat
ive
# qu
asi−
New
ton
step
s
Student Version of MATLAB
16
Numerical experiments
300× 800 random + offset A, b = Ax0, 50-sparse x0
0 0.5 1 1.5 2 2.5 310
−3
10−2
10−1
100
101
102
103
Runtime (seconds)
Rel
. dua
lity
gap
Student Version of MATLAB
17
Numerical experiments
Sparco blurspike, τ → σ ≈ 0.1‖b‖2
0 0.5 1 1.5 2 2.5 310
−8
10−6
10−4
10−2
100
102
Runtime (seconds)
Rel
. dua
lity
gap
Student Version of MATLAB
18
Numerical experiments
Sparco p3poly, τ → σ ≈ 10−3‖b‖2
0 500 1000 1500 2000 2500 300010
−8
10−6
10−4
10−2
100
102
Runtime (seconds)
Rel
. dua
lity
gap
Student Version of MATLAB
19
Numerical experiments
10−1
100
101
102
103
10−1
100
101
102
103
Runtime hybrid
Run
time
orig
inal
Student Version of MATLAB
20
Numerical experiments
Lasso
I Sometimes the procedure is never used, small overhead
I Does well on problems that take longer to solve
Basis pursuit denoise
I SPGL1 has enthusiastic (aggressive) update strategy
I Subproblem terminated before quasi-Newton steps are taken
I Update strategy can lead to run-away behavior
I In those cases accurate solves with hybrid method can help
21
Conclusions
Conclusions
I Hybrid method shows encouraging results
I Apply to box-contrained problems
ReferenceI J. Barzilai and J.M. Borwein, Two-point step size gradient methods, IMA
Journal of Numerical Analysis, 8 (1988), pp. 141–148I E.G. Birgin, J.M. Martınez, and M. Raydan, Nonmonotone spectral pro-
jected gradient methods on convex sets, SIAM Journal on Optimization,10 (2000), pp. 1196–1211
I E. v.d. Berg and M.P. Friedlander, Probing the Pareto frontier for basispursuit solutions, SIAM Journal on Scientific Computing, 2 (2008), pp.890–912
I E. v.d. Berg, M.P. Friedlander, G. Hennenfent, F. Herrmann, R. Saab,and O. Yılmaz, Algorithm 890: Sparco: A testing framework for sparsereconstruction, ACM Transactions on Mathematical Software, 35 (2009),pp. 1–16
22