QUASI-NEWTON METHODS
1
PROBLEMS WITH NEWTON’S METHOD
• Newton’s method is expensive• Compute mixed partials:• Invert Hessian:
O(N2)
• Gradient descent is cheap• Compute partials: O(N)
O(N3)
• Other problems:• When quadratic approximation is bad, Newton is a waste• Hessian becomes poorly conditioned• Nonconvex problems: indefinite Hessian = ascent step
f(x, y) = (1� x)2 + 100(y � x2)2Rosenbrock banana function
2
CONVERGENCE RATESAttain super-linear performance without all the fuss
Linear kek+1k = CkekkQuadratic kek+1k = Ckekk2
Superlinear kek+1k = Ckekkp, p > 1
Gradient methods
Newton
As good as quadratic?
3
QUASI-NEWTON METHODSNewton approximation
Quasi-Newton approximation
Hkdk = �g
kSolve
Solve Bkdk = �gk
Approximate Hessian
What conditions should B satisfy?
f(x) ⇡ (x� xk)T g +
1
2(x� x
k)TH(x� xk)
f(x) ⇡ (x� xk)T gk +1
2(x� xk)TBk(x� xk)
4
WHEN DOES THIS WORK?Bkdk = �gk
GradientSearch direction
h�g,�B�1gi = hg,�B�1gi = gTB�1g > 0
Search direction: acute angle with negative gradient
Must be PD
5
SECANT CONDITIONrf(x)�rf(y) ⇡ H(x� y)
Secant condition
�x = xk+1 � xk
�g = rf(xk+1)�rf(xk)
�g ⇡ Bk+1�x
6
BFGS
7
BASIC QUASI-NEWTON METHOD
�x = xk+1 � xk
�g = rf(xk+1)�rf(xk)
�g ⇡ Bk+1�x
{StandardNewton
stuff
Many solutions
Start with xk, Bk
Solve Bkd = �rf(xk)
xk+1= xk
+ ⌧d
Find new approximate Hessian:
solve
8
BFGS UPDATE
�g ⇡ Bk+1�x
minimizeBk+1
kBk+1 �Bkk2
subject to Bk+1 = (Bk+1)T
Bk+1 = Bk +�g�gT
�gT�x� (Bk�x)(Bk�x)T
�xTBk�x
Broyden-Fletcher-Goldfarb-Shanno
{Solve
Outer products
Scalars9
BFGS METHODStart with xk, Bk
Solve Bkd = rf(xk)
xk+1= xk
+ ⌧d
Find new approximate Hessian:
�x = xk+1 � xk
�g = rf(xk+1)�rf(xk)
{StandardNewton
stuff
Bk+1 = Bk +�g�gT
�gT�x� (Bk�x)(Bk�x)T
�xTBk�x
What’s the rate-limiting step??10
RANK-1 STRATEGYd = (Bk)�1rf(xk)
Bk+1 = Bk +�g�gT
�gT�x� (Bk�x)(Bk�x)T
�xTBk�x
Rank-1 matrices
Bk+1 = Bk + ↵�g�gT � �(Bk�x)(Bk�x)T
(A+ UV )�1 = A�1 �A�1U(I + V A�1U)�1V A�1Woodbury Identity
11
RANK-1 STRATEGY
Bk+1 = Bk + ↵�g�gT � �(Bk�x)(Bk�x)T
U
(A+ UV )�1 = A�1 �A�1U(I + V A�1U)�1V A�1Woodbury Identity
V
Bk+1 = Bk +
0
@| |
↵�g �Bk�x| |
1
A✓
�gT
(Bk�x)T
◆
12
RANK-1 STRATEGYd = (Bk)�1rf(xk)
Bk+1 = Bk +�g�gT
�gT�x� (Bk�x)(Bk�x)T
�xTBk�x
Rank-1 matrices
(Bk+1)�1 =
✓I � �x�gT
�gT�x
◆(Bk)�1
✓I � �g�xT
�gT�x
◆+
�x�xT
�gT�x
Complexity??
WoodburyIdentity
BFGS update
13
SECANT METHOD
xk
xk+1
xk+1
xk
xk�1
Solve: h(x) = rf(x) = 0
rh = r2fh(xk)� h(xk�1) ⇡ rh(xk)(xk � xk�1)
�g ⇡ Bk�x
14
CONVERGENCEExact line search+strong convexity= n-step quadratic convergence
kxk+n � x?k ckxk � x?k2
Backtracking search+strong convexity+smooth Hessian= superlinear
kxk+1 � x?kkxk � x?k ! 0
See O’Leary, “Scientific Computing”, 200915
EXAMPLE
n=100, m=500
minimize cTx�mX
i=1
log(bi � aTi x)
16
SQUARE-ROOT BFGS
Bk = (Lk)TLk Bk+1 = (Lk+1)TLk+1
(Bk)�1 (Bk+1)�1O(N2)
O(N2)
Fails when conditioning is bad
See O’Leary, “Scientific Computing”, 200917
L-BFGS
• Pick some memory parameter: m ⇡ 25
• Assume Bk�m = I
• Store only m most recent {�x,�g}
• Evalue (Bk+1)�1rf(xk) using recursive definition
(Bk+1)�1 =
✓I � �x�gT
�gT�x
◆(Bk)�1
✓I � �g�xT
�gT�x
◆+
�x�xT
�gT�x
18
REALLY LOW MEMORY: CGQuadratic problems
hdi, Hdji = 0
minimize xTHx+ b
Tx
dk+1 = �rf(xk+1) + �dk
Conjugate descent directions
Reduced Graham-Schmidt process
19
REALLY LOW MEMORY: CG• Initialize: d0 = �rf(x0)
...
• Line search
xk+1 = xk + ⌧dk
• dk+1 = �rf(xk+1) + �dk
� =(rf(xk+1)�rf(xk))Trf(xk+1)
rf(xk)Trf(xk)
Polak-RibiereRule
(or others)
See O’Leary, “Scientific Computing”, 200920
WHEN TO USE QUASI-NEWTON?
Problem should be:• Smooth• Unconstrained
CheapGradients
ExpensiveGradients
BFGS Conjugate Gradients
21
MAX-ENT LOGISTIC REGRESSION
P (Yi = 1) =ew
T1 x
1 + ewT1 x
P (Yi = 0) =1
1 + ewT1 x
P (Yi = 0) =1
Z=
1P
l ewT
l x
Two categories…
P (Yi = k) =1
Zew
Ti x =
ewTi x
1 +P
l ewT
l x
22
MAXIMUM LIKELIHOOD
maximize P (w) =Y
n
P (Yi|w) =Y
i
ewTki
xi
Pl e
wTl xi
Negative log-likelihood
Expensive for many labels: BFGS
minimizew
� logP (w) =X
i
�wTkixi +
X
i
log(X
l
ewTl xi)
23
� =
NEURAL NETS: LOGISTIC REGRESSION ON STEROIDS
y = �(X3�(X2�(X1D)))
Backpropagation is expensive: L-BFGS
24
SUMMARY: COMPLEXITY
Method Work Memory ConvergenceNewton O(N3) O(N2) Fastest (quadratic)BFGS O(N2) O(N2) SuperlinearL-BFGS O(NM) O(NM) Linear/superlinear
CG O(N) O(N) Linear/superlinear
All of these get used!
25