+ All Categories
Home > Documents > On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN...

On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN...

Date post: 13-Jul-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
19
ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING * RICHARD H. BYRD , GILLIAN M. CHIN , WILL NEVEITT § , AND JORGE NOCEDAL Abstract. This paper describes how to incorporate sampled curvature information in a Newton-CG method and in a limited memory quasi-Newton method for statistical learning. The motivation for this work stems from supervised machine learning applications involving a very large number of training points. We follow a batch approach, also known in the stochastic optimization literature as a sample average approxima- tion approach. Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration. A crucial feature of our technique is that Hessian-vector multiplications are carried out with a significantly smaller sam- ple size than is used for the function and gradient. The efficiency of the proposed methods is illustrated using a machine learning application involving speech recognition. Key words. unconstrained optimization, stochastic optimization, machine learning AMS subject classifications. 65K05, 90C06, 90C30, 90C55 DOI. 10.1137/10079923X 1. Introduction. The inexact Newton-CG method and the limited memory BFGS method are useful techniques for solving large-scale deterministic optimization problems in which the function and the gradient can be computed with good accuracy but the Hessian matrix cannot be formed or factored at reasonable cost [12]. In this paper, we consider optimization problems whose objective function is given by an expectation or a loss function and can therefore be considered stochastic. We ask whether one can design Newton-CG and limited memory BFGS methods that exploit the properties of such ob- jective functions. In this paper, we propose to employ sampled (or stochastic) curvature information, using a small batch size, to accelerate these optimization methods. The motivation for this work stems from supervised machine learning, where the goal is to estimate a statistical model on training data that makes accurate predictions on previously unseen data. Given a random set of training points ðx i ;y i Þ drawn from some probability distribution, and given a loss function lðw; x i ;y i Þ parameterized by a vector w R n , we seek to find the vector w that minimizes the expected loss. The optimization problem can thus be stated as min wR n J ðwÞ¼ 1 m X m i¼1 lðw; x i ;y i Þ; ð1:1Þ *Received by the editors June 17, 2010; accepted for publication (in revised form) June 27, 2011; published electronically September 22, 2011. http://www.siam.org/journals/siopt/21-3/79923.html Department of Computer Science, University of Colorado, Boulder, CO 80309 (Richard.Byrd@colorado. edu). This author was supported by National Science Foundation grant CMMI 0728190 and Department of Energy grant DE-SC0001774. Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208 ([email protected]). This author was supported by an NSERC fellowship and a grant from Google Inc. § Google Research, 1600 Amphitheatre Parkway, Mountain View, CA 94043 ([email protected]). Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208-3118 ([email protected]). This author was supported by National Science Foundation grant DMS-0810213 and by Department of Energy grant DE-FG02-87ER25047-A004. 977 SIAM J. OPTIM. Vol. 21, No. 3, pp. 977995 © 2011 Society for Industrial and Applied Mathematics Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 10/14/12 to 129.105.215.146. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Transcript
Page 1: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

ON THE USE OF STOCHASTIC HESSIAN INFORMATIONIN OPTIMIZATION METHODS FOR MACHINE LEARNING*

RICHARD H. BYRD†, GILLIAN M. CHIN‡, WILL NEVEITT§, AND JORGE NOCEDAL¶

Abstract. This paper describes how to incorporate sampled curvature information in a Newton-CGmethod and in a limited memory quasi-Newton method for statistical learning. The motivation for this workstems from supervised machine learning applications involving a very large number of training points. Wefollow a batch approach, also known in the stochastic optimization literature as a sample average approxima-tion approach. Curvature information is incorporated in two subsampled Hessian algorithms, one based on amatrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration. A crucialfeature of our technique is that Hessian-vector multiplications are carried out with a significantly smaller sam-ple size than is used for the function and gradient. The efficiency of the proposed methods is illustrated using amachine learning application involving speech recognition.

Key words. unconstrained optimization, stochastic optimization, machine learning

AMS subject classifications. 65K05, 90C06, 90C30, 90C55

DOI. 10.1137/10079923X

1. Introduction. The inexact Newton-CG method and the limited memory BFGSmethod are useful techniques for solving large-scale deterministic optimization problemsin which the function and the gradient can be computed with good accuracy but theHessian matrix cannot be formed or factored at reasonable cost [12]. In this paper, weconsider optimization problems whose objective function is given by an expectation ora loss function and can therefore be considered stochastic. We ask whether one can designNewton-CG and limited memory BFGS methods that exploit the properties of such ob-jective functions. In this paper, we propose to employ sampled (or stochastic) curvatureinformation, using a small batch size, to accelerate these optimization methods.

The motivation for this work stems from supervised machine learning, where thegoal is to estimate a statistical model on training data that makes accurate predictionson previously unseen data. Given a random set of training points ðxi; yiÞ drawn fromsome probability distribution, and given a loss function lðw; xi; yiÞ parameterized bya vector w ∈ Rn, we seek to find the vector w that minimizes the expected loss. Theoptimization problem can thus be stated as

minw∈Rn

JðwÞ ¼ 1

m

Xmi¼1

lðw; xi; yiÞ;ð1:1Þ

*Received by the editors June 17, 2010; accepted for publication (in revised form) June 27, 2011; publishedelectronically September 22, 2011.

http://www.siam.org/journals/siopt/21-3/79923.html†Department of Computer Science, University of Colorado, Boulder, CO 80309 (Richard.Byrd@colorado.

edu). This author was supported by National Science Foundation grant CMMI 0728190 and Department ofEnergy grant DE-SC0001774.

‡Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL60208 ([email protected]). This author was supported by an NSERC fellowship and a grant fromGoogle Inc.

§Google Research, 1600 Amphitheatre Parkway, Mountain View, CA 94043 ([email protected]).¶Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL

60208-3118 ([email protected]). This author was supported by National Science Foundationgrant DMS-0810213 and by Department of Energy grant DE-FG02-87ER25047-A004.

977

SIAM J. OPTIM.Vol. 21, No. 3, pp. 977–995

© 2011 Society for Industrial and Applied Mathematics

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 2: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

where xi ∈ RNF represent the feature vectors of the training points and yi ∈ RNC theircorresponding labels (or classes). The right-hand side in (1.1) represents an empiricalexpectation taken over the random set of training points. We are mainly interested inthe case when the loss function lð·; xi; yiÞ is smooth and convex, as is the case in log-linear models for entropy maximization. We are motivated by applications, such asspeech recognition, where the number of training points m is exceedingly large (inthe millions or even billions) and the number of variables w is large (in the tens ofthousands or millions). The evaluation of the objective function JðwÞ is therefore verycostly in these applications, and determining an exact solution by standard optimizationmethods is very time consuming.

In order to reduce the computational cost of the optimization, and given that thetraining set is often highly redundant, it is common to consider only a random sample ofthe training points, i.e., to include only a subset of the summation terms in (1.1) in theoptimization process. If we define D ¼ f1; 2; : : : ;mg and let X ⊆ D be a random sampleconsisting of jX j training instances ðxi; yiÞi∈X , we can define a stochastic approximationof the objective JðwÞ as

JX ðwÞ ¼ 1

jX jXi∈X

lðw; xi; yiÞ:ð1:2Þ

If the sample X is large enough (a so-called batch or sample average approximation(SAA) approach [14]), we can apply a conventional gradient-based method to minimizeJX . If, on the other hand, the sampleX is very small (the online or SA approach), we canapply one step of a stochastic gradient method [13], [1], choose another small sample X ,and repeat the process.

The two methods proposed in this paper are designed for batch (or minibatch) ap-plications—not for online settings where the optimization is performed using highly in-accurate function and gradient information. For the applications we have in mind, onecan use either the complete training set to define the loss function, as in (1.1), or only asubset of it (a minibatch), as in (1.2). Our approach is based on the fact that less ac-curacy in the Hessian is required than in the gradient, and on the observation that,although the Hessian of JX is typically very large and dense, one can compute productsof a sample of the Hessian times vector terms at modest cost. Specifically, supposethat we have determined (by some consideration) that an appropriate sample for com-puting the function and gradient is X . Then, we select a random subset of X , whichwe denote as S, and compute Hessian-vector products by including only those termscorresponding to S.

In the Newton-CG method, we incorporate sampled (or stochastic) curvatureinformation through a matrix-free CG iteration applied to the Newton equations.We implement this idea by using a subsample that is much smaller than that usedfor the evaluation of the objective function JX and its gradient∇JX , in the computationof the Hessian-vector products required by the CG iteration. By coordinating the size ofthe subsample and the number of CG iterations, the computational cost of this Newton-like iteration is comparable to the cost of the steepest descent step—but the resultingiteration is much more rapidly convergent. We refer to this algorithm as a subsampledHessian Newton method.

In the limited memory BFGS (L-BFGS) method, we incorporate stochastic Hessianinformation through the so-called “initial matrix” employed in limited memory BFGSupdating. In the standard L-BFGS method [6], this initial matrix is chosen at everyiteration as a multiple of the identity matrix. In the proposed algorithm, the initial

978 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 3: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

matrix is defined implicitly via a conjugate gradient solve of a linear system whose coef-ficient matrix is given by the stochastic Hessian. We call this technique the stochasticallyinitialized L-BFGS method, and similar to the approach described above, it is crucialthat the stochastic curvature information provided to the algorithm uses a much smallersample than that used for the evaluation of the objective function and its gradient.

The main goal of this paper is to propose that the use of approximate second deri-vative information of this kind can be useful in some machine learning applications.Additional gains in efficiency can be obtained in both methods by implementing themin a dynamic setting, where the sampleX is initially small and is increased automaticallyas needed, so as to achieve any desired level of accuracy. This dynamic framework is,however, not explored here, as it is the subject of a future study.

An important motivation for the approach proposed in this paper stems from theavailability of large-scale distributed computing environments that permit the parallelevaluation of the objective function (1.2) and obviate the need for working with verysmall batch sizes. In such a setting, JX , ∇JX , and Hessian-vector products can all beevaluated in parallel by assigning subsets of the summation in (1.2) to different com-puting nodes. It is advantageous that the subsets are not too small, so that the latency incommunication does not dominate the total computing time. Thus, the overall sampleXneed not be very small and the use of deterministic optimization techniques is justified.

A Newton-like method for machine learning has been studied in [5] in the context ofclassification problems involving two classes and a sparse set of features. Other Newton-like methods and quasi-Newton methods for machine learning are discussed in [1], [4]. Allof these approaches are significantly different from the techniques proposed here.

N.B. Shortly before providing the final version of this manuscript, we became awarethat Martens [9] independently and concurrently proposed a subsampled HessianNewton method and applied it to the training neural networks.

The paper is organized into five sections. In section 2 we present the subsampledHessian Newton-CG method, and in section 3, the L-BFGS variant. Numerical resultson a speech recognition problem are reported in section 4, and in section 5 we providesome concluding remarks and open questions.

Notation. We follow the convention used in machine learning and denote thevariables of the optimization problem (i.e., the parameters to be estimated) by w.Throughout the paper k · k denotes the Euclidean vector norm.

2. The subsampled Hessian Newton method. Let us begin by reviewing theNewton-CG method for unconstrained optimization (also known as the truncated orinexact Newton method; see, e.g., [17], [11]). The problem under consideration is to mini-mize a function J∶Rn → R.

At an iteratewk, we apply the CGmethod to compute an approximate solution pk ofthe linear system

∇2JðwkÞp ¼ −∇JðwkÞ:ð2:1Þ

The CG iteration is terminated when the residual rk ¼ ∇2JðwkÞpk þ∇JðwkÞ is suffi-ciently small, or when a prescribed number of CG iterations have been performed.The new iterate is then given bywkþ1 ¼ wk þ αkpk, where αk is a steplength that ensuressufficient decrease in the objective function.

The conjugate gradient method does not require explicit knowledge of the Hessianmatrix, but only requires products of this matrix times vectors. Therefore, we can im-plement the Newton-CGmethod in a matrix-free setting, provided we have the ability to

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 979

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 4: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

compute these Hessian-vector products without forming the Hessian. This method isquite flexible: By controlling the number of CG iterations, it can resemble the steepestdescent method, at one extreme, or the classical (exact) Newton method at the otherextreme. However, if an effective preconditioner is not available, the Newton-CG meth-od can be expensive because each Hessian-vector product is at least as costly as onegradient evaluation. In other words, the savings in outer iterations achieved by theNewton-CG method normally do not compensate for the higher cost of the iteration.

We propose that for expectation minimization problems of the form (1.2), an effec-tive way of implementing the Newton-CG method is to reduce the cost of the iterationby employing a smaller sample S and including only those terms corresponding to S inthe computation of Hessian-vector products. In section 4.1 we give an illustrative ex-ample in which the cost of the Hessian-vector products decreases linearly as the samplesize decreases.

The method described in this section is motivated by the following three considera-tions: (a) The stochastic nature of the objective (1.2) suggests that a natural way ofincorporating curvature information into a Newton-like method inexpensively is to sam-ple the Hessian; (b) Newton-like methods are much more tolerant to errors in theHessian than in the computation of the function and its gradient, and therefore, onecan use small samples for the representation of curvature information; (c) instead ofconstructing Hessian approximations, one can incorporate curvature informationthrough Hessian-vector products.

The Hessian subsample S can be chosen small enough so that the total cost of theCG iteration is not much larger than the cost of one gradient evaluation. On the otherhand, S should be large enough so that the curvature information obtained throughthese Hessian-vector products is useful. One of the challenges in this approach is toachieve the right balance between these two goals.

The proposed algorithm is stated below. We recall that, given any sampleX k ⊂ D ¼ f1; 2; : : : ;mg, the stochastic approximation JXk

is defined by (1.2).

ALGORITHM S-NEWTON: SUBSAMPLED HESSIAN NEWTON-CG METHOD.Choose an initial iterate w0, constants η;σ ∈ ð0; 1Þ, a CG iteration limit maxcg, andinitial samples X 0 and S0 ≠ ∅ such that jS0j < jX 0j. For k ¼ 0; 1; : : : , until a conver-gence test is satisfied:

1. Evaluate JXkðwkÞ and ∇JXk

ðwkÞ.2. Apply the matrix-free CGmethod to compute an approximate solution pk of the

linear system

∇2JSkðwkÞp ¼ −∇JXk

ðwkÞ:ð2:2Þ

The CG iteration is terminated either when maxcg iterations have been per-formed or when the residual rk ¼ ∇2JSk

ðwkÞpk þ∇JXkðwkÞ satisfies

rk ≤ σk∇JXkðwkÞk:ð2:3Þ

3. Update the variables:

wkþ1 ¼ wk þ αkpk;ð2:4Þ

where the steplength αk is the largest element in the set f1; 1 ∕ 2; 1 ∕ 4; : : : g suchthat

980 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 5: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

JXkðwkþ1Þ ≤ JXk

ðwkÞ þ ηαk∇JXkðwkÞTpk:ð2:5Þ

4. Choose new samples X kþ1, Skþ1 such that jSkþ1j < jX kþ1j.

The curvature information obtained in this manner can be expected to be useful insome statistical learning applications because, as mentioned above, Newton-like meth-ods are very tolerant to the choice of Hessian and can make good use of limited curvatureinformation. Specifically, if B is any symmetric and positive definite matrix and if weapply any number of CG steps to the system Bd ¼ −∇JðwkÞ, the resulting Newton-CGstep is a descent direction for JðwkÞ; see section 2.1. In the machine learning applicationsthat motivated this study, the (logistic) loss function l in (1.2) is convex, and hence∇2JS will be positive semidefinite for any nonempty choice of S.

We have not specified in Algorithm S-Newton whether the sizes of the samples X k,Sk change at every iteration, or are kept fixed. The algorithm has been stated insufficient generality to allow many strategies, including the “semi-stochastic” case whenSk ⊂ X k ¼ D ¼ f1; 2; : : : ;mg for all k. Comparatively, Algorithm S-Newton is alsocapable of incorporating dynamic techniques in which the sample sizes jSkj < jX kjare initially small (to benefit from the initial efficiency of stochastic gradient-typemethods) and increase as needed to achieve the desired objective value. Regardlessof the strategy chosen, and in order to avoid bias, the subsample Sk should be recom-puted at every (outer) iteration of the subsampled Hessian Newton method. Thus, evenif the size of the sample Sk remains constant throughout the iteration, the sample itselfshould change and would typically be chosen as a subset of X k. For the sake of simpli-city, in this paper, we analyze and test only the semistochastic case Sk ⊂ X k ¼ D anddefer the study of dynamic sampling strategies to a future study.

Let us quantify the cost of the search direction computation in the subsampledHessian Newton method. Let gcost denote the cost of computing the gradient ∇JXk

and maxcg the maximum number of CG iterations permitted. Suppose that the costof one Hessian-vector product is factor × gcost. Then, assuming that the maximumlimit of CG iterations is always reached, the cost of the step computation inAlgorithm S-Newton (excluding the function and gradient evaluation) is given by

maxcg × factor × gcost:

In the deterministic Newton-CG method, which corresponds in our case to the choiceSk ¼ X k ¼ D, we have that factor is at least 1, and maxcg can range from 5 to severaldozen. Thus, the cost of one iteration of the classical Newton-CG method can easily be10 times higher than the cost of a gradient computation, and this causes the method tobe less competitive than limited memory quasi-Newton and nonlinear CG methods onmany problems.

Now, by decreasing the sample size S, we can reduce the ratio factor significantly.For example, if jSkj is one-tenth of jX kj, then factor will be about one-tenth. In general,we can coordinate the size of the sample size jSkjwith the maximum allowable number ofCG iterations so that

factor ×maxcg ≈ 1;

and thus the total cost of the step computation is comparable to the cost of one gradientevaluation. The freedom in the selection of the subsample Sk thus provides the sub-sampled Hessian Newton-CG method with much flexibility. We should also point

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 981

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 6: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

out that for the solution of the subsampled Hessian Newton equation (2.2), scale invar-iance is preserved with respect to changes in scale of w.

2.1. Convergence properties. Based on some well-known properties of the CGmethod, it is easy to show that the semistochastic version of the Newton method (withX k ¼ D) is globally convergent on problems of the form (1.1), provided the subsampledHessians are uniformly positive definite; i.e., there is a constant γ1 > 0 such that for all kand all v ∈ Rn

vT∇2JSkðwkÞv ≥ γ1kvk:ð2:6Þ

For objective functions of the form (1.1) the loss term l is often convex (although notstrongly so), but J can be made uniformly convex by adding a regularization term of theform δkwk2 to the right-hand side of (1.1). Furthermore, uniform convexity of J impliesthat the sequence of iterateswk is bounded, and by continuity of∇2J there is a constantγ2 such that, for all k and all Sk,

k∇2JSkðwkÞk ≤ γ2:ð2:7Þ

For the following discussion, we denote J ≜ JD.THEOREM 2.1. Let JD∶Rn → R be twice continuously differentiable and uniformly

convex and suppose that condition (2.6) holds. Then, the sequence of iterates fwkggenerated by Algorithm S-Newton, with X k ¼ D, satisfies

limk→∞ ∇JDðwkÞ ¼ 0:ð2:8ÞProof. First we show that the search directions pk obtained by applying any num-

ber of CG steps to the system (2.2) are directions of strong descent for JDðwkÞ.It is a well-known fact [7] that the iterates generated by the CG method applied to

the system (2.2) minimize the quadratic function

1

2pT∇2JSk

ðwkÞpþ pT∇JDðwkÞ

over a Krylov subspace that includes the vector ∇JDðwkÞ. Let us define Q to be anorthonormal basis for this space. Then, the search direction pk can be expressed aspk ¼ Qv, for some vector v satisfying

½QT∇2JSkðwkÞQ�v ¼ −QT∇JDðwkÞ:ð2:9Þ

Since∇JDðwkÞ is in the range of Q (it is in the Krylov space mentioned above), we havethat kQT∇JDðwkÞk ¼ k∇JDðwkÞk, and hence by the orthogonality of Q,

∇JDðwkÞTpk ¼ −∇JDðwkÞTQ½QT∇2JSkðwkÞQ�−1QT∇JDðwkÞð2:10Þ

≤−kQT∇JDðwkÞk2

kQT∇2JSkðwkÞQkð2:11Þ

≤−k∇JDðwkÞk2k∇2JSk

ðwkÞkð2:12Þ

≤− k∇JDðwkÞk2 ∕ γ2;ð2:13Þ

982 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 7: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

where γ2 is defined in (2.7). In addition, we have from (2.9) and (2.6) that

kpk ¼ kQ½QT∇2JSkðwkÞQ�−1QT∇JDðwkÞk ≤ k∇JDðwkÞk∕ γ1:

It follows that

−pT∇JDðwkÞ

k∇JDðwkÞkkpk≥

γ1

γ2

;

which proves that pk is a strong direction of descent for JD at wk. We also have from(2.9) and (2.7) that

k∇JDðwkÞk ¼ kQT∇JDðwkÞk ≤ k∇2JSkðwkÞkkvk ≤ γ2kvk;

so that

kpkk ¼ kvk ≥ k∇JDðwkÞk ∕ γ2:

We can now apply Zoutendijk’s classical analysis to prove global convergence.Specifically, we have shown that all the conditions in Theorem 11.7 in (see [3, p. 379])are satisfied, and it follows that the limit (2.8) holds. ▯

3. Stochastically initialized L-BFGS method. The L-BFGS method [12]maintains very simple approximations of the Hessian of JðwÞ. Curvature informationfrom gradients at a few recent iterations is used to construct a Hessian approximation ina way that does not store a fully dense n by n matrix but is parsimonious in terms ofcomputational time and memory space.

In the standard BFGS method, an approximation Hk of the inverse Hessian∇2JðwkÞ−1 is updated at every iteration so that the secant equation is satisfied at eachstep, i.e.,

Hkþ1yk ¼ sk; where yk ¼ ∇Jðwkþ1Þ−∇JðwkÞ; sk ¼ wkþ1 −wk:ð3:1Þ

The BFGS update formula (see, e.g., Fletcher [2]) is given by

Hkþ1 ¼ ðI − ρkyksTk ÞTHkðI − ρkyks

Tk Þ þ ρksks

Tk ; where ρk ¼ 1 ∕ yTk sk:ð3:2Þ

The L-BFGS method does not construct the Hessian approximation explicitly, but de-fines it indirectly based on stored secant information from the t most recent iterations,where t is a small integer. If we define Vk ¼ ðI − ρkyks

Tk Þ, apply the BFGS formula t

times, and expand the resulting equation to encompass all computations from an initialapproximation H 0

k, we obtain

Hk ¼ ðVTk−1V

Tk−2 · · · VT

k−tÞH 0kðVk−tV k−tþ1 · · · Vk−1Þ

þ ρk−tðVTk−1 · · · VT

k−tþ1Þsk−tsTk−tðVk−tþ1 · · · Vk−1Þ

þ ρk−tþ1ðVTk−1 · · · VT

k−tþ2Þsk−tþ1sTk−tþ1ðVk−tþ2 · · · Vk−1Þ

þ · · ·

þ ρk−1sk−1sTk−1:ð3:3Þ

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 983

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 8: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

The search direction of the L-BFGS method is defined as

pk ¼ −Hk∇JðwkÞ:ð3:4Þ

Rather than forming the matrices Hk, one can store the correction pairs fsi; yig thatdefine them, and compute the product Hk∇JðwkÞ via the relation (3.3). This matrix-vector multiplication can be carried out very efficiently by the two-loop recursiondescribed in [12, p. 112] at a cost of about 4tn multiplications plus the cost of onemultiplication by H 0

k.In the standard L-BFGS method, the so-called initial matrix H 0

k in (3.3) is definedafresh at every iteration, with a common choice being

H 0k ¼ γkI ; where γk ¼

sTk−1yk−1

yTk−1yk−1

:ð3:5Þ

Such a choice of H 0k contains little curvature information about the problem. Therefore

we ask whether it is possible to define H 0k in a way that exploits the structure of a loss

function of the form (1.1).We propose to defineH 0

k implicitly through the use of a conjugate gradient iteration.To see how this can be done, note that when performing the computation of Hk∇JðwkÞthrough (3.3) we must compute, at some point, a product of the form

r←H 0kq for some vector q ∈ Rn:

Ideally, H 0k would be defined as an accurate approximation of the inverse of the Hessian,

i.e., H 0k ≈∇2JðwkÞ−1. Therefore, in the ideal scenario we would compute r as the solu-

tion of the linear system

∇2JðwkÞr ¼ q:ð3:6Þ

Instead of solving this system exactly, we could perform only a few iterations of thematrix-free conjugate gradient method and define the vector r as the resulting approx-imate solution of (3.6). This approach therefore eliminates the need for specifying theinitial matrix and implicitly defines H 0

k through an approximate matrix-free conjugategradient solve of the system (3.6).

As in the Newton method of the previous section, we employ a smaller sample todefine the Hessian in (3.6), compared to the sample used for the function and gradientcomputation so that the cost of Hessian-vector products within the CG method is af-fordable. Thus, given a sample Sk such that jSkj < jXkj, we define the vector r as anapproximate solution to the system

∇2JSkðwkÞr ¼ qð3:7Þ

computed by the conjugate gradient method. The computation of the productHk∇JXk

ðwkÞ in this semistochastic L-BFGS approach can be stated as follows(cf. [12, p. 112]).

984 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 9: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

PROCEDURE I: TWO-LOOP RECURSION WITH IMPLICIT STOCHASTIC INITIAL MATRIX.q←∇JXk

ðwkÞfor i ¼ k− 1; k− 2; : : : ; k− t

αi←ρisTi q;

q ¼ q− αiyi;end(for)r← approximate solution to (3.7) obtained by the matrix-free CG method;for i ¼ k− t; k− tþ 1; : : : ; k− 1

β←ρiyTi r;

r←r þ siðαi − βÞ;end(for)STOP: result r ¼ Hk∇JXk

ðwkÞ

We terminate the CG method either when the residual condition (2.3) is met orwhen a CG iteration limit is reached. The precise description of the stochastic L-BFGSmethod is as follows.

ALGORITHM SLM: STOCHASTICALLY INITIALIZED L-BFGS METHOD.Choose an initial iterate w0, a CG iteration limit maxcg, initial samples X 0 and S0 ≠ ∅such that jS0j < jX 0j, and constants 0 < c1 < c2 < 1. Set k←0.

1. Evaluate JX0ðw0Þ and ∇JX0

ðw0Þ.Set p0 ¼ −∇JX0

ðw0Þ2. While: Convergence test is not satisfied:

2.1 Store: wk and ∇JXkðwkÞ.

2.2 Line search: Compute steplength αk that satisfies the Wolfe conditions1. JXk

ðwk þ αkpkÞ ≤ JXkðwkÞ þ c1αk∇JXk

ðwkÞTpk2. ∇JXk

ðwk þ αkpkÞTpk ≥ c2∇JXkðwkÞTpk.

2.3 Compute new iterate: wkþ1←wk þ αkpk.2.4 Update sk←wkþ1 −wk and yk←∇JXk

ðwkþ1Þ−∇JXkðwkÞ.

2.5 Set k←kþ 1.2.5 Resample X k, Sk such that jSkj < jXkj.2.6 Evaluate JXk

ðwkÞ and ∇JXkðwkÞ.

2.7 Compute direction vector pk using Procedure I.3. End(While)

A variety of strategies can be employed to choose the samples X k, Sk at every itera-tion. For simplicity, our testing is done for the semistochastic case Sk ⊂ X k ¼ D. Asbefore, to avoid bias, the subsample Sk is recomputed at every iteration of the stochasticL-BFGS algorithm; see section 4.

We now show that Algorithm SLM is globally convergent on convex problems.THEOREM 3.1. Under the assumptions on J given in Theorem 2.1, the sequence of

iterates fwkg generated by Algorithm SLM satisfies

limk→∞ ∇JDðwkÞ ¼ 0:ð3:8Þ

Proof. It has been shown in [6] that the L-BFGS algorithm is globally convergenton convex problems provided the symmetric positive definite initial matrices H 0

k haveeigenvalues that are uniformly bounded above and away from zero. In Algorithm SLM,

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 985

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 10: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

these matrices are not constructed explicitly; instead we compute the vector r←H 0kq via

the approximate CG solution of (3.7). Thus, we need to show that at every iteration, thevector r computed in this manner can be expressed as the product of a matrix (with thedesired properties) times q.

Since the vector r is the result of, say, j, CG steps applied to a linear system (3.7), itfollows from the same argument that leads to (2.9) that r ¼ Qv, where

½QT∇2JSkðwkÞQ�v ¼ QTq

and the columns ofQ are an orthonormal basis for the Krylov subspace generated by theCG iteration. Thus,

r ¼ ðQ½QT∇2JSkðwkÞQ�−1QT Þq:

The matrix multiplying q is singular. However, q is in the Krylov subspace (and hence inthe range of Q), and therefore we also have that

r ¼ ðQ½QT∇2JSkðwkÞQ�−1QT þ Q̄Q̄T Þq¼defH 0

kq;ð3:9Þ

where Q̄ is a n× ðn− jÞ matrix whose columns are an orthonormal basis for the nullspace of QT , so that Q̄Tq ¼ 0. The matrix inside the square brackets can therefore beregarded as the initial matrix, H 0

k, for the limited memory update (3.3). The eigenvaluesof H 0

k are as follows: n− j are equal to 1 and correspond to the eigenvalues of Q̄Q̄T , andthe rest are given by the j nonzero eigenvalues ofQ½QT∇2JSk

ðwkÞQ�−1QT . By the ortho-gonality of Q these j eigenvalues are in the interval ½1 ∕ γ2; 1 ∕ γ1�, where γ1 and γ2 aredefined in (2.6) and (2.7). Therefore, the eigenvalues of the matrix H 0

k defined in (3.9)are all in the interval ½minf1; 1 ∕ γ2g;maxf1; 1 ∕ γ1g�.

It then follows from Theorem 7.1 of Liu and Nocedal [6] that the limit (3.8) issatisfied. ▯

3.1. Numerical behavior in the deterministic setting. Before testingAlgorithm SLM on statistical learning problems of the form (1.1), let us verify thatthe incorporation of curvature information, as described above, is beneficial in the idealcase when the objective function is deterministic and Sk ¼ X k ¼ D, i.e., when thematrix-vector products employed in the CG solve of (3.7) use the exact Hessian. Thiswill serve as an indication that the approach is not unsound.

Test 1: Quadratic function. We consider the problem of minimizing the convexquadratic

fðwÞ ¼X100j¼1

ð100− jþ 1Þw2j ;

whose Hessian has a condition number of 100. In Table 1 we report the number of itera-tions and function evaluations required by Algorithm SLM as a function of the max-imum allowed number of CG iterations (maxcg). In the bottom row, we report theperformance of the standard L-BFGS method.

Note from Table 1 that there is a consistent decrease in the number of iterations ofAlgorithm SLM as maxcg increases, showing the beneficial effect of incorporating cur-vature information through the initial matrix. The SLM method cannot, however, be

986 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 11: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

considered successful in this deterministic setting (where Sk ¼ X k) because, as shown inthe last column of Table 1, the total number of CG iterations is too expensive, even withthe reduction observed in outer iterations.

Test 2: Quadratic plus exponential. To increase the complexity of the problem, weintroduce an exponential term to the previous quadratic function, resulting in the con-vex function

fðwÞ ¼X100j¼1

ðð100− jþ 1Þw2j þ ewjÞ:

As in the previous example, we observe in Table 2 a steady reduction in the numberof SLM iterations as the maximum allowable number of CG iteration (maxcg) increases,however, the overall cost, as indicated by the last column, is too expensive to justify thesavings in outer iterations.

These experiments therefore suggest that the incorporation of curvature informa-tion in the initial matrix H 0

k has a beneficial effect in terms of total outer iterationcounts. In the next section, we show that by decreasing the cost of the CG iterationthrough stochastic Hessian subsampling we can make the SLM approach competitivein terms of computing time.

4. Numerical tests. To assess the effectiveness of the subsampled Hessian meth-ods proposed in this paper, we analyze and document their performance on a challengingmachine learning problem involving multiclass classification of speech frames. Ourbenchmark is the standard L-BFGS algorithm, which is widely used in the machinelearning community for tasks of this type [8], [10], [15]. As only the semistochastic formof the new methods is analyzed in this paper, we assume that X k ¼ D, and thus thefunction and gradient evaluations will use 100% of the information available for every

TABLE 1Results on a quadratic function.

maxcg SLM iter SLM functions Total CG iter

1 95 96 945 13 14 6010 8 9 7015 6 7 7320 5 6 74

L-BFGS 74 79

TABLE 2Results on a quadratic + exponential function.

maxcg SLM iter SLM functions Total CG iter

1 83 84 825 12 13 5510 8 9 7015 6 7 7220 6 7 91

L-BFGS 66 70

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 987

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 12: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

iteration. The stochastic Hessian information will be based on a smaller sampleSk—typically 5% and 10% of the sample used for the function and gradient. For con-venience, we define the Hessian-vector sampling percentage p%, as

p ¼ jSjjDj 100:ð4:1Þ

To determine the Hessian samples Sk, we first read in the training data and ran-domly shuffle it. Specifically, each training pair ðxi; yiÞ is assigned a random indexi ∈ f1; : : : ;Dg, excluding indices that have already been assigned. The array of indicesobtained in this manner is divided into blocks in a sequential manner. Thus, the firstiteration of the optimization algorithm will use the first p% of indices to define S1. At thesecond iteration, the next p% of the training points will be assigned to computing S2,and so on. If we reach the end of the array we wrap the current data block around to thebeginning of the data set. Clearly, one could use a higher degree of randomization, butthis will not have a noticeable effect on our algorithms.

4.1. A speech recognition problem. The objective of the speech recognitionproblem employed in our tests is to construct a representative multinomial logistic re-gression model that maximizes the probability of correct classification among the datapoints included within the training set. Each training point consists of a real valuedvector representing features for a 10 ms frame of speech and a label representing thephonetic state assigned to that frame based on a human transcription of the sourceutterance. The training set was provided by Google. Our objective is to maximizethe conditional probability of the correct phonetic state given the observed real valuedvector. The variables and parameters of the problem are described as follows.

Parameters:m∶ number of training points: 168,776.C∶ the set of all class labels: f1; 2; : : : ; 128; 129g.NF∶ number of feature measurements per data point: 79.yh∶ the class label associated with data point h∶yh ∈ C.xh∶ feature vector for data point h.xhðjÞ∶ jth entry of the feature vector.Decision variable:w∶ a parameter vector of dimension jCj× NF ¼ 10;191;wi∶ parameter subvector for class label i;wiðjÞ∶ jth entry of the parameter vector wi.Given pertinent feature information for a set of data points, the goal is to develop a

model that will suggest the correct class label for each point, on average, with reasonablyhigh probability. For each individual class label, the parameter vector w will have avalue associated with each respective feature measurement. Thus, each class label willhave a parameter vector of size NF, and as there are 129 different class labels and 79pertinent feature measurements, the total number of decision variables for this speechproblem is 129× 79 ¼ 10;191.

As we hope to build a statistical model that will suggest the correct class for eachdata point in the training set, an intuitive method for parameterizing the statisticalmodel, based on observed data, is maximum likelihood estimation. By performing max-imum likelihood estimation, we can not only determine the parameters of the statisticalmodel that most likely represents the data but also derive several attractive asymptoticproperties for these values, such as consistency and efficiency. Therefore, in choosing the

988 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 13: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

method of maximum likelihood estimation, we define the objective function J of theminimization problem (1.1) as the normalized sum of the negative log likelihood of eachdata point being placed in the correct class, over all data points used in the training set:

JðwÞ ¼ −1

m

Xmh¼1

logexpðwT

yhxhÞPi∈C

expðwTi xhÞ

¼ 1

m

�Xmh¼1

logXi∈C

expðwTi xhÞ−

Xmh¼1

expðwTyhxhÞ

�:ð4:2Þ

The gradient can be written as follows:

∂J∂wiðjÞ

¼ 1

m

�Xmh¼1

Pði; hÞxhðjÞ−Xm

h∶yh¼i

xhðjÞ�;ð4:3Þ

where

Pði; hÞ ¼�

expðwTi xhÞP

j∈C expðwTj xhÞ

�:

Note that Pði; hÞ is the designated probability of class label i ∈ C being the correct classlabel for data point h, given the parameter variable w.

The subsampling methods rely on the fact that Hessian-vector products can be com-puted efficiently. One can show that, given a vector v ∈ Rn, if we divide it into segmentsinitialized by class labels, and therefore vi ∈ RNF for i ∈ C, we have

½∇2JðwÞv�ði;jÞ ¼1

m

Xmh¼1

�Pði; hÞxhðjÞ

�vTi xh −

Pj∈Cv

Tj xh · expðwT

j xhÞPj∈C expðwT

j xh��

:ð4:4Þ

We have used the pair ði; jÞ to index the vector ∇2JðwÞv (which is of dimensionjCj× NF), where i ∈ C refers to the class and j denotes the feature (i.e., j ∈ f1;2; : : : ; 79g). Thus, each entry of the Hessian-vector product is composed of a summationover the set of data points, and this presents an opportunity to incorporate second orderinformation in a controlled manner such that the advantages of curvature informationare balanced against the processing cost. Formula (4.4) represents a summation acrossall terms within the data set D; to sample the Hessian, we select a subsample Sk asdescribed above and include only the corresponding terms in the summation.

4.2. Testing the subsampled Hessian Newton method. In Figure 1, we com-pare the behavior of three methods:

(i) The standard L-BFGS method [6] with memory t ¼ 20.(ii) The classical Newton-CG method (CN) [12, p. 169] with maxcg ¼ 10 and full

Hessian information, i.e., with p ¼ 100%, where p is defined in (4.1).(iii) The subsampled Hessian Newton method (SN) (Algorithm S-Newton) with

maxcg ¼ 10 and p ¼ 5%.As function, gradient, and Hessian-vector product evaluations are (by far) the most

costly computations in the three algorithms tested, the computational effort in expres-sions (4.2), (4.3), and (4.4) will be used in lieu of CPU time. We refer to this computa-tional effort as the “number of accessed points.” To further define this term, we can

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 989

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 14: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

observe that the objective function is a sum of terms, where each term is associated witha single training point. As a result, when computing the objective function, we can se-parate the overall function into smaller blocks, where each block is associated with asingle training point. Therefore, all blocks will have identical wall times and cumula-tively represent the objective function when aggregated. Similar blocks appear in thegradient and Hessian-vector products, which in turn will have approximately the samewall time as the blocks, which collectively represent the function evaluation. We refer tothe total number of computed blocks as the “number of accessed data points.” In allfigures in this paper, the horizontal axis plots this number. (Plots based on CPU timeexhibit similar behavior as those presented.) The vertical axis plots the probability ofcorrect classification, which is defined as expð−JðwkÞÞ, with J given in (4.2).

We observe from the behavior presented in Figure 1 that the subsampled HessianNewton method is the most efficient of all the methods for a correct classification prob-ability greater than 0.08. Notably, for a probability level of 13% correct classification,the subsampled Newton method is approximately three times as fast as the classicalNewton-CG method and twice as fast as the L-BFGS method. Note that the numberof accessed points is of order 106 and the number of training points is about 168,776.

In Figure 2 we analyze the behavior of the subsampled Hessian Newton method asthe CG iteration limitmaxcg varies, while fixing the sampling percentage at p ¼ 5%. Wereport results for the settings of maxcg ¼ 2, 5, 10, 50. From these plotted graphs, weobserve that the subsampled Hessian Newton method is not effective for maxcg ¼ 2,but is quite efficient for maxcg ¼ 10 and maxcg ¼ 50. This behavior is similar for highersampling percentages, such as p ¼ 10%. The CG limit was reached at most iterations.Note that little is lost in terms of efficiency by truncating the conjugate gradient limitmaxcg from 50 to 10. Given this observation, we fix the conjugate gradient limit to thevalue of maxcg ¼ 10 and report in Figure 3 the performance of the subsampled HessianNewton method for varying Hessian-vector product sampling percentages p, specifically

FIG. 1. Comparison of L-BFGS vs. subsampled Hessian Newton (SN) vs. classical Newton (CN).

990 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 15: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

for p ¼ 1%, 10%, 50%, 100%. We observe from Figure 3 that the subsampled HessianNewton algorithm outperforms the benchmark L-BFGS algorithm for sampling percen-tages of p ¼ 1% and 10%. Exhaustive testing indicates that, for this problem, the newalgorithm is efficient for the range p ∈ ½1%; 25%�.

FIG. 2. Behavior of the subsampled Hessian Newton method (SN), with p ¼ 5%, for varying maxcg values.

FIG. 3. Behavior of the subsampled Hessian Newton method (SN), withmaxcg ¼ 10, for varying values ofthe Hessian subsampling percentage p.

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 991

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 16: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

4.3. Testing the stochastically initialized L-BFGS method. In Figure 4, wecompare the behavior of the standard L-BFGS method with memory t ¼ 5 and the sto-chastically initialized L-BFGS method (Algorithm SLM) with t ¼ 5, maxcg ¼ 5, andtwo choices of p, namely, p ¼ 5%, 100%. From this graph, it is observed that subsam-pling the Hessian-vector products at the level p ¼ 5% exhibits an increased accelerationof performance over the traditional L-BFGS, especially near the convergent probability.More specifically, for an average probability classification of 13%, the reduction in num-ber of data points required (or similarly, a reduction in CPU time) is approximately halfof that required for L-BFGS. On the other hand, for the full sampling of Hessian-vectorproduct, that is, p ¼ 100%, the Algorithm SLM is significantly less efficient than the twoother methods.

In order to analyze the effect ofmaxcg on Algorithm SLM, we fix p ¼ 5% and reportin Figure 5 results for the settings maxcg ¼ 2, 5, 10, and 50, in comparison to thestandard L-BFGS algorithm with memory size t ¼ 5. We observe from this figure thatAlgorithm SLM performs well for relatively small values for maxcg. However, unlikeS-Newton, a large number of CG iterations is observed to be detrimental to the perfor-mance of the algorithm. Therefore, with a relatively small number of CG iterationsbelow a certain threshold, observed to be maxcg ¼ 10, the stochastically initializedL-BFGS method outperforms the standard L-BFGS algorithm.

To evaluate the behavior of Algorithm SLM for varying sampling percentages p,we report in Figure 6 the results for p ¼ 1%, 50%, 100%, with maxcg ¼ 10 andmemory setting t ¼ 5. This figure clearly shows the gains in efficiency achieved byAlgorithm SLM as the sampling percentage p decreases. For high percentages ofsampling, which in turn leads to a larger S set, the worth of the information in the sam-ple does not provide enough improvement over the benchmark L-BFGS algorithm tocompensate for the additional processing time required. In comparison, for small sam-pling percentages such as p ¼ 1%, the balance between the value of information and

FIG. 4. Comparison of L-BFGS vs. SLM with p ¼ 5%, 100%, and maxcg ¼ 5.

992 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 17: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

computing time contributes advantageously to Algorithm SLM, which is able to achievesubstantial gains over standard L-BFGS.

4.4. Comparison of the two methods. In Figure 7 we compare the performanceof the subsampled Hessian Newton method and the stochastically initialized L-BFGS

FIG. 5. Behavior of Algorithm SLM, with p ¼ 5%, for varying maxcg values.

FIG. 6. Behavior of Algorithm SLM, with maxcg ¼ 10, for p ¼ 1%, 50%, and 100%.

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 993

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 18: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

method, where the subsample percentage of p ¼ 5% is used. Both methods contain sev-eral parameters that affect their performance, and in Figure 7 we chose two settings foreach method, being maxcg ¼ 5 and 10. We see that the two approaches appear to besimilar in terms of performance for this problem. In general, we feel that both algorithmscan be effective for machine learning applications and the best choice among them maybe problem dependent.

5. Conclusions. We have proposed in this paper that Hessian subsampling via amatrix-free conjugate gradient iteration is an effective way of accelerating optimizationmethods for machine learning. Our method avoids sampling the second derivativesdirectly since this can lead to very noisy estimators; see [16]. We described two methodsthat can benefit from this approach: one is a variant of Newton-CG and the other ofL-BFGS. There are a variety of ways of implementing these methods, depending onthe choice of the function/gradient sample Xk and the Hessian subsample Sk at everyiteration. In this paper, we have focused on the case when X k is large; i.e., we have fol-lowed a batch (or SAA) approach. A key feature of our methods is that the Hessiansubsample Sk is much smaller than X k.

The subsampled Hessian Newton-CGmethod overcomes one of the main drawbacksof inexact (or truncated) Newton methods, namely, the high cost of computing a searchdirection, by significantly lowering the cost of the CG iteration. This is possible for sto-chastic objective functions of the form (1.2) because the computational cost of a Hessian-vector product decreases linearly with the sample size, and because small sample sizesprovide useful curvature information. The stochastic Newton-CG method might befurther accelerated by preconditioning the CG iteration, but we have not explored thattopic in this paper. The limited memory BFGS method benefits from the fact that thestochastic Hessian information complements the curvature information provided byquasi-Newton updating.

FIG. 7. Comparing S-Newton vs. SLM (with memory t ¼ 5), for maxcg ¼ 5, 10.

994 R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 19: On the Use of Stochastic Hessian Information in ...€¦ · ON THE USE OF STOCHASTIC HESSIAN INFORMATION IN OPTIMIZATION METHODS FOR MACHINE LEARNING* RICHARD H. BYRD†, GILLIAN

This work was motivated, in part, by the availability of distributed computing en-vironments that allow the parallel computation of very expensive loss functions invol-ving many thousands of training points, as well as by the possibility of computingHessian-vector products in parallel and at modest cost. Given the latency due to com-munication in such a setting, it is convenient to work with batch sizes that are not verysmall—and this in turn justifies the use of deterministic optimization techniques. Thepotential of the subsampled Hessian methods was illustrated in a speech recognitionproblem with data generated at Google.

Acknowledgments. The authors acknowledge several insightful conversationswith Tito Homem-de-Mello, Alexander Shapiro, and Yoram Singer. They thank FigenOztoprak for valuable advice on how to make the function evaluations and Hessian-vec-tor products more efficient.

REFERENCES

[1] L. BOTTOU, Online learning and stochastic approximations, in Online Learning and Neural Networks,Vol. 17, Cambridge University Press, Cambridge, 1998, pp. 9–42.

[2] R. FLETCHER, Practical Methods of Optimization, 2nd ed., J. Wiley and Sons, Chichester, England, 1987.[3] I. GRIVA, S. G. NASH, AND A. SOFER, Linear and Nonlinear Optimization, 2nd ed., SIAM, Philadelphia,

2008.[4] J. YU, S. V. N VISHWANATHAN, S. GÜNTER, AND N. N. SCHRAUDOLPH, A quasi-Newton approach to non-

smooth convex optimization problems in machine learning, J. Mach. Learn. Res., 11 (2010),pp. 1145–1200.

[5] C. J. LIN, R. C. WENG, AND S. S. KEERTHI, Trust region Newton method for logistic regression, J. Mach.Learn. Res., 9 (2008), pp. 627–650.

[6] D. C. LIU AND J. NOCEDAL, On the limited memory BFGS method for large scale optimization, Math.Program., 45 (1989), pp. 503–528.

[7] D. G. LUENBERGER, Linear and Nonlinear Programming, 2nd ed., Addison-Wesley, Reading, MA, 1984.[8] R. MALOUF, A comparison of algorithms for maximum entropy parameter estimation, in Proceedings of

the Sixth Conference on Natural Language Learning, Taipei, Taiwan, 2002.[9] J. MARTENS, Deep learning via Hessian-free optimization, in Proceedings of the 27th International

Conference on Machine Learning (ICML), Vol. 95, 2010.[10] T. P. MINKA,AComparison of Numerical Optimizers for Logistic Regression, Technical report, Microsoft

Research, 2003.[11] S. G. NASH, Newton-type minimization via the Lanczos method, SIAM J. Numer. Anal., 21 (1984),

770–788.[12] J. NOCEDAL AND S. J. WRIGHT, Numerical Optimization, 2nd ed., Springer Ser. Oper. Res., Springer, New

York, 2006.[13] H. ROBBINS AND S. MONRO, A stochastic approximation method, Ann. Math. Statist., 22 (1951),

pp. 400–407.[14] A. RUSZCZYNSKI AND A. SHAPIRO, Stochastic Programming, Handbook in Operations Research and

Management Science, Elsevier Science, New York, 2003.[15] F. SHA AND F. PEREIRA, Shallow parsing with conditional random fields, in Proceedings of the 2003 Con-

ference of the North American Chapter of the Association for Computational Linguistics on HumanLanguage Technology, Vol. 1, Department of Computer and Information Science, University ofPennsylvania, Philadelphia, PA, 2003, pp. 131–141.

[16] A. SHAPIRO AND T. HOMEM-DE-MELLO, A simulation-based approach to two-stage stochastic programmingwith recourse, Math. Program., 81 (1998), pp. 301–325.

[17] T. STEIHAUG, The conjugate gradient method and trust regions in large scale optimization,SIAM J. Numer. Anal., 20 (1983), pp. 626–637.

ON THE USE OF STOCHASTIC HESSIAN INFORMATION 995

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Dow

nloa

ded

10/1

4/12

to 1

29.1

05.2

15.1

46. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


Recommended