Overview and Recent Advances in Derivative Free Optimization
Katya Scheinberg
Joint work with A. Berahas, J. Blanchet, L. Cao, C. Cartis, A. R. Conn, M. Menickelly, C.Paquette, L. Vicente
School of Operations Research and Information Engineering
IPAM Workshop: From Passive to Active: Generative and Reinforcement Learning withPhysics, Sept 23-27, 2019
Local and Global Optimization
From Roos, Terlaky and DeKlerk, ”Nonlinear Optimisation”, 2002.
Katya Scheinberg (Cornell University) 2 / 37
Optimization and gradient descent
Katya Scheinberg (Cornell University) 3 / 37
Black Box Optimization Problems
minx∈Rn
f (x)
x f(x)BLACK BOX
f nonlinear function; derivatives of f not available
Noisy functions, stochastic or deterministic
minx∈Rn
f (x) = φ(x) + ε(x) minx∈Rn
f (x) = φ(x)(1 + ε(x))
Katya Scheinberg (Cornell University) 4 / 37
Motivation
Machine Learning
Source(s): https://blog.statsbot.co/, https://campus.datacamp.com/
Deep Learning
Source(s): https://medium.com/
Reinforcement Learning
Source(s): http://people.csail.mit.edu/
Katya Scheinberg (Cornell University) 5 / 37
Optimizingpropertiesobtainedfromexpensivesimulationsorexperiments
Criticaltemperaturesfrommoleculardynamics
simulations
Dignon et al. ACS Cent. Sci., Article ASAP
ReactionrateestimationfromkineticMonteCarlo
simulations
Activationbarriersfromquantummechanicalnudgedelasticbandcalculations
ZACROS (http://zacros.org/tutorials) Andersen et al. Front. Chem. 2019
Yieldestimationfromexperimentalorganicsynthesis
reactorsystems
• Manyexamplesexistinthedomainofmolecularandmaterialssciencewherecalculatingapropertyrequiresexpensivecomputationsorexperiments
• Inmanyofthesecases,derivativesarenotavailable
Holmesetal.React.Chem.Eng.,2016,1,36
Katya Scheinberg (Cornell University) 6 / 37
Derivative-free methods: direct and random search
Iterative algorithms that converge to a local optima.
In each iteration:1 Evaluate a set of sample points around the current iterate;2 Choose the sample point with the best function value;3 Make this point the next iterate;
Katya Scheinberg (Cornell University) 7 / 37
Derivative free methods: model-based
Iterative algorithms that converge to a local optimum.
In each iteration:
1 Evaluate a set of sample points around the current iterate;
2 Interpolate the sample points with a linear or quadratic model;
3 Use this model to find the next iterate;
⇒
Katya Scheinberg (Cornell University) 8 / 37
Model-Based Trust Region Method (pioneered by M.J.D. Powell)
(a) starting point (b) initial sampling
Katya Scheinberg (Cornell University) 9 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 10 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 11 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 12 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 13 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 14 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 15 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 16 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 17 / 37
Model-Based Trust Region Method
Katya Scheinberg (Cornell University) 18 / 37
Model-Based Trust Region Method
Shrinking and expanding trust region radius, exploiting curvature, efficient in terms of samples
Katya Scheinberg (Cornell University) 19 / 37
Direct Search
11307 function evaluations
Katya Scheinberg (Cornell University) 20 / 37
Random Search
3705 function evaluations
Katya Scheinberg (Cornell University) 21 / 37
Trust Region Method
69 function evaluations
Katya Scheinberg (Cornell University) 22 / 37
Active learning, generative models and derivative free optimizationoptimization
What does model-based derivative-free optimization do?
Using some ”labeled” data (x , f (x)), build a models m(x). What do we want from thatmodel m(x)? Quality? Simplicity?
Optimize m(x) or ”related function”, to obtain new potentially interesting data point. Whatdo we optimize?
Modify model (how?), repeat.
What do we need for convergence?
Katya Scheinberg (Cornell University) 23 / 37
Assumptions on models for convergence
For trust region, first-order convergence
‖∇f (xk )−∇mk (xk )‖ ≤ O(∆k ),
For trust region, second-order convergence
‖∇2f (xk )−∇2mk (xk )‖ ≤ O(∆k )
‖∇f (xk )−∇mk (xk )‖ ≤ O(∆2k )
For line search, first-order converegnce
‖∇f (xk )−∇mk (xk )‖ ≤ O(αk‖∇mk‖)
Intuition
In other words, model should have comparable Taylor expansion as the true function w.r.t. thestep size.
Katya Scheinberg (Cornell University) 24 / 37
Assumptions on models for convergence
For trust region, first-order convergence
‖∇f (xk )−∇mk (xk )‖ ≤ O(∆k ), w.p. 1− δ
For trust region, second-order convergence
‖∇2f (xk )−∇2mk (xk )‖ ≤ O(∆k ) w.p. 1− δ‖∇f (xk )−∇mk (xk )‖ ≤ O(∆2
k ) w.p. 1− δ
For line search, first-order converegnce
‖∇f (xk )−∇mk (xk )‖ ≤ O(αk‖∇mk‖) w.p. 1− δ
Intuition
In other words, model should have comparable Taylor expansion as the true function w.r.t. thestep size.
Katya Scheinberg (Cornell University) 24 / 37
Building models via linear interpolation
m(y) = f (x) + g(x)T (y − x) : m(y) = f (y), ∀y ∈ Y.
Katya Scheinberg (Cornell University) 25 / 37
Building models via linear interpolation
m(y) = f (x) + g(x)T (y − x) : m(y) = f (y), ∀y ∈ Y.
Let Y = {x + σy1, ..., x + σyn}, σ > 0,
FY =
f (x + σy1) − f (x)
.
.
.f (x + σyn) − f (x)
∈ Rn, MY =
yT1
.
.
.
yTn
∈ Rn×n
Katya Scheinberg (Cornell University) 25 / 37
Building models via linear interpolation
m(y) = f (x) + g(x)T (y − x) : m(y) = f (y), ∀y ∈ Y.
Let Y = {x + σy1, ..., x + σyn}, σ > 0,
FY =
f (x + σy1) − f (x)
.
.
.f (x + σyn) − f (x)
∈ Rn, MY =
yT1
.
.
.
yTn
∈ Rn×n
Model m(y) constructed to satisfy interpolation conditions:
σMYg = FY
Katya Scheinberg (Cornell University) 25 / 37
Building models via linear interpolation
m(y) = f (x) + g(x)T (y − x) : m(y) = f (y), ∀y ∈ Y.
Let Y = {x + σy1, ..., x + σyn}, σ > 0,
FY =
f (x + σy1) − f (x)
.
.
.f (x + σyn) − f (x)
∈ Rn, MY =
yT1
.
.
.
yTn
∈ Rn×n
Model m(y) constructed to satisfy interpolation conditions:
σMYg = FY
Theorem [Conn, Scheinberg & Vicente, 2008]
Let Y = {x, x + σy1, . . . , x + σyn} be set of interpolation points such that maxi ‖yi‖ ≤ 1 and that MY is nonsingular.Suppose that the function f has L-Lipschitz continuous gradients. Then,
‖∇m(x)−∇f (x)‖ ≤‖M−1Y ‖2
√nσL
2.
Cost: O(n3) (reduces to O(n2) if MY is orthornormal and O(n2) if MY = I )
Katya Scheinberg (Cornell University) 25 / 37
Quadratic Interpolation Models
m(y) = f (x) + g(x)T (y − x) +1
2(y − x)TH(x)(y − x) : m(y) = f (y), ∀y ∈ Y.
Katya Scheinberg (Cornell University) 26 / 37
Quadratic Interpolation Models
m(y) = f (x) + g(x)T (y − x) +1
2(y − x)TH(x)(y − x) : m(y) = f (y), ∀y ∈ Y.
Let Y = {x + σy1, ..., x + σyN}, σ > 0,
FY =
f (x + σy1) − f (x)
.
.
.f (x + σyN ) − f (x)
∈ RN , MY =
yT1 vec(y1y
T1 )
.
.
.
.
.
.
yTn vec(ynyTn )
∈ RN×N
Katya Scheinberg (Cornell University) 26 / 37
Quadratic Interpolation Models
m(y) = f (x) + g(x)T (y − x) +1
2(y − x)TH(x)(y − x) : m(y) = f (y), ∀y ∈ Y.
Let Y = {x + σy1, ..., x + σyN}, σ > 0,
FY =
f (x + σy1) − f (x)
.
.
.f (x + σyN ) − f (x)
∈ RN , MY =
yT1 vec(y1y
T1 )
.
.
.
.
.
.
yTn vec(ynyTn )
∈ RN×N
Model m(y) constructed to satisfy interpolation conditions:
σMY (g , vec(H)) = FY
Katya Scheinberg (Cornell University) 26 / 37
Quadratic Interpolation Models
m(y) = f (x) + g(x)T (y − x) +1
2(y − x)TH(x)(y − x) : m(y) = f (y), ∀y ∈ Y.
Let Y = {x + σy1, ..., x + σyN}, σ > 0,
FY =
f (x + σy1) − f (x)
.
.
.f (x + σyN ) − f (x)
∈ RN , MY =
yT1 vec(y1y
T1 )
.
.
.
.
.
.
yTn vec(ynyTn )
∈ RN×N
Model m(y) constructed to satisfy interpolation conditions:
σMY (g , vec(H)) = FY
Theorem [Conn, Scheinberg & Vicente, 2008]
Let Y = {x, x + σy1, . . . , x + σyn+n(n+1)/2} be set of interpolation points such that maxi ‖yi‖ ≤ 1 and that MY is
nonsingular. Suppose that the function f has L-Lipschitz continuous Hessians. Then,
‖∇m(x)−∇f (x)‖ ≤ O(‖M−1Y ‖2nσ
2L).
‖∇2m(x)−∇2f (x)‖ ≤ O(‖M−1Y ‖2nσL
).
Cost: O(n6)
Katya Scheinberg (Cornell University) 26 / 37
Interpolation model quality
⇒
⇒
Katya Scheinberg (Cornell University) 27 / 37
Model deterioration
Katya Scheinberg (Cornell University) 28 / 37
Some conclusions so far
Interpolation models allow for old points to be reused and hence are very economical interms of samples.
Linear algebra is expensive and more importantly can be ill-conditioned.
Can improve lin. alg. cost and conditioning by using pre-designed sample sets, but it is moreexpensive in terms of samples (e.g. FD needs n samples per gradient estimate).
What alternatives are there?
Katya Scheinberg (Cornell University) 29 / 37
Gaussian Smoothing
F (x) = Eε∼N (0,I )f (x + σε) =
∫Rn
f (x + σε)π(ε|0, I )dε
π(y |x ,Σ) is the pdf of N (x ,Σ) evaluated at y
F (x) is a Gaussian smoothed approximation to f (x)
∇F (x) =1
σEε∼N (0,I )f (x + σε)ε
Idea: Approximate ∇f (x) by a sample average approximation of ∇F (x)
g(x) =1
Nσ
N∑i=1
f (x + σεi )εi
Katya Scheinberg (Cornell University) 30 / 37
Gaussian Smoothing
F (x) = Eε∼N (0,I )f (x + σε) =
∫Rn
f (x + σε)π(ε|0, I )dε
π(y |x ,Σ) is the pdf of N (x ,Σ) evaluated at y
F (x) is a Gaussian smoothed approximation to f (x)
∇F (x) =1
σEε∼N (0,I )f (x + σε)ε
Idea: Approximate ∇f (x) by a sample average approximation of ∇F (x)
g(x) =1
Nσ
N∑i=1
f (x + σεi )εi
Issue: Variance →∞ as σ → 0
Katya Scheinberg (Cornell University) 30 / 37
Gaussian Smoothing
F (x) = Eε∼N (0,I )f (x + σε) =
∫Rn
f (x + σε)π(ε|0, I )dε
π(y |x ,Σ) is the pdf of N (x ,Σ) evaluated at y
F (x) is a Gaussian smoothed approximation to f (x)
∇F (x) =1
σEε∼N (0,I )(f (x + σε)−f (x))ε
Idea: Approximate ∇f (x) by a sample average approximation of ∇F (x)
g(x) =1
Nσ
N∑i=1
(f (x + σεi )−f (x))εi
Katya Scheinberg (Cornell University) 30 / 37
Gaussian Smoothing
N = 1, theoretical analysis of convergence rates for convex problems
used in reinforcement learning, no theory, N is large
uses interpolation on top of sample average approximation
uniform distribution on a ball for online learning
uniform distribution on a ball for model-free LQR
Katya Scheinberg (Cornell University) 31 / 37
Analysis of Variance for Gaussian Smoothing
‖g(x)−∇f (x)‖ ≤ ‖g(x)−∇F (x)‖︸ ︷︷ ︸sample average error
+ ‖∇F (x)−∇f (x)‖︸ ︷︷ ︸smoothing error
≤ r +√nσL
Theorem [Berahas, Cao, S., 2019]
Suppose that the function f (x) has L-Lipschitz continuous gradients. Let g(x) denote the GSG approximation to ∇f (x). If
N ≥1
δr2
(3n‖∇f (x)‖2 +
n(n2 + 6n + 8)L2σ2
4
).
then, ‖g(x)−∇f (x)‖ ≤ r +√
nσL.
with probability at least 1− δ.
Essentially N ∼ 3n
Katya Scheinberg (Cornell University) 32 / 37
Gradient Approximation Accuracy
numerical experiment setup and results:
f (x) =
n/2∑i=1
M sin(x2i−1) + cos(x2i )
+L−M
2nxT 1n×nx ,
which has ‖∇f (0)‖ =√
n2M. We use n = 20, M = 1, L = 2, σ = 0.01, and N = 4n for the
smoothing methods.
Katya Scheinberg (Cornell University) 33 / 37
Gradient Approximation Accuracy
Katya Scheinberg (Cornell University) 34 / 37
Algorithm Performance
More&Wild problems set (53 smooth problems)
1 2 4 8 16 32 64 128Performance Ratio
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFD (LBFGS)CFD (LBFGS)GSG (SD,n)BSG (LBFGS,4n)DFOTR
(a) τ = 10−1
1 2 4 8 16 32 64Performance Ratio
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFD (LBFGS)CFD (LBFGS)GSG (SD,n)BSG (LBFGS,4n)DFOTR
(b) τ = 10−3
1 2 4 8 16 32 64Performance Ratio
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFD (LBFGS)CFD (LBFGS)GSG (SD,n)BSG (LBFGS,4n)DFOTR
(c) τ = 10−5
0 200 400 600 800Number of Function Evaluations
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFD (LBFGS)CFD (LBFGS)GSG (SD,n)BSG (LBFGS,4n)DFOTR
(d) τ = 10−1
0 100 200 300 400 500 600 700Number of Function Evaluations
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFD (LBFGS)CFD (LBFGS)GSG (SD,n)BSG (LBFGS,4n)DFOTR
(e) τ = 10−3
0 200 400 600 800 1000 1200 1400Number of Function Evaluations
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FFD (LBFGS)CFD (LBFGS)GSG (SD,n)BSG (LBFGS,4n)DFOTR
(f) τ = 10−5
Performance and data profiles for best variant of each method. Top row: performance profiles; Bottom row:data profiles.
Katya Scheinberg (Cornell University) 35 / 37
Algorithm Performance
FD = forward finite differenceLIOD = linear interpolation of orthogonal directionsLS = (backtracking) linear searchGSG = Gaussian smooth gradient
0 200 400 600 800
Iterations
-50
0
50
100
150
200
250
300
350
400
Reward
Swimmer
FDLIODLIOD (LS)GSG
0 1000 2000 3000 4000 5000
Iterations
-8000
-6000
-4000
-2000
0
2000
4000
6000
Reward
HalfCheetah
FDLIODLIOD (LS)GSG
0 500 1000 1500 2000
Iterations
-10 5
-10 4
-10 3
-10 2
-10 1
-10 0
Reward
Reacher
FDLIODLIOD (LS)GSG
Reinforcement learning tasks: Swimmer (left), HalfCheetah (center), Reacher (right).
Katya Scheinberg (Cornell University) 36 / 37
Conclusions
Model based derivative free methods are efficient and theoretically sound
Select the type of models according to application but make sure theory applies
Use randomization only when necessary, as it can slow down convergence
Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to Derivative-freeOptimization MPS-SIAM Optimization series. SIAM, Philadelphia, USA, 2008.
Albert Berahas, Liyuan Cao, Krzyzstof Choromanski, Katya Scheinberg. A theoretical andempirical comparison of gradient approximations in derivative-free optimization, arXivpreprint arXiv:1904.11585,1905.01332, 2019.
Jeffrey Larson, Matt Menickelly, and Stefan M Wild. Derivative-free optimization methodsarXiv preprint arXiv:1904.11585, 2019.
Thank you!
Katya Scheinberg (Cornell University) 37 / 37
Conclusions
Model based derivative free methods are efficient and theoretically sound
Select the type of models according to application but make sure theory applies
Use randomization only when necessary, as it can slow down convergence
Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to Derivative-freeOptimization MPS-SIAM Optimization series. SIAM, Philadelphia, USA, 2008.
Albert Berahas, Liyuan Cao, Krzyzstof Choromanski, Katya Scheinberg. A theoretical andempirical comparison of gradient approximations in derivative-free optimization, arXivpreprint arXiv:1904.11585,1905.01332, 2019.
Jeffrey Larson, Matt Menickelly, and Stefan M Wild. Derivative-free optimization methodsarXiv preprint arXiv:1904.11585, 2019.
Thank you!
Katya Scheinberg (Cornell University) 37 / 37