+ All Categories
Home > Documents > Derivative free optimization via repeated classi...

Derivative free optimization via repeated classi...

Date post: 21-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
10
Derivative free optimization via repeated classification Tatsunori B. Hashimoto Steve Yadlowsky John C. Duchi Department of Statistics, Stanford University, Stanford, CA, 94305 {thashim, syadlows, jduchi}@stanford.edu Abstract We develop an algorithm for minimizing a function using n batched function value mea- surements at each of T rounds by using classi- fiers to identify a function’s sublevel set. We show that sufficiently accurate classifiers can achieve linear convergence rates, and show that the convergence rate is tied to the diffi- culty of active learning sublevel sets. Further, we show that the bootstrap is a computation- ally efficient approximation to the necessary classification scheme. The end result is a computationally effi- cient derivative-free algorithm requiring no tuning that consistently outperforms other approaches on simulations, standard bench- marks, real-world DNA binding optimiza- tion, and airfoil design problems whenever batched function queries are natural. 1 Introduction Consider the following abstract problem: given access to a function f : X→ R, where X is some space, find x ∈X minimizing f (x). We study an instantiation of this problem that trades sequential access to f for large batches of parallel queries—one can query f for its value over n points at each of T rounds. In this setting, we propose a general algorithm that effectively optimizes f whenever there is a family of classifiers h : X→ [0, 1] that can predict sublevel sets of f with high enough accuracy. Our main motivation comes from settings in which n is large—on the order of hundreds to thousands—while possibly small relative to the size of X . These types of Proceedings of the 21 st International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2018, Lan- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the author(s). problems occur in biological assays [21], physical sim- ulations [27], and reinforcement learning problems [33] where parallel computation or high-throughput mea- surement systems allow efficient collection of large batches of data. More concretely, consider the opti- mization of protein binding affinity to DNA sequence targets from biosensor data [11, 21, 38]. In this case, assays measure binding of n 1000s of sequences and are inherently parallel due to the fixed costs of set- ting up an experiment, while the time to measure a collection of sequences makes multiple sequential tests prohibitively time-consuming (so T must be small). In such problems, it is typically difficult to compute the gradients of f (if they even exist); consequently, we fo- cus on derivative-free optimization (DFO, also known as zero-order optimization) techniques. 1.1 Problem statement and approach The batched derivative free optimization problem con- sists of a sequence of rounds t =1, 2,...,T in which we propose a distribution p (t) , draw a sample of n can- didates X i iid p (t) , and observe Y i = f (X i ). The goal is to find at least one example X i for which the gap min i f (X i ) - inf x∈X f (x) is small. Our basic idea is conceptually simple: In each round, fit a classifier h predicting whether Y i α (t) for some threshold α (t) . Then, upweight points x that h pre- dicts as f (x) (t) and downweight the other points x for the proposal distribution p (t) for the next round. This algorithm is inspired by classical cutting-plane al- gorithms [30, Sec. 3.2], which remove a constant frac- tion of the remaining feasible space at each iteration, and is extended into the stochastic setting based on multiplicative weights algorithms [25, 3]. We present the overall algorithm as Algorithm 1.
Transcript
  • Derivative free optimization via repeated classification

    Tatsunori B. Hashimoto Steve Yadlowsky John C. Duchi

    Department of Statistics, Stanford University, Stanford, CA, 94305{thashim, syadlows, jduchi}@stanford.edu

    Abstract

    We develop an algorithm for minimizing afunction using n batched function value mea-surements at each of T rounds by using classi-fiers to identify a function’s sublevel set. Weshow that sufficiently accurate classifiers canachieve linear convergence rates, and showthat the convergence rate is tied to the diffi-culty of active learning sublevel sets. Further,we show that the bootstrap is a computation-ally efficient approximation to the necessaryclassification scheme.

    The end result is a computationally effi-cient derivative-free algorithm requiring notuning that consistently outperforms otherapproaches on simulations, standard bench-marks, real-world DNA binding optimiza-tion, and airfoil design problems wheneverbatched function queries are natural.

    1 Introduction

    Consider the following abstract problem: given accessto a function f : X → R, where X is some space, findx ∈ X minimizing f(x). We study an instantiationof this problem that trades sequential access to f forlarge batches of parallel queries—one can query f forits value over n points at each of T rounds. In thissetting, we propose a general algorithm that effectivelyoptimizes f whenever there is a family of classifiersh : X → [0, 1] that can predict sublevel sets of f withhigh enough accuracy.

    Our main motivation comes from settings in which n islarge—on the order of hundreds to thousands—whilepossibly small relative to the size of X . These types of

    Proceedings of the 21st International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2018, Lan-zarote, Spain. PMLR: Volume 84. Copyright 2018 by theauthor(s).

    problems occur in biological assays [21], physical sim-ulations [27], and reinforcement learning problems [33]where parallel computation or high-throughput mea-surement systems allow efficient collection of largebatches of data. More concretely, consider the opti-mization of protein binding affinity to DNA sequencetargets from biosensor data [11, 21, 38]. In this case,assays measure binding of n ≥ 1000s of sequences andare inherently parallel due to the fixed costs of set-ting up an experiment, while the time to measure acollection of sequences makes multiple sequential testsprohibitively time-consuming (so T must be small). Insuch problems, it is typically difficult to compute thegradients of f (if they even exist); consequently, we fo-cus on derivative-free optimization (DFO, also knownas zero-order optimization) techniques.

    1.1 Problem statement and approach

    The batched derivative free optimization problem con-sists of a sequence of rounds t = 1, 2, . . . , T in whichwe propose a distribution p(t), draw a sample of n can-

    didates Xiiid∼ p(t), and observe Yi = f(Xi). The goal

    is to find at least one example Xi for which the gap

    minif(Xi)− inf

    x∈Xf(x)

    is small.

    Our basic idea is conceptually simple: In each round,fit a classifier h predicting whether Yi ≶ α(t) for somethreshold α(t). Then, upweight points x that h pre-dicts as f(x) < α(t) and downweight the other pointsx for the proposal distribution p(t) for the next round.

    This algorithm is inspired by classical cutting-plane al-gorithms [30, Sec. 3.2], which remove a constant frac-tion of the remaining feasible space at each iteration,and is extended into the stochastic setting based onmultiplicative weights algorithms [25, 3]. We presentthe overall algorithm as Algorithm 1.

  • Derivative free optimization via repeated classification

    Algorithm 1 Cutting-planes using classifiers

    Require: Objective f , Action space X , hypothesisclass H.

    1: Set p(0)(x) = 1/|X |2: Draw X(0) ∼ p(0).3: Observe Y (0) = f(X(0))4: for t ∈ {1 . . . T} do5: Set α(t) = median({Y (t)i }ni=1)6: Set h(t) ∈ H as the loss minimizer of L over

    (X(0), Y (0) > α(t)) . . . (X(t−1), Y (t−1) > α(t)).7: Set p(t)(x) ∝ p(t−1)(x)(1− ηh(t)(x))8: Draw X(t) ∼ p(t)9: Observe Y (t) = f(X(t)).

    10: end for11: Set i∗ = arg mini Y

    (T )i

    12: return X(T )i∗ .

    1.2 Related work

    When, as is typical in optimization, one has substan-tial sequential access to f , meaning that T can belarge, there are a number of major approaches to op-timization. Bayesian optimization [34, 7] and kernel-based bandits [9] construct an explicit surrogate func-tion to minimize; often, one assumes it is possibleto perfectly model the function f . Local search al-gorithms [12, 26] emulate gradient descent via finite-difference and local function evaluations. Our workdiffers conceptually in two ways: first, we think of T asbeing small, while n is large, and second, we representa function f by approximating its sublevel sets. Exist-ing batched derivative-free optimizers encounter com-putational difficulties for batch sizes beyond dozens ofpoints [16]. Our sublevel set approach scales to largebatches of queries by simply sampling from the currentsublevel set approximation.

    While other researchers have considered level set esti-mation in the context of Bayesian optimization [17, 7]and evolutionary algorithms [29], these use the levelset to augment a traditional optimization algorithm.We show good sublevel set predictions alone are suffi-cient to achieve linear convergence. Moreover, giventhe extraordinary empirical success of modern clas-sification algorithms, e.g. deep networks for imageclassification [22], it is natural to develop algorithmsfor derivative-free optimization based on fitting a se-quence of classifiers. Yu et al. [40] also propose clas-sification based on optimization, but their approachassumes a classifier constrained to never misclassifynear the optimum, making the problem trivial.

    1.3 Contributions

    We present Algorithm 1 and characterize its conver-gence rate with appropriate classifiers and show howit relates to measures of difficulty in active learn-ing. We extend this basic approach, which may becomputationally challenging, to an approach based onbootstrap resampling that is empirically quite effectiveand—in certain nice-enough scenarios—has provableguarantees of convergence.

    We provide empirical results on a number of differ-ent tasks: random (simulated) problems, airfoil (de-vice) design based on physical simulators, and findingstrongly-binding proteins based on DNA assays. Weshow that a black-box approach with random forestsis highly effective within a few rounds T of sequentialclassification; this approach provides advantages in thelarge batch setting.

    The approach to optimization via classification has anumber of practical benefits, many of which we ver-ify experimentally. It is possible to incorporate priorknowledge in DFO through domain-specific classifiers,and in more generic optimization problems one canuse black-box classifiers such as random forests. Anysufficiently accurate classifier guarantees optimizationperformance and can leverage the large-batch data col-lection biological and physical problems essentially ne-cessitate. Finally, one does not even need to evaluatef : it is possible to apply this framework with pairwisecomparison or ordinal measurements of f .

    2 Cutting planes via classification

    Our starting point is a collection of “basic” resultsthat apply to classification-based schemes and associ-ated convergence results. Throughout this section, weassume we fit classifiers using pairs (x, z), where z is a0/1 label of negative (low f(x)) or positive (high f(x))class. We begin by demonstrating that two quantitiesgovern the convergence of the optimizer: (1) the fre-quency with which the classifier misclassifies (and thusdownweights) the optimum x∗ relative to the multi-plicative weight η, and (2) the fraction of the feasiblespace each iteration removes.

    If the classifier h(t)(x) exactly recovers the sublevelset (h(t)(x) < 0 iff f(x) < α(t)), α(t) is at most thepopulation median of f(X(t)), and X is finite, the basiccutting plane bound immediately implies that

    log

    [Px∼p(T )

    (f(x) = min

    x∗∈Xf(x∗)

    )]

    ≥ min

    (T log

    (2

    2− η

    )− log(|X |), 0

    ).

  • Tatsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi

    It is not obvious that such a guarantee continues tohold for inaccurate h(t): it may accidentally misclas-sify the optimum x∗, and the thresholds α(t) may notrapidly decrease the function value. To address theseissues, we provide a careful analysis in the coming sec-tions: first, we show the convergence guarantees im-plied by Algorithm 1 as a function of classification er-rors (Theorem 1), after which we propose a classifica-tion strategy directly controlling errors (Sec. 2.2), andfinally we give a computationally tractable approxima-tion (Sec. 3).

    2.1 Cutting plane style bound

    We begin with our basic convergence result. Lettingp(t) and h(t) be a sequence of distributions and classi-fiers on X , the convergence rate depends on two quan-tities: the coverage (number of items cut)∑

    x∈Xh(t)(x)p(t−1)(x)

    and the number of times a hypothesis downweightsitem x (because f(x) is too large), which we denote

    MT (x) :=∑Tt=1 h

    (t)(x). We have the following

    Theorem 1. Let γ > 0 and assume that for all t,∑x∈X

    h(t)(x)p(t−1)(x) ≥ γ

    where p(t)(x) ∝ p(t−1)(x)(1−ηh(t)(x)) as in Alg. 1. Letη ∈ [0, 1/2] and p(0) be uniform. Then for all x ∈ X ,

    log p(T )(x) ≥ γηη + 2

    T − η(η + 1)MT (x)− log(2|X |).

    The theorem follows from a modification of standardmultiplicative weight algorithm guarantees [3]; seesupplemental section A.1 for a full proof.

    We say that our algorithm converges linearly iflog p(t)(x) & t. In the context of Theorem 1, choice ofη maximizing −(η2 + η)MT (x∗) + ηη+2γT yields suchconvergence, as picking η sufficiently small that

    T − (η + 1)(η + 2)γ

    MT (x∗) = Ω(T )

    guarantees linear convergence if 2MT (x∗) < Tγ.

    A simpler form of the above bound for a fixed η showsthe linear convergence behavior.

    Corollary 1. Let x ∈ X , where qT (x) := MT (x)γT ≤1/4. Under the conditions of Theorem 1,

    log(p(T )(x)) ≥ min(

    1

    5,

    1

    3− 4qT (x)

    3

    )γT

    2− log(2|X |)

    and1

    4− log(2|X |)

    2γT≤ qT (x).

    The condition qT (x) ≥ 14 −1

    2γT log(2|X |) arises be-cause if MT (x) is small, then eventually we must havep(T )(x) ≥ 1 − γ, and any classifier h which fulfils thecondition

    ∑x∈X h

    (t)(x)p(t−1)(x) ≥ γ in Thm. 1 mustdownweight x. At this point, we can identify the op-timum exactly with O(1/(1− γ)) additional draws.

    The corollary shows that if MT (x∗) = 0 and γ =

    (1− 1/e)− 1/2 < 0, we recover a linear cutting-plane-like convergence rate [cf. 30], which makes constantprogress in volume reduction in each iteration.

    2.2 Consistent selective strategy for strongcontrol of error

    The basic guarantee of Theorem 1 requires relativelyfew mistakes on x∗, or at least on a point x withf(x) ≈ f(x∗), to achieve good performance in opti-mization. It is thus important to develop careful clas-sification strategies that are conservative: they do notprematurely cut out values x whose performance isuncertain. With this in mind, we now show how con-sistent selective classification strategies [15] (relatedto active learning techniques, and which abstain on“uncertain” examples similar to the Knows-What-It-Knows framework [23, 2]) allow us to achieve linearconvergence when the classification problems are real-izable using a low-complexity hypothesis class.

    The central idea is to only classify an example if allzero-error hypotheses agree on the label, and otherwiseabstain. Since any hypothesis achieving zero popula-tion error must have zero training set errors, we willonly label points in a way consistent with the true la-bels. El-Yaniv and Wiener [15] define the followingconsistent selective strategy (CSS).

    Definition 1 (Consistent selective strategy). For ahypothesis class H and training sample S, the versionspace VSH,Sm ⊂ H is the set of all hypotheses whichperfectly classify Sm. The consistent selective strategyis the classifier

    h(x) =

    1 if ∀g ∈ VSH,Sm , g(x) = 10 if ∀g ∈ VSH,Sm , g(x) = 0no decision otherwise.

    Applied to our optimizer, this strategy enables safelydownweighting examples whenever they are classifiedas being outside the sublevel set. Optimization per-formance guarantees then come from demonstratingthat at each iteration the selective strategy does notabstain on too many examples.

    The rate of abstention for a selective classifier is relatedto the difficulty of disagreement based active learning,controlled by the disagreement coefficient [18].

  • Derivative free optimization via repeated classification

    Definition 2. The disagreement ball of a hypothesisclass H for distribution P is

    BH,P (h, r) := {h′ ∈ H | P (h(X) 6= h′(X)) ≤ r}.

    The disagreement region of a subset G ⊂ H is

    Dis(G) := {x ∈ X | ∃h1, h2 ∈ G s.t. h1(x) 6= h2(x)}.

    The disagreement coefficient ∆h of the hypothesis classH for the distribution P is

    ∆h := supr>0

    P (X ∈ Dis(BH,P (h, r)))r

    .

    The disagreement coefficient directly bounds the ab-stention rate as a function of generalization error.

    Theorem 2. Let h be the CSS classifier in definition1, and let h∗ ∈ H be a classifier achieving zero risk. IfP(g(X) 6= h∗(X)) < � for all g ∈ VSH,Sm , then CSSachieves coverage

    P(h(X) = no decision) ≤ ∆h∗�

    This follows from the definition of the disagreementcoefficient, and the size of the version space (Supp.section A.1 contains a full proof).

    The dependence of our results on the disagreementcoefficient implies a reduction from zeroth order op-timization to disagreement based active learning [15]and selective classification [39] over sublevel sets.

    Implementing the CSS classifier may be somewhatchallenging: given a particular point x, one must ver-ify that all hypotheses consistent with the data classifyit identically. In many cases, this requires training aclassifier on the current training sample S(t) at iter-ation t, coupled with x labeled positively, and thenretraining the classifier with x labeled negatively [39].This cost can be prohibitive. (Of course, implement-ing the multiplicative weights-update algorithm overx ∈ X is in general difficult as well, but in a numberof application scenarios we know enough about H tobe able to approximate sampling from p(t) in Alg. 1.)

    A natural strategy is to use the CSS classifier as partof Algorithm 1, setting all no decision outputs to thezero class, only removing points confidently above thelevel set α(t). That is, in round t of the algorithm,given samples S = (X(t), Z(t)), we define

    h(t)(x) =

    1 if ∀g ∈ VSH,S , g(x) = 10 if ∀g ∈ VSH,S , g(x) = 00 otherwise.

    There is some tension between classifying examplescorrectly and cutting out bad x ∈ X , which thenext theorem shows we can address by choosing largeenough sample sizes n.

    Theorem 3. Let H be a hypothesis class containingindicator functions for the sublevel sets of f , with VC-dimension V and disagreement coefficient ∆h. Thereexists a numerical constant C < ∞ such that for allδ ∈ [0, 1], � ∈ [0, 1], and γ ∈ (∆h�, 12 ), and

    n ≥ max{C�−1[V log(�−1) + log(δ−1) + log(2T )],

    1

    2(γ − 0.5)2(log(δ−1) + log(2T ))

    },

    with probability at least 1− δ

    log(p(T )(x∗)) ≥ min{

    (γ−∆h�)η

    η + 2T − log(2|X |),

    log(1− γ)}

    after T rounds of Algorithm 1.

    The proof follows from combining the selective classifi-cation bound with standard VC dimension argumentsto obtain the sample size requirement (Supp. A.1 con-tains a full proof).

    Thus if ∆h is small, such as log(|X |), then choosing� = ∆−1h achieves exponential improvements over ran-dom sampling. In the worst case, ∆h = O(|X |), butsmall ∆h are known for many problems, for examplefor linear classification with continuous X over densi-ties bounded away from zero, ∆h = poly(log(Vol(X ))),which would result in linear convergence rates (Theo-rem 7.16, [18]).

    Using recent bounds for the disagreement coefficientfor linear separators [5], we can show that for linearoptimization over a convex domain, the CSS basedoptimization algorithm above achieves linear conver-gence with O(d3/2 log(d1/2) − d1/2 log(3Tδ)) sampleswith probability at least 1 − δ (for lack of space, wepresent this as Theorem A.2 in the supplement.)

    When the classification problem is non-realizable, butthe Bayes-optimal hypothesis does not misclassify x∗,an analogous result holds through the agnostic selec-tive classification framework of Wiener and El-Yaniv[39]. The full result is in supplemental Theorem A.7.

    3 Computationally efficientapproximations

    While selective classification provides sufficient controlof error for linear convergence, it is generally compu-tationally intractable. However, a bootstrap resam-pling algorithm [14] approximates selective classifica-tion well enough to provide finite sample guaranteesin parametric settings. Our analysis provides intuition

  • Tatsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi

    for the empirical observation that selective classifica-tion via the bootstrap works well in many real-worldproblems [1].

    Formally, consider a parametric family {Pθ}θ∈Θ ofconditional distributions Z | X ∈ [0, 1] with compactparameter space Θ. Given n samples X1, . . . , Xn, weobserve Zi|Xi ∼ Pθ∗ with θ∗ ∈ int Θ.

    Let `θ(x, z) = − log(Pθ(z|x)) be the negative log like-lihood of z, which majorizes the 0-1 loss of the linearhypothesis class `θ(x, z) ≥ 1{(2z−1)x>θθ◦ub > 00 if ∀b ∈ [B], x>θ◦ub ≤ 0no decision otherwise.

    .

    For linear classifiers with strongly convex losses, thisalgorithm obtains selective classification guaranteesunder appropriate regularity conditions as presentedin the following theorem.

    Theorem 4. Assume `θ is twice differentiable andfulfils ‖∇`θ(X,Z)‖ ≤ R, and

    ∥∥∇2`θ(X,Z)∥∥op ≤ Salmost surely. Additionally, assume Ln(θ, 1) is γ-strongly convex and that ∇2Ln(θ, 1) is M -Lipschitzwith probability one.

    For h◦ defined above and x ∈ X ,

    P (x>θ∗ ≤ 0 and h◦u(x) = 1) < δ.

    Further, the abstention rate is bounded by∫x∈Rd

    1{h◦u(x)=∅}p(x)dx ≤ �∆h

    with probability 1− δ whenever

    B ≥ 15 log(3/δ),

    σ = O(d1/2 + log(1/δ)1/2 + n−1/2),

    � = O(σ2n−1 log(B/δ)

    ),

    andn ≥ 2 log(2d/δ)S/γ2.

    Due to length, the proof and full statement with con-stants appears in the appendix as Theorem A.4, witha sketch provided here: we first show that a givenquadratic version space and a multivariate Gaussiansample θquad obtains the selective classification guar-antees (Lemmas A.3,A.4,A.5). We then show thatθ◦ ≈ θquad to order n−1 which is sufficient to recoverTheorem A.4.

    (a) Classification confidencesformed by bootstrapping ap-proximate selective classifica-tion.

    (b) Bootstrapping resultsin more consistent identi-fication of minima.

    Figure 1. Bootstrap consensus provides more con-servative classification boundaries which preventsrepeatedly misclassifying the minimum, comparedto direct loss minimization (panel b, triangle).

    The d∆h abstention rate in this bound is d timesthe original selective classification result. This ad-ditional factor of d appearing in σ2 arises from thedifference between finding an optimum within a balland randomly sampling it: random vectors concen-trate within O(1/d) of the origin, while the maximumpossible value is 1. This gap forces us to scale thevariance in the decision function by σ (step 3b). Wepresent selective classification approximation boundsanalogous to Theorem 3 for linear optimization in theAppendix as Theorem A.5.

    To illustrate our results through simulations, considera optimizing a two-dimensional linear function in theunit box. Figure 1a shows the set of downweightedpoints (colored points) for various algorithms on clas-sifying a single superlevel set based on eight observa-tions (black points). Observe how linear downweightsmany points (colored ‘x’), in contrast to exact CSS,which only downweights points guaranteed to be inthe superlevel set. Errors of this type combined withAlg. 1 result in optimizers which fail to find the true

  • Derivative free optimization via repeated classification

    minimum depending on initialization (Figure 1b). Thebootstrapped linear classifier behaves similarly to CSS,but is looser due to the non-asymptotic setting. Ran-dom forests, another type of bootstrapped classifier issurprisingly good at approximating CSS, despite notmaking use of the linearity of the decision boundary.

    4 Partial order based optimization

    One benefit of optimizing via classification is that thealgorithm only requires total ordering amongst the ele-ments. Specifically, step 6 of Algorithm 1 only requiresthreshold comparisons against a percentile selected instep 5. This enables optimization under pairwise com-parison feedback. At each round, instead of observing

    f(X(t)), we observe g(X(t)i , X

    (t)j ) = 1f(X(t)i ) 10 seems to work wellin practice, and more sophisticated preference aggrega-tion algorithms may reduce the number of comparisonseven further.

    5 Experimental evidence

    We evaluate Algorithm 1 as a DFO algorithm across afew real-world experimental design benchmarks, com-mon synthetic toy optimization problems, and bench-marks that allow only pairwise function value compar-isons. The small-batch (n = 1-10) nature of hyperpa-rameter optimization problems is outside the scope ofour work, even though they are common DFO prob-lems.

    For constructing the classifier in Algorithm 1, we ap-ply ensembled decision trees with a consensus decisiondefined as 75% of trees agreeing on the label (referredto as classify-rf). This particular classifier works ina black-box setting, and is highly effective across allproblem domains with no tuning. We also empiricallyinvestigate the importance of well-specified hypothesesand consensus ensembling and show improved resultsfor ensembles of linear classifiers and problem specificclassifiers, which we call classify-tuned.

    In order to demonstrate that no special tuning is nec-essary, the same constants are used in the optimizer

    for all experiments, and the classifiers use off-the-shelfimplementations from scikit-learn with no tuning.

    For sampling points according to the weighted distri-bution in Algorithm 1, we enumerate for discrete ac-tion spaces X , and for continuous X we perturb sam-ples from the previous rounds using a Gaussian and useimportance sampling to approximate the target distri-bution. Although exact sampling for the continuouscase would be time-consuming, the Gaussian pertur-bation heuristic is fast, and seems to work well enoughfor the functions tested here.

    As a baseline, we compare to the following algorithms

    • Random sampling (random)

    • Randomly sampling double the batch size(random-2x), which is a strong baseline recentlyshown to outperform many derivative-free opti-mizers [24].

    • The evolutionary strategy (CMA-ES) for con-tinuous problems, due to its high-performance inblack box optimization competitions as well as in-herent applicability to the large batch setting [26]

    • The Bayesian optimization algorithm provided byGpyOpt[4] (GP) for both continuous and dis-crete problems, using expected improvement asthe acquisition function. We use the ‘random’evaluator which implements an epsilon-greedybatching strategy, since the large batch sizes (100-1000) makes the use of more sophisticated eval-uators completely intractable. The default RBFkernel was used in all experiments presented here.The 3/2- and 5/2-Matern kernels and string kernelswere tried where appropriate, but did not provideany performance improvements.

    In terms of runtime, all computations for classify-rf take less than 1 second per iteration compared to0.1s for CMA-ES and 1.5 minutes for GpyOpt. Allexperiments were replicated fifteen times to measurevariability with respect to initialization.

    All new benchmark functions and reference imple-mentations are made available at http://bit.ly/2FgiIxA.

    5.1 Designing optimal DNA sequences

    The publicly available protein binding microarray(PBM) dataset consisting of 201 separate assays [6]allows us to accurately benchmark the optimizationprotein binding over DNA sequences. In each assay,the binding affinity between a particular DNA-bindingprotein (transcription factor) and all 8-base DNA se-quences are measured using a microarray.

    http://bit.ly/2FgiIxAhttp://bit.ly/2FgiIxA

  • Tatsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi

    (a) Binding to the CRX protein (b) Binding to the VSX1 protein (c) High-lift airfoil design

    Figure 2. Performance on two types of real-world batched zeroth-order optimization tasks. classify-rf consis-tently outperforms baselines and even randomly sampling twice the batch size. The line shows median functionvalue over runs, shaded area is quartiles.

    This dataset defines 201 separate discrete optimizationproblems. For each protein, the objective function isthe negative binding affinity (as measured by fluores-cence), the batch size is 100 (corresponding roughly tothe size of a typical 96-well plate), across ten rounds.Each possible action corresponds to measuring thebinding affinity of a particular 8-base DNA sequenceexactly. The actions are featurized by considering thebinary encoding of whether a base exists in a position,resulting in a 32-dimensional space. This emulates thetask of finding the DNA binding sequence of a proteinusing purely low-throughput methods.

    Figure 2a,2b shows the optimization traces of two ran-domly sampled examples, where the lines indicate me-dian achieved function value over 15 random initializa-tions, and the shading indicates quartiles. classify-rf shows consistent improvements over all discrete ac-tion space baselines. For evaluation, we further sample20 problems and find that the median binding affinityfound across replicates is strictly better on 16 out of20, and tied with the Gaussian process on 2.

    In this case, the high performance of random forests isrelatively unsurprising, as random forests are knownto be high-performance classifiers for DNA sequencerecognition tasks [10, 21].

    5.2 Designing high-lift airfoils

    Airfoil design, and other simulator-based objectivesare well-suited to the batched, classification based op-timization framework, as 30-40 simulations can be runin parallel on modern multicore computers. In the air-foil design case, the simulator is a 2-D aerodynamicssimulator for airfoils [13].

    The objective function is the negative of lift dividedby drag (with a zero whenever the simulator throwsan error) and the action space is the set of all commonairfoils (NACA-series 4 airfoils). The airfoils are fea-turized by taking the coordinates around the perime-

    ter of the airfoil as defined in the Selig airfoil format.This results in a highly-correlated two hundred dimen-sional feature space. The batch size is 30 (correspond-ing to the number of cores in our machine) and T = 10rounds of evaluations are performed.

    We find in Figure 2c that the classify-rf algorithmconverges to the optimal airfoil in only five rounds, anddoes so consistently, unlike the baselines. The Gaus-sian process beat the twice-random baseline, since theradial basis kernel is well-suited for this task (as lift isrelatively smooth over `2 distance between airfoils) butdid not perform as well as the classify-rf algorithm.

    5.3 Gains from designed classifiers andensembles

    Matching the classifier and objective function gener-ally results in large improvements in optimization per-formance. We test two continuous optimization prob-lems in [−1, 1]300, optimizing a random linear function,and optimizing a random sum of a quadratic and lin-ear functions. For this high dimensional task, we usea batch size of 1000. In both cases we compare contin-uous baselines with classify-rf and classify-tunewhich uses a linear classifier.

    We find that the use of the correct hypothesis classgives dramatic improvements over baseline in the lin-ear case (Figure 3a) and continues to give substan-tial improvements even when a large quadratic term isadded, making the hypothesis class misspecified (Fig-ure 3b). The classify-rf does not do as well as thiscustom classifier, but continues to do as well as thebest baseline algorithm (CMA-ES).

    We also find that using an ensembled classifier is animportant for optimization. Figure 3c shows an exam-ple run on the DNA binding task comparing the con-sensus of an ensemble of logistic regression classifiersagainst a single logistic regression classifier. Althoughboth algorithms perform well in early iterations, the

  • Derivative free optimization via repeated classification

    (a) Random linear function (b) Linear+quadratic function (c) Ensembling classifiers improvesoptimization performance

    Figure 3. Testing the importance of ensembling and well-specified hypothesis class in synthetic data where thehypothesis for Classify-tuned exactly matches level set (panel a), matches level sets with some error (panel b).Ensembling also consistently improves performance, and reduces dependence on initialization (panel c)

    single logistic regression algorithm gets ‘stuck’ earlierand finds a suboptimal local minima, due to an ac-cumulation of errors. Ensembling consistently reducessuch behavior.

    5.4 Low-dimensional synthetic benchmarks

    We additionally evaluate on two common syntheticbenchmarks (Figure 4a,4b). Although these tasks arenot the focus of the work, we show that the classify-rf is surprisingly good as a general black box opti-mizer when the batch sizes are large.

    We consider a batch size of 500 and ten steps due tothe moderate dimensionality and multi-modality rela-tive to the number of steps. We find qualitatively sim-ilar results to before, with classify-rf outperformingother algorithms and CMA-ES as the best baseline.

    (a) Shekel (4d) (b) Hartmann (6d)

    Figure 4. classify-rf outperforms baselines onsynthetic benchmark functions with large batches

    5.5 Optimizing with pairwise comparisons

    Finally, we demonstrate that we can optimize a func-tion using only pairwise comparisons. In Figure 5 weshow the optimization performance when using the or-dering estimator from equation 1.

    For small numbers of comparisons per element (c = 5)we find substantial loss of performance, but once weobserve at least 10 pairwise comparisons per proposed

    action, we are able to reliably optimize as well as thefull function value case. This suggests that classifica-tion based optimization can handle pairwise feedbackwith little loss in efficiency.

    Figure 5. Optimization with pairwise comparisonsbetween each action and a small set of (c) randomlyselected actions. Between 10-20 pairwise compar-isons per action gives sufficient information to fullyoptimize the function.

    6 Discussion

    Our work demonstrates that the classification-basedapproach to derivative-free optimization is effectiveand principled, but leaves open several theoretical andpractical questions. In terms of theory, it is not clearwhether a modified algorithm can make use of empir-ical risk minimizers instead of perfect selective classi-fiers. In practice, we have left the question of tractablysampling from p(t), as well as how to appropriatelyhandle smaller-batch settings of d > n.

  • Tatsunori B. Hashimoto, Steve Yadlowsky, John C. Duchi

    References

    [1] N. Abe and M. Hiroshi. Query learning strate-gies using boosting and bagging. In Proceedingsof the Fifteenth International Conference on Ma-chine Learning, volume 1, 1998.

    [2] J. Abernethy, K. Amin, M. Draief, andM. Kearns. Large-scale bandit problems andKWIK learning. In Proceedings of the 30th Inter-national Conference on Machine Learning, 2013.

    [3] S. Arora, E. Hazan, and S. Kale. The multiplica-tive weights update method: a meta algorithmand applications. Theory of Computing, 8(1):121–164, 2012.

    [4] T. G. authors. GPyOpt: A bayesian optimiza-tion framework in python. http://github.com/SheffieldML/GPyOpt, 2016.

    [5] M.-F. Balcan and P. M. Long. Active and passivelearning of linear separators under log-concavedistributions. In Proceedings of the Twenty SixthAnnual Conference on Computational LearningTheory, pages 288–316, 2013.

    [6] L. A. Barrera, A. Vedenko, J. V. Kurland,J. M. Rogers, S. S. Gisselbrecht, E. J. Rossin,J. Woodard, L. Mariani, K. H. Kock, S. Inukai,et al. Survey of variation in human transcriptionfactors reveals prevalent dna binding changes.Science, 351(6280):1450–1454, 2016.

    [7] I. Bogunovic, J. Scarlett, A. Krause, andV. Cevher. Truncated variance reduction: A uni-fied approach to bayesian optimization and level-set estimation. In Advances in Neural Infor-mation Processing Systems 29, pages 1507–1515,2016.

    [8] S. Boucheron, G. Lugosi, and P. Massart. Con-centration Inequalities: a Nonasymptotic Theoryof Independence. Oxford University Press, 2013.

    [9] S. Bubeck and R. Eldan. Multi-scale explorationof convex functions and bandit convex optimiza-tion. In Proceedings of the Twenty Ninth AnnualConference on Computational Learning Theory,pages 583–589, 2016.

    [10] X. Chen and H. Ishwaran. Random forests forgenomic data analysis. Genomics, 99(6):323–329,2012.

    [11] A. Chevalier, D.-A. Silva, G. J. Rocklin, D. R.Hicks, R. Vergara, P. Murapa, S. M. Bernard,L. Zhang, K.-H. Lam, G. Yao, et al. Massivelyparallel de novo protein design for targeted ther-apeutics. Nature, 2017.

    [12] A. Conn, K. Scheinberg, and L. Vicente. Introduc-tion to Derivative-Free Optimization, volume 8 ofMPS-SIAM Series on Optimization. SIAM, 2009.

    [13] M. Drela. Xfoil: An analysis and designsystem for low reynolds number airfoils. InLow Reynolds number aerodynamics, pages 1–12.Springer, 1989.

    [14] B. Efron and R. J. Tibshirani. An Introduction tothe Bootstrap. Chapman & Hall, 1993.

    [15] R. El-Yaniv and Y. Wiener. Active learning viaperfect selective classification. Journal of Ma-chine Learning Research, 13(Feb):255–279, 2012.

    [16] J. González, Z. Dai, P. Hennig, and N. Lawrence.Batch bayesian optimization via local penaliza-tion. In Proceedings of the 19th InternationalConference on Artificial Intelligence and Statis-tics, pages 648–657, 2016.

    [17] A. Gotovos, N. Casati, G. Hitz, and A. Krause.Active learning for level set estimation. In Pro-ceedings of the 28th International Joint Confer-ence on Artificial Intelligence, pages 1344–1350,2013.

    [18] S. Hanneke. Theory of disagreement-based ac-tive learning. Foundations and Trends in MachineLearning, 7(2-3):131–309, 2014.

    [19] C. Harwood and A. Wipat. Microbial SyntheticBiology, volume 40. Elsevier, 2013.

    [20] M. J. Kearns and U. V. Vazirani. An introduc-tion to computational learning theory. MIT press,1994.

    [21] C. G. Knight, M. Platt, W. Rowe, D. C. Wedge,F. Khan, P. J. Day, A. McShea, J. Knowles, andD. B. Kell. Array-based evolution of dna aptamersallows modelling of an explicit sequence-fitnesslandscape. Nucleic Acids Research, 37(1):e6–e6,2008.

    [22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learn-ing. Nature, 521(7553):436–444, 2015.

    [23] L. Li, M. Littman, and T. Walsh. Knows what itknows: A framework for self-aware learning. InProceedings of the 26th International Conferenceon Machine Learning, 2009.

    [24] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh,and A. Talwalkar. Hyperband: bandit-basedconfiguration evaluation for hyperparameter op-timization. In Proceedings of the Fourth Inter-national Conference on Learning Representations,2016.

    http://github.com/SheffieldML/GPyOpthttp://github.com/SheffieldML/GPyOpt

  • Derivative free optimization via repeated classification

    [25] N. Littlestone. Redundant noisy attributes, at-tribute errors, and linear-threshold learning us-ing winnow. In Proceedings of the Fourth An-nual Workshop on Computational Learning The-ory, pages 147–156, 1991.

    [26] I. Loshchilov. Cma-es with restarts for solvingcec 2013 benchmark problems. In EvolutionaryComputation (CEC), 2013, pages 369–376, 2013.

    [27] A. L. Marsden, M. Wang, J. E. Dennis, andP. Moin. Optimal aeroacoustic shape design usingthe surrogate management framework. Optimiza-tion and Engineering, 5(2):235–262, 2004.

    [28] V. K. S. Mendelson. Bounding the smallest sin-gular value of a random matrix without concen-tration. arXiv:1312.3580 [math.PR], 2013.

    [29] R. S. Michalski. Learnable evolution model: Evo-lutionary processes guided by machine learning.Machine Learning, 38(1):9–40, 2000.

    [30] Y. Nesterov. Introductory Lectures on Convex Op-timization. Kluwer Academic Publishers, 2004.

    [31] A. S. Phelps, D. M. Naeger, J. L. Courtier,J. W. Lambert, P. A. Marcovici, J. E. Villanueva-Meyer, and J. D. MacKenzie. Pairwise compari-son versus likert scale for biomedical image assess-ment. American Journal of Roentgenology, 204(1):8–14, 2015.

    [32] A. Rakhlin and K. Sridharan. On equivalence ofmartingale tail bounds and deterministic regretinequalities. arXiv:1510.03925 [math.PR], 2015.

    [33] J. Schulman, S. Levine, P. Abbeel, M. Jordan,and P. Moritz. Trust region policy optimization.In Proceedings of the 32nd International Con-ference on Machine Learning, pages 1889–1897,2015.

    [34] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams,and N. de Freitas. Taking the human out of theloop: A review of bayesian optimization. Proceed-ings of the IEEE, 104(1):148–175, 2016.

    [35] V. Spokoiny. Parametric estimation. finite sampletheory. The Annals of Statistics, 40(6):2877–2909,2012.

    [36] J. A. Tropp. An introduction to matrix concen-tration inequalities. Foundations and Trends inMachine Learning, 8(1-2):1–230, 2015.

    [37] J. Van Hemmen and T. Ando. An inequality fortrace ideals. Communications in MathematicalPhysics, 76(2):143–148, 1980.

    [38] J. Wang, Q. Gong, N. Maheshwari, M. Eisenstein,M. L. Arcila, K. S. Kosik, and H. T. Soh. Parti-cle display: A quantitative screening method forgenerating high-affinity aptamers. AngewandteChemie International Edition, 53(19):4796–4801,2014.

    [39] Y. Wiener and R. El-Yaniv. Agnostic selectiveclassification. In Advances in Neural InformationProcessing Systems 24, pages 1665–1673, 2011.

    [40] Y. Yu, H. Qian, and Y.-Q. Hu. Derivative-free op-timization via classification. In Proceedings of theThirty Second National Conference on ArtificialIntelligence, pages 2286–2292, 2016.


Recommended