NUS School of Computing Summer School
Gaussian Process Methods in Machine Learning
Jonathan [email protected]
Lecture 3: Advanced Bayesian Optimization Methods
August 2018
License Information
These slides are an edited version of those for EE-620 Advanced Topics in MachineLearning at EPFL (taught by Prof. Volkan Cevher, LIONS group), with the followinglicense information:I This work is released under a Creative Commons License with the following terms:I Attribution
I The licensor permits others to copy, distribute, display, and perform the work. In return,licensees must give the original authors credit.
I Non-CommercialI The licensor permits others to copy, distribute, display, and perform the work. In return,
licensees may not use the work for commercial purposes – unless they get the licensor’spermission.
I Share AlikeI The licensor permits others to distribute derivative works only under a license identical
to the one that governs the licensor’s work.I Full Text of the License
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 2/ 39
http://creativecommons.org/licenses/by-nc-sa/1.0/http://creativecommons.org/licenses/by-nc-sa/1.0/legalcode
Outline of Lectures
• Lecture 0: Bayesian Modeling and Regression
• Lecture 1: Gaussian Processes, Kernels, and Regression
• Lecture 2: Optimization with Gaussian Processes
• Lecture 3: Advanced Bayesian Optimization Methods
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 3/ 39
Outline: This Lecture
I This lecture1. Practical twists on Bayesian optimization2. Level-set estimation3. One-step lookahead algorithms4. Truncated variance reduction
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 4/ 39
Recap 1: Black-Box Function Optimization
Black-box function optimization:
x? ∈ arg maxx∈D⊆Rd
f(x)
• Setting:
I Unknown “reward” function fI Expensive evaluations of fI Noisy evaluationsI Choose xt based on {(xt′ , yt′ )}t′
Recap 2: Bayesian Optimization (BO) Template
A general BO template [Shahriari et al., 2016]
1: for t = 1, 2, . . . , T do2: choose new xt+1 by optimizing an acquisition function α(·)
xt+1 ∈ arg maxx∈D
α(x;Dt)
3: query objective function to obtain yt+14: augment data Dt+1 = {Dt, (xt+1, yt+1)}5: update the GP model6: end for7: make final recommendation x̂
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 6/ 39
Twists
Practical variations along the same theme:
Pointwise costs: Choosing point x incurs a cost c(x) [Snoek et al., 2012]
I Examples: Advertising costs, sensor power consumption
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 7/ 39
Twists
Practical variations along the same theme:
Heteroscedastic noise: Choosing point x incurs noise σ2(x) [Goldberg et al., 1997]
I Example: Different sensing quality
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 7/ 39
Twists
Practical variations along the same theme:
Multi-fidelity: Alternative evaluations f1, . . . , fK related to f [Swersky et al., 2013]
I Example: Varying data set sizes in automated machine learning
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 7/ 39
Another Twist: Level-Set Estimation
Level-set estimation: Estimate the super- and sub-level sets [Gotovos et al., 2013]
H(f) :={
x : f(x) > h}, L(f) :=
{x : f(x) < h
}for some threshold h
I Example: Find all hotspots in environmental monitoringSuper-level set
Sub-level set
Threshold
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 8/ 39
Accommodating the BO Twists: Lookahead Algorithms
• Mostly heuristic BO approaches
I Entropy search (ES): [Hennig et al., 2012]
xt ≈ arg minxt∈D
Eyt[H(x? | {xi, yi}ti=1)
]H: entropy function
I Minimum regret search (MRS): [Metzen, 2016]
xt ≈ arg minx∈D
Eyt
[Ex∗[regret
∣∣ {xi, yi}ti=1]]I Multi-step lookahead: approximation of the ideal lookahead loss function
[Osborne et al., 2009, Gonzalez et al., 2016]
Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.
Disadvantages: Expensive to compute; no theory; no LSE
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39
Accommodating the BO Twists: Lookahead Algorithms
• Mostly heuristic BO approaches
I Entropy search (ES): [Hennig et al., 2012]
xt ≈ arg minxt∈D
Eyt[H(x? | {xi, yi}ti=1)
]H: entropy function
I Minimum regret search (MRS): [Metzen, 2016]
xt ≈ arg minx∈D
Eyt
[Ex∗[regret
∣∣ {xi, yi}ti=1]]
I Multi-step lookahead: approximation of the ideal lookahead loss function[Osborne et al., 2009, Gonzalez et al., 2016]
Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.
Disadvantages: Expensive to compute; no theory; no LSE
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39
Accommodating the BO Twists: Lookahead Algorithms
• Mostly heuristic BO approaches
I Entropy search (ES): [Hennig et al., 2012]
xt ≈ arg minxt∈D
Eyt[H(x? | {xi, yi}ti=1)
]H: entropy function
I Minimum regret search (MRS): [Metzen, 2016]
xt ≈ arg minx∈D
Eyt
[Ex∗[regret
∣∣ {xi, yi}ti=1]]I Multi-step lookahead: approximation of the ideal lookahead loss function
[Osborne et al., 2009, Gonzalez et al., 2016]
Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.
Disadvantages: Expensive to compute; no theory; no LSE
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39
Accommodating the BO Twists: Lookahead Algorithms
• Mostly heuristic BO approaches
I Entropy search (ES): [Hennig et al., 2012]
xt ≈ arg minxt∈D
Eyt[H(x? | {xi, yi}ti=1)
]H: entropy function
I Minimum regret search (MRS): [Metzen, 2016]
xt ≈ arg minx∈D
Eyt
[Ex∗[regret
∣∣ {xi, yi}ti=1]]I Multi-step lookahead: approximation of the ideal lookahead loss function
[Osborne et al., 2009, Gonzalez et al., 2016]
Advantages: Versatility with point-wise costs, non-uniform noise, multi-fidelityscenarios; can improve on baseline algorithms even without these twists.
Disadvantages: Expensive to compute; no theory; no LSE
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 39
Note:
Lookahead algorithms tend to be moreversatile with respect to interesting twists on
the optimization problem
• Example.I Minimize entropy ⇐⇒ maximize reduction in entropyI Extension: Maximize reduction in entropy per unit cost
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 10/ 39
More on Entropy Search
• Entropy search and its variants are particularly popular:
xt ≈ arg minxt∈D
Eyt[H(x? | {xi, yi}ti=1)
]H: entropy function
I Interpretation: Choose the point that makes us least uncertain (i.e., minimizesentropy) about the optimizer x∗
• Difficulty. Cannot compute Eyt[H(x? | {xi, yi}ti=1)
]exactly
I Need to approximate, typically using Monte Carlo methodsI Particularly difficult for higher dimensions, e.g., x ∈ Rd for d > 10
• Alternative: Max-value entropy search.
xt ≈ arg minxt∈D
Eyt[H(f? | {xi, yi}ti=1)
]I Intuition: Low uncertainty in f∗ = f(x∗) should mean we have found x∗
I Now approximating entropy is easier – only one-dimensional
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 39
More on Entropy Search
• Entropy search and its variants are particularly popular:
xt ≈ arg minxt∈D
Eyt[H(x? | {xi, yi}ti=1)
]H: entropy function
I Interpretation: Choose the point that makes us least uncertain (i.e., minimizesentropy) about the optimizer x∗
• Difficulty. Cannot compute Eyt[H(x? | {xi, yi}ti=1)
]exactly
I Need to approximate, typically using Monte Carlo methodsI Particularly difficult for higher dimensions, e.g., x ∈ Rd for d > 10
• Alternative: Max-value entropy search.
xt ≈ arg minxt∈D
Eyt[H(f? | {xi, yi}ti=1)
]I Intuition: Low uncertainty in f∗ = f(x∗) should mean we have found x∗
I Now approximating entropy is easier – only one-dimensional
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 39
More on Entropy Search
• Entropy search and its variants are particularly popular:
xt ≈ arg minxt∈D
Eyt[H(x? | {xi, yi}ti=1)
]H: entropy function
I Interpretation: Choose the point that makes us least uncertain (i.e., minimizesentropy) about the optimizer x∗
• Difficulty. Cannot compute Eyt[H(x? | {xi, yi}ti=1)
]exactly
I Need to approximate, typically using Monte Carlo methodsI Particularly difficult for higher dimensions, e.g., x ∈ Rd for d > 10
• Alternative: Max-value entropy search.
xt ≈ arg minxt∈D
Eyt[H(f? | {xi, yi}ti=1)
]I Intuition: Low uncertainty in f∗ = f(x∗) should mean we have found x∗
I Now approximating entropy is easier – only one-dimensional
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 39
Experimental Example
• Performance plots from [Metzen, 2016] for robot control task:
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 12/ 39
Accommodating the Twists: Level-Set Estimation
• Limited literature
I Confidence-bound based LSE algorithm: [Gotovos et al., 2013]
xt = arg maxx∈Mt−1
min{ut(x)− h, h− `t(x)
}ut/lt: upper/lower confidence boundsMt: the set of unclassified pointsh: the level-set threshold
I Analogous to, but distinct from, the GP-UCB algorithm for BOI Intuition: Resolve uncertainty of points whose confidence interval crosses h
I Straddle heuristic: [Bryan et al., 2006]
xt = arg maxx∈D
1.96σt−1(x)− |µt−1(x)− h|
Advantages: Versatility in the sense of handling level-set estimation
Disadvantages: No theory (Straddle); lacking in other versatility (costs, non-uniformnoise, multi-fidelity)
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 13/ 39
Accommodating the Twists: Level-Set Estimation
• Limited literature
I Confidence-bound based LSE algorithm: [Gotovos et al., 2013]
xt = arg maxx∈Mt−1
min{ut(x)− h, h− `t(x)
}ut/lt: upper/lower confidence boundsMt: the set of unclassified pointsh: the level-set threshold
I Analogous to, but distinct from, the GP-UCB algorithm for BOI Intuition: Resolve uncertainty of points whose confidence interval crosses h
I Straddle heuristic: [Bryan et al., 2006]
xt = arg maxx∈D
1.96σt−1(x)− |µt−1(x)− h|
Advantages: Versatility in the sense of handling level-set estimation
Disadvantages: No theory (Straddle); lacking in other versatility (costs, non-uniformnoise, multi-fidelity)
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 13/ 39
Accommodating the Twists: Level-Set Estimation
• Limited literature
I Confidence-bound based LSE algorithm: [Gotovos et al., 2013]
xt = arg maxx∈Mt−1
min{ut(x)− h, h− `t(x)
}ut/lt: upper/lower confidence boundsMt: the set of unclassified pointsh: the level-set threshold
I Analogous to, but distinct from, the GP-UCB algorithm for BOI Intuition: Resolve uncertainty of points whose confidence interval crosses h
I Straddle heuristic: [Bryan et al., 2006]
xt = arg maxx∈D
1.96σt−1(x)− |µt−1(x)− h|
Advantages: Versatility in the sense of handling level-set estimation
Disadvantages: No theory (Straddle); lacking in other versatility (costs, non-uniformnoise, multi-fidelity)
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 13/ 39
Accommodating the Twists with Guarantees: TruVaR
Truncated Variance Reduction (TruVaR) algorithm: [Bogunovic et al., 2016]I Unified BO and LSEI Versatility to handle all of the above twistsI Theoretical guarantees
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 14/ 39
TruVaR Intuition (for optimization):
• Use confidence bounds to keep track ofpotential maximizers
• Choose points that shrink their uncertainty
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 15/ 39
Modified Template for Choosing xt Based on {(xt′ , yt′)}t′
TruVaR: Intution
ConfidenceTarget
Selected point
Potential maximizers
Max. lower bound
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Intution
0.0 0.2 0.4 0.6 0.8 1.04
3
2
1
0
1
2
3
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 39
TruVaR: Acquisition Function
• Acquisition function based on variance reduction per cost
arg maxx∈D
∑x∈Mt−1 max{β(i)σ
2t−1(x), η2(i)} −
∑x∈Mt−1 max{β(i)σ
2t−1|x(x), η
2(i)}
c(x)
σ2t−1|x: posterior variance given all points up to time t− 1, and xβ(i): exploration parameter
The set of potential maximizers Mt• BO
Mt ={
x ∈Mt−1 : ut(x) ≥ maxx∈Mt−1
`t(x)}
ut(x)/lt(x): upper/lower confidence bounds
• LSEMt =
{x ∈Mt−1 : ut(x) ≥ h and `t(x) ≤ h
}SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 39
Numerical Evidence
• Real and synthetic data
• Acronyms
LSE Level-set estimation algorithm [Gotovos et al., 2013]
STR Straddle heuristic [Bryan et al., 2006]
VAR Maximum variance rule [Gotovos et al., 2013]
EI Expected improvement [Mockus et al., 1978]
GP-UCB Gaussian process upper confidence bound [Srinivas et al., 2012]
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 19/ 39
Numerical Evidence 1: Level-Set Estimation (I)
• Lake Zurich chlorophyll concentration via an autonomous vehicle:
0 400 800 1200Length (m)
20
15
10
5
0
Dep
th (m
)
1.86
h = 1.5
0
• Evaluate performance with the F1 score:
F1 =#true positives
#true positives + 12(#false positives + #false negatives
) ∈ [0, 1]where “positive” means above the level-set h.
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 20/ 39
Numerical Evidence 1: Level-Set Estimation (II)
• Classification performance (unit-cost):
Time
0 20 40 60 80 100 120
F1score
0
0.2
0.4
0.6
0.8
1
TruVaR
LSE
STR
VAR
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 21/ 39
Numerical Evidence 1: Level-Set Estimation (III)
• Cost function: (i) Penalizes distance traveled; (ii) Penalizes deeper measurements
LSE algorithm [Gotovos et al., 2013]: TruVaR:
0 400 800 1200Length (m)
20
15
10
5
0
Dep
th (m
)
0 400 800 1200Length (m)
20
15
10
5
0
Dep
th (m
)
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 22/ 39
Numerical Evidence 1: Level-Set Estimation (IV)
• Classification performance (non-unit cost):
Cost (×104)×10
4
0 0.5 1 1.5 2
F1score
0
0.2
0.4
0.6
0.8
1
TruVaR
LSE
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 23/ 39
Numerical Evidence 2: Level-Set Estimation (I)
• Twist: Choice of the noise level
I Noise levels {10−6, 10−3, 0.05}
I Corresponding costs {15, 10, 2}
• Synthetic simulation
I Function drawn from GP with squared-exponential kernel
I True kernel used in algorithms
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 24/ 39
Numerical Evidence 2: Level-Set Estimation (II)
• Synthetic function drawn from GP:
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 25/ 39
Numerical Evidence 2: Level-Set Estimation (III)
• Oracle-level classification performance:
Cost (×104)0.1 0.15 0.2 0.25 0.3 0.35 0.4
F1score
0.5
0.6
0.7
0.8
0.9
1
TruVaRLSE high noiseLSE medium noiseLSE small noise
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 26/ 39
Numerical Evidence 2: Level-Set Estimation (IV)• Cost incurred for each noise level:
• TruVaR gradually switches between different levels:
high noise / low cost =⇒ medium noise / cost =⇒ low noise / high cost
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 27/ 39
Numerical Evidence 3: Bayesian Optimization
• Hyper-parameter tuning: SVM on grid dataset [Snoek et al., 2012]
I Tuning 3 hyperparameters for SVM algorithm
I GP kernel estimated online using maximum-likelihood
• Generalization error:
Time
0 20 40 60 80 100
ValidationError
0.24
0.25
0.26
0.27
TruVaR
EI
GP-UCB
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 28/ 39
TruVaR – Batch Setting
• In the batch setting, we choose B > 1 points at each time, evaluate them inparallel, and observe the B observations [Azimi et al., 2010]I Example 1: Equipment allows running scientific experiments in parallelI Example 2: f is a computation, and we have multiple computing cores
• With B = 1, we can interpret TruVaR as greedily minimizing∑x∈Mt−1
max{β(i)σ2t−1|x(x), η2(i)}
with respect to x
• A simple batch extension: In each round, run B steps of the greedy algorithmminimizing this function
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 29/ 39
TruVaR – Batch Setting
• In the batch setting, we choose B > 1 points at each time, evaluate them inparallel, and observe the B observations [Azimi et al., 2010]I Example 1: Equipment allows running scientific experiments in parallelI Example 2: f is a computation, and we have multiple computing cores
• With B = 1, we can interpret TruVaR as greedily minimizing∑x∈Mt−1
max{β(i)σ2t−1|x(x), η2(i)}
with respect to x
• A simple batch extension: In each round, run B steps of the greedy algorithmminimizing this function
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 29/ 39
Epilogue: Theoretical Performance
Definition: Numerical �-accuracy
I (BO) The reported point after T rounds satisfies f(x̂T ) ≥ f(x?)− �
I (LSE) The classification after T rounds is correct for points at least �2 -far from h
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 30/ 39
Epilogue: Theoretical Performance (I)
• Generalize the canonical notion of rounds T to costs C to shrink variance:
C∗(ξ,M) = minS
{c(S) : max
x∈MσS(x) ≤ ξ
},
σ2S : posterior variance given points in S
TheoremFor a finite domain D, under a submodularity assumption, if TruVaR is run until thecumulative cost reaches
C� =∑
i : 4η(i−1)>�
C∗( η(i)β
1/2(i)
,M(i−1)
)log|M(i−1)|β(i)
η2(i),
for suitable β(i), then with probability at least 1− δ we have �-accuracy.In the cumulative cost, the outer bounds on Mt are defined as
M(i) :={
x : f(x) ≥ f(x?)− 4η(i)}
(BO)
M(i) :={
x : |f(x)− h| ≤ 2η(i)}
(LSE)
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 31/ 39
Epilogue: Theoretical Performance (II)
CorollaryThere exist β(i) such that we have �-accuracy with probability at least 1− δ once
T ≥ O∗(σ2γT
�2+C1γTσ2
)where C1 = 1log(1+σ−2) , and
γT = maxS : |S|=T
I(f ; yS)
is the maximum amount of information yS = (y1, . . . , yT ) can reveal about f uponquerying points S = (x1, . . . ,xT )
• New: Improved dependence on the noise level in BO
I For small σ and �� σ, existing bound (GP-UCB): T ≥ O∗(C1γT�2
)
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 32/ 39
Epilogue: Theoretical Performance (III)
• Multi-noise setup:I noise levels σ2(1), . . . , σ2(K)I sampling costs c(1), . . . , c(k)
CorollaryFor each k = 1, · · · ,K, let T ∗(k) denote the smallest value of T such that
T ≥ Ω∗(σ(k)2γT
�2+C1(k)γTσ(k)2
)where C1 = 1log(1+σ(k)−2) .
There exist choices of β(i) such that we have �-accuracy with probability at least 1− δonce the cumulative cost reaches
minkc(k)T ∗(k)
• As good as sticking to any fixed noise/cost pair a posteriori!
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 33/ 39
Epilogue: Theoretical Performance (IV)
• Recall: Minimum cost required to shrink variance
C∗(ξ,M) = minS
{c(S) : max
x∈MσS(x) ≤ ξ
},
σ2S : posterior variance given points in S
• In a single epoch, TruVaR greedily maximizes a submodular set function
g(S) = −∑
x∈Mt−1
max{β(i)σ2t−1|S(x), η2(i)}
• By submodularity, our incurred cost is within a logarithmic factor of the optimum:
C(i) ≤ C∗( η(i)β
1/2(i)
,M(i−1)
)log|M(i−1)|β(i)
η2(i)
• Sum over the epochs i to obtain the theory
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 34/ 39
Further Reading
• Tutorials/classes:I Taking the Human Out of the Loop: A Review of Bayesian Optimization
(Shahriari et al., 2016)I A Tutorial on Bayesian Optimization of Expensive Cost Functions, with
Application to Active User Modeling and Hierarchical Reinforcement Learning(Brochu et al., 2010)
I Lectures on Gaussian Processes & Bayesian optimization by Nando de Freitas(available on YouTube)
• Other:I Various papers referenced at the end of each set of slides (this & previous ones)I Popular GP book: Gaussian Processes for Machine Learning (Rasmussen, 2006)I My papers: http://www.comp.nus.edu.sg/~scarlett/
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 35/ 39
http://www.comp.nus.edu.sg/~scarlett/
Useful Programming Packages
• Useful libraries:I Python packages (some with other methods beyonds GPs):
I GPy and GPyOptI SpearmintI BayesianOptimizationI PyBoI HyperOptI MOE
I Packages for other languages:I GPML for MATLABI GPFit and rBayesianOptimization for R
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 36/ 39
References I
[1] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher.Truncated variance reduction: A unified approach to Bayesian optimization andlevel-set estimation.In Conf. Neur. Inf. Proc. Sys. (NIPS), 2016.
[2] Eric Brochu, Vlad M. Cora, and Nando de Freitas.A tutorial on bayesian optimization of expensive cost functions, with application toactive user modeling and hierarchical reinforcement learning.http://arxiv.org/abs/1012.2599, 2010.
[3] Brent Bryan, Robert C Nichol, Christopher R Genovese, Jeff Schneider,Christopher J Miller, and Larry Wasserman.Active learning for identifying function threshold boundaries.In Conf. Neur. Inf. Proc. Sys. (NIPS), pages 163–170, 2006.
[4] Paul W Goldberg, Christopher KI Williams, and Christopher M Bishop.Regression with input-dependent noise: A Gaussian process treatment.Adv. Neur. Inf. Proc. Sys. (NIPS), 10:493–499, 1997.
[5] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause.Active learning for level set estimation.In Int. Joint. Conf. Art. Intel., 2013.
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 37/ 39
References II
[6] Philipp Hennig and Christian J Schuler.Entropy search for information-efficient global optimization.J. Mach. Learn. Research, 13(1):1809–1837, 2012.
[7] Jan Hendrik Metzen.Minimum regret search for single-and multi-task optimization.In Int. Conf. Mach. Learn. (ICML), 2016.
[8] J Moćkus, V Tiesis, and A Źilinskas.The application of Bayesian methods for seeking the extremum. vol. 2, 1978.
[9] Carl Edward Rasmussen.Gaussian processes for machine learning.MIT Press, 2006.
[10] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nandode Freitas.Taking the human out of the loop: A review of Bayesian optimization.Proc. IEEE, 104(1):148–175, 2016.
[11] Jasper Snoek, Hugo Larochelle, and Ryan P Adams.Practical Bayesian optimization of machine learning algorithms.In Adv. Neur. Inf. Proc. Sys. 2012.
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 38/ 39
References III
[12] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger.Information-theoretic regret bounds for Gaussian process optimization in thebandit setting.IEEE Trans. Inf. Theory, 58(5):3250–3265, May 2012.
[13] Kevin Swersky, Jasper Snoek, and Ryan P Adams.Multi-task Bayesian optimization.In Adv. Neur. Inf. Proc. Sys. (NIPS), pages 2004–2012, 2013.
SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 39/ 39
Lecture 3 – Advanced Bayesian Optimization Methods