4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 1 of 34
Benchmarking The ATM Algorithm
on the BBOB 2009 Noiseless Function Testbed
Benjamin BodnerBrown University
Providence, RI, USA
BBOB Workshop
GECCO 2019
Prague
Content
Motivation Intuition
Introduction01 BBOB NoiselessBBOB Large-scale
Internal runtime
Results03
Parameters & main equations
Parameter adaptationResource allocation
Main Components02 Recent progress
Goals moving forwardConclusions
Summary042 of 34Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed4/1/2020
Motivation• Growing need for
optimization methods for
very high-dimensional
settings
Image from:
https://towardsdatascience.com/why-
deep-learning-is-needed-over-traditional-
machine-learning-1b6a99177063
Optimization
Algorithms• Problems commonly
have 10^5- 10^8
optimizable variables[Devlin et al. 2019]
Deep Learning
Physical Sciences
Image from GOMC:
https://gomc-
wsu.github.io/Manual/index.html
12/7/2019Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 3 of 34
Motivation
Deep
Learning
Gradient-based
optimization methods can
create many difficulties
[Shalev-Shwartz et al. 2017]
Current ways of
mitigating these issues
Vanishing gradients
Getting stuck in
local minima
Hyperparameter
tuning
Image from: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
Noise
Image from [He et al. 2015]),
Architecture
Design
Regularization
Image from:
Srivastava, Nitish, et al. 2014
[Sutskever 2013]12/7/2019
Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 4 of 34
Do not always work
Motivation
Image by Thomas Splettstoesser:
https://www.behance.net/gallery/10952399/Protein-
Folding-Funnel
Image from GOMC:
https://gomc-wsu.github.io/Manual/index.html• Functions are non-convex
• Notoriously have large
numbers of local minima [Nichita 2002]
Image from:
https://en.wikibooks.org/wiki/Structural_Bioch
emistry/Proteins/Protein_Folding_Problem
Physical Sciences
• Simulated annealing and
quasi-Newton methods can be slow
• Do not always converge to the global minima [Hao et al. 2015]
5 of 34Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed12/7/2019
Interacting Particles
Protein Folding
Motivation
Existing algorithms have
been highly successful
in these settings
Characteristics intentionally
designed into the BBOB
function testbeds
[BIPOP CMA-ES, Hansen 2009]
Covariance matrices and
Hessians limit their scalability
capabilities
Key components
and operations are
usually of order D^2 Images from:
Finck, Hansen, Ros, Auger 201512/7/2019
Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 6 of 34
Proposal
Eliminate the use of D^2 objects
and operations
Adaptive Two Mode (ATM) Algorithm
A black box optimization algorithm which
only maintains objects and executes
operations of order D
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 7 of 34
The Adaptive Two Mode Algorithm
Directional
distribution
Isotropic
distribution
• The two modes complement each other
• ATM uses a set of rules to control the amplitudes
and interactions between the modes
Uses a combination of two kinds of search distributions / “modes”
Exploitation Exploration
8 of 34Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed12/7/2019
ATM Algorithm
Start from
isotropic
distribution If sample leads
to improvement:Suggest samples
in that direction
If new samples also lead to improvement:
Sample in same direction at exponentially increasing amplitude
Once no more
“good” samples
are found:
Start over with the
isotropic search
(using an evolutionary strategy)
1
4 3
2
Best Sample
Best Sample
from last step
Regular Sample
12/7/2019 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 9 of 34
Repeat
Parameters of the AlgorithmThere are (currently) 8 parameters which play several roles in the ATM algorithm:
• Controlling the growth factors of the modes:
• Controlling the amplitudes of the modes
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 10 of 34
𝑖𝑓 𝑋𝑏𝑒𝑠𝑡𝑡 − 𝑋𝑏𝑒𝑠𝑡𝑡−12> 𝛥𝑋𝑚𝑖𝑛
2 : 𝑑𝑜 𝑑 += 1, 𝑟 = 0
𝑒𝑙𝑠𝑒: 𝑑𝑜 𝑟 += 1, 𝑑 = 0
𝑅 = 𝑅𝑚𝑎𝑥 exp 𝐺𝑟 sin 𝑚𝑜𝑑𝜋𝑟
2 𝑇𝑟,𝜋
2− 1
𝐷 = 𝑅𝑚𝑎𝑥exp 𝐺𝑑𝑑 − 𝐷𝑑𝑟
Parameters of the Algorithm
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 11 of 24
• Controlling the search distribution in different axis:
𝑺 = 𝛽𝑺 + 1 − 𝛽 𝑚𝑒𝑎𝑛𝑶 − 𝑂𝐺𝑏𝑒𝑠𝑡𝑿 − 𝑿𝐺𝑏𝑒𝑠𝑡𝑡
2
𝑨 =𝛼
𝑺 + 𝛼2
𝑂𝑃 =(𝑚𝑒𝑎𝑛 Δ𝑂𝑃𝑏𝑒𝑠𝑡 +𝑚𝑖𝑛 Δ𝑂𝑃𝑏𝑒𝑠𝑡 )
2
Online Parameter Tuning
𝛥𝑂𝑃𝑏𝑒𝑠𝑡 = 𝐵𝑒𝑠𝑡 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑡ℎ𝑒𝑡𝑟𝑢𝑒 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛,𝑓𝑜𝑢𝑛𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑡𝑒𝑟 𝑠𝑒𝑡
Changing
characteristics at
different stages
Different
functions
Need for online
parameter tuning
• 4 intertwined parameter sets
• Parameter sets are optimized by
another Two-Mode algorithm
• Objective function designed to
reflect the “success” at the task of
minimizing the true objective function
How to do this?+
12/7/2019 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 13 of 34
Problem with Online Tuning
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 14 of 34
New
parameter
sets
Changing
local search
space
Good chance
for unsuitable
sets
+
ProposalFewer resources to “bad” parameter sets
more resources to better ones
Resources allocated to
parameter set
Performance of
parameter set
Parallel Optimization with Resource Allocation Given a fixed number of samples 𝑁𝑡𝑜𝑡, distributed among 𝑚 parameter sets.
Change the allocation of samples to reflect their performance
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 15 of 34
𝑵𝑡+1 = 𝑵𝑡 − 𝐾 𝑀−1 𝜟𝑶𝑷𝒃𝒆𝒔𝒕𝒕 − K0M−1 𝐍t − 𝐍0
𝑵𝒕 = 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒 𝑎𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 𝑣𝑒𝑐𝑡𝑜𝑟 𝑎𝑡 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑡
Parallel Optimization with Resource Allocation – Choice of Matrices
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 16 of 34
K =
𝑚 − 1 ∗ 𝐾 −𝐾 ⋯ −𝐾−𝐾 𝑚 − 1 ∗ 𝐾 ⋯ −𝐾⋮ ⋮ ⋱ ⋮
−𝐾 −𝐾 ⋯ 𝑚 − 1 ∗ 𝐾𝐾0 = 𝑘0 𝐼𝑀 = 𝜇 𝐼
• Conserves the total number of samples
• Merit-based allocation system
𝑵𝑡+1 = 𝑵𝑡 − 𝐾 𝑀−1 𝜟𝑶𝑷𝒃𝒆𝒔𝒕𝒕 − K0M−1 𝐍t − 𝐍0
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 20 of 34
Information Flow Throughout ATM Components
Resource allocation
Parameter
Set3
Parameter
Set4
Evaluate
Samples
Parameter
Set1Parameter
Set2
Values of
objective
function
Suggestions for samples
Repeat
ATM Optimization Process
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 21 of 24
Sum Of Different Powers - f14Sharp Ridge - f13Rotated Ellipse - f10
Succeeds at
solving:
• 23/24 in 2D
• 8/24 in 40D
Results on BBOB Testbed - Overview
• Underperforms on
non-separable functions
• Especially if ill-conditioned
and/or noisy
One of the best
optimizers for the
separable
functions subset
(f1-5)
+Large budget
Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed4/1/2020 24 of 34
Results on BBOB Testbed - Successes
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 25 of 34
• Very effective at optimizing separable functions
• Capable at optimizing functions with “large” regions
around the global minima which are convex(“large” = comparable to 𝑅𝑚𝑎𝑥)
Results on BBOB Testbed Underperformance
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 26 of 34
• Poor performance on
rotated and ill
conditioned functions
• Poor performance
rotated and noisy/
multimodal functions
Results from BBOB Largescale
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 27 of 34
Budget = 3000D
Ability to Scale to Large Search Spaces
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 28 of 34
Internal runtime of the ATM algorithm scales linearly
as a function of the number of variables in the search space
Results from timing experiment:
• Internal Runtime =
Total Runtime – Evaluation time
• 128 function evaluations,
averaged over 3 runs
• f1 sphere function
• Number of variables to pass
1.0 sec internal runtime
• Google Colab GPU
CMA(pip install CMA)
BFGS
(scipy.optimize)
Internal Runtime as a Function of Number of Variables in Search Space
Inte
rna
l R
un
tim
e (
sec
on
ds)
Number of variables in search space (NU)
ATM
Nelder Mead (scipy.optimize)
L-BFGS-B
(scipy.optimize)
Recent Progress
• Introduced a primary axis
updated by a moving
average rule
• Performance of the ATM
is improved on rotated
functions
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 29 of 34
Iterations
log10Δ𝑓
Convergence Plot Rotated Ellipse f10 Convergence Plot Sharp Ridge f13
Original ATM
New Version
Goals Moving Forward
Improve performance on
rotated and ill-conditioned
functions
(without using DxD objects)
01Increase performance in noisy
environments – use averaging
and moving mean
03
02
04Add second population with
weak restart conditions –
for multimodal functions
Make the ATM more user friendly
and customizable
For more information see: https://github.com/BjBodner/ATM-optimization-algorithm
30 of 344/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed
Conclusions
• Good candidate for
optimizing very
high-dimensional
problems
• More research is
needed
• Scales linearly
with size of the
search space:
• No DxD objects
• Underperforms on
rotated functions
e.g., ill-conditioned
and/or noisy functions
• Very efficient at
optimizing
separable
functions
Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed4/1/2020 31 of 34
The ATM Algorithm
Acknowledgements
Dr Brenda Rubenstein,
Brown University,
Providence, RI, USA for her
guidance in developing this
algorithm.
Her contributions and
encouragement were
essential in advancing this
project forward and getting
it to its current form.
Dr Eran Triester
Ben-Gurion University,
Beersheva, Israel for his
ongoing collaboration.
Working with him is
significantly helping
improve the performance
of the algorithm.
32 of 34Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed4/1/2020
References
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 33 of 34
• Nikolaus Hansen. Benchmarking a BI-Population CMA-ES on the BBOB-2009 Function
Testbed . GECCO '09 Proceedings of the 11th Annual Conference Companion on
Genetic and Evolutionary Computation Conference: Late Breaking Papers Pages
2389-2396
• Bin Qian, Angel R. Ortiz, David Baker. Improvement of comparative model accuracy
by free-energy optimization along principal components of natural structural
variation. PNAS October 26, 2004, vol.101,no. 43, 1534
• Dan Vladimir Nichita, Susana Gomez, Eduardo Luna. Multiphase equilibria calculation
by direct minimization of Gibbs free energy with a global optimization method.
Computers and Chemical Engineering 26 (2002) 1703/1724
References
◦ Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Touta. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
◦ Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah. Failures of Gradient-Based Deep Learning. ICML’17 Proceedings of the 34th International Conference on MachineLearning-Volume70,Pages3067-3075 2017.
◦ Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385v1 [cs.CV] 10 Dec 2015.
◦ Sutskever,I.,Martens,J.,Dahl,G.,Hinton,G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30 the International Conference on Machine Learning-Volume28,I CML13, III1139-III-1147 (JMLR.org,2013)
4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 34 of 34
Thank you!
Questions?
Email: benjamin_bodner@brown_edu
For more information see: https://github.com/BjBodner/ATM-optimization-algorithm