Benchmarking The ATM Algorithm › presentation-archive › 2019-GECCO › 07_… · Benchmarking...

4/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 1 of 34

Benchmarking The ATM Algorithm

on the BBOB 2009 Noiseless Function Testbed

Benjamin BodnerBrown University

Providence, RI, USA

BBOB Workshop

GECCO 2019

Prague

Content

Motivation Intuition

Introduction01 BBOB NoiselessBBOB Large-scale

Internal runtime

Results03

Parameters & main equations

Parameter adaptationResource allocation

Main Components02 Recent progress

Goals moving forwardConclusions

Summary042 of 34Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed4/1/2020

Motivation• Growing need for

optimization methods for

very high-dimensional

settings

Image from:

https://towardsdatascience.com/why-

deep-learning-is-needed-over-traditional-

machine-learning-1b6a99177063

Optimization

Algorithms• Problems commonly

have 10^5- 10^8

optimizable variables[Devlin et al. 2019]

Deep Learning

Physical Sciences

Image from GOMC:

https://gomc-

wsu.github.io/Manual/index.html

12/7/2019Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 3 of 34

Motivation

Deep

Learning

Gradient-based

optimization methods can

create many difficulties

[Shalev-Shwartz et al. 2017]

Current ways of

mitigating these issues

Vanishing gradients

Getting stuck in

local minima

Hyperparameter

tuning

Image from: https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

Noise

Image from [He et al. 2015]),

Architecture

Design

Regularization

Image from:

Srivastava, Nitish, et al. 2014

[Sutskever 2013]12/7/2019

Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 4 of 34

Do not always work

Motivation

Image by Thomas Splettstoesser:

https://www.behance.net/gallery/10952399/Protein-

Folding-Funnel

Image from GOMC:

https://gomc-wsu.github.io/Manual/index.html• Functions are non-convex

• Notoriously have large

numbers of local minima [Nichita 2002]

Image from:

https://en.wikibooks.org/wiki/Structural_Bioch

emistry/Proteins/Protein_Folding_Problem

Physical Sciences

• Simulated annealing and

quasi-Newton methods can be slow

• Do not always converge to the global minima [Hao et al. 2015]

5 of 34Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed12/7/2019

Interacting Particles

Protein Folding

Motivation

Existing algorithms have

been highly successful

in these settings

Characteristics intentionally

designed into the BBOB

function testbeds

[BIPOP CMA-ES, Hansen 2009]

Covariance matrices and

Hessians limit their scalability

capabilities

Key components

and operations are

usually of order D^2 Images from:

Finck, Hansen, Ros, Auger 201512/7/2019

Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed 6 of 34

Proposal

Eliminate the use of D^2 objects

and operations

Adaptive Two Mode (ATM) Algorithm

A black box optimization algorithm which

only maintains objects and executes

operations of order D


The Adaptive Two Mode Algorithm

Directional

distribution

Isotropic

distribution

• The two modes complement each other

• ATM uses a set of rules to control the amplitudes

and interactions between the modes

Uses a combination of two kinds of search distributions / “modes”

Exploitation Exploration


ATM Algorithm

Start from

isotropic

distribution If sample leads

to improvement:Suggest samples

in that direction

If new samples also lead to improvement:

Sample in same direction at exponentially increasing amplitude

Once no more

“good” samples

are found:

Start over with the

isotropic search

(using an evolutionary strategy)

1

4 3

2

Best Sample

Best Sample

from last step

Regular Sample


Repeat

Parameters of the AlgorithmThere are (currently) 8 parameters which play several roles in the ATM algorithm:

• Controlling the growth factors of the modes:

• Controlling the amplitudes of the modes


𝑖𝑓 𝑋𝑏𝑒𝑠𝑡𝑡 − 𝑋𝑏𝑒𝑠𝑡𝑡−12> 𝛥𝑋𝑚𝑖𝑛

2 : 𝑑𝑜 𝑑 += 1, 𝑟 = 0

𝑒𝑙𝑠𝑒: 𝑑𝑜 𝑟 += 1, 𝑑 = 0

𝑅 = 𝑅𝑚𝑎𝑥 exp 𝐺𝑟 sin 𝑚𝑜𝑑𝜋𝑟

2 𝑇𝑟,𝜋

2− 1

𝐷 = 𝑅𝑚𝑎𝑥exp 𝐺𝑑𝑑 − 𝐷𝑑𝑟

Parameters of the Algorithm


• Controlling the search distribution in different axis:

𝑺 = 𝛽𝑺 + 1 − 𝛽 𝑚𝑒𝑎𝑛𝑶 − 𝑂𝐺𝑏𝑒𝑠𝑡𝑿 − 𝑿𝐺𝑏𝑒𝑠𝑡𝑡

2

𝑨 =𝛼

𝑺 + 𝛼2

𝑂𝑃 =(𝑚𝑒𝑎𝑛 Δ𝑂𝑃𝑏𝑒𝑠𝑡 +𝑚𝑖𝑛 Δ𝑂𝑃𝑏𝑒𝑠𝑡 )

2

Online Parameter Tuning

𝛥𝑂𝑃𝑏𝑒𝑠𝑡 = 𝐵𝑒𝑠𝑡 𝑐ℎ𝑎𝑛𝑔𝑒 𝑖𝑛 𝑡ℎ𝑒𝑡𝑟𝑢𝑒 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛,𝑓𝑜𝑢𝑛𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑝𝑎𝑟𝑎𝑚𝑡𝑒𝑟 𝑠𝑒𝑡

Changing

characteristics at

different stages

Different

functions

Need for online

parameter tuning

• 4 intertwined parameter sets

• Parameter sets are optimized by

another Two-Mode algorithm

• Objective function designed to

reflect the “success” at the task of

minimizing the true objective function

How to do this?+


Problem with Online Tuning


New

parameter

sets

Changing

local search

space

Good chance

for unsuitable

sets

+

ProposalFewer resources to “bad” parameter sets

more resources to better ones

Resources allocated to

parameter set

Performance of

parameter set

Parallel Optimization with Resource Allocation Given a fixed number of samples 𝑁𝑡𝑜𝑡, distributed among 𝑚 parameter sets.

Change the allocation of samples to reflect their performance


𝑵𝑡+1 = 𝑵𝑡 − 𝐾 𝑀−1 𝜟𝑶𝑷𝒃𝒆𝒔𝒕𝒕 − K0M−1 𝐍t − 𝐍0

𝑵𝒕 = 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒 𝑎𝑙𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 𝑣𝑒𝑐𝑡𝑜𝑟 𝑎𝑡 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑡

Parallel Optimization with Resource Allocation – Choice of Matrices


K =

𝑚 − 1 ∗ 𝐾 −𝐾 ⋯ −𝐾−𝐾 𝑚 − 1 ∗ 𝐾 ⋯ −𝐾⋮ ⋮ ⋱ ⋮

−𝐾 −𝐾 ⋯ 𝑚 − 1 ∗ 𝐾𝐾0 = 𝑘0 𝐼𝑀 = 𝜇 𝐼

• Conserves the total number of samples

• Merit-based allocation system

𝑵𝑡+1 = 𝑵𝑡 − 𝐾 𝑀−1 𝜟𝑶𝑷𝒃𝒆𝒔𝒕𝒕 − K0M−1 𝐍t − 𝐍0


Information Flow Throughout ATM Components

Resource allocation

Parameter

Set3

Parameter

Set4

Evaluate

Samples

Parameter

Set1Parameter

Set2

Values of

objective

function

Suggestions for samples

Repeat

ATM Optimization Process


Sum Of Different Powers - f14Sharp Ridge - f13Rotated Ellipse - f10

Succeeds at

solving:

• 23/24 in 2D

• 8/24 in 40D

Results on BBOB Testbed - Overview

• Underperforms on

non-separable functions

• Especially if ill-conditioned

and/or noisy

One of the best

optimizers for the

separable

functions subset

(f1-5)

+Large budget

Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed4/1/2020 24 of 34

Results on BBOB Testbed - Successes


• Very effective at optimizing separable functions

• Capable at optimizing functions with “large” regions

around the global minima which are convex(“large” = comparable to 𝑅𝑚𝑎𝑥)

Results on BBOB Testbed Underperformance


• Poor performance on

rotated and ill

conditioned functions

• Poor performance

rotated and noisy/

multimodal functions

Results from BBOB Largescale


Budget = 3000D

Ability to Scale to Large Search Spaces


Internal runtime of the ATM algorithm scales linearly

as a function of the number of variables in the search space

Results from timing experiment:

• Internal Runtime =

Total Runtime – Evaluation time

• 128 function evaluations,

averaged over 3 runs

• f1 sphere function

• Number of variables to pass

1.0 sec internal runtime

• Google Colab GPU

CMA(pip install CMA)

BFGS

(scipy.optimize)

Internal Runtime as a Function of Number of Variables in Search Space

Inte

rna

l R

un

tim

e (

sec

on

ds)

Number of variables in search space (NU)

ATM

Nelder Mead (scipy.optimize)

L-BFGS-B

(scipy.optimize)

Recent Progress

• Introduced a primary axis

updated by a moving

average rule

• Performance of the ATM

is improved on rotated

functions


Iterations

log10Δ𝑓

Convergence Plot Rotated Ellipse f10 Convergence Plot Sharp Ridge f13

Original ATM

New Version

Goals Moving Forward

Improve performance on

rotated and ill-conditioned

functions

(without using DxD objects)

01Increase performance in noisy

environments – use averaging

and moving mean

03

02

04Add second population with

weak restart conditions –

for multimodal functions

Make the ATM more user friendly

and customizable

For more information see: https://github.com/BjBodner/ATM-optimization-algorithm

30 of 344/1/2020 Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed

https://github.com/BjBodner/ATM-optimization-algorithm

Conclusions

• Good candidate for

optimizing very

high-dimensional

problems

• More research is

needed

• Scales linearly

with size of the

search space:

• No DxD objects

• Underperforms on

rotated functions

e.g., ill-conditioned

and/or noisy functions

• Very efficient at

optimizing

separable

functions

Benchmarking the ATM Algorithm on the BBOB 2009 Noiseless Function Testbed4/1/2020 31 of 34

The ATM Algorithm

Acknowledgements

Dr Brenda Rubenstein,

Brown University,

Providence, RI, USA for her

guidance in developing this

algorithm.

Her contributions and

encouragement were

essential in advancing this

project forward and getting

it to its current form.

Dr Eran Triester

Ben-Gurion University,

Beersheva, Israel for his

ongoing collaboration.

Working with him is

significantly helping

improve the performance

of the algorithm.


References


• Nikolaus Hansen. Benchmarking a BI-Population CMA-ES on the BBOB-2009 Function

Testbed . GECCO '09 Proceedings of the 11th Annual Conference Companion on

Genetic and Evolutionary Computation Conference: Late Breaking Papers Pages

2389-2396

• Bin Qian, Angel R. Ortiz, David Baker. Improvement of comparative model accuracy

by free-energy optimization along principal components of natural structural

variation. PNAS October 26, 2004, vol.101,no. 43, 1534

• Dan Vladimir Nichita, Susana Gomez, Eduardo Luna. Multiphase equilibria calculation

by direct minimization of Gibbs free energy with a global optimization method.

Computers and Chemical Engineering 26 (2002) 1703/1724

References

◦ Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Touta. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805

◦ Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah. Failures of Gradient-Based Deep Learning. ICML’17 Proceedings of the 34th International Conference on MachineLearning-Volume70,Pages3067-3075 2017.

◦ Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385v1 [cs.CV] 10 Dec 2015.

◦ Sutskever,I.,Martens,J.,Dahl,G.,Hinton,G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30 the International Conference on Machine Learning-Volume28,I CML13, III1139-III-1147 (JMLR.org,2013)


Thank you!

Questions?

Email: benjamin_bodner@brown_edu

For more information see: https://github.com/BjBodner/ATM-optimization-algorithm

https://github.com/BjBodner/ATM-optimization-algorithm

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Benchmarking The ATM Algorithm › presentation-archive › 2019-GECCO › 07_… · Benchmarking...

Documents