Introduction Learning for optimization Optimization for learning
Machine learning and
black-box expensive optimization
Sebastien Verel
Laboratoire d’Informatique, Signal et Image de la Cote d’opale (LISIC)Universite du Littoral Cote d’Opale, Calais, Francehttp://www-lisic.univ-littoral.fr/~verel/
June, 18th, 2018
Introduction Learning for optimization Optimization for learning
AI : Machine Learning, Optimization, perception, etc.
Learning :
Minimize an error functionMθ : model to learn on dataSearch θ? = arg minθ Error(Mθ, data)
According to the model dimension, variables, error function, etc.,huge number of optimization algorithms
Optimization :
Learn a design algorithm of good solutionsAθ : search algorithm for problems (X , f )Learn Aθ such that Aθ(X , f ) = arg minx∈X f (x)
According to the class of algorithms, search spaces, functions, etc.,huge number of learning algorithms
Artificial : from paper to computer !
Introduction Learning for optimization Optimization for learning
AI : Machine Learning, Optimization, perception, etc.
Learning :
Minimize an error functionMθ : model to learn on dataSearch θ? = arg minθ Error(Mθ, data)
According to the model dimension, variables, error function, etc.,huge number of optimization algorithms
Optimization :
Learn a design algorithm of good solutionsAθ : search algorithm for problems (X , f )Learn Aθ such that Aθ(X , f ) = arg minx∈X f (x)
According to the class of algorithms, search spaces, functions, etc.,huge number of learning algorithms
Artificial : from paper to computer !
Introduction Learning for optimization Optimization for learning
AI : Machine Learning, Optimization, perception, etc.
Learning :
Minimize an error functionMθ : model to learn on dataSearch θ? = arg minθ Error(Mθ, data)
According to the model dimension, variables, error function, etc.,huge number of optimization algorithms
Optimization :
Learn a design algorithm of good solutionsAθ : search algorithm for problems (X , f )Learn Aθ such that Aθ(X , f ) = arg minx∈X f (x)
According to the class of algorithms, search spaces, functions, etc.,huge number of learning algorithms
Artificial : from paper to computer !
Introduction Learning for optimization Optimization for learning
AI : Machine Learning, Optimization, perception, etc.
Learning :
Minimize an error functionMθ : model to learn on dataSearch θ? = arg minθ Error(Mθ, data)
According to the model dimension, variables, error function, etc.,huge number of optimization algorithms
Optimization :
Learn a design algorithm of good solutionsAθ : search algorithm for problems (X , f )Learn Aθ such that Aθ(X , f ) = arg minx∈X f (x)
According to the class of algorithms, search spaces, functions, etc.,huge number of learning algorithms
Artificial : from paper to computer !
Introduction Learning for optimization Optimization for learning
AI : Machine Learning, Optimization, perception, etc.
Learning :
Minimize an error functionMθ : model to learn on dataSearch θ? = arg minθ Error(Mθ, data)
According to the model dimension, variables, error function, etc.,huge number of optimization algorithms
Optimization :
Learn a design algorithm of good solutionsAθ : search algorithm for problems (X , f )Learn Aθ such that Aθ(X , f ) = arg minx∈X f (x)
According to the class of algorithms, search spaces, functions, etc.,huge number of learning algorithms
Artificial : from paper to computer !
Introduction Learning for optimization Optimization for learning
AI : Machine Learning, Optimization, perception, etc.
Learning :
Minimize an error functionMθ : model to learn on dataSearch θ? = arg minθ Error(Mθ, data)
According to the model dimension, variables, error function, etc.,huge number of optimization algorithms
Optimization :
Learn a design algorithm of good solutionsAθ : search algorithm for problems (X , f )Learn Aθ such that Aθ(X , f ) = arg minx∈X f (x)
According to the class of algorithms, search spaces, functions, etc.,huge number of learning algorithms
Artificial : from paper to computer !
Introduction Learning for optimization Optimization for learning
Black-box (expensive) optimization
x −→ −→ f (x)
No information on the objective definition function f
Objective fonction :
can be irregular, non continuous, non differentiable, etc.
given by a computation or an (expensive) simulation
Few examples from the team :
• Mobility simulation (Florian Lepretre),• Plant’s biology, plant growth (Amaury Dubois),• Logistic simulation (Brahim Aboutaib),• Cellular automata,• Nuclear power plant (Valentin Drouet),
Introduction Learning for optimization Optimization for learning
Real-world black-box expensive optimizationPhD of Mathieu Muniglia, 2014-2017, Valentin Drouet, 2017-2020, CEA, Paris
x −→ −→ f (x)
(73, . . . , 8) −→ −→ ∆zP
Multi-physic simulator
Expensive optimization : parallel computing, and surrogate model.
Introduction Learning for optimization Optimization for learning
Adaptive distributed optimization algorithmsChristopher Jankee, Bilel Derbel, Cyril Fonlupt
Portfolio of algorithms :Control of algorithm during
optimization
How to select an algorithm ?Design reinforcement learning
methods for distributed computing(ε-greedy, adapt. pursuit, UCB, ...)
How to compute a reward ?Aggregation function of local
rewards (mean, max, etc.) for aglobal selection
Introduction Learning for optimization Optimization for learning
Adaptive distributed optimization algorithmsChristopher Jankee, Bilel Derbel, Cyril Fonlupt
34
4
8
2
5
Portfolio of algorithms :Control of algorithm during
optimization
How to select an algorithm ?Design reinforcement learning
methods for distributed computing(ε-greedy, adapt. pursuit, UCB, ...)
How to compute a reward ?Aggregation function of local
rewards (mean, max, etc.) for aglobal selection
Introduction Learning for optimization Optimization for learning
Adaptive distributed optimization algorithmsChristopher Jankee, Bilel Derbel, Cyril Fonlupt
34
4
8
2
5
Methodology
Use designed benchmarkfunctions with designedproperties andexperimental analysis
Portfolio of algorithms :Control of algorithm during
optimization
How to select an algorithm ?Design reinforcement learning
methods for distributed computing(ε-greedy, adapt. pursuit, UCB, ...)
How to compute a reward ?Aggregation function of local
rewards (mean, max, etc.) for aglobal selection
Introduction Learning for optimization Optimization for learning
Features to learn : Mult.-Obj. fitness landscapeK. Tanaka, H. Aguirre (Univ. Shinshu), A. Liefooghe, B. Derbel (univ. Lille), 2010 - 2018
Fitness landscape : (X , f ,N ), Search space, obj. func., neighborhood relation
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Obj
ectiv
e 2
Objective 1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Obj
ectiv
e 2
Objective 1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Obj
ectiv
e 2
Objective 1
conflicting objectives independent objectives correlated objectives
f_co
r_rw
srh
o#l
supp
_avg
_rw
s#l
supp
_avg
_aw
s#l
nd_a
vg_r
ws
#lnd
_avg
_aw
sle
ngth
_aw
s#s
up_a
vg_a
ws
#sup
_avg
_rw
s#i
nc_a
vg_r
ws
#inf
_avg
_rw
s#i
nc_a
vg_a
ws
#inf
_avg
_aw
shv
_r1_
rws
#sup
_r1_
rws
#inf
_r1_
rws
#lnd
_r1_
rws
nhv_
r1_r
ws
hvd_
r1_r
ws
#inc
_r1_
rws
k_n
#lsu
pp_r
1_rw
s nhv
d_av
g_rw
shv
d_av
g_aw
shv
_avg
_aw
shv
_avg
_rw
s mnh
v_av
g_rw
snh
v_av
g_aw
s
nhv_avg_awsnhv_avg_rwsmhv_avg_rwshv_avg_awshvd_avg_awshvd_avg_rwsn#lsupp_r1_rwsk_n#inc_r1_rwshvd_r1_rwsnhv_r1_rws#lnd_r1_rws#inf_r1_rws#sup_r1_rwshv_r1_rws#inf_avg_aws#inc_avg_aws#inf_avg_rws#inc_avg_rws#sup_avg_rws#sup_avg_awslength_aws#lnd_avg_aws#lnd_avg_rws#lsupp_avg_aws#lsupp_avg_rwsrhof_cor_rws
−1 0 1
Value
040
Kendall's tau
Cou
nt
Perf. prediction (cross-val.)feature set MAE MSE R2 rank
GSEMOall 0.007781 0.000118 0.951609 1enumeration 0.008411 0.000142 0.943046 2sampling all 0.009113 0.000161 0.932975 3sampling rws 0.009284 0.000167 0.930728 4sampling aws 0.010241 0.000195 0.917563 5{r ,m, n, k/n} 0.010609 0.000215 0.911350 6{r ,m, n} 0.026974 0.001123 0.518505 7{m, n} 0.032150 0.001545 0.340715 8
Introduction Learning for optimization Optimization for learning
Learning/tuning parameters according to features
Introduction Learning for optimization Optimization for learning
Learning/tuning parameters according to features
Introduction Learning for optimization Optimization for learning
Surrogate model for pseudo-boolean functions
Goal : Replace/learn the (expensive) objective functionwith a (sheep) meta-model during the optimization process
Continuous optimization : NN, Gaussian Process (krigging),EGO : sample the next solution with max. expected improvement
GP : Random variables which have joint Gaussian distribution.
mean : m(y(x)) = µcovariance : cov(y(x), y(x ′)) = exp(−θd(x , x ′)p)
from : Rasmussen, Williams, GP for ML, MIT Press, 2006.
Introduction Learning for optimization Optimization for learning
Surrogate model for pseudo-boolean functions
Goal : Replace/learn the (expensive) objective functionwith a (sheep) meta-model during the optimization process
Continuous optimization : NN, Gaussian Process (krigging),EGO : sample the next solution with max. expected improvement
Proposition
Walsh function basis : ∀x ∈ {0, 1}n, ϕk(x) = (−1)∑n−1
j=0 kjxj
f (x) =2n−1∑k=0
wk .ϕk(x) with wk =1
2n
∑x∈{0,1}n
f (x).ϕk(x)
Surrogate model :
f (x) =∑
k : o(ϕk )6d
wk .ϕk(x)
Estimate the coefficientswith LARS
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●0.00
0.01
0.02
0.03
0.04
0 100 200 300 400 500
Sample size
Mea
n A
bs. E
rror
of f
itnes
s
method●
●
krigingwalsh
Introduction Learning for optimization Optimization for learning
Energy surface of deep learning problem
To learn Deep NN :High dimension spaceMinimize error with variants of stochastic gradient descent
Why does it works ?improves expressiveness but complicates optimization
What is the shape of energy surface ?
A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. ”The loss surfaces of multilayer networks.”In Artificial Intelligence and Statistics, pp. 192-204. (2015).P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina.”Entropy-sgd : Biasing gradient descent into wide valleys.” ICLR (2017).S. Arora, N. Cohen, and E. Hazan. ”On the optimization of deep networks : Implicit acceleration byoverparameterization.” arXiv preprint arXiv :1802.06509 (2018).
Perspective
Study the geometry of fitness landscape...
Any idea, and collaboration are welcome !
Introduction Learning for optimization Optimization for learning
Energy surface of deep learning problem
To learn Deep NN :High dimension spaceMinimize error with variants of stochastic gradient descent
Why does it works ?improves expressiveness but complicates optimization
What is the shape of energy surface ?
A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. ”The loss surfaces of multilayer networks.”In Artificial Intelligence and Statistics, pp. 192-204. (2015).P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina.”Entropy-sgd : Biasing gradient descent into wide valleys.” ICLR (2017).S. Arora, N. Cohen, and E. Hazan. ”On the optimization of deep networks : Implicit acceleration byoverparameterization.” arXiv preprint arXiv :1802.06509 (2018).
Perspective
Study the geometry of fitness landscape...
Any idea, and collaboration are welcome !