Gradient-based Hyperparameter Optimization
Paolo FrasconiUniversità degli Studi di Firenze, Italyhttp://ai.dinfo.unifi.it/paolo/
Joint work with Luca Franceschi (IIT and UCL), Michele Donini (IIT),Massimiliano Pontil (IIT and UCL)
EMMCVPR 2017 — Venezia, November 1st, 2017
Hyperparameter Optimization
Most machine learning algorithms depend on the values of somevariables that must be decided before learning starts
At least three kinds of hyperparameters:Regularization (e.g. amount of L2 or L1 penalty, dropout,multitask, etc.)
Hypothesis space (e.g. variables in the kernel, layers in neuralnets, etc.)
Optimization (e.g. learning rate, momentum, etc.)
HO: tune hyperparameters automatically
Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
Some approaches to HO
Grid search (trivial): only practical for 1–2 hyperparameters
Random search: better than grid search (J. Bergstra and Bengio2012) — 32 hyperparameters
Bayesian approaches (J. Bergstra, Yamins, et al. 2013) (Hyperopt)— 238 hyperparameters
Spearmint (Snoek et al. 2012) — 288 hyperparameters
Sequential Model-Based (SMBO, SMAC) (Hutter et al. 2011)
Tree-structured Parzen Estimator (J. S. Bergstra et al. 2011;Thornton et al. 2011)
Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
Gradient-based HO
Early works limited to few hyperparameters: (Bengio 2000; Larsenet al. 1996)
More recent works capable to handle one thousandhyperparameters (Maclaurin et al. 2015; Pedregosa 2016)
Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
Simple example: Ridge regression
Prediction function: g : Rd 7→ R, g(x;w) = w⊺x
Learning problem:
J(w, λ).=
∑(x,y)∈T
[y − g(x;w)
]2+ λ∥w∥2
Closed-form solution:
w(λ) = argminw
J(w, λ) = (X⊺X + λI)−1X⊺Y
Response function:f(λ)
.=
∑(x,y)∈V
[y − g(x; w(λ))
]2=
∑(x,y)∈V
[y − ((X⊺X + λI)−1X⊺Y )⊺x
]2Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
Response function
Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
HO as a bilevel program
Two sets of variables:outer variables: hyperparameters λ
inner variables: parameters w
Optimize outer problem subject to optimum of the inner problem:
minλ
f(λ, w)
s.t. w ∈ argminw
J(λ,w)
In HO, the outer problem is the validation loss and the innerproblem is the training objective
Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
HO as a bilevel program
Moore, G., Bergeron, C., & Bennett, K. P. (2011). Model selection forprimal SVM. Machine Learning, 85(1–2), 175–208.
Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
Learning dynamics
We assume that the objectives are differentiable
However in general there is no closed-form solution
In facts, objectives may be non-convex
Thus we introduce learning dynamics encompassing those instochastic gradient descent algorithms such as Nesterov, Adam,RMSProp etc.
Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia
Example: SGD with momentum on neural networks
Dynamical system:
vt = µvt−1 − η∇Jt(wt−1)wt = wt−1 + vt
wt are the weights, vt the velocities
µ and η are optimization hyperparameters
Jt is the lower objective for the t-th minibatch
Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia
Learning dynamics in general
st = Φt(st−1, λ) t = 1, . . . , T
The state st contains parameters and accessory variables (e.g.velocities)
Φt : (Rd ×Rm) → Rd is a smooth mapping representing the
operation performed by the t-th step of the optimizationalgorithm (on minibatch t)
The iterates s1, . . . , sT depend on the hyperparameters λ bothexplicitly and implicitly
Minibatch 1 Minibatch 2 Minibatch T
Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia
Hyperparameter Optimization
Minibatch 1 Minibatch 2 Minibatch T Validation set
Change the bilevel program to use the parameters at the last iteratesT rather than w:
minλ
f(λ)
where f : Rm → R is the response function redefined as
f(λ) = E(sT (λ))
Hypergradient:∇f(λ) = ∇E(sT )
dsTdλ
Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia
Hyperparameter Optimization
Minibatch 1 Minibatch 2 Minibatch T Validation set
Similar to a recurrent neural network but:
Minibatches are like inputs to the RNN
The state of the RNN are like the parameters of the model
Hyperparameter are like the weights of the RNN
The validation error is like the training loss of the RNN
Indeed (Maclaurin et al. 2015) proposed to use backpropagation(without mentioning BPTT or RNNs)
Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia
Algorithmic Differentiation
Most (complex) functions of interest in ML can be computed bycomposing elementary operations whose derivatives are readilyavailable
Algorithmic differentiation more effective than alternative ways ofcomputing derivatives such as:
Numerical Differentiation (subject to rounding-off errors)
Symbolic Differentiation (subject to expressions of explodingsizes)
Backpropagation (Werbos 1982) is perhaps the most widelyknown AD technique in machine learning
Gradient-based Hyperparameter Optimization Algorithmic Differentiation EMMCVPR ’17 — Venezia
Algorithmic Differentiation
Describe a function y = f(x) via a computation graph (essentiallya circuit) where each node contains a value vi
There are two main approaches to AD:Forward mode: for each node i and a fixed input j , define
vi.=
∂vi∂xj
vi can be computed from vk ’s (k parents of i)
Reverse mode: for each node in the graph and a fixed outputyj , define
vi.=
∂yj∂vi
vi can be computed from vk ’s for each child(k children of i)but all vi must be stored in memory!
Gradient-based Hyperparameter Optimization Algorithmic Differentiation EMMCVPR ’17 — Venezia
Algorithmic Differentiation for RNNs
Not surprisingly, both reverse mode and forward mode AD waspopular for training RNNs in the late 1980’s
Backpropagation through time (see e.g., Werbos 1988,Pearlmutter 1989) is reverse mode AD
Real-time recurrent learning (see e.g., Mozer 1989, Williams &Zipser 1989) is forward mode AD
Gradient-based Hyperparameter Optimization Algorithmic Differentiation EMMCVPR ’17 — Venezia
Reverse mode
The HO problem can be reformulated as a constrainedoptimization problem:
minλ,s1,...,sT
E(sT )
s.t. st = Φt(st−1, λ), t ∈ 1, . . . , T
Classical Lagrangian formalism used to derive backprop (LeCun1988)
L(s, λ, α) = E(sT ) +T∑t=1
αt(Φt(st−1, λ)− st) αt ∈ Rd
Constraints on hyperparameters can be specified naturally
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Reverse mode
Partial derivatives of the Lagrangian:
∂L∂αt
= Φt(st−1, λ)− st, t ∈ 1, . . . , T
∂L∂st
= αt+1∂Φt+1(st, λ)
∂st− αt, t ∈ 1, . . . , T−1
The second equation yields the useful recursion:
Let At+1.=
∂Φt+1(st, λ)
∂sta (d× d) matrix
thenαt = αt+1At+1
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Reverse mode
The base step for the recursion is derived from
∂L∂sT
= ∇E(sT )− αT
Finally the whole hypergradient is
∂L∂λ
=T∑t=1
αt∂Φt(st−1, λ)
∂λ︸ ︷︷ ︸Bt
where Bt is a (d×m) matrix
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Reverse mode
Reverse-HG(λ, s0)
1 Inputs: Current hyperparameters, λ, initial state, s02 Outputs: Hypergradient at λ3 for t = 1 to T4 st = Φt(st−1, λ) // d vector, all must be stored5 αT = ∇E(sT )6 g = 07 for t = 1 to T
8 At+1 = ∂Φt+1(st,λ)∂st
// d× d matrix9 Bt =
∂Φt(st−1,λ)∂λ
// d×m matrix10 αt = αt+1At+1 // d vector11 g = g + αtBt //m vector12 return g
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Forward mode
Use chain rule:
∇f(λ) = ∇E(sT )dsTdλ
Plug in the learning dynamics:
dstdλ
=∂Φt(st−1, λ)
∂st−1
dst−1
dλ+
∂Φt(st−1, λ)
∂λ
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Forward mode recursion
Use chain rule:
∇f(λ) = ∇E(sT )dsTdλ
Plug in the learning dynamics:
dstdλ︸︷︷︸
Zt(d×m)
=∂Φt(st−1, λ)
∂st−1︸ ︷︷ ︸At(d×d)
dst−1
dλ︸ ︷︷ ︸Zt−1(d×m)
+∂Φt(st−1, λ)
∂λ︸ ︷︷ ︸Bt(d×m)
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Forward mode recursion unrolled
∇f(λ) = ∇E(sT )ZT
= ∇E(sT )(ATZT−1 +BT )
= ∇E(sT )(ATAT−1ZT−2 + ATBT−1 +BT )...
= ∇E(sT )
(T∑t=1
(At+1 · · ·AT )Bt
)
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Forward mode
Forward-HG(λ, s0)
1 Inputs: Current hyperparameters, λ, initial state, s02 Outputs: Hypergradient at λ3 Z0 = 04 for t = 1 to T5 st = Φt(st−1, λ) // d vector6 At =
∂Φt(st−1,λ)∂st−1
// d× d matrix
7 Bt =∂Φt(st−1,λ)
∂λ// d×m matrix
8 Zt = AtZt−1 +Bt// d×m matrix9 // Memory for st can be reused in this case!
10 return ∇E(s)ZT
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Computation graph
×
×
×
×
×
×
+
×× ×+ + +
× × ×
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Real-time HO
For t ∈ 1, . . . , T define
ft(λ) = E(st(λ))
(the previous response function is fT )
partial hypergradients are available in forward mode:
∇ft(λ) =dE(st)
dλ= ∇E(st)Zt
Significant: we can update hyperparameters several times in asingle optimization epoch, without having to wait until time T
Similar to RTRL, applicable to data streams (or large datasets)
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Real-time HO
RTHO(λ, s0)
1 Inputs: initial hyperparameters, λ, initial state, s02 Outputs: Final parameters, sT3 Z0 = 04 for t = 1 to T5 st = Φt(st−1, λ) // d vector6 At =
∂Φt(st−1,λ)∂st−1
// d× d matrix
7 Bt =∂Φt(st−1,λ)
∂λ// d×m matrix
8 Zt = AtZt−1 +Bt// d×m matrix9 // Memory for At, Bt, Zt can be reused!
10 if t == 0 (mod ∆)11 λ = λ− η∇E(st)Zt
12 return sT
Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia
Analysis
The two approaches have different time/space tradeoffs
Reverse mode needs to store the whole history of parameterupdates — (Maclaurin et al. 2015) proposed to “invert” the updatedynamics and recompute the trace rather of storing it in memory
Forward mode does not scale well with the number ofhyperparameters
Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia
Results from algorithmic differentiation (AD)
Let F : Rn 7→ Rp be any differentiable function
Let c(n, p) and s(n, p) be the time and space to evaluate F
Also let JF the p× n Jacobian matrix of F
General results (Baydin et al. 2015; Griewank and Walther 2008):(i) For any r ∈ Rn, JF r can be evaluated in time O(c(n, p)) and
space O(s(n, p)) using forward-mode AD — hence thewhole JF can be computed in time O(nc(n, p)) and spaceO(s(n, p))
(ii) For any vector q ∈ Rp, the product J⊺F q can be evaluated in
both time and space O(c(n, p)) using reverse-mode AD —hence JF can be computed in time O(pc(n, p)) and spaceO(c(n, p))
Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia
Analysis of hypergradient computation (1)
Cost to evaluate the update map Φt:time g(d,m)1
space h(d,m) 2
Then the response function f(λ) : Rm 7→ R can be evaluated intime O(Tg(d,m)) and space O(h(d,m))
Notes:1 assuming the time required to compute the validation error doesnot affect the bound (realistic since the number of validationexamples is typically lower than the number of training iterations.2 since variables st may be overwritten at each iteration
Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia
Analysis of Forward-HG
Apply Fact (i) from AD: Forward-HG takes time O(Tmg(d,m))and space O(h(d,m))
Result can also be obtained by noting that product AtZt−1
requires m Jacobian-vector products, each costing O(g(d,m)),while computing the Jacobian Bt takes time O(mg(d,m))
Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia
Analysis of Reverse-HG
Apply Fact (ii) from AD: Reverse-HG takes both time and spaceO(Tg(d,m))
Results can also be obtained by noting that αt+1At1 and αtBt aretransposed-Jacobian-vector products that in reverse-mode takeboth time O(g(d,m))
Note that in this case variables st cannot be overwritten,explaining the much higher space requirement
Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia
Example
Neural network with k weights trained by SGD or Adam
Hyperparameters: are just learning rate and momentum terms
In this case, d = O(k) and m = O(1)
Moreover, g(d,m) and h(d,m) are both O(k)
Hence, Reverse-HG takes time and space O(Tk) whileForward-HG takes time O(Tk) and space O(k)
In this case there is a dramatic difference in terms of memoryrequirements
Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia
Empirically
5 10 15 20Number of hyperparameters
0
20
40
60
80
100
120
140
Run
ning
tim
e (s
)
Time requirements
ForwardHGReverseHG
200000 400000 600000Number of weights
0
1000
2000
3000
4000
5000
Mem
ory
usag
e (M
b)
Memory requirements
Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia
Data hyper-cleaning: Setting
Noisy labels but can only afford to check a subset of them
Train on noisy data D, cleaned data C as validation
One hyperparameter for each training example:
J(λ,w) =1
n
n∑i=1
λ(i)ℓ(y(i), f(x(i);w)
)HO problem:
minλ
∑(x,y)∈C
ℓ(y, g(x; w))
s.t. w = argminw
J(λ,w)
λ(i) ∈ [0, 1]
|λ|1 < R
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Data hyper-cleaning: experimental setup
MNIST digits, 5000 validation (cleaned) examples, 5000 trainingexamples (50% corruption rate), 10000 test examples
g(x) = softmax(wx), ℓ cross-entropy loss
Reverse-HG to compute hypergradients, Adam to optimizehyperparameters
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Data hyper-cleaning: performance measures
Oracle: test accuracy after fitting w on validation plus cleanedportion of the training set
Baseline: test accuracy after fitting w on validation and (noisy) train
DH-R: test accuracy for the hyper-cleaner for a given L1 radius R(fit w on validation plus training examples having λ(i) > 0)
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Data hyper-cleaning: Results
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Data hyper-cleaning: Results
0 100 200 300 400 500
Hyper-iterations
0
500
1000
1500
2000
2500
3000
3500
Nu
mb
erof
dis
card
edex
amp
les
80
82
84
86
88
90
92
Acc
ura
cy
0 100 200 300 400 500
Hyper-iterations
0
500
1000
1500
2000
2500
3000
3500
Nu
mb
erof
dis
card
edex
amp
les
Accuracy and sparsity of λ
Validation
Test
TP
FP
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Multi-task learning: setup
Goal is to tune the hyperparameters λ = (C, ρ) of a multi-taskregularizer (Evgeniou et al. 2005)
Ω(w, λ) =K∑j=1
K∑k=1
cj,k∥wj − wk∥2 + ρK∑k=1
∥wk∥2
where wk are the parameters for task k and K is the number oftasks
C is a symmetric non-negative matrix and ρ > 0
Training objective:
J(w, λ) =∑
(x,y)∈T
ℓ(g(x,w), y) + Ω(w, λ)
As before, the classifier g is a (linear) softmax regressor and ℓ thecross-entropy loss
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Multi-task learning: setup
Datasets: CIFAR-10 and CIFAR-100
Features: Inception-V3 model trained on ImageNet (Szegedy et al.2015)
Few-shots learning setup:CIFAR-10: 50 training examples (5 per class), 50 validationexamples
CIFAR-100: 300 training examples (3 per class), 300validation examples
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Multi-task learning: setup
Outer objective:
minλ
∑(x,y)∈V
ℓ(g(x,wT ), y)
s.t. ρ ≥ 0
xij ≥ 0
C = C⊺
where wT are the parameters at the T -th gradient descentiteration of the inner objective
We used Reverse-HG to compute hypergradients and Adam forhyper-optimization
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Multi-task learning: variants
SLT: single task learning, i.e. C = 0 and applying HO to ρ
NMTL: naive MTL scenario where all Cj,k = a, and applying HOto a and ρ
HMTL: Reverse-HG for tuning both C and ρ
HMTL-S: additional constraint∑
j,k cj,k ≤ R to prevent spurioustask interactions due to the few shot learning setting
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Multi-task learning: results
CIFAR-10 CIFAR-100STL 67.47±2.78 18.99±1.12NMTL 69.41±1.90 19.19±0.75HMTL 70.85±1.87 21.15±0.36HMTL-S 71.62±1.34 22.09±0.29(Dinuzzo et al. 2011) 69.96±1.85(Jawanpuria et al. 2015) (p = 2) 70.30±1.05(Jawanpuria et al. 2015) (p = 4/3) 70.96±1.04
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Phone classification: Dataset
TIMIT phonetic recognition dataset (Garofolo et al. 1993)
5040 sentences, 1.5 million 25ms speech acoustic frames
73% train 23% validation, 4% test
123-dimensional feature per frame (40 Mel cepstral coefficients +energy, with their delta and delta-delta)
Window of 11 frames around the target (1353-dimensional inputvectors)
183 classes (HMM monophone states)
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Phone classification: Multi-task setting
Rationale for MTL: domain specific information of related tasksused as inductive bias for the primary task
Primary task: phone recognition
Secondary task: phonetic context embedding vectors(300-dimensional) of triphones, proposed in (Badino 2016)
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Phone classification: Network
The network is simple but not tiny (about 16 million weights)
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Phone classification: Optimization problem
Hyperparameters: learning rate η, momentum term µ,importance of the secondary task ρ
Outer objective
minρ,η,µ
E(wT , wp,T )
s.t. ρ, η ≥ 0
0 ≤ µ ≤ 1
where the inner objective is
J(w,wp, ws) = Jp(w,wp) + ρJs(w,ws)
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Phone classification
There are more than 107 parameters, reverse mode is notpossible (because of memory)
Forward mode on the other hand is very time consuming
RTHO effective and fast
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Phone classification: Results
Frame level phone-state classification accuracy on standard TIMITtest set and execution time in minutes on one Titan X GPU
For random search, we set a time budget of 300 minutes
Accuracy % Time (min)No Aux task, η, µ as in (Badino 2016) 59.81 12Random Search 60.36 300RTHO 61.97 164RTHO with null teacher (all HP=0) 61.38 289
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Phone classification: Results
Forward and Reverse Gradient-based Hyperparameter OptimizationLuca Franceschi1,2, Michele Donini1, Paolo Frasconi3, Massimiliano Pontil1,2(1) Istituto Italiano di Tecnologia, IT (2) University College London, UK (3) Università degli Studi di Firenze, IT
Objectives & Contributions
In the context of gradient-based hyperparameter optimization, westudy two procedures, Reverse-HG and Forward-HG, forcomputing the gradient of a validation error E with respect toreal-valued hyperparameters of any differentiable iterative learningalgorithm. We also present a novel Real Time HO algorithm,based on forward computation of hyper-gradients, which is able tofind good values of critical hyperparameters at a reasonable cost.We conduct a series of experiments in different setting to empiri-cally validate the proposed algorithms.
Aims
• Increasing model training automation and reducing hardwarerequirements;
•Achieving better generalization performances;•Allowing a “freer” model design.
Difficulties
•Computational complexity (model must be optimized severaltimes);
•Reliability (HO methods usually have themselves severaldata-dependent hyperparameters);
•Complexity of search space (continuous, integer and conditionalhyperparameters).
Current Approaches
•Manual/Grid search, Random search•Model based/Bayesian Optimization•Gradient-based OptimizationBengio, Domke, Maclaurin, Pedregosa.
Problem Setting
Example (Stochastic gradient descentwith momentum)•Weights + velocity: state: (w, v) = s ∈ Rd
•Training error: Etrain(w, ρ)•Hyperparameters: (η, µ, ρ) = λ
•Training algorithm:wtvt
︸ ︷︷ ︸st
=wt−1 − η(µvt−1 −∇Etrain
t (wt−1, ρ))µvt−1 +∇Etrain
t (wt−1, ρ)
︸ ︷︷ ︸
Φt(st−1,λ)
GOAL: Optimize λ according to a certain error functionEval evaluated at the last iterate sT . The problem is
minλ∈Λ
f (λ)
where the set Λ describe constraints on λ, and theresponse function f : Rm→ R+ is defined at λ ∈ Rm
asf (λ) = Eval(sT (λ)).
Iteratively minimize f ⇒ compute ∇f (λ)
Constrained optimization problem forhyperparameter optimization:
minλ,s1,...,sT
E(sT )subject to st = Φt(st−1, λ), t ∈ 1, . . . , T.
Lagrangian is
L(s, λ, α) = E(sT ) +T∑t=1αt(Φt(st−1, λ)− st)
αt ∈ Rd are row vectors of Lagrange multipliers. Definingthe matrices
At = ∂Φt(st−1, λ)∂st−1
∈ Rd×d, Bt = ∂Φt(st−1, λ)∂λ
∈ Rd×m
From optimality condition ∇sL = 0 obtain
αt =
∇Eval(sT ) if t = T,
αt+1At+1 if 0 ≤ t ≤ T − 1The ∇f (λ) can be computed incrementally using αt.
5 10 15 20Number of hyperparameters
0
20
40
60
80
100
120
140
Run
ning
tim
e (s
)
Time requirements
FrowardHGReverseHG
200000 400000 600000Number of weights
0
1000
2000
3000
4000
5000
Mem
ory
usag
e (M
b)
Memory requirements
Direct computation of ∇Eval(sT (λ)) using chain rule:
∇Eval(sT (λ)) = ∇Eval(sT )dsTdλ
;
dstdλ
= ∂Φt(st−1, λ)∂st−1
dst−1
dλ+ ∂Φt(st−1, λ)
∂λt ∈ 1, . . . , T.
Define: Z0 = 0; Zt = dstdλ ∈ Rd×m
Recursive equation for the total derivative of s:Zt = AtZt−1 + Bt, t ∈ 1, . . . , T.
Hypergradient: ∇f (λ) = ∂L∂λ = ∇E(sT )∑T
t=1(∏T
s=t+1As
)Bt. = ∇E(sT )ZT
Forward and Reverse-HG
Algorithm 1 Reverse-HG (linkedto BPTT)
for t = 1 to T dost← Φt(st−1, λ)
end forαT ← ∇E(sT )g ← 0for t = T − 1 downto 1 dog ← g + αt+1Bt+1αt← αt+1At+1
end forreturn g
×
×
×
×
×
×
+
×× ×+ + +
× × × Algorithm 2 Forward-HG(linked to RTRL)Z0← 0for t = 1 to T doZt← AtZt−1 + Bt
st← Φt(st−1, λ)end forreturn ∇E(s)ZT
Experiment: Data Hyper-cleaning
•Dataset: Subset of MNIST with random noise on labels•Task: Classification/Nosy examples detection•Model: Logistic regression with weighted error:
Etrain =∑iλiEi
•Hyperparameters: Weights of single examples λ•Constraints: ||λ||1 ≤ R
0 100 200 300 400 500
Hyper-iterations
0
500
1000
1500
2000
2500
3000
3500
Nu
mb
erof
dis
card
edex
amp
les
80
82
84
86
88
90
92
Acc
ura
cy
0 100 200 300 400 500
Hyper-iterations
0
500
1000
1500
2000
2500
3000
3500
Nu
mb
erof
dis
card
edex
amp
les
Accuracy and sparsity of λ
Validation
Test
TP
FP
Experiment: Learning Task Interactions
•Dataset: Small subsets of CIFAR-10 (and CIFAR-100)•Task: Classification in MTL setting/Interactions learning•Model: Logistic regression with MTL regularizer∑
j,k
Ajk||wj − wk||22 + ρ∑j||wj||22
•Hyperparameters: A, ρ, η, µ (Gradient descent with momentum)•Constraints: ||A||1 ≤ R (For-HG-S)
Accuracy %STL 67.47±2.78Dinuzzo et al. (2011) 69.96±1.85Jawanpuria et al. (2015) 70.96±1.04Rev-HG-S 71.62±1.34
Some Future Directions•Validate RTHO empirically and study its convergence properties• Improve reliability (adaptiveness) of gradient-based HO methods
Real-Time HO
Algorithm 3 Executes parameter and hyperparameter optimization in realtime with RTHO (no need of hyper-iterations)Z0← 0for t = 1 to . . . dost← Φt(st−1, λ)Zt← AtZt−1 + Bt
if t ≡ 0 ( mod ∆) thenλ← λ− η∇E(st)Zt
end ifend forreturn s
Experiment: Phone Classification
“Large scale” experiment on TIMIT dataset with RTHO, with a 5-layerstwo-outputs FFNN (∼ 15×106 params), 2 optimization and 1 regularizaitonhyperparameters.
0 50 100 150 200 2500
20
40
60
80
100Training and validation accuracies
Training
Validation
0 50 100 150 200 250
2
4
6Validation error
0 50 100 150 200 2500.0
0.1
0.2
0.3Plot of η, µ
η
µ
0 50 100 150 200 2500.0
0.1
0.2
0.3Plot of ρ
Experiment: CNN (not in the paper)
Real-Time hyperparameter optimization on a small convolutional neural net-work trained on MNIST. Hyperparameters are learning rate η and L2 regu-larization of fully connected layer weights ρ. RTHO decreases of around 25%test classification error, over the baseline.
0 50 100 150 200 250 30097.0
97.5
98.0
98.5
99.0
99.5
100.0Validation and test accuracies
Valid (99.34)
Test (99.40)
0 50 100 150 200 250 3000.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035Plot of η
0 50 100 150 200 250 3000.88
0.90
0.92
0.94
0.96
0.98
1.00Plot of µ
0 50 100 150 200 250 3000.00000
0.00002
0.00004
0.00006
0.00008
0.00010
0.00012
0.00014
0.00016
0.00018Plot of ρ
Code at: https://github.com/lucfra/RFHO
Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia
Perspectives (1)
Need better theory to explain RTHO (e.g. convergence rate)
The stochastic or real-time HO approach can also be applied inthe case of reverse-mode by truncating hypergradientspropagation (similar to truncated BPTT) — encouraging results in(Grazzi 2017)
We also lack statistical theory for HO: Can many hyperparametersoverfit the validation set? Can we establish bounds?
Gradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia
Perspectives (2)
Many recent works on meta-learning or learning-to-optimize can beformulated within a framework that is compatible with HO
For example meta-learning can be seen as a bilevel program
minζ
E(ζ, θ)
s.t. θ ∈ argminθ
J(ζ, θ)
whereE is the test error in the meta-training episodes
J is the training error in the meta-training episodes
ζ are (hyper)parameters that index a class of hypothesis spaces
θ are parameters used to fit meta-train episodesGradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia
Perspectives (2)
By tuning ζ we select a particular hypothesis space that is hopefullywell suited for novel (meta-test) learning episodes
Preliminary results on MiniImagenet are comparable or better thanthose reported in (Ravi & Larochelle 2016) for 1-shot learning
Gradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia
Thank You!
Code available:https://github.com/lucfra/RFHO
Gradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia
References I
Badino, Leonardo (2016). “Phonetic Context Embeddings for DNN-HMM Phone Recognition”. In:Proceedings of Interspeech, pp. 405–409.
Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind (2015).“Automatic differentiation in machine learning: a survey”. In: arXiv preprint arXiv:1502.05767. url:https://arxiv.org/abs/1502.05767.
Bengio, Yoshua (2000). “Gradient-based optimization of hyperparameters”. In: Neural computation 12.8,pp. 1889–1900. url:http://www.mitpressjournals.org/doi/abs/10.1162/089976600300015187.
Bergstra, James S., Rémi Bardenet, Yoshua Bengio, and Balázs Kégl (2011). “Algorithms forhyper-parameter optimization”. In: Advances in Neural Information Processing Systems, pp. 2546–2554. url:http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.
Bergstra, James and Yoshua Bengio (2012). “Random search for hyper-parameter optimization”. In:Journal of Machine Learning Research 13.Feb, pp. 281–305. url:http://www.jmlr.org/papers/v13/bergstra12a.html.
Bergstra, James, Daniel Yamins, and David D. Cox (2013). “Making a Science of Model Search:Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures.”. In: ICML (1) 28,pp. 115–123. url: http://www.jmlr.org/proceedings/papers/v28/bergstra13.pdf.
References II
Dinuzzo, Francesco, Cheng S Ong, Gianluigi Pillonetto, and Peter V Gehler (2011). “Learning outputkernels with block coordinate descent”. In: ICML, pp. 49–56.
Evgeniou, Theodoros, Charles A Micchelli, and Massimiliano Pontil (2005). “Learning multiple tasks withkernel methods”. In: J. Mach. Learn. Res. 6, pp. 615–637.
Garofolo, John S., Lori F. Lamel, William M. Fisher, Jonathon G. Fiscus, and David S. Pallett (1993).“DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1”. In: NASASTI/Recon technical report 93.
Griewank, Andreas and Andrea Walther (Jan. 2008). Evaluating Derivatives: Principles and Techniques ofAlgorithmic Differentiation, Second Edition. en. Second. Society for Industrial and Applied Mathematics.isbn: 978-0-89871-659-7 978-0-89871-776-1. url:http://epubs.siam.org/doi/book/10.1137/1.9780898717761.
Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown (2011). “Sequential model-based optimizationfor general algorithm configuration”. In: International Conference on Learning and Intelligent Optimization.Springer, pp. 507–523. url: http://link.springer.com/10.1007%2F978-3-642-25566-3_40.
Jawanpuria, Pratik, Maksim Lapin, Matthias Hein, and Bernt Schiele (2015). “Efficient Output KernelLearning for Multiple Tasks”. In: Advances in Neural Information Processing Systems, pp. 1189–1197.
References III
Larsen, Jan, Lars Kai Hansen, Claus Svarer, and M. Ohlsson (1996). “Design and regularization of neuralnetworks: the optimal use of a validation set”. In: Neural Networks for Signal Processing [1996] VI.Proceedings of the 1996 IEEE Signal Processing Society Workshop. IEEE, pp. 62–71.
LeCun, Yann (1988). “A Theoretical Framework for Back-Propagation”. In: Proc. of the 1988 Connectionistmodels summer school. Ed. by Geoffrey Hinton and Terrence Sejnowski. Morgan Kaufmann, pp. 21–28.
Maclaurin, Dougal, David Duvenaud, and Ryan P. Adams (2015). “Gradient-based hyperparameteroptimization through reversible learning”. In: Proceedings of the 32nd International Conference on MachineLearning. url: http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.
Pedregosa, Fabian (2016). “Hyperparameter optimization with approximate gradient”. In: arXiv preprintarXiv:1602.02355. url: http://www.jmlr.org/proceedings/papers/v48/pedregosa16.pdf.
Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams (2012). “Practical bayesian optimization of machinelearning algorithms”. In: Advances in neural information processing systems, pp. 2951–2959.
Thornton, Chris, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown (2011). “Auto-WEKA:Combined selection and hyperparameter optimization of classification algorithms”. In: pp. 847–855.
Werbos, Paul J. (1982). “Applications of advances in nonlinear sensitivity analysis”. In: System modeling andoptimization. Springer, pp. 762–770.