+ All Categories
Home > Documents > Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf ·...

Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf ·...

Date post: 14-Apr-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
15
Gradient-based Hyperparameter Optimization Paolo Frasconi Università degli Studi di Firenze, Italy http://ai.dinfo.unifi.it/paolo/ Joint work with Luca Franceschi (IIT and UCL), Michele Donini (IIT), Massimiliano Pontil (IIT and UCL) EMMCVPR 2017 — Venezia, November 1st, 2017 Hyperparameter Optimization Most machine learning algorithms depend on the values of some variables that must be decided before learning starts At least three kinds of hyperparameters: Regularization (e.g. amount of L2 or L1 penalty, dropout, multitask, etc.) Hypothesis space (e.g. variables in the kernel, layers in neural nets, etc.) Optimization (e.g. learning rate, momentum, etc.) HO: tune hyperparameters automatically Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia Some approaches to HO Grid search (trivial): only practical for 1–2 hyperparameters Random search: better than grid search (J. Bergstra and Bengio 2012) — 32 hyperparameters Bayesian approaches (J. Bergstra, Yamins, et al. 2013) (Hyperopt) — 238 hyperparameters Spearmint (Snoek et al. 2012) — 288 hyperparameters Sequential Model-Based (SMBO, SMAC) (Hutter et al. 2011) Tree-structured Parzen Estimator (J. S. Bergstra et al. 2011; Thornton et al. 2011) Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia Gradient-based HO Early works limited to few hyperparameters: (Bengio 2000; Larsen et al. 1996) More recent works capable to handle one thousand hyperparameters (Maclaurin et al. 2015; Pedregosa 2016) Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia
Transcript
Page 1: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Gradient-based Hyperparameter Optimization

Paolo FrasconiUniversità degli Studi di Firenze, Italyhttp://ai.dinfo.unifi.it/paolo/

Joint work with Luca Franceschi (IIT and UCL), Michele Donini (IIT),Massimiliano Pontil (IIT and UCL)

EMMCVPR 2017 — Venezia, November 1st, 2017

Hyperparameter Optimization

Most machine learning algorithms depend on the values of somevariables that must be decided before learning starts

At least three kinds of hyperparameters:Regularization (e.g. amount of L2 or L1 penalty, dropout,multitask, etc.)

Hypothesis space (e.g. variables in the kernel, layers in neuralnets, etc.)

Optimization (e.g. learning rate, momentum, etc.)

HO: tune hyperparameters automatically

Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia

Some approaches to HO

Grid search (trivial): only practical for 1–2 hyperparameters

Random search: better than grid search (J. Bergstra and Bengio2012) — 32 hyperparameters

Bayesian approaches (J. Bergstra, Yamins, et al. 2013) (Hyperopt)— 238 hyperparameters

Spearmint (Snoek et al. 2012) — 288 hyperparameters

Sequential Model-Based (SMBO, SMAC) (Hutter et al. 2011)

Tree-structured Parzen Estimator (J. S. Bergstra et al. 2011;Thornton et al. 2011)

Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia

Gradient-based HO

Early works limited to few hyperparameters: (Bengio 2000; Larsenet al. 1996)

More recent works capable to handle one thousandhyperparameters (Maclaurin et al. 2015; Pedregosa 2016)

Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia

Page 2: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Simple example: Ridge regression

Prediction function: g : Rd 7→ R, g(x;w) = w⊺x

Learning problem:

J(w, λ).=

∑(x,y)∈T

[y − g(x;w)

]2+ λ∥w∥2

Closed-form solution:

w(λ) = argminw

J(w, λ) = (X⊺X + λI)−1X⊺Y

Response function:f(λ)

.=

∑(x,y)∈V

[y − g(x; w(λ))

]2=

∑(x,y)∈V

[y − ((X⊺X + λI)−1X⊺Y )⊺x

]2Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia

Response function

Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia

HO as a bilevel program

Two sets of variables:outer variables: hyperparameters λ

inner variables: parameters w

Optimize outer problem subject to optimum of the inner problem:

minλ

f(λ, w)

s.t. w ∈ argminw

J(λ,w)

In HO, the outer problem is the validation loss and the innerproblem is the training objective

Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia

HO as a bilevel program

Moore, G., Bergeron, C., & Bennett, K. P. (2011). Model selection forprimal SVM. Machine Learning, 85(1–2), 175–208.

Gradient-based Hyperparameter Optimization Hyperparameter Optimization EMMCVPR ’17 — Venezia

Page 3: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Learning dynamics

We assume that the objectives are differentiable

However in general there is no closed-form solution

In facts, objectives may be non-convex

Thus we introduce learning dynamics encompassing those instochastic gradient descent algorithms such as Nesterov, Adam,RMSProp etc.

Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia

Example: SGD with momentum on neural networks

Dynamical system:

vt = µvt−1 − η∇Jt(wt−1)wt = wt−1 + vt

wt are the weights, vt the velocities

µ and η are optimization hyperparameters

Jt is the lower objective for the t-th minibatch

Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia

Learning dynamics in general

st = Φt(st−1, λ) t = 1, . . . , T

The state st contains parameters and accessory variables (e.g.velocities)

Φt : (Rd ×Rm) → Rd is a smooth mapping representing the

operation performed by the t-th step of the optimizationalgorithm (on minibatch t)

The iterates s1, . . . , sT depend on the hyperparameters λ bothexplicitly and implicitly

Minibatch 1 Minibatch 2 Minibatch T

Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia

Hyperparameter Optimization

Minibatch 1 Minibatch 2 Minibatch T Validation set

Change the bilevel program to use the parameters at the last iteratesT rather than w:

minλ

f(λ)

where f : Rm → R is the response function redefined as

f(λ) = E(sT (λ))

Hypergradient:∇f(λ) = ∇E(sT )

dsTdλ

Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia

Page 4: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Hyperparameter Optimization

Minibatch 1 Minibatch 2 Minibatch T Validation set

Similar to a recurrent neural network but:

Minibatches are like inputs to the RNN

The state of the RNN are like the parameters of the model

Hyperparameter are like the weights of the RNN

The validation error is like the training loss of the RNN

Indeed (Maclaurin et al. 2015) proposed to use backpropagation(without mentioning BPTT or RNNs)

Gradient-based Hyperparameter Optimization Our HO Setting EMMCVPR ’17 — Venezia

Algorithmic Differentiation

Most (complex) functions of interest in ML can be computed bycomposing elementary operations whose derivatives are readilyavailable

Algorithmic differentiation more effective than alternative ways ofcomputing derivatives such as:

Numerical Differentiation (subject to rounding-off errors)

Symbolic Differentiation (subject to expressions of explodingsizes)

Backpropagation (Werbos 1982) is perhaps the most widelyknown AD technique in machine learning

Gradient-based Hyperparameter Optimization Algorithmic Differentiation EMMCVPR ’17 — Venezia

Algorithmic Differentiation

Describe a function y = f(x) via a computation graph (essentiallya circuit) where each node contains a value vi

There are two main approaches to AD:Forward mode: for each node i and a fixed input j , define

vi.=

∂vi∂xj

vi can be computed from vk ’s (k parents of i)

Reverse mode: for each node in the graph and a fixed outputyj , define

vi.=

∂yj∂vi

vi can be computed from vk ’s for each child(k children of i)but all vi must be stored in memory!

Gradient-based Hyperparameter Optimization Algorithmic Differentiation EMMCVPR ’17 — Venezia

Algorithmic Differentiation for RNNs

Not surprisingly, both reverse mode and forward mode AD waspopular for training RNNs in the late 1980’s

Backpropagation through time (see e.g., Werbos 1988,Pearlmutter 1989) is reverse mode AD

Real-time recurrent learning (see e.g., Mozer 1989, Williams &Zipser 1989) is forward mode AD

Gradient-based Hyperparameter Optimization Algorithmic Differentiation EMMCVPR ’17 — Venezia

Page 5: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Reverse mode

The HO problem can be reformulated as a constrainedoptimization problem:

minλ,s1,...,sT

E(sT )

s.t. st = Φt(st−1, λ), t ∈ 1, . . . , T

Classical Lagrangian formalism used to derive backprop (LeCun1988)

L(s, λ, α) = E(sT ) +T∑t=1

αt(Φt(st−1, λ)− st) αt ∈ Rd

Constraints on hyperparameters can be specified naturally

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Reverse mode

Partial derivatives of the Lagrangian:

∂L∂αt

= Φt(st−1, λ)− st, t ∈ 1, . . . , T

∂L∂st

= αt+1∂Φt+1(st, λ)

∂st− αt, t ∈ 1, . . . , T−1

The second equation yields the useful recursion:

Let At+1.=

∂Φt+1(st, λ)

∂sta (d× d) matrix

thenαt = αt+1At+1

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Reverse mode

The base step for the recursion is derived from

∂L∂sT

= ∇E(sT )− αT

Finally the whole hypergradient is

∂L∂λ

=T∑t=1

αt∂Φt(st−1, λ)

∂λ︸ ︷︷ ︸Bt

where Bt is a (d×m) matrix

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Reverse mode

Reverse-HG(λ, s0)

1 Inputs: Current hyperparameters, λ, initial state, s02 Outputs: Hypergradient at λ3 for t = 1 to T4 st = Φt(st−1, λ) // d vector, all must be stored5 αT = ∇E(sT )6 g = 07 for t = 1 to T

8 At+1 = ∂Φt+1(st,λ)∂st

// d× d matrix9 Bt =

∂Φt(st−1,λ)∂λ

// d×m matrix10 αt = αt+1At+1 // d vector11 g = g + αtBt //m vector12 return g

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Page 6: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Forward mode

Use chain rule:

∇f(λ) = ∇E(sT )dsTdλ

Plug in the learning dynamics:

dstdλ

=∂Φt(st−1, λ)

∂st−1

dst−1

dλ+

∂Φt(st−1, λ)

∂λ

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Forward mode recursion

Use chain rule:

∇f(λ) = ∇E(sT )dsTdλ

Plug in the learning dynamics:

dstdλ︸︷︷︸

Zt(d×m)

=∂Φt(st−1, λ)

∂st−1︸ ︷︷ ︸At(d×d)

dst−1

dλ︸ ︷︷ ︸Zt−1(d×m)

+∂Φt(st−1, λ)

∂λ︸ ︷︷ ︸Bt(d×m)

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Forward mode recursion unrolled

∇f(λ) = ∇E(sT )ZT

= ∇E(sT )(ATZT−1 +BT )

= ∇E(sT )(ATAT−1ZT−2 + ATBT−1 +BT )...

= ∇E(sT )

(T∑t=1

(At+1 · · ·AT )Bt

)

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Forward mode

Forward-HG(λ, s0)

1 Inputs: Current hyperparameters, λ, initial state, s02 Outputs: Hypergradient at λ3 Z0 = 04 for t = 1 to T5 st = Φt(st−1, λ) // d vector6 At =

∂Φt(st−1,λ)∂st−1

// d× d matrix

7 Bt =∂Φt(st−1,λ)

∂λ// d×m matrix

8 Zt = AtZt−1 +Bt// d×m matrix9 // Memory for st can be reused in this case!

10 return ∇E(s)ZT

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Page 7: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Computation graph

×

×

×

×

×

×

+

×× ×+ + +

× × ×

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Real-time HO

For t ∈ 1, . . . , T define

ft(λ) = E(st(λ))

(the previous response function is fT )

partial hypergradients are available in forward mode:

∇ft(λ) =dE(st)

dλ= ∇E(st)Zt

Significant: we can update hyperparameters several times in asingle optimization epoch, without having to wait until time T

Similar to RTRL, applicable to data streams (or large datasets)

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Real-time HO

RTHO(λ, s0)

1 Inputs: initial hyperparameters, λ, initial state, s02 Outputs: Final parameters, sT3 Z0 = 04 for t = 1 to T5 st = Φt(st−1, λ) // d vector6 At =

∂Φt(st−1,λ)∂st−1

// d× d matrix

7 Bt =∂Φt(st−1,λ)

∂λ// d×m matrix

8 Zt = AtZt−1 +Bt// d×m matrix9 // Memory for At, Bt, Zt can be reused!

10 if t == 0 (mod ∆)11 λ = λ− η∇E(st)Zt

12 return sT

Gradient-based Hyperparameter Optimization Computing hypergradients EMMCVPR ’17 — Venezia

Analysis

The two approaches have different time/space tradeoffs

Reverse mode needs to store the whole history of parameterupdates — (Maclaurin et al. 2015) proposed to “invert” the updatedynamics and recompute the trace rather of storing it in memory

Forward mode does not scale well with the number ofhyperparameters

Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia

Page 8: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Results from algorithmic differentiation (AD)

Let F : Rn 7→ Rp be any differentiable function

Let c(n, p) and s(n, p) be the time and space to evaluate F

Also let JF the p× n Jacobian matrix of F

General results (Baydin et al. 2015; Griewank and Walther 2008):(i) For any r ∈ Rn, JF r can be evaluated in time O(c(n, p)) and

space O(s(n, p)) using forward-mode AD — hence thewhole JF can be computed in time O(nc(n, p)) and spaceO(s(n, p))

(ii) For any vector q ∈ Rp, the product J⊺F q can be evaluated in

both time and space O(c(n, p)) using reverse-mode AD —hence JF can be computed in time O(pc(n, p)) and spaceO(c(n, p))

Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia

Analysis of hypergradient computation (1)

Cost to evaluate the update map Φt:time g(d,m)1

space h(d,m) 2

Then the response function f(λ) : Rm 7→ R can be evaluated intime O(Tg(d,m)) and space O(h(d,m))

Notes:1 assuming the time required to compute the validation error doesnot affect the bound (realistic since the number of validationexamples is typically lower than the number of training iterations.2 since variables st may be overwritten at each iteration

Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia

Analysis of Forward-HG

Apply Fact (i) from AD: Forward-HG takes time O(Tmg(d,m))and space O(h(d,m))

Result can also be obtained by noting that product AtZt−1

requires m Jacobian-vector products, each costing O(g(d,m)),while computing the Jacobian Bt takes time O(mg(d,m))

Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia

Analysis of Reverse-HG

Apply Fact (ii) from AD: Reverse-HG takes both time and spaceO(Tg(d,m))

Results can also be obtained by noting that αt+1At1 and αtBt aretransposed-Jacobian-vector products that in reverse-mode takeboth time O(g(d,m))

Note that in this case variables st cannot be overwritten,explaining the much higher space requirement

Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia

Page 9: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Example

Neural network with k weights trained by SGD or Adam

Hyperparameters: are just learning rate and momentum terms

In this case, d = O(k) and m = O(1)

Moreover, g(d,m) and h(d,m) are both O(k)

Hence, Reverse-HG takes time and space O(Tk) whileForward-HG takes time O(Tk) and space O(k)

In this case there is a dramatic difference in terms of memoryrequirements

Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia

Empirically

5 10 15 20Number of hyperparameters

0

20

40

60

80

100

120

140

Run

ning

tim

e (s

)

Time requirements

Forward­HGReverse­HG

200000 400000 600000Number of weights

0

1000

2000

3000

4000

5000

Mem

ory

usag

e (M

b)

Memory requirements

Gradient-based Hyperparameter Optimization Analysis EMMCVPR ’17 — Venezia

Data hyper-cleaning: Setting

Noisy labels but can only afford to check a subset of them

Train on noisy data D, cleaned data C as validation

One hyperparameter for each training example:

J(λ,w) =1

n

n∑i=1

λ(i)ℓ(y(i), f(x(i);w)

)HO problem:

minλ

∑(x,y)∈C

ℓ(y, g(x; w))

s.t. w = argminw

J(λ,w)

λ(i) ∈ [0, 1]

|λ|1 < R

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Data hyper-cleaning: experimental setup

MNIST digits, 5000 validation (cleaned) examples, 5000 trainingexamples (50% corruption rate), 10000 test examples

g(x) = softmax(wx), ℓ cross-entropy loss

Reverse-HG to compute hypergradients, Adam to optimizehyperparameters

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Page 10: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Data hyper-cleaning: performance measures

Oracle: test accuracy after fitting w on validation plus cleanedportion of the training set

Baseline: test accuracy after fitting w on validation and (noisy) train

DH-R: test accuracy for the hyper-cleaner for a given L1 radius R(fit w on validation plus training examples having λ(i) > 0)

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Data hyper-cleaning: Results

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Data hyper-cleaning: Results

0 100 200 300 400 500

Hyper-iterations

0

500

1000

1500

2000

2500

3000

3500

Nu

mb

erof

dis

card

edex

amp

les

80

82

84

86

88

90

92

Acc

ura

cy

0 100 200 300 400 500

Hyper-iterations

0

500

1000

1500

2000

2500

3000

3500

Nu

mb

erof

dis

card

edex

amp

les

Accuracy and sparsity of λ

Validation

Test

TP

FP

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Multi-task learning: setup

Goal is to tune the hyperparameters λ = (C, ρ) of a multi-taskregularizer (Evgeniou et al. 2005)

Ω(w, λ) =K∑j=1

K∑k=1

cj,k∥wj − wk∥2 + ρK∑k=1

∥wk∥2

where wk are the parameters for task k and K is the number oftasks

C is a symmetric non-negative matrix and ρ > 0

Training objective:

J(w, λ) =∑

(x,y)∈T

ℓ(g(x,w), y) + Ω(w, λ)

As before, the classifier g is a (linear) softmax regressor and ℓ thecross-entropy loss

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Page 11: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Multi-task learning: setup

Datasets: CIFAR-10 and CIFAR-100

Features: Inception-V3 model trained on ImageNet (Szegedy et al.2015)

Few-shots learning setup:CIFAR-10: 50 training examples (5 per class), 50 validationexamples

CIFAR-100: 300 training examples (3 per class), 300validation examples

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Multi-task learning: setup

Outer objective:

minλ

∑(x,y)∈V

ℓ(g(x,wT ), y)

s.t. ρ ≥ 0

xij ≥ 0

C = C⊺

where wT are the parameters at the T -th gradient descentiteration of the inner objective

We used Reverse-HG to compute hypergradients and Adam forhyper-optimization

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Multi-task learning: variants

SLT: single task learning, i.e. C = 0 and applying HO to ρ

NMTL: naive MTL scenario where all Cj,k = a, and applying HOto a and ρ

HMTL: Reverse-HG for tuning both C and ρ

HMTL-S: additional constraint∑

j,k cj,k ≤ R to prevent spurioustask interactions due to the few shot learning setting

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Multi-task learning: results

CIFAR-10 CIFAR-100STL 67.47±2.78 18.99±1.12NMTL 69.41±1.90 19.19±0.75HMTL 70.85±1.87 21.15±0.36HMTL-S 71.62±1.34 22.09±0.29(Dinuzzo et al. 2011) 69.96±1.85(Jawanpuria et al. 2015) (p = 2) 70.30±1.05(Jawanpuria et al. 2015) (p = 4/3) 70.96±1.04

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Page 12: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Phone classification: Dataset

TIMIT phonetic recognition dataset (Garofolo et al. 1993)

5040 sentences, 1.5 million 25ms speech acoustic frames

73% train 23% validation, 4% test

123-dimensional feature per frame (40 Mel cepstral coefficients +energy, with their delta and delta-delta)

Window of 11 frames around the target (1353-dimensional inputvectors)

183 classes (HMM monophone states)

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Phone classification: Multi-task setting

Rationale for MTL: domain specific information of related tasksused as inductive bias for the primary task

Primary task: phone recognition

Secondary task: phonetic context embedding vectors(300-dimensional) of triphones, proposed in (Badino 2016)

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Phone classification: Network

The network is simple but not tiny (about 16 million weights)

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Phone classification: Optimization problem

Hyperparameters: learning rate η, momentum term µ,importance of the secondary task ρ

Outer objective

minρ,η,µ

E(wT , wp,T )

s.t. ρ, η ≥ 0

0 ≤ µ ≤ 1

where the inner objective is

J(w,wp, ws) = Jp(w,wp) + ρJs(w,ws)

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Page 13: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Phone classification

There are more than 107 parameters, reverse mode is notpossible (because of memory)

Forward mode on the other hand is very time consuming

RTHO effective and fast

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Phone classification: Results

Frame level phone-state classification accuracy on standard TIMITtest set and execution time in minutes on one Titan X GPU

For random search, we set a time budget of 300 minutes

Accuracy % Time (min)No Aux task, η, µ as in (Badino 2016) 59.81 12Random Search 60.36 300RTHO 61.97 164RTHO with null teacher (all HP=0) 61.38 289

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Phone classification: Results

Forward and Reverse Gradient-based Hyperparameter OptimizationLuca Franceschi1,2, Michele Donini1, Paolo Frasconi3, Massimiliano Pontil1,2(1) Istituto Italiano di Tecnologia, IT (2) University College London, UK (3) Università degli Studi di Firenze, IT

Objectives & Contributions

In the context of gradient-based hyperparameter optimization, westudy two procedures, Reverse-HG and Forward-HG, forcomputing the gradient of a validation error E with respect toreal-valued hyperparameters of any differentiable iterative learningalgorithm. We also present a novel Real Time HO algorithm,based on forward computation of hyper-gradients, which is able tofind good values of critical hyperparameters at a reasonable cost.We conduct a series of experiments in different setting to empiri-cally validate the proposed algorithms.

Aims

• Increasing model training automation and reducing hardwarerequirements;

•Achieving better generalization performances;•Allowing a “freer” model design.

Difficulties

•Computational complexity (model must be optimized severaltimes);

•Reliability (HO methods usually have themselves severaldata-dependent hyperparameters);

•Complexity of search space (continuous, integer and conditionalhyperparameters).

Current Approaches

•Manual/Grid search, Random search•Model based/Bayesian Optimization•Gradient-based OptimizationBengio, Domke, Maclaurin, Pedregosa.

Problem Setting

Example (Stochastic gradient descentwith momentum)•Weights + velocity: state: (w, v) = s ∈ Rd

•Training error: Etrain(w, ρ)•Hyperparameters: (η, µ, ρ) = λ

•Training algorithm:wtvt

︸ ︷︷ ︸st

=wt−1 − η(µvt−1 −∇Etrain

t (wt−1, ρ))µvt−1 +∇Etrain

t (wt−1, ρ)

︸ ︷︷ ︸

Φt(st−1,λ)

GOAL: Optimize λ according to a certain error functionEval evaluated at the last iterate sT . The problem is

minλ∈Λ

f (λ)

where the set Λ describe constraints on λ, and theresponse function f : Rm→ R+ is defined at λ ∈ Rm

asf (λ) = Eval(sT (λ)).

Iteratively minimize f ⇒ compute ∇f (λ)

Constrained optimization problem forhyperparameter optimization:

minλ,s1,...,sT

E(sT )subject to st = Φt(st−1, λ), t ∈ 1, . . . , T.

Lagrangian is

L(s, λ, α) = E(sT ) +T∑t=1αt(Φt(st−1, λ)− st)

αt ∈ Rd are row vectors of Lagrange multipliers. Definingthe matrices

At = ∂Φt(st−1, λ)∂st−1

∈ Rd×d, Bt = ∂Φt(st−1, λ)∂λ

∈ Rd×m

From optimality condition ∇sL = 0 obtain

αt =

∇Eval(sT ) if t = T,

αt+1At+1 if 0 ≤ t ≤ T − 1The ∇f (λ) can be computed incrementally using αt.

5 10 15 20Number of hyperparameters

0

20

40

60

80

100

120

140

Run

ning

tim

e (s

)

Time requirements

Froward­HGReverse­HG

200000 400000 600000Number of weights

0

1000

2000

3000

4000

5000

Mem

ory

usag

e (M

b)

Memory requirements

Direct computation of ∇Eval(sT (λ)) using chain rule:

∇Eval(sT (λ)) = ∇Eval(sT )dsTdλ

;

dstdλ

= ∂Φt(st−1, λ)∂st−1

dst−1

dλ+ ∂Φt(st−1, λ)

∂λt ∈ 1, . . . , T.

Define: Z0 = 0; Zt = dstdλ ∈ Rd×m

Recursive equation for the total derivative of s:Zt = AtZt−1 + Bt, t ∈ 1, . . . , T.

Hypergradient: ∇f (λ) = ∂L∂λ = ∇E(sT )∑T

t=1(∏T

s=t+1As

)Bt. = ∇E(sT )ZT

Forward and Reverse-HG

Algorithm 1 Reverse-HG (linkedto BPTT)

for t = 1 to T dost← Φt(st−1, λ)

end forαT ← ∇E(sT )g ← 0for t = T − 1 downto 1 dog ← g + αt+1Bt+1αt← αt+1At+1

end forreturn g

×

×

×

×

×

×

+

×× ×+ + +

× × × Algorithm 2 Forward-HG(linked to RTRL)Z0← 0for t = 1 to T doZt← AtZt−1 + Bt

st← Φt(st−1, λ)end forreturn ∇E(s)ZT

Experiment: Data Hyper-cleaning

•Dataset: Subset of MNIST with random noise on labels•Task: Classification/Nosy examples detection•Model: Logistic regression with weighted error:

Etrain =∑iλiEi

•Hyperparameters: Weights of single examples λ•Constraints: ||λ||1 ≤ R

0 100 200 300 400 500

Hyper-iterations

0

500

1000

1500

2000

2500

3000

3500

Nu

mb

erof

dis

card

edex

amp

les

80

82

84

86

88

90

92

Acc

ura

cy

0 100 200 300 400 500

Hyper-iterations

0

500

1000

1500

2000

2500

3000

3500

Nu

mb

erof

dis

card

edex

amp

les

Accuracy and sparsity of λ

Validation

Test

TP

FP

Experiment: Learning Task Interactions

•Dataset: Small subsets of CIFAR-10 (and CIFAR-100)•Task: Classification in MTL setting/Interactions learning•Model: Logistic regression with MTL regularizer∑

j,k

Ajk||wj − wk||22 + ρ∑j||wj||22

•Hyperparameters: A, ρ, η, µ (Gradient descent with momentum)•Constraints: ||A||1 ≤ R (For-HG-S)

Accuracy %STL 67.47±2.78Dinuzzo et al. (2011) 69.96±1.85Jawanpuria et al. (2015) 70.96±1.04Rev-HG-S 71.62±1.34

Some Future Directions•Validate RTHO empirically and study its convergence properties• Improve reliability (adaptiveness) of gradient-based HO methods

Real-Time HO

Algorithm 3 Executes parameter and hyperparameter optimization in realtime with RTHO (no need of hyper-iterations)Z0← 0for t = 1 to . . . dost← Φt(st−1, λ)Zt← AtZt−1 + Bt

if t ≡ 0 ( mod ∆) thenλ← λ− η∇E(st)Zt

end ifend forreturn s

Experiment: Phone Classification

“Large scale” experiment on TIMIT dataset with RTHO, with a 5-layerstwo-outputs FFNN (∼ 15×106 params), 2 optimization and 1 regularizaitonhyperparameters.

0 50 100 150 200 2500

20

40

60

80

100Training and validation accuracies

Training

Validation

0 50 100 150 200 250

2

4

6Validation error

0 50 100 150 200 2500.0

0.1

0.2

0.3Plot of η, µ

η

µ

0 50 100 150 200 2500.0

0.1

0.2

0.3Plot of ρ

Experiment: CNN (not in the paper)

Real-Time hyperparameter optimization on a small convolutional neural net-work trained on MNIST. Hyperparameters are learning rate η and L2 regu-larization of fully connected layer weights ρ. RTHO decreases of around 25%test classification error, over the baseline.

0 50 100 150 200 250 30097.0

97.5

98.0

98.5

99.0

99.5

100.0Validation and test accuracies

Valid (99.34)

Test (99.40)

0 50 100 150 200 250 3000.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035Plot of η

0 50 100 150 200 250 3000.88

0.90

0.92

0.94

0.96

0.98

1.00Plot of µ

0 50 100 150 200 250 3000.00000

0.00002

0.00004

0.00006

0.00008

0.00010

0.00012

0.00014

0.00016

0.00018Plot of ρ

Code at: https://github.com/lucfra/RFHO

Gradient-based Hyperparameter Optimization Experiments EMMCVPR ’17 — Venezia

Perspectives (1)

Need better theory to explain RTHO (e.g. convergence rate)

The stochastic or real-time HO approach can also be applied inthe case of reverse-mode by truncating hypergradientspropagation (similar to truncated BPTT) — encouraging results in(Grazzi 2017)

We also lack statistical theory for HO: Can many hyperparametersoverfit the validation set? Can we establish bounds?

Gradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia

Page 14: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

Perspectives (2)

Many recent works on meta-learning or learning-to-optimize can beformulated within a framework that is compatible with HO

For example meta-learning can be seen as a bilevel program

minζ

E(ζ, θ)

s.t. θ ∈ argminθ

J(ζ, θ)

whereE is the test error in the meta-training episodes

J is the training error in the meta-training episodes

ζ are (hyper)parameters that index a class of hypothesis spaces

θ are parameters used to fit meta-train episodesGradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia

Perspectives (2)

By tuning ζ we select a particular hypothesis space that is hopefullywell suited for novel (meta-test) learning episodes

Preliminary results on MiniImagenet are comparable or better thanthose reported in (Ravi & Larochelle 2016) for 1-shot learning

Gradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia

Thank You!

Code available:https://github.com/lucfra/RFHO

Gradient-based Hyperparameter Optimization Conclusions EMMCVPR ’17 — Venezia

References I

Badino, Leonardo (2016). “Phonetic Context Embeddings for DNN-HMM Phone Recognition”. In:Proceedings of Interspeech, pp. 405–409.

Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind (2015).“Automatic differentiation in machine learning: a survey”. In: arXiv preprint arXiv:1502.05767. url:https://arxiv.org/abs/1502.05767.

Bengio, Yoshua (2000). “Gradient-based optimization of hyperparameters”. In: Neural computation 12.8,pp. 1889–1900. url:http://www.mitpressjournals.org/doi/abs/10.1162/089976600300015187.

Bergstra, James S., Rémi Bardenet, Yoshua Bengio, and Balázs Kégl (2011). “Algorithms forhyper-parameter optimization”. In: Advances in Neural Information Processing Systems, pp. 2546–2554. url:http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.

Bergstra, James and Yoshua Bengio (2012). “Random search for hyper-parameter optimization”. In:Journal of Machine Learning Research 13.Feb, pp. 281–305. url:http://www.jmlr.org/papers/v13/bergstra12a.html.

Bergstra, James, Daniel Yamins, and David D. Cox (2013). “Making a Science of Model Search:Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures.”. In: ICML (1) 28,pp. 115–123. url: http://www.jmlr.org/proceedings/papers/v28/bergstra13.pdf.

Page 15: Gradient-based Hyperparameter Optimizationai.dinfo.unifi.it/paolo/talks/Venezia17.pdf · 2018-08-05 · Learning dynamics We assume that the objectives are differentiable However

References II

Dinuzzo, Francesco, Cheng S Ong, Gianluigi Pillonetto, and Peter V Gehler (2011). “Learning outputkernels with block coordinate descent”. In: ICML, pp. 49–56.

Evgeniou, Theodoros, Charles A Micchelli, and Massimiliano Pontil (2005). “Learning multiple tasks withkernel methods”. In: J. Mach. Learn. Res. 6, pp. 615–637.

Garofolo, John S., Lori F. Lamel, William M. Fisher, Jonathon G. Fiscus, and David S. Pallett (1993).“DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1”. In: NASASTI/Recon technical report 93.

Griewank, Andreas and Andrea Walther (Jan. 2008). Evaluating Derivatives: Principles and Techniques ofAlgorithmic Differentiation, Second Edition. en. Second. Society for Industrial and Applied Mathematics.isbn: 978-0-89871-659-7 978-0-89871-776-1. url:http://epubs.siam.org/doi/book/10.1137/1.9780898717761.

Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown (2011). “Sequential model-based optimizationfor general algorithm configuration”. In: International Conference on Learning and Intelligent Optimization.Springer, pp. 507–523. url: http://link.springer.com/10.1007%2F978-3-642-25566-3_40.

Jawanpuria, Pratik, Maksim Lapin, Matthias Hein, and Bernt Schiele (2015). “Efficient Output KernelLearning for Multiple Tasks”. In: Advances in Neural Information Processing Systems, pp. 1189–1197.

References III

Larsen, Jan, Lars Kai Hansen, Claus Svarer, and M. Ohlsson (1996). “Design and regularization of neuralnetworks: the optimal use of a validation set”. In: Neural Networks for Signal Processing [1996] VI.Proceedings of the 1996 IEEE Signal Processing Society Workshop. IEEE, pp. 62–71.

LeCun, Yann (1988). “A Theoretical Framework for Back-Propagation”. In: Proc. of the 1988 Connectionistmodels summer school. Ed. by Geoffrey Hinton and Terrence Sejnowski. Morgan Kaufmann, pp. 21–28.

Maclaurin, Dougal, David Duvenaud, and Ryan P. Adams (2015). “Gradient-based hyperparameteroptimization through reversible learning”. In: Proceedings of the 32nd International Conference on MachineLearning. url: http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.

Pedregosa, Fabian (2016). “Hyperparameter optimization with approximate gradient”. In: arXiv preprintarXiv:1602.02355. url: http://www.jmlr.org/proceedings/papers/v48/pedregosa16.pdf.

Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams (2012). “Practical bayesian optimization of machinelearning algorithms”. In: Advances in neural information processing systems, pp. 2951–2959.

Thornton, Chris, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown (2011). “Auto-WEKA:Combined selection and hyperparameter optimization of classification algorithms”. In: pp. 847–855.

Werbos, Paul J. (1982). “Applications of advances in nonlinear sensitivity analysis”. In: System modeling andoptimization. Springer, pp. 762–770.


Recommended