+ All Categories
Home > Documents > Training neural ODEs for density estimation

Training neural ODEs for density estimation

Date post: 17-Nov-2021
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
29
Training neural ODEs for density estimation Chris Finlay in collaboration with J¨orn-Henrik Jacobsen, Levon Nurbekyan and Adam Oberman Paper: “How to train your Neural ODE” IPAM HJB II April 23, 2020
Transcript
Page 1: Training neural ODEs for density estimation

Training neural ODEs for density estimation

Chris Finlay

in collaboration withJorn-Henrik Jacobsen, Levon Nurbekyan and Adam Oberman

Paper: “How to train your Neural ODE”

IPAM HJB IIApril 23, 2020

Page 2: Training neural ODEs for density estimation

0.

Table of Contents

BackgroundDensity estimation with Normalizing flowsNeural ODEs

FFJORD: Neural ODEs + Normalizing flows

Regularized neural ODEs

Results

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 1 / 22

Page 3: Training neural ODEs for density estimation

1. Background Density estimation with Normalizing flows

Density estimation

Training data

p(dog) = ?

Density estimation is a fundamental problemin ML

I Given data {X1, . . . ,Xn} drawn from anunknown distribution p(x), estimate p

Typically done through maximumlog-likelihood

I select a family of models pθI solve maxθ

∑log pθ(Xi )

Two issues arise

1. How to parameterize pθ?

2. How to sample from the learneddistribution pθ? Ie, generate new data?

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 2 / 22

Page 4: Training neural ODEs for density estimation

1. Background Density estimation with Normalizing flows

Density estimation

Training data

p(dog) = ?

Density estimation is a fundamental problemin ML

I Given data {X1, . . . ,Xn} drawn from anunknown distribution p(x), estimate p

Typically done through maximumlog-likelihood

I select a family of models pθI solve maxθ

∑log pθ(Xi )

Two issues arise

1. How to parameterize pθ?

2. How to sample from the learneddistribution pθ? Ie, generate new data?

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 2 / 22

Page 5: Training neural ODEs for density estimation

1. Background Density estimation with Normalizing flows

Density estimation

Training data

p(dog) = ?

Density estimation is a fundamental problemin ML

I Given data {X1, . . . ,Xn} drawn from anunknown distribution p(x), estimate p

Typically done through maximumlog-likelihood

I select a family of models pθI solve maxθ

∑log pθ(Xi )

Two issues arise

1. How to parameterize pθ?

2. How to sample from the learneddistribution pθ? Ie, generate new data?

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 2 / 22

Page 6: Training neural ODEs for density estimation

1. Background Density estimation with Normalizing flows

Density estimation and generation with normalizing flows

Today, the gospel of Normalizing Flows [TT13, RM15]

Density estimation

I suppose we have an invertible map fθ : X 7→ Z

I Change of variables applied to log-densities:

log pθ(x) = log pN (fθ(x)) + log det

[dfθ(x)

dx

](1)

I here N is the standard normal distribution

Generation

I sample z ∼ NI compute x = f −1

θ (z)

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 3 / 22

Page 7: Training neural ODEs for density estimation

1. Background Density estimation with Normalizing flows

That’s nice, but. . .

Unfortunately this approach has two difficultproblems:

1. Constructing invertible deep neuralnetworks is hard. (but see eg [CBD+19])

2. Need to compute Jacobian determinants.Also hard.

Both of these tasks can be made easier byputting structural constraints on the model.However these constraints tend to degrademodel performance.

Is there an easier way?

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 4 / 22

Page 8: Training neural ODEs for density estimation

1. Background Density estimation with Normalizing flows

That’s nice, but. . .

Unfortunately this approach has two difficultproblems:

1. Constructing invertible deep neuralnetworks is hard. (but see eg [CBD+19])

2. Need to compute Jacobian determinants.Also hard.

Both of these tasks can be made easier byputting structural constraints on the model.However these constraints tend to degrademodel performance.

Is there an easier way?

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 4 / 22

Page 9: Training neural ODEs for density estimation

1. Background Neural ODEs

Neural ODEs

Source: [CRB+18],Figure 1

Neural ODEs [HR17, CRB+18] aregeneralizations of Residual Networks, wherelayer index t is a continuous variable called“time”.

I Compare one layer of a Residual Network

x t+1 = x t + vθ(x t , t)

I with an Euler step discretization of theODE x = vθ(x , t), with step size τ

x t+1 = x t + τvθ(x t , t)

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 5 / 22

Page 10: Training neural ODEs for density estimation

1. Background Neural ODEs

Neural ODEs

Source: [CRB+18],Figure 1

I Rather than fixing the number of layers(time steps) in the ResNet beforehand

I Solve the ODE{x = vθ(x , t)

x(0) = x0

(2)

with an adaptive ODE solver, up to someend time T . Here x0 is the input to the“network”

I The function so defined is the solutionmap:

fθ(x0) := x(T ) = x0 +

∫ T

0vθ(x(s), s) ds

Benefits: memory savings; tradeoff between accuracy and speedChris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 6 / 22

Page 11: Training neural ODEs for density estimation

1. Background Neural ODEs

Aside: Differentiating losses of neural ODEs

Suppose we have a loss function L(x(T )) which depends only onthe final state x(T ). How do we differentiate wrt θ?

I Naive approach: backpropagate through the computationgraph of the ODE solver

I Optimal control: use sensitivity analysis [PMB+62]. Thegradient dL

dθ is given by

dL(x(T ))

dθ= µ(0)

where µ, λ and x solve the following ODE (run backwards intime)

µ = −λT ddθvθ(x(t), t), µ(T ) = 0

λ = −λT∇xvθ(x(t), t), λ(T ) = dL(x(T ))dx(T )

x = −vθ(x(t), t), x(T ) = xT

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 7 / 22

Page 12: Training neural ODEs for density estimation

2. FFJORD: Neural ODEs + Normalizing flows

Neural ODEs and Normalizing Flows

We can overcome the two difficulties of normalizing flows byexploiting properties of dynamical systems

1. If vθ is Lipschitz, then the solution map is invertible!(Picard-Lindelof)

f −1θ (x(T )) = x(T ) +

∫ 0

Tvθ(x(s), s) ds

I ie just need to solve the backwards dynamics{x = −vθ(x , t)

x(0) = xT(3)

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 8 / 22

Page 13: Training neural ODEs for density estimation

2. FFJORD: Neural ODEs + Normalizing flows

Neural ODEs and Normalizing Flows

We can overcome the two difficulties of normalizing flows byexploiting properties of dynamical systems

2. Log-determinants (of particles solving the ODE) have abeautiful time derivative [Vil03, p. 114]

d

dslog det

[dx(s)

dx0

]= div (v) (x(s), s)

where div (·) is the divergence operator, div (f ) =∑

i∂fi∂xi

I ie

log det

[dfθ(x)

dx

]=

∫ T

0

div (v) (x(s), s)ds

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 9 / 22

Page 14: Training neural ODEs for density estimation

2. FFJORD: Neural ODEs + Normalizing flows

FFJORD: density estimation

Putting these two observations together, we arrive at the FFJORDalgorithm (Free-form Jacobian Of Reversible Dynamics) [GCB+19]

Density estimationTo learn the distribution pθ, solve the following log-likelihoodoptimization problem

maxθ

∑i

log pN (zi (T )) +

∫ T

0div (vθ) (zi (s), s)ds

where zi (s) satisfies the ODE{zi (s) = vθ(zi (s), s)

zi (0) = xi

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 10 / 22

Page 15: Training neural ODEs for density estimation

2. FFJORD: Neural ODEs + Normalizing flows

That’s nice, but. . .

Isn’t it really hard to compute the divergenceterm

div (vθ) (x , s) =∑j

∂v jθ(x , s)

∂xj

in high dimensions?

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 11 / 22

Page 16: Training neural ODEs for density estimation

2. FFJORD: Neural ODEs + Normalizing flows

Trace estimates

I Divergence is the trace of the Jacobian∇xvθ(x , t)

I So we can use trace estimates toapproximate

div (vθ) (x , s) = Tr(∇xvθ(x , t))

= Eη[ηT∇xvθ(x , t)η

]where η ∼ N (0, 1)

I this can be computed quickly with reversemode automatic differentiation

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 12 / 22

Page 17: Training neural ODEs for density estimation

2. FFJORD: Neural ODEs + Normalizing flows

FFJORD: generation

FFJORD [GCB+19] generation (density sampling) is simple

GenerationSample z ∼ N , and solve{

x(s) = −vθ(x(s), s)

x(0) = z

Ie, x = z +∫ 0T vθ(z(s), s) ds

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 13 / 22

Page 18: Training neural ODEs for density estimation

3. Regularized neural ODEs

Need for regularity

target

x

sourceA generic solution

FFJORD is promising, but there are noconstraints placed on the paths the particlestake

I As long as source (data) distribution ismapped to target (normal) distribution,the log-likelihood is maximized

I ie solutions are not unique

I If the particle paths are “wobbly” theadaptive ODE solver has to take manytiny steps, with many functionevaluations. This is time consuming

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 14 / 22

Page 19: Training neural ODEs for density estimation

3. Regularized neural ODEs

Solver is slowed by number of function evaluations

Function evaluations vsJacobian norm

I the number of function evaluations (timesteps) taken by the solver is related to thetotal derivative

dv(x(t), t)

dt= [∇v(x(t), t)] v(x(t), t)

+∂

∂tv(x(t), t)

I In other words, if we can control the sizeof v and ∇v , we can reduce the numberof function evaluations (time steps) takenby the ODE solver

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 15 / 22

Page 20: Training neural ODEs for density estimation

3. Regularized neural ODEs

Proposal: regularize objective to decrease # of FEs

0

1

tp(

z(x,

T))

target

x

p(x)

sourceOT map

Add two terms to the objective to encourageregularity of trajectories:

I∫ T

0 ‖vθ(x(s), s)‖2 ds, the kinetic energy.This is closely related to the OptimalTransport cost between distributions(Benamou-Brenier)

I∫ T

0 ‖∇xvθ(x(s), s)‖2F ds, a Frobenius

norm penalty on the Jacobian

I Frobenius norms can be computed againwith trace estimate:

‖A‖2F = Tr(ATA) = Eη

[ηTATAη

]= E

[‖Aη‖2

]ie can recycle computations fromdivergence estimate

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 16 / 22

Page 21: Training neural ODEs for density estimation

3. Regularized neural ODEs

Proposal: regularize objective to decrease # of FEs

0

1

tp(

z(x,

T))

target

x

p(x)

sourceOT map

Add two terms to the objective to encourageregularity of trajectories:

I∫ T

0 ‖vθ(x(s), s)‖2 ds, the kinetic energy.This is closely related to the OptimalTransport cost between distributions(Benamou-Brenier)

I∫ T

0 ‖∇xvθ(x(s), s)‖2F ds, a Frobenius

norm penalty on the Jacobian

I Frobenius norms can be computed againwith trace estimate:

‖A‖2F = Tr(ATA) = Eη

[ηTATAη

]= E

[‖Aη‖2

]ie can recycle computations fromdivergence estimate

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 16 / 22

Page 22: Training neural ODEs for density estimation

4. Results

Test log-likelihood vs wall clock time

0 20 40 60Wall Clock Time (hours)

0.95

1.00

1.05

1.10

1.15

1.20

Bits

/dim

vanilla FFJORDFFJORD RNODE (ours)

MNIST

0 10 20 30 40 50Wall Clock Time (hours)

3.4

3.6

3.8

4.0

4.2

4.4

4.6

Bits

/dim

vanilla FFJORDFFJORD RNODE (ours)

CIFAR10

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 17 / 22

Page 23: Training neural ODEs for density estimation

4. Results

Ablation study: regularity measures on MNIST

0 25 50 75 100Epoch

0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Mea

n Ja

cobi

an F

robe

nius

nor

m

no regularizationkinetic penalty onlyJacobian penalty onlyRNODE (both penalties)

Jacobian norm vs epoch

0 25 50 75 100Epoch

4

6

8

10

12

14

16

Mea

n ki

netic

ene

rgy

no regularizationkinetic penalty onlyJacobian penalty onlyRNODE (both penalties)

Kinetic energy vs epoch Function evals vs epoch

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 18 / 22

Page 24: Training neural ODEs for density estimation

4. Results

Some numbers

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 19 / 22

Page 25: Training neural ODEs for density estimation

4. Results

Some pictures: MNIST & CIFAR10

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 20 / 22

Page 26: Training neural ODEs for density estimation

4. Results

Some bigger pictures: ImageNet64

Real ImageNet64 images Generated ImageNet64 images(FFJORD RNODE)

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 21 / 22

Page 27: Training neural ODEs for density estimation

4. Results

Comments

I The Jacobian norm penalty can be viewed as a continuoustime analogue of layer-wise Lipschitz regularization inResNets. This helps ensure particle paths do not cross, andhelps keep networks numerically invertible

I Without regularization, it is very difficult to train neural ODEswith fixed step-size ODE solvers. Need adaptive ODE solvers

I However with regularization, neural ODEs can be trained withfixed step-size ODE solvers (eg a four stage Runge-Kuttascheme)

I On large images (> 64 pixel width) vanilla FFJORD will nottrain: adaptive solver’s time step underflows

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 22 / 22

Page 28: Training neural ODEs for density estimation

5. References

References I

TQ Chen, J Behrmann, D Duvenaud and JH Jacobsen, Residual Flows forInvertible Generative Modeling, NeurIPS 2019: 9913–9923.

TQ Chen, Y Rubanova, J Bettencourt and D Duvenaud, Neural OrdinaryDifferntial Equations, NeurIPS 2018: 6572 – 6583.

W Grathwohl, RT Chen, J Bettencourt, I Sutskever and D Duvenaud,FFJORD: Free-Form Continuous Dynamics for Scalable ReversibleGenerative Models, ICLR 2019.

E Haber and L Ruthotto, Stable architectures for deep neural networks,Inverse Problems, 34 (2017), no. 1, 014004.

L Pontryagin, EF Mischenko, VG Boltyanskii and RV Gamkrelidze, Themathematical theory of optimal processes, 1962.

RJ Rezende and S Mohamed, Variational Inference with NormalizingFlows, ICML 2015: 1530–1538.

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 23 / 22

Page 29: Training neural ODEs for density estimation

5. References

References II

EG Tabak and CV Turner, A family of nonparametric density estimationalgorithms, Communication on Pure and Applied Mathematics, 66 (2013),no. 2, 145–164.

C Villani, Topics in Optimal Transport, Graduate studies in mathematics.AMS 2003, ISBN 9780821833124.

Chris Finlay Training neural ODEs for density estimation IPAM HJB II April 23, 2020 24 / 22


Recommended