+ All Categories
Home > Documents > CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt...

CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt...

Date post: 06-Jun-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
109
CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: Training Neural Networks 2
Transcript
Page 1: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – Training Neural Networks 2

Page 2: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

• PS3/HW3 – Will be out today– Due in two weeks

• Submit project proposal by Thursday (and notify me)

(C) Dhruv Batra and Zsolt Kira and Zsolt Kira 2

Page 3: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 20183

3

Previously: Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 4: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 20184

4

Previously: Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Good default choice

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 5: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 20185

5

Previously: Weight InitializationInitialization too small:Activations go to zero, gradients also zero,No learning

Initialization too big:Activations saturate (for tanh), Gradients zero, no learning

Initialization just right:Nice distribution of activations at all layers,Learning proceeds nicely

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 6: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 20186

6

Previously: Data Preprocessing

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 7: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 20187

7

Previously: Babysitting Learning

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 8: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 20188

8

Previously: Hyperparameter Search

Important Parameter

Important Parameter

Uni

mpo

rtant

Pa

ram

eter

Uni

mpo

rtant

Pa

ram

eter

Grid Layout Random Layout

Coarse to fine search

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 9: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 20189

9

Today

- Fancier optimization- Regularization- Transfer Learning

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 10: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201810

10

Optimization

W_1

W_2

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 11: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201811

11

Optimization: Problems with SGD (1)What if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 12: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201812

12

Optimization: Problems with SGD (1)What if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 13: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201813

13

Optimization: Problems with SGD (2)

What if the loss function has a local minima or saddle point?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 14: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201814

14

Optimization: Problems with SGD (2)

What if the loss function has a local minima or saddle point?

Zero gradient, gradient descent gets stuck

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 15: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201815

15

Optimization: Problems with SGD (2)

What if the loss function has a local minima or saddle point?

Saddle points much more common in high dimension

Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 16: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201816

16

Optimization: Problems with SGD (3)

Our gradients come from minibatches so they can be noisy!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 17: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201817

17

SGD + MomentumSGD

- Build up “velocity” as a running mean of gradients- Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

SGD+Momentum

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 18: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201818

18

SGD + MomentumSGD+Momentum SGD+Momentum

You may see SGD+Momentum formulated different ways, but they are equivalent - given same sequence of x

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Page 19: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201819

19

SGD + Momentum

Local Minima Saddle points

Poor Conditioning

Gradient Noise

SGD SGD+Momentum

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 20: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201820

20

Gradient

Velocity

actual step

Momentum update:

SGD+Momentum

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Combine gradient at current point with velocity to get step used to update weights

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 21: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201821

21

Gradient

Velocity

actual step

Momentum update:

Nesterov Momentum

GradientVelocity

actual step

Nesterov Momentum

Combine gradient at current point with velocity to get step used to update weights

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Page 22: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201822

22

Nesterov Momentum

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 23: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201823

23

Nesterov MomentumAnnoying, usually we want update in terms of

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 24: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201824

24

Nesterov MomentumAnnoying, usually we want update in terms of

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Change of variables and rearrange:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 25: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201825

Change of variables and rearrange:

25

Nesterov MomentumAnnoying, usually we want update in terms of

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 26: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201826

26

Nesterov MomentumSGD

SGD+Momentum

Nesterov

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 27: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201827

27

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 28: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201828

28

AdaGrad

Q: What happens with AdaGrad?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 29: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201829

29

AdaGrad

Q: What happens with AdaGrad? Progress along “steep” directions is damped; progress along “flat” directions is accelerated

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 30: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201830

30

AdaGrad

Q2: What happens to the step size over long time?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 31: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201831

31

AdaGrad

Q2: What happens to the step size over long time? Decays to zero

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 32: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201832

32

RMSProp

AdaGrad

RMSProp

Tieleman and Hinton, 2012

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 33: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201833

33

RMSPropSGD

SGD+Momentum

RMSProp

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 34: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201834

34

Adam (almost)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 35: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201835

35

Adam (almost)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Sort of like RMSProp with momentum

Q: What happens at first timestep?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 36: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201836

36

Adam (full form)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Bias correction

Bias correction for the fact that first and second moment estimates start at zero

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 37: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201837

37

Adam (full form)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Bias correction

Bias correction for the fact that first and second moment estimates start at zero

Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 38: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201838

38

Adam

SGD

SGD+Momentum

RMSProp

Adam

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 39: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201839

39

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Q: Which one of these learning rates is best to use?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 40: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201840

40

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

=> Learning rate decay over time!

step decay: e.g. decay learning rate by half every few epochs.

exponential decay:

1/t decay:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 41: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201841

41

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Loss

Epoch

Learning rate decay!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 42: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201842

42

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Loss

Epoch

Learning rate decay!

More critical with SGD+Momentum, less common with Adam

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 43: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201843

43

First-Order Optimization

Loss

w1

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 44: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201844

44

First-Order Optimization

Loss

w1

(1)Use gradient form linear approximation(2)Step to minimize the approximation

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 45: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201845

45

Second-Order Optimization

Loss

w1

(1)Use gradient and Hessian to form quadratic approximation(2)Step to the minima of the approximation

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 46: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201846

46

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q: What is nice about this update?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 47: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201847

47

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q: What is nice about this update?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 48: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201848

48

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q: What is nice about this update?

No hyperparameters!No learning rate!(Though you might use one in practice)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 49: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201849

49

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q2: Why is this bad for deep learning?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 50: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201850

50

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q2: Why is this bad for deep learning?

Hessian has O(N^2) elementsInverting takes O(N^3)N = (Tens or Hundreds of) Millions

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 51: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201851

51

Second-Order Optimization

- Quasi-Newton methods (BGFS most popular):instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).

- L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 52: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201852

52

L-BFGS

- Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely

- Does not transfer very well to mini-batch setting. Gives bad results. Adapting second-order methods to large-scale, stochastic setting is an active area of research.

Le et al, “On optimization methods for deep learning, ICML 2011”Ba et al, “Distributed second-order optimization using Kronecker-factored approximations”, ICLR 2017

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 53: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201853

53

- Adam is a good default choice in many cases- SGD+Momentum with learning rate decay often

outperforms Adam by a bit, but requires more tuning

- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

In practice:

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 54: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201854

54

Beyond Training Error

Better optimization algorithms help reduce training loss

But we really care about error on new data - how to reduce the gap?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 55: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201855

55

1.Train multiple independent models2.At test time average their results

(Take average of predicted probability distributions, then choose argmax)

Enjoy 2% extra performance

Model Ensembles

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 56: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201856

56

Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 57: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201857

57

Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!

Cyclic learning rate schedules can make this work even better!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Page 58: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201858

58

Model Ensembles: Tips and TricksInstead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time (Polyak averaging)

Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 59: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201859

59

Early Stopping

Iteration

Loss

Iteration

AccuracyTrainVal

Stop training here

Stop training the model when accuracy on the validation set decreasesOr train for a long time, but always keep track of the model snapshot that worked best on val

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 60: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201860

60

How to improve single-model performance?

Regularization

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 61: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201861

Regularization: Add term to loss

61

In common use:L2 regularizationL1 regularizationElastic net (L1 + L2)

(Weight decay)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 62: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201862

62

Regularization: DropoutIn each forward pass, randomly set some neurons to zeroProbability of dropping is a hyperparameter; 0.5 is common

Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 63: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201863

63

Regularization: Dropout Example forward pass with a 3-layer network using dropout

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 64: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201864

64

Regularization: DropoutHow can this possibly be a good idea?

Forces the network to have a redundant representation;Prevents co-adaptation of features

has an ear

has a tail

is furry

has clawsmischievous look

cat score

X

X

X

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 65: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201865

65

Regularization: DropoutHow can this possibly be a good idea?

Another interpretation:

Dropout is training a large ensemble of models (that share parameters).

Each binary mask is one model

An FC layer with 4096 units has24096 ~ 101233 possible masks!Only ~ 1082 atoms in the universe...

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 66: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201866

66

Dropout: Test time

Dropout makes our output random!

Output(label)

Input(image)

Random mask

Want to “average out” the randomness at test-time

But this integral seems hard …

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 67: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201867

67

Dropout: Test timeWant to approximate the integral

Consider a single neuron.a

x y

w1 w2

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 68: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201868

68

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:a

x y

w1 w2

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 69: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201869

69

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:During training we have:

a

x y

w1 w2

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 70: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201870

70

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:During training we have:

a

x y

w1 w2

At test time, multiplyby dropout probability

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Note: Here, dropout probability means the probability of keepingan activation – sometimes people define this as the opposite…

Page 71: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201871

71

Dropout: Test time

At test time all neurons are active always=> We must scale the activations so that for each neuron:output at test time = expected output at training time

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 72: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201872

72

Dropout Summary

drop in forward pass

scale at test time

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 73: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201873

73

Alternative: “Inverted dropout”

test time is unchanged!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 74: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201874

74

Regularization: A common patternTraining: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 75: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201875

75

Regularization: A common patternTraining: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Example: Batch Normalization

Training: Normalize using stats from random minibatches

Testing: Use fixed stats to normalize

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 76: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201876

76

Load image and label

“cat”

CNN

Computeloss

Regularization: Data Augmentation

This image by Nikita is licensed under CC-BY 2.0

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 77: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201877

77

Regularization: Data Augmentation

Load image and label

“cat”

CNN

Computeloss

Transform image

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 78: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201878

78

Data AugmentationHorizontal Flips

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 79: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201879

79

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 80: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201880

80

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Testing: average a fixed set of cropsResNet:1. Resize image at 5 scales: {224, 256, 384, 480, 640}2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 81: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201881

81

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 82: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201882

82

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

More Complex:

1.Apply PCA to all [R, G, B] pixels in training set

2.Sample a “color offset” along principal component directions

1.Add offset to all pixels of a training image

(As seen in [Krizhevsky et al. 2012],ResNet, etc)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 83: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201883

83

Data AugmentationGet creative for your problem!

Random mix/combinations of :- translation- rotation- stretching- shearing, - lens distortions, … (go crazy)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 84: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201884

84

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData Augmentation

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 85: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201885

85

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData AugmentationDropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 86: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201886

86

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 87: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201887

87

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max PoolingStochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 88: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201888

88

Transfer Learning

“You need a lot of a data if you want to train/use CNNs”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 89: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201889

89

Transfer Learning

“You need a lot of a data if you want to train/use CNNs”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 90: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201890

90

Transfer Learning with CNNs

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

1. Train on Imagenet

Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 91: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201891

91

Transfer Learning with CNNs

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

1. Train on Imagenet

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096

FC-C

2. Small Dataset (C classes)

Freeze these

Reinitialize this and train

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Page 92: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201892

92

Transfer Learning with CNNs

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

1. Train on Imagenet

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096

FC-C

2. Small Dataset (C classes)

Freeze these

Reinitialize this and train

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096

FC-C

3. Bigger dataset

Fine-tune these

Train these

With bigger dataset, train more layers

Lower learning rate when finetuning; 1/10 of original LR is good starting point

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Page 93: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201893

93

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

More generic

More specific

very similar dataset

very different dataset

very little data ? ?

quite a lot of data

? ?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 94: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201894

94

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

More generic

More specific

very similar dataset

very different dataset

very little data Use Linear Classifier ontop layer

?

quite a lot of data

Finetune a few layers

?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 95: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201895

95

Image

Conv-64Conv-64MaxPool

Conv-128Conv-128MaxPool

Conv-256Conv-256MaxPool

Conv-512Conv-512MaxPool

Conv-512Conv-512MaxPool

FC-4096FC-4096FC-1000

More generic

More specific

very similar dataset

very different dataset

very little data Use Linear Classifier on top layer

You’re in trouble… Try linear classifier from different stages

quite a lot of data

Finetune a few layers

Finetune a larger number of layers

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 96: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201896

96

Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)

Image Captioning: CNN + RNNObject Detection (Fast R-CNN)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.

Page 97: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201897

97

Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)

Image Captioning: CNN + RNNObject Detection (Fast R-CNN) CNN pretrained

on ImageNet

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.

Page 98: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201898

98

Transfer learning with CNNs is pervasive…(it’s the norm, not an exception)

Image Captioning: CNN + RNNObject Detection (Fast R-CNN) CNN pretrained

on ImageNet

Word vectors pretrained with word2vec

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Girshick, “Fast R-CNN”, ICCV 2015Figure copyright Ross Girshick, 2015. Reproduced with permission.

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015Figure copyright IEEE, 2015. Reproduced for educational purposes.

Page 99: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 201899

The most effective method: Gather more data!

DEEP LEARNING SCALING IS PREDICTABLE, EMPIRICALLYJoel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou

Revisiting the Unreasonable Effectiveness of Data https://ai.googleblog.com/2017/07/revisiting-unreasonable-effectiveness.html

Page 100: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 2018100

100

Takeaway for your projects and beyond:Have some dataset of interest but it has < ~1M images?

1.Find a very large dataset that has similar data, train a big ConvNet there

2.Transfer learn to your datasetDeep learning frameworks provide a “Model Zoo” of pretrained models so you don’t need to train your ownCaffe: https://github.com/BVLC/caffe/wiki/Model-ZooTensorFlow: https://github.com/tensorflow/modelsPyTorch: https://github.com/pytorch/vision

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 101: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

April 24, 2018101

101

Life is never so simpleThere are several areas being researched- Batch size- Regularization and generalization- Overparameterization and why

SGD is so good

Why is this still not understood?- Our understanding comes from built-in

intuition that is repeated but not always tested- Difficult to apply theory being developed

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 102: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

DNNs: Over‐parametrization yields great generalization even without any explicit regularization

Zhang et al, Theory of deep learning III, September 2017

The interesting part – Great generalization!

Note the zero training error in the over‐parametrized part • DNN forms patterns, thus 

generalizing well• It also memorizes noisy examples, 

but in a harmless way• All these need more understanding

Slide Credit: S. Sathiya Keerthi

Page 103: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

Ben Recht Talk slides, ICLR 2017

More examples of great generalization without any regularization

n = number of examplesp = number of parametersd = number of inputsk = number of layers

Note that networks with larger p/n have better generalization

Slide Credit: S. Sathiya Keerthi

Page 104: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

History of powerful DNN solvers on ImageNet (15 million examples) [Ref]

Slide Credit: S. Sathiya Keerthi

Page 105: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

Batch Size

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaNitish Shirish, Keskar, Dheevatsa, Mudigere, Jorge, Nocedal, Mikhail Smelyanskiy, Ping Tak, Peter Tang

Page 106: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

This experiment was done on a modified AlexNet (CNN) on CIFAR-10 dataset

Large batch yields inferior generalization..

Keskar et al: This is due to flatness properties of solutions

Slide Credit: S. Sathiya Keerthi

Page 107: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

Generalization (negatively) correlates well with sharpness, thus explaining the superiority of small batch over large batch

Continuous red line: ; Broken red line: This experiment was done on a modifiedAlexNet (CNN) on CIFAR-10 dataset

Slide Credit: S. Sathiya Keerthi

Page 108: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

Why flatness means better generalization?

Flatness implies that the test loss will be close to the training loss

Good Bad

Slide Credit: S. Sathiya Keerthi

Page 109: CS 4803 / 7643: Deep Learning - College of Computing€¦ · CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: – Training Neural Networks 2 • PS3/HW3 –Will be out

Stores the model

Stores the model

Gradient Computations

Gradient Computations

One solution: Distributed SGD

Slide Credit: Dimitris Papailiopoulos


Recommended