Download - Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

Deep Learning for Vision

Xiaoming Liu

Slides adapted from Adam Coates

2

What do we want ML to do?

•  Given image, predict complex high-‐level pa>erns:

Object recogniCon DetecCon SegmentaCon

“Cat”

[MarCn et al., 2001]

3

How is ML done?

•  Machine learning oNen uses common pipeline with hand-‐designed feature extracCon. •  Final ML algorithm learns to make decisions starCng from

the higher-‐level representaCon. •  SomeCmes layers of increasingly high-‐level abstracCons.

–  Constructed using prior knowledge about problem domain.

Feature ExtracCon Machine Learning Algorithm “Cat”?

Prior Knowledge, Experience

4

“Deep Learning”

•  Deep Learning •  Train mul$ple layers of features/abstracCons from data. •  Try to discover representa$on that makes decisions easy.

Low-‐level Features

Mid-‐level Features

High-‐level Features Classifier “Cat”?

Deep Learning: train layers of features so that classifier works well.

More abstract representaCon

5

“Deep Learning”

•  Why do we want “deep learning”? –  Some decisions require many stages of processing.

•  Easy to invent cases where a “deep” model is compact but a shallow model is very large / inefficient.

– We already, intuiCvely, hand-‐engineer “layers” of representaCon. •  Let’s replace this with something automated!

– Algorithms scale well with data and compuCng power. •  In pracCce, one of the most consistently successful ways to get good results in ML.

•  Can try to take advantage of unlabeled data to learn representaCons before the task.

6

Have we been here before? Ø Yes. –  Basic ideas common to past ML and neural networks research. •  Supervised learning is straight-‐forward. •  Standard ML development strategies sCll relevant. •  Some knowledge carried over from problem domains.

Ø No. –  Faster computers; more data. –  Be>er opCmizers; be>er iniCalizaCon schemes.

•  “Unsupervised pre-‐training” trick [Hinton et al. 2006; Bengio et al. 2006]

–  Lots of empirical evidence about what works. •  Made useful by ability to “mix and match” components. [See, e.g., Jarre> et al., ICCV 2009]

7

Real impact

•  DL systems are high performers in many tasks over many domains.

Image recogniCon [E.g., Krizhevsky et al., 2012]

Speech recogniCon [E.g., Heigold et al., 2013]

NLP [E.g., Socher et al., ICML 2011; Collobert & Weston, ICML 2008]

[Honglak Lee]

8

Outline •  ML refresher / crash course

–  LogisCc regression –  OpCmizaCon –  Features

•  Supervised deep learning –  Neural network models –  Back-‐propagaCon –  Training procedures

•  Supervised DL for images –  Neural network architectures for images. –  ApplicaCon to Image-‐Net

•  References / Resources

MACHINE LEARNING REFRESHER

Crash Course

10

Supervised Learning

•  Given labeled training examples:

•  For instance: x(i) = vector of pixel intensiCes. y(i) = object class ID.

•  Goal: find f(x) to predict y from x on training data. – Hopefully: learned predictor works on “test” data.

255 98 93 87 …

f(x) y = 1 (“Cat”)

11

Logistic Regression •  Simple binary classificaCon algorithm – Start with a funcCon of the form:

–  InterpretaCon: f(x) is probability that y = 1. •  Sigmoid “nonlinearity” squashes linear funcCon to [0,1].

– Find choice of that minimizes objecCve:

1

12

Optimization

•  How do we tune to minimize ? •  One algorithm: gradient descent –  Compute gradient:

–  Follow gradient “downhill”:

•  StochasCc Gradient Descent (SGD): take step using gradient from only small batch of examples. –  Scales to larger datasets. [Bo>ou & LeCun, 2005]

13

Is this enough?

•  Loss is convex à we always find minimum. •  Works for simple problems: –  Classify digits as 0 or 1 using pixel intensity. –  Certain pixels are highly informaCve -‐-‐-‐ e.g., center pixel.

•  Fails for even slightly harder problems. –  Is this a coffee mug?

14

Why is vision so hard?

“Coffee Mug”

Pixel Intensity

Pixel intensity is a very poor representaCon.

15

pixel 1

pixel 2

+ Coffee Mug

Not Coffee Mug -‐


+ -‐

-‐

pixel 1

pixel 2

+

Pixel Intensity [72 160]

16


+

pixel 1

pixel 2

-‐ +

+

-‐ -‐

+ -‐ +

+ Coffee Mug

Not Coffee Mug -‐

+

pixel 1

pixel 2

-‐ +

+

-‐ -‐

+ -‐ +

Is this a Coffee Mug?

Learning Algorithm

17

Features

+ handle?

cylinder?

-‐ + + -‐ -‐

+

-‐ +

+ Coffee Mug

Not Coffee Mug -‐

cylinder? handle?

Is this a Coffee Mug?

Learning Algorithm + handle?

cylinder?

-‐ + + -‐ -‐

+

-‐ +

18

Features •  Features are usually hard-‐wired transformaCons built into the system. – Formally, a funcCon that maps raw input to a “higher level” representaCon.

– Completely staCc -‐-‐-‐ so just subsCtute φ(x) for x and do logisCc regression like before.

Where do we get good features?

19

Features

•  Huge investment devoted to building applicaCon-‐specific feature representaCons. –  Find higher-‐level pa>erns so that final decision is easy to learn with ML algorithm.

Object Bank [Li et al., 2010] Super-‐pixels [Gould et al., 2008; Ren & Malik, 2003]

SIFT [Lowe, 1999] Spin Images [Johnson & Hebert, 1999]

SUPERVISED DEEP LEARNING

Extension to neural networks

21

Basic idea

•  We saw how to do supervised learning when the “features” φ(x) are fixed. – Let’s extend to case where features are given by tunable funcCons with their own parameters.

Inputs are “features”-‐-‐-‐one feature for each row of W: Outer part of funcCon is same

as logisCc regression.

22

Basic idea

•  To do supervised learning for two-‐class classificaCon, minimize:

•  Same as logisCc regression, but now f(x) has mulCple stages (“layers”, “modules”):

Intermediate representaCon (“features”) PredicCon for

23

Neural network

•  This model is a sigmoid “neural network”:

Flow of computaCon. “Forward prop”

“Neuron”

24

Neural network •  Can stack up several layers: Must learn mulCple stages

of internal “representaCon”.

25

Back-propagation

•  Minimize:

•  To minimize we need gradients:

– Then use gradient descent algorithm as before.

•  Formula for can be found by hand (same as before); but what about W?

26

The Chain Rule •  Suppose we have a module that looks like:

•  And we know and , chain rule gives:

Similarly for W:

Ø  Given gradient with respect to output, we can build a new “module” that finds gradient with respect to inputs.

Jacobian matrix.

27

The Chain Rule •  Easy to build toolkit of known rules to compute gradients given –  Automated differenCaCon! E.g., Theano [Bergstra et al., 2010]

FuncCon Gradient w.r.t. input Gradient w.r.t. parameters

28

Back-propagation •  Can re-‐apply chain rule to get gradients for all intermediate values and parameters.

“Backward” modules for each forward stage.

29

Example

•  Given , compute :

Using several items from our table:

30

Training Procedure •  Collect labeled training data –  For SGD: Randomly shuffle aNer each epoch!

•  For a batch of examples: –  Compute gradient w.r.t. all parameters in network.

– Make a small update to parameters.

–  Repeat unCl convergence.

31

Training Procedure

•  Historically, this has not worked so easily. – Non-‐convex: Local minima; convergence criteria. – OpCmizaCon becomes difficult with many stages.

•  “Vanishing gradient problem” – Hard to diagnose and debug malfuncCons.

•  Many things turn out to ma>er: – Choice of nonlineariCes. –  IniCalizaCon of parameters. – OpCmizer parameters: step size, schedule.

32

Nonlinearities

•  Choice of funcCons inside network ma>ers. – Sigmoid funcCon turns out to be difficult. – Some other choices oNen used:

1

-‐1

1

tanh(z) ReLu(z) = max{0, z}

“RecCfied Linear Unit” à Increasingly popular.

1

abs(z)

[Nair & Hinton, 2010]

33

Initialization

•  Usually small random values. –  Try to choose so that typical input to a neuron avoids saturaCng / non-‐differenCable areas.

–  Has to be random otherwise every neuron will be equal! –  Occasionally inspect units for saturaCon / blowup. –  Larger values may give faster convergence, but worse models!

•  IniCalizaCon schemes for parCcular units: –  tanh units: Unif[-‐r, r]; sigmoid: Unif[-‐4r, 4r]. See [Glorot et al., AISTATS 2010]

34

Optimization: Step sizes •  Choose SGD step size carefully. –  Up to factor ~2 can make a difference.

•  Strategies: –  Brute-‐force: try many; pick one with best result. –  Racing: pick size with best error on validaCon data aNer T steps. •  Not always accurate if T is too small.

•  Step size schedule: –  Fixed: same step size –  Step – MulC-‐Step –  Inverse

Bengio, 2012: “PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures” Hinton, 2010: “A PracCcal Guide to Training Restricted Boltzmann Machines”

35

Optimization: Momentum •  “Smooth” esCmate of gradient from several steps of SGD:

•  A li>le bit like second-‐order informaCon. –  High-‐curvature direcCons cancel out. –  Low-‐curvature direcCons “add up” and accelerate.

Bengio, 2012: “PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures” Hinton, 2010: “A PracCcal Guide to Training Restricted Boltzmann Machines”

37

Other factors

•  “Weight decay” penalty can help. – Add small penalty for squared weight magnitude.

•  For modest datasets, LBFGS or second-‐order methods are easier than SGD. – See, e.g.: Martens & Sutskever, ICML 2011. – Can crudely extend to mini-‐batch case if batches are large. [Le et al., ICML 2011]

SUPERVISED DL FOR VISION

ApplicaCon

39

Working with images

•  Major factors: –  Choose funcConal form of network to roughly match the computaCons we need to represent. •  E.g., “selecCve” features and “invariant” features.

–  Try to exploit knowledge of images to accelerate training or improve performance.

•  Generally try to avoid wiring detailed visual knowledge into system -‐-‐-‐ prefer to learn.

40

Local connectivity

•  Neural network view of single neuron: Extremely large number of connecCons. à More parameters to train. à Higher computaConal expense. à Turn out not to be helpful in pracCce.

41

Local connectivity •  Reduce parameters with local connecCons.

–  Weight vector is a spaCally localized “filter”.

42

Local connectivity

•  SomeCmes think of neurons as viewing small adjacent windows. –  Specify connecCvity by the size (“recepCve field” size) and spacing (“step” or “stride”) of windows. •  Typical RF size = 3 to 20 •  Typical step size = 1 pixel up to RF size.

43

Local connectivity

•  SpaCal organizaCon of filters means output features can also be organized like an image. –  X,Y dimensions correspond to X,Y posiCon of neuron window.

–  “Channels” are different features extracted from same spaCal locaCon. (Also called “feature maps”, or “maps”.)

1D input

1-‐dimensional example:

X spaCal locaCon

“Channel” or “map” index

44

Local connectivity Ø We can treat output of a layer like an image and re-‐use the

same tricks.

1D input


X spaCal locaCon


45

Weight-Tying

•  Even with local connecCons, may sCll have too many weights. – Trick: constrain some weights to be equal if we know that some parts of input should learn same kinds of features.

–  Images tend to be “staConary”: different patches tend to have similar low-‐level structure. Ø Constrain weights used at different spaCal posiCons to be the equal.

46

Weight-Tying Ø  Before, could have neurons with different weights at different

locaCons. But can reduce parameters by making them equal.

1D input


X spaCal locaCon


•  SomeCmes called a “convoluConal” network. Each unique filter is spaCally convolved with the input to produce responses for each map. [LeCun et al., 1989; LeCun et al., 2004]

47

Pooling •  FuncConal layers designed to represent invariant features. •  Usually locally connected with specific nonlineariCes.

–  Combined with convoluCon, corresponds to hard-‐wired translaCon invariance.

•  Usually fix weights to local box or Gaussian filter. –  Easy to represent max-‐, average-‐, or 2-‐norm pooling.

[Scherer et al., ICANN 2010] [Boureau et al., ICML 2010]

48

Batch Normalization

•  Reduce covariance shiN during training •  SpaCally over all samples in each batch

… Sample 1 Sample 2 Sample m

µB ←1

W × H ×mxw,h,i

w,h,i∑

σ B2 ← 1

W × H ×m(xw,h,i −

w,h,i∑ µB )

2

x∧

w,h,i ←xi − µB

σ B2 + ε

yi ←γ x∧

w,h,i+ β ≡ BNγ ,β (xw,h,i )

50

Application: Image-Net •  System from Krizhevsky et al., NIPS 2012: –  ConvoluConal neural network. – Max-‐pooling. –  RecCfied linear units (ReLu). –  Local response normalizaCon. –  Local connecCvity.

51

Application: Image-Net

•  Top result in LSVRC 2012: ~85%, Top-‐5 accuracy.

What’s an Agaric!?

52

More applications •  SegmentaCon: predict classes of pixels / super-‐pixels.

•  DetecCon: combine classifiers with sliding-‐window architecture. –  Economical when used with convoluConal nets.

•  RoboCc grasping. [Lenz et al., RSS 2013]

ß Ciresan et al., NIPS 2012

Farabet et al., ICML 2012 à

Pierre Sermanet (2010) à

h>p://www.youtube.com/watch?v=f9CuzqI1SkE

Go

AlphaGo beat best human (Lee Sedol) 3/2016 (win-win-win-loss-win) Games are available at https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

55

How does AlphaGo work? (For all the details, see their paper:

www.nature.com/nature/journal/v529/n7587/pdf/nature16961.pdf ) •  Combines two techniques:

– Monte Carlo Tree Search (MCTS) •  Historical advantages of MCTS over minimax-‐search approaches: –  Does not require an evaluaCon funcCon –  Typically works be>er with large branching factors

•  Improved Go programs by ~10 kyu around 2006 – Deep Learning

•  ‘Value networks’ to evaluate board posiCons and •  ‘Policy networks’ to select moves

56

AlphaGo hardware

•  EvaluaCng policy and value networks requires several orders of magnitude more computaCon than tradiConal search heurisCcs

•  AlphaGo uses an asynchronous mulC-‐threaded search that executes simulaCons on CPUs, and computes policy and value networks in parallel on GPUs

•  Final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs

•  (They also implemented a distributed version)

57

Speech Recognition

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-‐to-‐end speech recogniCon”, arXiv:1412.5567v2, 2014.

Output: characters (and space)

Input: acousCc features (spectrograms)

No phoneme and lexicon (No OOV problem)

+ null (~)

58

Resources

Tutorials Stanford Deep Learning tutorial:

h>p://ufldl.stanford.edu/wiki

Deep Learning tutorials list: h>p://deeplearning.net/tutorials

IPAM DL/UFL Summer School: h>p://www.ipam.ucla.edu/programs/gss2012/

ICML 2012 RepresentaCon Learning Tutorial h>p://www.iro.umontreal.ca/~bengioy/talks/deep-‐learning-‐tutorial-‐2012.html

59

References h>p://www.stanford.edu/~acoates/bmvc2013refs.pdf Overviews: Yoshua Bengio,

“PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures”

Yoshua Bengio & Yann LeCun,

“Scaling Learning Algorithms towards AI” Yoshua Bengio, Aaron Courville & Pascal Vincent,

“RepresentaCon Learning: A Review and New PerspecCves”

SoNware: Theano GPU library: h>p://deeplearning.net/soNware/theano SPAMS toolkit: h>p://spams-‐devel.gforge.inria.fr/