Deep Learning for Vision
Xiaoming Liu
Slides adapted from Adam Coates
2
What do we want ML to do?
• Given image, predict complex high-‐level pa>erns:
Object recogniCon DetecCon SegmentaCon
“Cat”
[MarCn et al., 2001]
3
How is ML done?
• Machine learning oNen uses common pipeline with hand-‐designed feature extracCon. • Final ML algorithm learns to make decisions starCng from
the higher-‐level representaCon. • SomeCmes layers of increasingly high-‐level abstracCons.
– Constructed using prior knowledge about problem domain.
Feature ExtracCon Machine Learning Algorithm “Cat”?
Prior Knowledge, Experience
4
“Deep Learning”
• Deep Learning • Train mul$ple layers of features/abstracCons from data. • Try to discover representa$on that makes decisions easy.
Low-‐level Features
Mid-‐level Features
High-‐level Features Classifier “Cat”?
Deep Learning: train layers of features so that classifier works well.
More abstract representaCon
5
“Deep Learning”
• Why do we want “deep learning”? – Some decisions require many stages of processing.
• Easy to invent cases where a “deep” model is compact but a shallow model is very large / inefficient.
– We already, intuiCvely, hand-‐engineer “layers” of representaCon. • Let’s replace this with something automated!
– Algorithms scale well with data and compuCng power. • In pracCce, one of the most consistently successful ways to get good results in ML.
• Can try to take advantage of unlabeled data to learn representaCons before the task.
6
Have we been here before? Ø Yes. – Basic ideas common to past ML and neural networks research. • Supervised learning is straight-‐forward. • Standard ML development strategies sCll relevant. • Some knowledge carried over from problem domains.
Ø No. – Faster computers; more data. – Be>er opCmizers; be>er iniCalizaCon schemes.
• “Unsupervised pre-‐training” trick [Hinton et al. 2006; Bengio et al. 2006]
– Lots of empirical evidence about what works. • Made useful by ability to “mix and match” components. [See, e.g., Jarre> et al., ICCV 2009]
7
Real impact
• DL systems are high performers in many tasks over many domains.
Image recogniCon [E.g., Krizhevsky et al., 2012]
Speech recogniCon [E.g., Heigold et al., 2013]
NLP [E.g., Socher et al., ICML 2011; Collobert & Weston, ICML 2008]
[Honglak Lee]
8
Outline • ML refresher / crash course
– LogisCc regression – OpCmizaCon – Features
• Supervised deep learning – Neural network models – Back-‐propagaCon – Training procedures
• Supervised DL for images – Neural network architectures for images. – ApplicaCon to Image-‐Net
• References / Resources
MACHINE LEARNING REFRESHER
Crash Course
10
Supervised Learning
• Given labeled training examples:
• For instance: x(i) = vector of pixel intensiCes. y(i) = object class ID.
• Goal: find f(x) to predict y from x on training data. – Hopefully: learned predictor works on “test” data.
255 98 93 87 …
f(x) y = 1 (“Cat”)
11
Logistic Regression • Simple binary classificaCon algorithm – Start with a funcCon of the form:
– InterpretaCon: f(x) is probability that y = 1. • Sigmoid “nonlinearity” squashes linear funcCon to [0,1].
– Find choice of that minimizes objecCve:
1
12
Optimization
• How do we tune to minimize ? • One algorithm: gradient descent – Compute gradient:
– Follow gradient “downhill”:
• StochasCc Gradient Descent (SGD): take step using gradient from only small batch of examples. – Scales to larger datasets. [Bo>ou & LeCun, 2005]
13
Is this enough?
• Loss is convex à we always find minimum. • Works for simple problems: – Classify digits as 0 or 1 using pixel intensity. – Certain pixels are highly informaCve -‐-‐-‐ e.g., center pixel.
• Fails for even slightly harder problems. – Is this a coffee mug?
14
Why is vision so hard?
“Coffee Mug”
Pixel Intensity
Pixel intensity is a very poor representaCon.
15
pixel 1
pixel 2
+ Coffee Mug
Not Coffee Mug -‐
Why is vision so hard?
+ -‐
-‐
pixel 1
pixel 2
+
Pixel Intensity [72 160]
16
Why is vision so hard?
+
pixel 1
pixel 2
-‐ +
+
-‐ -‐
+ -‐ +
+ Coffee Mug
Not Coffee Mug -‐
+
pixel 1
pixel 2
-‐ +
+
-‐ -‐
+ -‐ +
Is this a Coffee Mug?
Learning Algorithm
17
Features
+ handle?
cylinder?
-‐ + + -‐ -‐
+
-‐ +
+ Coffee Mug
Not Coffee Mug -‐
cylinder? handle?
Is this a Coffee Mug?
Learning Algorithm + handle?
cylinder?
-‐ + + -‐ -‐
+
-‐ +
18
Features • Features are usually hard-‐wired transformaCons built into the system. – Formally, a funcCon that maps raw input to a “higher level” representaCon.
– Completely staCc -‐-‐-‐ so just subsCtute φ(x) for x and do logisCc regression like before.
Where do we get good features?
19
Features
• Huge investment devoted to building applicaCon-‐specific feature representaCons. – Find higher-‐level pa>erns so that final decision is easy to learn with ML algorithm.
Object Bank [Li et al., 2010] Super-‐pixels [Gould et al., 2008; Ren & Malik, 2003]
SIFT [Lowe, 1999] Spin Images [Johnson & Hebert, 1999]
SUPERVISED DEEP LEARNING
Extension to neural networks
21
Basic idea
• We saw how to do supervised learning when the “features” φ(x) are fixed. – Let’s extend to case where features are given by tunable funcCons with their own parameters.
Inputs are “features”-‐-‐-‐one feature for each row of W: Outer part of funcCon is same
as logisCc regression.
22
Basic idea
• To do supervised learning for two-‐class classificaCon, minimize:
• Same as logisCc regression, but now f(x) has mulCple stages (“layers”, “modules”):
Intermediate representaCon (“features”) PredicCon for
23
Neural network
• This model is a sigmoid “neural network”:
Flow of computaCon. “Forward prop”
“Neuron”
24
Neural network • Can stack up several layers: Must learn mulCple stages
of internal “representaCon”.
25
Back-propagation
• Minimize:
• To minimize we need gradients:
– Then use gradient descent algorithm as before.
• Formula for can be found by hand (same as before); but what about W?
26
The Chain Rule • Suppose we have a module that looks like:
• And we know and , chain rule gives:
Similarly for W:
Ø Given gradient with respect to output, we can build a new “module” that finds gradient with respect to inputs.
Jacobian matrix.
27
The Chain Rule • Easy to build toolkit of known rules to compute gradients given – Automated differenCaCon! E.g., Theano [Bergstra et al., 2010]
FuncCon Gradient w.r.t. input Gradient w.r.t. parameters
28
Back-propagation • Can re-‐apply chain rule to get gradients for all intermediate values and parameters.
“Backward” modules for each forward stage.
29
Example
• Given , compute :
Using several items from our table:
30
Training Procedure • Collect labeled training data – For SGD: Randomly shuffle aNer each epoch!
• For a batch of examples: – Compute gradient w.r.t. all parameters in network.
– Make a small update to parameters.
– Repeat unCl convergence.
31
Training Procedure
• Historically, this has not worked so easily. – Non-‐convex: Local minima; convergence criteria. – OpCmizaCon becomes difficult with many stages.
• “Vanishing gradient problem” – Hard to diagnose and debug malfuncCons.
• Many things turn out to ma>er: – Choice of nonlineariCes. – IniCalizaCon of parameters. – OpCmizer parameters: step size, schedule.
32
Nonlinearities
• Choice of funcCons inside network ma>ers. – Sigmoid funcCon turns out to be difficult. – Some other choices oNen used:
1
-‐1
1
tanh(z) ReLu(z) = max{0, z}
“RecCfied Linear Unit” à Increasingly popular.
1
abs(z)
[Nair & Hinton, 2010]
33
Initialization
• Usually small random values. – Try to choose so that typical input to a neuron avoids saturaCng / non-‐differenCable areas.
– Has to be random otherwise every neuron will be equal! – Occasionally inspect units for saturaCon / blowup. – Larger values may give faster convergence, but worse models!
• IniCalizaCon schemes for parCcular units: – tanh units: Unif[-‐r, r]; sigmoid: Unif[-‐4r, 4r]. See [Glorot et al., AISTATS 2010]
34
Optimization: Step sizes • Choose SGD step size carefully. – Up to factor ~2 can make a difference.
• Strategies: – Brute-‐force: try many; pick one with best result. – Racing: pick size with best error on validaCon data aNer T steps. • Not always accurate if T is too small.
• Step size schedule: – Fixed: same step size – Step – MulC-‐Step – Inverse
Bengio, 2012: “PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures” Hinton, 2010: “A PracCcal Guide to Training Restricted Boltzmann Machines”
35
Optimization: Momentum • “Smooth” esCmate of gradient from several steps of SGD:
• A li>le bit like second-‐order informaCon. – High-‐curvature direcCons cancel out. – Low-‐curvature direcCons “add up” and accelerate.
Bengio, 2012: “PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures” Hinton, 2010: “A PracCcal Guide to Training Restricted Boltzmann Machines”
37
Other factors
• “Weight decay” penalty can help. – Add small penalty for squared weight magnitude.
• For modest datasets, LBFGS or second-‐order methods are easier than SGD. – See, e.g.: Martens & Sutskever, ICML 2011. – Can crudely extend to mini-‐batch case if batches are large. [Le et al., ICML 2011]
SUPERVISED DL FOR VISION
ApplicaCon
39
Working with images
• Major factors: – Choose funcConal form of network to roughly match the computaCons we need to represent. • E.g., “selecCve” features and “invariant” features.
– Try to exploit knowledge of images to accelerate training or improve performance.
• Generally try to avoid wiring detailed visual knowledge into system -‐-‐-‐ prefer to learn.
40
Local connectivity
• Neural network view of single neuron: Extremely large number of connecCons. à More parameters to train. à Higher computaConal expense. à Turn out not to be helpful in pracCce.
41
Local connectivity • Reduce parameters with local connecCons.
– Weight vector is a spaCally localized “filter”.
42
Local connectivity
• SomeCmes think of neurons as viewing small adjacent windows. – Specify connecCvity by the size (“recepCve field” size) and spacing (“step” or “stride”) of windows. • Typical RF size = 3 to 20 • Typical step size = 1 pixel up to RF size.
43
Local connectivity
• SpaCal organizaCon of filters means output features can also be organized like an image. – X,Y dimensions correspond to X,Y posiCon of neuron window.
– “Channels” are different features extracted from same spaCal locaCon. (Also called “feature maps”, or “maps”.)
1D input
1-‐dimensional example:
X spaCal locaCon
“Channel” or “map” index
44
Local connectivity Ø We can treat output of a layer like an image and re-‐use the
same tricks.
1D input
1-‐dimensional example:
X spaCal locaCon
“Channel” or “map” index
45
Weight-Tying
• Even with local connecCons, may sCll have too many weights. – Trick: constrain some weights to be equal if we know that some parts of input should learn same kinds of features.
– Images tend to be “staConary”: different patches tend to have similar low-‐level structure. Ø Constrain weights used at different spaCal posiCons to be the equal.
46
Weight-Tying Ø Before, could have neurons with different weights at different
locaCons. But can reduce parameters by making them equal.
1D input
1-‐dimensional example:
X spaCal locaCon
“Channel” or “map” index
• SomeCmes called a “convoluConal” network. Each unique filter is spaCally convolved with the input to produce responses for each map. [LeCun et al., 1989; LeCun et al., 2004]
47
Pooling • FuncConal layers designed to represent invariant features. • Usually locally connected with specific nonlineariCes.
– Combined with convoluCon, corresponds to hard-‐wired translaCon invariance.
• Usually fix weights to local box or Gaussian filter. – Easy to represent max-‐, average-‐, or 2-‐norm pooling.
[Scherer et al., ICANN 2010] [Boureau et al., ICML 2010]
48
Batch Normalization
• Reduce covariance shiN during training • SpaCally over all samples in each batch
… Sample 1 Sample 2 Sample m
µB ←1
W × H ×mxw,h,i
w,h,i∑
σ B2 ← 1
W × H ×m(xw,h,i −
w,h,i∑ µB )
2
x∧
w,h,i ←xi − µB
σ B2 + ε
yi ←γ x∧
w,h,i+ β ≡ BNγ ,β (xw,h,i )
50
Application: Image-Net • System from Krizhevsky et al., NIPS 2012: – ConvoluConal neural network. – Max-‐pooling. – RecCfied linear units (ReLu). – Local response normalizaCon. – Local connecCvity.
51
Application: Image-Net
• Top result in LSVRC 2012: ~85%, Top-‐5 accuracy.
What’s an Agaric!?
52
More applications • SegmentaCon: predict classes of pixels / super-‐pixels.
• DetecCon: combine classifiers with sliding-‐window architecture. – Economical when used with convoluConal nets.
• RoboCc grasping. [Lenz et al., RSS 2013]
ß Ciresan et al., NIPS 2012
Farabet et al., ICML 2012 à
Pierre Sermanet (2010) à
h>p://www.youtube.com/watch?v=f9CuzqI1SkE
Go
AlphaGo beat best human (Lee Sedol) 3/2016 (win-win-win-loss-win) Games are available at https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
55
How does AlphaGo work? (For all the details, see their paper:
www.nature.com/nature/journal/v529/n7587/pdf/nature16961.pdf ) • Combines two techniques:
– Monte Carlo Tree Search (MCTS) • Historical advantages of MCTS over minimax-‐search approaches: – Does not require an evaluaCon funcCon – Typically works be>er with large branching factors
• Improved Go programs by ~10 kyu around 2006 – Deep Learning
• ‘Value networks’ to evaluate board posiCons and • ‘Policy networks’ to select moves
56
AlphaGo hardware
• EvaluaCng policy and value networks requires several orders of magnitude more computaCon than tradiConal search heurisCcs
• AlphaGo uses an asynchronous mulC-‐threaded search that executes simulaCons on CPUs, and computes policy and value networks in parallel on GPUs
• Final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs
• (They also implemented a distributed version)
57
Speech Recognition
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-‐to-‐end speech recogniCon”, arXiv:1412.5567v2, 2014.
Output: characters (and space)
Input: acousCc features (spectrograms)
No phoneme and lexicon (No OOV problem)
+ null (~)
58
Resources
Tutorials Stanford Deep Learning tutorial:
h>p://ufldl.stanford.edu/wiki
Deep Learning tutorials list: h>p://deeplearning.net/tutorials
IPAM DL/UFL Summer School: h>p://www.ipam.ucla.edu/programs/gss2012/
ICML 2012 RepresentaCon Learning Tutorial h>p://www.iro.umontreal.ca/~bengioy/talks/deep-‐learning-‐tutorial-‐2012.html
59
References h>p://www.stanford.edu/~acoates/bmvc2013refs.pdf Overviews: Yoshua Bengio,
“PracCcal RecommendaCons for Gradient-‐Based Training of Deep Architectures”
Yoshua Bengio & Yann LeCun,
“Scaling Learning Algorithms towards AI” Yoshua Bengio, Aaron Courville & Pascal Vincent,
“RepresentaCon Learning: A Review and New PerspecCves”
SoNware: Theano GPU library: h>p://deeplearning.net/soNware/theano SPAMS toolkit: h>p://spams-‐devel.gforge.inria.fr/