Deep Convolutional Networks & Computer Vision
J. Sullivan, H. Azizpour, A. S. Razavian, A. Maki and S. Carlsson
Computer Vision Group,
KTH.
March 10, 2015
What Deep Learning has done for Computer Vision?
Deep Learning has resulted in
1. much better automatic
- visual image classification and
- object detection,
2. much more powerful generic image representations.
What ConvNets have done for Computer Vision?
ConvNets have resulted in
1. much better automatic
- visual image classification and
- object detection,
2. much more powerful generic image representations.
Image Classification Task: ILSVRCILSVRC!Task!2:!Classifica1on!
Output:*Scale!TPshirt!
Steel!drum!Drums1ck!Mud!turtle!
Steel!drum!
✔! ✗!Output:*Scale!TPshirt!
Giant!panda!Drums1ck!Mud!turtle!
Error =1
100, 000
X
100, 000 images
1(incorrect on image i)
Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
ConvNets ! much better image classification
2010 2011 2012 2013 2014
0
10
20
30
28.2
25.8
16.4
11.7
6.7
Classificationerror(%)
Performance of winning entry in ILSVRC competitions (2010-14).
Red indicates when deep ConvNets were introduced.
Pascal VOC: Object Detection
PASCAL(VOC(200562012(
Classifica>on:=person,=motorcycle=Detec4on( Segmenta4on(
Person=
Motorcycle(
Ac>on:=riding=bicycle=
Everingham,(Van(Gool,(Williams,(Winn(and(Zisserman.(The(PASCAL(Visual(Object(Classes((VOC)(Challenge.(IJCV(2010.(
20=object=classes = =22,591=images=
ConvNets ! much better object detection
2007 2008 2009 2010 2011 2012 2013 2014 2015
10
20
30
40
50
60
70
80
Deep learning
Year
Accuracy
plant
person
chair
cat
car
aeroplane
all classes
Progress of object detection for the Pascal VOC 2007 challenge.
ConvNets ! much better image representation
ObjectClassification
SceneClassification
BirdSubcategorization
FlowersRecognition
Hum
anAttributeDetection
ObjectAttributeDetection
ParisBuildingsRetrieval
OxfordBuildingsRetrieval
SculpturesRetrieval
SceneImageRetrieval
ObjectInstanceRetrieval
40
60
80
100
71.1
64
56.8
80.7
69.9
89.5
74.9
81.7
45.4
81.9
89.3
77.2
69
61.8
86.8
73
91.4
79.5
68
42.3
84.3
91.1
Best state-of-the-art ConvNet o↵-the-shelf + Linear SVM
Source: CNN Features o↵-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.
Reason for jump in performance:
Learn feature hierarchies from the data
Modern Visual Recognition Systems
1. Training Phase
- Gather labelled training data.
- Extract a feature representation for each training example.
- Construct a decision boundary.
2. Test Phase
- Extract feature representation from the test example.
- Compare to the learnt decision boundary.
Modern Visual Recognition Systems
1. Training Phase
- Gather labelled training data.
- Extract a feature representation for each training example.
- Construct a decision boundary.
2. Test Phase
- Extract feature representation from the test example.
- Compare to the learnt decision boundary.
It’s just supervised learning.
Is it a bike or a face?
?
Construct a decision boundary
Decision Boundary
The two extremes of feature extraction
Ideal features Far from ideal
The two extremes of feature extraction
Ideal features Far from ideal
Supervised Deep Learning allows you to learn ideal features.
Learning Representations/Features
Traditional Pattern Recognition: Fixed/Handcrafted feature extraction
f
FeatureExtractor
TrainableClassifier
Modern Pattern Recognition: Unsupervised mid-level features
f
FeatureExtractor
TrainableClassifier
Mid-levelFeatures
Deep Learning: Train hierarchical representations
f
Low-levelFeatures
Mid-levelFeatures
High-levelFeatures
TrainableClassifier
Source: Talk Computer Perception with Deep Learning by Yann LeCun
Key Properties of Deep Learning
Provides a mechanism to:
• Learn a highly non-linear function.
• Learn it from data.
• Build feature hierarchies
- Distributed representations
- Compositionality
• Perform end-to-end learning.
How? Convolutional Networks
Convolutional Networks
• Are deployed in many practical applicationsImage recognition, speech recognition, Google’s and Baidu’s photo
taggers
• Have won several competitionsImageNet, Kaggle Facial Expression, Kaggle Multimodal Learning,
German Tra�c Signs, Connectomics, Handwriting...
• Are applicable to array data where nearby values are correlatedImages, sound, time-frequency representations, video, volumetric
images, RGB-Depth images....
Source: Talk Computer Perception with Deep Learning by Yann LeCun
Convolutional NetworkY LeCun
The Convolutional Net Model (Multistage Hubel-Wiesel system)
pooling subsampling
“Simple cells”“Complex cells”
Multiple convolutions
Retinotopic Feature Maps
[LeCun et al. 89][LeCun et al. 98]
Training is supervisedWith stochastic gradient descent
• Training is supervised and with stochastic gradientdescent.
• LeCun et al. ’89, ’98
Source: Talk Computer Perception with Deep Learning by Yann LeCun
ConvNets: History
• Fukushima 1980: designed network with same basic structure butdid not train by backpropagation.
• LeCun from late 80s: figured out backpropagation for ConvNets,popularized and deployed ConvNets for OCR applications etc.
• Poggio from 1999: same basic structure but learning is restrictedto top layer (k-means at second stage).
• LeCun from 2006: unsupervised feature learning.
• DiCarlo from 2008: large scale experiments, normalization layer.
• LeCun from 2009: harsher non-linearities, normalization layer,learning unsupervised and supervised.
• Mallat from 2011: provides a theory behind the architecture.
• Hinton 2012: use bigger nets, GPUs, more data, purely supervised.
26
TIM
E
Convolutional Neural Net 2012
Convolutional Neural Net 1998
Convolutional Neural Net 1988
Q.: Did we make any prgress since then?
A.: The main reason for the breakthrough is: data and GPU, but we have also made networks deeper and more non-linear.
Reasons for breakthrough now:
• Data and GPUs,
• Networks have been made deeper.
Modern Convolutional Network
224⇥224⇥3
55⇥55⇥48
27⇥27⇥128
13⇥13⇥192
13⇥13⇥192
13⇥13⇥128
dense
4096
dense
4096
dense
1000
OutputInput Image Fully connected layersConvolutional layers
Alex Net 2012
Convolutional Networks for RGB Images: The Basic Operations
Convolution Operation
• Input:
- a set of 2d feature maps x1:m = {x1, . . . ,xm}- each xi has size W ⇥W
• Convolutional Parameters:
- a set of 2d convolutional kernels k1:m = {k1, . . . ,km}- each ki has size (2w + 1)⇥ (2w + 1) and
- a bias term b
224⇥224⇥3224⇥224⇥48
Input Image Convolutionresponse maps
Convolution Operation
Convolutional Operator:
• Define conv(·, ·, ·), the convolution of x1:m with k1:m, as:
conv(x1:m,k1:m, b) =
mX
i=1
(xi ⇤ ki) + b
the 2d convolution xi ⇤ ki returns a 2d map with (x, y)th entry:
x
i,x,y
=wX
x
0=�w
wX
y
0=�w
k
i,x
0+w+1,y0+w+1 xi,x+x
0,y+y
0
224⇥224⇥3224⇥224⇥48
Input Image Convolutionresponse maps
Remember 2D convolution
Image f
Spatial domain
Origin
(x, y)w(-1, -1) w(-1,0) w(-1,1)
w(0,0)
w(1,0) w(1,1)
w(0,1)w(0,-1)
w(1,-1)
Filter coefficientsf(x-1, y-1) f(x-1,y) f(x-1,y+1)
f(x,y)
f(x+1,y) f(x+1,y+1)
f(x,y+1)f(x,y-1)
f(x+1,y--1)
Pixels of image
section under filter
g(x, y) =
aX
s=�a
bX
t=�b
w(s, t) f(x+ s, y + t)
Next non-linear activation and then max-pool
Create a new 2d feature map by applying two more operators:
˜
x = pool(�(conv(x1:m,k1:m, b)))
where- �(·) is a non-linear function typically �(x) = max(0, x)
- pool(·) represents a local max-pooling operator.
224⇥224⇥3224⇥224⇥48
55⇥55⇥48
Input Image Activationresponse maps
Max-pooledresponse maps
From one convolutional layer to the next
• At convolutional layer l have a set of 2d feature maps
x
(l)1:ml
= {x(l)1 , . . . ,x
(l)ml
}
• Have multiple sets of convolutional kernels k(l+1)j,1:ml
, j = 1, . . . ,ml+1
• For each kernel set k(l+1)j,1:ml
create a new set of 2d feature map:
x
(l+1)j = pool
⇣�
⇣conv
⇣x
(l)1:ml
,k
(l+1)j,1:ml
, b
(l+1)j
⌘⌘⌘
Convolve ! Activation ! Max-pool
55⇥55⇥4855⇥55⇥128
27⇥27⇥128
Layer 1 output Activationresponse maps
Max-pooledresponse maps
For j = 1, . . . ,ml+1
- Convolve current response maps with k
(l+1)j,1:ml
:
ˆ
x
(l+1)j = conv
⇣x
(l)1:ml
,k
(l+1)j,1:ml
, b
(l+1)j
⌘
- Non-linear activation:
z
(l+1)j = �
⇣ˆ
x
(l+1)j
⌘
- Max-pool:
x
(l+1)j = pool
⇣z
(l+1)j
⌘
AlexNet 2012
224⇥224⇥3
55⇥55⇥48
27⇥27⇥128
13⇥13⇥192
13⇥13⇥192
13⇥13⇥128
dense
4096
dense
4096
dense
1000
OutputInput Image Fully connected layersConvolutional layers
1st fully connected layer
For j = 1, . . . ,mlc
+1
x
(lc
+1)j = max
ml
cX
i=1
w
(lc
+1)j,i · x(l
c
)i + b
(lc
)j , 0
!
224⇥224⇥3
55⇥55⇥48
27⇥27⇥128
13⇥13⇥192
13⇥13⇥192
13⇥13⇥128
dense
4096
dense
4096
dense
1000
OutputInput Image Fully connected layersConvolutional layers
2nd fully connected layer
For j = 1, . . . ,mlc
+2
x
(lc
+2)j = max
⇣w
(lc
+2)j · x(l
c
+1)+ b
(lc
+2)j , 0
⌘
224⇥224⇥3
55⇥55⇥48
27⇥27⇥128
13⇥13⇥192
13⇥13⇥192
13⇥13⇥128
dense
4096
dense
4096
dense
1000
OutputInput Image Fully connected layersConvolutional layers
Output layer: soft-max operator
• For j = 1, . . . , C:
o
0j = w
(lc
+3)j · x(l
c
+2)+ b
(lc
+3)j , or =
exp(o
0r)
PCj=1 exp
⇣o
0j
⌘
224⇥224⇥3
55⇥55⇥48
27⇥27⇥128
13⇥13⇥192
13⇥13⇥192
13⇥13⇥128
dense
4096
dense
4096
dense
1000
OutputInput Image Fully connected layersConvolutional layers
Parameters of the model
• Filter parameters:
- Convolutional layers
For l = 1, . . . , lc
k
(l)j,1:ml�1
1 j ml, each k
(l)j,i has size wl ⇥ wl
- Fully connnected layers
* First fully connected layer:
w
(lc+1)j,i , 1 j mlc+1, 1 i mlc ,
each w
(lc+1)j,i and x
(lc)i have equal size.
* Subsequent fully connected layers:
For l = lc + 1, . . . , lc + L
w
(l)j , 1 j ml, each w
(l)j has size ml�1 ⇥ml�1
Training Convolutional Networks
• Set-up - Supervised learning
- For RGB image x set ConvNet’s first set of 2d feature maps:
x
(0)1:3 = {xred channel,xgreen channel,xblue channel}
- Have a set D of labelled training images i.e. have many (x,y)
- To learn the network’s parameters must link value of
W = {Wconvolutional,Wfully connected}
to network’s prediction performance on D.
Training ConvNets: Measuring Performance
- Remember a ConvNet represents a function:
fConvNet : [0, 1]W⇥W⇥3 ⇥ Rp ! [0, 1]
M
so for input x the function fConvNet predicts its label
fConvNet(x;W) =
ˆ
y
- Use a loss function to measure the error in fConvNet(x;W)’spredicted label for input x in D.
- Loss function typically has the property that
L(y, y) increases as ky � yk increases
- Cross-Entropy loss frequently used
L(y, fConvNet(x;W)) = �MX
j=1
yj log (yj)
Training ConvNets: The Optimization Problem
• Define performance of a network with parameters W on D as
E(D,W) =
1
|D|X
(x,y)2D
L(y, fConvNet(x;W))
• The learning problem is to find the W that minimizes
min
WE(D,W)
• How do we do the optimization?
Training ConvNets: How to optimize
Our optimization problem:
min
WE(D,W) = min
W
1
|D|X
(x,y)2D
L(y, fConvNet(x;W))
• Initialize the network’s parameters randomly to get W(0).
• Update W’s using batch-mode Stochastic Gradient Descent(SGD):
- At iteration t randomly choose a small subset D(t) of D.
- And perform the update with learning rate ↵
(t)
W(t+1)= W(t) � ↵
(t) rWE(D(t),W)
���W(t)
• This procedure allows us to find a local minima of E(D,W).Is this good enough?
Training ConvNets: How to optimize
Our optimization problem:
min
WE(D,W) = min
W
1
|D|X
(x,y)2D
L(y, fConvNet(x;W))
• Initialize the network’s parameters randomly to get W(0).
• Update W’s using batch-mode Stochastic Gradient Descent(SGD):
- At iteration t randomly choose a small subset D(t) of D.
- And perform the update with learning rate ↵
(t)
W(t+1)= W(t) � ↵
(t) rWE(D(t),W)
���W(t)
• This procedure allows us to find a local minima of E(D,W).Is this good enough?
Intuition about hardness of training a deep
ConvNet with Backpropagation
Next slides: Deep Learning for Vision: Tricks of the Trade, M. Ranzato,BAVM, Oct ’13
28
ConvNets: till 2012
Loss
parameter
Common wisdom: training does not work because we “get stuck in local minima”
29
ConvNets: today
Loss
parameter
Local minima are all similar, there are long plateaus, it can take long time to break symmetries.
w w
input/output invariant to permutations
breaking ties between parameters
WTX
1
Saturating units
30
Like walking on a ridge between valleys
31
ConvNets: today
Loss
parameter
Local minima are all similar, there are long plateaus, it can take long to break symmetries.
Optimization is not the real problem when:– dataset is large– unit do not saturate too much– normalization layer
32
ConvNets: today
Loss
parameter
Today's belief is that the challenge is about:– generalization How many training samples to fit 1B parameters? How many parameters/samples to model spaces with 1M dim.?
– scalability
Regularization during training is very important
Avoid overfitting by:
• Training with large labelled datasets.
• Augment training sets with random jitterings of input.
• Train for a long time with small learning rates.
• Dropout (idea from Geo↵ Hinton)(only classify with and update a random subset of the network at each
training iteration)
• Be vigilant!
When training constantly monitor performance with a validation set.
Source for many labelled images:
ImageNet
ImageNet: Large scale visual recognition challenge(2010-13)
• Classification (+ Localization) Challenge
- 1000 object classes
- 1,431,167 images
• Detection Challenge
- 200 object classes
- 456,191 images
Source: http://image9net.org/challenges/LSVRC/2013
Backpack!
Flute! Strawberry! Traffic!light!
Bathing!cap!Matchs1ck!
Racket!
Sea!lion!
Variety of object classes in ILSVRCVariety!of!object!classes!in!ILSVRC!
Olga!Russakovsky,!Jia!Deng,!Zhiheng!Huang,!Alex!Berg,!Li!FeiPFei!Detec1ng!avocados!to!zucchinis:!what!have!we!done,!and!where!are!we!going?!ICCV!2013!
!!!!!!!!! !!!!!!!hOp://image9net.org/challenges/LSVRC/2012/analysis*
Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
Variety of object classes
Variety!of!object!classes!in!ILSVRC!DET! CLSPLOC!
Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
A revolution in computer vision
ConvNet of Krizhevsky, Sutskever, Hinton 2012
ImageNet Classification with Deep Convolutional NeuralNetworks (NIPS ’12)
224⇥224⇥3
55⇥55⇥48
27⇥27⇥128
13⇥13⇥192
13⇥13⇥192
13⇥13⇥128
dense
4096
dense
4096
dense
1000
OutputInput Image Fully connected layersConvolutional layers
Image Classification: Dramatic ILSVRC Results since 2012
• ImageNet Large Scale Visual Recognition Challenge
• 1000 categories, 1.3 Million (⇥10 with data augmentation)labeled training samples
2010 2011 2012 2013 2014
0
10
20
30
28.2
25.8
16.4
11.7
6.7
Classificationerror(%)
ConvNet of Krizhevsky, Sutskever, Hinton 2012
• Method: large convolutional net
- 60M parameters
- Trained with backprop on GPU
- Trained “with all the tricks Yann came up with in the last 20
years, plus dropout” (Hinton, NIPS 2012)
- Rectification, contrast normalization,...
- Softmax output function
• Error Rate on ImageNet: 15% (correct class not in top 5)
• Previous state of the art: 25% error
• Deployed in Google+ Photo Tagging in May 2013
First layer filters learnt
Y LeCun
Object R
ecognition [Krizhevsky, Sutskever, Hinton 2
01
2]
Method: large convolutional net6
50
K neurons, 83
2M
synapses, 60
M param
eters
Trained with backprop on G
PU
Trained “with all the tricks Yann cam
e up with in the
last 20
years, plus dropout” (Hinton, N
IPS 20
12
)
Rectification, contrast norm
alization,...
Error rate: 15
% (w
henever correct class isn't in top 5)
Previous state of the art: 25
% error
A R
EVOLU
TION
IN CO
MPU
TER V
ISION
Acquired by G
oogle in Jan 20
13
Deployed in G
oogle+ Photo Tagging in M
ay 20
13
AlexNet: Object Recognition Results
Y LeCun
Object Recognition [Krizhevsky, Sutskever, Hinton 2012]
AlexNet: Object Recognition Results
Y LeCun
Object Recognition [Krizhevsky, Sutskever, Hinton 2012]
AlexNet: Object Recognition Results
Y LeCun
Object Recognition [Krizhevsky, Sutskever, Hinton 2012]
Leader Board from ImageNet LSVRC-2014
Name Institution ConvNet Error
GoogLeNet Google X 6.656
VGG Oxford university X 7.337
MSRA Visual Computing Microsoft Research Asia X 8.060
Andrew Howard consultant X 8.111
DeeperVision company X 9.058
For more details check out
http://www.image-net.org/challenges/LSVRC/2014/results.php
ConvNet features are generic
CNN Features: an Astounding Baseline for Recognition
•CNN Features o↵-the-shelf: an Astounding Baseline for
Recognition,A. Sharif Razavin, H. Azizpour, J. Sullivan and S. Carlsson,
CVPR workshop on Deep Learning 2014.
• Paper’s experimental evaluation shows:
Replace the handcrafted feature pipeline for many tasks
,PDJH 3DUW�$QQRWDWLRQV
/HDUQ�1RUPDOL]HG�
3RVH
([WUDFW�)HDWXUHV5*%��JUDGLHQW��
/%3�
&11�5HSUHVHQWDWLRQ
6906WURQJ�'30
with ConvNet features from one large ConvNet trained onImageNet and IMPROVE RESULTS.
What we mean by a ConvNet Feature
224⇥224⇥3
55⇥55⇥48
27⇥27⇥128
13⇥13⇥192
13⇥13⇥192
13⇥13⇥128
dense
4096
dense
4096
dense
1000
OutputInput Image Fully connected layersConvolutional layers
ConvNets ! much better image representation
ObjectClassification
SceneClassification
BirdSubcategorization
FlowersRecognition
Hum
anAttributeDetection
ObjectAttributeDetection
ParisBuildingsRetrieval
OxfordBuildingsRetrieval
SculpturesRetrieval
SceneImageRetrieval
ObjectInstanceRetrieval
40
60
80
100
71.1
64
56.8
80.7
69.9
89.5
74.9
81.7
45.4
81.9
89.3
77.2
69
61.8
86.8
73
91.4
79.5
68
42.3
84.3
91.1
Best state-of-the-art ConvNet o↵-the-shelf + Linear SVM
Source: CNN Features o↵-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.
How to optimize ConvNet representations for transfer learning
When do I need transfer learning?
• Currently ImageNet is one of the few sets of large labelledtraining data for computer vision.
• What do I do?
- I have a visual recognition task that di↵ers from image
classification
- I have limited labelled training data.
But I still want to use a deep ConvNet representation.
Factors that influence a ConvNet’s representation
Target
image
Source
ConvNet
Target ConvNet
RepresentationSVM
Target
label
layer?dim. reduction?
spatial pooling?fine-tuning?
Backprop with Source images & labelsRandom ConvNet Source ConvNet
network architecture?source task? early stopping?
+
Training of Source ConvNet from scratch
Exploit Source ConvNet for Target Task
Can order visual recognition tasks relative to ImageNet
Task’s distance from ImageNet increases ����������!?????????y
Image Classification Attribute Detect. Fine-grained Recog. Compositional Instance Retrieval
PASCAL VOC Object H3D human attrib. Cat&Dog breeds VOC Human Act. Holiday scenes
MIT 67 Indoor Scenes Object attrib. Bird subordinate Stanford 40 Act. Paris buildings
SUN 397 Scene SUN scene attrib. 102 Flowers Visual Phrases Sculptures
Best practices for a ConvNet rep. for transfer learning
Target task
FactorSource taskImageNet · · ·
FineGrainedrecognition · · ·
Instance xxretrieval
Early stopping Don’t do it
Fine-tuning Yes, more improvement with more labelled data
Network depth As deep as possible1
Network width Wider Moderately wide
Dim. reduction Original dim Reduced dim
Rep. layer Later layers Earlier layers
1In general the network should be as deep as possible but in the final experiments a couple of the instance
retrieval tasks defied this advice!
Gains to be made by optimizing these parameters for a task
VOC
MIT
SCEN
ESU
N
SceneAtt
ObjAtt
Hum
anAtt
Petbreed
BirdSubord
Flowers
VOCAction
StfdAction
Vis.Phrase
Holidays
UKB
OxfordParis
Sculpture30
405060708090100
Best non-ConvNet Deep Standard Deep Optimal
From first fully-connected layer can regress to local spatialinformation
Facial Landmarks via Linear Regression
RGB via Linear Regression
Segmentation via Linear Regression
Segmentation via Linear Regression
Segmentation via Linear Regression
Segmentation via Linear Regression