Convolutional Neural Networks II
CS194: Image Manipulation, Comp. Vision, and Comp. PhotoAlexei Efros, UC Berkeley, Spring 2020
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20162
Case Study: LeNet-5[LeCun et al., 1998]
Conv filters were 5x5, applied at stride 1Subsampling (Pooling) layers were 2x2 applied at stride 2i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]
Andrew NgImageNet Challenge (1000 object classes), Fei-Fei et al.
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20164
Case Study: AlexNet[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4=>Q: what is the output volume size? Hint: (227-11)/4+1 = 55
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20165
Case Study: AlexNet[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4=>Output volume [55x55x96]
Q: What is the total number of parameters in this layer?
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20166
Case Study: AlexNet[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4=>Output volume [55x55x96]Parameters: (11*11*3)*96 = 35K
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20167
Case Study: AlexNet[Krizhevsky et al. 2012]
Input: 227x227x3 imagesAfter CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Q: what is the output volume size? Hint: (55-3)/2+1 = 27
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20168
Case Study: AlexNet[Krizhevsky et al. 2012]
Input: 227x227x3 imagesAfter CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96
Q: what is the number of parameters in this layer?
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 20169
Case Study: AlexNet[Krizhevsky et al. 2012]
Input: 227x227x3 imagesAfter CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96Parameters: 0!
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201610
Case Study: AlexNet[Krizhevsky et al. 2012]
Input: 227x227x3 imagesAfter CONV1: 55x55x96After POOL1: 27x27x96...
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201611
Case Study: AlexNet[Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201612
Case Study: AlexNet[Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives: - first use of ReLU- used Norm layers (not common anymore)- heavy data augmentation- dropout 0.5- batch size 128- SGD Momentum 0.9- Learning rate 1e-2, reduced by 10manually when val accuracy plateaus- L2 weight decay 5e-4- 7 CNN ensemble: 18.2% -> 15.4%
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201613
(slide from Kaiming He’s recent presentation)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201614
Case Study: ResNet[He et al., 2015]
224x224x3
spatial dimension only 56x56!
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201615
Case Study: ResNet [He et al., 2015]
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201616
“You need a lot of a data if you want to train/use CNNs”
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201617
Transfer Learning
“You need a lot of a data if you want to train/use CNNs”
Deep Features & their Embeddings
The Unreasonable Effectiveness of Deep Features
Classes separate in the deep representations and transfer to many tasks.[DeCAF] [Zeiler-Fergus]
Can be used as a generic feature (“CNN code” = 4096-D vector before classifier)
query image nearest neighbors in the “code” space
ImageNet + Deep Learning
Beagle
- Image Retrieval- Detection (RCNN)- Segmentation (FCN)- Depth Estimation- …
ImageNet + Deep Learning
Beagle
Pose?
Boundaries?Geometry?Parts?
Materials?
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201623
Transfer Learning with CNNs
1. Train on Imagenet
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201624
Transfer Learning with CNNs
1. Train on Imagenet
2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier
i.e. swap the Softmax layer at the end
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201625
Transfer Learning with CNNs
1. Train on Imagenet
2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier
i.e. swap the Softmax layer at the end
3. If you have medium sized dataset, “finetune”instead: use the old weights as initialization, train the full network or only some of the higher layers
retrain bigger portion of the network, or even all of it.
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201626
Transfer Learning with CNNs
1. Train on Imagenet
2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier
i.e. swap the Softmax layer at the end
3. If you have medium sized dataset, “finetune”instead: use the old weights as initialization, train the full network or only some of the higher layers
retrain bigger portion of the network, or even all of it.
tip: use only ~1/10th of the original learning rate in finetuning to player, and ~1/100th on intermediate layers
27
Learning an Embedding
CNN
Embeddingshared representation
CNNShared W
Image 1 Image 2
28
CNN
Matchingshared representation
CNNShared W
Image 1 Image 2
Learning an Embedding
Siamese Network w/ Contrastive Loss
Siamese Architecture[Chopra 2005, Hadsell 2006]
L E A R N I N G V I S U A L S I M I L A R I T YF O R P R O D U C T D E S I G N W I T HC O N V O L U T I O N A L N E U R A L N E T W O R K SS E A N B E L L A N D K A V I T A B A L AC O R N E L L U N I V E R S I T Y
T H E P R O B L E M
Name: ”Great Bowl O’Fire Sculptural Fire Bowl”
(1) “What is this?” (2) “Where is it used?”
Category: Fire pitSold by: John T. Unger, LLC
T H E P R O B L E M
(1) “What is this?” (2) “Where is it used?”
Challenge: determine whether these are the same product(different resolution, viewpoint, color, lighting, occlusions)
Name: ”Great Bowl O’Fire Sculptural Fire Bowl”Category: Fire pitSold by: John T. Unger, LLC
T W O K I N D S O F I M A G E SIconic In context
(From a product website) (Cropped from a scene photo)
P R O J E C T I N G I N T O A J O I N T E M B E D D I N G
Embedding
Iconic In context
S E A R C H U S I N G T H E E M B E D D I N G
Embedding
“What is it?”
Name: Hemel RingCategory: Hanging lightSold by: Holly Hunt
S E A R C H U S I N G T H E E M B E D D I N G
Embedding
“Where is it used?”
Embedding
C O N T R A S T I V E L O S S : P O S I T I V E E X A M P L E
Parameters θ
Iconic (same)
In context
CNN
CNN
Loss Lp
xp
xq
Embedding
C O N T R A S T I V E L O S S : N E G A T I V E E X A M P L E
In context
Loss Ln
Margin m
Iconic (different)
CNN
CNN
Parameters θxq
xn
C O N T R A S T I V E L O S S : A L L T O G E T H E R
[Chopra 2005, Hadsell 2006]
Minimize L(θ) with stochastic gradient descent and momentum
Margin
T R A I N I N G P I P E L I N E
StochasticGradientDescent
CNN Embedding
Image pairsCNN Parameters
θ
Image database
R E S U L T S : “ W H A T I S I T ? ”
In context
R E S U L T S : “ W H A T I S I T ? ”
In context Iconic Top 4 results:
R E S U L T S : “ W H A T I S I T ? ”
In context
R E S U L T S : “ W H A T I S I T ? ”
In context Iconic Top 4 results:
R E S U L T S : “ W H A T I S I T ? ”
In context
R E S U L T S : “ W H A T I S I T ? ”
In context Iconic Top 4 results:
C O M P A R I S O N : T R A I N E D O N L Y O N C A T E G O R I E S
IconicIn context Top 4 results:
IconicIn context Top 4 results:
C O M P A R I S O N : T R A I N E D O N L Y O N I M A G E N E T
R E S U L T S : F A I L U R E C A S E
In context
R E S U L T S : F A I L U R E C A S E
In context Iconic Top 4 results:
“Maskros Pendant Lamp”
R E S U L T S : “ W H E R E I S I T U S E D ? ”
R E S U L T S : “ W H E R E I S I T U S E D ? ”
"LEM Piston Stool | Design Within Reach”
S E A R C H I N G A C R O S S C A T E G O R I E S
Color distribution cross-entropy loss with colorfulness enhancing term.
Zhang et al. 2016
[Zhang, Isola, Efros, ECCV 2016]
Designing loss functionsInput Ground truth
Image colorization
Cross entropy loss, with colorfulness term
“semantic feature loss” (VGG feature covariance matching objective)
[Johnson et al. 2016]
Super-resolution[Zhang et al. 2016]
Designing loss functions
Universal loss?
… …
…
Generated vs Real(classifier)
[Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio 2014]
Generative Adversarial Network(GANs)
Real photos
Generated images
…
…
Generator
[Goodfellow et al., 2014]
G tries to synthesize fake images that fool D
D tries to identify the fakes
Generator Discriminator
real or fake?
[Goodfellow et al., 2014]
fake (0.9)
real (0.1)
[Goodfellow et al., 2014]
G tries to synthesize fake images that foolD:
real or fake?
[Goodfellow et al., 2014]
G tries to synthesize fake images that fool the bestD:
real or fake?
[Goodfellow et al., 2014]
Loss Function
G’s perspective: D is a loss function.
Rather than being hand-designed, it is learned.
[Isola et al., 2017][Goodfellow et al., 2014]
+ L1
real or fake?
[Goodfellow et al., 2014]
real!(“Aquarius”)
[Goodfellow et al., 2014]
real or fake pair ?
[Goodfellow et al., 2014][Isola et al., 2017]
real or fake pair ?
[Goodfellow et al., 2014][Isola et al., 2017]
fake pair
[Goodfellow et al., 2014][Isola et al., 2017]
real pair
[Goodfellow et al., 2014][Isola et al., 2017]
real or fake pair ?
[Goodfellow et al., 2014][Isola et al., 2017]
BW → Color
Data from [Russakovsky et al. 2015]
BW → Color
Data from [Russakovsky et al. 2015]
Labels → FacadesInput Output
Data from [Tylecek, 2013]
Labels → FacadesInput Output Input Output
Data from [Tylecek, 2013]
Day → NightInput Output Input Output Input Output
Data from [Laffont et al., 2014]
Thermal → RGB
Edges → ImagesInput Output Input Output Input Output
Edges from [Xie & Tu, 2015]
Sketches → ImagesInput Output Input Output Input Output
Trained on Edges → ImagesData from [Eitz, Hays, Alexa, 2012]
#edges2cats [Christopher Hesse]
Ivy Tasi @ivymyt
Vitaly Vidmirov @vvid
@gods_tail
@ka92
Twitter-driven research: #pix2pix
Bertrand Gondouin @bgondouin
Brannon Dorsey @brannondorsey
Mario Klingemann @quasimondo
© Memo Akten, “Learning to See: Gloomy Sunday”
“Do as I Do”
OpenPose
pix2pix
Everybody Dance NowCaroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. EfrosUC Berkeley
Source Subject Target Subject
Results
https://www.youtube.com/watch?v=PCBTZh41Ris&feature=youtu.be
CEO: our own Dr. Tinghui Zhou
…
Paired training examples Unpaired training examples
… …
…
CycleGAN, or “there and back aGAN”
[Zhu*, Park*, Isola, Efros. ICCV 2017]
……
Cycle-Consistency LossG(x) F(G x )x
F G x − x 1
G(x) F(G x )x F(y) G(F x )𝑦𝑦
Cycle-Consistency Loss
F G x − x 1 G F y − 𝑦𝑦 1
Video
Collection Style Transfer
Van Gogh
Cezanne
Monet
Ukiyo-e
Photograph© Alexei Efros
CG to RealGrand Theft Auto
Real to CG
Shallower depth of field
Failure case
A Neural Algorithmof Artistic Style
Gatys, Ecker, Bethge (arXiv 2015)
Van Gogh (1889)
Picasso (1910)
Munch (1893)
Turner (1805)
Kandinsky (1913)
Early Vision Texture Models
2.37.13.8
Heeger & Bergen (1995) Portilla & Simoncelli (2000)
Linear filter bank
Heeger & Bergen, SIGGRAPH‘95Start with a noise image as output Main loop:
• Match pixel histogram of output image to input
• Decompose input and output images using multi-scale filter bank (Steerable Pyramid)
• Match subband histograms of input and output pyramids
• Reconstruct input and output images (collapse the pyramids)
Heeger, Bergen, Pyramid-based texture analysis/synthesis, SIGGRAPH 1995
Multi-scale filter decomposition
Filter bank
Input image
Filter response histograms
Simoncelli & Portilla ’98+
Match joint histograms of pairs of filter responses at adjacent spatial locations, orientations, and scales.
Optimize using repeated projections onto statistical constraint sufraces
Texture SynthesisImage Space Model Space
Images with equalmodel response
Portilla & Simoncelli (2000)
Convolutional Neural Network Texture Model
2.37.13.8
Convolutional Neural Network
Gatys et al. (NIPS 2015)
CNN - Multiscale Filter Bank
conv1_1
pool4pool3
pool2
pool1
64
# features
64
128
256
512
CNN - Texture Features
CNN - Texture Features
64
# features
64
128
256
512
Gram Matrices
Texture Synthesis
Texture Synthesis
Texture Synthesis
Texture Synthesis
Texture Synthesis
Texture Synthesis
Texture Synthesis
Test Julesz’ Conjecture
Test Julesz’ Conjecture
CNN - Texture Synthesis
Gatys et al. (NIPS 2015)
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Artistic Style Transfer
Relative Weighting of Content and Style1e-4
1e-2 1e-1
1e-3
Different Reconstruction Layers
Conv2_2 Conv4_2
Different Reconstruction Layers
Conv2_2 Conv4_2
Different Reconstruction Layers
Conv2_2Original Conv4_2
General Style Transfer
General Style Transfer