Convolutional Architectures Training Achitecture search Bibliography
Convolutional Neural Networks and SupervisedLearning
Eilif Solberg
August 30, 2018
Convolutional Architectures Training Achitecture search Bibliography
Outline
Convolutional ArchitecturesConvolutional neural networks
TrainingLossOptimizationRegularizationHyperparameter search
Achitecture searchNAS1NAS2
Bibliography
Convolutional Architectures Training Achitecture search Bibliography
Convolutional Architectures
Convolutional Architectures Training Achitecture search Bibliography
Template matching
Figure: Illustration fromhttp://pixuate.com/technology/template-matching/
1. Try to match template at each location by �sliding overwindow�
2. Threshold for detection
For 2D-objects, kind of possible but di�cult
Convolutional Architectures Training Achitecture search Bibliography
Convolution
Which �lter has produces the activation map on the right?
Convolutional Architectures Training Achitecture search Bibliography
Convolutional layer
�> Glori�ed template matching
� Many templates (aka output �lters)
� We learn the templates, the weights are the templates
� Intermediate detection results only means to an end
� treat them as features, which we again match new templates to
� Starting from the second layer we have �nonlinear �lters�
Convolutional Architectures Training Achitecture search Bibliography
Hyperparameters of convolutional layer
1. Kernel height and width -template sizes
2. Stride - skips between templatematches
3. Dilation rate� �Wholes� in template where
we don't care� Larger �eld-of-view without
more weights. . .
4. Number of output �lters -number of templates
5. Padding - expand image,typically with zeros
Figure: Image fromhttp://neuralnetworksanddeeplearning.com/
Convolutional Architectures Training Achitecture search Bibliography
Detector / activation function
� Non-saturating activation functions as ReLU, leaky ReLUdominating
Figure: Sigmoidfunction
Figure: Tanh functionFigure: ReLU function
Convolutional Architectures Training Achitecture search Bibliography
Basic CNN architecture for image classi�cation
Image �> [Conv �> ReLU]xN �> Fully Connected �> Softmax
� Increase �lter depth when using stride
Improve with:
� Batch normalization
� Skip connections ala ResNet or DenseNet
� No fully connected, average pool predictions instead
Convolutional Architectures Training Achitecture search Bibliography
Training
Convolutional Architectures Training Achitecture search Bibliography
How do we �t model?
How do we �nd parameters θ for our network?
Convolutional Architectures Training Achitecture search Bibliography
Supervised learning
� Training data comes as (X ,Y ) pairs, where Y is the target
� Want to learn f (x) ∼ p(y |x), conditional distribution of Ygiven X
� De�ne di�erentiable surrogate loss function, e.g. for a singlesample
l(f (X ),Y ) = (f (X )− Y )2regression (1)
l(f (X ),Y ) = −∑c
Yc log(f (X )c)classi�cation (2)
Convolutional Architectures Training Achitecture search Bibliography
Gradient
� The direction for which the function increases the most
Figure: Gradient of the function f (x2, y2) = x/ex2+y
2
[By Vivekj78 [CC BY-SA3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons]
Convolutional Architectures Training Achitecture search Bibliography
Backpropagation
� E�cient bookkeeping scheme when applying chain rule fordi�erentiation
� Biologically implausible?
Convolutional Architectures Training Achitecture search Bibliography
(Stochastic) gradient descentTaking steps in the opposite direction of the gradient
Figure: [By Vivekj78 [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], fromWikimedia Commons]
� Full gradient too expensive / not necessary
N∑i=1
∇θl(f (Xi ),Yi ) ≈n∑i=1
∇θl(f (XP(i)),YP(i)) (3)
for a random permutation P .
Many di�erent extensions to standard SGD� SGD with momentum, RMSprop, ADAM.
Convolutional Architectures Training Achitecture search Bibliography
Network, loss, optimization
� Weight penalty added to loss term, usually squared L2normalization uniformly for all parameters
l(θ) + λ‖θ‖22 (4)
� Dropout
� Batch normalization� Intersection of optimization and generalization� Your best friend and your worst enemy
Convolutional Architectures Training Achitecture search Bibliography
More on batch normalization
For a tensor [batch_size Ö height Ö width Ö depth], normalize�template matching scores� for each template d by
µd ←1
N ∗ H ∗W
N∑i=1
H∑h=1
W∑w=1
xi ,h,w ,d (5)
σ2d← 1
N ∗ H ∗W
N∑i=1
H∑h=1
W∑w=1
(xi ,h,w ,d − µd )2 (6)
x̂i ,h,w ,d ←xi ,h,w ,d − µd√
(σ2d+ ε)
(7)
yi ,h,w ,d ← γx̂i ,h,w ,d + β (8)
where N, H and W represents batch size, height and width.
� �Template/Feature more present than usual or not�
� During inference we use stored values for µd and σd .
Convolutional Architectures Training Achitecture search Bibliography
Data augmentation
Idea: apply random transformation to X that does not alter Y .
� Normally you would like result X ′ to be plausible, i.e. couldhave been a sample from the distribution of interest
� Which transformation you may use is application-dependent.
Image data
� Horizontal mirroring (issuefor objects not left/rightsymmetric)
� Random crop
� Scale
� Aspect ratio
� Lightning etc.
Text data
� Synonym insertion
� Back-translation: translateand translate back with e.g.Google Translate!!!
Convolutional Architectures Training Achitecture search Bibliography
Hyperparameters to search
� Learning rate (and learning rate schedule)
� Regularization params: L2, (dropout)
Convolutional Architectures Training Achitecture search Bibliography
Search strategies
� random search rather than grid search
� logscale when appropriate
� careful with best values on border
� may re�ne search
Convolutional Architectures Training Achitecture search Bibliography
Achitecture search
Convolutional Architectures Training Achitecture search Bibliography
Architecture search
1. De�ne the search space.
2. Decide upon the optimization algorithm� random search, reinforcment learning, genetic algorithms
Convolutional Architectures Training Achitecture search Bibliography
Neural architecture search
Figure: An overview of Neural Architecture Search. Figure and captionfrom [?].
Convolutional Architectures Training Achitecture search Bibliography
NAS1 - search space
Fixed structure:
� Architecture is a series of layers of the form
conv2D(FH, FW, N) −→ batch-norm −→ ReLU
Degrees of freedom:
� Parameters of conv layer� �lter height, �lter width and number of output �lters
� Input layers to each conv layer
Convolutional Architectures Training Achitecture search Bibliography
NAS1 - discovered architecture
Figure: FH is �lter height, FW is �lter width and N is number of �lters.If one layer has many input layers then all input layers are concatenatedin the depth dimension. Figure from [?].
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - search space
Fixed structure:
Figure: Architecure for CIFAR-10 and ImageNet. Figure from [?].
Degrees of freedom:
� Some freedom in normal cell and reduction cell, shall see soon
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - discovered convolutional cells
Normal Cell Reduction Cell
hi
hi-1
...
hi+1
concat
avg!3x3
sep!5x5
sep!7x7
sep!5x5
max!3x3
sep!7x7
add add
add add add
sep!3x3
iden!tity
avg!3x3
max!3x3
hi
hi-1
...
hi+1
concat
sep!3x3
avg!3x3
avg!3x3
sep!5x5
sep!3x3
iden!tity
iden!tity
sep!3x3
sep!5x5
avg!3x3
add add add addadd
Figure: NASNet-A identi�ed with CIFAR-10. Figure and caption from[?].
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - Performance(computational_cost)
10000 20000 300000
75
70
65
80
85
# Mult-Add operations (millions)
accu
racy
(pre
cisi
on @
1)
40000
PolyNet
Inception-v1
VGG-16
MobileNet
Inception-v3
Inception-v2
ResNeXt-101
ResNet-152Inception-v4
Inception-ResNet-v2
Xception
NASNet-A (6 @ 4032)
ShuffleNet
DPN-131NASNet-A (7 @ 1920)
NASNet-A (5 @ 1538)
NASNet-A (4 @ 1056)
SENet
Figure: Performance on ILSVRC12 as a function of number of�oating-point multiply-add operations needed to process an image.Figure from [?].
Convolutional Architectures Training Achitecture search Bibliography
NAS2 - Performance(#parameters)
75
70
65
80
85
# parameters (millions)
accu
racy
(pre
cisi
on @
1)
60 80 100 120 1400 4020
NASNet-A (5 @ 1538)
NASNet-A (4 @ 1056)VGG-16
PolyNet
MobileNetInception-v1
ResNeXt-101
Inception-v2
Inception-v4
Inception-ResNet-v2
ResNet-152
Xception
Inception-v3
ShuffleNet
DPN-131
NASNet-A (6 @ 4032)
NASNet-A (7 @ 1920) SENet
Figure: Performance on ILSVRC12 as a function of number ofparameters. Figure from [?].
Convolutional Architectures Training Achitecture search Bibliography
Bibliography
Convolutional Architectures Training Achitecture search Bibliography
Bibliography I