Convolutional Neural Networks and Supervised LearningConvolutional Neural Networks and Supervised...

Post on 02-Jun-2020

18 views 0 download

transcript

Convolutional Architectures Training Achitecture search Bibliography

Convolutional Neural Networks and SupervisedLearning

Eilif Solberg

August 30, 2018

Convolutional Architectures Training Achitecture search Bibliography

Outline

Convolutional ArchitecturesConvolutional neural networks

TrainingLossOptimizationRegularizationHyperparameter search

Achitecture searchNAS1NAS2

Bibliography

Convolutional Architectures Training Achitecture search Bibliography

Convolutional Architectures

Convolutional Architectures Training Achitecture search Bibliography

Template matching

Figure: Illustration fromhttp://pixuate.com/technology/template-matching/

1. Try to match template at each location by �sliding overwindow�

2. Threshold for detection

For 2D-objects, kind of possible but di�cult

Convolutional Architectures Training Achitecture search Bibliography

Convolution

Which �lter has produces the activation map on the right?

Convolutional Architectures Training Achitecture search Bibliography

Convolutional layer

�> Glori�ed template matching

� Many templates (aka output �lters)

� We learn the templates, the weights are the templates

� Intermediate detection results only means to an end

� treat them as features, which we again match new templates to

� Starting from the second layer we have �nonlinear �lters�

Convolutional Architectures Training Achitecture search Bibliography

Hyperparameters of convolutional layer

1. Kernel height and width -template sizes

2. Stride - skips between templatematches

3. Dilation rate� �Wholes� in template where

we don't care� Larger �eld-of-view without

more weights. . .

4. Number of output �lters -number of templates

5. Padding - expand image,typically with zeros

Figure: Image fromhttp://neuralnetworksanddeeplearning.com/

Convolutional Architectures Training Achitecture search Bibliography

Detector / activation function

� Non-saturating activation functions as ReLU, leaky ReLUdominating

Figure: Sigmoidfunction

Figure: Tanh functionFigure: ReLU function

Convolutional Architectures Training Achitecture search Bibliography

Basic CNN architecture for image classi�cation

Image �> [Conv �> ReLU]xN �> Fully Connected �> Softmax

� Increase �lter depth when using stride

Improve with:

� Batch normalization

� Skip connections ala ResNet or DenseNet

� No fully connected, average pool predictions instead

Convolutional Architectures Training Achitecture search Bibliography

Training

Convolutional Architectures Training Achitecture search Bibliography

How do we �t model?

How do we �nd parameters θ for our network?

Convolutional Architectures Training Achitecture search Bibliography

Supervised learning

� Training data comes as (X ,Y ) pairs, where Y is the target

� Want to learn f (x) ∼ p(y |x), conditional distribution of Ygiven X

� De�ne di�erentiable surrogate loss function, e.g. for a singlesample

l(f (X ),Y ) = (f (X )− Y )2regression (1)

l(f (X ),Y ) = −∑c

Yc log(f (X )c)classi�cation (2)

Convolutional Architectures Training Achitecture search Bibliography

Gradient

� The direction for which the function increases the most

Figure: Gradient of the function f (x2, y2) = x/ex2+y

2

[By Vivekj78 [CC BY-SA3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons]

Convolutional Architectures Training Achitecture search Bibliography

Backpropagation

� E�cient bookkeeping scheme when applying chain rule fordi�erentiation

� Biologically implausible?

Convolutional Architectures Training Achitecture search Bibliography

(Stochastic) gradient descentTaking steps in the opposite direction of the gradient

Figure: [By Vivekj78 [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], fromWikimedia Commons]

� Full gradient too expensive / not necessary

N∑i=1

∇θl(f (Xi ),Yi ) ≈n∑i=1

∇θl(f (XP(i)),YP(i)) (3)

for a random permutation P .

Many di�erent extensions to standard SGD� SGD with momentum, RMSprop, ADAM.

Convolutional Architectures Training Achitecture search Bibliography

Network, loss, optimization

� Weight penalty added to loss term, usually squared L2normalization uniformly for all parameters

l(θ) + λ‖θ‖22 (4)

� Dropout

� Batch normalization� Intersection of optimization and generalization� Your best friend and your worst enemy

Convolutional Architectures Training Achitecture search Bibliography

More on batch normalization

For a tensor [batch_size Ö height Ö width Ö depth], normalize�template matching scores� for each template d by

µd ←1

N ∗ H ∗W

N∑i=1

H∑h=1

W∑w=1

xi ,h,w ,d (5)

σ2d← 1

N ∗ H ∗W

N∑i=1

H∑h=1

W∑w=1

(xi ,h,w ,d − µd )2 (6)

x̂i ,h,w ,d ←xi ,h,w ,d − µd√

(σ2d+ ε)

(7)

yi ,h,w ,d ← γx̂i ,h,w ,d + β (8)

where N, H and W represents batch size, height and width.

� �Template/Feature more present than usual or not�

� During inference we use stored values for µd and σd .

Convolutional Architectures Training Achitecture search Bibliography

Data augmentation

Idea: apply random transformation to X that does not alter Y .

� Normally you would like result X ′ to be plausible, i.e. couldhave been a sample from the distribution of interest

� Which transformation you may use is application-dependent.

Image data

� Horizontal mirroring (issuefor objects not left/rightsymmetric)

� Random crop

� Scale

� Aspect ratio

� Lightning etc.

Text data

� Synonym insertion

� Back-translation: translateand translate back with e.g.Google Translate!!!

Convolutional Architectures Training Achitecture search Bibliography

Hyperparameters to search

� Learning rate (and learning rate schedule)

� Regularization params: L2, (dropout)

Convolutional Architectures Training Achitecture search Bibliography

Search strategies

� random search rather than grid search

� logscale when appropriate

� careful with best values on border

� may re�ne search

Convolutional Architectures Training Achitecture search Bibliography

Achitecture search

Convolutional Architectures Training Achitecture search Bibliography

Architecture search

1. De�ne the search space.

2. Decide upon the optimization algorithm� random search, reinforcment learning, genetic algorithms

Convolutional Architectures Training Achitecture search Bibliography

Neural architecture search

Figure: An overview of Neural Architecture Search. Figure and captionfrom [?].

Convolutional Architectures Training Achitecture search Bibliography

NAS1 - search space

Fixed structure:

� Architecture is a series of layers of the form

conv2D(FH, FW, N) −→ batch-norm −→ ReLU

Degrees of freedom:

� Parameters of conv layer� �lter height, �lter width and number of output �lters

� Input layers to each conv layer

Convolutional Architectures Training Achitecture search Bibliography

NAS1 - discovered architecture

Figure: FH is �lter height, FW is �lter width and N is number of �lters.If one layer has many input layers then all input layers are concatenatedin the depth dimension. Figure from [?].

Convolutional Architectures Training Achitecture search Bibliography

NAS2 - search space

Fixed structure:

Figure: Architecure for CIFAR-10 and ImageNet. Figure from [?].

Degrees of freedom:

� Some freedom in normal cell and reduction cell, shall see soon

Convolutional Architectures Training Achitecture search Bibliography

NAS2 - discovered convolutional cells

Normal Cell Reduction Cell

hi

hi-1

...

hi+1

concat

avg!3x3

sep!5x5

sep!7x7

sep!5x5

max!3x3

sep!7x7

add add

add add add

sep!3x3

iden!tity

avg!3x3

max!3x3

hi

hi-1

...

hi+1

concat

sep!3x3

avg!3x3

avg!3x3

sep!5x5

sep!3x3

iden!tity

iden!tity

sep!3x3

sep!5x5

avg!3x3

add add add addadd

Figure: NASNet-A identi�ed with CIFAR-10. Figure and caption from[?].

Convolutional Architectures Training Achitecture search Bibliography

NAS2 - Performance(computational_cost)

10000 20000 300000

75

70

65

80

85

# Mult-Add operations (millions)

accu

racy

(pre

cisi

on @

1)

40000

PolyNet

Inception-v1

VGG-16

MobileNet

Inception-v3

Inception-v2

ResNeXt-101

ResNet-152Inception-v4

Inception-ResNet-v2

Xception

NASNet-A (6 @ 4032)

ShuffleNet

DPN-131NASNet-A (7 @ 1920)

NASNet-A (5 @ 1538)

NASNet-A (4 @ 1056)

SENet

Figure: Performance on ILSVRC12 as a function of number of�oating-point multiply-add operations needed to process an image.Figure from [?].

Convolutional Architectures Training Achitecture search Bibliography

NAS2 - Performance(#parameters)

75

70

65

80

85

# parameters (millions)

accu

racy

(pre

cisi

on @

1)

60 80 100 120 1400 4020

NASNet-A (5 @ 1538)

NASNet-A (4 @ 1056)VGG-16

PolyNet

MobileNetInception-v1

ResNeXt-101

Inception-v2

Inception-v4

Inception-ResNet-v2

ResNet-152

Xception

Inception-v3

ShuffleNet

DPN-131

NASNet-A (6 @ 4032)

NASNet-A (7 @ 1920) SENet

Figure: Performance on ILSVRC12 as a function of number ofparameters. Figure from [?].

Convolutional Architectures Training Achitecture search Bibliography

Bibliography

Convolutional Architectures Training Achitecture search Bibliography

Bibliography I