Hands-on course in deep neural networks for vision Instructors Michal Irani, Ronen Basri Teaching...

Post on 17-Jan-2016

214 views 0 download

transcript

Hands-on course in deep neural networks

for visionInstructors

Michal Irani, Ronen Basri

Teaching AssistantsIta Lifshitz, Ethan Fetaya, Amir Rosenfeld

Course websitehttp://www.wisdom.weizmann.ac.il/~vision/courses/2016_1/DNN/index.html

Schedule (fall semester)

4/11/2015 Introduction18/11/2015 Tutorial on Caffe, exercise 1

9/12/2015 Q&A on exercise 116/12/2015 Submission of exercise 1 (no class)

6/1/2016 Student presentations of recent work27/1/2016 Project selection

Nerve cells in a microscope

1836 First microscopic image of a nerve cell (Valentin)1838 First visualization of axons (Remak)1862 First description of the neuromuscular junction (Kühne)1873 Introduction of silver-chromate technique as staining

procedure (Golgi)1888 Birth of the neuron doctrine: the nervous system is made up of

independent cells (Cajal)1891 The term “neuron” is coined (Waldeyer).1897 Concept of synapse (Sherrington)1906 Nobel Prize: Cajal and Golgi.1921 Nobel Prize: Sherrington

Nerve cell

Synapse

Action potential

• The human brain contains ~86 billion nerve cells• The DNA cannot encode a different function for each cell• Therefore,

• Each cell must perform a simple computation• The overall computations are achieved by ensembles of neurons

(“connectionism”)• These computations can be changed dynamically by learning (“plasticity”)

A Logical Calculus of the Ideas Immanent in Nervous Activity

Warren S. Mcculloch and Walter PittsBULLETIN OF MATHEMATICAL BIOPHYSICS VOLUME 5, 1943

“We shall make the following physical assumptions for our calculus.1. The activity of the neuron is an "all-or-none" process. 2. A certain fixed number of synapses must be excited within the period

of latent addition in order to excite a neuron at any time, and this number is independent of previous activity and position on the neuron.

3. The only significant delay within the nervous system is synaptic delay. 4. The activity of any inhibitory synapse absolutely prevents excitation of

the neuron at that time. 5. The structure of the net does not change with time.”

The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain

Frank RosenblattPsychological Review Vol. 65, No. 6, 1958

A simplified view of neural computation• A nerve cell accumulates electric potential at its dendrites• This accumulation is due to the flux of neurotransmitors from

neighboring (input) neurons• The strength of this potential depends on the efficacy of the synapse

and can be positive (“excitatory”) or negative (“inhibitory”)• Once sufficient electric potential is accumulated (exceeding a certain

threshold) one or more action potentials (spikes) are produced. They then travel through the nerve axon and affect nearby (output) neurons

The perceptron

• Input • Weights

• Output

• This is a linear classifier • Implemented with one layer of weights

+ threshold (Heaviside step activation)

𝑥1

𝑥𝑑

𝑥2𝑤1

𝑤2

𝑤𝑛

Multi-layer Perceptron

• Linear classifiers are very limited,e.g., XOR configurations cannot be classified

Multi-layer Perceptron

• Linear classifiers are very limited,e.g., XOR configurations cannot be classified

Solution: multilayer, feed-forward perceptron

𝑥1

𝑥𝑑

𝑥2

h1

h𝑚

h2

h ′ 1

h ′𝑚′

h ′ 2

�̂� 1

�̂� 𝑘

�̂� 2

Input Hidden 1 Hidden 2 Output

The non-linear is called“activation function”

Activation functions

• Heaviside (threshold)

• Sigmoid ( or logistic )

• Rectified linear unit (Relu: max(x,0))

• Max pooling (max(

Multi-layer Perceptron

• A perceptron with one hidden layer can approximate any smooth function to an arbitrary accuracy

Supervised learning

• Objective: given labeled training examples , learn a map from input to output,

• Type of problems:• Classification – each input is mapped to one of a discrete set, e.g.,

{person, bike, car}• Regression – each input is mapped to a number, e.g., viewing angle

• How? Given a network architecture, find a set of weights that minimize a loss function on the training data, e.g., -log likelihood of class given input

• Generalization vs. overfit

• We want to minimize loss on the test data, but the test data is not available in training

• Use a validation set to make sure you don’t overfit

Generalization vs. overfit

𝑥

𝑦

• We want to minimize loss on the test data, but the test data is not available in training

• Use a validation set to make sure you don’t overfit

Generalization vs. overfit

Loss

Iteration

Loss on training

Loss on validation

Training DNN: back propagation

• A network is a function • Objective: modify to improve the prediction of on the training (and

validation) data• Quality of is measured via the loss function • Back propagation: minimize by gradient descent

• Gradient is computed by a backward pass by applying the chain rule

Calculating the gradient

• Loss: , where • Activation: logistic

(note: )

(Recall that )

𝑥 ′𝑑′𝑥 ′ 1

𝑤 ′ 11

�̂� 1 �̂� 𝑘𝑤 ′𝑑′ 𝑘

𝑥𝑑𝑥1

𝑤11 𝑤𝑑𝑑′

Calculating the gradient

• Computed recursively starting with

𝑥 ′𝑑′𝑥 ′ 1

𝑤 ′ 11

�̂� 1 �̂� 𝑘𝑤 ′𝑑′ 𝑘

𝑥𝑑𝑥1

𝑤11 𝑤𝑑𝑑′

Training algorithm

• Initialize with random weights• Forward pass: given a training input vector apply the network to and

store all intermediate results• Backward pass: starting from top, recursively use the chain rule to

calculate derivatives for all nodes and use those derivatives to calculate for all edges

• Repeat for all training vectors, the gradient is composed of the sum of all over all edges

• Weight update:

Stochastic gradient descent

• Gradient descent requires computing the gradient of the (full) loss over all of the training data at every stepWith large training this is expensive

• Approach: compute the gradient over a sample (“mini-batch”), usually by re-shuffling the training setGoing once through the entire training data is called an epoch

• If learning rate decreases appropriately and under mild assumptions this converges almost surely to a local minimum

• Momentum: pass the gradient from one iteration to the next (with decay), i.e.,

Regularization: dropout

• At training we randomly eliminate half of the nodes in the network• At test we use the full network, but each weight is halved• This spreads the representation of the data over multiple nodes• Informally, it is equivalent to training with many different networks

and using their average response

Invariance

• Invariance to translation, rotation and scale as well as for some transformations of intensity is often desired

• Can be achieved by perturbing the training set (data augmentation) – useful for small transformations (expensive)

• Translation invariance is further achieved with max pooling and/or convolution

Invariance: convolutional neural nets• Translation invariance can be achieved by conv-nets• A conv-net learns a linear filter at the first level, non linear ones higher up• SIFT is an example for a useful non-linear filter, but a learned filter may be more

suitable• Inspired by the local receptive fields in the V1 visual cortex area• Conv-nets were applied successfully to digit recognition in the 90’s (LeCunn et al.

1998), but at the time did not scale well to other kinds of images

𝑥 ′𝑑′𝑥 ′ 1

𝑥𝑑𝑥5𝑥3𝑥2𝑥1 𝑥4

𝑥 ′ 2 𝑥 ′ 3

Weight-sharing: same color same weight

Alex-netKrizhevsky, Sutzkever, Hinton 2012

• Trained and tested on Imagenet: 1.2M training images, 50K validation, 150K test. 1000 categories

• Loss: softmax – top layer has nodes, (here categories). The softmax function renormalizes them by

Network maximizes the multinomial logistic regression objective, that is over the training images of class

Alex-net

• Activation: Relu• Data augmentation

• Translation• Horizontal reflection• gray level shifts by principal components)

• Dropout• Momentum

tanhRelu

Alex-net

Alex-net: results• ILSVRC-2010

• ILSVRC-2012

• More recent networks reduced error to ~7%

Team nameError

(5 guesses)Description

SuperVision 0.15315 Using extra training from ImageNet 2011

SuperVision 0.16422 Using only supplied training data

ISI 0.26172 Weighted sum of scores SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV

OXFORD_VGG 0.26979 Mixed selection from High-Level SVM and Baseline Scores

XRCE/INRIA 0.27058

University of Amsterdam 0.29576 Baseline: SVM trained on Fisher Vectors over Dense SIFT and Color Statistics

Alex-net: results

(Krizhevsky et al. 2012)

Applications• Image classification• Face recognition• Object detection• Object tracking• Low-level vision

• Optical flow, stereo• Super-resolution, de-bluring• Edge detection• Image segmentation

• Attention and saliency• Image and video captioning• …

Face recognition

• Google’s conv-net is trained on260M face images

• Achieved 99.63% accuracy on LFW(face comparison database), and someof its mistakes turned out to belabeling mistakes)

• Available in Google Photos

(Schroff et al. 2015)

Object detection

• Latest methods achieve average precision of about 60% on PASCAL VOC 2007 and 44% on Imagenet ILSVRC 2014

(He et al. 2015)

Unsupervised learning

• Find structure in data• Type of problems:

• Clustering• Density estimation• Dimensionality reduction

Auto-encoders

• Produce a compact representationof the training

• Analogous to PCA• Note that identity transformation

may be a valid (but undesired)solution

• Initialize by training a Restricted Bolzmann Machine

𝑥1

𝑥𝑑

𝑥2

h1

h𝑚

h2

�̂� 1

�̂� 𝑑

�̂� 2

Input Hidden 1 “Output = Input”

Recurrent networks: Hopfield netJohn Hopfield, 1982

• Fully connected network• Time dynamic: starting from an input,

apply network repeatedly to convergence• Update: ,

minimizes an Ising model energy• Weights are set to store preferred states• “Associative memory” (content address):

denoising and completion

𝑥1𝑥𝑑𝑥2

𝑥3𝑥4

Recurrent networks

• Used, e.g., for image annotation, i.e., produce a descriptive sentence of an input image

Next word

Image

Recurrent networks (unrolled)

• Used, e.g., for image annotation, i.e., produce a descriptive sentence of an input image

First word Second word

Image

(Kiros et al. 2014)

Neural network development packages• Caffe: http://caffe.berkeleyvision.org/

• Matconvnet: http://www.vlfeat.org/matconvnet/

• Torch: https://github.com/torch/torch7/wiki/Cheatsheet