Apprentissage, réseaux de neurones et modèles graphiques(RCP209)
Neural Networks and Deep Learning
Nicolas [email protected]
http://cedric.cnam.fr/vertigo/Cours/ml2/
Département InformatiqueConservatoire Nationnal des Arts et Métiers (Cnam)
Context Neural Nets Backprop
8 Weeks on Deep Learning and Structured Prediction
• Week 1-5: Deep Learning• Week 6-8: Structured Prediction and Applications
[email protected] RCP209 / Deep Learning 2/ 54
Context Neural Nets Backprop
Outline
1 Context
2 Neural Networks
3 Training Deep Neural Networks
[email protected] RCP209 / Deep Learning 3/ 54
Context Neural Nets Backprop
Context
Big Data• Superabundance of data: images, videos, audio, text, use traces, etc
BBC: 2.4M videos Facebook: 350B images 100M monitoring cameras
1B each day
• Obvious need to access, search, or classify these data: Recognition• Huge number of applications: mobile visual search, robotics, autonomous driving,
augmented reality, medical imaging etc
• Leading track in major ML/CV conferences during the last decade
[email protected] RCP209 / Deep Learning 4/ 54
Context Neural Nets Backprop
Recognition and classification
• Classification : assign a given data to a given set of pre-defined classes• Recognition much more general than classification, e.g.
Object Localization in imagesRanking for document indexingSequence prediction for text, speech, audio, etc
• Many tasks can be cast as classification problems⇒ importance of classification
[email protected] RCP209 / Deep Learning 5/ 54
Context Neural Nets Backprop
Focus on Visual Recognition: Perceiving Visual World
• Visual Recognition: archetype of low-level signal understanding• Supposed to be a master class problem in the early 80’s• Certainly the most impacted topic by deep learning
• Scene categorization
• Object localization
• Context & Attributerecognition
• Rough 3D layout, depthordering
• Rich description ofscene, e.g. sentences
[email protected] RCP209 / Deep Learning 6/ 54
Context Neural Nets Backprop
Recognition of low-level signals
Challenge: filling the semantic gap
What we perceive vsWhat a computer sees
• Illumination variations• View-point variations• Deformable objects• intra-class variance• etc
⇒ How to design "good" intermediate representation ?
[email protected] RCP209 / Deep Learning 7/ 54
Context Neural Nets Backprop
Deep Learning (DL) & Recognition of low-level signals
• DL: breakthrough for the recognition of low-level signal data• Before DL: handcrafted intermediate representations for each task
⊖ Needs expertise (PhD level) in each field⊖ Weak level of semantics in the representation
[email protected] RCP209 / Deep Learning 8/ 54
Context Neural Nets Backprop
Deep Learning (DL) & Recognition of low-level signals
• DL: breakthrough for the recognition of low-level signal data• Since DL: automatically learning intermediate representations
⊕ Outstanding experimental performances >> handcrafted features⊕ Able to learn high level intermediate representations⊕ Common learning methodology ⇒ field independent, no expertise
[email protected] RCP209 / Deep Learning 9/ 54
Context Neural Nets Backprop
Deep Learning (DL) & Representation Learning
• DL: breakthrough for representation learningAutomatically learning intermediate levels of representation
• Ex: Natural language Processing (NLP)
[email protected] RCP209 / Deep Learning 10/ 54
Context Neural Nets Backprop
Outline
1 Context
2 Neural Networks
3 Training Deep Neural Networks
[email protected] RCP209 / Deep Learning 11/ 54
Context Neural Nets Backprop
The Formal Neuron: 1943 [MP43]
• Basis of Neural Networks• Input: vector x ∈ Rm, i.e. x = {xi}i∈{1,2,...,m}• Neuron output y ∈ R: scalar
[email protected] RCP209 / Deep Learning 12/ 54
Context Neural Nets Backprop
The Formal Neuron: 1943 [MP43]
• Mapping from x to y :1 Linear (affine) mapping: s = w⊺x + b2 Non-linear activation function: f : y = f (s)
[email protected] RCP209 / Deep Learning 13/ 54
Context Neural Nets Backprop
The Formal Neuron: Linear Mapping
• Linear (affine) mapping: s = w⊺x + b =m∑i=1
wixi + b
w: normal vector to an hyperplane in Rm ⇒ linear boundaryb bias, shift the hyperplane position
2D hyperplane: line 3D hyperplane: plane
[email protected] RCP209 / Deep Learning 14/ 54
Context Neural Nets Backprop
The Formal Neuron: Activation Function
• y = f (w⊺x + b), f activation functionPopular f choices: step, sigmoid, tanh
• Step (Heaviside) function: H(z) =⎧⎪⎪⎨⎪⎪⎩
1 if z ≥ 00 otherwise
[email protected] RCP209 / Deep Learning 15/ 54
Context Neural Nets Backprop
Step function: Connection to Biological Neurons
• Formal neuron, step activation H: y = H(w⊺x + b)y = 1 (activated) ⇔ w⊺x ≥ −by = 0 (unactivated) ⇔ w⊺x < −b
• Biological Neurons: output activated⇔ input weighted by synaptic weight ≥ threshold
[email protected] RCP209 / Deep Learning 16/ 54
Context Neural Nets Backprop
Sigmoid Activation Function
• Neuron output y = f (w⊺x + b), f activation function• Sigmoid: σ(z) = (1 + e−az)−1
• a ↑: more similar to step function (step: a →∞)• Sigmoid: linear and saturating regimes
[email protected] RCP209 / Deep Learning 17/ 54
Context Neural Nets Backprop
The Formal neuron: Application to Binary Classification
• Binary Classification: label input x as belonging to class 1 or 0• Neuron output with sigmoid: y = 1
1+e−a(w⊺x+b)
• Sigmoid: probabilistic interpretation ⇒ y ∼ P(1/x)Input x classified as 1 if P(1/x) > 0.5 ⇔ w⊺x + b > 0Input x classified as 0 if P(1/x) < 0.5 ⇔ w⊺x + b < 0⇒ sign(w⊺x + b): linear boundary decision in input space !
[email protected] RCP209 / Deep Learning 18/ 54
Context Neural Nets Backprop
The Formal neuron: Toy Example for Binary Classification
• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b
[email protected] RCP209 / Deep Learning 19/ 54
Context Neural Nets Backprop
The Formal neuron: Toy Example for Binary Classification
• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b
• Sigmoid activation function: y = (1 + e−a(w⊺x+b))−1,
a = 10
[email protected] RCP209 / Deep Learning 19/ 54
Context Neural Nets Backprop
The Formal neuron: Toy Example for Binary Classification
• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b
• Sigmoid activation function: y = (1 + e−a(w⊺x+b))−1,
a = 1
[email protected] RCP209 / Deep Learning 19/ 54
Context Neural Nets Backprop
The Formal neuron: Toy Example for Binary Classification
• 2d example: m = 2, x = {x1, x2} ∈ [−5;5] × [−5;5]• Linear mapping: w = [1;1] and b = −2• Result of linear mapping : s = w⊺x + b
• Sigmoid activation function: y = (1 + e−a(w⊺x+b))−1,
a = 0.1
[email protected] RCP209 / Deep Learning 19/ 54
Context Neural Nets Backprop
From Formal Neuron to Neural Networks
• Formal Neuron:1 A single scalar output2 Linear decision boundary for binary
classification• Single scalar output: limited for several tasks
Ex: multi-class classification, e.g. MNIST orCIFAR
[email protected] RCP209 / Deep Learning 20/ 54
Context Neural Nets Backprop
Perceptron and Multi-Class Classification
• Formal Neuron: limited to binaryclassification
• Multi-Class Classification: use severaloutput neurons instead of a single one !⇒ Perceptron
• Input x in Rm
• Output neuron y1 is a formal neuron:Linear (affine) mapping: s1 = w1
⊺x + b1Non-linear activation function: f :y1 = f (s1)
• Linear mapping parameters:w1 = {w11, ...,wm1} ∈ Rm
b1 ∈ R
[email protected] RCP209 / Deep Learning 21/ 54
Context Neural Nets Backprop
Perceptron and Multi-Class Classification
• Input x in Rm
• Output neuron yk is a formal neuron:Linear (affine) mapping: sk = wk
⊺x + bkNon-linear activation function: f :yk = f (sk)
• Linear mapping parameters:wk = {w1k , ...,wmk} ∈ Rm
bk ∈ R
[email protected] RCP209 / Deep Learning 22/ 54
Context Neural Nets Backprop
Perceptron and Multi-Class Classification
• Input x in Rm (1 ×m), output y : concatenation of K formal neurons• Linear (affine) mapping ∼ matrix multiplication: s = xW + b
W matrix of size m ×K - columns are wkb: bias vector - size 1 ×K
• Element-wise non-linear activation: y = f (s)
[email protected] RCP209 / Deep Learning 23/ 54
Context Neural Nets Backprop
Perceptron and Multi-Class Classification
• Soft-max Activation:
yk = f (sk) =esk
K∑
k′=1esk′
• Probabilistic interpretation for multi-classclassification:
Each output neuron ⇔ classyk ∼ P(k/x,w)
⇒ Logistic Regression (LR) Model !
[email protected] RCP209 / Deep Learning 24/ 54
Context Neural Nets Backprop
2d Toy Example for Multi-Class Classification
• x = {x1, x2} ∈ [−5;5] × [−5;5], y : 3 outputs (classes)
Linear mapping foreach class:sk = wk
⊺x + bk
w1 = [1;1], b1 = −2 w2 = [0;−1], b2 = 1 w3 = [1;−0.5], b3 = 10
Soft-max output:P(k/x,W)
[email protected] RCP209 / Deep Learning 25/ 54
Context Neural Nets Backprop
2d Toy Example for Multi-Class Classification
• x = {x1, x2} ∈ [−5;5] × [−5;5], y : 3 outputs (classes)
Soft-max output:P(k/x,W)
w1 = [1;1], b1 = −2 w2 = [0;−1], b2 = 1 w3 = [1;−0.5], b3 = 10
Class Prediction:k∗ = arg max
kP(k/x,W)
[email protected] RCP209 / Deep Learning 25/ 54
Context Neural Nets Backprop
Beyond Linear Classification
X-OR Problem
• Logistic Regression (LR): NN with 1 input layer & 1 output layer• LR: limited to linear decision boundaries• X-OR: NOT 1 and 2 OR NOT 2 AND 1
X-OR: Non linear decision function
[email protected] RCP209 / Deep Learning 26/ 54
Context Neural Nets Backprop
Beyond Linear Classification
• LR: limited to linear boundaries• Solution: add a layer !
• Input x in Rm, e.g. m = 4• Output y in RK (K # classes),
e.g. K = 2• Hidden layer h in RL
[email protected] RCP209 / Deep Learning 27/ 54
Context Neural Nets Backprop
Multi-Layer Perceptron
• Hidden layer h: x projection to a newspace RL
• Neural Net with ≥ 1 hidden layer:Multi-Layer Perceptron (MLP)
• h: intermediate representations of xfor classification y: h = f (xW + b)
• Mapping from x to y: non-linearboundary ! ⇒ activation f crucial!
[email protected] RCP209 / Deep Learning 28/ 54
Context Neural Nets Backprop
Deep Neural Networks
• Adding more hidden layers: Deep Neural Networks (DNN) ⇒ Basis of DeepLearning
• Each layer hl projects layer hl−1 into a new space• Gradually learning intermediate representations useful for the task
[email protected] RCP209 / Deep Learning 29/ 54
Context Neural Nets Backprop
Conclusion
• Deep Neural Networks: applicable to classification problems with non-lineardecision boundaries
• Visualize prediction from fixed model parameters• Reverse problem: Supervised Learning
[email protected] RCP209 / Deep Learning 30/ 54
Context Neural Nets Backprop
Outline
1 Context
2 Neural Networks
3 Training Deep Neural Networks
[email protected] RCP209 / Deep Learning 31/ 54
Context Neural Nets Backprop
Training Multi-Layer Perceptron (MLP)
• Input x, output y• A parametrized model x⇒ y: fw(xi) = yi
• Supervised context: training set A = {(xi , y∗i )}i∈{1,2,...,N}A loss function `(yi , y∗i ) for each annotated pair (xi , y∗i )
• Assumptions: parameters w ∈ Rd continuous, L differentiable• Gradient ∇w = ∂L
∂w : steepest direction to decrease loss L
[email protected] RCP209 / Deep Learning 32/ 54
Context Neural Nets Backprop
MLP Training
• Gradient descent algorithm:Initialyze parameters w
Update: w(t+1) = w(t) − η ∂L∂w
Until convergence, e.g. ∣∣∇w ∣∣2 ≈ 0
[email protected] RCP209 / Deep Learning 33/ 54
Context Neural Nets Backprop
Gradient Descent
Update rule: w(t+1) = w(t) − η ∂L∂w η learning rate
• Convergence ensured ? ⇒ provided a "well chosen" learning rate η
[email protected] RCP209 / Deep Learning 34/ 54
Context Neural Nets Backprop
Gradient Descent
Update rule: w(t+1) = w(t) − η ∂L∂w
• Global minimum ?⇒ convex a) vs non convex b) loss L(w)
a) Convex function a) Non convex function
[email protected] RCP209 / Deep Learning 35/ 54
Context Neural Nets Backprop
Supervised Learning: Multi-Class Classification
• Logistic Regression for multi-class classification• si = xiW + b• Soft-Max (SM): yk ∼ P(k/xi,W,b) = esk
K∑
k′=1esk′
• Supervised loss function: L(W,b) = 1N
N∑i=1`(yi , y∗i )
1 y ∈ {1;2; ...; K}2 yi = arg max
kP(k/xi ,W,b)
3 `0/1(yi , y∗i ) =⎧⎪⎪⎨⎪⎪⎩
1 if yi ≠ y∗i0 otherwise
: 0/1 loss
[email protected] RCP209 / Deep Learning 36/ 54
Context Neural Nets Backprop
Logistic Regression Training Formulation
• Input xi , ground truth output supervision y∗i• One hot-encoding for y∗i :
y∗c,i =⎧⎪⎪⎨⎪⎪⎩
1 if c is the groud truth class for xi
0 otherwise
[email protected] RCP209 / Deep Learning 37/ 54
Context Neural Nets Backprop
Logistic Regression Training Formulation
• Loss function: multi-class Cross-Entropy (CE) `CE
• `CE : Kullback-Leiber divergence between y∗i and yi
`CE (yi, y∗i ) = KL(y∗i , yi) = −
K
∑c=1
y∗c,i log(yc,i) = −log(yc∗,i)
• B KL asymmetric: KL(yi, y∗i ) ≠ KL(y∗i , yi) B
[email protected] RCP209 / Deep Learning 38/ 54
Context Neural Nets Backprop
Logistic Regression Training
• LCE(W,b) = 1N
N∑i=1`CE(yi , y∗i ) = − 1
N
N∑i=1
log(yc∗,i)
• `CE smooth convex upper bound of `0/1⇒ gradient descent optimization
• Gradient descent: W(t+1) =W(t) − η ∂LCE∂W (b(t+1) = b(t) − η ∂LCE
∂b )
• MAIN CHALLENGE: computing ∂LCE∂W = 1
N
N∑i=1
∂`CE∂W ?
⇒ Key Property: chain rule∂x∂z=∂x∂y
∂y∂z
⇒ Backpropagation of gradient error!
[email protected] RCP209 / Deep Learning 39/ 54
Context Neural Nets Backprop
Chain Rule
∂`∂x =
∂`∂y
∂y∂x
• Logistic regression: ∂`CE∂W =
∂`CE∂yi
∂yi∂si
∂si∂W
[email protected] RCP209 / Deep Learning 40/ 54
Context Neural Nets Backprop
Logistic Regression Training: Backpropagation
∂`CE∂W = ∂`CE
∂yi
∂yi∂si
∂si∂W , `CE(yi , y∗i ) = −log(yc∗,i) ⇒ Update for 1 example:
• ∂`CE∂yi
= −1yc∗,i
= −1yi⊙ δc,c∗
• ∂`CE∂si
= yi − y∗i = δyi
• ∂`CE∂W = xi
T δyi
[email protected] RCP209 / Deep Learning 41/ 54
Context Neural Nets Backprop
Logistic Regression Training: Backpropagation
• Whole dataset: data matrix X (N ×m), label matrix Y, Y∗ (N ×K)
• LCE(W,b) = − 1N
N∑i=1
log(yc∗,i), ∂LCE∂W = ∂LCE
∂Y∂Y∂S
∂S∂W
• ∂LCE∂s = Y −Y∗ = ∆y
• ∂LCE∂W = XT ∆y
[email protected] RCP209 / Deep Learning 42/ 54
Context Neural Nets Backprop
Perceptron Training: Backpropagation
• Perceptron vs Logistic Regression: adding hidden layer (sigmoid)• Goal: Train parameters Wy and Wh (+bias) with Backpropagation
⇒ computing ∂LCE∂Wy = 1
N
N∑i=1
∂`CE∂Wy and ∂LCE
∂Wh = 1N
N∑i=1
∂`CE∂Wh
• Last hidden layer ∼ Logistic Regression• First hidden layer: ∂`CE
∂Wh = xiT ∂`CE
∂ui⇒ computing ∂`CE
∂ui= δh
i
[email protected] RCP209 / Deep Learning 43/ 54
Context Neural Nets Backprop
Perceptron Training: Backpropagation
• Computing ∂`CE∂ui
= δhi ⇒ use chain rule: ∂`CE
∂ui= ∂`CE
∂vi
∂vi∂hi
∂hi∂ui
• ... Leading to: ∂`CE∂ui
= δhi = δ
yiTWy ⊙ σ
′
(hi) = δyiTWy ⊙ (hi ⊙ (1 − hi))
[email protected] RCP209 / Deep Learning 44/ 54
Context Neural Nets Backprop
Deep Neural Network Training: Backpropagation
• Multi-Layer Perceptron (MLP): adding more hidden layers
• Backpropagation update ∼ Perceptron: assuming ∂L∂Ul+1
= ∆l+1 known
∂L∂Wl+1 = Hl
T ∆l+1
Computing ∂L∂Ul
= ∆l (= ∆l+1TWl+1 ⊙Hl ⊙ (1 −Hl) sigmoid)
∂L∂Wl = Hl−1
T ∆hl
[email protected] RCP209 / Deep Learning 45/ 54
Context Neural Nets Backprop
Neural Network Training: Optimization Issues
• Classification loss over training set (vectorized w, bignored):
LCE(w) = 1N
N
∑i=1`CE(yi , y∗i ) = −
1N
N
∑i=1
log(yc∗,i)
• Gradient descent optimization:
w(t+1) = w(t) − η ∂LCE
∂w(w(t)) = w(t) − η∇(t)w
• Gradient ∇(t)w = 1N
N∑i=1
∂`CE (yi ,y∗i )∂w (w(t)) linearly scales
wrt:w dimensionTraining set size
⇒ Too slow even for moderatedimensionality & dataset size!
[email protected] RCP209 / Deep Learning 46/ 54
Context Neural Nets Backprop
Stochastic Gradient Descent
• Solution: approximate ∇(t)w = 1N
N∑i=1
∂`CE (yi ,y∗i )∂w (w(t)) with subset of examples
⇒ Stochastic Gradient Descent (SGD)
• Use a single example (online):
∇(t)w ≈ ∂`CE(yi , y∗i )∂w
(w(t))
• Mini-batch: use B < N examples:
∇(t)w ≈ 1B
B
∑i=1
∂`CE(yi , y∗i )∂w
(w(t))
Full gradient SGD (online) SGD (mini-batch)
[email protected] RCP209 / Deep Learning 47/ 54
Context Neural Nets Backprop
Stochastic Gradient Descent
• SGD: approximation of the true Gradient ∇w !Noisy gradient can lead to bad direction, increase lossBUT: much more parameter updates: online ×N, mini-batch ×N
BFaster convergence, at the core of Deep Learning for large scale datasets
Full gradient SGD (online) SGD (mini-batch)
[email protected] RCP209 / Deep Learning 48/ 54
Context Neural Nets Backprop
Optimization: Learning Rate Decay
• Gradient descent optimization: w(t+1) = w(t) − η∇(t)w
• η setup ? ⇒ open question• Learning Rate Decay: decrease η during training progress
Inverse (time-based) decay: ηt = η01+r ⋅t , r decay rate
Exponential decay: ηt = η0 ⋅ e−λt
Step Decay ηt = η0 ⋅ rttu ...
Exponential Decay (η0 = 0.1, λ = 0.1s) Step Decay (η0 = 0.1, r = 0.5, tu = 10)
[email protected] RCP209 / Deep Learning 49/ 54
Context Neural Nets Backprop
Generalization and Overfitting
• Learning: minimizing classification loss LCE over training setTraining set: sample representing data vs labels distributionsUltimate goal: train a prediction function with low prediction error on the true(unknown) data distribution
Ltrain = 4, Ltrain = 9 Ltest = 15, Ltest = 13
⇒ Optimization ≠ Machine Learning!⇒ Generalization / Overfitting!
[email protected] RCP209 / Deep Learning 50/ 54
Context Neural Nets Backprop
Regularization
• Regularization: improving generalization, i.e. test (≠ train) performances• Structural regularization: add Prior R(w) in training objective:
L(w) = LCE(w) + αR(w)
• L2 regularization: weight decay, R(w) = ∣∣w∣∣2Commonly used in neural networksTheoretical justifications, generalization bounds (SVM)
• Other possible R(w): L1 regularization, dropout, etc
[email protected] RCP209 / Deep Learning 51/ 54
Context Neural Nets Backprop
Regularization and hyper-parameters
• Neural networks: hyper-parameters to tune:Training parameters: learning rate, weight decay, learning rate decay, #epochs, etcArchitectural parameters: number of layers, number neurones,non-linearity type, etc
• Hyper-parameters tuning: ⇒ improve generalization: estimateperformances on a validation set
[email protected] RCP209 / Deep Learning 52/ 54
Context Neural Nets Backprop
Neural networks: Conclusion
• Training issues at several levels: optimization, generalization, cross-validation• Limits of fully connected layers and Convolutional Neural Nets ? ⇒ next course!
[email protected] RCP209 / Deep Learning 53/ 54
Context Neural Nets Backprop
References I
Warren S McCulloch and Walter Pitts, A logical calculus of the ideas immanent in nervous activity, The
bulletin of mathematical biophysics 5 (1943), no. 4, 115–133.
[email protected] RCP209 / Deep Learning 54/ 54