Introduction to Neural Networks
E. Scornet
January 2020
E. Scornet Deep Learning January 2020 1 / 118
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 2 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 3 / 118
Supervised Learning
Supervised Learning FrameworkInput measurement X ∈ XOutput measurement Y ∈ Y.(X,Y ) ∼ P with P unknown.Training data : Dn = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)
OftenI X ∈ Rd and Y ∈ {−1, 1} (classification)I or X ∈ Rd and Y ∈ R (regression).
A predictor is a function in F = {f : X → Y meas.}
GoalConstruct a good predictor f from the training data.
Need to specify the meaning of good.Classification and regression are almost the same problem!
E. Scornet Deep Learning January 2020 4 / 118
Loss and Probabilistic Framework
Loss function for a generic predictorLoss function : `(Y , f (X)) measures the goodness of the prediction of Y by f (X)Examples:
I Prediction loss: `(Y , f (X)) = 1Y 6=f (X)I Quadratic loss: `(Y , f (X)) = |Y − f (X)|2
Risk functionRisk measured as the average loss for a new couple:
R(f ) = E(X,Y )∼P [`(Y , f (X))]Examples:
I Prediction loss: E [`(Y , f (X))] = P {Y 6= f (X)}I Quadratic loss: E [`(Y , f (X))] = E
[|Y − f (X)|2
]Beware: As f depends on Dn, R(f ) is a random variable!
E. Scornet Deep Learning January 2020 5 / 118
Supervised Learning
Experience, Task and Performance measureTraining data : D = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)Predictor: f : X → Y measurableCost/Loss function : `(Y , f (X)) measure how well f (X) “predicts" YRisk:
R(f ) = E [`(Y , f (X))] = EX[EY |X [`(Y , f (X))]
]Often `(Y , f (X)) = |f (X)− Y |2 or `(Y , f (X)) = 1Y 6=f (X)
GoalLearn a rule to construct a predictor f ∈ F from the training data Dn s.t. the riskR(f ) is small on average or with high probability with respect to Dn.
E. Scornet Deep Learning January 2020 6 / 118
Best Solution
The best solution f ∗ (which is independent of Dn) is
f ∗ = argminf∈F
R(f ) = argminf∈F
E [`(Y , f (X))]
Bayes Predictor (explicit solution)I In binary classification with 0− 1 loss:
f ∗(X) =
{+1 if P {Y = +1|X} ≥ P {Y = −1|X}
⇔ P {Y = +1|X} ≥ 1/2−1 otherwise
I In regression with the quadratic lossf ∗(X) = E [Y |X]
Issue: Solution requires to know E [Y |X] for all values of X!
E. Scornet Deep Learning January 2020 7 / 118
Examples
Spam detection (Text classification)
Data: email collectionInput: emailOutput : Spam or No Spam
E. Scornet Deep Learning January 2020 8 / 118
Examples
Face Detection
Data: Annotated database of imagesInput : Sub window in the imageOutput : Presence or no of a face...
E. Scornet Deep Learning January 2020 9 / 118
Examples
Number Recognition
Data: Annotated database of images (each image is represented by a vector of28× 28 = 784 pixel intensities)Input: ImageOutput: Corresponding number
E. Scornet Deep Learning January 2020 10 / 118
Machine Learning
A definition by Tom Mitchell (http://www.cs.cmu.edu/~tom/)A computer program is said to learn from experience E with respect to some class oftasks T and performance measure P, if its performance at tasks in T, as measured by P,improves with experience E.
E. Scornet Deep Learning January 2020 11 / 118
Unsupervised Learning
Experience, Task and Performance measureTraining data : D = {X1, . . . ,Xn} (i.i.d. ∼ P)Task: ???Performance measure: ???
No obvious task definition!
Tasks for this lectureClustering (or unsupervised classification): construct a grouping of the data inhomogeneous classes.Dimension reduction: construct a map of the data in a low dimensional spacewithout distorting it too much.
E. Scornet Deep Learning January 2020 12 / 118
Marketing
Data: Base of customer data containing their properties and past buying records
Goal: Use the customers similarities to find groups.
Two directions:I Clustering: propose an explicit grouping of the customersI Visualization: propose a representation of the customers so that the groups are visibles
E. Scornet Deep Learning January 2020 13 / 118
Machine Learning
E. Scornet Deep Learning January 2020 14 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 15 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 16 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 17 / 118
What is a neuron?
This is a real neuron!
Figure: Real Neuron - diagram
Figure: Real Neuron
The idea of neural networks began unsurprisingly as a model of how neurons in the brainfunction, termed ‘connectionism’ and used connected circuits to simulate intelligent be-haviour.
E. Scornet Deep Learning January 2020 18 / 118
McCulloch and Pitts neuron - 1943
In 1943, portrayed with a simple electrical circuit by neurophysiologist Warren McCullochand mathematician Walter Pitts.
A McCulloch-Pitts neuron takes binary inputs, computes a weighted sum and returns 0 ifthe result is below threshold and 1 otherwise.
Figure: [“A logical calculus of the ideas immanent in nervous activity”, McCulloch and Pitts 1943]
E. Scornet Deep Learning January 2020 19 / 118
McCulloch and Pitts neuron - 1943
In 1943, portrayed with a simple electrical circuit by neurophysiologist Warren McCullochand mathematician Walter Pitts.
A McCulloch-Pitts neuron takes binary inputs, computes a weighted sum and returns 0 ifthe result is below threshold and 1 otherwise.
Figure: [“A logical calculus of the ideas immanent in nervous activity”, McCulloch and Pitts 1943]
Donald Hebb took the idea further by proposing that neural pathways strengthen overeach successive use, especially between neurons that tend to fire at the same time.[The organization of behavior: a neuropsychological theory, Hebb 1949]
E. Scornet Deep Learning January 2020 19 / 118
Perceptron - 1958
In the late 50s, Frank Rosenblatt, a psychologist at Cornell, worked on on decision systemspresent in the eye of a fly, which determine its flee response.
In 1958, he proposed the idea of a Perceptron, calling it Mark I Perceptron. It was asystem with a simple input-output relationship, modelled on a McCulloch-Pitts neuron.
[“Perceptron simulation experiments”, Rosenblatt 1960]
E. Scornet Deep Learning January 2020 20 / 118
Perceptron diagram
Figure: True representation of perceptron
The connections between the input and the first hidden layer cannot be optimized!
E. Scornet Deep Learning January 2020 21 / 118
Perceptron Machine
First implementation: Mark I Perceptron(1958).
The machine was connected to acamera (20x20 photocells, 400-pixelimage)
Patchboard: allowed experimentationwith different combinations of inputfeatures
Potentiometers: that implement theadaptive weights
E. Scornet Deep Learning January 2020 22 / 118
Parameters of the perceptron
Activation function: σ(z) = 1z>0
Parameters:Weights w = (w1, . . . ,wd) ∈ Rd
Bias: b
The output is given by f(w,b)(x).
How do we estimate (w, b)?
E. Scornet Deep Learning January 2020 23 / 118
Parameters of the perceptron
Activation function: σ(z) = 1z>0
Parameters:Weights w = (w1, . . . ,wd) ∈ Rd
Bias: b
The output is given by f(w,b)(x).
How do we estimate (w, b)?
E. Scornet Deep Learning January 2020 23 / 118
The Perceptron Algorithm
To ease notations, we put w = (w1, . . . ,wd , b) and xi = (xi , 1)
Perceptron Algorithm - first (iterative) learning algorithmStart with w = 0.Repeat over all samples:
I if yi < w, xi >< 0 modify w into w + yi xi ,I otherwise do not modify w.
ExerciseWhat is the rational behind this procedure?Is this procedure related to a gradient descent method?
Gradient descent:
1 Start with w = w0
2 Update w← w− η∇L(w), where L is the loss to be minimized.3 Stop when w does not vary too much.
E. Scornet Deep Learning January 2020 24 / 118
Solution
Perceptron algorithm can be seen as a stochastic gradient descent.
Perceptron AlgorithmRepeat over all samples:
I if yi < w, xi >< 0 modify w into w + yi xi ,I otherwise do not modify w.
A sample is misclassified if yi < w, xi >< 0. Thus we want to minimize the lossL(w) = −
∑i∈Mw
yi < w, xi >,
whereMw is the set of indices misclassified by the hyperplane w.
Stochastic Gradient Descent:1 Select randomly i ∈Mw
2 Update w← w− η∇Li (w) = w− ηyi xi
Perceptron algorithm: η = 1.
E. Scornet Deep Learning January 2020 25 / 118
Exercise
Figure: From http://image.diku.dk/kstensbo/notes/perceptron.pdf
Let R = maxi ‖xi‖. Let w? be the optimalhyperplane of margin
γ = mini
yi〈w?, xi〉,with
‖w?‖ = 1.
Theorem (Block 1962; Novikoff 1963)Assume that the training set Dn ={(x1, yi ), . . . , (xn, yn)} is linearly separable(γ > 0). Start with w0 = 0. Then the num-ber of updates k of the perceptron algorithmis bounded by
k ≤ 1 + R2
γ2.
Exercise: Prove it!
E. Scornet Deep Learning January 2020 26 / 118
Solution
According to Cauchy-Schwarz inequality,
〈w?, wk+1〉 ≤ ‖wk+1‖2 .
since ‖w?‖2 = 1 with equality if and only if wk+1 ∝ w?. By construction,
〈w?, wk+1〉 = 〈w?, wk〉 + yi〈w?, xi〉
≥ 〈w?, wk〉 + γ
≥ 〈w?, w0〉 + kγ≥ kγ,
since w0 = 0. We have
‖wk+1‖2 = ‖wk‖2 + ‖xi‖2 + 2〈wk , yi xi〉
≤ ‖wk‖2 + R2 + 1
≤ k(1 + R2).
Finally,
kγ ≤√
k(1 + R2)
⇔k ≤1 + R2
γ2.
E. Scornet Deep Learning January 2020 27 / 118
Perceptron - Summary and drawbacks
Perceptron algorithmWe have a data set Dn = {(xi , yi ), i = 1, . . . , n}We use the perceptron algorithm to learn the weight vector w and the bias b.We predict using f(w,b)(x) = 1〈w,x〉+b>0.
Limitations
The decision frontier is linear! Too simple model.The perceptron algorithm does not converge if data are not linearly separable: inthis case, the algorithm must not be used.In practice, we do not know if data are linearly separable... Perceptron should neverbe used!
E. Scornet Deep Learning January 2020 28 / 118
Perceptron - Summary and drawbacks
Perceptron algorithmWe have a data set Dn = {(xi , yi ), i = 1, . . . , n}We use the perceptron algorithm to learn the weight vector w and the bias b.We predict using f(w,b)(x) = 1〈w,x〉+b>0.
LimitationsThe decision frontier is linear! Too simple model.The perceptron algorithm does not converge if data are not linearly separable: inthis case, the algorithm must not be used.In practice, we do not know if data are linearly separable... Perceptron should neverbe used!
E. Scornet Deep Learning January 2020 28 / 118
Perceptron will make machine intelligent... or not!
The Perceptron project led by Rosenblatt was funded by the US Office of Naval Research.
The Navy revealed the embryo of an electronic computer today that it expects will beable to walk, talk, see, write, reproduce itself and be conscious of its existence. Laterperceptrons will be able to recognize people and call out their names and instantlytranslate speech in one language to speech and writing in another language, it was
predicted.
Press conference, 7 July 1958, New York Times.
For an extensive study of the perceptron, [Principles of neurodynamics. perceptrons and the theory of brain
mechanisms, Rosenblatt 1961]
E. Scornet Deep Learning January 2020 29 / 118
AI winter
In 1969, Minsky and Papert exhibit the fact that it was difficult for perceptron todetect parity (number of activated pixels)detect connectedness (are the pixels connected?)represent simple non linear function like XOR
There is no reason to suppose that any of [the virtue of perceptrons] carry over to themany-layered version. Nevertheless, we consider it to be an important research problemto elucidate (or reject) our intuitive judgement that the extension is sterile. Perhaps somepowerful convergence theorem will be discovered, or some profound reason for the failureto produce an interesting "learning theorem" for the multilayered machine will be found.
[“Perceptrons.”, Minsky and Papert 1969]
This book is the starting point of the period known as “AI winter”, a significant declinein funding of neural network research.
Controversy between Rosenblatt, Minsky, Papert:[“A sociological study of the official history of the perceptrons controversy”, Olazaran 1996]
E. Scornet Deep Learning January 2020 30 / 118
AI Winter: the XOR function
Exercise. Consider two binary variables x1 ∈ {0, 1} and x2 ∈ {0, 1}. The logical functionAND applied to x1, x2 is defined as
AND :{0, 1}2 → {0, 1}
(x1, x2) 7→{
1 if x1 = x2 = 10 otherwise
1) Find a perceptron (i.e., weights and bias) that implements the AND function.
The logical function XOR applied to x1, x2 is defined asXOR :{0, 1}2 → {0, 1}
(x1, x2) 7→{
0 if x1 = x2 = 0 or x1 = x2 = 11 otherwise
2) Prove that no perceptron can implement the XOR function.3) Find a neural network with one hidden layer that implements the XOR function.
E. Scornet Deep Learning January 2020 31 / 118
Solution
1 The following perceptron with w1 = w2 = 1 and b = −1.5 implements the ANDfunction:
x1
x2
b
w1
w2
b
σ(w1x1 + w2x2 + b)
Output layerInput layer
2 The perceptron algorithm builds a hyperplane and predicts accordingly, dependingon which side of the hyperplane the observation falls into. Unfortunately the XORfunction cannot be represented with a hyperplane in the original space {0, 1}2.
E. Scornet Deep Learning January 2020 32 / 118
Solution
3 The following network implements the XOR function withw (1)1,1 = −1
w (1)1,2 = −1
b(1)1 = 0.5
w (1)2,1 = 1
w (1)2,2 = 1
b(1)2 = −1.5
w (2)1,1 = −1
w (2)1,2 = −1
b(2)1 = 0.5
E. Scornet Deep Learning January 2020 33 / 118
Solution
x1
x2
b(1)1
b(1)2
b(2)1
"0"
"2"
w (1)1,1
w (1)2,1
w (1)1,2
w (1)2,2
w (2)1,1
w (2)1,2
L1Input layer Output layer
E. Scornet Deep Learning January 2020 34 / 118
Moving forward - ADALINE, MADALINE
In 1959 at Stanford, Bernard Widrow and Marcian Hoff developed AdaLinE (ADAptiveLINear Elements) and MAdaLinE (Multiple AdaLinE) the latter being the first networksuccessfully applied to a real world problem.[Adaptive switching circuits, Bernard Widrow and Hoff 1960]
Loss: square difference between a weighted sum of inputs and the outputOptimization procedure: trivial gradient descent
E. Scornet Deep Learning January 2020 35 / 118
MADALINE
Many Adalines: network with one hidden layer composed of many Adaline units.[“Madaline Rule II: a training algorithm for neural networks”, Winter and Widrow 1988]
Applications:Speech and pattern recognition[“Real-Time Adaptive Speech-Recognition System”,
Talbert et al. 1963]
Weather forecasting[“Application of the adaline system to weather
forecasting”, Hu 1964]
Adaptive filtering and adaptive signalprocessing[“Adaptive signal processing”, Bernard and Samuel 1985]
E. Scornet Deep Learning January 2020 36 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 37 / 118
Neural network with one hidden layer
Generic notations:W (`)
i,j : weights between the j neuron in the `− 1 layer and the i neuron of the ` layer.
b(`)j : bias of the j neuron of the ` layer.
a(`)j : output of the j neuron of the ` layer.
z (`)j : input of the j neuron of the ` layer, such that a(`)
j = σ(z (`)j ).
E. Scornet Deep Learning January 2020 38 / 118
How to find weights and bias?
Perceptron algorithm does not work anymore!
E. Scornet Deep Learning January 2020 39 / 118
Gradient Descent Algorithm
The prediction of the network is given by fθ(x).
Empirical risk minimization:
argminθ
1n
n∑i=1
`(Yi , fθ(Xi )) ≡ argminθ
1n
n∑i=1
`i (θ)
(Stochastic) Gradient descent rule:While |θt − θt−1| ≥ ε do
I Sample It ⊂ {1, . . . , n}I
θt+1 = θt − η( 1|It |
∑i∈It
∇θ`i (θt))
How to compute ∇θ`i efficiently?
E. Scornet Deep Learning January 2020 40 / 118
Gradient Descent Algorithm
The prediction of the network is given by fθ(x).
Empirical risk minimization:
argminθ
1n
n∑i=1
`(Yi , fθ(Xi )) ≡ argminθ
1n
n∑i=1
`i (θ)
(Stochastic) Gradient descent rule:While |θt − θt−1| ≥ ε do
I Sample It ⊂ {1, . . . , n}I
θt+1 = θt − η( 1|It |
∑i∈It
∇θ`i (θt))
How to compute ∇θ`i efficiently?
E. Scornet Deep Learning January 2020 40 / 118
Backprop Algorithm
A Clever Gradient Descent ImplementationPopularized by Rumelhart, McClelland, Hinton in 1986.Can be traced back to Werbos in 1974.Nothing but the use of chain rule derivation with a touch of dynamic programing.
Key ingredient to make the Neural Networks work!Still at the core of Deep Learning algorithm.
E. Scornet Deep Learning January 2020 41 / 118
Backpropagation equations
Neural network with L layers, with vector output, with quadratic costC = 1
2‖y − a(L)‖2.By definition,
δ(`)j = ∂C
∂z (`)j
.
The four fundamental equations of backpropagation are given byδ(L) = ∇aC � σ′(z (L)),
δ(`) = ((w (`+1))T δ(`+1))� σ′(z (`))∂C∂b(`)
j
= δ(`)j
∂C∂w (`)
j,k
= a(`−1)k δ
(`)j .
E. Scornet Deep Learning January 2020 42 / 118
Proof
We start with the first equality
δ(L) = ∇aC � σ′(z(L)).
Applying the chain rule gives
δ(L)j =
∑k
∂C
∂a(L)k
∂a(L)k
∂z(L)j
,
where z(L)j is the input of the j neuron of the layer L. Since the activation a(L)
k depends only on the input z(L)j ,
we have
δ(L)j =
∂C
∂a(L)j
∂a(L)j
∂z(L)j
.
Besides, since a(L)j = σ(z(L)
j ), we have
δ(L)j =
∂C
∂a(L)j
σ′(z(L)
j ),
which is the component wise version of the first equality.
E. Scornet Deep Learning January 2020 43 / 118
Proof
Now, we want to prove
δ(`) = ((w (`+1))T
δ(`+1))� σ′(z(`)).
Again, using the chain rule,
δ(`)j =
∂C
∂z(`)j
=∑
k
∂C
∂z(`+1)k
∂z(`+1)k
∂z(`)j
=∑
k
δ(`+1)k
∂z(`+1)k
∂z(`)j
.
Recalling that z(`+1)k =
∑jw (`+1)
kj σ(z(`)j ) + b`+1
k , we get
δ(`)j =
∑k
δ(`+1)k w (`+1)
kj σ′(z(`)
j ),
which concludes the proof.
E. Scornet Deep Learning January 2020 44 / 118
Proof
Now, we want to prove∂C
∂b(`)j
= δ(`)j .
Using the chain rule,
∂C∂b`j
=∑
k
∂C
∂z(`)k
∂z(`)k
∂b`j.
However, only z(`)j depends on b`j and z(`)
j =∑
kw (`)
jk σ(z(`−1)k ) + b`j . Therefore,
∂C∂b`j
= δ(`)j .
E. Scornet Deep Learning January 2020 45 / 118
Proof
Finally, we want to prove
∂C
∂w (`)j,k
= a(`−1)k δ
(`)j .
By the chain rule,
∂C
∂w (`)j,k
=∑
m
∂C
∂z(`)m
∂z(`)m
∂w (`)j,k
.
Since z(`)m =
∑k
w (`)mk σ(z(`−1)
k ) + b`m, we have
∂z(`)m
∂w (`)j,k
= σ(z(`−1)k )1m=j .
Consequently,∂C
∂w (`)j,k
= δ(`)j σ(z(`−1)
k ) = a(`−1)k δ
(`)j .
E. Scornet Deep Learning January 2020 46 / 118
Backpropagation Algorithm
Letδ
(`)j = ∂C
∂z (`)j
,
where z (`)j is the entry of the neuron j of the layer `.
Backpropagation AlgorithmInitialize randomly weights and bias in the network.
For each training sample xi ,1 Feedforward: let xi go through the network and store the value of activation function
and its derivative, for each neuron.2 Output error: compute the neural network error for xi .3 Backpropagation: compute recursively the vectors δ(`) starting from ` = L to ` = 1.
Update the weights and bias using the backpropagation equations.
E. Scornet Deep Learning January 2020 47 / 118
Neural Network terminology
Epoch: one forward pass and one backward pass of all training examples.
(Mini) Batch size: number of training examples in one forward/backward pass. Thehigher the batch size is, the more memory space you’ll need.
Number of iterations: number of passes, each pass using [batch size] number ofexamples. To be clear, one pass = one forward pass + one backward pass (we do notcount the forward pass and backward pass as two different passes).
For example, for 1000 training examples, if you set the batch size at 500, it will take 2iterations to complete 1 epoch.
E. Scornet Deep Learning January 2020 48 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 49 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 50 / 118
Sigmoid
5 4 3 2 1 0 1 2 3 4 50.0
0.5
1.0
Sigmoid function
x 7→ exp(x)1 + exp(x)
Problems:
1 Sigmoid is not a zero centeredfunction -> need for rescaling data.
2 Saturated function: gradient killer ->need for rescaling data
3 Plus: exp is a bit computationalexpensive
E. Scornet Deep Learning January 2020 51 / 118
Tanh
5 4 3 2 1 0 1 2 3 4 51
0
1
Hyperbolic tangente function
x 7→ exp(x)− exp(−x)exp(x) + exp(−x)
Problems:
1 Tanh is a zero centered function ->no need for rescaling data
2 Saturated function: gradient killer ->need for rescaling data
3 Plus: exp is a bit computationalexpensive
Note: tanh(x) = 2σ(2x)− 1.
E. Scornet Deep Learning January 2020 52 / 118
Rectified Linear Unit
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5
ReLU / positive part
x 7→ max(0, x)
Problems:
1 Not a saturated function.2 Computationally efficient3 Empirically, convergence is faster than
sigmoid/tanh.4 Plus: biologically plausible
E. Scornet Deep Learning January 2020 53 / 118
A little bit more on ReLU
Introduced by [“Imagenet classification with deep convolutional neural networks”, Krizhevsky et al. 2012] in AlexNet
Problems:Not zero centeredGradient is null if x < 0If weights are not properly initialized, ReLU output can be zero.
Usually initial bias for ReLU so that they fire up: useful or not? Mystery...
Related to biology [“Deep sparse rectifier neural networks”, Glorot, Bordes, et al. 2011]:Most of the time, neurons are inactive.when they activate, their activation is proportional to their input.
E. Scornet Deep Learning January 2020 54 / 118
Leaky ReLU/ Parametric ReLU / Absolute value rectification
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5x 7→ max(αx , x)
Leaky ReLU: α = 0.1[“Rectifier nonlinearities improve neural network acoustic
models”, Maas et al. 2013]
Absolute Value Rectification:α = −1
[“What is the best multi-stage architecture for object
recognition?”, Jarrett et al. 2009]
Parametric ReLU: α optimized duringbackpropagation. Activation function islearned.[“Empirical evaluation of rectified activations in
convolutional network”, Xu et al. 2015]
E. Scornet Deep Learning January 2020 55 / 118
ELU
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5
Exponential Linear Unit
x 7→{
x if x ≥ 0α(exp(x)− 1) otherwise
Negative saturation regime, closer tozero mean output.
α is set to 1.0.
Robustness to noise[“Fast and accurate deep network learning by exponential
linear units (elus)”, Clevert et al. 2015]
E. Scornet Deep Learning January 2020 56 / 118
Maxout
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5
ReLU / positive part
x 7→ max(w1x + b1,w2x + b2)
Linear regime, not saturating, notdying.
Number of parameters multiplied by 2(resp. by k if k entries)
Learn piecewise linear functions (up tok pieces).[“Maxout networks”, Goodfellow, Warde-Farley, et al.
2013] [“Deep maxout neural networks for speech
recognition”, Cai et al. 2013]
Resist to catastrophic forgetting[“An empirical investigation of catastrophic forgetting in
gradient-based neural networks”, Goodfellow, Mirza,
et al. 2013]
E. Scornet Deep Learning January 2020 57 / 118
Conclusion on activation functions
Use ReLU.Test Leaky ReLU, maxout, ELU.Try out Tanh, but not expect too much.Do not use sigmoid.
E. Scornet Deep Learning January 2020 58 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 59 / 118
Output units
Linear output unit:y = W Th + b
→ Linear regression based on the new variables h.
Sigmoid output unit, used to predict {0, 1} outputs:P(Y = 1|h) = σ(W Th + b),
where σ(t) = et/(1 + et)..
→ Logistic regression based on the new variables h.
Softmax output unit, used to predict {1, . . . ,K}:
softmax(z)i = ezi∑Kk=1 ezk
where, each zi is the activation of one neuron of the previous layer, given byzi = W T
i h + bi .
→ Multinomial logistic regression based on the new variables h.
E. Scornet Deep Learning January 2020 60 / 118
Multinomial logistic regression
Generalization of logistic regression for multiclass outputs: for all 1 ≤ k ≤ K ,
log(P[Yi = k]
Z
)= βkXi ,
Hence, for all 1 ≤ k ≤ K ,
P[Yi = k] = Zeβk Xi ,
whereZ = 1∑K
k=1 eβk Xi.
Thus,
P[Yi = k] = eβk Xi∑K`=1 eβ`Xi
.
E. Scornet Deep Learning January 2020 61 / 118
Biology bonus
Softmax, used with cross-entropy:
− log(P(Y = k|z)) = − log softmax(z)k
= zk − log(∑
j
exp(zj))
' zk −maxj
zj
' 0,No contribution to the cost when softmax(z)y is maximal.
Lateral inhibition: believed to exist between nearby neurons in the cortex. When thedifference between the max and the other is large, winner takes all: one neuron is set to 1and the others go to zero.
More complex models: Conditional Gaussian Mixture: y is multimodal[“On supervised learning from sequential data with applications for speech recognition”; “Generating sequences with recurrent
neural networks”, Schuster 1999; Graves 2013].
E. Scornet Deep Learning January 2020 62 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 63 / 118
Cost functions
Mean Square Error (MSE)
1n
n∑i=1
`(Yi , fθ(Xi )) = 1n
n∑i=1
(Yi − fθ(Xi ))2
Mean Absolute Error
1n
n∑i=1
`(Yi , fθ(Xi )) = 1n
n∑i=1
|Yi − fθ(Xi )|
0− 1 Error
1n
n∑i=1
`(Yi , fθ(Xi )) = 1n
n∑i=1
1Yi 6=fθ(Xi )
E. Scornet Deep Learning January 2020 64 / 118
Cost functions
Cross entropy (or negative log-likelihood):
1n
n∑i=1
`(Yi , fθ(Xi )) = −1n
n∑i=1
K∑k=1
1yi =k log([fθ(Xi )]k
)Cross-entropy:
Very popular!Should help to prevent saturation phenomenon compared to MSE:
− log(P(Y = yi |X)) = − log(σ((2y − 1)(W Th + b))),with
σ(t) = et
1 + et
Usually, saturation occurs when (2y − 1)(W Th+ b)� 1. In that case, − log(P(Y =yi |X)) is linear in W and b which makes the gradient easy to compute, and thegradient descent easy to implement.
Mean Square Error should not be used with softmax output units[“Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition”,
Bridle 1990]
E. Scornet Deep Learning January 2020 65 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 66 / 118
Weight initialization
First idea: Set all weights and bias to the same value.
E. Scornet Deep Learning January 2020 67 / 118
Small or big weights?
1 First idea: small random numbers, typically0.01×N (0, 10−4).
I work for small networksI for big networks (∼ 10 layers) output become dirac in 0: there is no activation at all.
2 Second idea: “big random numbers”N (0, 10−4).
→ Saturating phenomenon
In any case, no need to tune the bias: they can be initially set to zero.
E. Scornet Deep Learning January 2020 68 / 118
Other initialization
Idea: the variance of the input should be the same as the variance of the output.
1 Xavier initializationInitialize bias to zero and weights randomly using
U[−
√6√nj + nj+1
,
√6√nj + nj+1
],
where nj is the size of layer j[“Understanding the difficulty of training deep feedforward neural networks”, Glorot and Bengio 2010].→ Sadly, it does not work for ReLU (non activated neurons)
2 He et al. initializationInitialize bias to zero and weights randomly using
N(0,√2
nj
),
where nj is the size of layer j.[“Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”, He et al. 2015].
Bonus: [“All you need is a good init”, Mishkin and Matas 2015]
E. Scornet Deep Learning January 2020 69 / 118
Exercise
1 Consider a neural network with two hidden layers (containing n1 and n2 neuronsrespectively) and equipped with linear hidden units. Find a simple sufficientcondition on the weights so that the variance of the hidden units stay constantaccross layers.
2 Considering the same network, find a simple sufficient condition on the weights sothat the gradient stay constant across layers when applying backpropagationprocedure.
3 Based on previous questions, propose a simple way to initialize weights.
E. Scornet Deep Learning January 2020 70 / 118
Solution
1 Consider the neuron i of the second hidden layer. The output of this neuron is
a(`)i =
n1∑j=1
W (2)ij z(1)
j + b(2)i
=
n1∑j=1
W (2)ij
( n0∑k=1
W (2)jk xk + b(2)
j
)+ b(2)
i
=
n1∑j=1
W (2)ij
( n0+1∑k=1
W (2)jk xk)
+ b(2)i ,
where we set xn0+1 = 1, and W (1)j,n0+1 = b(1)
j to ease notations. At fixed x , since the weights and bias arecentred, we have
V[
a(`)i
]= V[ n1∑
j=1
n0+1∑k=1
W (2)ij W (2)
jk xk + b(2)i
]=
n1∑j=1
n0+1∑k=1
x2kV[
W (2)ij W (2)
jk
]E. Scornet Deep Learning January 2020 71 / 118
Solution
=
n1∑j=1
n0+1∑k=1
x2kσ
22σ
21 ,
where σ21 (resp. σ2
2) is the shared variance of all weights between layers 0 and 1 (resp. layers 1 and 2).Since we want that V
[a(`)
i
]= V[
a(`−1)i
], it is sufficient that
n1σ22 = 1.
E. Scornet Deep Learning January 2020 72 / 118
Solution
2 Consider a neural network where the cost function is given, for one training example, by
C =(
y −1
nd+1
nd+1∑j=1
z(d)j
)2,
which implies
∂C
∂z(d)j
= −2Wj
nd+1
(y −
1nd+1
nd∑i=1
Wi z(d)i
).
Hence,
E[ ∂C
∂z(d)j
]= E[ 2W 2
j
n2d+1
z(d)j
]=
2σ2d+1
n2d+1
E[
z(d)j
].
Besides, according to the chain rule, we have
∂C
∂z(d−1)k
=
nd+1∑j=1
∂C
∂z(d)j
∂z(d)j
∂z(d−1)k
=
nd+1∑j=1
∂C
∂z(d)j
W (d)jk .
E. Scornet Deep Learning January 2020 73 / 118
Solution
More precisely,
E[ ∂C
∂z(d)j
W (d)jk
]= E[ ∂C
∂z(d)j
W (d)jk
]= E[−
2Wj
nd+1yW (d)
jk +2Wj
n2d+1
W (d)jk
∑i
Wi(∑
`
W (d)i` z(d−1)
`+ b(d)
i
)]= E[ 2W 2
j
n2d+1
W (d)jk
(∑`
W (d)j` z(d−1)
`+ b(d)
j
)]= E[ 2W 2
j
n2d+1
(W (d)jk )2z(d−1)
`
]=
2σ2d+1σ
2d
n2d+1
E[
z(d−1)`
].
Using the chain rule, we get
E[ ∂C
∂z(d−1)k
]= E[ ∂C
∂z(d)j
]⇔nd+1
2σ2d+1σ
2d
n2d+1
E[
z(d−1)`
]=
2σ2d+1
n2d+1
E[
z(d)j
]E. Scornet Deep Learning January 2020 74 / 118
Solution
⇔nd+1σ2d =
E[
z(d)j
]E[
z(d−1)`
] .This gives the sufficient condition nd+1σ
2d ∼ 1 assuming that the activations of two consecutive hidden
layers are similar.
E. Scornet Deep Learning January 2020 75 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 76 / 118
Regularizing to avoid overfitting
Avoid overfitting by imposing some constraints over the parameter space.
Reducing variance and increasing bias.
E. Scornet Deep Learning January 2020 77 / 118
Overfitting
Many different manners to avoid overfitting:
Penalization (L1 or L2)Replacing the cost function L by L(θ,X , y) = L(θ,X , y) + pen(θ).
Soft weight sharing - cf CNN lectureReduce the parameter space artificially by imposing explicit constraints.
DropoutRandomly kill some neurons during optimization and predict with the full network.
Batch normalizationRenormalize a layer inside a batch, so that the network does not overfit on thisparticular batch.
Early stoppingStop the gradient descent procedure when the error on the validation set increases.
E. Scornet Deep Learning January 2020 78 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 79 / 118
Constraint the optimization problemminθL(θ,X , y), s.t. pen(θ) ≤ cste.
Using Lagrangian formulation, this problem is equivalent to:minθL(θ,X , y) + λ pen(θ),
whereL(θ,X , y) is the loss function (data-driven term)pen is a function that increases when θ becomes more complex (penalty term)λ ≥ 0 is a constant standing for the strength of the penalty term.
For Neural Networks, pen only penalizes the weights and not the bias: the later beingeasier to estimate than weights.
E. Scornet Deep Learning January 2020 80 / 118
Example of penalization
minθL(θ,X , y) + pen(θ),
Ridgepen(θ) = λ‖θ‖22
[“Ridge regression: Biased estimation for nonorthogonal problems”, Hoerl and Kennard 1970].See also [“Lecture notes on ridge regression”, Wieringen 2015]
Lassopen(θ) = λ‖θ‖1
[“Regression shrinkage and selection via the lasso”, Tibshirani 1996]
Elastic Netpen(θ) = λ‖θ‖22 + µ‖θ‖1
[“Regularization and variable selection via the elastic net”, Zou and Hastie 2005]
E. Scornet Deep Learning January 2020 81 / 118
Simple case: linear regression
Linear regressionThe estimate of linear regression β is given by
β ∈ argminβ∈Rd
n∑i=1
(Yi −d∑
j=1
βjx (j)i )2,
which can be written asβ ∈ argmin
β∈Rd‖Y − Xβ‖22,
where X ∈ Mn,d(R).
Solution:β = (X′X)−1X′Y .
E. Scornet Deep Learning January 2020 82 / 118
Penalized regression
Penalized linear regressionThe estimate of linear regression βλ,q is given by
βλ,q ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖qq.
q = 2: Ridge linear regressionq = 1: LASSO
E. Scornet Deep Learning January 2020 83 / 118
Ridge regression, q = 2
Ridge linear regressionThe ridge estimate βλ,2 is given by
βλ,2 ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖22.
Solution:βλ,2 = (X′X + λI)−1X′Y .
This estimate has a bias equal to −λ(X′X + λI)−1β, and a varianceσ2(X′X + λI)−1X′X(X′X + λI)−1. Note that
V[βλ,2] ≤ V[β].
In the case of orthonormal design (X′X = I), we have
βλ,2 = β
1 + λ= 1
1 + λX ′j Y .
E. Scornet Deep Learning January 2020 84 / 118
Ridge regression q = 2
In general, the ridge penalization considers
pen(θ) = 12‖β‖
22 = 1
2
d∑j=1
β2j ,
in the optimization problemminθL(β,X , y) + pen(β),
It penalizes the “size” of β.
This is the most widely used penalization
It’s nice and easyIt allows to “deal” with correlated features.It actually helps training! With a ridge penalization, the optimization problem iseasier.
E. Scornet Deep Learning January 2020 85 / 118
Sparsity
There is another desirable property on β
If βj = 0, then feature j has no impact on the prediction:y = sign(x>β + b)
If we have many features (d is large), it would be nice if β contained zeros, and many ofthem
Means that only few features are statistically relevant.Means that only few features are useful to predict the label
Leads to a simpler model, with a “reduced” dimension
How do we enforce sparsity in β?
E. Scornet Deep Learning January 2020 86 / 118
Sparsity
Tempting to solve
βλ,0 ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖0.
where‖β‖0 = #{j ∈ {1, . . . , d} : βj 6= 0}.
To solve this, explore all possible supports of β. Too long! (NP-hard)
Find a convex proxy of ‖‖0: the `1-norm ‖β‖1 =∑d
j=1 |βj |
E. Scornet Deep Learning January 2020 87 / 118
LASSO
Ridge linear regressionThe LASSO (Least Absolute Selection and Shrinkage Operator) estimate of linearregression βλ,2 is given by
βλ,2 ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖1.
Solution: No close form in the general case
If the Xj are orthonormal then
βλ,1 = X ′j Y(1− λ
2|X ′j Y |
)+,
where (x)+ = max(0, x).Thus, in the very specific case of orthogonal design, we can easily show that L1penalization implies a sparse vector if the parameter λ is properly tuned.
E. Scornet Deep Learning January 2020 88 / 118
Sparsity - a picture
Why `2 (ridge) does not induce sparsity?
E. Scornet Deep Learning January 2020 89 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 90 / 118
Dropout
Dropout refers to dropping out units (hidden and visible) in a neural network, i.e.,temporarily removing it from the network, along with all its incoming and outgoingconnections.
Each unit is independently retained with probabilityp = 0.5 for hidden unitsp ∈ [0.5, 1] for input units, usually p = 0.8.
[“Improving neural networks by preventing co-adaptation of feature detectors”, Hinton et al. 2012]
E. Scornet Deep Learning January 2020 91 / 118
Dropout
E. Scornet Deep Learning January 2020 92 / 118
Dropout algorithm
Training step. While not convergence
1 Inside one epoch, for each mini-batch of size m,
1 Sample m different mask. A mask consists in one Bernoulli per node of the network(inner and entry nodes but not output nodes). These Bernoulli variables are i.i.d..Usually
F the probability of selecting an hidden node is 0.5F the probability of selecting an input node is 0.8
2 For each one of the m observation in the mini-batch,F Do a forward pass on the masked networkF Compute backpropagation in the masked networkF Compute the gradient for each weights
3 Update the parameter according to the usual formula.
Prediction step.Use all neurons in the network with weights given by the previous optimization procedure,times the probability p of being selected (0.5 for inner nodes, 0.8 for input nodes).
E. Scornet Deep Learning January 2020 93 / 118
Another way of seeing dropout - Ensemble method
Averaging many different neural networks.
Different can mean either:
randomizing the data set on which wetrain each network (via subsampling)Problem: not enough data to obtaingood performance...
building different network architecturesand train each large network separatelyon the whole training setProblem: computationally prohibitive attraining time and test time!
Miscelleanous:[“Fast dropout training”, Wang and Manning 2013]
[“Dropout: A simple way to prevent neural networks from
overfitting”, Srivastava et al. 2014]
E. Scornet Deep Learning January 2020 94 / 118
Exercise: linear units
1 Consider a neural networks with linear activation functions. Prove that dropout can be seen as a modelaveraging method.
2 Given one training example, consider the error of the ensemble of neural network and that of one randomneural network sample with dropout:
Eens =12
(y − aens )2 =12
(y −d∑
j=1
pi wi xi )2
Esingle =12
(y − asingle)2 =12
(y −d∑
j=1
δi wi xi )2,
where δi ∈ {0, 1} represents the presence (δi = 1) or the absence of a connexion between the input xiand the output.Prove that
E[∇w Esingle ] = ∇w Eens +12
d∑j=1
w2i x2
i V(δi ).
Comment.
E. Scornet Deep Learning January 2020 95 / 118
Solution
1 Simple calculations...2 The gradient of the ensemble is given by
∂Eens
∂wi= −(y − aens)
∂aens
∂wi= −(y − aens)pi Ii .
and that of a single (random) network satisfies∂Esingle
∂wi= −(y − asingle)∂asingle
∂wi
= −(y − asingle)δi Ii
= −tδixi + wiδ2i x2
i +∑j 6=i
wjδiδjxixj .
Taking the expectation of the last expression gives
E
[∂Esingle
∂wi
]= −tpiδi + wipix2
i +∑j 6=i
wipipjxixj
= −(t − aens)pixi + wix2i pi (1− pi ).
Therefore, we haveE
[∂Esingle
∂wi
]= ∂Eens
∂wi+ wiV[δixi ].
E. Scornet Deep Learning January 2020 96 / 118
Solution
In terms of error, we have
E[Esingle ] = Eens + 12
d∑j=1
w2i x2
i V[δi ].
The regularization term is maximized for p = 0.5.
E. Scornet Deep Learning January 2020 97 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 98 / 118
Batch normalization
The network converges faster if its input are scaled (mean, variance) and decorrelated.[“Efficient backprop”, LeCun et al. 1998]
Hard to decorrelate variables: requiring to compute covariance matrix.[“Batch normalization: Accelerating deep network training by reducing internal covariate shift”, Ioffe and Szegedy 2015]
Ideas:Improving gradient flowsAllowing higher learning ratesReducing strond dependence on initializationRelated to regularization (maybe slightly reduces the need for Dropout)
E. Scornet Deep Learning January 2020 99 / 118
Algorithm
See [“Batch normalization: Accelerating deep network training by reducing internal covariate shift”, Ioffe and Szegedy 2015]
1 For every unit in the first layer,1 µB = 1
m∑m
i=1 xi2 σ2B = 1
m∑m
i=1(xi − µB)2
3 xi = xi−µB√σ2B+ε
4 yi = γxi + β ≡ BNγ,β(xi )2 yi is fed to the next layer and the procedure iterates.3 Backpropagation is performed on the network parameters including (γ(k), β(k)). This
returns a network.4 For inference, compute the average over many training batches B:
EB[x ] = EB[µB] and VB[x ] = mm − 1EB[σ2
B].
5 For inference, replace every function x 7→ NBγ,β(x) in the network by
x 7→ γ√VB[x ] + ε
x +(β − γEB[x ]√
VB[x ] + ε
).
E. Scornet Deep Learning January 2020 100 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 101 / 118
Early stopping
Idea:Store the parameter values that lead to the lowest error on the validation setReturn these values rather than the latest ones.
E. Scornet Deep Learning January 2020 102 / 118
Early stopping algorithm
Parameters:patience p of the algorithm: number of times to observe worsening validation seterror before giving up;the number of steps n between evaluations.
1 Start with initial random values θ0.
2 Let θ? = θ0, Err? =∞, j = 0, i = 0.
3 While j < p1 Update θ by running the training algorithm for n steps2 i = i + n3 Compute the error Err(θ) on the validation set4 If Err(θ) < Err?
F θ? = θF Err? = Err(θ)F j = 0
else j = j + 1.
4 Return θ? and the overall number of steps i? = i − np.
E. Scornet Deep Learning January 2020 103 / 118
How to leverage on early stopping?
First idea: use early stopping to determine the best number of iterations i? and train onthe whole data set for i? iterations.
Let X (train), y (train) be the training set.
Split X (train), y (train) into X (subtrain), y (subtrain) and X (valid), y (valid).
Run early stopping algorithm starting from random θ using X (subtrain), y (subtrain) fortraining data and X (valid), y (valid) for validation data. This returns i? the optimalnumber of steps.
Set θ to random values again.
Train on X (train), y (train) for i? steps.
E. Scornet Deep Learning January 2020 104 / 118
How to leverage on early stopping?
Second idea: use early stopping to determine the best parameters and the training errorat the best number of iterations. Starting from θ?, train on the whole data set until theerror matches the previous early stopping error.
Let X (train), y (train) be the training set.
Split X (train), y (train) into X (subtrain), y (subtrain) and X (valid), y (valid).
Run early stopping algorithm starting from random θ using X (subtrain), y (subtrain) fortraining data and X (valid), y (valid) for validation data. This returns the optimalparameters θ?.
Set ε = L(θ?,X (subtrain), y (subtrain)).
While L(θ?,X (valid), y (valid)) > ε, train on X (train), y (train) for n steps.
E. Scornet Deep Learning January 2020 105 / 118
To go further
Early stopping is a very old ideaI [“Three topics in ill-posed problems”, Wahba 1987]
I [“A formal comparison of methods proposed for the numerical solution of first kind integral equations”, Anderssenand Prenter 1981]
I [“Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping”, Caruana et al. 2001]
But also an active area of researchI [“Adaboost is consistent”, Bartlett and Traskin 2007]
I [“Boosting algorithms as gradient descent”, Mason et al. 2000]
I [“On early stopping in gradient descent learning”, Yao et al. 2007]
I [“Boosting with early stopping: Convergence and consistency”, Zhang, Yu, et al. 2005]
I [“Early stopping for kernel boosting algorithms: A general analysis with localized complexities”, Wei et al. 2017]
E. Scornet Deep Learning January 2020 106 / 118
More on reducing overfitting averaging
Soft-weight sharing:[“Simplifying neural networks by soft weight-sharing”, Nowlan and Hinton 1992]
Model averaging:Average over: random initialization, random selection of minibatches,hyperparameters, or outcomes of nondeterministic neural networks.
Boosting neural networks by incrementally adding neural networks to the ensemble[“Training methods for adaptive boosting of neural networks”, Schwenk and Bengio 1998]
Boosting has also been applied interpreting an individual neural network as anensemble, incrementally adding hidden units to the networks[“Convex neural networks”, Bengio et al. 2006]
E. Scornet Deep Learning January 2020 107 / 118
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 108 / 118
Pipeline for neural networks
Step 1: Preprocessing the data (substract mean, divide by standard deviation).More complex if data are images.
Step 2: Choose the architecture (number of layers, number of nodes per layer)
Step 3:1 First, run the network and see if the loss is reasonable (compare with dumb classifier:
uniform for classification, mean for regression)2 Add some regularization and check that the error on the training set increases.3 On a small portion of data, make sure you can overfit when turning down the
regularization.4 Find the best learning rate
1 The error does not change too much → learning rate too small2 The error explodes, NaN → learning rate too high3 Find a rough range [10−5, 10−3].
Playing with neural network:
http://playground.tensorflow.org/
E. Scornet Deep Learning January 2020 109 / 118
To go further
The kernel perceptron algorithm was alreadyintroduced by[“Theoretical foundations of the potential function method in pat-
tern recognition learning”, Aizerman 1964].
General idea (work for all methods using onlydot product): replace the dot product by amore complex kernel function.
Linear separation in the original space be-comes a linear separation in a more complexspace, i.e., a non linear separation in the orig-inal space.
Margin bounds for the Perceptron algorithm in the general non-separable case wereproven by[“Large margin classification using the perceptron algorithm”, Freund and Schapire 1999]
and then by[“Perceptron mistake bounds”, Mohri and Rostamizadeh 2013]
who extended existing results and gave new L1 bounds.
E. Scornet Deep Learning January 2020 110 / 118
Mark A Aizerman. “Theoretical foundations of the potential functionmethod in pattern recognition learning”. In: Automation and remotecontrol 25 (1964), pp. 821–837.
RS Anderssen and PM Prenter. “A formal comparison of methods proposedfor the numerical solution of first kind integral equations”. In: TheANZIAM Journal 22.4 (1981), pp. 488–500.
Yoshua Bengio et al. “Convex neural networks”. In: Advances in neuralinformation processing systems. 2006, pp. 123–130.
Hans-Dieter Block. “The perceptron: A model for brain functioning. i”. In:Reviews of Modern Physics 34.1 (1962), p. 123.
John S Bridle. “Probabilistic interpretation of feedforward classificationnetwork outputs, with relationships to statistical pattern recognition”. In:Neurocomputing. Springer, 1990, pp. 227–236.
Widrow Bernard and D Stearns Samuel. “Adaptive signal processing”. In:Englewood Cliffs, NJ, Prentice-Hall, Inc 1 (1985), p. 491.
Peter L Bartlett and Mikhail Traskin. “Adaboost is consistent”. In: Journalof Machine Learning Research 8.Oct (2007), pp. 2347–2368.
E. Scornet Deep Learning January 2020 111 / 118
Rich Caruana, Steve Lawrence, and C Lee Giles. “Overfitting in neural nets:Backpropagation, conjugate gradient, and early stopping”. In: Advances inneural information processing systems. 2001, pp. 402–408.
Meng Cai, Yongzhe Shi, and Jia Liu. “Deep maxout neural networks forspeech recognition”. In: Automatic Speech Recognition and Understanding(ASRU), 2013 IEEE Workshop on. IEEE. 2013, pp. 291–296.
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. “Fast andaccurate deep network learning by exponential linear units (elus)”. In: arXivpreprint arXiv:1511.07289 (2015).
Yoav Freund and Robert E Schapire. “Large margin classification using theperceptron algorithm”. In: Machine learning 37.3 (1999), pp. 277–296.
Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of trainingdeep feedforward neural networks”. In: Proceedings of the thirteenthinternational conference on artificial intelligence and statistics. 2010,pp. 249–256.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifierneural networks”. In: Proceedings of the Fourteenth InternationalConference on Artificial Intelligence and Statistics. 2011, pp. 315–323.
E. Scornet Deep Learning January 2020 112 / 118
Ian J Goodfellow, Mehdi Mirza, et al. “An empirical investigation ofcatastrophic forgetting in gradient-based neural networks”. In: arXivpreprint arXiv:1312.6211 (2013).
Ian J Goodfellow, David Warde-Farley, et al. “Maxout networks”. In: arXivpreprint arXiv:1302.4389 (2013).
Alex Graves. “Generating sequences with recurrent neural networks”. In:arXiv preprint arXiv:1308.0850 (2013).
Kaiming He et al. “Delving deep into rectifiers: Surpassing human-levelperformance on imagenet classification”. In: Proceedings of the IEEEinternational conference on computer vision. 2015, pp. 1026–1034.
DO Hebb. The organization of behavior: a neuropsychological theory.Wiley, 1949.
Geoffrey E Hinton et al. “Improving neural networks by preventingco-adaptation of feature detectors”. In: arXiv preprint arXiv:1207.0580(2012).
Arthur E Hoerl and Robert W Kennard. “Ridge regression: Biasedestimation for nonorthogonal problems”. In: Technometrics 12.1 (1970),pp. 55–67.
E. Scornet Deep Learning January 2020 113 / 118
Michael Jen-Chao Hu. “Application of the adaline system to weatherforecasting”. PhD thesis. Department of Electrical Engineering, StanfordUniversity, 1964.
Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift”. In: arXiv preprintarXiv:1502.03167 (2015).
Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. “What is the bestmulti-stage architecture for object recognition?” In: Computer Vision, 2009IEEE 12th International Conference on. IEEE. 2009, pp. 2146–2153.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenetclassification with deep convolutional neural networks”. In: Advances inneural information processing systems. 2012, pp. 1097–1105.
Yann LeCun et al. “Efficient backprop”. In: Neural networks: Tricks of thetrade. Springer, 1998, pp. 9–50.
Llew Mason et al. “Boosting algorithms as gradient descent”. In: Advancesin neural information processing systems. 2000, pp. 512–518.
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. “Rectifiernonlinearities improve neural network acoustic models”. In: Proc. icml.Vol. 30. 1. 2013, p. 3.
E. Scornet Deep Learning January 2020 114 / 118
Dmytro Mishkin and Jiri Matas. “All you need is a good init”. In: arXivpreprint arXiv:1511.06422 (2015).
Warren S McCulloch and Walter Pitts. “A logical calculus of the ideasimmanent in nervous activity”. In: The bulletin of mathematical biophysics5.4 (1943), pp. 115–133.
Marvin Minsky and Seymour Papert. “Perceptrons.” In: (1969).
Mehryar Mohri and Afshin Rostamizadeh. “Perceptron mistake bounds”.In: arXiv preprint arXiv:1305.0208 (2013).
Steven J Nowlan and Geoffrey E Hinton. “Simplifying neural networks bysoft weight-sharing”. In: Neural computation 4.4 (1992), pp. 473–493.
Albert B Novikoff. On convergence proofs for perceptrons. Tech. rep.STANFORD RESEARCH INST MENLO PARK CA, 1963.
Mikel Olazaran. “A sociological study of the official history of theperceptrons controversy”. In: Social Studies of Science 26.3 (1996),pp. 611–659.
Frank Rosenblatt. “Perceptron simulation experiments”. In: Proceedings ofthe IRE 48.3 (1960), pp. 301–309.
E. Scornet Deep Learning January 2020 115 / 118
Frank Rosenblatt. Principles of neurodynamics. perceptrons and the theoryof brain mechanisms. Tech. rep. CORNELL AERONAUTICAL LAB INCBUFFALO NY, 1961.
Holger Schwenk and Yoshua Bengio. “Training methods for adaptiveboosting of neural networks”. In: Advances in neural information processingsystems. 1998, pp. 647–653.
Michael Schuster. “On supervised learning from sequential data withapplications for speech recognition”. In: Daktaro disertacija, Nara Instituteof Science and Technology 45 (1999).
Nitish Srivastava et al. “Dropout: A simple way to prevent neural networksfrom overfitting”. In: The Journal of Machine Learning Research 15.1(2014), pp. 1929–1958.
LR Talbert, GF Groner, and JS Koford. “Real-Time AdaptiveSpeech-Recognition System”. In: The Journal of the Acoustical Society ofAmerica 35.5 (1963), pp. 807–807.
Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In:Journal of the Royal Statistical Society. Series B (Methodological) (1996),pp. 267–288.
E. Scornet Deep Learning January 2020 116 / 118
Grace Wahba. “Three topics in ill-posed problems”. In: Inverse andill-posed problems. Elsevier, 1987, pp. 37–51.
Bernard Widrow and Marcian E Hoff. Adaptive switching circuits.Tech. rep. Stanford Univ Ca Stanford Electronics Labs, 1960.
Wessel N van Wieringen. “Lecture notes on ridge regression”. In: arXivpreprint arXiv:1509.09169 (2015).
Sida Wang and Christopher Manning. “Fast dropout training”. In:international conference on machine learning. 2013, pp. 118–126.
Capt Rodney Winter and B Widrow. “Madaline Rule II: a training algorithmfor neural networks”. In: Second Annual International Conference on NeuralNetworks. 1988, pp. 1–401.
Yuting Wei, Fanny Yang, and Martin J Wainwright. “Early stopping forkernel boosting algorithms: A general analysis with localized complexities”.In: Advances in Neural Information Processing Systems. 2017,pp. 6067–6077.
Bing Xu et al. “Empirical evaluation of rectified activations in convolutionalnetwork”. In: arXiv preprint arXiv:1505.00853 (2015).
E. Scornet Deep Learning January 2020 117 / 118
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. “On early stopping ingradient descent learning”. In: Constructive Approximation 26.2 (2007),pp. 289–315.
Tong Zhang, Bin Yu, et al. “Boosting with early stopping: Convergence andconsistency”. In: The Annals of Statistics 33.4 (2005), pp. 1538–1579.
Hui Zou and Trevor Hastie. “Regularization and variable selection via theelastic net”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.2 (2005), pp. 301–320.
E. Scornet Deep Learning January 2020 118 / 118