+ All Categories
Home > Documents > Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e...

Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e...

Date post: 29-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
58
Deep-Learning : the basics A. Allauzen Universit´ e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 1 / 49
Transcript
Page 1: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Deep-Learning : the basics

A. Allauzen

Universite Paris-Sud / LIMSI-CNRS

9 mai 2017

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 1 / 49

Page 2: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Outline

1 Neural Nets : BasicsTerminologyTraining by back-propagation

2 Tools

3 Drop-out

4 Vanishing gradient

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 2 / 49

Page 3: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics

Outline

1 Neural Nets : BasicsTerminologyTraining by back-propagation

2 Tools

3 Drop-out

4 Vanishing gradient

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 3 / 49

Page 4: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

Outline

1 Neural Nets : BasicsTerminologyTraining by back-propagation

2 Tools

3 Drop-out

4 Vanishing gradient

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 4 / 49

Page 5: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

A choice of terminology

Logistic regression (binary classification)

x

1 intercept term

wt1y1 = f(w1

tx)

f(a = w1tx) =

1

1 + e−a−10 −5 0 5 10

0

0.25

0.5

0.75

1

w1tx

y1

A single artificial neuron

pre-activation : a1 = w1tx

y1 = f(w1tx), f is the activation function of the neuron

x

1bias term

w1

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 5 / 49

Page 6: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

A choice of terminology

Logistic regression (binary classification)

x

1 intercept term

wt1y1 = f(w1

tx)

f(a = w1tx) =

1

1 + e−a−10 −5 0 5 10

0

0.25

0.5

0.75

1

w1tx

y1

A single artificial neuron

pre-activation : a1 = w1tx

y1 = f(w1tx), f is the activation function of the neuron

x

1bias term

w1

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 5 / 49

Page 7: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

A choice of terminology - 2

From binary classification to K classes (Maxent)

x

wt1y1 = f(w1

tx)

wtKyK = f(wK

tx)

f(ak = wktx) =

eak∑Kk′=1 e

ak′=

eak

Z(x)

A simple neural network

y1 = f(wt1x)x

w1

wK

yK = f(wtKx)

x : input layer

y : output layer

each yk has its parameters wk

f is the softmax function

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 6 / 49

Page 8: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

A choice of terminology - 2

From binary classification to K classes (Maxent)

x

wt1y1 = f(w1

tx)

wtKyK = f(wK

tx)

f(ak = wktx) =

eak∑Kk′=1 e

ak′=

eak

Z(x)

A simple neural network

y1 = f(wt1x)x

w1

wK

yK = f(wtKx)

x : input layer

y : output layer

each yk has its parameters wk

f is the softmax function

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 6 / 49

Page 9: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

Two layers fully connected

x y = f(Wx)

W

W k,: = wkt

f is usually a non-linear function

f is a component wise function

e.g the softmax function :

yk = P (c = k|x) =ewk

tx∑k′ ewk′ tx

=eW k,:x∑k′ eW k′,:x

tanh, sigmoid, relu, ...

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 7 / 49

Page 10: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

Two layers fully connected

x y = f(Wx)

W

W k,: = wkt

× =f( )

W x y

f is usually a non-linear function

f is a component wise function

e.g the softmax function :

yk = P (c = k|x) =ewk

tx∑k′ ewk′ tx

=eW k,:x∑k′ eW k′,:x

tanh, sigmoid, relu, ...

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 7 / 49

Page 11: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

Two layers fully connected

x y = f(Wx)

W

W k,: = wkt

× =f( )

W x y

f is usually a non-linear function

f is a component wise function

e.g the softmax function :

yk = P (c = k|x) =ewk

tx∑k′ ewk′ tx

=eW k,:x∑k′ eW k′,:x

tanh, sigmoid, relu, ...

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 7 / 49

Page 12: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

Bias or not bias

Implicit Bias

x

1

y = f(Wx)

W

W k,: = wkt

× =f( )

W x y

Explicit bias

x y = f(Wx+ b) × + =f( )

W x b y

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 8 / 49

Page 13: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

With neural network : add a hidden layer

x : raw input representation

h = f(W (1)x)

y = f(W (2)h)

the internal and tailored representation

Intuitions

Learn an internal representation of the raw input

Apply a non-linear transformation

The input representation x is transformed/compressed in a newrepresentation h

Adding more layers to obtain a more and more abstract representation

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 9 / 49

Page 14: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Terminology

How do we learn the parameters ?

For a supervised single layer neural net

Just like a maxent model :

Calculate the gradient of the objective function and use it to iterativelyupdate the parameters.

Conjugate gradient, L-BFGS, ...

In practice : Stochastic gradient descent (SGD)

With one hidden layer

The internal (“hidden”) units make the function non-convex ... just likeother models with hidden variables :

hidden CRFs (Quattoni et al.2007), ...

But we can use the same ideas and techniques

Just without guarantees ⇒ backpropagation (Rumelhart et al.1986)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 10 / 49

Page 15: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Outline

1 Neural Nets : BasicsTerminologyTraining by back-propagation

2 Tools

3 Drop-out

4 Vanishing gradient

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 11 / 49

Page 16: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Ex. 1 : A single layer network for classification

x

y = f(Wx)

W

yk = P (c = k|x)

θ= the set of parameters, in this case :

θ = (W )

The log-loss (conditional log-likelihood)

Assume the dataset D = (x(i), c(i))Ni=1, c(i) ∈ {1, 2, . . . , C}

L(θ) =

N∑i=1

l(θ,x(i), c(i)) =

N∑i=1

(−

C∑c=1

I{c = c(i)

}log(P (c|x(i)))

)(1)

l(θ,x(i), c(i)) = −C∑

k=1

I{k = c(i)

}log(yk) (2)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 12 / 49

Page 17: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Ex. 1 : optimization method

Stochastic Gradient Descent (Bottou2010)

For ( t = 1 ; until convergence ; t+ +) :

Pick randomly a sample (x(i), c(i))

Compute the gradient of the loss function w.r.t the parameters (∇θ)

Update the parameters : θ = θ − ηt∇θ

Questions

convergence : what does it mean ?

what do you mean by ηt ?

convergence if∑

t ηt = ∞ and∑

t η2t <∞

ηt ∝ t−1

and lot of variants like Adagrad (Duchi et al.2011), Down scheduling, ...see (LeCun et al.2012)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 13 / 49

Page 18: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Ex. 1 : compute the gradient - 1

wkj

×

=f( )

W

xt

y

j

k

wkj

Inference chain :x(i) −→ (a = Wx(i)) −→ (y = f(a)) −→ l(θ,x(i), c(i))

The gradient for wkj

∇wkj=∂l(θ,x(i), c(i))

∂wkj=∂l(θ,x(i), c(i))

∂y× ∂y

∂a× ∂a

∂wkj

= −(I{k = c(i)

}− yk)xj = δkxj

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 14 / 49

Page 19: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Ex. 1 : compute the gradient - 2

wkj

×

= ⇒f( )

W

xt

y δ

Generalization

∇W = δxt

δk = −(I{k = c(i)

}− yk)

with δ the gradient at the pre-activation level.

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 15 / 49

Page 20: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Ex. 1 : Summary

xy = f(W (1)x)

forward x

backward δ

Inference : a forward step

matrice multiplication with the input x

Application of the activation function

One training step : forward and backward steps

Pick randomly a sample (x(i), c(i))

Compute δ

Update the parameters : θ = θ − ηtδxt

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 16 / 49

Page 21: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Notations for a multi-layer neural network(feed-forward)

One layer, indexed by l

x(l)

W (l)

y(l)

x(l) : input of the layer l

y(l) = f (l)(W (l) x(l))

stacking layers : y(l) = x(l+1)

x(1) = a data example

x(1) x(2) x(3) x(L)

W (1)

y(1)W (2)

y(2) y(L−1)W (L)

y(L) : output

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 17 / 49

Page 22: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Ex. 2 : with one hidden layer

x(1) x(2)

W (1)

y(1)W (2)

y(2) : output

θ = (W (1),W (2))

To learn, we need the gradients for :

the output layer : ∇W (2)

the hidden layer : ∇W (1)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 18 / 49

Page 23: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Back-propagation of the loss gradientFor the output layer

As in the Ex. 1 :

y(1) = x(2)y(2) = f(W (2)x(2))

forward x(2)

backward δ(2)

∇W (2) = δ(2)x(2)t, with

δ(2)k = −(I

{k = c(i)

}− yk)

y → y(2)

W →W (2)

x → x(2) = y(1)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 19 / 49

Page 24: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Back-propagation of the loss gradientFor the hidden layer - 1

The goal : compute δ(1)

Inference (/forward) chain from a(1) to the output :

y(1) = f (1)(a(1))→(a(2) = W (2)y(1)

)→(y(2) = f (2)(a(2))

)→ l(θ,x(i), c(i))

Backward / Back-propagation :

δ(1)j = ∇

a(1)j

=∂l(θ,x(i), c(i))

∂a(1)j

=∂l(θ,x(i), c(i))

∂y(2)× ∂y(2)

∂a(2)× ∂a(2)

∂y(1)j

×∂y

(1)j

∂a(1)j

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 20 / 49

Page 25: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Back-propagation of the loss gradientFor the hidden layer - 2

y(1) y(2)

W(2):,j

y(1)j

δ(2)

Backward / Back-propagation :

δ(1)j = ∇

a(1)j

=∂l(θ,x(i), c(i))

∂a(1)j

=∂l(θ,x(i), c(i))

∂y(2)× ∂y(2)

∂a(2)× ∂a(2)

∂y(1)j

×∂y

(1)j

∂a(1)j

= f ′(1)

(aj)(W

(2):,j

tδ(2)

)Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 21 / 49

Page 26: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Back-propagation of the loss gradientFor the hidden layer - 3

←f( )

W (2)

the back-propagated signal ∇y(1) ←

δ(2)

j

k

W(2):,k

∇y(1) = W (2)tδ(2), then

δ(1) =∇a(1) = f (1)′(a(1)) ◦(W (2)tδ(2)

)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 22 / 49

Page 27: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Back-propagation of the loss gradientFor the hidden layer - 4

⇐⇐

y(1) y(2)δ(2)δ(1)

As for the output layer, the gradient is :

∇W (1) = δ(1)x(1)t, with

δ(1)j = ∇

a(1)j

δ(1) = f ′(1)

(a(1)) ◦ (W (2)tδ(2))

The term (W (2)tδ(2)) comes from the upper layer.

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 23 / 49

Page 28: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Back-propagation : generalization

For a hidden layer l :

The gradient at the pre-activation level :

δ(l) = f ′(l)

(a(l)) ◦(W (l+1)tδ(l+1)

)The update is as follows :

W (l) = W (l) − ηtδ(l)x(l)t

The layer should keep :

W (l) : the parameters

f (l) : its activation function

x(l) : its input

a(l) : its pre-activation associated to the input

δ(l) : for the update and the back-propagation to the layer l − 1

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 24 / 49

Page 29: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Back-propagation : one training step

Pick a training example : x(1) = x(i)

Forward pass

For l = 1 to (L− 1)

Compute y(l) = f (l)(W (l)x(l))

x(l+1) = y(l)

y(L) = f (L)(W (L)x(L))

Backward pass

Init : δ(L) = ∇a(L)

For l = L to 2 // all hidden units

δ(l−1) = f ′(l−1)

(a(l−1)) ◦(W (l)tδ(l)

)W (l) = W (l) − ηtδ(l)x(l)t

W (1) = W (1) − ηtδ(1)x(1)t

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 25 / 49

Page 30: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Neural Nets : Basics Training by back-propagation

Initialization recipes

A difficult question with several empirical answers.

One standard trick

W ∼ N (0,1√nin

)

with nin is the number of inputs

A more recent one

W ∼ U[−

√6√

nin + nout,

√6√

nin + nout

]with nin is the number of inputs

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 26 / 49

Page 31: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Outline

1 Neural Nets : BasicsTerminologyTraining by back-propagation

2 Tools

3 Drop-out

4 Vanishing gradient

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 27 / 49

Page 32: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Some useful libraries

Theano

Written in python by the LISA (Y. Bengio and I. Goodfellow)

TensorFlow

The Google library with python API

Keras

A high-level API, in Python, running on top of either TensorFlow or Theano.

Torch

The Facebook library with Lua python API

CPU/GPU

Automatic differentiation based on computational graph

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 28 / 49

Page 33: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Computation graph

A convenient way to represent a complex mathematical expressions :

each node is an operation or a variable

an operation has some inputs / outputs made of variables

Example 1 : A single layer network

x(1)

W (1)

× f (1) y(1)

Setting x(1) and W (1)

Forward pass → y(1)

y(1) = f (1)(W (1)x(1))

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 29 / 49

Page 34: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Training computation graph

x(1)

W (1)

× f (1) l

c(i)

l(x(1), c(i),θ)

A variable node encodes the label

To compute the output for a given input

→ forward pass

To compute the gradient of the loss wrt the parameters (W (1))

→ backward pass

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 30 / 49

Page 35: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

A function node

Forward pass

x

y

zf

This node implements :z = f(x,y)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 31 / 49

Page 36: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

A function node - 2

Backward pass

∂l∂x

∂l∂y

∂l∂z

f

A function node knows :

the ”local gradients” computation

∂z

∂x,∂z

∂y

how to return the gradient to the inputs :( ∂l∂z

∂z

∂x

),( ∂l∂z

∂z

∂y

)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 32 / 49

Page 37: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Summary of a function node

f :x,y, z # store the values

z = f(x,y) # forward

∂z

∂x→ ∂f

∂x# local gradients

∂z

∂y→ ∂f

∂y( ∂l∂x

∂z

∂x

),( ∂l∂y

∂z

∂y

)# backward

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 33 / 49

Page 38: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Example of a single layer network

x(1)

W (1)

× f (1) l

c(i)

l(x(1), c(i),θ)

Forward

For each function node in topological order

forward propagation

Which means :

1 a(1) = W (1)x(1)

2 y(1) = f (1)(a(1))

3 l(y(1), c(i))

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 34 / 49

Page 39: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Example of a single layer network

x(1)

W (1)

× f (1) l

c(i)

l(x(1), c(i),θ)

Forward

For each function node in reversed topological order

backward propagation

Which means :

1 ∇y(1)

2 ∇a(1)

3 ∇W (1)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 34 / 49

Page 40: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Example of a two layers network

x(1)

W (1)

× f (1)

W (2)

× f (2) l

c(i)

l(x(2), c(i),θ)

The algorithms remain the same,

even for more complex architectures

Generalization by coding the function node

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 35 / 49

Page 41: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Example in Theano - 1

import theano

import theano.tensor as T

# Define the input

x = T.fvector(’x’)

# The parameters of the hidden layer

H = 100 # hidden layer size

n_in=im.shape[0] # dimension of inputs

n_out=H

Wi = uniform(shape=[n_out,n_in], name="Wi")

bi=shared0s([n_out],name="bi")

# parameters for the output layer

n_in=H

n_out=NLABELS

Wo = uniform(shape=[n_out,n_in], name="Wo")

bo=shared0s([n_out],name="bo")

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 36 / 49

Page 42: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Example in Theano - 2

# define the hidden layer

h = T.nnet.relu(T.dot(Wi,x)+bi)

# output layer and related variables:

p_y_given_x = T.nnet.softmax(T.dot(Wo,h)+bo)

y_pred = T.argmax(p_y_given_x)

# Compute the cost function

ygold = T.iscalar(’gold_target’)

cost = -T.log(p_y_given_x[0][ygold])

# 1/ Store all the learnt parameters:

params = [Wi, bi, Wo, bo]

# 2/ Get the gradients of everyone

gradients = T.grad(cost,params)

# 3/ Collect the updates

upds = [(p, p - (learning_rate * g))

for p, g in zip(params, gradients)]

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 37 / 49

Page 43: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Example in Tensorflow - 1

import tensorflow as tf

# x isn’t a specific value. It’s a placeholder,

# a value that we’ll input to run a computation.

x = tf.placeholder(tf.float32, [None, 784])

# Define the parameters as variables

W = tf.Variable(tf.zeros([784, 10]))

b = tf.Variable(tf.zeros([10]))

# the prediction variable

y = tf.nn.softmax(tf.matmul(x, W) + b)

# the gold standard (a placeholder)

y_ = tf.placeholder(tf.float32, [None, 10])

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 38 / 49

Page 44: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Tools

Example in Tensorflow - 2

# the loss function

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

# SGD

train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

# Init. of all the variables

# This defines the operations but does not run it yet.

init = tf.initialize_all_variables()

# open a session

sess = tf.Session()

sess.run(init)

# Training

for i in range(1000):

batch_xs, batch_ys = mnist.train.next_batch(100)

sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 39 / 49

Page 45: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Drop-out

Outline

1 Neural Nets : BasicsTerminologyTraining by back-propagation

2 Tools

3 Drop-out

4 Vanishing gradient

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 40 / 49

Page 46: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Drop-out

Regularization l2 or gaussian prior or weight decay

The basic way :

L(θ) =

N∑i=1

l(θ,x(i), c(i))+λ

2||θ||2

The second term is the regularization term.

Each parameter has a gaussian prior : N (0, 1/λ).

λ is a hyperparameter.

The update has the form :

θ = (1 + ηtλ)θ − ηt∇θ

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 41 / 49

Page 47: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Drop-out

DropoutA new regularization scheme (Srivastava and Salakhutdinov2014)

Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov

(a) Standard Neural Net (b) After applying dropout.

Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right:An example of a thinned net produced by applying dropout to the network on the left.Crossed units have been dropped.

its posterior probability given the training data. This can sometimes be approximated quitewell for simple or small models (Xiong et al., 2011; Salakhutdinov and Mnih, 2008), but wewould like to approach the performance of the Bayesian gold standard using considerablyless computation. We propose to do this by approximating an equally weighted geometricmean of the predictions of an exponential number of learned models that share parameters.

Model combination nearly always improves the performance of machine learning meth-ods. With large neural networks, however, the obvious idea of averaging the outputs ofmany separately trained nets is prohibitively expensive. Combining several models is mosthelpful when the individual models are di↵erent from each other and in order to makeneural net models di↵erent, they should either have di↵erent architectures or be trainedon di↵erent data. Training many di↵erent architectures is hard because finding optimalhyperparameters for each architecture is a daunting task and training each large networkrequires a lot of computation. Moreover, large networks normally require large amounts oftraining data and there may not be enough data available to train di↵erent networks ondi↵erent subsets of the data. Even if one was able to train many di↵erent large networks,using them all at test time is infeasible in applications where it is important to respondquickly.

Dropout is a technique that addresses both these issues. It prevents overfitting andprovides a way of approximately combining exponentially many di↵erent neural networkarchitectures e�ciently. The term “dropout” refers to dropping out units (hidden andvisible) in a neural network. By dropping a unit out, we mean temporarily removing it fromthe network, along with all its incoming and outgoing connections, as shown in Figure 1.The choice of which units to drop is random. In the simplest case, each unit is retained witha fixed probability p independent of other units, where p can be chosen using a validationset or can simply be set at 0.5, which seems to be close to optimal for a wide range ofnetworks and tasks. For the input units, however, the optimal probability of retention isusually closer to 1 than to 0.5.

1930

For each training example :randomly turn-off the neurons ofhidden units (with p = 0.5)

At test time, use each neuronscaled down by p

Dropout serves to separate effects from strongly correlated features and

prevents co-adaptation between units

It can be seen as averaging different models that share parameters.

It acts as a powerful regularization scheme.

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 42 / 49

Page 48: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Drop-out

Dropout - implementation

The layer should keep :

W (l) : the parameters

f (l) : its activation function

x(l) : its input

a(l) : its pre-activation associated to the input

δ(l) : for the update and the back-propagation to the layer l − 1

m(l) : the dropout mask, to be applied on x(l)

Forward pass

For l = 1 to (L− 1)

Compute y(l) = f (l)(W (l)x(l))

x(l+1) = y(l) = y(l) ◦m(l)

y(L) = f (L)(W (L)x(L))

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 43 / 49

Page 49: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Outline

1 Neural Nets : BasicsTerminologyTraining by back-propagation

2 Tools

3 Drop-out

4 Vanishing gradient

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 44 / 49

Page 50: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Experimental observations (MNIST task) - 1

The MNIST database

Comparison of different depth for feed-forward architecture

x(1) x(2) x(3) x(L)

W (1)

y(1)W (2)

y(2) y(L−1)W (L)

y(L) : output

Hidden layers have a sigmoid activation function.

The output layer is a softmax.

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 45 / 49

Page 51: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Experimental observations (MNIST task) - 2

Varying the depth

Without hidden layer : ≈ 88% accuracy

1 hidden layer (30) : ≈ 96.5% accuracy

2 hidden layer (30) : ≈ 96.9% accuracy

3 hidden layer (30) : ≈ 96.5% accuracy

4 hidden layer (30) : ≈ 96.5% accuracy

(From http://neuralnetworksanddeeplearning.com/chap5.html)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 46 / 49

Page 52: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Experimental observations (MNIST task) - 2

Varying the depth

Without hidden layer : ≈ 88% accuracy

1 hidden layer (30) : ≈ 96.5% accuracy

2 hidden layer (30) : ≈ 96.9% accuracy

3 hidden layer (30) : ≈ 96.5% accuracy

4 hidden layer (30) : ≈ 96.5% accuracy

(From http://neuralnetworksanddeeplearning.com/chap5.html)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 46 / 49

Page 53: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Intuitive explanation

Let consider the simplest deep neural network, with just a single neuron ineach layer.

wi, bi are resp. the weight and bias of neuron i and C some cost function.

Compute the gradient of C w.r.t the bias b1

∂C

∂b1=∂C

∂y4× ∂y4

∂a4× ∂a4

∂y3× ∂y3

∂a3× ∂a3

∂y2× ∂y2

∂a2× ∂a2

∂y1× ∂y1

∂a1× ∂a1

∂b1(3)

=∂C

∂y4× σ′(a4)× w4 × σ′(a3)× w3 × σ′(a2)× w2 × σ′(a1) (4)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 47 / 49

Page 54: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Intuitive explanation - 2

The derivative of the activation function : σ′

−10 −5 0 5 100

0.25

0.5

0.75

1

σ′(x) = σ(x)(1− σ(x))

But weights are initialize around 0.

The different layers in our deep network are learning at vastlydifferent speeds :

when later layers in the network are learning well,

early layers often get stuck during training, learning almost nothing at all.

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 48 / 49

Page 55: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Solutions

Change the activation function (Rectified Linear Unit or ReLU)

−10 −5 0 5 100

2.5

5

7.5

10

Avoid the vanishing gradient

Some units can ”die”

See (Glorot et al.2011) for more details

Do pre-training when it is possible

See (Hinton et al.2006; Bengio et al.2007) :

when you cannot really escape from the initial (random) point, find a goodstarting point.

More details

See (Hochreiter et al.2001; Glorot and Bengio2010; LeCun et al.2012)

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 49 / 49

Page 56: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle.

2007.

Greedy layer-wise training of deep networks.

In B. Scholkopf, J.C. Platt, and T. Hoffman, editors, Advances in Neural InformationProcessing Systems 19, pages 153–160. MIT Press.

Leon Bottou.

2010.

Large-scale machine learning with stochastic gradient descent.

In Yves Lechevallier and Gilbert Saporta, editors, Proceedings of COMPSTAT’2010,pages 177–186. Physica-Verlag HD.

John Duchi, Elad Hazan, and Yoram Singer.

2011.

Adaptive subgradient methods for online learning and stochastic optimization.

J. Mach. Learn. Res., 12 :2121–2159, July.

Xavier Glorot and Yoshua Bengio.

2010.

Understanding the difficulty of training deep feedforward neural networks.

In JMLR W&CP : Proceedings of the Thirteenth International Conference onArtificial Intelligence and Statistics (AISTATS 2010), volume 9, pages 249–256, May.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio.Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 49 / 49

Page 57: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

2011.

Deep sparse rectifier neural networks.

In Geoffrey J. Gordon and David B. Dunson, editors, Proceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics (AISTATS-11),volume 15, pages 315–323. Journal of Machine Learning Research - Workshop andConference Proceedings.

Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh.

2006.

A fast learning algorithm for deep belief nets.

Neural Computation, 18(7) :1527–1554, JUL.

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber.

2001.

Gradient flow in recurrent nets : the difficulty of learning long-term dependencies.

In Kremer and Kolen, editors, A Field Guide to Dynamical Recurrent NeuralNetworks. IEEE Press.

Yann LeCun, Leon Bottou, Genevieve Orr, and Klaus-Robert Muller.

2012.

Efficient backprop.

In Gregoire Montavon, GenevieveB. Orr, and Klaus-Robert Muller, editors, NeuralNetworks : Tricks of the Trade, volume 7700 of Lecture Notes in Computer Science,pages 9–48. Springer Berlin Heidelberg.

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 49 / 49

Page 58: Deep-Learning : the basics · 2018-11-29 · Deep-Learning : the basics A. Allauzen Universit e Paris-Sud / LIMSI-CNRS 9 mai 2017 Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai

Vanishing gradient

Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and TrevorDarrell.

2007.

Hidden conditional random fields.

IEEE Trans. Pattern Anal. Mach. Intell., 29(10) :1848–1852, October.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams.

1986.

Learning representations by back-propagating errors.

Nature, 323(6088) :533–536, 10.

Nitish Srivastava and Ruslan Salakhutdinov.

2014.

Multimodal learning with deep boltzmann machines.

Journal of Machine Learning Research, 15 :2949–2980.

Alexandre Allauzen (LIMSI-CNRS) NNet basics 9 mai 2017 49 / 49


Recommended