+ All Categories
Home > Documents > Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural...

Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural...

Date post: 03-Jul-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
131
Artificial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University of Chicago November 2015 Artificial Neural Networks II STAT 27725/CMSC 25400
Transcript
Page 1: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Artificial Neural Networks IISTAT 27725/CMSC 25400: Machine Learning

Shubhendu Trivedi

University of Chicago

November 2015

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 2: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Things we will look at today

• Regularization in Neural Networks• Drop Out• Sequence to Sequence Learning using Recurrent Neural

Networks• Generative Neural Methods

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 3: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Things we will look at today

• Regularization in Neural Networks

• Drop Out• Sequence to Sequence Learning using Recurrent Neural

Networks• Generative Neural Methods

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 4: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Things we will look at today

• Regularization in Neural Networks• Drop Out

• Sequence to Sequence Learning using Recurrent NeuralNetworks

• Generative Neural Methods

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 5: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Things we will look at today

• Regularization in Neural Networks• Drop Out• Sequence to Sequence Learning using Recurrent Neural

Networks

• Generative Neural Methods

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 6: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Things we will look at today

• Regularization in Neural Networks• Drop Out• Sequence to Sequence Learning using Recurrent Neural

Networks• Generative Neural Methods

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 7: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

A Short Primer on Regularization: EmpiricalRisk

Assume that the data are sampled from an unknowndistribution p(x, y)

Next we choose the loss function L, and a parametric modelfamily f(x;w)

Ideally, our goal is to minimize the expected loss, called therisk

R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]

The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set

L(w, X, y) =1

N

N∑i=1

L(f(xi;w), yi)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 8: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

A Short Primer on Regularization: EmpiricalRisk

Assume that the data are sampled from an unknowndistribution p(x, y)

Next we choose the loss function L, and a parametric modelfamily f(x;w)

Ideally, our goal is to minimize the expected loss, called therisk

R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]

The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set

L(w, X, y) =1

N

N∑i=1

L(f(xi;w), yi)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 9: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

A Short Primer on Regularization: EmpiricalRisk

Assume that the data are sampled from an unknowndistribution p(x, y)

Next we choose the loss function L, and a parametric modelfamily f(x;w)

Ideally, our goal is to minimize the expected loss, called therisk

R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]

The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set

L(w, X, y) =1

N

N∑i=1

L(f(xi;w), yi)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 10: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

A Short Primer on Regularization: EmpiricalRisk

Assume that the data are sampled from an unknowndistribution p(x, y)

Next we choose the loss function L, and a parametric modelfamily f(x;w)

Ideally, our goal is to minimize the expected loss, called therisk

R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]

The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set

L(w, X, y) =1

N

N∑i=1

L(f(xi;w), yi)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 11: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Model Complexity and Overfitting

Consider data drawn from a 3rd order model:

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 12: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

How to avoid overfitting?

If a model overfits (is too sensitive to the data), it would beunstable and will not generalize well.

Intuitively, the complexity of the model can be measured bythe number of ”degrees of freedom” (independentparameters) (previous example?)

Idea: Directly penalize by the number of parameters (calledthe Akaike Information criterion): minimize

N∑i=1

L(f(xi;w), yi) + #params

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 13: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

How to avoid overfitting?

If a model overfits (is too sensitive to the data), it would beunstable and will not generalize well.

Intuitively, the complexity of the model can be measured bythe number of ”degrees of freedom” (independentparameters) (previous example?)

Idea: Directly penalize by the number of parameters (calledthe Akaike Information criterion): minimize

N∑i=1

L(f(xi;w), yi) + #params

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 14: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

How to avoid overfitting?

If a model overfits (is too sensitive to the data), it would beunstable and will not generalize well.

Intuitively, the complexity of the model can be measured bythe number of ”degrees of freedom” (independentparameters) (previous example?)

Idea: Directly penalize by the number of parameters (calledthe Akaike Information criterion): minimize

N∑i=1

L(f(xi;w), yi) + #params

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 15: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Description Length

Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters

With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.

Then we can have Regularized Risk minimization:

N∑i=1

L(f(xi;w), yi) + Ω(w)

We can measure ”size” in different ways: L1, L2 norms

Regularization is basically a way to implement Occam’s Razor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 16: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Description Length

Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters

With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.

Then we can have Regularized Risk minimization:

N∑i=1

L(f(xi;w), yi) + Ω(w)

We can measure ”size” in different ways: L1, L2 norms

Regularization is basically a way to implement Occam’s Razor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 17: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Description Length

Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters

With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.

Then we can have Regularized Risk minimization:

N∑i=1

L(f(xi;w), yi) + Ω(w)

We can measure ”size” in different ways: L1, L2 norms

Regularization is basically a way to implement Occam’s Razor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 18: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Description Length

Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters

With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.

Then we can have Regularized Risk minimization:

N∑i=1

L(f(xi;w), yi) + Ω(w)

We can measure ”size” in different ways: L1, L2 norms

Regularization is basically a way to implement Occam’s Razor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 19: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Description Length

Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters

With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.

Then we can have Regularized Risk minimization:

N∑i=1

L(f(xi;w), yi) + Ω(w)

We can measure ”size” in different ways: L1, L2 norms

Regularization is basically a way to implement Occam’s Razor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 20: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Regularization in Neural Networks

We have infact already looked at one method (for vision tasks)

How is this a form of regularization?

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 21: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Regularization in Neural Networks

We have infact already looked at one method (for vision tasks)

How is this a form of regularization?

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 22: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Regularization in Neural Networks

Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer

Why is it called Weight decay?

Parameter sharing (CNNs, RNNs)

Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 23: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Regularization in Neural Networks

Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer

Why is it called Weight decay?

Parameter sharing (CNNs, RNNs)

Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 24: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Regularization in Neural Networks

Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer

Why is it called Weight decay?

Parameter sharing (CNNs, RNNs)

Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 25: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Regularization in Neural Networks

Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer

Why is it called Weight decay?

Parameter sharing (CNNs, RNNs)

Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 26: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Regularization in Neural Networks

Early Stoppping:

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 27: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout

A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes

Every sample is processed by a decimated neural network

But, they all do the same job, and share weights

Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I

Sutskever, R Salakhutdinov, JMLR 2014

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 28: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout

A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes

Every sample is processed by a decimated neural network

But, they all do the same job, and share weights

Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I

Sutskever, R Salakhutdinov, JMLR 2014

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 29: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout

A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes

Every sample is processed by a decimated neural network

But, they all do the same job, and share weights

Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I

Sutskever, R Salakhutdinov, JMLR 2014

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 30: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout

A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes

Every sample is processed by a decimated neural network

But, they all do the same job, and share weights

Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I

Sutskever, R Salakhutdinov, JMLR 2014

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 31: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: Feedforward Operation

Without dropout: z(l+1)i = w

(l+1)i yl + b

(l+1)i , and yl+1

i = f(z(l+1)i )

With dropout:

r(l)j = Bernoulli(p)

y(l) = r(l) ∗ y(l)

z(l+1)i = w

(l+1)i yl + b

(l+1)i

yl+1i = f(z

(l+1)i )

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 32: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: At Test time

Use a single neural net with weights scaled down

By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time

Extreme form of bagging

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 33: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: At Test time

Use a single neural net with weights scaled down

By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time

Extreme form of bagging

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 34: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: At Test time

Use a single neural net with weights scaled down

By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time

Extreme form of bagging

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 35: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: At Test time

Use a single neural net with weights scaled down

By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time

Extreme form of bagging

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 36: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: Performance

These architectures have 2 to 4 hidden layers with 1024 to 2048hidden units

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 37: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: Performance

Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I

Sutskever, R Salakhutdinov, JMLR 2014

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 38: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout: Effect on Sparsity

Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I

Sutskever, R Salakhutdinov, JMLR 2014

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 39: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout for Linear Regression

Objective: ‖y −Xw‖22

When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)

Marginalizing the noise, the objective becomes:

minw

ER∼ Bernoulli(p)‖y − (R ∗X)w‖22

This is the same as:

minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2

Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 40: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout for Linear Regression

Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)

Marginalizing the noise, the objective becomes:

minw

ER∼ Bernoulli(p)‖y − (R ∗X)w‖22

This is the same as:

minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2

Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 41: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout for Linear Regression

Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)

Marginalizing the noise, the objective becomes:

minw

ER∼ Bernoulli(p)‖y − (R ∗X)w‖22

This is the same as:

minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2

Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 42: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout for Linear Regression

Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)

Marginalizing the noise, the objective becomes:

minw

ER∼ Bernoulli(p)‖y − (R ∗X)w‖22

This is the same as:

minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2

Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 43: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Dropout for Linear Regression

Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)

Marginalizing the noise, the objective becomes:

minw

ER∼ Bernoulli(p)‖y − (R ∗X)w‖22

This is the same as:

minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2

Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 44: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does this make sense?

Bagging is always good if models are diverse enough

Motivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 45: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does this make sense?

Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”

Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 46: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does this make sense?

Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).

Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 47: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does this make sense?

Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)

Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 48: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does this make sense?

Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 49: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Sequence Learning with Neural Networks

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 50: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems with MLPs for Sequence Tasks

The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality

This is great when working (for example) with images, andthe output is an encoding of the category

This is bad when if we are interested in Machine Translationor Speech Recognition

Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.

Recurrent Neural Networks address this issue by having loops.

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 51: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems with MLPs for Sequence Tasks

The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality

This is great when working (for example) with images, andthe output is an encoding of the category

This is bad when if we are interested in Machine Translationor Speech Recognition

Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.

Recurrent Neural Networks address this issue by having loops.

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 52: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems with MLPs for Sequence Tasks

The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality

This is great when working (for example) with images, andthe output is an encoding of the category

This is bad when if we are interested in Machine Translationor Speech Recognition

Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.

Recurrent Neural Networks address this issue by having loops.

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 53: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems with MLPs for Sequence Tasks

The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality

This is great when working (for example) with images, andthe output is an encoding of the category

This is bad when if we are interested in Machine Translationor Speech Recognition

Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.

Recurrent Neural Networks address this issue by having loops.

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 54: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems with MLPs for Sequence Tasks

The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality

This is great when working (for example) with images, andthe output is an encoding of the category

This is bad when if we are interested in Machine Translationor Speech Recognition

Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.

Recurrent Neural Networks address this issue by having loops.

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 55: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Some Sequence Tasks

Figure credit: Andrej Karpathy

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 56: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The loops in them allow the information to persist

For some input xi, we pass it through a hidden state A andthen output a value hi. The loop allows information to bepassed from one time step to another

A RNN can be thought of as multiple copies of the samenetwork, each of which passes a message to its successor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 57: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The loops in them allow the information to persist

For some input xi, we pass it through a hidden state A andthen output a value hi. The loop allows information to bepassed from one time step to another

A RNN can be thought of as multiple copies of the samenetwork, each of which passes a message to its successor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 58: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The loops in them allow the information to persist

For some input xi, we pass it through a hidden state A andthen output a value hi. The loop allows information to bepassed from one time step to another

A RNN can be thought of as multiple copies of the samenetwork, each of which passes a message to its successor

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 59: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

More generally, a RNN can be thought of as arranging hiddenstate vectors hlt in a 2-D grid, with t = 1, . . . , T being timeand l = 1, . . . , L being the depth

h0t = xt and hLt is used to predict the output vector yt. Allintermediate vectors hlt are computed as a function of hlt−1

hl−1t

RNN is a recurrence of the form:

hlt = tanhW l

(hl−1t

hlt−1

)Illustration credit: Chris Olah

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 60: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

More generally, a RNN can be thought of as arranging hiddenstate vectors hlt in a 2-D grid, with t = 1, . . . , T being timeand l = 1, . . . , L being the depthh0t = xt and hLt is used to predict the output vector yt. Allintermediate vectors hlt are computed as a function of hlt−1

hl−1t

RNN is a recurrence of the form:

hlt = tanhW l

(hl−1t

hlt−1

)Illustration credit: Chris Olah

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 61: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

More generally, a RNN can be thought of as arranging hiddenstate vectors hlt in a 2-D grid, with t = 1, . . . , T being timeand l = 1, . . . , L being the depthh0t = xt and hLt is used to predict the output vector yt. Allintermediate vectors hlt are computed as a function of hlt−1

hl−1t

RNN is a recurrence of the form:

hlt = tanhW l

(hl−1t

hlt−1

)Illustration credit: Chris Olah

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 62: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The chain like structure enables sequence modeling

W varies between layers but is shared through time

Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)

The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix

”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 63: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The chain like structure enables sequence modeling

W varies between layers but is shared through time

Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)

The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix

”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 64: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The chain like structure enables sequence modeling

W varies between layers but is shared through time

Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)

The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix

”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 65: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The chain like structure enables sequence modeling

W varies between layers but is shared through time

Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)

The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix

”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 66: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

The chain like structure enables sequence modeling

W varies between layers but is shared through time

Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)

The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix

”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 67: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

Training RNNs might seem daunting.

Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN

If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation

This is called Backpropagation through time

But weights are shared across different time stamps? How isthis constraint enforced?

Train the network as if there were no constraints, obtainweights at different time stamps, average them

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 68: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

Training RNNs might seem daunting.

Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN

If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation

This is called Backpropagation through time

But weights are shared across different time stamps? How isthis constraint enforced?

Train the network as if there were no constraints, obtainweights at different time stamps, average them

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 69: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

Training RNNs might seem daunting.

Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN

If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation

This is called Backpropagation through time

But weights are shared across different time stamps? How isthis constraint enforced?

Train the network as if there were no constraints, obtainweights at different time stamps, average them

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 70: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

Training RNNs might seem daunting.

Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN

If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation

This is called Backpropagation through time

But weights are shared across different time stamps? How isthis constraint enforced?

Train the network as if there were no constraints, obtainweights at different time stamps, average them

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 71: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recurrent Neural Networks

Training RNNs might seem daunting.

Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN

If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation

This is called Backpropagation through time

But weights are shared across different time stamps? How isthis constraint enforced?

Train the network as if there were no constraints, obtainweights at different time stamps, average them

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 72: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems

Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)

Consider a language model in which the task is to predict thenext word based on the previous

Sometimes the context can be clear immediately: ”The cloudsare in the sky”

Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”

In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.

The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 73: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems

Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)

Consider a language model in which the task is to predict thenext word based on the previous

Sometimes the context can be clear immediately: ”The cloudsare in the sky”

Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”

In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.

The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 74: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems

Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)

Consider a language model in which the task is to predict thenext word based on the previous

Sometimes the context can be clear immediately: ”The cloudsare in the sky”

Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”

In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.

The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 75: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Problems

Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)

Consider a language model in which the task is to predict thenext word based on the previous

Sometimes the context can be clear immediately: ”The cloudsare in the sky”

Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”

In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.

The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 76: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Long Short Term Memory Networks

Vanilla RNN: Error propagation is blocked by a non-linearity Illustration

credit: Chris Olah

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 77: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Long Short Term Memory Networks

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 78: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Long Short Term Memory

One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions

The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.

Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication

There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 79: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Long Short Term Memory

One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions

The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.

Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication

There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 80: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Long Short Term Memory

One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions

The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.

Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication

There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 81: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Long Short Term Memory

One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions

The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.

Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication

There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 82: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Long Short Term Memory

Precise form of the LSTM update is:ifo

Ct

=

sigmsigmsigmtanh

W l

(hl−1t

hlt−1

)

clt = f clt−1 + i Ct, and hlt = o tanh(clt)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 83: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Some Applications: Caption Generation

Caption Generation (Karpathy and Li, 2014)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 84: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

RNN Shakespeare

Using a character level language model trained on all ofShakespeare.VIOLA: Why, Salisbury must find his flesh and thought That whichI am not aps, not a man and in fire, To show the reining of theraven and the wars To grace my hand reproach within, and not afair are hand, That Caesar and my goodly father’s world; When Iwas heaven of presence and our fleets, We spare with hours, butcut thy council I am great, Murdered and by thy master’s readythere My power to give thee but so much as hell: Some service inthe noble bondman here, Would show him to her wine.KING LEAR: O, if you were a feeble sight, the courtesy of yourlaw, Your sight and several breath, will wear the gods With hisheads, and my hands are wonder’d at the deeds, So drop upon yourlordship’s head, and your opinion Shall be against your honour.

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 85: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Image Generation

(Also uses attention mechanism - not discussed) DRAW: ARecurrent Neural Network For Image Generation (Gregor et al.,2015)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 86: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Applications

Acoustic Modeling

Natural Language Processing i.e. parsing etc

Machine Translation (e.g. Google Translate uses RNNs)

Voice Transcription

Video and Image understanding

list goes on

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 87: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Applications

Acoustic Modeling

Natural Language Processing i.e. parsing etc

Machine Translation (e.g. Google Translate uses RNNs)

Voice Transcription

Video and Image understanding

list goes on

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 88: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Applications

Acoustic Modeling

Natural Language Processing i.e. parsing etc

Machine Translation (e.g. Google Translate uses RNNs)

Voice Transcription

Video and Image understanding

list goes on

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 89: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Applications

Acoustic Modeling

Natural Language Processing i.e. parsing etc

Machine Translation (e.g. Google Translate uses RNNs)

Voice Transcription

Video and Image understanding

list goes on

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 90: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Applications

Acoustic Modeling

Natural Language Processing i.e. parsing etc

Machine Translation (e.g. Google Translate uses RNNs)

Voice Transcription

Video and Image understanding

list goes on

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 91: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Applications

Acoustic Modeling

Natural Language Processing i.e. parsing etc

Machine Translation (e.g. Google Translate uses RNNs)

Voice Transcription

Video and Image understanding

list goes on

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 92: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Generative Neural Models

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 93: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recap: Multilayered Neural Networks

Let layer k compute an output vector hk using the outputhk−1 of the previous layer.

Note that the input x = h0

hk = tanh(bk + W khk−1)

Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)

We might have the output layer return the followingnon-linearity

hli =eb

li+W l

ihl−1∑

j eblj+W l

jhl−1

This is called the softmax and can be used as an estimator ofp(Y = i|x)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 94: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recap: Multilayered Neural Networks

Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0

hk = tanh(bk + W khk−1)

Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)

We might have the output layer return the followingnon-linearity

hli =eb

li+W l

ihl−1∑

j eblj+W l

jhl−1

This is called the softmax and can be used as an estimator ofp(Y = i|x)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 95: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recap: Multilayered Neural Networks

Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0

hk = tanh(bk + W khk−1)

Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)

We might have the output layer return the followingnon-linearity

hli =eb

li+W l

ihl−1∑

j eblj+W l

jhl−1

This is called the softmax and can be used as an estimator ofp(Y = i|x)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 96: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recap: Multilayered Neural Networks

Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0

hk = tanh(bk + W khk−1)

Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)

We might have the output layer return the followingnon-linearity

hli =eb

li+W l

ihl−1∑

j eblj+W l

jhl−1

This is called the softmax and can be used as an estimator ofp(Y = i|x)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 97: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recap: Multilayered Neural Networks

Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0

hk = tanh(bk + W khk−1)

Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)

We might have the output layer return the followingnon-linearity

hli =eb

li+W l

ihl−1∑

j eblj+W l

jhl−1

This is called the softmax and can be used as an estimator ofp(Y = i|x)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 98: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recap: Multilayered Neural Networks

Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0

hk = tanh(bk + W khk−1)

Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)

We might have the output layer return the followingnon-linearity

hli =eb

li+W l

ihl−1∑

j eblj+W l

jhl−1

This is called the softmax and can be used as an estimator ofp(Y = i|x)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 99: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Recap: Multilayered Neural Networks

One loss to be considered: L(hl, y) = − logP (Y = y|x)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 100: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

The Difficulty of Training Deep Networks

Until 2006, deep architectures were not used extensively inMachine Learning

Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)

Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture

Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 101: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

The Difficulty of Training Deep Networks

Until 2006, deep architectures were not used extensively inMachine Learning

Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)

Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture

Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 102: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

The Difficulty of Training Deep Networks

Until 2006, deep architectures were not used extensively inMachine Learning

Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)

Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture

Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 103: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

The Difficulty of Training Deep Networks

Until 2006, deep architectures were not used extensively inMachine Learning

Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)

Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture

Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 104: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Initial Breakthrough: Layer-wise Training

Unsupervised pre-training is possible in certain DeepGenerative Models (Hinton, 2006)

Idea: Greedily train one layer at a time using a simple model(Restricted Boltzmann Machine)

Use the parameters learned to initialize a feedforward neuralnetwork, and fine tune for classification

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 105: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Initial Breakthrough: Layer-wise Training

Unsupervised pre-training is possible in certain DeepGenerative Models (Hinton, 2006)

Idea: Greedily train one layer at a time using a simple model(Restricted Boltzmann Machine)

Use the parameters learned to initialize a feedforward neuralnetwork, and fine tune for classification

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 106: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Initial Breakthrough: Layer-wise Training

Unsupervised pre-training is possible in certain DeepGenerative Models (Hinton, 2006)

Idea: Greedily train one layer at a time using a simple model(Restricted Boltzmann Machine)

Use the parameters learned to initialize a feedforward neuralnetwork, and fine tune for classification

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 107: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Sigmoid Belief Networks, 1992

The generative model is decomposed as:

P (x, h1, . . . , hl) = P (hl)( l−1∏

k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x). Intractable in practice except fortiny models

R. Neal, Connectionist learning of belief networks, 1992

Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. The Helmholtz Machine, 1995

L. Saul, T. Jaakkola, and M. Jordan, Mean field theory for sigmoid belief networks, 1996

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 108: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Sigmoid Belief Networks, 1992

The generative model is decomposed as:

P (x, h1, . . . , hl) = P (hl)( l−1∏

k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x). Intractable in practice except fortiny models

R. Neal, Connectionist learning of belief networks, 1992

Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. The Helmholtz Machine, 1995

L. Saul, T. Jaakkola, and M. Jordan, Mean field theory for sigmoid belief networks, 1996

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 109: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Sigmoid Belief Networks, 1992

The generative model is decomposed as:

P (x, h1, . . . , hl) = P (hl)( l−1∏

k=1

P (hk|hk+1))P (x|h1)

Marginalization yields P (x). Intractable in practice except fortiny models

R. Neal, Connectionist learning of belief networks, 1992

Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. The Helmholtz Machine, 1995

L. Saul, T. Jaakkola, and M. Jordan, Mean field theory for sigmoid belief networks, 1996

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 110: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Deep Belief Networks, 2006

Similar to Sigmoid Belief Networks, except the top two layers

P (x, h1, . . . , hl) = P (hl−1, hl)( l−2∏

k=1

P (hk|hk+1))P (x|h1)

The joint distribution of the top two layers is a RestrictedBoltzmann Machine

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 111: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Deep Belief Networks, 2006

Similar to Sigmoid Belief Networks, except the top two layers

P (x, h1, . . . , hl) = P (hl−1, hl)( l−2∏

k=1

P (hk|hk+1))P (x|h1)

The joint distribution of the top two layers is a RestrictedBoltzmann Machine

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 112: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Deep Belief Networks, 2006

Similar to Sigmoid Belief Networks, except the top two layers

P (x, h1, . . . , hl) = P (hl−1, hl)( l−2∏

k=1

P (hk|hk+1))P (x|h1)

The joint distribution of the top two layers is a RestrictedBoltzmann Machine

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 113: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

Before looking at RBMs, let’s look at the basics of Energybased models

Such models assign a scalar energy to each configuration ofthe variables of interest. Learning then corresponds tomodifying the energy function so that its shape has desirableproperties

P (x) =e−Energy(x)

Zwhere Z =

∑x

e−Energy(x)

We only care about the marginal (since only x is observed)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 114: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

Before looking at RBMs, let’s look at the basics of Energybased models

Such models assign a scalar energy to each configuration ofthe variables of interest. Learning then corresponds tomodifying the energy function so that its shape has desirableproperties

P (x) =e−Energy(x)

Zwhere Z =

∑x

e−Energy(x)

We only care about the marginal (since only x is observed)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 115: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

Before looking at RBMs, let’s look at the basics of Energybased models

Such models assign a scalar energy to each configuration ofthe variables of interest. Learning then corresponds tomodifying the energy function so that its shape has desirableproperties

P (x) =e−Energy(x)

Zwhere Z =

∑x

e−Energy(x)

We only care about the marginal (since only x is observed)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 116: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

With hidden variables P (x, h) = e−Energy(x,h)

Z

We only care about the marginal (since only x is observed)

P (x) =∑

he−Energy(x,h)

Z

We can introduce the notion of free-energy

P (x) =e−FreeEnergy(x)

Z, with Z =

∑x

e−FreeEnergy(x)

WhereFreeEnergy(x) = − log

∑h

e−Energy(x,h)

The data log-likelihood gradient has an interesting form(details skipped)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 117: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

With hidden variables P (x, h) = e−Energy(x,h)

Z

We only care about the marginal (since only x is observed)

P (x) =∑

he−Energy(x,h)

Z

We can introduce the notion of free-energy

P (x) =e−FreeEnergy(x)

Z, with Z =

∑x

e−FreeEnergy(x)

WhereFreeEnergy(x) = − log

∑h

e−Energy(x,h)

The data log-likelihood gradient has an interesting form(details skipped)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 118: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

With hidden variables P (x, h) = e−Energy(x,h)

Z

We only care about the marginal (since only x is observed)

P (x) =∑

he−Energy(x,h)

Z

We can introduce the notion of free-energy

P (x) =e−FreeEnergy(x)

Z, with Z =

∑x

e−FreeEnergy(x)

WhereFreeEnergy(x) = − log

∑h

e−Energy(x,h)

The data log-likelihood gradient has an interesting form(details skipped)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 119: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

With hidden variables P (x, h) = e−Energy(x,h)

Z

We only care about the marginal (since only x is observed)

P (x) =∑

he−Energy(x,h)

Z

We can introduce the notion of free-energy

P (x) =e−FreeEnergy(x)

Z, with Z =

∑x

e−FreeEnergy(x)

WhereFreeEnergy(x) = − log

∑h

e−Energy(x,h)

The data log-likelihood gradient has an interesting form(details skipped)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 120: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Energy Based Models

With hidden variables P (x, h) = e−Energy(x,h)

Z

We only care about the marginal (since only x is observed)

P (x) =∑

he−Energy(x,h)

Z

We can introduce the notion of free-energy

P (x) =e−FreeEnergy(x)

Z, with Z =

∑x

e−FreeEnergy(x)

WhereFreeEnergy(x) = − log

∑h

e−Energy(x,h)

The data log-likelihood gradient has an interesting form(details skipped)

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 121: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Restricted Boltzmann Machines

x1 → h1 ∼ P (h|x1)→ x2 ∼ P (x|h1)→ h2 ∼ P (h|x2)→ . . .Artificial Neural Networks II STAT 27725/CMSC 25400

Page 122: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Back to Deep Belief Networks

Everything is completely unsupervised till now. We can treatthese weights learned as an initialization, treat the network asa feedword network and fine tune using backpropagation

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 123: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Back to Deep Belief Networks

Everything is completely unsupervised till now. We can treatthese weights learned as an initialization, treat the network asa feedword network and fine tune using backpropagation

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 124: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Deep Belief Networks

G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006

G. E. Hinton, S Osindero, YW Teh, A fast learning algorithm for deep belief nets, Neural Computation, 2006

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 125: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Deep Belief Networks: Object Parts

Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Honglak Lee,

Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 126: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Effect of Unsupervised Pre-training

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 127: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Effect of Unsupervised Pre-training

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 128: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does Unsupervised Pre-training work?

Regularization. Feature representations that are good forP (x) are good for P (y|x)

Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 129: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does Unsupervised Pre-training work?

Regularization. Feature representations that are good forP (x) are good for P (y|x)

Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 130: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Why does Unsupervised Pre-training work?

Regularization. Feature representations that are good forP (x) are good for P (y|x)

Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization

Artificial Neural Networks II STAT 27725/CMSC 25400

Page 131: Arti cial Neural Networks II - University of Chicagoshubhendu/Slides/NeuralNets2.pdfArti cial Neural Networks II STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University

Autoencoders

Main idea

Sparse Autoencoders

Denoising Autoencoders

Pretraining using Autoencoders

Artificial Neural Networks II STAT 27725/CMSC 25400


Recommended