Artificial Neural Networks IISTAT 27725/CMSC 25400: Machine Learning
Shubhendu Trivedi
University of Chicago
November 2015
Artificial Neural Networks II STAT 27725/CMSC 25400
Things we will look at today
• Regularization in Neural Networks• Drop Out• Sequence to Sequence Learning using Recurrent Neural
Networks• Generative Neural Methods
Artificial Neural Networks II STAT 27725/CMSC 25400
Things we will look at today
• Regularization in Neural Networks
• Drop Out• Sequence to Sequence Learning using Recurrent Neural
Networks• Generative Neural Methods
Artificial Neural Networks II STAT 27725/CMSC 25400
Things we will look at today
• Regularization in Neural Networks• Drop Out
• Sequence to Sequence Learning using Recurrent NeuralNetworks
• Generative Neural Methods
Artificial Neural Networks II STAT 27725/CMSC 25400
Things we will look at today
• Regularization in Neural Networks• Drop Out• Sequence to Sequence Learning using Recurrent Neural
Networks
• Generative Neural Methods
Artificial Neural Networks II STAT 27725/CMSC 25400
Things we will look at today
• Regularization in Neural Networks• Drop Out• Sequence to Sequence Learning using Recurrent Neural
Networks• Generative Neural Methods
Artificial Neural Networks II STAT 27725/CMSC 25400
A Short Primer on Regularization: EmpiricalRisk
Assume that the data are sampled from an unknowndistribution p(x, y)
Next we choose the loss function L, and a parametric modelfamily f(x;w)
Ideally, our goal is to minimize the expected loss, called therisk
R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]
The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set
L(w, X, y) =1
N
N∑i=1
L(f(xi;w), yi)
Artificial Neural Networks II STAT 27725/CMSC 25400
A Short Primer on Regularization: EmpiricalRisk
Assume that the data are sampled from an unknowndistribution p(x, y)
Next we choose the loss function L, and a parametric modelfamily f(x;w)
Ideally, our goal is to minimize the expected loss, called therisk
R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]
The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set
L(w, X, y) =1
N
N∑i=1
L(f(xi;w), yi)
Artificial Neural Networks II STAT 27725/CMSC 25400
A Short Primer on Regularization: EmpiricalRisk
Assume that the data are sampled from an unknowndistribution p(x, y)
Next we choose the loss function L, and a parametric modelfamily f(x;w)
Ideally, our goal is to minimize the expected loss, called therisk
R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]
The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set
L(w, X, y) =1
N
N∑i=1
L(f(xi;w), yi)
Artificial Neural Networks II STAT 27725/CMSC 25400
A Short Primer on Regularization: EmpiricalRisk
Assume that the data are sampled from an unknowndistribution p(x, y)
Next we choose the loss function L, and a parametric modelfamily f(x;w)
Ideally, our goal is to minimize the expected loss, called therisk
R(w) = E(x0,y0)∼p(x,y)[L(f(x0;w), y0)]
The true distribution is unknown. So, we instead work with aproxy that is measurable: Empirical loss on the training set
L(w, X, y) =1
N
N∑i=1
L(f(xi;w), yi)
Artificial Neural Networks II STAT 27725/CMSC 25400
Model Complexity and Overfitting
Consider data drawn from a 3rd order model:
Artificial Neural Networks II STAT 27725/CMSC 25400
How to avoid overfitting?
If a model overfits (is too sensitive to the data), it would beunstable and will not generalize well.
Intuitively, the complexity of the model can be measured bythe number of ”degrees of freedom” (independentparameters) (previous example?)
Idea: Directly penalize by the number of parameters (calledthe Akaike Information criterion): minimize
N∑i=1
L(f(xi;w), yi) + #params
Artificial Neural Networks II STAT 27725/CMSC 25400
How to avoid overfitting?
If a model overfits (is too sensitive to the data), it would beunstable and will not generalize well.
Intuitively, the complexity of the model can be measured bythe number of ”degrees of freedom” (independentparameters) (previous example?)
Idea: Directly penalize by the number of parameters (calledthe Akaike Information criterion): minimize
N∑i=1
L(f(xi;w), yi) + #params
Artificial Neural Networks II STAT 27725/CMSC 25400
How to avoid overfitting?
If a model overfits (is too sensitive to the data), it would beunstable and will not generalize well.
Intuitively, the complexity of the model can be measured bythe number of ”degrees of freedom” (independentparameters) (previous example?)
Idea: Directly penalize by the number of parameters (calledthe Akaike Information criterion): minimize
N∑i=1
L(f(xi;w), yi) + #params
Artificial Neural Networks II STAT 27725/CMSC 25400
Description Length
Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters
With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.
Then we can have Regularized Risk minimization:
N∑i=1
L(f(xi;w), yi) + Ω(w)
We can measure ”size” in different ways: L1, L2 norms
Regularization is basically a way to implement Occam’s Razor
Artificial Neural Networks II STAT 27725/CMSC 25400
Description Length
Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters
With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.
Then we can have Regularized Risk minimization:
N∑i=1
L(f(xi;w), yi) + Ω(w)
We can measure ”size” in different ways: L1, L2 norms
Regularization is basically a way to implement Occam’s Razor
Artificial Neural Networks II STAT 27725/CMSC 25400
Description Length
Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters
With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.
Then we can have Regularized Risk minimization:
N∑i=1
L(f(xi;w), yi) + Ω(w)
We can measure ”size” in different ways: L1, L2 norms
Regularization is basically a way to implement Occam’s Razor
Artificial Neural Networks II STAT 27725/CMSC 25400
Description Length
Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters
With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.
Then we can have Regularized Risk minimization:
N∑i=1
L(f(xi;w), yi) + Ω(w)
We can measure ”size” in different ways: L1, L2 norms
Regularization is basically a way to implement Occam’s Razor
Artificial Neural Networks II STAT 27725/CMSC 25400
Description Length
Intuition: Should not penalize the parameters, but the numberof bits needed to encode the parameters
With a finite set of parameter values, these are equivalent.With an infinite set, we can limit the effective number ofdegrees of freedom by restricting the value of the parameters.
Then we can have Regularized Risk minimization:
N∑i=1
L(f(xi;w), yi) + Ω(w)
We can measure ”size” in different ways: L1, L2 norms
Regularization is basically a way to implement Occam’s Razor
Artificial Neural Networks II STAT 27725/CMSC 25400
Regularization in Neural Networks
We have infact already looked at one method (for vision tasks)
How is this a form of regularization?
Artificial Neural Networks II STAT 27725/CMSC 25400
Regularization in Neural Networks
We have infact already looked at one method (for vision tasks)
How is this a form of regularization?
Artificial Neural Networks II STAT 27725/CMSC 25400
Regularization in Neural Networks
Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer
Why is it called Weight decay?
Parameter sharing (CNNs, RNNs)
Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation
Artificial Neural Networks II STAT 27725/CMSC 25400
Regularization in Neural Networks
Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer
Why is it called Weight decay?
Parameter sharing (CNNs, RNNs)
Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation
Artificial Neural Networks II STAT 27725/CMSC 25400
Regularization in Neural Networks
Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer
Why is it called Weight decay?
Parameter sharing (CNNs, RNNs)
Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation
Artificial Neural Networks II STAT 27725/CMSC 25400
Regularization in Neural Networks
Weight decay: Penalize ‖W l‖2 or ‖W l‖1 in every layer
Why is it called Weight decay?
Parameter sharing (CNNs, RNNs)
Dataset Augmentation ImageNet 2012, discussed last timewas won by significant dataset augmentation
Artificial Neural Networks II STAT 27725/CMSC 25400
Regularization in Neural Networks
Early Stoppping:
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout
A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes
Every sample is processed by a decimated neural network
But, they all do the same job, and share weights
Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I
Sutskever, R Salakhutdinov, JMLR 2014
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout
A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes
Every sample is processed by a decimated neural network
But, they all do the same job, and share weights
Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I
Sutskever, R Salakhutdinov, JMLR 2014
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout
A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes
Every sample is processed by a decimated neural network
But, they all do the same job, and share weights
Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I
Sutskever, R Salakhutdinov, JMLR 2014
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout
A more exotic regularization technique. Introduced in 2012and one of the factors in the recent Neural Net successes
Every sample is processed by a decimated neural network
But, they all do the same job, and share weights
Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I
Sutskever, R Salakhutdinov, JMLR 2014
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: Feedforward Operation
Without dropout: z(l+1)i = w
(l+1)i yl + b
(l+1)i , and yl+1
i = f(z(l+1)i )
With dropout:
r(l)j = Bernoulli(p)
y(l) = r(l) ∗ y(l)
z(l+1)i = w
(l+1)i yl + b
(l+1)i
yl+1i = f(z
(l+1)i )
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: At Test time
Use a single neural net with weights scaled down
By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time
Extreme form of bagging
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: At Test time
Use a single neural net with weights scaled down
By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time
Extreme form of bagging
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: At Test time
Use a single neural net with weights scaled down
By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time
Extreme form of bagging
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: At Test time
Use a single neural net with weights scaled down
By doing this scaling, 2n networks with shared weights can becombined into a single neural network to be used at test time
Extreme form of bagging
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: Performance
These architectures have 2 to 4 hidden layers with 1024 to 2048hidden units
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: Performance
Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I
Sutskever, R Salakhutdinov, JMLR 2014
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout: Effect on Sparsity
Dropout: A simple way to prevent neural networks from overfitting, N Srivastava, G Hinton, A Krizhevsky, I
Sutskever, R Salakhutdinov, JMLR 2014
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout for Linear Regression
Objective: ‖y −Xw‖22
When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)
Marginalizing the noise, the objective becomes:
minw
ER∼ Bernoulli(p)‖y − (R ∗X)w‖22
This is the same as:
minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2
Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout for Linear Regression
Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)
Marginalizing the noise, the objective becomes:
minw
ER∼ Bernoulli(p)‖y − (R ∗X)w‖22
This is the same as:
minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2
Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout for Linear Regression
Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)
Marginalizing the noise, the objective becomes:
minw
ER∼ Bernoulli(p)‖y − (R ∗X)w‖22
This is the same as:
minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2
Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout for Linear Regression
Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)
Marginalizing the noise, the objective becomes:
minw
ER∼ Bernoulli(p)‖y − (R ∗X)w‖22
This is the same as:
minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2
Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ
Artificial Neural Networks II STAT 27725/CMSC 25400
Dropout for Linear Regression
Objective: ‖y −Xw‖22When input is dropped out such that any input dimension isretained with probability p. The input can be expressed asR ∗X where R ∈ 0, 1N×D is a random matrix withRij ∼ Bernoulli(p)
Marginalizing the noise, the objective becomes:
minw
ER∼ Bernoulli(p)‖y − (R ∗X)w‖22
This is the same as:
minw‖y−pXw‖22+p(1−p)‖Γw‖22 where Γ = (diag(XTX))1/2
Thus, dropout with linear regression is equivalent, inexpectation to ridge regression with a particular form of Γ
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does this make sense?
Bagging is always good if models are diverse enough
Motivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does this make sense?
Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”
Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does this make sense?
Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).
Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does this make sense?
Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)
Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does this make sense?
Bagging is always good if models are diverse enoughMotivation 1: Ten conspiracies each involving five people isprobably a better way to wreak havoc than a conspiracyinvolving 50 people. If conditions don’t change (stationary)and plenty of time for rehearsal, a big conspiracy can workwell, but otherwise will ”overfit”Motivation 2: Comes from a theory for the superiority ofsexual reproduction in evolution (Livnat, Papadimitriou,PNAS, 2010).Seems plausible that asexual reproduction should be a betterway to optimize for individual fitness (in sexual reproduction ifa good combination is found, it’s split again)Criterion for natural selection may not be individual fitnessbut mixability. Thus role of sexual reproduction is not just toallow useful new genes to propagate but also to ensure thatcomplex coadaptations between genes are broken
Artificial Neural Networks II STAT 27725/CMSC 25400
Sequence Learning with Neural Networks
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems with MLPs for Sequence Tasks
The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality
This is great when working (for example) with images, andthe output is an encoding of the category
This is bad when if we are interested in Machine Translationor Speech Recognition
Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.
Recurrent Neural Networks address this issue by having loops.
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems with MLPs for Sequence Tasks
The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality
This is great when working (for example) with images, andthe output is an encoding of the category
This is bad when if we are interested in Machine Translationor Speech Recognition
Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.
Recurrent Neural Networks address this issue by having loops.
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems with MLPs for Sequence Tasks
The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality
This is great when working (for example) with images, andthe output is an encoding of the category
This is bad when if we are interested in Machine Translationor Speech Recognition
Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.
Recurrent Neural Networks address this issue by having loops.
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems with MLPs for Sequence Tasks
The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality
This is great when working (for example) with images, andthe output is an encoding of the category
This is bad when if we are interested in Machine Translationor Speech Recognition
Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.
Recurrent Neural Networks address this issue by having loops.
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems with MLPs for Sequence Tasks
The ”API” is too limited. They only accept an input of afixed dimensionality and map it to an output that is again of afixed dimensionality
This is great when working (for example) with images, andthe output is an encoding of the category
This is bad when if we are interested in Machine Translationor Speech Recognition
Traditional Neural Networks treat every exampleindependently. Imagine the task is to classify events at everyfixed point in the movie. A plain vanilla neural network wouldnot be able to use its knowledge about the previous events tohelp in classifying the current.
Recurrent Neural Networks address this issue by having loops.
Artificial Neural Networks II STAT 27725/CMSC 25400
Some Sequence Tasks
Figure credit: Andrej Karpathy
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The loops in them allow the information to persist
For some input xi, we pass it through a hidden state A andthen output a value hi. The loop allows information to bepassed from one time step to another
A RNN can be thought of as multiple copies of the samenetwork, each of which passes a message to its successor
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The loops in them allow the information to persist
For some input xi, we pass it through a hidden state A andthen output a value hi. The loop allows information to bepassed from one time step to another
A RNN can be thought of as multiple copies of the samenetwork, each of which passes a message to its successor
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The loops in them allow the information to persist
For some input xi, we pass it through a hidden state A andthen output a value hi. The loop allows information to bepassed from one time step to another
A RNN can be thought of as multiple copies of the samenetwork, each of which passes a message to its successor
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
More generally, a RNN can be thought of as arranging hiddenstate vectors hlt in a 2-D grid, with t = 1, . . . , T being timeand l = 1, . . . , L being the depth
h0t = xt and hLt is used to predict the output vector yt. Allintermediate vectors hlt are computed as a function of hlt−1
hl−1t
RNN is a recurrence of the form:
hlt = tanhW l
(hl−1t
hlt−1
)Illustration credit: Chris Olah
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
More generally, a RNN can be thought of as arranging hiddenstate vectors hlt in a 2-D grid, with t = 1, . . . , T being timeand l = 1, . . . , L being the depthh0t = xt and hLt is used to predict the output vector yt. Allintermediate vectors hlt are computed as a function of hlt−1
hl−1t
RNN is a recurrence of the form:
hlt = tanhW l
(hl−1t
hlt−1
)Illustration credit: Chris Olah
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
More generally, a RNN can be thought of as arranging hiddenstate vectors hlt in a 2-D grid, with t = 1, . . . , T being timeand l = 1, . . . , L being the depthh0t = xt and hLt is used to predict the output vector yt. Allintermediate vectors hlt are computed as a function of hlt−1
hl−1t
RNN is a recurrence of the form:
hlt = tanhW l
(hl−1t
hlt−1
)Illustration credit: Chris Olah
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The chain like structure enables sequence modeling
W varies between layers but is shared through time
Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)
The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix
”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The chain like structure enables sequence modeling
W varies between layers but is shared through time
Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)
The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix
”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The chain like structure enables sequence modeling
W varies between layers but is shared through time
Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)
The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix
”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The chain like structure enables sequence modeling
W varies between layers but is shared through time
Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)
The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix
”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
The chain like structure enables sequence modeling
W varies between layers but is shared through time
Basically the inputs from the layer below and before in timeare transformed by a non-linearity after an additive interaction(weak coupling)
The plain vanilla RNN described is infact Turing Completewith the right size and weight matrix
”If training vanilla neural nets is optimization over functions,training recurrent nets is optimization over programs”
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
Training RNNs might seem daunting.
Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN
If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation
This is called Backpropagation through time
But weights are shared across different time stamps? How isthis constraint enforced?
Train the network as if there were no constraints, obtainweights at different time stamps, average them
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
Training RNNs might seem daunting.
Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN
If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation
This is called Backpropagation through time
But weights are shared across different time stamps? How isthis constraint enforced?
Train the network as if there were no constraints, obtainweights at different time stamps, average them
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
Training RNNs might seem daunting.
Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN
If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation
This is called Backpropagation through time
But weights are shared across different time stamps? How isthis constraint enforced?
Train the network as if there were no constraints, obtainweights at different time stamps, average them
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
Training RNNs might seem daunting.
Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN
If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation
This is called Backpropagation through time
But weights are shared across different time stamps? How isthis constraint enforced?
Train the network as if there were no constraints, obtainweights at different time stamps, average them
Artificial Neural Networks II STAT 27725/CMSC 25400
Recurrent Neural Networks
Training RNNs might seem daunting.
Infact, we can simply adopt the backpropagation algorithmafter unrolling the RNN
If we have to look at sequences of size s, we unroll each loopinto s steps, and treat it as a normal neural network to trainusing backpropagation
This is called Backpropagation through time
But weights are shared across different time stamps? How isthis constraint enforced?
Train the network as if there were no constraints, obtainweights at different time stamps, average them
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems
Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)
Consider a language model in which the task is to predict thenext word based on the previous
Sometimes the context can be clear immediately: ”The cloudsare in the sky”
Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”
In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.
The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems
Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)
Consider a language model in which the task is to predict thenext word based on the previous
Sometimes the context can be clear immediately: ”The cloudsare in the sky”
Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”
In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.
The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems
Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)
Consider a language model in which the task is to predict thenext word based on the previous
Sometimes the context can be clear immediately: ”The cloudsare in the sky”
Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”
In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.
The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)
Artificial Neural Networks II STAT 27725/CMSC 25400
Problems
Recurrent Neural Networks have trouble learning long termdependencies (Hochreiter and Schmidhuber, 1991 and Bengioet al, 1994)
Consider a language model in which the task is to predict thenext word based on the previous
Sometimes the context can be clear immediately: ”The cloudsare in the sky”
Sometimes the dependency is more long term: ”We arebasically from Transylvania, although I grew up in Spain, but Ican still speak fluent Romanian.”
In principle, RNNs should be able to learn long termdependencies with the right parameter choices, but learningthose parameters is hard.
The Long Short Term Memory was proposed to solve thisproblem (Hochreiter and Schmidhuber, 1997)
Artificial Neural Networks II STAT 27725/CMSC 25400
Long Short Term Memory Networks
Vanilla RNN: Error propagation is blocked by a non-linearity Illustration
credit: Chris Olah
Artificial Neural Networks II STAT 27725/CMSC 25400
Long Short Term Memory Networks
Artificial Neural Networks II STAT 27725/CMSC 25400
Long Short Term Memory
One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions
The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.
Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication
There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)
Artificial Neural Networks II STAT 27725/CMSC 25400
Long Short Term Memory
One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions
The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.
Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication
There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)
Artificial Neural Networks II STAT 27725/CMSC 25400
Long Short Term Memory
One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions
The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.
Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication
There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)
Artificial Neural Networks II STAT 27725/CMSC 25400
Long Short Term Memory
One of the main points about LSTM is the cell state Ct,which runs across time and can travel unchanged only withminor linear interactions
The LSTM regulates the cell state by various gates, whichgives the ability to remove or add information to the cell state.
Each of the gates are composed of a sigmoid non-linearityfollowed by a pointwise multiplication
There are three types of gates in LSTM (e.g. forget gatehelps the LSTM to learn to forget)
Artificial Neural Networks II STAT 27725/CMSC 25400
Long Short Term Memory
Precise form of the LSTM update is:ifo
Ct
=
sigmsigmsigmtanh
W l
(hl−1t
hlt−1
)
clt = f clt−1 + i Ct, and hlt = o tanh(clt)
Artificial Neural Networks II STAT 27725/CMSC 25400
Some Applications: Caption Generation
Caption Generation (Karpathy and Li, 2014)
Artificial Neural Networks II STAT 27725/CMSC 25400
RNN Shakespeare
Using a character level language model trained on all ofShakespeare.VIOLA: Why, Salisbury must find his flesh and thought That whichI am not aps, not a man and in fire, To show the reining of theraven and the wars To grace my hand reproach within, and not afair are hand, That Caesar and my goodly father’s world; When Iwas heaven of presence and our fleets, We spare with hours, butcut thy council I am great, Murdered and by thy master’s readythere My power to give thee but so much as hell: Some service inthe noble bondman here, Would show him to her wine.KING LEAR: O, if you were a feeble sight, the courtesy of yourlaw, Your sight and several breath, will wear the gods With hisheads, and my hands are wonder’d at the deeds, So drop upon yourlordship’s head, and your opinion Shall be against your honour.
Artificial Neural Networks II STAT 27725/CMSC 25400
Image Generation
(Also uses attention mechanism - not discussed) DRAW: ARecurrent Neural Network For Image Generation (Gregor et al.,2015)
Artificial Neural Networks II STAT 27725/CMSC 25400
Applications
Acoustic Modeling
Natural Language Processing i.e. parsing etc
Machine Translation (e.g. Google Translate uses RNNs)
Voice Transcription
Video and Image understanding
list goes on
Artificial Neural Networks II STAT 27725/CMSC 25400
Applications
Acoustic Modeling
Natural Language Processing i.e. parsing etc
Machine Translation (e.g. Google Translate uses RNNs)
Voice Transcription
Video and Image understanding
list goes on
Artificial Neural Networks II STAT 27725/CMSC 25400
Applications
Acoustic Modeling
Natural Language Processing i.e. parsing etc
Machine Translation (e.g. Google Translate uses RNNs)
Voice Transcription
Video and Image understanding
list goes on
Artificial Neural Networks II STAT 27725/CMSC 25400
Applications
Acoustic Modeling
Natural Language Processing i.e. parsing etc
Machine Translation (e.g. Google Translate uses RNNs)
Voice Transcription
Video and Image understanding
list goes on
Artificial Neural Networks II STAT 27725/CMSC 25400
Applications
Acoustic Modeling
Natural Language Processing i.e. parsing etc
Machine Translation (e.g. Google Translate uses RNNs)
Voice Transcription
Video and Image understanding
list goes on
Artificial Neural Networks II STAT 27725/CMSC 25400
Applications
Acoustic Modeling
Natural Language Processing i.e. parsing etc
Machine Translation (e.g. Google Translate uses RNNs)
Voice Transcription
Video and Image understanding
list goes on
Artificial Neural Networks II STAT 27725/CMSC 25400
Generative Neural Models
Artificial Neural Networks II STAT 27725/CMSC 25400
Recap: Multilayered Neural Networks
Let layer k compute an output vector hk using the outputhk−1 of the previous layer.
Note that the input x = h0
hk = tanh(bk + W khk−1)
Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)
We might have the output layer return the followingnon-linearity
hli =eb
li+W l
ihl−1∑
j eblj+W l
jhl−1
This is called the softmax and can be used as an estimator ofp(Y = i|x)
Artificial Neural Networks II STAT 27725/CMSC 25400
Recap: Multilayered Neural Networks
Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0
hk = tanh(bk + W khk−1)
Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)
We might have the output layer return the followingnon-linearity
hli =eb
li+W l
ihl−1∑
j eblj+W l
jhl−1
This is called the softmax and can be used as an estimator ofp(Y = i|x)
Artificial Neural Networks II STAT 27725/CMSC 25400
Recap: Multilayered Neural Networks
Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0
hk = tanh(bk + W khk−1)
Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)
We might have the output layer return the followingnon-linearity
hli =eb
li+W l
ihl−1∑
j eblj+W l
jhl−1
This is called the softmax and can be used as an estimator ofp(Y = i|x)
Artificial Neural Networks II STAT 27725/CMSC 25400
Recap: Multilayered Neural Networks
Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0
hk = tanh(bk + W khk−1)
Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)
We might have the output layer return the followingnon-linearity
hli =eb
li+W l
ihl−1∑
j eblj+W l
jhl−1
This is called the softmax and can be used as an estimator ofp(Y = i|x)
Artificial Neural Networks II STAT 27725/CMSC 25400
Recap: Multilayered Neural Networks
Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0
hk = tanh(bk + W khk−1)
Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)
We might have the output layer return the followingnon-linearity
hli =eb
li+W l
ihl−1∑
j eblj+W l
jhl−1
This is called the softmax and can be used as an estimator ofp(Y = i|x)
Artificial Neural Networks II STAT 27725/CMSC 25400
Recap: Multilayered Neural Networks
Let layer k compute an output vector hk using the outputhk−1 of the previous layer. Note that the input x = h0
hk = tanh(bk + W khk−1)
Top layer output hl is used for making a prediction. If thetarget is given by y, then we define a loss L(hl, y) (convex inbl + W lhl−1)
We might have the output layer return the followingnon-linearity
hli =eb
li+W l
ihl−1∑
j eblj+W l
jhl−1
This is called the softmax and can be used as an estimator ofp(Y = i|x)
Artificial Neural Networks II STAT 27725/CMSC 25400
Recap: Multilayered Neural Networks
One loss to be considered: L(hl, y) = − logP (Y = y|x)
Artificial Neural Networks II STAT 27725/CMSC 25400
The Difficulty of Training Deep Networks
Until 2006, deep architectures were not used extensively inMachine Learning
Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)
Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture
Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)
Artificial Neural Networks II STAT 27725/CMSC 25400
The Difficulty of Training Deep Networks
Until 2006, deep architectures were not used extensively inMachine Learning
Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)
Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture
Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)
Artificial Neural Networks II STAT 27725/CMSC 25400
The Difficulty of Training Deep Networks
Until 2006, deep architectures were not used extensively inMachine Learning
Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)
Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture
Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)
Artificial Neural Networks II STAT 27725/CMSC 25400
The Difficulty of Training Deep Networks
Until 2006, deep architectures were not used extensively inMachine Learning
Poor training and generalization errors using the standardrandom initialization (with the exception of convolutionalneural networks)
Difficult to propagate gradients to lower layers. Too manyconnections in a deep architecture
Purely discriminative. No generative model for the raw inputfeatures x (connections go upwards)
Artificial Neural Networks II STAT 27725/CMSC 25400
Initial Breakthrough: Layer-wise Training
Unsupervised pre-training is possible in certain DeepGenerative Models (Hinton, 2006)
Idea: Greedily train one layer at a time using a simple model(Restricted Boltzmann Machine)
Use the parameters learned to initialize a feedforward neuralnetwork, and fine tune for classification
Artificial Neural Networks II STAT 27725/CMSC 25400
Initial Breakthrough: Layer-wise Training
Unsupervised pre-training is possible in certain DeepGenerative Models (Hinton, 2006)
Idea: Greedily train one layer at a time using a simple model(Restricted Boltzmann Machine)
Use the parameters learned to initialize a feedforward neuralnetwork, and fine tune for classification
Artificial Neural Networks II STAT 27725/CMSC 25400
Initial Breakthrough: Layer-wise Training
Unsupervised pre-training is possible in certain DeepGenerative Models (Hinton, 2006)
Idea: Greedily train one layer at a time using a simple model(Restricted Boltzmann Machine)
Use the parameters learned to initialize a feedforward neuralnetwork, and fine tune for classification
Artificial Neural Networks II STAT 27725/CMSC 25400
Sigmoid Belief Networks, 1992
The generative model is decomposed as:
P (x, h1, . . . , hl) = P (hl)( l−1∏
k=1
P (hk|hk+1))P (x|h1)
Marginalization yields P (x). Intractable in practice except fortiny models
R. Neal, Connectionist learning of belief networks, 1992
Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. The Helmholtz Machine, 1995
L. Saul, T. Jaakkola, and M. Jordan, Mean field theory for sigmoid belief networks, 1996
Artificial Neural Networks II STAT 27725/CMSC 25400
Sigmoid Belief Networks, 1992
The generative model is decomposed as:
P (x, h1, . . . , hl) = P (hl)( l−1∏
k=1
P (hk|hk+1))P (x|h1)
Marginalization yields P (x). Intractable in practice except fortiny models
R. Neal, Connectionist learning of belief networks, 1992
Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. The Helmholtz Machine, 1995
L. Saul, T. Jaakkola, and M. Jordan, Mean field theory for sigmoid belief networks, 1996
Artificial Neural Networks II STAT 27725/CMSC 25400
Sigmoid Belief Networks, 1992
The generative model is decomposed as:
P (x, h1, . . . , hl) = P (hl)( l−1∏
k=1
P (hk|hk+1))P (x|h1)
Marginalization yields P (x). Intractable in practice except fortiny models
R. Neal, Connectionist learning of belief networks, 1992
Dayan, P., Hinton, G. E., Neal, R., and Zemel, R. S. The Helmholtz Machine, 1995
L. Saul, T. Jaakkola, and M. Jordan, Mean field theory for sigmoid belief networks, 1996
Artificial Neural Networks II STAT 27725/CMSC 25400
Deep Belief Networks, 2006
Similar to Sigmoid Belief Networks, except the top two layers
P (x, h1, . . . , hl) = P (hl−1, hl)( l−2∏
k=1
P (hk|hk+1))P (x|h1)
The joint distribution of the top two layers is a RestrictedBoltzmann Machine
Artificial Neural Networks II STAT 27725/CMSC 25400
Deep Belief Networks, 2006
Similar to Sigmoid Belief Networks, except the top two layers
P (x, h1, . . . , hl) = P (hl−1, hl)( l−2∏
k=1
P (hk|hk+1))P (x|h1)
The joint distribution of the top two layers is a RestrictedBoltzmann Machine
Artificial Neural Networks II STAT 27725/CMSC 25400
Deep Belief Networks, 2006
Similar to Sigmoid Belief Networks, except the top two layers
P (x, h1, . . . , hl) = P (hl−1, hl)( l−2∏
k=1
P (hk|hk+1))P (x|h1)
The joint distribution of the top two layers is a RestrictedBoltzmann Machine
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
Before looking at RBMs, let’s look at the basics of Energybased models
Such models assign a scalar energy to each configuration ofthe variables of interest. Learning then corresponds tomodifying the energy function so that its shape has desirableproperties
P (x) =e−Energy(x)
Zwhere Z =
∑x
e−Energy(x)
We only care about the marginal (since only x is observed)
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
Before looking at RBMs, let’s look at the basics of Energybased models
Such models assign a scalar energy to each configuration ofthe variables of interest. Learning then corresponds tomodifying the energy function so that its shape has desirableproperties
P (x) =e−Energy(x)
Zwhere Z =
∑x
e−Energy(x)
We only care about the marginal (since only x is observed)
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
Before looking at RBMs, let’s look at the basics of Energybased models
Such models assign a scalar energy to each configuration ofthe variables of interest. Learning then corresponds tomodifying the energy function so that its shape has desirableproperties
P (x) =e−Energy(x)
Zwhere Z =
∑x
e−Energy(x)
We only care about the marginal (since only x is observed)
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
With hidden variables P (x, h) = e−Energy(x,h)
Z
We only care about the marginal (since only x is observed)
P (x) =∑
he−Energy(x,h)
Z
We can introduce the notion of free-energy
P (x) =e−FreeEnergy(x)
Z, with Z =
∑x
e−FreeEnergy(x)
WhereFreeEnergy(x) = − log
∑h
e−Energy(x,h)
The data log-likelihood gradient has an interesting form(details skipped)
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
With hidden variables P (x, h) = e−Energy(x,h)
Z
We only care about the marginal (since only x is observed)
P (x) =∑
he−Energy(x,h)
Z
We can introduce the notion of free-energy
P (x) =e−FreeEnergy(x)
Z, with Z =
∑x
e−FreeEnergy(x)
WhereFreeEnergy(x) = − log
∑h
e−Energy(x,h)
The data log-likelihood gradient has an interesting form(details skipped)
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
With hidden variables P (x, h) = e−Energy(x,h)
Z
We only care about the marginal (since only x is observed)
P (x) =∑
he−Energy(x,h)
Z
We can introduce the notion of free-energy
P (x) =e−FreeEnergy(x)
Z, with Z =
∑x
e−FreeEnergy(x)
WhereFreeEnergy(x) = − log
∑h
e−Energy(x,h)
The data log-likelihood gradient has an interesting form(details skipped)
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
With hidden variables P (x, h) = e−Energy(x,h)
Z
We only care about the marginal (since only x is observed)
P (x) =∑
he−Energy(x,h)
Z
We can introduce the notion of free-energy
P (x) =e−FreeEnergy(x)
Z, with Z =
∑x
e−FreeEnergy(x)
WhereFreeEnergy(x) = − log
∑h
e−Energy(x,h)
The data log-likelihood gradient has an interesting form(details skipped)
Artificial Neural Networks II STAT 27725/CMSC 25400
Energy Based Models
With hidden variables P (x, h) = e−Energy(x,h)
Z
We only care about the marginal (since only x is observed)
P (x) =∑
he−Energy(x,h)
Z
We can introduce the notion of free-energy
P (x) =e−FreeEnergy(x)
Z, with Z =
∑x
e−FreeEnergy(x)
WhereFreeEnergy(x) = − log
∑h
e−Energy(x,h)
The data log-likelihood gradient has an interesting form(details skipped)
Artificial Neural Networks II STAT 27725/CMSC 25400
Restricted Boltzmann Machines
x1 → h1 ∼ P (h|x1)→ x2 ∼ P (x|h1)→ h2 ∼ P (h|x2)→ . . .Artificial Neural Networks II STAT 27725/CMSC 25400
Back to Deep Belief Networks
Everything is completely unsupervised till now. We can treatthese weights learned as an initialization, treat the network asa feedword network and fine tune using backpropagation
Artificial Neural Networks II STAT 27725/CMSC 25400
Back to Deep Belief Networks
Everything is completely unsupervised till now. We can treatthese weights learned as an initialization, treat the network asa feedword network and fine tune using backpropagation
Artificial Neural Networks II STAT 27725/CMSC 25400
Deep Belief Networks
G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006
G. E. Hinton, S Osindero, YW Teh, A fast learning algorithm for deep belief nets, Neural Computation, 2006
Artificial Neural Networks II STAT 27725/CMSC 25400
Deep Belief Networks: Object Parts
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Honglak Lee,
Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng
Artificial Neural Networks II STAT 27725/CMSC 25400
Effect of Unsupervised Pre-training
Artificial Neural Networks II STAT 27725/CMSC 25400
Effect of Unsupervised Pre-training
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does Unsupervised Pre-training work?
Regularization. Feature representations that are good forP (x) are good for P (y|x)
Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does Unsupervised Pre-training work?
Regularization. Feature representations that are good forP (x) are good for P (y|x)
Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization
Artificial Neural Networks II STAT 27725/CMSC 25400
Why does Unsupervised Pre-training work?
Regularization. Feature representations that are good forP (x) are good for P (y|x)
Optimization: Unsupervised pre-training leads to betterregions of the space i.e. better than random initialization
Artificial Neural Networks II STAT 27725/CMSC 25400
Autoencoders
Main idea
Sparse Autoencoders
Denoising Autoencoders
Pretraining using Autoencoders
Artificial Neural Networks II STAT 27725/CMSC 25400