Learning Lateral Connections between Hidden Units
Geoffrey HintonUniversity of Toronto
in collaboration with
Kejie BaoUniversity of Toronto
Overview of the talk
• Causal Model: Learns to represent images using multiple, simultaneous, hidden, binary causes.– Introduce the variational approximation trick
• Boltzmann Machines: Learning to model the probabilities of binary vectors.– Introduce the brief Monte Carlo trick
• Hybrid model: Use a Boltzmann machine to model the prior distribution over configurations of binary causes. Uses both tricks
• Causal hierarchies of MRF’s: Generalize the hybrid model to many hidden layers– The causal connections act as insulators that
keep the local partition functions separate.
Bayes Nets: Hierarchies of causes• It is easy to generate an
unbiased example at the leaf nodes.
• It is typically hard to compute the posterior distribution over all possible configurations of hidden causes.
• Given samples from the posterior, it is easy to learn the local interactions
Hidden cause
Visible effect
Two of the training images
A simple set of images
probabilities of turning on the binary hidden units
reconstructions
of the images
The generative model
To generate a datavector:• first generate a code from the prior distribution
• then generate an ideal datavector from the code• then add Gaussian noise.
jij
cjii
cdd
i
codesc
wsbcd
ecdp
cdpcpdp
ii
|ˆ
)|(
)|()()(
22 2)|ˆ(
2
1
value that code c predicts for the i’th component of the data vector
weight from hidden unit j to pixel i
binary state of hidden unit j in code vector c
bias
Learning the model
• For each image in the training set we ought to consider all possible codes. This is exponentially expensive.
)|ˆ()|()(log
)|()(log)(log
21 cddsdcp
w
dp
cdpcpdp
iicj
cji
codesc
prior probability of code
posterior probability of code c
prediction error of code c
d
jiw
i
j
id̂
How to beat the exponential explosionof possible codes
• Instead of considering each code separately, we could use an approximation to the true posterior distribution. This makes it tractable to consider all the codes at once.
• Instead of computing a separate prediction error for each binary code, we compute the expected squared error given the approximate posterior distribution over codes– Then we just change the weights to minimize
this expected squared error.
A factorial approximation
• For a given datavector, assume that each code unit has a probability of being on, but that the code units are conditionally independent of each other.
j
jcjj
cj qsqsdcQ )1)(1()|(
Use this term if code unit j is on in code vector c
otherwise use this term
product over all code units
The expected squared prediction error
2
2
2
)1(
)()|ˆ(
jijj
j
jij
jiiii
wqq
wqbdcdd
expected prediction
additional squared error caused by the variance in the prediction
The variance term prevents it from cheating by using the precise real-valued q values to make precise predictions.
Approximate inference
• We use an approximation to the posterior distribution over hidden configurations.– assume the posterior factorizes into a product of
distributions for each hidden cause.
• If we use the approximation for learning, there is no guarantee that learning will increase the probability that the model would generate the observed data.
• But maybe we can find a different and sensible objective function that is guaranteed to improve at each update.
)( 1, jj qq
A trade-off between how well the model fits the data and the tractability of inference
This makes it feasible to fit models that are so complicated that we cannot figure out how the model would generate the data, even if we know the parameters of the model.
)( ),|(||)|()|( log)( dhPdhQKLdpGd
How well the model fits the data
The inaccuracy of inference
parameters data
approximating posterior distribution
true posterior distribution
new objective function
Where does the approximate posterior come from?
• We have a tractable cost function expressed in terms of the approximating probabilities, q.
• So we can use the gradient of the cost function w.r.t. the q values to train a “recognition network” to produce good q values.
assume that the prior over codes also factors, so it can be represented by generative biases.
data
Two types of density model
Stochastic generative model using directed acyclic graph (e.g. Bayes Net)
Generation from model is easy
Inference can be hard
Learning is easy after inference
Energy-based models that associate an energy with each data vector
Generation from model is hard
Inference can be easy
Is learning hard?
c
cdpcpdp )|()()(
r
rE
dE
e
edp
)(
)()(
A simple energy-based model
• Connect a set of binary stochastic units together using symmetric connections. Define the energy of a binary configuration, alpha, to be
• The energy of a binary vector determines its probability via the Boltzmann distribution.
ijjjii wssE
Maximum likelihood learning is hard in energy-based models
• To get high probability for d we need low energy for d and high energy for its main rivals, r
We need to find the serious rivals to d and raise their energy. This seems hard.
r
rE
dE
e
edp )(
)()(
It is easy to lower the energy of d
Markov chain monte carlo
• It is easy to set up a Markov chain so that it finds the rivals to the data with just the right probability
)(
)()()(log rE
rpdEdp
rsample rivals with this probability?
A picture of the learning rule for a fully visible Boltzmann machine
0 jiss1 jiss
jiss
i i i i
t = 0 t = 1 t = 2 t = infinity
)(
,1
1)1(
0
jijiij
kjkkjxj
ssssw
wsxwheree
spj
Start with a training vector. Then pick units at random and update their states stochastically using the rule:
a fantasy
The maximum likelihood learning rule is then
A surprising shortcut
• Instead of taking the negative samples from the equilibrium distribution, use slight corruptions of the datavectors. Only run the Markov chain for for a few steps.– Much less variance because a datavector
and its confabulation form a matched pair.– Seems to be very biased, but maybe it is
optimizing a different objective function.• If the model is perfect and there is an infinite
amount of data, the confabulations will be equilibrium samples. So the shortcut will not cause learning to mess up a perfect model.
Intuitive motivation
• It is silly to run the Markov chain all the way to equilibrium if we can get the information required for learning in just a few steps.– The way in which the model systematically
distorts the data distribution in the first few steps tells us a lot about how the model is wrong.
– But the model could have strong modes far from any data. These modes will not be sampled by brief Monte Carlo. Is this a problem in practice? Apparently not.
Mean field Boltzmann machines
• Instead of using binary units with stochastic updates, approximate the Markov chain by using deterministic units with real-valued states, q, that represent a distribution over binary states.
• We can then run a deterministic approximation to the brief Markov chain:
j
jcjj
cj qsqscQ )1)(1()(
kjk
tkjx
tj wqxwhere
eq
j1
11
The hybrid model
• We can use the same factored distribution over code units in a causal model and in a mean field Boltzmann machine that learns to model the prior distribution over codes.
• The stochastic generative model is:– First sample a binary vector from the prior
distribution that is specified by the lateral connections between code units
– Then use this code vector to produce an ideal data vector
– Then add Gaussian noise.
A hybrid model
d
Z
qqqq
wqqbqC
jjjj
j
jkkkjjj
jj
log
)1log()1(log
1
20 )ˆ()ˆ(2
1
iddidVarCi
i
jiw
i
j
id̂
k
recognition model
The partition function is independent of the causal model
expected energy
minus entropy
• Do a forward pass through the recognition model to compute q+ values for the code units
• Use the q+ values to compute top-down predictions of the data and use the expected prediction errors to compute: – derivatives for the generative weights– likelihood derivatives for the q+ values
• Run the code units for a few steps ignoring the data to get the q- values. Use these q- values to compute– The derivatives for the lateral weights.– The derivatives for the q+ values that come from the
prior.• Combine the likelihood and prior derivatives of the q+
values and backpropagate through the recognition net.
The learning procedure
Simulation by Kejie Bao
Generative weights of hidden units
Generative weights of hidden units
Adding more hidden layers
d
jiw
i
j
id̂
d k
Recognition model
jx
Recognition model
The cost function for a multilayer model
}{
1
log
)1log()1(log
)(
jx
jjjj
j
jkkkjjjj
jj
Z
qqqq
wqqxbqC
20 )ˆ()ˆ(2
1
iddidVarCi
i
inputstopdownthewithoutbutClikeC 12
Conditional partition function that depends on the current top-down inputs to each unit
The learning procedure for multiple hidden layers
• The top down inputs control the conditional partition function of a layer, but all the required derivatives can still be found using the differences between the q+ and the q- statistics.
• The learning procedure is just the same except that the top down inputs to a layer from the layer above must be frozen in place while each layer separately runs its brief Markov chain.
Advantages of a causal hierarchy of Markov Random Fields
• Allows clean-up at each stage of generation in a multilayer generative model. This makes it easy to maintain constraints.
• The lateral connections implement a prior that squeezes the redundancy out of each hidden layer by making most possible configurations very unlikely. This creates a bottleneck of the appropriate size.
• The causal connections between layers separate the partition functions so that the whole net does not have to settle. Each layer can settle separately.– This solves Terry’s problem.
data
THE END
Energy-Based Models with deterministic hidden units
• Use multiple layers of deterministic hidden units with non-linear activation functions.
• Hidden activities contribute additively to the global energy, E.
c
cE
dE
e
edp
)(
)()( data
j
k
Ek
Ej
Contrastive divergence
Aim is to minimize the amount by which a step toward equilibrium improves the data distribution.
)||()||( 1 QQKLQPKLCD
Minimize Contrastive Divergence
Minimize divergence between data distribution and model’s distribution
Maximize the divergence between confabulations and model’s distribution
data distribution
model’s distribution
distribution after one step of Markov chain
Contrastive divergence
.
EEQQKL
EEQQKL
1
0
)||(
)||(
1
0
1
11 )||(
Q
QQKLQ
changing the parameters changes the distribution of confabulations
Contrastive divergence makes the awkward terms cancel