How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research &...

How to learn a generative model of images

Geoffrey HintonCanadian Institute for Advanced Research

&University of Toronto

• Each neuron receives inputs from thousands of other neurons– A few neurons also get inputs from the sensory receptors- A few neurons send outputs to muscles.- Neurons use binary spikes of activity to communicate

• The effect that one neuron has on another is controlled by a synaptic weight– The weights can be

positive or negative

• The synaptic weights adapt so that the whole network learns to perform useful computations– Recognizing objects, understanding language, making

plans, controlling the body

How the brain works

How to make an intelligent system• The cortex has about a hundred billion neurons.

• Each neuron has thousands of connections.

• So all you need to do is find the right values for the weights on thousands of billions of connections.

• This task is much too difficult for evolution to solve directly.– A blind search would be much too slow.– DNA doesn’t have enough capacity to store the answer.

• So there must be an intelligent designer.– What does she look like?– Where did she come from?

The intelligent designer

• The intelligent designer is a learning algorithm.– The algorithm adjusts the weights to give the neural

network a better model of the data it encounters.• A learning algorithm is the differential equation of knowledge.

• Evolution produced the learning algorithm– Trial and error in the space of learning algorithms is a

much better strategy than trial and error in the space of synapse strengths.

• To understand the learning algorithm, we first need to understand the type of network it produces.– Shape recognition is a good task to consider. – We are much better than computers and it uses a lot of

neurons.

Hopfield nets

• Model each pixel in an image using a binary neuron that has states of 1or 0.

• Connect the neurons together with symmetric connections.

• Update the neurons one at a time based on the total input they receive.

• Stored patterns correspond to the energy minima of the network.

3.7

-4.2

0

0

1

0

1

1

1

ji

ijjiunitsi

ii wssbsE )(s

bias of unit i

weight between units i and j

Energy of binary

configuration s

binary state of unit i

in configuration s

indexes every non-identical

pair of i and j once

To store a pattern we change the weights to lower the energy of that pattern.

Why a Hopfield net doesn’t work

• The ways in which shapes vary are much too complicated to be captured by pair-wise interactions between pixels.– To capture all the allowable variations of a

shape we need extra “hidden” variables that learn to represent the features that the shape is composed of.

Some examples of real handwritten digits

From Hopfield Nets to Boltzmann Machines

• Boltzmann machines are stochastic Hopfield nets with hidden variables.

• They have a simple learning algorithm that adapts all of the interactions so that the equilibrium distribution over the visible variables matches the distribution of the observed data.– The pair-wise interactions with the hidden

variables can model higher-order correlations between visible variables.

Stochastic binary neurons

• These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons.

0.5

00

1

jjiji

i wsbsp

)exp(1)( 11

j

jiji wsb

)( 1isp

How a Boltzmann Machine models data

• The aim of learning is to discover weights that cause the equilibrium distribution of the whole network to match the data distribution on the visible variables.

• Everything is defined in

terms of energies of joint configurations of the visible and hidden units.

hidden units

visible units

The Energy of a joint configuration

ji

ijvhj

vhi

unitsii

vhi wssbshvE ),(

bias of unit i

weight between units i and j

Energy with configuration v on the visible units and h on the hidden units

binary state of unit i in joint configuration v,h

indexes every non-identical

pair of i and j once

Using energies to define probabilities

• The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations.

• The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.

gu

guE

hvE

e

ehvp

,

),(

),(

),(

gu

guEh

hvE

e

e

vp

,

),(

),(

)(

partition function

A very surprising fact• Everything that one weight needs to know about

the other weights and the data in order to do maximum likelihood learning is contained in the difference of two correlations.

freejijiij

ssssw

p

vv)(log

Derivative of log probability of one training vector

Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units

Expected value of product of states at thermal equilibrium when nothing is clamped

The batch learning algorithm

• Positive phase– Clamp a datavector on the visible units. – Let the hidden units reach thermal equilibrium at a

temperature of 1 (may use annealing to speed this up)– Sample for all pairs of units– Repeat for all datavectors in the training set.

• Negative phase– Do not clamp any of the units – Let the whole network reach thermal equilibrium at a

temperature of 1 (where do we start?)– Sample for all pairs of units– Repeat many times to get good estimates

• Weight updates– Update each weight by an amount proportional to the

difference in in the two phases.

jiss

jiss

jiss

Three reasons why learning is impracticalin Boltzmann Machines

• If there are many hidden layers, it can take a long time to reach thermal equilibrium when a data-vector is clamped on the visible units.

• It takes even longer to reach thermal equilibrium in the “negative” phase when the visible units are unclamped.– The unconstrained energy surface needs to

be highly multimodal to model the data.• The learning signal is the difference of two

sampled correlations which is very noisy.

Restricted Boltzmann Machines

• We restrict the connectivity to make inference and learning easier.– Only one layer of hidden units.– No connections between hidden

units.• In an RBM, the hidden units are

conditionally independent given the visible states. It only takes one step to reach thermal equilibrium when the visible units are clamped. – So we can quickly get the exact

value of :v jiss

hidden

i

j

visible

A picture of the Boltzmann machine learning algorithm for an RBM

datajiss fantasyjiss

i

j

i

j

i

j

i

j

t = 0 t = 1 t = 2 t = infinity

)( fantasyjidatajiij ssssw

Start with a training vector on the visible units.

Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

a fantasy

Contrastive divergence learning: A quick way to learn an RBM

datajiss reconjiss

i

j

i

j

t = 0 t = 1

)( reconjidatajiij ssssw

Start with a training vector on the visible units.

Update all the hidden units in parallel

Update the all the visible units in parallel to get a “reconstruction”.

Update the hidden units again.

This is not following the gradient of the log likelihood. But it works well.

It is trying to make the free energy gradient be zero at the data distribution.

reconstructiondata

How to learn a set of features that are good for reconstructing images of the digit 2

50 binary feature neurons

16 x 16 pixel

image

50 binary feature neurons

16 x 16 pixel

image

Increment weights between an active pixel and an active feature

Decrement weights between an active pixel and an active feature

data (reality)

reconstruction (lower energy than reality)

Bush joke

The weights of the 50 feature detectors

We start with small random weights to break symmetry

The final 50 x 256 weights

Each neuron grabs a different feature.

Reconstruction from activated binary featuresData

Reconstruction from activated binary featuresData

How well can we reconstruct the digit images from the binary feature activations?

New test images from the digit class that the model was trained on

Images from an unfamiliar digit class (the network tries to see every image as a 2)

Bush joke 2

Training a deep network

• First train a layer of features that receive input directly from the pixels.

• Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.

• It can be proved that each time we add another layer of features we get a better model of the set of training images.– i.e. we assign lower free energy to the real data and

higher free energy to all other possible images.– The proof uses the fact that the variational free

energy of a non-equilibrium distribution is always higher that the variational free energy of the equilibrium distribution.

– The proof depends on a neat equivalence.

• Learning the weights in an RBM is exactly equivalent to learning in an infinite causal network with tied weights.

A causal network that is equivalent to an RBM

v

h

W

TW

h2

v

h1

h3

TW

W

etc.

W

• First learn with all the weights tied

Learning a deep causal network

v

h1

1W

h2

v

h1

h3

etc.

1W

1W

1W

1W

• Then freeze the bottom layer and relearn all the other layers.

h1

h2

2W

h2

v

h1

h3

etc.

1W

2W

2W

2W

• Then freeze the bottom two layers and relearn all the other layers.

h2

h3

3W

h2

v

h1

h3

etc.

1W

2W

3W

3W

The generative model after learning 3 layers

• To generate data:

1. Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling.

2. Perform a top-down pass to get states for all the other layers.

So the lower level bottom-up connections are not part of the generative model

h2

data

h1

h3

2W

3W

1W

• After learning the first layer of weights:

• If we freeze the generative weights that define the likelihood term and the recognition weights that define the distribution over hidden configurations, we get:

• Maximizing the RHS is equivalent to maximizing the log prob of “data” that occurs with probability

Why the hidden configurations should be treated as data when learning the next layer of weights

entropyhvphpvhp

vhentropyvenergyvp

)|(log)(log)|(

)||()()(log

antconsthpvhpvp )(log)|()(log

)|( vhp

A neural model of digit recognition

2000 top-level neurons

500 neurons

500 neurons

28 x 28 pixel image

10 label

neurons

The model learns to generate combinations of labels and images.

To perform recognition we do an up-pass from the image followed by a few iterations of the top-level associative memory.

The top two layers form an associative memory whose energy landscape models the low dimensional manifolds of the digits.

The energy valleys have names

Fine-tuning with the up-down algorithm: A contrastive divergence version of wake-sleep

• Replace the top layer of the causal network by an RBM– This eliminates explaining away at the top-level.– It is nice to have an associative memory at the top.

• Replace the sleep phase by a top-down pass starting with the state of the RBM produced by the wake phase.– This makes sure the recognition weights are trained in

the vicinity of the data.– It also reduces mode averaging. If the recognition

weights prefer one mode, they will stick with that mode even if the generative weights like some other mode just as much.

SHOW THE MOVIE

Examples of correctly recognized handwritten digitsthat the neural network had never seen before

Its very good

How well does it discriminate on MNIST test set with no extra information about geometric distortions?

• Generative model based on RBM’s 1.25%• Support Vector Machine (Decoste et. al.) 1.4% • Backprop with 1000 hiddens (Platt) 1.6%• Backprop with 500 -->300 hiddens 1.6%• K-Nearest Neighbor ~ 3.3%

• Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input.

Learning perceptual physics

• Suppose we have a video sequence of some balls bouncing in a box.

• A physicist would model the data using Newton’s laws. To do this, you need to decide:– How many objects are there?– What are the coordinates of their centers at each time

step?– How elastic are they?

• Does a baby do the same as a physicist?– Maybe we can just learn a model of how the world

behaves from the raw video.– It doesn’t learn the abstractions that the physicist has,

but it does know what it likes.• And what it likes is videos that obey Newtonian physics

The conditional RBM model

• Given the data and the previous hidden state and the previous visible frames, the hidden units at time t are conditionally independent.– So it is easy to sample from their

conditional equilibrium distribution. • Learning can be done by using

contrastive divergence.– Reconstruct the data at time t from

the inferred states of the hidden units.– The temporal connections between

hiddens can be learned as if they were additional biases

t-2 t-1 t

t-1 t

)( reconjdatajiij sssw

ij

Show Ilya’s movies

THE END

For more on this type of learning see:

www.cs.toronto.edu/~hinton/science.pdf

For the proof that adding extra layers makes the model better see the paper on my web page:

“A fast learning algorithm for deep belief nets”

http://www.cs.toronto.edu/~hinton/science.pdf

Learning with realistic labels

This network treats the labels in a special way, but they could easily be replaced by an auditory pathway.

2000 top-level units

500 units

500 units

28 x 28 pixel

image

10 label units

Learning with auditory labels

• Alex Kaganov replaced the class labels by binarized cepstral spectrograms of many different male speakers saying digits.

• The auditory pathway then had multiple layers, just like the visual pathway. The auditory and visual inputs shared the top level layer.

• After learning, he showed it a visually ambiguous digit and then reconstructed the visual input from the representation that the top-level associative memory had settled on after 10 iterations.

“six” “five”

reconstruction original visual input reconstruction

The features learned in the first hidden layer

Seeing what it is thinking

• The top level associative memory has activities over thousands of neurons.– It is hard to tell what the network is

thinking by looking at the patterns of activation.

• To see what it is thinking, convert the top-level representation into an image by using top-down connections.– A mental state is the state of a

hypothetical world in which the internal representation is correct.

The extra activation of cortex caused by a speech task. What were they thinking?

brain state

What goes on in its mind if we show it an image composed of random pixels and ask it

to fantasize from there?

mind brain

mind brain

mind brain

2000 top-level neurons

500 neurons

500 neurons

28 x 28 pixel

image

10 label

neurons

data reconstruction

feature

Date post:	29-Jan-2016
Category:	Documents
Upload:	beatrix-sybil-francis
View:	218 times
Download:	0 times

How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research &...

Documents