+ All Categories
Home > Documents > New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12....

New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12....

Date post: 14-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
25
Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign [email protected] December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Deep Boltzmann Machines December 2, 2016 1 / 16
Transcript
Page 1: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Deep Boltzmann Machines

Ruslan Salakutdinov and Geoffrey E. Hinton

Amish Goel

University of Illinois Urbana Champaign

[email protected]

December 2, 2016

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 1 / 16

Page 2: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Overview

1 IntroductionRepresentation of the model

2 Learning in Boltzmann MachinesVariational Lower Bound - Mean Field ApproximationStochastic Approximation Procedure - Persistent Markov Chains

3 Additional Tricks for DBMGreedy Pretraining of the ModelDiscriminative Finetuning

4 Simulation results

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 2 / 16

Page 3: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Introduction

Boltzmann Machine - Pairwise Markov Random Fields. Consider a setof random variables as latent i.e. hidden (h) and others as visible (v).

The probability distribution for binary random variables is given by

Pθ(v, h) =1

Zθe−Eθ(v,h),θ = {L, J,W}

Eθ(v,h) = −1

2vTLv− 1

2hT Jh− vTWh,

Figure: Model for Boltzmann Machines

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 3 / 16

Page 4: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Representation

While Boltzmann Machine is a powerful model over the data, it iscomputationally expensive to learn. So, one considers severalapproximations to Boltzmann machines.

Figure: Boltzmann Machines vs RBM

Deep Boltzmann Machine consider hidden nodes in several layers,with a layer being units that have no direct connections.

Figure: Model for Deep Boltzmann MachinesRuslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 4 / 16

Page 5: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Learning in Boltzmann Machines

Model can be trained using Maximum Likelihood. The gradient of thelikelihood takes the following form -

ln(Lθ(v)) = ln(pθ(v)) = ln

(∑h

pθ(v,h)

)

=

ln∑h

exp(−Eθ(v,h))− ln∑v,h

exp(−Eθ(v,h))

;

∂ln(Lθ(v))

∂θ= −

∑h

p(h|v)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Data Dependent Expectation

+∑v,h

p(v,h)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Model Dependent Expectation

(1)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 5 / 16

Page 6: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Learning in Boltzmann Machines

Model can be trained using Maximum Likelihood. The gradient of thelikelihood takes the following form -

ln(Lθ(v)) = ln(pθ(v)) = ln

(∑h

pθ(v,h)

)

=

ln∑h

exp(−Eθ(v,h))− ln∑v,h

exp(−Eθ(v,h))

;

∂ln(Lθ(v))

∂θ= −

∑h

p(h|v)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Data Dependent Expectation

+∑v,h

p(v,h)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Model Dependent Expectation

(1)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 5 / 16

Page 7: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Learning in Boltzmann Machines

Model can be trained using Maximum Likelihood. The gradient of thelikelihood takes the following form -

ln(Lθ(v)) = ln(pθ(v)) = ln

(∑h

pθ(v,h)

)

=

ln∑h

exp(−Eθ(v,h))− ln∑v,h

exp(−Eθ(v,h))

;

∂ln(Lθ(v))

∂θ= −

∑h

p(h|v)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Data Dependent Expectation

+∑v,h

p(v,h)∂Eθ(v,h)

∂θ︸ ︷︷ ︸Model Dependent Expectation

(1)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 5 / 16

Page 8: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Learning in Boltzmann Machines

Using gradient ascent by substituting Eθ(v,h) in the gradientobtained in previous equation, one can obtain the update for therespective parameters as,

∆W = α(EPdata[vhT ]− EPmodel

[vhT ]),

∆L = α(EPdata[vvT ]− EPmodel

[vvT ]),

∆J = α(EPdata[hhT ]− EPmodel

[hhT ]),

∆b = α(EPdata[v]− EPmodel

[v]),

∆c = α(EPdata[h]− EPmodel

[h]),

(2)

The parameters updates in the M.L.E. is very costly in the previoussteps as would need to sum over exponential number of terms tocompute both expectations. One needs Approximations.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 6 / 16

Page 9: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Approximate Maximum Likelihood Learning in BoltzmannMachines

One approximation is to use a variational lower bound on thelog-likelihood-

ln (pθ(v)) = ln

(∑h

pθ(v,h)

)= ln

(∑h

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)

≥∑h

qµ(h|vi )logpθ(v,h) + He(qµ) = L(qµ,θ)

(3)

where qµ(h|v) is an approximate posterior (variational) distributionand He(.) is the entropy function with natural logarithm.

Try to find the tightest lowerbound on the log-likelihood byoptimizing on the distributions qµ and parameters θ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 7 / 16

Page 10: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Approximate Maximum Likelihood Learning in BoltzmannMachines

One approximation is to use a variational lower bound on thelog-likelihood-

ln (pθ(v)) = ln

(∑h

pθ(v,h)

)= ln

(∑h

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)≥∑h

qµ(h|vi )logpθ(v,h) + He(qµ) = L(qµ,θ)

(3)

where qµ(h|v) is an approximate posterior (variational) distributionand He(.) is the entropy function with natural logarithm.

Try to find the tightest lowerbound on the log-likelihood byoptimizing on the distributions qµ and parameters θ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 7 / 16

Page 11: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Approximate Maximum Likelihood Learning in BoltzmannMachines

One approximation is to use a variational lower bound on thelog-likelihood-

ln (pθ(v)) = ln

(∑h

pθ(v,h)

)= ln

(∑h

qµ(h|v)

qµ(h|v)pθ(vi ,h)

)≥∑h

qµ(h|vi )logpθ(v,h) + He(qµ) = L(qµ,θ)

(3)

where qµ(h|v) is an approximate posterior (variational) distributionand He(.) is the entropy function with natural logarithm.

Try to find the tightest lowerbound on the log-likelihood byoptimizing on the distributions qµ and parameters θ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 7 / 16

Page 12: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Variational Learning for Boltzmann Machines

For Boltzmann Machines, the lower bound can be rewritten as(ignoring the bias terms) -

L(qµ,θ) =∑h

qµ(h|v)(−Eθ(v,h))− log(Zθ) + He(qµ) (4)

Using Mean Field Approximation, qµ(h|v) =∏M

j=1 q(hj |v), and oneassumes that q(hj = 1) = µj . (M is the number of hidden units.)

=∑h

M∏i=1

qµ(hi |vi )(

1

2vTLv +

1

2hT Jh + vTWh

)− log(Zθ) + He(qµ)

=1

2vTLv +

1

2µTJµ + vTWµ− log(Zθ) +

M∑j=1

He(µj)

(5)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 8 / 16

Page 13: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Variational Learning for Boltzmann Machines

For Boltzmann Machines, the lower bound can be rewritten as(ignoring the bias terms) -

L(qµ,θ) =∑h

qµ(h|v)(−Eθ(v,h))− log(Zθ) + He(qµ) (4)

Using Mean Field Approximation, qµ(h|v) =∏M

j=1 q(hj |v), and oneassumes that q(hj = 1) = µj . (M is the number of hidden units.)

=∑h

M∏i=1

qµ(hi |vi )(

1

2vTLv +

1

2hT Jh + vTWh

)− log(Zθ) + He(qµ)

=1

2vTLv +

1

2µTJµ + vTWµ− log(Zθ) +

M∑j=1

He(µj)

(5)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 8 / 16

Page 14: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Variational Learning for Boltzmann Machines

For Boltzmann Machines, the lower bound can be rewritten as(ignoring the bias terms) -

L(qµ,θ) =∑h

qµ(h|v)(−Eθ(v,h))− log(Zθ) + He(qµ) (4)

Using Mean Field Approximation, qµ(h|v) =∏M

j=1 q(hj |v), and oneassumes that q(hj = 1) = µj . (M is the number of hidden units.)

=∑h

M∏i=1

qµ(hi |vi )(

1

2vTLv +

1

2hT Jh + vTWh

)− log(Zθ) + He(qµ)

=1

2vTLv +

1

2µTJµ + vTWµ− log(Zθ) +

M∑j=1

He(µj)

(5)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 8 / 16

Page 15: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Variational EM Learning for Boltzmann Machines

Maximize lower bound iteratively by maximizing over the variationalparameters µ and θ iteratively - Typical EM learning idea.

E-step : supµL(qµ,θ) =

supµ12v

TLv + 12µ

TJµ + vTWµ− log(Zθ) +∑M

j=1 He(µj)

Using alternate maximization over each variate, one gets the update

µj ← σ

∑i

Wijvi +∑m 6=j

Jmjµm

,

where σ(.) denotes the sigmoid function.

After running these updates, the parameter µ converges to µ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 9 / 16

Page 16: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Variational EM Learning for Boltzmann Machines

Maximize lower bound iteratively by maximizing over the variationalparameters µ and θ iteratively - Typical EM learning idea.

E-step : supµL(qµ,θ) =

supµ12v

TLv + 12µ

TJµ + vTWµ− log(Zθ) +∑M

j=1 He(µj)

Using alternate maximization over each variate, one gets the update

µj ← σ

∑i

Wijvi +∑m 6=j

Jmjµm

,

where σ(.) denotes the sigmoid function.

After running these updates, the parameter µ converges to µ.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 9 / 16

Page 17: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Stochastic Approximations or Persistent Markov Chains

M-step :supθL(qµ, θ) = supθ

12v

TLv+ 12µ

TJµ+vTWµ−log(Zθ)+∑M

j=1 He(µj)

MCMC Sampling and Persistent Markov Chains to approximategradient of log-partition function log(Zθ)

The parameter updates for one training example can be written as,

∆W = αt

([vµT ]−

N∑i=1

vhTi

),

∆L = αt

([vvT ]−

N∑i=1

vhTi

),

∆J = αt

([µµT ]−

N∑i=1

vhTi

),

(6)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 10 / 16

Page 18: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Stochastic Approximations or Persistent Markov Chains

M-step :supθL(qµ, θ) = supθ

12v

TLv+ 12µ

TJµ+vTWµ−log(Zθ)+∑M

j=1 He(µj)

MCMC Sampling and Persistent Markov Chains to approximategradient of log-partition function log(Zθ)

The parameter updates for one training example can be written as,

∆W = αt

([vµT ]−

N∑i=1

vhTi

),

∆L = αt

([vvT ]−

N∑i=1

vhTi

),

∆J = αt

([µµT ]−

N∑i=1

vhTi

),

(6)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 10 / 16

Page 19: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Stochastic Approximations or Persistent Markov Chains

M-step :supθL(qµ, θ) = supθ

12v

TLv+ 12µ

TJµ+vTWµ−log(Zθ)+∑M

j=1 He(µj)

MCMC Sampling and Persistent Markov Chains to approximategradient of log-partition function log(Zθ)

The parameter updates for one training example can be written as,

∆W = αt

([vµT ]−

N∑i=1

vhTi

),

∆L = αt

([vvT ]−

N∑i=1

vhTi

),

∆J = αt

([µµT ]−

N∑i=1

vhTi

),

(6)

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 10 / 16

Page 20: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Overall Algorithm for Training Boltzmann Machines

Data: Training set Sn of N binary data vectors v and M, the number ofpersistent Markov chains

Initialize vector θ0 and M samples : {v0,1, h0,1}, ..., {v0,M , h

0,M};for t =0 to T (number of iterations) do

for each n ∈ Sn doRandomly initalize µn and run updates to obtain µn

µj ← σ(∑

i Wijvi +∑

m 6=j Jmjµm

)endfor m = 1 to M (number of persistent markov chains) do

Sample (vt+1,m, ht+1,m

) given (vt+1,m, ht+1,m

) by running Gibbssampler

endUpdate θ using equation (6) (adjusting for batch data) and decreasethe learning rate αt .

end

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 11 / 16

Page 21: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Learning for Deep Boltzmann Machines

For Deep Boltzmann Machines, L = 0 and J would have manyzero-blocks as hidden unit interactions layered. So somecomputations simplified.

Gibbs sampling procedure is simplified as all units in one layer can besampled parallely.

But, learning observed slow, and Greedy Pretraining can result infaster convergence of parameters.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 12 / 16

Page 22: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Pretraining in Deep Boltzmann Machines

Training each RBM separately, with some weight scaling.

Figure: Greedy Layerwise Pretraining for DBM

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 13 / 16

Page 23: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Discriminative Finetuning in Deep Boltzmann Machines

Further, an additional step of finetuning is also considered to improvethe performance.

For example, for a 2 hidden layer DBM, an approximate posterior isused as an augmented input to a neural network with weights ofnetwork initialized using parameters of DBM.

Figure: Finetuning the parameters of DBM

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 14 / 16

Page 24: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Some Experimental Results and Observations

Training a DBM for modeling handwritten digits in MNIST dataset.

(a) DBM Model used for Training (b) Examples of handwritten digits

Figure: An example of DBM used for MNIST data generation withtraining done for 60000 examples

Some interesting observations :- Without Greedy Pretraining, themodels were not producing good results.

Using Discriminative fine tuning, DBM gave 99.5% accuracy, best onMNIST dataset for recognition at that time.

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 15 / 16

Page 25: New Deep Boltzmann Machinesswoh.web.engr.illinois.edu/.../handout/fall2016_slide18.pdf · 2016. 12. 3. · Overview 1 Introduction Representation of the model 2 Learning in Boltzmann

Thank You

Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel (UIUC)Deep Boltzmann Machines December 2, 2016 16 / 16


Recommended