Download - Dropout as a Bayesian Approximation: Insights and …people.ee.duke.edu/~lcarin/Chunyuan1.15.2016.pdf2016/01/15 · Dropout as a Bayesian Approximation: Insights and Applications

PreliminariesA Gaussian Process Approximation

Dropout as a Bayesian Approximation:Insights and Applications

Yarin Gal and Zoubin Ghahramani

Discussion by: Chunyuan Li

Jan. 15, 2016

1 / 16


Main idea

I In the framework of variational inference, the authors showthat the standard algorithm of SGD training withDropout is ensentially optimizing the stochastic lowerbound of Gaussian Processes whose kernel takes theform of neural networks.

2 / 16


Outline

PreliminariesDropoutGaussian ProcessesVariational Inference

A Gaussian Process ApproximationA Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

3 / 16


DropoutGaussian ProcessesVariational Inference

DropoutProcedure

I Training stage: A unit is present with probability pI Testing stage: The unit is always present and the weights are

multiplied by p

IntuitionI Training a neural network with dropout can be seen as training a

collection of 2K thinned networks with extensive weight sharingI A single neural network to approximate averaging output at test

time4 / 16



Dropout for one-hidden-layer Neural NetworksI Dropout local units∗

y =(g((x� b1)W1)� b2

)W2 (1)

I Input x and Output y;I g: activation function; Weights: W ∈ RK`−1×K`

I b` is binary dropout vairiablesI Equivalent to multiplying the global weight matrices by the

binary vectors to dropout entire rows:

y = g(x(diag(b1)W1))(diag(b2)W2) (2)

I Application to regression

L =1

2N

N∑n=1

‖yn − yn‖22 + λ(‖W1‖2 + ‖W2‖2) (3)

∗bias term is ignored for simplicity5 / 16



Gaussian Processes

I f is the GP functionp(f |X,Y)︸︷︷︸

Posterior

∝ p(f)︸︷︷︸Prior

p(Y|X,f)︸︷︷︸Likelihood

I ApplicationsI Regression

F|X ∼ N (0,K(X,X)), Y|F ∼ N (F, τ−1IN )

6 / 16



Variational InferenceI The predictive distribution

K(y∗|x∗,X,Y) =

∫p(y∗|x∗,w) p(w|X,Y)︸︷︷︸

≈q(w)

dw (4)

I Objective: argminq KL(q(w)|p(w|X,Y)

)I Variational Prediction: q(y∗|x∗) =

∫p(y∗|x∗,w)q(w)dw

I Log evidence lower bound

LVI =

∫q(w) log p(Y|X,w)dw −KL

(q(w)||p(w)

)(5)

I Objective: argmaxq LVII A variational distribution q(w) that explains the data well

while still being close to prior7 / 16


A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound

A single-layer neural network example

I SetupI Q: input dimension,K: number of hidden units,D: ouput dimesnion

I Goal: Learn W1 ∈ RQ×K and W2 ∈ RK×D to mapX ∈ RN×Q to Y ∈ RN×D

I Idea: Introduce W1 and W2 to GP approxiamtion

8 / 16



Introduce W1

I Define the kernel of GP

K(x,x′) =

∫p(w)g(w>x)g(w>x′)dw (6)

I Resort to Monte Carlo integration, with generative process:

wk ∼ p(w), W1 = [wk]Kk=1, K(x,x′) =

1

K

K∑k=1

g(w>k x)g(w>k x′)

F|X,W1 ∼ N (0, K), Y|F ∼ N (F, τ−1IN ) (7)

where K is the number of hidden units

9 / 16



Introduce W2

I Analytically integrating wrt FThe predictive distribution

p(Y|X) =

∫p(Y|F)p(F|X,W1)p(W1)dw1df (8)

can be rewritten as

p(Y|X) =

∫N (Y;0,ΦΦ> + τ−1IN )p(W1)dw1 (9)

where Φ = [φ]Nn=1, φ =√

1K g(W

>1 x)

I For W2 = [w2]Dd=1, wd ∼ N (0, IK)

N (yd;0,ΦΦ> + τ−1IN ) =

∫N (yd;Φwd, τ

−1IN )N (wd; 0, IK)dw1

p(Y|X) =

∫p(Y|X,W1,W2)p(W1)p(W2)dw1dw2 (10)

10 / 16



Variational Inference in the Approximate Model

q(W1,W2) = q(W1)q(W2) (11)

I To mimic Dropout, q(W1) is factorised over input dimension, each of themis a Gaussian mixture distribution with two components,

q(W1) =

Q∏q=1

q(wq), q(wq) = p1N (mq , σ2IK) + (1− p1)N (0, σ2IK) (12)

with p1 ∈ [0, 1], mq ∈ RK

I The same for q(W2)

q(W2) =K∏k=1

q(wk), q(wk) = p2N (mk, σ2ID) + (1− p2)N (0, σ2ID) (13)

with p1 ∈ [0, 1], mq ∈ RD

I Optimise over parameters, especially M1 = [mq ]Qq=1, M2 = [mk]

Kk=1

11 / 16



Evaluating the Log Evidence Lower Bound for Regression

I Log evidence lower bound

LGP-VI =

∫q(W1,W2) log p(Y|X,W1,W2)dW1dW2︸︷︷︸

L1

−KL(q(W1,W2)||p(W1,W2)

)︸︷︷︸L2

(14)

I Approximation of L†2

I For large enough K we can approximate the KL divergence term as

KL(q(W1)||p(W1)) ≈ QK(σ2 − log(σ2)− 1) + p12

∑Qq=1m

>q mq

(15)

I Similarly for KL(q(W2)||p(W2)

†Following Proposition 1 on KL of a Mixture of Gaussians in Appendix12 / 16



L1 : Monte Carlo integration

I Parameterization

L1 = q(b1, ε1,b2, ε2) log p(Y|X,W1(b1, ε1),W2(b2, ε2))db1db2dε1dε2 (16)

W1 = diag(b1)(M1 + σε1) + (1− diag(b1))σε1

W2 = diag(b2)(M2 + σε2) + (1− diag(b2))σε2 (17)

where ε1 ∼ N (0, IQ×K),b1q ∼ Bernoulli(p1),

ε2 ∼ N (0, IK×D),b2k ∼ Bernoulli(p2), (18)

I Take a single sample

LGP-MC = log p(Y|X,W1,W2)︸︷︷︸L1-MC

−KL(q(W1,W2)||p(W1,W2)

)(19)

Optimising LGP-MC converges to the same limit as LGP-VI.

13 / 16



L1 : Monte Carlo integration

I Regression

L1-MC = log p(Y|X,W1,W2)

=D∑d=1

logN (yd;φwd, τ−1IN )

= −ND

2log(2π) +

ND

2log(τ)−

D∑d=1

τ

2‖yd − φwd‖22 (20)

I Sum over the rows instead of the columns of Y

D∑d=1

τ

2‖yd − φwd‖22 =

∑Nn=1

τ2‖yn − φwn‖22 (21)

where yn = φW2 =√

1Kg(xnW1)W2

14 / 16



Recover SGD training with Dropout

I Take the approximation for L1 and L2 for the bound, andignoring constant terms τ and σ

LGP-MC = −τ

2

N∑n=1

‖yn − yn‖22 −p1

2‖M1‖22 −

p2

2‖M2‖22 (22)

I Setting σ tend to 0

W1 = diag(b1)(M1 + σε1) + (1− diag(b1))σε1 ⇒

W1 ≈ diag(b1)M1, W2 ≈ diag(b2)M2 (23)

yn =

√1

Kg(xnW1)W2 =

√1

Kg(xn(diag(b1)M1))(diag(b2)M2) (24)

15 / 16



More ApplicationsI Convolutional Neural Networks

Bayesian Convolutional Neural Networks with Bernoulli Approximate

Variational Inference, arXiv:1506.02158, 2015

I Recurrent Neural NetworksA Theoretically Grounded Application of Dropout in Recurrent Neural

Networks, arXiv:1512.05287, 2015I Reinforcement Learning

Dropout as a Bayesian Approximation: Representing Model Uncertainty in

Deep Learning, arXiv:1506.02142, 201516 / 16