PreliminariesA Gaussian Process Approximation
Dropout as a Bayesian Approximation:Insights and Applications
Yarin Gal and Zoubin Ghahramani
Discussion by: Chunyuan Li
Jan. 15, 2016
1 / 16
PreliminariesA Gaussian Process Approximation
Main idea
I In the framework of variational inference, the authors showthat the standard algorithm of SGD training withDropout is ensentially optimizing the stochastic lowerbound of Gaussian Processes whose kernel takes theform of neural networks.
2 / 16
PreliminariesA Gaussian Process Approximation
Outline
PreliminariesDropoutGaussian ProcessesVariational Inference
A Gaussian Process ApproximationA Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
3 / 16
PreliminariesA Gaussian Process Approximation
DropoutGaussian ProcessesVariational Inference
DropoutProcedure
I Training stage: A unit is present with probability pI Testing stage: The unit is always present and the weights are
multiplied by p
IntuitionI Training a neural network with dropout can be seen as training a
collection of 2K thinned networks with extensive weight sharingI A single neural network to approximate averaging output at test
time4 / 16
PreliminariesA Gaussian Process Approximation
DropoutGaussian ProcessesVariational Inference
Dropout for one-hidden-layer Neural NetworksI Dropout local units∗
y =(g((x� b1)W1)� b2
)W2 (1)
I Input x and Output y;I g: activation function; Weights: W ∈ RK`−1×K`
I b` is binary dropout vairiablesI Equivalent to multiplying the global weight matrices by the
binary vectors to dropout entire rows:
y = g(x(diag(b1)W1))(diag(b2)W2) (2)
I Application to regression
L =1
2N
N∑n=1
‖yn − yn‖22 + λ(‖W1‖2 + ‖W2‖2) (3)
∗bias term is ignored for simplicity5 / 16
PreliminariesA Gaussian Process Approximation
DropoutGaussian ProcessesVariational Inference
Gaussian Processes
I f is the GP functionp(f |X,Y)︸ ︷︷ ︸
Posterior
∝ p(f)︸︷︷︸Prior
p(Y|X,f)︸ ︷︷ ︸Likelihood
I ApplicationsI Regression
F|X ∼ N (0,K(X,X)), Y|F ∼ N (F, τ−1IN )
6 / 16
PreliminariesA Gaussian Process Approximation
DropoutGaussian ProcessesVariational Inference
Variational InferenceI The predictive distribution
K(y∗|x∗,X,Y) =
∫p(y∗|x∗,w) p(w|X,Y)︸ ︷︷ ︸
≈q(w)
dw (4)
I Objective: argminq KL(q(w)|p(w|X,Y)
)I Variational Prediction: q(y∗|x∗) =
∫p(y∗|x∗,w)q(w)dw
I Log evidence lower bound
LVI =
∫q(w) log p(Y|X,w)dw −KL
(q(w)||p(w)
)(5)
I Objective: argmaxq LVII A variational distribution q(w) that explains the data well
while still being close to prior7 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
A single-layer neural network example
I SetupI Q: input dimension,K: number of hidden units,D: ouput dimesnion
I Goal: Learn W1 ∈ RQ×K and W2 ∈ RK×D to mapX ∈ RN×Q to Y ∈ RN×D
I Idea: Introduce W1 and W2 to GP approxiamtion
8 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
Introduce W1
I Define the kernel of GP
K(x,x′) =
∫p(w)g(w>x)g(w>x′)dw (6)
I Resort to Monte Carlo integration, with generative process:
wk ∼ p(w), W1 = [wk]Kk=1, K(x,x′) =
1
K
K∑k=1
g(w>k x)g(w>k x′)
F|X,W1 ∼ N (0, K), Y|F ∼ N (F, τ−1IN ) (7)
where K is the number of hidden units
9 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
Introduce W2
I Analytically integrating wrt FThe predictive distribution
p(Y|X) =
∫p(Y|F)p(F|X,W1)p(W1)dw1df (8)
can be rewritten as
p(Y|X) =
∫N (Y;0,ΦΦ> + τ−1IN )p(W1)dw1 (9)
where Φ = [φ]Nn=1, φ =√
1K g(W
>1 x)
I For W2 = [w2]Dd=1, wd ∼ N (0, IK)
N (yd;0,ΦΦ> + τ−1IN ) =
∫N (yd;Φwd, τ
−1IN )N (wd; 0, IK)dw1
p(Y|X) =
∫p(Y|X,W1,W2)p(W1)p(W2)dw1dw2 (10)
10 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
Variational Inference in the Approximate Model
q(W1,W2) = q(W1)q(W2) (11)
I To mimic Dropout, q(W1) is factorised over input dimension, each of themis a Gaussian mixture distribution with two components,
q(W1) =
Q∏q=1
q(wq), q(wq) = p1N (mq , σ2IK) + (1− p1)N (0, σ2IK) (12)
with p1 ∈ [0, 1], mq ∈ RK
I The same for q(W2)
q(W2) =K∏k=1
q(wk), q(wk) = p2N (mk, σ2ID) + (1− p2)N (0, σ2ID) (13)
with p1 ∈ [0, 1], mq ∈ RD
I Optimise over parameters, especially M1 = [mq ]Qq=1, M2 = [mk]
Kk=1
11 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
Evaluating the Log Evidence Lower Bound for Regression
I Log evidence lower bound
LGP-VI =
∫q(W1,W2) log p(Y|X,W1,W2)dW1dW2︸ ︷︷ ︸
L1
−KL(q(W1,W2)||p(W1,W2)
)︸ ︷︷ ︸L2
(14)
I Approximation of L†2
I For large enough K we can approximate the KL divergence term as
KL(q(W1)||p(W1)) ≈ QK(σ2 − log(σ2)− 1) + p12
∑Qq=1m
>q mq
(15)
I Similarly for KL(q(W2)||p(W2)
†Following Proposition 1 on KL of a Mixture of Gaussians in Appendix12 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
L1 : Monte Carlo integration
I Parameterization
L1 = q(b1, ε1,b2, ε2) log p(Y|X,W1(b1, ε1),W2(b2, ε2))db1db2dε1dε2 (16)
W1 = diag(b1)(M1 + σε1) + (1− diag(b1))σε1
W2 = diag(b2)(M2 + σε2) + (1− diag(b2))σε2 (17)
where ε1 ∼ N (0, IQ×K),b1q ∼ Bernoulli(p1),
ε2 ∼ N (0, IK×D),b2k ∼ Bernoulli(p2), (18)
I Take a single sample
LGP-MC = log p(Y|X,W1,W2)︸ ︷︷ ︸L1-MC
−KL(q(W1,W2)||p(W1,W2)
)(19)
Optimising LGP-MC converges to the same limit as LGP-VI.
13 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
L1 : Monte Carlo integration
I Regression
L1-MC = log p(Y|X,W1,W2)
=D∑d=1
logN (yd;φwd, τ−1IN )
= −ND
2log(2π) +
ND
2log(τ)−
D∑d=1
τ
2‖yd − φwd‖22 (20)
I Sum over the rows instead of the columns of Y
D∑d=1
τ
2‖yd − φwd‖22 =
∑Nn=1
τ2‖yn − φwn‖22 (21)
where yn = φW2 =√
1Kg(xnW1)W2
14 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
Recover SGD training with Dropout
I Take the approximation for L1 and L2 for the bound, andignoring constant terms τ and σ
LGP-MC = −τ
2
N∑n=1
‖yn − yn‖22 −p1
2‖M1‖22 −
p2
2‖M2‖22 (22)
I Setting σ tend to 0
W1 = diag(b1)(M1 + σε1) + (1− diag(b1))σε1 ⇒
W1 ≈ diag(b1)M1, W2 ≈ diag(b2)M2 (23)
yn =
√1
Kg(xnW1)W2 =
√1
Kg(xn(diag(b1)M1))(diag(b2)M2) (24)
15 / 16
PreliminariesA Gaussian Process Approximation
A Gaussian Process ApproximationEvaluating the Log Evidence Lower Bound
More ApplicationsI Convolutional Neural Networks
Bayesian Convolutional Neural Networks with Bernoulli Approximate
Variational Inference, arXiv:1506.02158, 2015
I Recurrent Neural NetworksA Theoretically Grounded Application of Dropout in Recurrent Neural
Networks, arXiv:1512.05287, 2015I Reinforcement Learning
Dropout as a Bayesian Approximation: Representing Model Uncertainty in
Deep Learning, arXiv:1506.02142, 201516 / 16