RANDOM TOPICSstochastic gradient descent
&Monte Carlo
MASSIVE MODEL FITTING
minimize1
2kAx� bk2 =
X
i
1
2(aix� bi)
2
least squares
minimize1
2kwk2 + h(LDw) =
1
2kwk2 +
X
i
h(lidiw)
SVM
low-rank factorization
Big!(over 100K)
minimize f(x) =1
n
nX
i=1
fi(x)
minimize1
2kD �XY k2 =
X
ij
1
2(dij � xiyj)
2
THE BIG IDEA
True gradient
⌦ ⇢ [1, N ]
Idea: choose a subset of the data
usually just one sample
minimize f(x) =1
n
nX
i=1
fi(x)
rf =1
n
nX
i=1
rfi(x)
rf ⇡ g⌦ =1
|⌦|X
i2⌦
rfi(x)
INFINITE SAMPLE VS FINITE SAMPLE
minimize f(x) = Es[fs(x)] =
Z
sfs(x)p(s)ds
minimize f(x) =1
n
nX
i=1
fi(x)
We can solve finite sample problem to high accuracy….
…but true accuracy is limited by sample size
finite sample
infinite sample
SGD
big errorsolution improves
compute gradientselect data
update
gk =1
M
MX
i=1
rf(x, di)gk ⇡ rf(x, d8)gk ⇡ rf(x, d12)
small errorsolutions gets worse
xk+1 = xk � ⌧kgk
SGDcompute gradientselect
dataupdate
gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk
Error must decrease as we approach solution
limk!1
⌧k = 0
classical solution
O(1/pk)
shrink stepsize slow convergence
Variance reductionCorrect error in gradient approximations
EXAMPLE: WITHOUT DECREASING STEPSIZE
minimize1
2kAx� bk2
inexact gradientwhy does this happen?
what's happening?
DECREASING STEP SIZEbig stepsize
small stepsize
⌧k =a
b+ k
used for strongly convex problems
xk+1 = xk � ⌧krfk(x)
(almost) equivalent tochoose larger sample
why?
AVERAGING
averaging
xk+1 = xk � ⌧krfk(x)
⌧k =apk + b
xk+1 =1
k + 1
Xxi
“ergodic” averaging
used for weakly convex problems
x
does this limit convergence rate?
why is this bad?
AVERAGING
xk+1 =1
k + 1
Xxi
ergodic averaging
averaging
x
xk+1 =k
k + 1xk +
1
k + 1xk+1
compute without storage
xk+1 =k
k + ⌘xk +
⌘
k + ⌘xk+1
short memory version
⌘ � 1
tradeoff variance for bias
KNOWN RATES: WEAKLY CONVEX
TheoremSuppose f is convex, krf(x)k G, and that the diameterof dom(f) is less than D. If we use the stepsize
⌧k =cpk,
then
SeeShamir and Zhang, ICML ’13
Rakhlin, Shamir, Sridharan, CoRR ‘11
E[f(xk)� f?] ✓D2
c+ cG
◆2 + log(k)p
k
KNOWN RATES: STRONGLY CONVEX
Theorem
Shamir and Zhang, ICML ’13
E[f(xk)� f?] 58(1 + ⌘/k)
✓⌘(⌘ + 1) +
(⌘ + .5)3(1 + log k)
k
◆G2
mk
xk+1 =k
k + ⌘xk +
⌘
k + ⌘xk+1
Suppose f is strongly convex with parameter m, and thatkrf(x)k G. If you use stepsize ⌧k = 1/mk, and the limitedmemory averaging
with ⌘ � 1, then
EXAMPLE: SVMPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM
rh(x) =
(�1, x < 1
0, otherwise
note: this is a “subgradient” descent method
minimize1
2kwk2 + Ch(Aw)
PEGOSOSPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM
⌧k =1
�k
minimizeX
i
�
2kwk2 + h(aiw)
While “not converged”:
not used in practicewk+1 =1
k + 1
k+1X
i=1
wk
If aTkwk < 1 : wk+1 = wk � ⌧k(�wk � aTi )
If aTkwk � 1 : wk+1 = wk � ⌧k�wk
PEGOSOS
gradientmethod
stochasticmethods
finite sample method
PEGOSOS
gradientmethod
stochasticmethods
classification error (CV)
You hope to converge before SGD gets slow!
ALLREDUCEbut if you don’t…
use SGD as warm start for iterative method
Agarwal. “Effective Terascale Linear Learning.” ‘12
SGDcompute gradientselect
dataupdate
gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk
Error must decrease as we approach solution
variance reduction solutionmake gradient more accurate
preserve fast convergence
SGD+VARIANCE REDUCTIONcompute gradient update
gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk
Error must decrease as we approach solution
variance reduction solutionmake gradient more accurate
preserve fast convergence
� error8
select data
VR APPROACHES
Central VRA VR approach targeting distributed ML
SAGADefazio, Bach, Lacoste-Julian, 2014
SVRGJohnson, Zhang, 2013
SAGLe Roux, Schmidt, Bach, 2013
many more…
“Efficient Distributed SGD with Variance Reduction,” ICDM 2016
shameless self-promotion
The original: requires full gradient computations
Avoid full gradient computations
SVRG
rf1(x1m)
rf2(x2m)
rfn(xnm)
...
rfn�1(xn�1m )
First epoch gradient tableau
rf3(x3m)
rf1(x1m)
rf2(x2m)
rfn(xnm)
...
rfn�1(xn�1m )} gm =
1
n
nX
i=1
rfi(xim)
Average gradientover last epoch rf3(x
3m)
gradient tableau
SVRG
rf1(x1m)
rf2(x2m)
rfn(xnm)
...
rfn�1(xn�1m )
gm =1
n
nX
i=1
rfi(xim)
Average gradientover last epoch
rf3(x3m)
(rf3(x3m)� gm)�
corrected gradient}error
rf3(x3m+1)
new gradient
gradient tableau
SVRG
SVRG GETS BACK DETERMINISTIC RATES
Theorem
Johnson and Zhang, 2013
Suppose of objective terms are m stronglyconvex with L Lipschitz gradients. If thelearning rate is small enough, then
<latexit sha1_base64="HTGAliInHfDxWKYbVWtacL71/ok=">AAACj3icbVHLbtNAFB2bVzGvFJbdXBFXYoEiOyzKCkWwoVIXRZC2UhxF48m1PXQe1sx1IET5Br6PT+AD2DNJs6AtVxrp6Dx0r86UrZKesuxXFN+5e+/+g72HyaPHT54+6+0/P/O2cwLHwirrLkruUUmDY5Kk8KJ1yHWp8Ly8/LDRzxfovLTmCy1bnGpeG1lJwSlQs97PolKdbxRWBMnnrm2tR7AV2PIrCpILBEKnPXCHkOoUPDlrarUsigSENQv8Dt8kNZCepHAiWy8aST+gdnwu0ZAfABxXQA1CCCjkzkhTg+OEID14zZUCNLarm9cbl0lmvX42yLYDt0G+A322m9NZ73cxt6LTYZtQ3PtJnrU0XXFHUihcJ0XnseXiktc4CdBwjX662ha3hsPAzKGyLjxDsGX/Tay49n6py+DUnBp/U9uQ/9MmHVVvpytp2o7QiKtFVaeALGx+AebShXrVMgAunAy3gmi44yK07ZPk2h4jBVZBWodu8ptN3AZnw0H+ZjD8NOyP3u9a2mMH7CV7xXJ2xEbsIztlYybYn+ggSqPDeD8+it/FoytrHO0yL9i1iY//AlY7xdk=</latexit>
Ef(wk)� f? ck[f(w0)� f?]<latexit sha1_base64="FkMPZaQmJTfxjKaog3LujK0S4+U=">AAACNHicbVBNS8NAEN3U7/pV9ehlsQj1YElqQY+iCB4rWBWatGy2k3bpZhN2N0oJ/SP+EM9e9R8I3sSDF3+DmzYHW30w8Hhvhpl5fsyZ0rb9ZhXm5hcWl5ZXiqtr6xubpa3tGxUlkkKTRjySdz5RwJmApmaaw10sgYQ+h1t/cJ75t/cgFYvEtR7G4IWkJ1jAKNFG6pTqbkh03/fTixEOKg/twcEhDtqu0kRilwOm7QFuZYZ9cJjrHu6UynbVHgP/JU5OyihHo1P6crsRTUIQmnKiVMuxY+2lRGpGOYyKbqIgJnRAetAyVJAQlJeOvxvhfaN0cRBJU0Ljsfp7IiWhUsPQN53ZL2rWy8T/vFaigxMvZSJONAg6WRQkHOsIZ1HhLpNANR8aQqhk5lZM+0QSqk2gxeLUHsEoBMYamWyc2ST+kpta1Tmq1q7q5dOzPKVltIv2UAU56BidokvUQE1E0SN6Ri/o1Xqy3q0P63PSWrDymR00Bev7BxyTqZY=</latexit>
for some c < 1.<latexit sha1_base64="7i+9y8tRinww0pMsrhRs+pZ9DCo=">AAACD3icbVC9TgJBGNzDP8QfTi1tNoKJFbnDQgsLoo0lJvKTACF7y3ewYW/3srtnQggPYW2rz2BnbH0EH8G3cA+uEHCqycx8+SYTxJxp43nfTm5jc2t7J79b2Ns/OCy6R8dNLRNFoUEll6odEA2cCWgYZji0YwUkCji0gvFd6reeQGkmxaOZxNCLyFCwkFFirNR3i6FUWMsIcJne+JVy3y15FW8OvE78jJRQhnrf/ekOJE0iEIZyonXH92LTmxJlGOUwK3QTDTGhYzKEjqWCRKB703nxGT63ygCnFUIpDJ6rfy+mJNJ6EgU2GREz0qteKv7ndRITXvemTMSJAUEXj8KEYyNxugIeMAXU8IklhCpmu2I6IopQY7cqFJb+CEYhtNbMbuOvLrFOmtWKf1mpPlRLtdtspTw6RWfoAvnoCtXQPaqjBqIoQS/oFb05z8678+F8LqI5J7s5QUtwvn4BB86bCg==</latexit>
MONTO CARLO METHODS
Monte Carlo casino
Methods that involve randomly sampling a distribution
Monaco?
BAYESIAN LEARNING
ingredients…prior
likelihood
estimate parameters in a model
probability distribution of unknown parameters
data = M(parameters)estimate
thesemeasure
these
probability of datagiven (unknown) parameters Thomas Bayes
BAYES RULE
P (A|B) =P (B|A)P (A)
P (B)
P (parameters|data) = P (data|parameters)P (parameters)
P (data)
likelihood prior
P (parameters|data) / P (data|parameters)P (parameters)
we really care about this
“posterior distribution”
EXAMPLE: LOGISTIC REGRESSION
P (parameters|data) / P (data|parameters)P (parameters)
prior
P (y,X|w) =Y
i
exp(yi · xTi w)
1 + exp(yi · xTi w)
<latexit sha1_base64="lWgkerH9AERziYYhNxTP4CqGnHg=">AAAIGXicfVXNbtw2EFaSNpu6f0567EWoE8BuXGPlBEgvBZK6SNuDUbeIHQOmLVBcasUsfxSSyu6W1pPkGfICufVW9JpTgfZdOtTuNruU1jxIQ37fzHBmyGFWcmZsv//3tes3PvjwZu/WRxsff/LpZ59v3r5zYlSlCT0miit9mmFDOZP02DLL6WmpKRYZp8+z0YHHn7+i2jAln9lpSc8FHkqWM4ItLKWbPx5tT3dPL8c78XcxKrUapAzlGhOH6KTcnsKMDJSNJym7eDbeqV1yvxtIN7f6e/1mxG0hmQtb0XwcpbdvvkUDRSpBpSUcG3OW9Et77rC2jHBab6DK0BKTER5Sh4UxU5HV8T2BbWFCzC92YpMmQfXGxr2l1bPK5t+eOybLylJJQBGwvOKxVbFPUTxgmhLLpyBgohnsJyYFhqRYSOSKKZcpNbI4M+ACSTomSggsBw4RY+qz5NyhjA6ZdAQqZOqtBFEAZ5N6VUHZguoxM7R2yNKJdfH7ldA4ZwKOBthwSLNhYbHWalyHnHxh6ec8Dr1xsC3/9+QncctLgbk30RyGpHb7oQ2NvX8sh02xVqx7hHchxgJSZGriYrSLdk2VvYBMQ+L9rNnCKp/JASj4uhLM3Wm4BSZf1e7CfZOEgGU+OFx1YUrZpbA8q6X9ntHA7dhHpfc7ut8yPhIN0OG1BJsXkACswzxSzM08yixzv4WaIJRL+EELLxiTvlIgpO4ghazkdhqyxFJMoiskSMtyXva7EuM5o5CUjjpoZoVl2FDgdo189a4shAko3XaAcokkzjiOJ5ctUOBJ7aazZrVgBSTKuZJw8e56KU3uBvBBwa46hII8WYKfhHBul9CnrVR5v7PS5i4J7iBU0rd1+lIqf3VoSVQloQc5+rJq2neN4m0Et3cx3/H6P1Bop5oegs1fSqqxVfprcCEgTYeHJ/VaApNMsN9hOwsJgdXBej6eLPhz6Wo+1A46TvNdR8F62FQL/mjXS1cRmVwQ2XqLptBMwomd/ztpzj974LX5rmFYOIEQ6ewHTQpeuyR829rCyf5e8mBv/9eHW4+/n797t6Ivo6+i7SiJHkWPo5+io+g4ItGb6F30T/Rv73Xvj96fvb9m1OvX5jpfRCuj9+4/+/8CsA==</latexit>
P (yi = 1|xi) =exp(xT
i w)
1 + exp(xTi w)
<latexit sha1_base64="30xsw632vjL+Lqp1+IPVHH/WWSI=">AAAIBHicfVXNbtw2EFbSNJu4f05zzEWIE8BuXGO1CZBcAiRxULQHo24ROwZMW6C41IpZ/igkFe+G1jXP0Iforeilh1zbV+jbdCjvNruU1jpIQ37ffMOZIams5MzYfv/fK1c/u/b59d6Nm2tffPnV19+s3/r20KhKE3pAFFf6KMOGcibpgWWW06NSUywyTl9n412Pv35HtWFKvrLTkp4IPJIsZwRbmErXB/ub05Q9Tc4nKduKn8Yo15g4RCflJsycvjrbql3yYGmcrm/0d/rNE7eNZGZsRLNnP711/S80VKQSVFrCsTHHSb+0Jw5rywin9RqqDC0xGeMRdVgYMxVZHd8X2BYmxPxkJzZpalGvrd1fmD2ubP7kxDFZVpZKAo6A5RWPrYp9NeIh05RYPgUDE81gPTEpMJTAQs2WpFym1NjizEAIJOkZUUJgOXSIGFMfJycOZXTEpCPQDFNvJIgCeDGolx2ULag+Y4bWDlk6sS7+NBOKcyZgF4CGQ5qNCou1Vmd1yMnnSj/lcRiNg7b8P5IfxK0oBeZeoml9UrtBqKGxj4/lqGnWkrpHeBdiLCBFpiYuRtto21TZG6g0FN6PmiUs85kcgoPvK8HcHYVLYPJd7U7d90kIWOaTw1UXppRdSMuzWt6fGA3czn1c+rjjBy3xsWiAjqglaJ5CAbAO60gxN7Mss8z9GnqCUS7guy28YEz6ToGRut0UqpLbacgSCzmJrpSgLIt1GXQVxnPGISkdd9DMEsuwkcDtHvnuXdoIE1C6dYByjiTOOI4n5y1Q4EntpogMlY3nrIBEOVcSDt49b6XJvQDeLdhlm1CQ5wvw8xDO7QL6Q6tUPu5Fa3OXBGcQOulvcPpWKn90aElUJeEOcvRt1dzUNYo3EZze+XjL+7+kcJ1qugeaP5dUY6v0dxBCQJn29g7rlQQmmWDvYTlzC4HqcDUfT+b8mXU5H3oHN07zXkXBetR0C75o21uXEZmcE9lqRVNoJmHHzr6dNIdKrSBq817BsLADIdOLD1xS8LdLwn9b2zgc7CQPdwa/PNp49mL237sR3YnuRptREj2OnkU/RvvRQUSi36KP0d/RP70Pvd97f/T+vKBevTLzuR0tPb2P/wHrBPmj</latexit>
EXAMPLE: LOGISTIC REGRESSION
P (parameters|data) / P (data|parameters)P (parameters)
prior
logP (w) = �|w|Example:
P (w|y,X) =
Y
i
exp(yi · xTi w)
1 + exp(yi · xTi w)
!P (w)
<latexit sha1_base64="004+ZLNp5XNu9cMH76QXus7GxG4=">AAAIK3icfVXdchs1FN4WqEv4S+GSmx3Sztg0ZLyhM+WGmZYwDFxkCEyThomSHa2s9QrrZytpaxtlH4kX4IaH4AqG297CK3C0tqmtXUcXu0f6vvMd6Rz9ZCVnxg6Hf966/cabb93p3X17551333v/g917H54ZVWlCT4niSp9n2FDOJD21zHJ6XmqKRcbp82xy5PHnL6k2TMlndl7SS4HHkuWMYAtD6e5PJ/3p9Xz/fBB/iTjNbT9GpVajlKFcY+IQnZX9OfTISNl4lrKrZ9NB7ZKH3QDSbFzYQQyig3R3b3gwbFrcNpKlsRct20l6785vaKRIJai0hGNjLpJhaS8d1pYRTusdVBlaYjLBY+qwMGYusjp+ILAtTIj5wU5s1qSs3tl5sDZ6Udn8i0vHZFlZKgk4ApZXPLYq9kmLR0xTYvkcDEw0g/nEpMCQIAup3ZBymVITizMDIZCkU6KEwHLkEDGmvkguHcromElHoGam3ksQBXDRqTcdlC2onjJDa4csnVkXvx4JxTkTsFlAwy0qgLVW0zrk5Cul7/I4jMZBW/4fyXfiVpQCcy/RbIykdoehhsY+Ppbjplgb6h7hXYixgBSZmrkY7aN9U2U/Q6Yh8b7XTGGTz+QIHHxdCebuPJwCky9rd+U+S0LAMr84XHVhStm1ZXlWy/s1o4Hba5+UPu7kYUt8IhqgI2oJmleQAKzDPFLMzXKVWeZ+DD3BKNfwoxZeMCZ9pcBI3VEKWcntPGSJtTWJriVBWtbzctiVGM+ZhKR00kEzGyzDxgK3a+Srd2MhTEDp1gHKNZI44zieXbdAgWe1my8urhUrIFHOlYSDd99baXI/gI8KdtMmFOTpGvw0hHO7hn7TSpWPuyht7pLgDEIl/UVPX0jljw4tiaok3EGOvqiaC71GcR/B6V31B97/awrXqabHoPl9STW2Sn8KIQSk6fj4rN5KYJIJ9gtMZ2UhUB1t5+PZir+0buZD7eDGab7bKFiPm2rBH+176yYikysi265oCs0k7Njlv5Pm/BMIUZvvFoaFHQgrXfzgkoLXLgnftrZxdniQfH5w+MOjvSdfLd+9u9HH0SdRP0qix9GT6NvoJDqNSPR79Cr6J/q392vvj95fvb8X1Nu3lj4fRRut9+o/62oJbg==</latexit>
logP (w|y,X) = logP (w) +X
i
� log(1 + exp(�yi · xTi w))
<latexit sha1_base64="3Y80qniIrti7yfwt00Tirz2HpIc=">AAAIFXicfVXdjhs1FJ620JTlbwuX3FhsKyXsjzILEr1BaglCcLFiQd3tSuvdkcfxZEz8M7U9TYJ3noNn4BWQuEPc9pa+DceThCYzyc7FzLG/z9/xOcc+kxaCW9fvv7lz9947797vPHhv5/0PPvzo492Hn5xbXRrKzqgW2lykxDLBFTtz3Al2URhGZCrYi3Q8CPiLV8xYrtVzNyvYlSQjxTNOiYOpZHeAhR6h0+7kZnZw0UPfoOW4h/YRtqVMODqsJ7vxPmbTons4SzimQ+3QNOHXzye9XrK71z/q1w9qG/HC2IsWz2ny8P6feKhpKZlyVBBrL+N+4a48MY5TwaodXFpWEDomI+aJtHYm0wo9lsTltomFyY3YtE5OtbPzeGX2snTZkyvPVVE6pigsBCwrBXIahfSgITeMOjEDg1DDYT+I5sQQ6iCJa1I+1XrsSGrBBVZsQrWURA09ptZWl/GVxykbceUpVMdWezFmAM4H1foC7XJmJtyyymPHps6jtzNNccElHAvQ8NjwUe6IMXpSNTnZUunHDDW9CdBW/3sKA9TykhMRJDKI28eVP25qGBL8EzWqi7WmHhCxCbEOkDzVU4/wAT6wZforZBoSH0b1Ftb5XA1hQagrJcJfNLfA1avKX/vDuAk4HoIj5SZMa7cSVmC1Vr9l1HA79nER/I73W+JjWQMbvBageQ0JIKaZR0aEXUSZpv6X5kowihV80MJzzlWoFBiJHySQlczNmiy5EpPcFBKkZTUvx5sSEzjjJikZb6DZNZblI0naNQrVu7UQtkHZrAOUG6xIKgia3rRASaaVn82b1ZLVIDEhtIKL9yhYSfyoAQ9yftshlPTZCvysCWduBf2+largd17azMeNOwiVDC2dvVQ6XB1WUF0q6EGevSzr1l1h1MVwe5fjXlj/HYN2atgJaP5UMEOcNl+ACwlpOjk5r7YSuOKS/wbbWVoYVIfb+WS65C+s2/lQO+g49XsbhZhRXS344oNg3Ubkaknk2xVtbriCE7v4bqR5XBgNXuv3FoaDEwiRzj/QpOBvFzf/bW3j/Pgo/vLo+Oev9p5+u/jvPYg+iz6PulEcfR09jX6ITqOziEZ/RK+jf6M3nd87f3X+7vwzp969s1jzabT2dF7/B9Db/e8=</latexit>
BAYESIANS OFFER NO SOLUTIONS
P (w|D, l)
solution space
optimizationsolution
Monte Carlo methods randomly sample this solution space
BayesLagrange
WHY MONTE CARLO?You need to know more than just a max/minimizer
optimizationsolution
“error bars”
Var(w) = E[(w � µ)(w � µ)T ] =
ZwwTP (w) dw � µµT
E(w) =
ZwP (w) dw
EXAMPLE: POKER BOTSYou can’t solve your problem by other (faster) methods
maximize logP (w) we don’t know derivative
poker bots
model parameters describe player behavior
2.5M different community cards
10 players = 59K raise/fold/call perms per round
WHY MONTE CARLO?You’re incredibly lazymaximize logP (w)
I could differentiate this,I just don’t wanna
argmax logP (w) ⇡ maxk
{P (wk)}
MARKOV CHAINS
Irreducible: can visit any state starting at any state
Aperiodic: does not get trapped in deterministic cycles
MC has steady state if
What is an MC?
METROPOLIS HASTINGSingredients
q(y|x)
proposal distribution
p(x)
posterior distribution
MH AlgorithmStart with x0
For k = 1, 2, 3, . . .Choose candidate y from q(y|xk)Compute acceptance probability
↵ = min
⇢1,
p(y)q(xk|y)p(xk)q(y|xk)
�
Set xk+1 = y with probability ↵otherwise xk+1 = xk
CONVERGENCE
Theorem
Suppose the support of q contains the support of p. Thenthe MH sampler has a stationary distribution, and the
distribution is equal to p.
Irreducible: support of q must contain support of p
Aperiodic: there must be positive rejection probability states
METROPOLIS ALGORITHMingredients
q(y|x)
proposal distribution
p(x)
posterior distribution
Assume proposal is symmetricq(xk|y) = q(y|xk)
↵ = min
⇢1,
p(y)q(xk|y)p(xk)q(y|xk)
�↵ = min
⇢1,
p(y)
p(xk)
�
Metropolis Hastings Metropolis
EXAMPLE: GMMhistogram of Metropolis iterates
Andrieu, Freitas, Doucet, Jordan ‘03
PROPERTIES OF MH
We don’t need a normalization constant for posteriorWe can run many chains in parallel (cluster/GPU)We don’t need any derivatives
Pro’s
P (parameters|data) = P (data|parameters)P (parameters)
P (data)
P (parameters|data) / P (data|parameters)P (parameters)
PROPERTIES OF MH
“Mixing time” depends on proposal distribution• too wide = constant rejections = slow mixing• too narrow = short movements=slow mixing
Con’s
Samples only meaningful at stationary distribution • “Burn in” samples must be discarded• Many samples needed because of correlations
Posterior
proposal
This is bad…
xk<latexit sha1_base64="8tY7aDdJ2IAPpvSJDyWI10/Am7g=">AAAH03icfVVLb9w2EJbTNuu4rzg55iLUMeC2jrFyAqRHOw6K9mDUfdgxYNoLikutWPEhk5S9G1qXotdc2//Sf9J/06G8W+9SWusgDfl98w1nhqTSkjNj+/1/Vx589PEnD3urj9Y+/ezzL758vP7kxKhKE3pMFFf6NMWGcibpsWWW09NSUyxSTt+lxYHH311RbZiSv9lJSc8FHkmWMYItTP06vigGjzf6O/3midtGMjU2oulzNFh/+A8aKlIJKi3h2JizpF/ac4e1ZYTTeg1VhpaYFHhEHRbGTERax5sC29yEmJ/sxMZNavXa2ubc7Flls+/OHZNlZakk4AhYVvHYqtgnFw+ZpsTyCRiYaAbriUmONSYWSrAg5VKlCotTAyGQpNdECYHl0CFiTH2WnDuU0hGTjkBtTb2RIArg7aBedFA2p/qaGVo7ZOnYuvhuJhTnTEBTQcMhzUa5xVqr6zrkZDOlH7M4jMZBW/4fyQ/iVpQccy+RQd4uqd1uqKGxj4/lqGnWgrpHeBdiLCB5qsYuRtto21Tp71BpKLwfNUtY5DM5BAffV4K5Ow2XwORV7S7ciyQELPPJ4aoLU8rOpeVZLe87RgO3cy9KH7f4tiVeiAboiFqC5gUUAOuwjhRzM80yTd0voScY5Rx+0MJzxqTvFBgDdzCAqmR2ErLEXE6iKyUoy3xddrsK4zlFSBoUHTSzwDJsJHC7R7579zbCBJRuHaDcIIlTjuPxTQsUeFy7CSJDZeMZKyBRzpWEg/fcW4PkeQAf5Oy+TSjI/hy8H8KZnUO/b5XKx71tbeaS4AxCJ/2FTC+l8keHlkRVEu4gRy+r5uKtUbyF4PTOxl97/7cUrlNND0Hzp5JqbJX+BkIIKNPh4Um9lMAkE+w9LGdmIVAdLufj8Yw/te7nQ+/gxmneyyhYj5puwRdte+s+IpMzIluuaHLNJOzY6beT5lCpFURt3ksYFnYgZHr7gUsK/nZJ+G9rGye7O8nLnd2fX23svZn+91ajZ9FX0VaURK+jveiH6Cg6jkg0ij5Ef0V/9457rvdH789b6oOVqc/TaOHpffgP+CLnSg==</latexit>
SIMULATED ANNEALINGmaximize p(x)
Easy choice: run MCMC, thenmax
kp(xk)
Better choice: sample the distribution
p1
Tk (xk) “temperature”
Why is this better?Where does the name come from?
COOLING SCHEDULE
Andrieu, Freitas, Doucet, Jordan ‘03
hot warm
cool cold
CONVERGENCE
Granville, "Simulated annealing: A proof of convergence”
TheoremSuppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time starting at every temperature. For an annealing schedule with temperature
simulated annealing converges to a global optima with probability 1.
Tk =1
C log(k + T0)
SA solves non-convex problem, even NP-complete problems,as time goes to infinity.
WHEN TO USE SIMULATED ANNEALING
flow chart
Should I use simulated annealing?
no.
No practical way to choose temperature scheduleToo fast = stuck in local minimum (risky)
Too slow = no different from MCMCAct of desperation!
GIBBS SAMPLERWant to sample
P (x1, x2, x3)
on stage k, pick some coordinate j
q(y|xk) =
(p(yj |xk
jc), if yjc = xkjc
0, otherwise
↵ =p(y)q(xk|y)p(xk)q(y|xk)
=p(y)p(xk
j |xkjc)
p(xk)p(yj |yjc)
=p(yjc)
p(xkjc)
= 1
P (B) =P (A and B)
P (A|B)
p(y) = p(yj and yjc)<latexit sha1_base64="KTec+pwpBDIxNpH0RsoPrN7Tsvc=">AAACI3icbVDLSgMxFM3Ud31VXboJFkE3ZaYKuhFENy4r2Fpoa8mkd9poJjMkd8RhmA/wQ1y71W9wJ25c+AH+hWntwqoHQg7nnMtNjh9LYdB1353C1PTM7Nz8QnFxaXlltbS23jBRojnUeSQj3fSZASkU1FGghGasgYW+hEv/5nToX96CNiJSF5jG0AlZX4lAcIZW6pbK8U66S4+ovbrXtI1whxllqkdzmnaz6yue71KbcivuCPQv8cakTMaodUuf7V7EkxAUcsmMaXlujJ2MaRRcQl5sJwZixm9YH1qWKhaC6WSjz+R02yo9GkTaHoV0pP6cyFhoTBr6NhkyHJjf3lD8z2slGBx2MqHiBEHx70VBIilGdNgM7QkNHGVqCeNa2LdSPmCacbT9FYsTe5TgEFgrt914v5v4SxrVirdXqZ7vl49Pxi3Nk02yRXaIRw7IMTkjNVInnNyTR/JEnp0H58V5dd6+owVnPLNBJuB8fAH9lqL0</latexit>
GIBBS SAMPLERWant to sample
P (x1, x2, x3)
iterates
x2 ⇠ P (x1|x12, x
13, x
14, . . . , x
1n)
x3 ⇠ P (x2|x21, x
23, x
24, . . . , x
2n)
x4 ⇠ P (x3|x31, x
32, x
34, . . . , x
3n)
· · ·
APPLICATION: SAMPLING GRAPHICAL MODELS
Restricted Boltzmann Machine (RBM)
E(v, h) = �aT v � bTh� vTWh
P (v, h) =1
Ze�E(v,h) “partition
function”
weights
binaryrandomvariables
0/1
APPLICATION: SAMPLING GRAPHICAL MODELS
Restricted Boltzmann Machine (RBM)
E(v, h) = �aT v � bTh� vTWh P (v, h) =1
Ze�E(v,h)
remove normalization
aggregate constant
cancel constants
sigmoid function
P (vi = 1|h) = P (vi = 1|h)P (vi = 0|h) + P (vi = 1|h)
=exp(�ai +
Pj wijhj + C)
exp(C) + exp(�ai +P
j wijhj + C)
=exp(�ai +
Pj wijhj)
1 + exp(�ai +P
j wijhj)
= �(�ai +X
i
wijhj)
BLOCK GIBBS FOR RBM
P (vi = 1|h) = �(�ai +X
j
wijhj)
hiddenvisible
freeze hiddenrandomly sample visible
stage 1
freeze visiblerandomly sample hidden
stage 2
P (hj = 1|v) = �(�bj +X
i
wijvi)
DEEP BELIEF NETSvisible
hidden
DBN = layered RBM
Each layer only dependson layer beneath it
(feed forward)
Probability for each hidden node is sigmoid function
EXAMPLE: MNIST
Train 3-layer DBN with 200 hidden units
pre-training: layer-by-layer training
training: train all weights simultaneously
Gan, Henao, Carlson, Carin ‘15
trained on 60K MNIST digits
training done using Gibbs sampler
Gibbs sampler used to explore final solution
EXAMPLE: MNISTtraining data Learned features
observations sampled from deep belief network using Gibbs sampler
WHAT’S WRONG WITH GIBBS?
susceptible to strong correlationsalso a problem for MH: bad proposal distribution
SLICE SAMPLERProblems with MH:
hard to choose proposal distributiondoes not exploit analytical information about problem
long mixing timeSlice sampler
q(x, u) =
(1, if 0 u p(x)
0, otherwise
1 1
x
p(x)
x
u
“LIFTED INTEGRAL”
1 1
x
p(x)
x
u
p(x) =
Zq(x, u) du
Zf(x)p(x) =
Z Zf(x)q(x, u) du dx
sample this
Problems with MH:hard to choose proposal distribution
does not exploit analytical information about problemlong mixing timeSlice sampler
SLICE SAMPLERZ
f(x)p(x) =
Z Zf(x)q(x, u) du dx
do Gibbs on thisFreeze x :
Choose u uniformly from [0, p(x)]Freeze u :
Choose x uniformly from {x|p(x) � u}need analytical formula
for this set
p(x)u
MODEL FITTING
q(x, u1, u2, . . . , un) =
(1, if pi(x) ui 8i0, otherwise
Zq(x, u1, u2, . . . , un) du =
Z p1(x)
u1=0
Z p2(x)
u2=0· · ·
Z pn(x)
un=0du =
nY
k=1
pi(x)
log p(x) =nX
i=1
log pi(x)p(x) =nY
i=1
pi(x)
Zf(x)p(x) dx =
Z
x
Z
uf(x)q(x, u) dx du
support is a box
MODEL FITTINGZ
f(x)p(x) dx =
Z
x
Z
uf(x)q(x, u) dx du
Freeze x :For all i, choose ui uniformly from [0, pi(x)]
Freeze u :Choose x uniformly from
Ti{x|pi(x) � ui}
q(x, u1, u2, . . . , un) =
(1, if pi(x) ui 8i0, otherwise
SLICE SAMPLER PROPERTIESpros
• does not require proposal distribution• mixes super fast• simple implementation
cons• analytical formula for super-level sets• hard in multiple dimensions
fixes / generalizations• step-out methods• random hyper-rectangles
STEP-OUT METHODSp(x)
u
p(x)u
p(x)u
•pick small interval
•double width size until slice is contained
•sample uniformly from slice using rejection sampling - throw away selections outside of slice
HIGH DIMS: COORDINATE METHOD
Zf(x)p(x) =
Z Zf(x)q(x, u) du dx
do coordinate Gibbs on this
Freeze x :Choose u uniformly from [0, p(x)]
Freeze u :For i = 1 · · ·n
Choose xi uniformly from{xi|p(x1, · · · , xi · · · , xn) � u}
can use stepping -out methodRadford Neal: “Slice Sampling” ‘03
HIGH DIMS: HYPER-RECTANGLE
Radford Neal: “Slice Sampling” ‘03
•choose “large” rectangle
•randomly/uniformly place rectangle around current location
•draw random proposal point from rectangle
•if proposal lies outside slice, shrink box so proposal is on boundary
performs much better for poorly-conditioned objectives
COMPARISON
Gradient/Splitting
SGD
MCMC
rate
e�ck
1/k
1/pk
when to use
you value reliability and precision(moderate speed, high accuracy)
you value speed over accuracy(high speed, moderate accuracy)
you value simplicity (no gradient)or need statistical inference
(slow and inaccurate)
DO MATLAB EXERCISEMCMC