Sparse Approximations toBayesian Gaussian Processes
Matthias SeegerMatthias Seeger
University of EdinburghUniversity of Edinburgh
Collaborators
Neil Lawrence (Sheffield)Neil Lawrence (Sheffield) Chris Williams (Edinburgh)Chris Williams (Edinburgh) Ralf Herbrich (MSR Cambridge)Ralf Herbrich (MSR Cambridge)
Overview of the Talk
Gaussian processes and approximationsGaussian processes and approximations Understanding sparse schemes asUnderstanding sparse schemes as
likelihood approximationslikelihood approximations Two schemes and their relationshipsTwo schemes and their relationships Fast greedy selection for the projected latent Fast greedy selection for the projected latent
variables scheme (GP regression)variables scheme (GP regression)
Why Sparse Approximations?
GPs lead to very powerful Bayesian GPs lead to very powerful Bayesian methods for function fitting, classification, methods for function fitting, classification, etc. Yet: (Almost) Nobody uses them!etc. Yet: (Almost) Nobody uses them!
Reason: Horrible scaling Reason: Horrible scaling O(nO(n33)) If sparse approximations work, there is a If sparse approximations work, there is a
host of applications, e.g. as building blocks host of applications, e.g. as building blocks in Bayesian networks, etc.in Bayesian networks, etc.
Gaussian Process Models
Target Target yy separated by latent separated by latent uu from all other from all other variables Inference a finite problemvariables Inference a finite problem
y1
u1
x1
y2
u2
x2
y3
u3
x3
Gaussian priorGaussian prior(dense),(dense),kernel kernel KK
ParameterisationData Data D = {(D = {(xxii,y,yii) | i=1,…,n}.) | i=1,…,n}.
Latent outputs Latent outputs u = (uu = (u11,…,u,…,unn).).
Approximate posterior process Approximate posterior process P(u(P(u(¢¢) | D)) | D)by GP by GP Q(u(Q(u(¢¢) | D)) | D)
Conditional GP (Prior)Conditional GP (Prior) nn-dim. Gaussian-dim. Gaussian
GP Approximations
Most (non-MCMC) GP approximations use Most (non-MCMC) GP approximations use this representationthis representation
Exact computation of Exact computation of Q(Q(uu | D) | D) intractable, intractable, needsneeds
Attractive for sparse approximations:Attractive for sparse approximations:Sequential fitting of Sequential fitting of Q(Q(uu | D) | D) toto P( P(uu | D) | D)
Assumed Density Filtering
Update (ADF step):Update (ADF step):
Towards Sparsity
ADF = Bayesian Online [Opper].ADF = Bayesian Online [Opper].Multiple updates: Cavity method [Opper, Multiple updates: Cavity method [Opper, Winther], EP [Minka]Winther], EP [Minka]
Generalizations: EP [Minka], ADATAP Generalizations: EP [Minka], ADATAP [Csato,Opper,Winther: COW][Csato,Opper,Winther: COW]
Sequential updates suitable for sparse Sequential updates suitable for sparse online or greedy methodsonline or greedy methods
Likelihood Approximations
Active set:Active set: I I ½½ {1,…,n}, |I| = d {1,…,n}, |I| = d¿¿ n n
Several sparse schemes can be understood asSeveral sparse schemes can be understood aslikelihood approximationslikelihood approximations
Depends on Depends on uuII only only
Likelihood Approximations (II)
y1
u1
x1
y2
u2
x2
y3
u3
x3
y4
u4
x4
Active Set Active Set I = {2,3}I = {2,3}
Likelihood Approximations (III)
For such sparse schemes:For such sparse schemes: O(dO(d22)) parameters at most parameters at most Prediction in Prediction in O(dO(d22)), , O(d)O(d) for mean only for mean only Approximations to marginal likelihood Approximations to marginal likelihood
(variational lower bound, ADATAP (variational lower bound, ADATAP [COW]), PAC bounds [Seeger], etc., [COW]), PAC bounds [Seeger], etc., become become cheapcheap as well! as well!
Two Schemes IVM [Lawrence, Seeger, Herbrich: LSH]IVM [Lawrence, Seeger, Herbrich: LSH]
ADF with fast greedy forward selectionADF with fast greedy forward selection Sparse Greedy GPR [Smola, Bartlett: SB]Sparse Greedy GPR [Smola, Bartlett: SB]
Greedy, expensive. Can be sped up:Greedy, expensive. Can be sped up:Projected Latent Variables [Seeger, Projected Latent Variables [Seeger, Lawrence, Williams]. More general:Lawrence, Williams]. More general:Sparse batch ADATAP [COW]Sparse batch ADATAP [COW]
Not here: Sparse Online GP [Csato, Opper]Not here: Sparse Online GP [Csato, Opper]
Informative Vector Machine
ADF, stopped after ADF, stopped after ddinclusions [could do deletions, exchanges]inclusions [could do deletions, exchanges]
Fast greedy forward selection using criteria Fast greedy forward selection using criteria known in active learningknown in active learning
Faster than SVM on hard MNIST binary Faster than SVM on hard MNIST binary tasks, yet probabilistic (error bars, etc.)tasks, yet probabilistic (error bars, etc.)
Only Only dd are non-zero are non-zero
Why So Simple? Locality Property of ADF:Locality Property of ADF:
Marginal Marginal QQnewnew(u(uii)) in in O(1)O(1) from from Q(uQ(uii))
Locality Property and Gaussianity:Locality Property and Gaussianity:Relations like:Relations like:
Fast evaluation of differential criteria Fast evaluation of differential criteria
KL-Optimal Projections
Csato/Opper observed:Csato/Opper observed:
KL-Optimal Projections (II)
For Gaussian likelihood:For Gaussian likelihood:
Can be used online or batchCan be used online or batch A bit unfortunate: We use relative entropy A bit unfortunate: We use relative entropy
both ways aroundboth ways around!!
Projected Latent Variables
Full GPR samples Full GPR samples uuII»» P( P(uuII), ), uuRR»» P( P(uuRR | | uuII), ),
yy»» N( N(yy | | uu, , 22 II).). Instead: Instead: yy»» N( N(yy | E[ | E[uu | | uuII], ], 22 II). Latent ). Latent
variables variables uuRR replaced by replaced by projectionsprojections in in
likelihood [SB] (without interpret.)likelihood [SB] (without interpret.) NoteNote: Sparse batch ADATAP [COW] more : Sparse batch ADATAP [COW] more
general (non-Gaussian likelihoods)general (non-Gaussian likelihoods)
Fast Greedy Selections
With this likelihood approximation, typical With this likelihood approximation, typical forward selection criteria (MAP [SB]; diff. forward selection criteria (MAP [SB]; diff. entropy, info-gain [LSH]) are too entropy, info-gain [LSH]) are too expensiveexpensive
Problem: Upon inclusion, latent Problem: Upon inclusion, latent uuii is is
coupled with coupled with allall targets targets yy Cheap criterion: Ignore most couplings for Cheap criterion: Ignore most couplings for
score evaluation (score evaluation (notnot for inclusion!) for inclusion!)
Yet Another Approximation
To score To score xxii, we approximate , we approximate QQnewnew((uu | D) | D)
after inclusion of after inclusion of ii by by
Example: Example: Information gainInformation gain
Fast Greedy Selections (II)
Leads to Leads to O(1)O(1) criteria. criteria.Cost of searching over Cost of searching over allall remaining points remaining points dominated by cost for inclusiondominated by cost for inclusion
Can easily be generalized to allow for Can easily be generalized to allow for couplings between couplings between uuii and and somesome targets, if targets, if
desireddesired Can be done for sparse batch ADATAP as Can be done for sparse batch ADATAP as
wellwell
Marginal Likelihood
Can be optimized efficiently w.r.t. Can be optimized efficiently w.r.t. and and kernel parameters, kernel parameters, O(n d (d+p))O(n d (d+p)) per per gradient, gradient, pp number of parameters number of parameters
Keep Keep II fixed during line searches, reselect fixed during line searches, reselect for search directionsfor search directions
The marginal likelihood isThe marginal likelihood is
Conclusions
Most sparse approximations can be Most sparse approximations can be understood as likelihood approximationsunderstood as likelihood approximations
Several schemes available, all Several schemes available, all O(n dO(n d22)), yet , yet constants do matter here!constants do matter here!
Fast information-theoretic criteria effective Fast information-theoretic criteria effective for classification Extension to active for classification Extension to active learning straightforwardlearning straightforward
Conclusions (II)
MissingMissing: Experimental comparison, esp. to : Experimental comparison, esp. to test effectiveness of marginal likelihood test effectiveness of marginal likelihood optimizationoptimization
Extensions:Extensions: CC classes: Easy in classes: Easy in O(n dO(n d22 C C22)), maybe in , maybe in
O(n dO(n d22 C) C) Integrate with Bayesian networksIntegrate with Bayesian networks
[Friedman, Nachman][Friedman, Nachman]