+ All Categories
Home > Data & Analytics > Firefly exact MCMC for Big Data

Firefly exact MCMC for Big Data

Date post: 07-Jul-2015
Category:
Upload: gianvito-siciliano
View: 224 times
Download: 0 times
Share this document with a friend
Description:
MCMC exact solutions for Big Data Firefly is a novel algorithm proposed to work directly with a big data set
Popular Tags:
24
EXACT MCMC ON BIGDATA: THE TIP OF AN ICEBERG University of Helsinki Gianvito Siciliano (2014 - Probabilistic Models for Big Data Seminar)
Transcript
Page 1: Firefly exact MCMC for Big Data

EXACT MCMC ON BIGDATA: THE TIP OF AN ICEBERG

University of Helsinki

Gianvito Siciliano

(2014 - Probabilistic Models for Big Data Seminar)

Page 2: Firefly exact MCMC for Big Data

AGENDA

1. MCMC intro:

• Bayesian Inference

• Sampling methods (Gibbs, MH)

2. MCMC and Big Data

• Issues

• Approximate solutions (SGLD, SGFS, MH Test)

3. Firefly Monte Carlo

4. Conclusions

Page 3: Firefly exact MCMC for Big Data

BAYESIAN MODELING• To obtain quantities of interest from the posterior we usually need to engage with an

integral in this form:

• The problem is that these integrals are usually impossible to evaluate analytically • Bayes rule allows us to express the posterior over parameters in terms of the prior

and likelihood terms:

P (✓|X) /NY

i=1

P (xi|✓)P (✓)

Page 4: Firefly exact MCMC for Big Data

MCMC

• Monte Carlo: simulation to draw quantities of interest from the distribution

• Markov Chain: stochastic process in which future states are independent of past states given the present state.

• Hence, MCMC is a class of method in which we can simulate draws that are slightly dependent and are approximately from posterior distribution.

Page 5: Firefly exact MCMC for Big Data

HOW TO SAMPLE? In Bayesian statistics, there are generally two algorithms that you can use (to allow

pseudo-random sampling from a distribution)

Gibbs Sampler Metropolis-Hastings algorithm. Used to sample from a joint distribution, if we knew the full conditional distributions for each parameter

JD = p(θ1, . . . , θk )

The full conditional distribution is the distribution of the parameter conditional on the known information and all the other parameters:

FCD = p(θj|θ−j, X)

Used when…

• the posterior doesn’t look like any distribution we know (no conjugacy)

• the posterior consists of more than 2 parameters (grid approximations intractable)

• some (or all) of the full conditionals do not look like any distributions we know (no Gibbs sampling for those whose full conditionals we don’t know)

Page 6: Firefly exact MCMC for Big Data

1. Pick a vector of starting values θ(0).

2. Start with any θ (order does not matter).

Draw a value θ1(1) from the full conditional p(θ1 | θ2(0), θ3(0), y).

3. Draw a value θ2(1) (again order does not matter) from the full

conditional p(θ2 | θ1(1), θ3(0), y). Note that we must use theupdated value of θ1(1).

4. Repeat (for all parameters) until we get M draws, with each draw

being a vector θ(t).

5. Optional burn-in and/or thinning.

Gibbs Sampler

Page 7: Firefly exact MCMC for Big Data

1. Choose a starting value θ(0).

2. At iteration t, draw a candidate θ(∗) from a jumping distribution

Jt(θ∗ | θ(t−1)).

3. Compute an acceptance ratio conditioned:

r = p(θ∗|y)/Jt(θ∗|θ(t−1)) / p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)

4. Accept θ∗ as θ(t) with probability min(r,1).

If θ∗ is not accepted, then θ(t) = θ(t−1).

5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional burn-in and/or thinning.

MH Algorithm

Page 8: Firefly exact MCMC for Big Data

1. Choose a starting value θ(0).

2. At iteration t, draw a candidate θ(∗) from a jumping distribution

Jt(θ∗ | θ(t−1)).

3. Compute an acceptance ratio conditioned:

r = p(θ∗|y) / p(θ(t−1)|y)

4. Accept θ∗ as θ(t) with probability min(r,1).

If θ∗ is not accepted, then θ(t) = θ(t−1).

5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional burn-in and/or thinning.

MH Algorithm

Page 9: Firefly exact MCMC for Big Data

• Canonical MCMC algorithm proposed samples from a distribution Q and accept/reject the proposals with a rule that need to examine the likelihood of all data-items

• All the data are processed at each iteration,

run-time may be excessive!

MCMC and BIG DATA

Propose: ✓

0 ⇠ Q(✓

0|✓)

Accept with Prob. ↵ = min

"1,

Q(✓|✓0)P (✓

0)

QNi=1 P (xi|✓0

)

Q(✓

0|✓)P (✓)

QNi=1 P (xi|✓)

#

If accept=True: ✓ ✓

0

Page 10: Firefly exact MCMC for Big Data

IDEA • Assume that you have T units of computation to achieve the lowest

possible error. • Your MCMC procedure has a knob to control the bias/variance

tradeoff So, during the sampling phase… Turn left => SLOW: small bias, high variance Turn right => FAST: strong bias, low variance

MCMC APPROXIMATE SOLUTIONS FOR BIG DATA

Page 11: Firefly exact MCMC for Big Data

SGLD & SGFS: knob = stepsize

Stochastic Gradient Langevin Dynamics Langevin dynamics based on stochastic gradients [W. & Teh, ICML 2011] • The idea is to expand Stochastic Gradient descend optimization algorithm to include gaussian noise with Langevin Dynamics. • One of the advantages of SGLD is that the entire data sets should never be saved into memory • Disadvantages:

• it has to read from external data each iteration • gradients are computationally expensive • it uses a proper pre-conditions matrix to decide the size step of the transaction operator.

Stochastic Gradient Fisher Scoring [Ahn, et al, ICML 2012] Built on SGLD and it tries to beat its predecessor by offering a three phase procedure: 1. Burn-in: large stepsize. 2. Reached distribution: still large stepsize and samples from the asymptotic gaussian approximation of the posterior. 3. Further annealing: smaller stepsize to generate increasingly accurate samples from the true posterior. • With this approach the algorithm tries to reduce the bias in burn-in phase and then starts sampling to reduce variance.

Page 12: Firefly exact MCMC for Big Data

MH TEST: knob = confidence

CUTTING THE MH ALGORITHM BUDGET [Korattikara et al, ICML 1023] …by conducing sequential hypothesis tests to decide whether accept/reject a given sample and find the majority of these decision based on a small fraction of the data • Works directly on the rule-step of MH algorithm • Accept a proposal with a given confidence • Applicable to problem where is impossible to compute gradient

Page 13: Firefly exact MCMC for Big Data

FIREFLY EXACT SOLUTION

ISSUE 1: prohibitive cost of evaluating every likelihood terms at every iteration (for a big data-sets) ISSUE 2: latter procedures construct an approximated transition operator (using subsets of data) GOAL: obtain an exact procedures, that leaves the true full-data posterior distribution invariant! HOW: by querying only the likelihood of a potentially small subset of the data at each iteration yet simulates from the exact posterior IDEA: introduce a collection of Bernoulli variables that turn on (and off) the data for which calculate the likelihoods

Page 14: Firefly exact MCMC for Big Data

Assuming we have:

1. Target Distribution 2. Likelihood function

Compute all N likelihoods at every iteration is a bottleneck! 3. Assume that each product term in Ln can be bounded by a cheaper lower bound:

5. Each zn has the following Bernoulli Distribution (conditioned)

6. And augment the posterior with these N vars

FLYMC: HOW IT WORKS

Page 15: Firefly exact MCMC for Big Data

Assuming we have:

1. Target Distribution 2. Likelihood function

Compute all N likelihoods at every iteration is a bottleneck! 3. Assume that each product term in Ln can be bounded by a cheaper lower bound:

5. Each zn has the following Bernoulli Distribution (conditioned)

6. And augment the posterior with these N vars

FLYMC: HOW IT WORKS

} the marginal distrib. is still the correct posterior given in equation 1

Why Exact?

Page 16: Firefly exact MCMC for Big Data

Assuming we have:

1. Target Distribution 2. Likelihood function

Compute all N likelihoods at every iteration is a bottleneck! 3. Assume that each product term in Ln can be bounded by a cheaper lower bound:

5. Each zn has the following Bernoulli Distribution (conditioned)

6. And augment the posterior with these N vars

FLYMC: HOW IT WORKS

} from this joint distrib. evaluate only those likelihood terms for which zn = 1 (light terms)

Why Firefly?

Page 17: Firefly exact MCMC for Big Data

FLYMC: THE REDUCED SPACE

• We simulate the Markov chain on the zn space:

zn = 0 => Dark point (no likelihoods computed) zn = 1 => Light point (likelihoods computed)

{

• If the Markov chain will tend to occupy zn = 0

Page 18: Firefly exact MCMC for Big Data

ALGORITHM IMPL.

Page 19: Firefly exact MCMC for Big Data

FLYMC: LOWER BOUNDThe lower bound Bn(θ) of each data point’s likelihood Ln(θ), should satisfy 2 properties:

• Tightness, to determine the number of bright data points (M is the average):

• It must be easy to compute the product (using scale exponential-family lower bounds)

With this setting, we achieve speedup of N/M, from O(ND) ev. time of regular MCMC

Page 20: Firefly exact MCMC for Big Data

MAP-OPTIMISATION…in order to find an Approximate Maximum a Posteriori value of θ and to construct Bn to be tight there.

The proposed algorithm versions (used in the experiments) are:

• Untuned FlyMC, with the choice of ε = 1.5 for all data points.

• MAP-tuned FlyMC that performs a gradient descent optimization to find an ε value for each data points. (This last way allows to obtain a nearer bounds to the posteriori value of θ).

• Regular full-posterior MCMC (for comparison)

Page 21: Firefly exact MCMC for Big Data

EXPERIMENTS

Expectation:

• slower in mixing • faster in iterating

Results:

• FlyMC offers a speedup of at least one order of magnitude compared with reg. MCMC

Page 22: Firefly exact MCMC for Big Data

CONCLUSIONS

FlyMC is an exact procedures that has the true full-posterior as its target

The introduction of the binary latent variables is a simple and efficient idea

The lower bound is a requirement, and it can be difficult to obtain for many problems

Page 23: Firefly exact MCMC for Big Data

Dr. Antti Honkela Dr. Arto Klami

Reviewers

Acknoledgements


Recommended