+ All Categories
Home > Documents > Florida Mar 2010

Florida Mar 2010

Date post: 07-Apr-2018
Category:
Upload: valerio-marra
View: 218 times
Download: 0 times
Share this document with a friend
41
Introduction to Markov chain Monte Carlo (MCMC) and its role in modern Bayesian analysis Phil Gregory University of British Columbia March 2010
Transcript
Page 1: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 141

Introduction to Markov chain Monte Carlo (MCMC)

and its role in modern Bayesian analysis

Phil Gregory

University of British Columbia

March 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 241

Outline

1 Bayesian primer

2 Spectral line problemChallenge of nonlinear models

3 Introduction to Markov chain Monte Carlo (MCMC)

Parallel temperingHybrid MCMC

4 Mathematica MCMC demonstration

5 Conclusions

1

2

3

4

5

6

7

8

Methanol Occam

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 341

outline

What is Bayesian Probability Theory

(BPT)

BPT = a theory of extended logic

Deductive logic is based on Axiomatic knowledge

In science we never know any theory of nature is true because

our reasoning is based on incomplete information

Our conclusions are at best probabilities

Any extension of logic to deal with situations of incompleteinformation (realm of inductive logic) requires a theory of

probability

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 441

outline

A new perception of probability has arisen in recognition that

the mathematical rules of probability are not merely rules for

manipulating random variables

They are now recognized as valid principles of logic for

conducting inference about any hypothesis of interest

This view of ``Probability Theory as Logic was championed

in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo

Cambridge University Press 2003

It is also commonly referred to as Bayesian Probability Theory

in recognition of the work of the 18th century English

clergyman and Mathematician Thomas Bayes

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 541

outline

Logic is concerned with the truth of propositions

A proposition asserts that something is true

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 641

outline

We will need to consider compound propositions like

AB which asserts that propositions A and B are true

AB|C asserts that propositions A and B are true

given that proposition C is true

Rules for manipulating probabilities

Sum rule p A C + p A macrmacr

C = 1

Product rule p A B C = p A C p B A C

= p B C

p A B C

Bayes theorem

p A B C =

p A C p B A C

p B C

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 741

outline

How to proceed in a Bayesian analysis

Write down Bayesrsquo theorem identify the terms and solve

The likelihood p(D| Hi

I) also written as (Hi

) stands for

the probability that we would have gotten the data D that we

did if Hi is true

Every item to the right of the

vertical bar | is assumed to be true

p H i D I = p H i I acirc p D H i I p D I

Posterior probability

that Hi is true given

the new data D and

prior information I

Prior probability Likelihood

Normalizing constant

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 2: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 241

Outline

1 Bayesian primer

2 Spectral line problemChallenge of nonlinear models

3 Introduction to Markov chain Monte Carlo (MCMC)

Parallel temperingHybrid MCMC

4 Mathematica MCMC demonstration

5 Conclusions

1

2

3

4

5

6

7

8

Methanol Occam

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 341

outline

What is Bayesian Probability Theory

(BPT)

BPT = a theory of extended logic

Deductive logic is based on Axiomatic knowledge

In science we never know any theory of nature is true because

our reasoning is based on incomplete information

Our conclusions are at best probabilities

Any extension of logic to deal with situations of incompleteinformation (realm of inductive logic) requires a theory of

probability

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 441

outline

A new perception of probability has arisen in recognition that

the mathematical rules of probability are not merely rules for

manipulating random variables

They are now recognized as valid principles of logic for

conducting inference about any hypothesis of interest

This view of ``Probability Theory as Logic was championed

in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo

Cambridge University Press 2003

It is also commonly referred to as Bayesian Probability Theory

in recognition of the work of the 18th century English

clergyman and Mathematician Thomas Bayes

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 541

outline

Logic is concerned with the truth of propositions

A proposition asserts that something is true

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 641

outline

We will need to consider compound propositions like

AB which asserts that propositions A and B are true

AB|C asserts that propositions A and B are true

given that proposition C is true

Rules for manipulating probabilities

Sum rule p A C + p A macrmacr

C = 1

Product rule p A B C = p A C p B A C

= p B C

p A B C

Bayes theorem

p A B C =

p A C p B A C

p B C

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 741

outline

How to proceed in a Bayesian analysis

Write down Bayesrsquo theorem identify the terms and solve

The likelihood p(D| Hi

I) also written as (Hi

) stands for

the probability that we would have gotten the data D that we

did if Hi is true

Every item to the right of the

vertical bar | is assumed to be true

p H i D I = p H i I acirc p D H i I p D I

Posterior probability

that Hi is true given

the new data D and

prior information I

Prior probability Likelihood

Normalizing constant

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 3: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 341

outline

What is Bayesian Probability Theory

(BPT)

BPT = a theory of extended logic

Deductive logic is based on Axiomatic knowledge

In science we never know any theory of nature is true because

our reasoning is based on incomplete information

Our conclusions are at best probabilities

Any extension of logic to deal with situations of incompleteinformation (realm of inductive logic) requires a theory of

probability

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 441

outline

A new perception of probability has arisen in recognition that

the mathematical rules of probability are not merely rules for

manipulating random variables

They are now recognized as valid principles of logic for

conducting inference about any hypothesis of interest

This view of ``Probability Theory as Logic was championed

in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo

Cambridge University Press 2003

It is also commonly referred to as Bayesian Probability Theory

in recognition of the work of the 18th century English

clergyman and Mathematician Thomas Bayes

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 541

outline

Logic is concerned with the truth of propositions

A proposition asserts that something is true

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 641

outline

We will need to consider compound propositions like

AB which asserts that propositions A and B are true

AB|C asserts that propositions A and B are true

given that proposition C is true

Rules for manipulating probabilities

Sum rule p A C + p A macrmacr

C = 1

Product rule p A B C = p A C p B A C

= p B C

p A B C

Bayes theorem

p A B C =

p A C p B A C

p B C

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 741

outline

How to proceed in a Bayesian analysis

Write down Bayesrsquo theorem identify the terms and solve

The likelihood p(D| Hi

I) also written as (Hi

) stands for

the probability that we would have gotten the data D that we

did if Hi is true

Every item to the right of the

vertical bar | is assumed to be true

p H i D I = p H i I acirc p D H i I p D I

Posterior probability

that Hi is true given

the new data D and

prior information I

Prior probability Likelihood

Normalizing constant

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 4: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 441

outline

A new perception of probability has arisen in recognition that

the mathematical rules of probability are not merely rules for

manipulating random variables

They are now recognized as valid principles of logic for

conducting inference about any hypothesis of interest

This view of ``Probability Theory as Logic was championed

in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo

Cambridge University Press 2003

It is also commonly referred to as Bayesian Probability Theory

in recognition of the work of the 18th century English

clergyman and Mathematician Thomas Bayes

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 541

outline

Logic is concerned with the truth of propositions

A proposition asserts that something is true

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 641

outline

We will need to consider compound propositions like

AB which asserts that propositions A and B are true

AB|C asserts that propositions A and B are true

given that proposition C is true

Rules for manipulating probabilities

Sum rule p A C + p A macrmacr

C = 1

Product rule p A B C = p A C p B A C

= p B C

p A B C

Bayes theorem

p A B C =

p A C p B A C

p B C

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 741

outline

How to proceed in a Bayesian analysis

Write down Bayesrsquo theorem identify the terms and solve

The likelihood p(D| Hi

I) also written as (Hi

) stands for

the probability that we would have gotten the data D that we

did if Hi is true

Every item to the right of the

vertical bar | is assumed to be true

p H i D I = p H i I acirc p D H i I p D I

Posterior probability

that Hi is true given

the new data D and

prior information I

Prior probability Likelihood

Normalizing constant

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 5: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 541

outline

Logic is concerned with the truth of propositions

A proposition asserts that something is true

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 641

outline

We will need to consider compound propositions like

AB which asserts that propositions A and B are true

AB|C asserts that propositions A and B are true

given that proposition C is true

Rules for manipulating probabilities

Sum rule p A C + p A macrmacr

C = 1

Product rule p A B C = p A C p B A C

= p B C

p A B C

Bayes theorem

p A B C =

p A C p B A C

p B C

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 741

outline

How to proceed in a Bayesian analysis

Write down Bayesrsquo theorem identify the terms and solve

The likelihood p(D| Hi

I) also written as (Hi

) stands for

the probability that we would have gotten the data D that we

did if Hi is true

Every item to the right of the

vertical bar | is assumed to be true

p H i D I = p H i I acirc p D H i I p D I

Posterior probability

that Hi is true given

the new data D and

prior information I

Prior probability Likelihood

Normalizing constant

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 6: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 641

outline

We will need to consider compound propositions like

AB which asserts that propositions A and B are true

AB|C asserts that propositions A and B are true

given that proposition C is true

Rules for manipulating probabilities

Sum rule p A C + p A macrmacr

C = 1

Product rule p A B C = p A C p B A C

= p B C

p A B C

Bayes theorem

p A B C =

p A C p B A C

p B C

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 741

outline

How to proceed in a Bayesian analysis

Write down Bayesrsquo theorem identify the terms and solve

The likelihood p(D| Hi

I) also written as (Hi

) stands for

the probability that we would have gotten the data D that we

did if Hi is true

Every item to the right of the

vertical bar | is assumed to be true

p H i D I = p H i I acirc p D H i I p D I

Posterior probability

that Hi is true given

the new data D and

prior information I

Prior probability Likelihood

Normalizing constant

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 7: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 741

outline

How to proceed in a Bayesian analysis

Write down Bayesrsquo theorem identify the terms and solve

The likelihood p(D| Hi

I) also written as (Hi

) stands for

the probability that we would have gotten the data D that we

did if Hi is true

Every item to the right of the

vertical bar | is assumed to be true

p H i D I = p H i I acirc p D H i I p D I

Posterior probability

that Hi is true given

the new data D and

prior information I

Prior probability Likelihood

Normalizing constant

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 8: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 841

As a theory of extended logic BPT can be used to find optimal

answers to well posed scientific questions for a given state of

knowledge in contrast to a numerical recipe approach

outline

Two basic problems

1 Model selection (discrete hypothesis space)

ldquoWhich one of 2 or more models (hypotheses) is most probable

given our current state of knowledgerdquo

eg

bull Hypothesis or model M0 asserts that the star has no planets

bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets

2 Parameter estimation (continuous hypothesis)

ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our

current state of knowledgerdquo

egbull Hypothesis H asserts that the orbital period is between P and P+dP

S f foutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 9: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 941

Significance of this developmentoutline

Probabilities are commonly quantified by a real number between 0 and 1

0 1Realm of science

and inductive logic

truefalse

The end-points corresponding to absolutely false and absolutely true

are simply the extreme limits of this infinity of real numbers

Bayesian probability theory spans the whole range

Deductive logic is just a special case of Bayesian probability

theory in the idealized limit of complete information

Occam

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 10: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1041

Let d i represent the i th measured data value We model d i by

outline

Calculation of a simple Likelihood

Model prediction for i th data value

for current choice of parameters

p D M X I

where ei represents the error component in the measurement

d i = f i X + ei

X

Since is assumed to be true if it were not for the

error ei d i would equal the model prediction f i

p Di M X I =

1

s i 2 p Exp-

ei 2

2s i 2

=

1

s i 2 p Exp -

d i - f i X 2

2 s i 2

Now suppose prior information I indicates that ei has a Gaussian

probability distribution Then

M X

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 11: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1141

outline

pH Di raquo M X I Lproportional

to line height

ei

measured d i

Gaussian error curve

f iH X L predicted value

0 2 4 6 8

0

01

02

03

04

05

Signal strength

P r o b a b i l i t y

d e n s i t y

Probability of getting a data value d i a distance ei away from the

predicted value f i is proportional to the height of the Gaussian error curve at that location

D M X IC l l ti f i l Lik lih doutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 12: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1241

D M X I Calculation of a simple Likelihood

p J D M X I N=

H 2p

L- N

ecirc 2

permili= 1 N

s

i

- 1

gt ExpB-

05 sbquoi= 1 N J d i - f i H X LN 2

s i 2 F

The familiar c2

statistic used

in least-squares

For independent data the likelihood for the entire data

set D=(D1D2 hellipDN ) is the product of N Gaussians

Maximizing the likelihood corresponds to minimizing c2

Recall Bayesian posterior micro prior acirc likelihood

Thus only for a uniform prior will a least-squares analysis

yield the same solution as the Bayesian posterior

Simple example of when not to use a uniform prioroutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 13: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1341

Simple example of when not to use a uniform prior

In the exoplanet problem the prior range for the unknown

orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)

Suppose we assume a uniform prior probability density for the P

parameter This would imply that we believed that it was ~ 104

timesmore probable that the true period was in the upper decade

(104 to 105 d) of the prior range than in the lowest decade from

1 to 10 d

104

105

p P M I P

1

10 p P M I P

= 104

Usually expressing great uncertainty in some quantity corresponds

more closely to a statement of scale invariance or equal probability per

decade The Jeffreys prior has this scale invariant property

outlin

Jeffreys prior (scale invariant)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 14: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1441

Jeffreys prior (scale invariant)

p

H P M I

L dP =

P yen ln H P max ecirc P minL p Hln P M I L d ln P =

ln

ln H P max ecirc P minLor equivalently

1

10

p P M I P = 10

4

105

p P M I P

Equal probability per decade

Actually there are good reasons for searching in orbital frequency

f = 1P instead of P The form of the prior is unchanged

p ln f M I d ln f = ln

ln f max f min

Modified Jeffre s fre

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 15: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1541

Integration not minimization

A full Bayesian analysis requires integrating over the model

parameter space Integration is more difficult than minimization

However the Bayesian solution provides the most accurate

information about the parameter errors and correlations without

the need for any additional calculations ie Monte Carlo

simulations

Shortly discuss an efficient method for

Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)

End of Bayesian primer

outline

Si l S t l Li P bl

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 16: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1641

Simple Spectral Line Problem

Background (prior) informationTwo competing grand unification theories have been proposed each

championed by a Nobel prize winner in physics We want to compute

the relative probability of the truth of each theory based on our prior

information and some new data

Theory 1 is unique in that it predicts the existence of a new short-lived

baryon which is expected to form a short-lived atom and give rise to a

spectral line at an accurately calculable radio wavelength

Unfortunately it is not feasible to detect the line in the laboratory The

only possibility of obtaining a sufficient column density of the short-

lived atom is in interstellar space

outline

Data

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 17: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1741

To test this prediction a new spectrometer was mounted on the James

Clerk Maxwell telescope on Mauna Kea and the spectrum shown below

was obtained The spectrometer has 64 frequency channels

Data

All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent

outline

Simple Spectral Line Problem

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 18: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1841

Simple Spectral Line Problem

The predicted line shape has the form

where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the

spectrometer channel number and the line center frequency is ν 0

Line profile

for a given

ν 0 s L

In this version of the problemT ν 0 s L are all unknowns with

prior limits

T = 00 - 1000

ν 0 = 1 ndash 44

s L = 05 ndash 40

Extra noise term e0i

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 19: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 1941

Extra noise term e 0i

We will represent the measured data by the equation

d i = f i + ei + e0 i

d i = ith measured data valuef i = model prediction

ei = component of d i which arises from measurement errors

e0 i = any additional unknown measurement errors plus any real signal

in the data that cannot be explained by the model prediction f i

In the absence of detailed knowledge of the sampling distribution for e0 i

other than that it has a finite variance the Maximum Entropy principle tells us

that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)

We therefore adopt a Gaussian distribution for e0 i with a variance s2

Thus the combination of ei + e

0 i has a Gaussian distribution with

variance = si 2

+ s2

In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)

which has the desirable effect of treating as noise anything in the data that can t be

explained by the model and known measurement errors leading to most conservative

estimates of the model parameters Prior range for s = 0 - 05 times data range

outline

Questions of interest

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 20: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2041

Questions of interest

Based on our current state of information which includes just the

above prior information and the measured spectrum

1) what do we conclude about the relative probabilities of the two

competing theories

and 2) what is the posterior PDF for the model parameters and s

Hypothesis space of interest for model selection part

M0 equiv ldquoModel 0 no line existsrdquo

M1 equiv ldquoModel 1 line existsrdquo

M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s

M0 has no unknown parameters and one nuisance parameter s

Likelihood for the spectral line modeloutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 21: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2141

Likelihood for the spectral line model

In the earlier spectral line problem which had only

one unknown variable T we derived the likelihood

Our new likelihood for the more complicated model withunknown variables T u0 sL s

H D M 1 T I L = H2 p L- N

2 σ minusN

ExpC- sbquoi = 1N

Hd i - T f i

L2 s G

p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2

+ s2 N-N

2 ExpC- sbquoi = 1

N Hd i - T f i Hu 0 s LLL2 Is 2

+ s2 MG

outline

Simple nonlinear model with a single parameter α

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 22: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2241

p g p

The Bayesian posterior density for a nonlinear model with single parameter

α for 4 simulated data sets of different size ranging from N = 5 to N = 80

The N = 5 case has the broadest distribution and exhibits 4 maxima

True value

Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the

sample size becomes largerSimulated annealing

Integration not minimizationoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 23: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2341

g

In Least-squares analysis we minimize some statistic like c2

In a Bayesian analysis we need to integrate

Parameter estimation to find the marginal posterior probability

density function (PDF) for the orbital period P we need to integrate

the joint posterior over all the other parameters

p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I

Marginal PDF

for T Joint posterior probability

density function (PDF) for

the parameters

Shortly discuss an efficient method for Integrating over a large parameter space

called Markov chain Monte Carlo (MCMC)

Integration is more difficult than minimization However the Bayesian

solution provides the most accurate information about the parameter errors and correlations without the need for any additional

calculations ie Monte Carlo simulations

Data Model Prior outline

Numerical tools

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 24: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2441

D M I

Linear models (uniform priors)

Posterior has a single peak

(multi-dimensional Gaussian)

Posterior

Parameters given

by the normal equations

of linear least-squares

No integration required

solution very fast

using linear algebra

Posterior may have multiple peaks

Brute force Asymptotic Moderate High

integration approxrsquos dimensions dimensions

peak finding quadrature MCMC

algorithms

(1) Levenberg- randomized

Marquardt quadrature

(2) Simulatedannealing adaptive

(3) Genetic quadrature

algorithm

Laplace

approxrsquos

Nonlinear models

+ linear models (non-uniform priors)

For some

parameters

analytic

integration

sometimespossible

for Bayesian

model fitting

(chapter 10) (chapter 11) (chapter 12)

Chaptersoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 25: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2541

1 Role of probability theory in science

2 Probability theory as extended logic

3 The how-to of Bayesian inference4 Assigning probabilities

5 Frequentist statistical inference

6 What is a statistic

7 Frequentist hypothesis testing8 Maximum entropy probabilities

9 Bayesian inference (Gaussian errors)

10 Linear model fitting (Gaussian errors)

11 Nonlinear model fitting

12 Markov chain Monte Carlo

13 Bayesian spectral analysis

14 Bayesian inference (Poisson sampling)

p

Resources and solutions

This title has free

Mathematica based supportsoftware available

Introduces statistical inference in the

larger context of scientific methods and

includes 55 worked examples and manyproblem sets

outline

MCMC for integration in large parameter spaces

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 26: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2641

g g

Markov chain Monte Carlo (MCMC) algorithms provide a powerful

means for efficiently computing integrals in many dimensions to within

a constant factor This factor is not required for parameter estimation

After an initial burn-in period (which is discarded) the MCMC

produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior

PDF

It is very efficient because unlike straight Mont Carlo integration it

doesnrsquot waste time exploring regions where the joint posterior is very

small

The MCMC employs a Markov chain random walk whereby the new

sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or

kernel p(Xt+1 |Xt) The transition kernel is assumed to be time

independent

conditions return

outline

Starting point Metropolis-Hastings MCMC algorithm

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 27: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2741

P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)

1 Choose X0 an initial location in the parameter space Set t = 0

2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form

-Sample a Uniform

H0 1

Lrandom variable U

-If U poundp H Y raquo D ILp HXt raquo D IL

acircq HXt raquo YLq H Y raquoXtL

then set Xt+1 = Y

otherwise set Xt+1 = Xt

- Increment t gtThis factor =1

for a symmetric proposal

distribution like a Gaussian

I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)

return

Toy MCMC simulations the efficiency depends on tuning proposal

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 28: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2841

In this example the

posterior probability

distribution consists of two2 dimensional Gaussians

indicated by the contours

Acceptance rate = 95 Acceptance rate = 63

Acceptance rate = 4

Autocorrelation

distributionsrsquos Can be a very difficult challenge for many parameters

return

outline

MCMC parameter samples for

K l d l ith 2 l t

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 29: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 2941

P1

P2

a Kepler model with 2 planets

MNRAS 374 1321 2007

P C Gregory

Title A Bayesian Kepler

Periodogram Detects a

Second Planet in HD 208487

Post burn-inGelman Ruben stat

Parallel tempering MCMCoutlin

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 30: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3041

The simple Metropolis-Hastings MCMC algorithm can run into

difficulties if the probability distribution is multi-modal with widely

separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow

One solution is to run multiple Metropolis-Hastings simulations in

parallel employing probability distributions of the kind

Typical set of β values = 00901502203504806107810

β = 1 corresponds to our desired target distribution The others

correspond to progressively flatter probability distributions

p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L

At intervals a pair of adjacent simulations are chosen at random and

a proposal made to swap their parameter states The swap allows for

an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise

whereas at higher β a configuration is given the chance to refine itself

Final results are based on samples from the β = 1 simulation

Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems

outline

MCMC Technical Difficulties

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 31: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3141

1 Deciding on the burn-in period

2 Choosing a good choice for the characteristic width

of each proposal distribution one for each model

parameterFor Gaussian proposal distributions this means picking

a set of proposal σrsquos This can be very time consuming

for a large number of different parameters

3 Handling highly correlated parameters

Ans transform parameter set or differential MCMC

4 Deciding how many iterations are sufficient

Ans use Gelman-Rubin Statistic

5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic

My involvement since 2002 ongoing

development of a general Bayesian Nonlinear

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 32: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3241

development of a general Bayesian Nonlinear

model fitting program

My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates

-Parallel tempering

-Simulated annealing-Genetic algorithm

-Differential evolution

-Unique control system automates the MCMC

Code is implemented in Mathematica

Current extra-solar planet applications

-precision radial velocity data ndash (4 new planets published to date)

-pulsar planets from timing residuals of NGC 6440C

-NASA stellar interferometry mission astrometry testing

Submillimeter radio spectroscopy of galactic center methanol lines

Mathematica 7 (latest version) provides an easy route to parallel computing

I run on an 8 core PC and achieve a speed-up of 7 times

outline

Bli d h i h h b id MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 33: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3341

Blind searches with hybrid MCMC

Parallel tempering

Simulated annealing

Genetic algorithmDifferential evolution

Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four

in a hybrid MCMC we greatly increase the probability of

realizing this goal

Data Model Prior information

MCMC details outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 34: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3441

Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system

that automates the selection of Gaussian proposal distribution σrsquos

Hybridparallel tempering

MCMCNonlinear modelfitting program

D M I

Target Posterior pH8XaltraquoDMIL

Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal

distribution ss using an annealing operation

2L Monitors MCMC for emergence of significantly improved

parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains

n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels

- Control systemdiagnostics

- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals

- 8Xalt 683 credible regions

- pHDraquoMIL marginal likelihoodfor model comparison

1

outlin

Output at each iterationAdaptive Hybrid MCMC

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 35: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3541

8 parallel tempering Metropolis chainsOutput at each iteration

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike

parameters logprior + b acirc loglike logprior + loglike

Monitor for

parameterswith peak

probabilityAnneal Gaussian

proposal srsquos

Refine amp update

Gaussian

proposal srsquos

2 stage proposal s control system

error signal =

(actual joint acceptance rate ndash 025)

Effectively defines burn-in interval

Genetic algorithm

Every 10th iteration perform gene

crossover operation to breed larger (logprior + loglike) parameter set

Peak parameter setIf (logprior + loglike) gt

previous best by a

threshold then update

and reset burn-in

β = 1 T

Parallel tempering

swap operations

MCMC adaptive control system

= 10

= 072

= 052

= 039

= 029

= 020= 013

= 009

β

β

β

β

β

ββ

β

Corr Par

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 36: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3641

Go to Mathematica support material

Go to Mathematica version of MCMC

Quasi-Monte Carlo

outline

Calculation of p(D|M 0 I)

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 37: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3741

Model M 0 assumes the spectrum is consistent with noise and has no

free parameters so we can write

Model selection results

p H D M 0 s I L = H2 p L- N 2 Js2+ s

2 N-N

2 ExpC- sbquoi = 1

N Hd i - 0 L2 Is 2 + s2 M

G

Bayes factor =45x104

Methanol emission inthe Sgr A environment

out ne

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 38: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3841

9v Ikm sminus1M FWHM Ikm s

minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm

minus2MTK HKL ν

UL H MHzL FWHM UL Ikm s

minus1M TUL HKL ds96 ds242 s HKL=

νUL H MHzL is the rest frequency of the unidentied

line after removal of the Doppler veocity v Hkm sminus1L

M Stanković ER Seaquist (UofT) S

Leurini (ESO) PGregory (UBC)

S Muehle(JIVE) KMMenten (MPIfR)

g

Optically thin fit to 3 bands

+ unidentified line in 96 GHz band

return

Conclusionsoutline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 39: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 3941

1 For Bayesian parameter estimation MCMC provides a powerful

means of computing the integrals required to compute posterior

probability density function (PDF) for each model parameter

2 Even though we demonstrated the performance of an MCMC for a

simple spectral line problem with only 4 parameters MCMC

techniques are really most competitive for models with a much larger number of parameters m ge 15

3 Markov chain Monte Carlo analysis produces samples in model

parameter space in proportion to the posterior probability distribution

This is fine for parameter estimation

For model selection we need to determine the proportionality constant

to evaluate the marginal likelihood p(D|Mi I) for each model This is a

much more difficult problem still in search of two good solutions for large m We need two to know if either is valid

One solution is to use the MCMC results from all the parallel

tempering chains spanning a wide range of β values however this

becomes computationally very intensive for m gt 17

For a copy of this talk please Google Phil Gregory

outline

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 40: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4041

The rewards of data analysis

lsquoThe universe is full of magical thingspatiently waiting for our wits to grow

sharperrsquo

Eden Philpotts (1862-1960)

Author and playwright

outline

Let q represent one of the model parameters

Gelman-Rubin Statistic

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation

Page 41: Florida Mar 2010

842019 Florida Mar 2010

httpslidepdfcomreaderfullflorida-mar-2010 4141

Mean withinchain variance W =1

m Hh- 1L

sbquo j=1

m

sbquoi=1

h

Iq j

i- q jecircecirc

M2

Betweenchain variance B =h

m- 1 sbquo j=1

m Hq jecircecirc - q ecircecircL2

Estimated variance V` Hq L = ikjj1-

1

hyzz W+

1

h B

Gelman- Rubin statistic =

$V` Hq LW

The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative

simulations using multiple sequences Hwith discussionL

Statistical Science 7 pp 457 minus 511

Let q represent one of the model parameters

Let q ji

represent the ith

iteration of the jth

of m independent simulation

Extract the last h post burn - in iterations for each simulation


Recommended