Date post: | 07-Apr-2018 |
Category: |
Documents |
Upload: | valerio-marra |
View: | 218 times |
Download: | 0 times |
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 141
Introduction to Markov chain Monte Carlo (MCMC)
and its role in modern Bayesian analysis
Phil Gregory
University of British Columbia
March 2010
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 241
Outline
1 Bayesian primer
2 Spectral line problemChallenge of nonlinear models
3 Introduction to Markov chain Monte Carlo (MCMC)
Parallel temperingHybrid MCMC
4 Mathematica MCMC demonstration
5 Conclusions
1
2
3
4
5
6
7
8
Methanol Occam
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 341
outline
What is Bayesian Probability Theory
(BPT)
BPT = a theory of extended logic
Deductive logic is based on Axiomatic knowledge
In science we never know any theory of nature is true because
our reasoning is based on incomplete information
Our conclusions are at best probabilities
Any extension of logic to deal with situations of incompleteinformation (realm of inductive logic) requires a theory of
probability
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 441
outline
A new perception of probability has arisen in recognition that
the mathematical rules of probability are not merely rules for
manipulating random variables
They are now recognized as valid principles of logic for
conducting inference about any hypothesis of interest
This view of ``Probability Theory as Logic was championed
in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo
Cambridge University Press 2003
It is also commonly referred to as Bayesian Probability Theory
in recognition of the work of the 18th century English
clergyman and Mathematician Thomas Bayes
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 541
outline
Logic is concerned with the truth of propositions
A proposition asserts that something is true
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 641
outline
We will need to consider compound propositions like
AB which asserts that propositions A and B are true
AB|C asserts that propositions A and B are true
given that proposition C is true
Rules for manipulating probabilities
Sum rule p A C + p A macrmacr
C = 1
Product rule p A B C = p A C p B A C
= p B C
p A B C
Bayes theorem
p A B C =
p A C p B A C
p B C
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 741
outline
How to proceed in a Bayesian analysis
Write down Bayesrsquo theorem identify the terms and solve
The likelihood p(D| Hi
I) also written as (Hi
) stands for
the probability that we would have gotten the data D that we
did if Hi is true
Every item to the right of the
vertical bar | is assumed to be true
p H i D I = p H i I acirc p D H i I p D I
Posterior probability
that Hi is true given
the new data D and
prior information I
Prior probability Likelihood
Normalizing constant
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 241
Outline
1 Bayesian primer
2 Spectral line problemChallenge of nonlinear models
3 Introduction to Markov chain Monte Carlo (MCMC)
Parallel temperingHybrid MCMC
4 Mathematica MCMC demonstration
5 Conclusions
1
2
3
4
5
6
7
8
Methanol Occam
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 341
outline
What is Bayesian Probability Theory
(BPT)
BPT = a theory of extended logic
Deductive logic is based on Axiomatic knowledge
In science we never know any theory of nature is true because
our reasoning is based on incomplete information
Our conclusions are at best probabilities
Any extension of logic to deal with situations of incompleteinformation (realm of inductive logic) requires a theory of
probability
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 441
outline
A new perception of probability has arisen in recognition that
the mathematical rules of probability are not merely rules for
manipulating random variables
They are now recognized as valid principles of logic for
conducting inference about any hypothesis of interest
This view of ``Probability Theory as Logic was championed
in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo
Cambridge University Press 2003
It is also commonly referred to as Bayesian Probability Theory
in recognition of the work of the 18th century English
clergyman and Mathematician Thomas Bayes
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 541
outline
Logic is concerned with the truth of propositions
A proposition asserts that something is true
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 641
outline
We will need to consider compound propositions like
AB which asserts that propositions A and B are true
AB|C asserts that propositions A and B are true
given that proposition C is true
Rules for manipulating probabilities
Sum rule p A C + p A macrmacr
C = 1
Product rule p A B C = p A C p B A C
= p B C
p A B C
Bayes theorem
p A B C =
p A C p B A C
p B C
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 741
outline
How to proceed in a Bayesian analysis
Write down Bayesrsquo theorem identify the terms and solve
The likelihood p(D| Hi
I) also written as (Hi
) stands for
the probability that we would have gotten the data D that we
did if Hi is true
Every item to the right of the
vertical bar | is assumed to be true
p H i D I = p H i I acirc p D H i I p D I
Posterior probability
that Hi is true given
the new data D and
prior information I
Prior probability Likelihood
Normalizing constant
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 341
outline
What is Bayesian Probability Theory
(BPT)
BPT = a theory of extended logic
Deductive logic is based on Axiomatic knowledge
In science we never know any theory of nature is true because
our reasoning is based on incomplete information
Our conclusions are at best probabilities
Any extension of logic to deal with situations of incompleteinformation (realm of inductive logic) requires a theory of
probability
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 441
outline
A new perception of probability has arisen in recognition that
the mathematical rules of probability are not merely rules for
manipulating random variables
They are now recognized as valid principles of logic for
conducting inference about any hypothesis of interest
This view of ``Probability Theory as Logic was championed
in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo
Cambridge University Press 2003
It is also commonly referred to as Bayesian Probability Theory
in recognition of the work of the 18th century English
clergyman and Mathematician Thomas Bayes
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 541
outline
Logic is concerned with the truth of propositions
A proposition asserts that something is true
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 641
outline
We will need to consider compound propositions like
AB which asserts that propositions A and B are true
AB|C asserts that propositions A and B are true
given that proposition C is true
Rules for manipulating probabilities
Sum rule p A C + p A macrmacr
C = 1
Product rule p A B C = p A C p B A C
= p B C
p A B C
Bayes theorem
p A B C =
p A C p B A C
p B C
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 741
outline
How to proceed in a Bayesian analysis
Write down Bayesrsquo theorem identify the terms and solve
The likelihood p(D| Hi
I) also written as (Hi
) stands for
the probability that we would have gotten the data D that we
did if Hi is true
Every item to the right of the
vertical bar | is assumed to be true
p H i D I = p H i I acirc p D H i I p D I
Posterior probability
that Hi is true given
the new data D and
prior information I
Prior probability Likelihood
Normalizing constant
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 441
outline
A new perception of probability has arisen in recognition that
the mathematical rules of probability are not merely rules for
manipulating random variables
They are now recognized as valid principles of logic for
conducting inference about any hypothesis of interest
This view of ``Probability Theory as Logic was championed
in the late 20th century by E T JaynesldquoProbability Theory The Logic of Sciencerdquo
Cambridge University Press 2003
It is also commonly referred to as Bayesian Probability Theory
in recognition of the work of the 18th century English
clergyman and Mathematician Thomas Bayes
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 541
outline
Logic is concerned with the truth of propositions
A proposition asserts that something is true
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 641
outline
We will need to consider compound propositions like
AB which asserts that propositions A and B are true
AB|C asserts that propositions A and B are true
given that proposition C is true
Rules for manipulating probabilities
Sum rule p A C + p A macrmacr
C = 1
Product rule p A B C = p A C p B A C
= p B C
p A B C
Bayes theorem
p A B C =
p A C p B A C
p B C
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 741
outline
How to proceed in a Bayesian analysis
Write down Bayesrsquo theorem identify the terms and solve
The likelihood p(D| Hi
I) also written as (Hi
) stands for
the probability that we would have gotten the data D that we
did if Hi is true
Every item to the right of the
vertical bar | is assumed to be true
p H i D I = p H i I acirc p D H i I p D I
Posterior probability
that Hi is true given
the new data D and
prior information I
Prior probability Likelihood
Normalizing constant
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 541
outline
Logic is concerned with the truth of propositions
A proposition asserts that something is true
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 641
outline
We will need to consider compound propositions like
AB which asserts that propositions A and B are true
AB|C asserts that propositions A and B are true
given that proposition C is true
Rules for manipulating probabilities
Sum rule p A C + p A macrmacr
C = 1
Product rule p A B C = p A C p B A C
= p B C
p A B C
Bayes theorem
p A B C =
p A C p B A C
p B C
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 741
outline
How to proceed in a Bayesian analysis
Write down Bayesrsquo theorem identify the terms and solve
The likelihood p(D| Hi
I) also written as (Hi
) stands for
the probability that we would have gotten the data D that we
did if Hi is true
Every item to the right of the
vertical bar | is assumed to be true
p H i D I = p H i I acirc p D H i I p D I
Posterior probability
that Hi is true given
the new data D and
prior information I
Prior probability Likelihood
Normalizing constant
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 641
outline
We will need to consider compound propositions like
AB which asserts that propositions A and B are true
AB|C asserts that propositions A and B are true
given that proposition C is true
Rules for manipulating probabilities
Sum rule p A C + p A macrmacr
C = 1
Product rule p A B C = p A C p B A C
= p B C
p A B C
Bayes theorem
p A B C =
p A C p B A C
p B C
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 741
outline
How to proceed in a Bayesian analysis
Write down Bayesrsquo theorem identify the terms and solve
The likelihood p(D| Hi
I) also written as (Hi
) stands for
the probability that we would have gotten the data D that we
did if Hi is true
Every item to the right of the
vertical bar | is assumed to be true
p H i D I = p H i I acirc p D H i I p D I
Posterior probability
that Hi is true given
the new data D and
prior information I
Prior probability Likelihood
Normalizing constant
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 741
outline
How to proceed in a Bayesian analysis
Write down Bayesrsquo theorem identify the terms and solve
The likelihood p(D| Hi
I) also written as (Hi
) stands for
the probability that we would have gotten the data D that we
did if Hi is true
Every item to the right of the
vertical bar | is assumed to be true
p H i D I = p H i I acirc p D H i I p D I
Posterior probability
that Hi is true given
the new data D and
prior information I
Prior probability Likelihood
Normalizing constant
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 841
As a theory of extended logic BPT can be used to find optimal
answers to well posed scientific questions for a given state of
knowledge in contrast to a numerical recipe approach
outline
Two basic problems
1 Model selection (discrete hypothesis space)
ldquoWhich one of 2 or more models (hypotheses) is most probable
given our current state of knowledgerdquo
eg
bull Hypothesis or model M0 asserts that the star has no planets
bull Hypothesis M1 asserts that the star has 1 planetbull Hypothesis Mi asserts that the star has i planets
2 Parameter estimation (continuous hypothesis)
ldquoAssuming the truth of M1 solve for the probability densitydistribution for each of the model parameters based on our
current state of knowledgerdquo
egbull Hypothesis H asserts that the orbital period is between P and P+dP
S f foutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 941
Significance of this developmentoutline
Probabilities are commonly quantified by a real number between 0 and 1
0 1Realm of science
and inductive logic
truefalse
The end-points corresponding to absolutely false and absolutely true
are simply the extreme limits of this infinity of real numbers
Bayesian probability theory spans the whole range
Deductive logic is just a special case of Bayesian probability
theory in the idealized limit of complete information
Occam
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1041
Let d i represent the i th measured data value We model d i by
outline
Calculation of a simple Likelihood
Model prediction for i th data value
for current choice of parameters
p D M X I
where ei represents the error component in the measurement
d i = f i X + ei
X
Since is assumed to be true if it were not for the
error ei d i would equal the model prediction f i
p Di M X I =
1
s i 2 p Exp-
ei 2
2s i 2
=
1
s i 2 p Exp -
d i - f i X 2
2 s i 2
Now suppose prior information I indicates that ei has a Gaussian
probability distribution Then
M X
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1141
outline
pH Di raquo M X I Lproportional
to line height
ei
measured d i
Gaussian error curve
f iH X L predicted value
0 2 4 6 8
0
01
02
03
04
05
Signal strength
P r o b a b i l i t y
d e n s i t y
Probability of getting a data value d i a distance ei away from the
predicted value f i is proportional to the height of the Gaussian error curve at that location
D M X IC l l ti f i l Lik lih doutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1241
D M X I Calculation of a simple Likelihood
p J D M X I N=
H 2p
L- N
ecirc 2
permili= 1 N
s
i
- 1
gt ExpB-
05 sbquoi= 1 N J d i - f i H X LN 2
s i 2 F
The familiar c2
statistic used
in least-squares
For independent data the likelihood for the entire data
set D=(D1D2 hellipDN ) is the product of N Gaussians
Maximizing the likelihood corresponds to minimizing c2
Recall Bayesian posterior micro prior acirc likelihood
Thus only for a uniform prior will a least-squares analysis
yield the same solution as the Bayesian posterior
Simple example of when not to use a uniform prioroutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1341
Simple example of when not to use a uniform prior
In the exoplanet problem the prior range for the unknown
orbital period P is very large from ~1 day to 1000 yr (upper limit set by perturbations from neighboring stars)
Suppose we assume a uniform prior probability density for the P
parameter This would imply that we believed that it was ~ 104
timesmore probable that the true period was in the upper decade
(104 to 105 d) of the prior range than in the lowest decade from
1 to 10 d
104
105
p P M I P
1
10 p P M I P
= 104
Usually expressing great uncertainty in some quantity corresponds
more closely to a statement of scale invariance or equal probability per
decade The Jeffreys prior has this scale invariant property
outlin
Jeffreys prior (scale invariant)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1441
Jeffreys prior (scale invariant)
p
H P M I
L dP =
P yen ln H P max ecirc P minL p Hln P M I L d ln P =
ln
ln H P max ecirc P minLor equivalently
1
10
p P M I P = 10
4
105
p P M I P
Equal probability per decade
Actually there are good reasons for searching in orbital frequency
f = 1P instead of P The form of the prior is unchanged
p ln f M I d ln f = ln
ln f max f min
Modified Jeffre s fre
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1541
Integration not minimization
A full Bayesian analysis requires integrating over the model
parameter space Integration is more difficult than minimization
However the Bayesian solution provides the most accurate
information about the parameter errors and correlations without
the need for any additional calculations ie Monte Carlo
simulations
Shortly discuss an efficient method for
Integrating over a large parameter spacecalled Markov chain Monte Carlo (MCMC)
End of Bayesian primer
outline
Si l S t l Li P bl
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1641
Simple Spectral Line Problem
Background (prior) informationTwo competing grand unification theories have been proposed each
championed by a Nobel prize winner in physics We want to compute
the relative probability of the truth of each theory based on our prior
information and some new data
Theory 1 is unique in that it predicts the existence of a new short-lived
baryon which is expected to form a short-lived atom and give rise to a
spectral line at an accurately calculable radio wavelength
Unfortunately it is not feasible to detect the line in the laboratory The
only possibility of obtaining a sufficient column density of the short-
lived atom is in interstellar space
outline
Data
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1741
To test this prediction a new spectrometer was mounted on the James
Clerk Maxwell telescope on Mauna Kea and the spectrum shown below
was obtained The spectrometer has 64 frequency channels
Data
All channels have Gaussian noise characterized by σ = 1 mK The noisein separate channels is independent
outline
Simple Spectral Line Problem
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1841
Simple Spectral Line Problem
The predicted line shape has the form
where the signal strength is measured in temperature units of mK and T is the amplitude of the line The frequency ν i is in units of the
spectrometer channel number and the line center frequency is ν 0
Line profile
for a given
ν 0 s L
In this version of the problemT ν 0 s L are all unknowns with
prior limits
T = 00 - 1000
ν 0 = 1 ndash 44
s L = 05 ndash 40
Extra noise term e0i
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 1941
Extra noise term e 0i
We will represent the measured data by the equation
d i = f i + ei + e0 i
d i = ith measured data valuef i = model prediction
ei = component of d i which arises from measurement errors
e0 i = any additional unknown measurement errors plus any real signal
in the data that cannot be explained by the model prediction f i
In the absence of detailed knowledge of the sampling distribution for e0 i
other than that it has a finite variance the Maximum Entropy principle tells us
that a Gaussian distribution is the most conservative choice (ie maximallynon committal about the information we dont have)
We therefore adopt a Gaussian distribution for e0 i with a variance s2
Thus the combination of ei + e
0 i has a Gaussian distribution with
variance = si 2
+ s2
In Bayesian analysis we marginalize the unknown s (integrate it out of the problem)
which has the desirable effect of treating as noise anything in the data that can t be
explained by the model and known measurement errors leading to most conservative
estimates of the model parameters Prior range for s = 0 - 05 times data range
outline
Questions of interest
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2041
Questions of interest
Based on our current state of information which includes just the
above prior information and the measured spectrum
1) what do we conclude about the relative probabilities of the two
competing theories
and 2) what is the posterior PDF for the model parameters and s
Hypothesis space of interest for model selection part
M0 equiv ldquoModel 0 no line existsrdquo
M1 equiv ldquoModel 1 line existsrdquo
M1 has 3 unknown parameters the line temperature T ν 0 s Land one nuisance parameter s
M0 has no unknown parameters and one nuisance parameter s
Likelihood for the spectral line modeloutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2141
Likelihood for the spectral line model
In the earlier spectral line problem which had only
one unknown variable T we derived the likelihood
Our new likelihood for the more complicated model withunknown variables T u0 sL s
H D M 1 T I L = H2 p L- N
2 σ minusN
ExpC- sbquoi = 1N
Hd i - T f i
L2 s G
p H D M 1 T u0 sL s I L = H2 p L- N 2 Js2
+ s2 N-N
2 ExpC- sbquoi = 1
N Hd i - T f i Hu 0 s LLL2 Is 2
+ s2 MG
outline
Simple nonlinear model with a single parameter α
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2241
p g p
The Bayesian posterior density for a nonlinear model with single parameter
α for 4 simulated data sets of different size ranging from N = 5 to N = 80
The N = 5 case has the broadest distribution and exhibits 4 maxima
True value
Asymptotic theory says that the maximum likelihood estimator becomesmore unbiased more normally distributed and of smaller variance as the
sample size becomes largerSimulated annealing
Integration not minimizationoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2341
g
In Least-squares analysis we minimize some statistic like c2
In a Bayesian analysis we need to integrate
Parameter estimation to find the marginal posterior probability
density function (PDF) for the orbital period P we need to integrate
the joint posterior over all the other parameters
p T D M 1 I = sbquo u0 sbquo s L sbquo s p T u0 s L s D M 1 I
Marginal PDF
for T Joint posterior probability
density function (PDF) for
the parameters
Shortly discuss an efficient method for Integrating over a large parameter space
called Markov chain Monte Carlo (MCMC)
Integration is more difficult than minimization However the Bayesian
solution provides the most accurate information about the parameter errors and correlations without the need for any additional
calculations ie Monte Carlo simulations
Data Model Prior outline
Numerical tools
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2441
D M I
Linear models (uniform priors)
Posterior has a single peak
(multi-dimensional Gaussian)
Posterior
Parameters given
by the normal equations
of linear least-squares
No integration required
solution very fast
using linear algebra
Posterior may have multiple peaks
Brute force Asymptotic Moderate High
integration approxrsquos dimensions dimensions
peak finding quadrature MCMC
algorithms
(1) Levenberg- randomized
Marquardt quadrature
(2) Simulatedannealing adaptive
(3) Genetic quadrature
algorithm
Laplace
approxrsquos
Nonlinear models
+ linear models (non-uniform priors)
For some
parameters
analytic
integration
sometimespossible
for Bayesian
model fitting
(chapter 10) (chapter 11) (chapter 12)
Chaptersoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2541
1 Role of probability theory in science
2 Probability theory as extended logic
3 The how-to of Bayesian inference4 Assigning probabilities
5 Frequentist statistical inference
6 What is a statistic
7 Frequentist hypothesis testing8 Maximum entropy probabilities
9 Bayesian inference (Gaussian errors)
10 Linear model fitting (Gaussian errors)
11 Nonlinear model fitting
12 Markov chain Monte Carlo
13 Bayesian spectral analysis
14 Bayesian inference (Poisson sampling)
p
Resources and solutions
This title has free
Mathematica based supportsoftware available
Introduces statistical inference in the
larger context of scientific methods and
includes 55 worked examples and manyproblem sets
outline
MCMC for integration in large parameter spaces
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2641
g g
Markov chain Monte Carlo (MCMC) algorithms provide a powerful
means for efficiently computing integrals in many dimensions to within
a constant factor This factor is not required for parameter estimation
After an initial burn-in period (which is discarded) the MCMC
produces an equilibrium distribution of samples in parameter spacesuch that the density of samples is proportional to the joint posterior
It is very efficient because unlike straight Mont Carlo integration it
doesnrsquot waste time exploring regions where the joint posterior is very
small
The MCMC employs a Markov chain random walk whereby the new
sample in parameter space designated Xt+1 depends on previoussample Xt according to an entity called the transition probability or
kernel p(Xt+1 |Xt) The transition kernel is assumed to be time
independent
conditions return
outline
Starting point Metropolis-Hastings MCMC algorithm
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2741
P(X|DMI) = target posterior probability distribution(X represents the set of model parameters)
1 Choose X0 an initial location in the parameter space Set t = 0
2 Repeat -Obtain a new sample Y from a proposal distribution q H Y raquo XtLthat is easy to evaluate q H Y raquo XtLcan have almost any form
-Sample a Uniform
H0 1
Lrandom variable U
-If U poundp H Y raquo D ILp HXt raquo D IL
acircq HXt raquo YLq H Y raquoXtL
then set Xt+1 = Y
otherwise set Xt+1 = Xt
- Increment t gtThis factor =1
for a symmetric proposal
distribution like a Gaussian
I use a Gaussian proposal distribution ie Normal distribution N(Xt σ)
return
Toy MCMC simulations the efficiency depends on tuning proposal
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2841
In this example the
posterior probability
distribution consists of two2 dimensional Gaussians
indicated by the contours
Acceptance rate = 95 Acceptance rate = 63
Acceptance rate = 4
Autocorrelation
distributionsrsquos Can be a very difficult challenge for many parameters
return
outline
MCMC parameter samples for
K l d l ith 2 l t
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 2941
P1
P2
a Kepler model with 2 planets
MNRAS 374 1321 2007
P C Gregory
Title A Bayesian Kepler
Periodogram Detects a
Second Planet in HD 208487
Post burn-inGelman Ruben stat
Parallel tempering MCMCoutlin
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3041
The simple Metropolis-Hastings MCMC algorithm can run into
difficulties if the probability distribution is multi-modal with widely
separated peaks It can fail to fully explore all peaks which containsignificant probability especially if some of the peaks are very narrow
One solution is to run multiple Metropolis-Hastings simulations in
parallel employing probability distributions of the kind
Typical set of β values = 00901502203504806107810
β = 1 corresponds to our desired target distribution The others
correspond to progressively flatter probability distributions
p X D M b I = p X M I p D X M I b 0 lt β b 1H raquo L H raquo L H raquo L H L
At intervals a pair of adjacent simulations are chosen at random and
a proposal made to swap their parameter states The swap allows for
an exchange of information across the ladder of simulationsIn the low β simulations radically different configurations can arise
whereas at higher β a configuration is given the chance to refine itself
Final results are based on samples from the β = 1 simulation
Samples from the other simulations provide one way to evaluatethe Bayes Factor in model selection problems
outline
MCMC Technical Difficulties
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3141
1 Deciding on the burn-in period
2 Choosing a good choice for the characteristic width
of each proposal distribution one for each model
parameterFor Gaussian proposal distributions this means picking
a set of proposal σrsquos This can be very time consuming
for a large number of different parameters
3 Handling highly correlated parameters
Ans transform parameter set or differential MCMC
4 Deciding how many iterations are sufficient
Ans use Gelman-Rubin Statistic
5 Deciding on a good choice of tempering levels (β values)Gelman ndashRubin statistic
My involvement since 2002 ongoing
development of a general Bayesian Nonlinear
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3241
development of a general Bayesian Nonlinear
model fitting program
My latest hybrid Markov chain Monte Carlo (MCMC)nonlinear model fitting algorithm incorporates
-Parallel tempering
-Simulated annealing-Genetic algorithm
-Differential evolution
-Unique control system automates the MCMC
Code is implemented in Mathematica
Current extra-solar planet applications
-precision radial velocity data ndash (4 new planets published to date)
-pulsar planets from timing residuals of NGC 6440C
-NASA stellar interferometry mission astrometry testing
Submillimeter radio spectroscopy of galactic center methanol lines
Mathematica 7 (latest version) provides an easy route to parallel computing
I run on an 8 core PC and achieve a speed-up of 7 times
outline
Bli d h i h h b id MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3341
Blind searches with hybrid MCMC
Parallel tempering
Simulated annealing
Genetic algorithmDifferential evolution
Each of these methods was designed to facilitate thedetection of a global minimum in c2 By combining all four
in a hybrid MCMC we greatly increase the probability of
realizing this goal
Data Model Prior information
MCMC details outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3441
Schematic of a Bayesian Markov chain Monte Carlo program for nonlinear model fitting The program incorporates a control system
that automates the selection of Gaussian proposal distribution σrsquos
Hybridparallel tempering
MCMCNonlinear modelfitting program
D M I
Target Posterior pH8XaltraquoDMIL
Adaptive Two Stage Control System __________________________________________________________ _ 1L Automates selection of an efficient set of Gaussian proposal
distribution ss using an annealing operation
2L Monitors MCMC for emergence of significantly improved
parameter set and resets MCMC Includes a gene crossover algorithm to breed higher probability chains
n = no of iterations8Xaltinit = start parameters8saltinit= start proposal ss8 blt = Temperinglevels
- Control systemdiagnostics
- 8Xalt iterations- Summarystatistics- Best fit model amp residuals- 8Xalt marginals
- 8Xalt 683 credible regions
- pHDraquoMIL marginal likelihoodfor model comparison
1
outlin
Output at each iterationAdaptive Hybrid MCMC
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3541
8 parallel tempering Metropolis chainsOutput at each iteration
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglikeparameters logprior + b acirc loglike logprior + loglike
parameters logprior + b acirc loglike logprior + loglike
Monitor for
parameterswith peak
probabilityAnneal Gaussian
proposal srsquos
Refine amp update
Gaussian
proposal srsquos
2 stage proposal s control system
error signal =
(actual joint acceptance rate ndash 025)
Effectively defines burn-in interval
Genetic algorithm
Every 10th iteration perform gene
crossover operation to breed larger (logprior + loglike) parameter set
Peak parameter setIf (logprior + loglike) gt
previous best by a
threshold then update
and reset burn-in
β = 1 T
Parallel tempering
swap operations
MCMC adaptive control system
= 10
= 072
= 052
= 039
= 029
= 020= 013
= 009
β
β
β
β
β
ββ
β
Corr Par
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3641
Go to Mathematica support material
Go to Mathematica version of MCMC
Quasi-Monte Carlo
outline
Calculation of p(D|M 0 I)
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3741
Model M 0 assumes the spectrum is consistent with noise and has no
free parameters so we can write
Model selection results
p H D M 0 s I L = H2 p L- N 2 Js2+ s
2 N-N
2 ExpC- sbquoi = 1
N Hd i - 0 L2 Is 2 + s2 M
G
Bayes factor =45x104
Methanol emission inthe Sgr A environment
out ne
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3841
9v Ikm sminus1M FWHM Ikm s
minus1M TJ HKL H N ecircZL A Icm minus2M H N ecirc ZL A Icm
minus2MTK HKL ν
UL H MHzL FWHM UL Ikm s
minus1M TUL HKL ds96 ds242 s HKL=
νUL H MHzL is the rest frequency of the unidentied
line after removal of the Doppler veocity v Hkm sminus1L
M Stanković ER Seaquist (UofT) S
Leurini (ESO) PGregory (UBC)
S Muehle(JIVE) KMMenten (MPIfR)
g
Optically thin fit to 3 bands
+ unidentified line in 96 GHz band
return
Conclusionsoutline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 3941
1 For Bayesian parameter estimation MCMC provides a powerful
means of computing the integrals required to compute posterior
probability density function (PDF) for each model parameter
2 Even though we demonstrated the performance of an MCMC for a
simple spectral line problem with only 4 parameters MCMC
techniques are really most competitive for models with a much larger number of parameters m ge 15
3 Markov chain Monte Carlo analysis produces samples in model
parameter space in proportion to the posterior probability distribution
This is fine for parameter estimation
For model selection we need to determine the proportionality constant
to evaluate the marginal likelihood p(D|Mi I) for each model This is a
much more difficult problem still in search of two good solutions for large m We need two to know if either is valid
One solution is to use the MCMC results from all the parallel
tempering chains spanning a wide range of β values however this
becomes computationally very intensive for m gt 17
For a copy of this talk please Google Phil Gregory
outline
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4041
The rewards of data analysis
lsquoThe universe is full of magical thingspatiently waiting for our wits to grow
sharperrsquo
Eden Philpotts (1862-1960)
Author and playwright
outline
Let q represent one of the model parameters
Gelman-Rubin Statistic
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation
842019 Florida Mar 2010
httpslidepdfcomreaderfullflorida-mar-2010 4141
Mean withinchain variance W =1
m Hh- 1L
sbquo j=1
m
sbquoi=1
h
Iq j
i- q jecircecirc
M2
Betweenchain variance B =h
m- 1 sbquo j=1
m Hq jecircecirc - q ecircecircL2
Estimated variance V` Hq L = ikjj1-
1
hyzz W+
1
h B
Gelman- Rubin statistic =
$V` Hq LW
The Gelman -Rubin statistic should be close to 10 Heg lt 105Lfor all paramaters for convergenceRef Gelman Aand DBRubin H1992L Inference from iterative
simulations using multiple sequences Hwith discussionL
Statistical Science 7 pp 457 minus 511
Let q represent one of the model parameters
Let q ji
represent the ith
iteration of the jth
of m independent simulation
Extract the last h post burn - in iterations for each simulation