Hamiltonian Monte Carlo
Sherman Ip and Jack Jewson
19th November 2015
Introduction/Motivation
Markov Chain Monte Carlo (MCMC) methods have become a huge part of statistical research and
there is an increasing necessity for these methods to be able to sample from vastly complex and
multidimensional distributions. One of the most widely used MCMC algorithms is the Metropolis
Hastings algorithm designed by Metropolis et al. [1953] and Hastings [1970]. This algorithm proposes
a new data point conditional on the previous data point, creating a Markov chain, and accepts or rejects
this new point with a probability which ensures the invariant distribution of the Markov chain is the
correct target. Using a Metropolis Hastings (MH) algorithm to successfully and efficiently sample
from a complex and multidimensional distributions relies on the ability the produce proposals that
both explore the whole target distribution ’quickly’ and are accepted with high probability. The speed
of exploration and the acceptance probability determine how long the algorithm will need to be run
for it to produce a satisfactory sample and also how close the sample produced is to an independent
sample.
This is by no means an easy task. An example which will be revisited a lot in this paper is the
Random Walk Metropolis Hastings (RWMH) which proposes a new point by means of a Gaussian
random walk with a preset variance. If this variance is too low the random walk will explore the space
very slowly requiring a large number of iterations to explore the target, creating a sample where data
points are largely dependent on data points many iterations ago. Alternatively, if the variance is too
high then the jumps made by the random walk will be much larger but by the nature of the MH
acceptance probability may often be rejected, this will result in many values being the same and a lot
of computational time wasted.
In order to attempt to solve this problem, Hamiltonian Monte Carlo (HMC) presented by Neal [2011]
is suggested. The HMC algorithm uses Hamiltonian dynamics to propose new data points and the
same MH acceptance rate to ensure the correct target is produced. Neal [2011] showed that using the
Hamiltonian dynamics wide ranging proposals that explore that target quickly can be produced and
are accepted with probability very close to one. The aims of this project was to implement HMC and
investigate whenever it does solve this problem or not compared to RWMH.
Theory
Neal [2011] gave a detailed review of HMC and in this section a brief review will be presented. From
physics, Hamiltonian dynamics are a way of describing the motion a body of mass in d dimension
space. Neal [2011] gave the intuitive description of a puck moving around on an unlevel d-dimensional
surface. Suppose the body of mass has d dimensional momentum p and d dimensional position q such
that
p =
p1
p2...
pd
q =
q1
q2...
qd
, (1)
then Hamiltonian dynamics formulates the equations of motion of this body using the following differ-
1
ential equations
dqidt
=∂H
∂pi(2)
dpidt
= −∂H∂qi
(3)
for i = 1, 2, ···, d, where H is the Hamiltonian, or total energy, which depends on both the momentum
and position and t is time.
The goal here is to use Hamiltonian dynamics to produce proposals that can be used in a MH algorithm
in order to efficiently sample from a distribution with d dimensions and probability density function
f(.). This is done by introducing p as an auxilery random variable and defining its distribution by
means of the kinetic energy
K(p) =1
2pTM−1p (4)
where M is the mass matrix and is usually defined as a multiple of the identity matrix. This corresponds
to the negative logarithm of the p.d.f. of a multivaraite gaussian, plus some arbitrary constant, with
mean 0 and covariance matrix M.
The potential energy is defined, in a similar way, to be the negative logarithm of the target. This has
been done to ensure that the canonical distribution1 of our potential and kinetic energy are the desired
target and a multivariate Gaussian.
U(q) = − log(f(q)) . (5)
Putting this together, the Hamiltonian, the total energy in the system, is
H(p,q) = K(p) + U(q) . (6)
It can be shown that Hamiltonian dynamics are reversible, keep the Hamiltonian constant and preserves
volume in (p,q). These properties are important as they ensure a Markov chain, using Hamiltonian
dynamics and the usual MH acceptance probability, will emit the correct target as its invariant distri-
bution. This motivates the use of Hamiltonian dynamics as a proposal distribution.
Hamiltonian dynamics evolve as a continuous process and therefore require a discretisation in order for
them to be implemented on a computer. The discretisation will therefore only provide proposals that
approximately obey Hamiltonian dynamics. It is important that the discretisation is also reversible and
volume preserving in order to maintain the validity of using approximate Hamiltonian Dynamics as a
proposal distribution. After trying various methods Neal [2011] decides upon the leapfrog discretisation
method, as a way of providing a good approximation for the Hamiltonian dynamics that satisfy the
necessary properties. The leapfrog method depends on 2 externally inputted parameters, L and ε. L
determines how many leapfrog steps the discretisation is run for and ε determines how far each leapfrog
step jumps. εL can therefore be seen as the length of the leapfrog algorithm. A smaller ε will give a
better approximation to the Hamiltonian dynamics and a larger L will displace the particle further at
each iteration.
Now that the validity and practical implications of using Hamiltonian dynamics as a MH proposal
1The canonical distribution P (E) of energy E is eE/kBT where T is temperature and kB is Boltzmann’s constant
2
distribution have been established Neal [2011] proposes a HMC algorithm.
The HMC algorithm is as follows:
Initialise position q0
for i = 1,...,N
1) Draw pi ∼ Nd(0,M)
2) Using the leapfrog starting at qi and pi, with preset parameters ε and L, propose q* and p* from the
Hamiltonian dynamics.
With probability
αHMC = min
(1,
exp [−H(p∗,q∗)]
exp [−H(pi,qi)]
)(7)
set qi+1 = q∗, else set qi+1 = qi
The HMC algorithm targets the canonical distribution of the Hamiltonian which, here, is the augmented
target. Due to the properties of the exp function our augmented target can be written as the product
of the canonical distributions of the Kinetic energy and the Potential energy. The fact that the target
can be written in such a way indicates that q and p are independent and therefore the marginal of q
will be our target f(.).
αHMC corresponds to the usual MH acceptance probability.exp [−H(p∗,q∗)]
exp [−H(pi,qi)]corresponds to the ratio
of our augmented target distribution and p∗ is set as the negative of the momentum at q∗ in order to
ensure that the proposals are symmetric and therefore cancel. If it were possible to propose a position
exactly according to Hamiltonian dynamics then αHMC would always equal 1, as the Hamiltonian is
constant. However as a discretisation is used the value of the Hamiltonian will change, but providing
ε is small enough, it shouldn’t change too much leaving a high acceptance rate.
The right choice of ε and L will allow HMC to produce wide ranging proposals that will be accepted
with probability close to one.
Univariate Gaussian
In order to provide a simple comparison between RWMH and HMC they were both implemented to
sample from a univariate standard Gaussian distribution
In order to compare HMC fairly with RWMH, L, the number of leapfrog steps done in HMC, RWMH
steps will be run before doing a MH accept/reject step.
Both methods were run for 10000 iterations to target the standard univariate normal. The RWMH
parameters were set with proposal variance of 1 and L = 25 and the HMC parameters were selected
to be ε = 0.3, M = 1 and L = 25.
During the experiments, the autocorrelation and trace plots were observed. In order to provide further
insight into the performance of HMC compared with RWMH the Gelman and Rubin [1992] diagnostic
was also implemented to investigate how fast the algorithm chains burnt in. This was done by running
k independent chains of length 2n with different initial values. Let xji be the ith sample from the jth
3
chain, x̄j =∑2n
i=n+1 xji/n and x̄ =∑k
j=1
∑2ni=n+1 xji/kn. Then the ratio√
s2bs2w
(8)
where
s2b =n∑k
j=1 (x̄j − x̄)2
k − 1(9)
and
s2w =
∑kj=1
∑2ni=n+1 (xji − x̄j)2
nk − k(10)
can be used to compare the sample means of all k chains, similar to ANOVA. The ratio should be about
1 if all k chains have the same mean, thus could be considered burnt in if the ratio is about 1. The F
test can be conducted but it was found that it was too strict for the purpose of this project. Instead,
the diagnostic was repeated 10 times to obtain a sample of ratios. When targeting the univariate
standard Normal distribution, initial values of {−20,−19, ···, 19, 20} were used for both RWMH and
HMC.
The trace plots in Figure 1 show that both RWMH and HMC explore the suport of the target however
HMC manages to do it a lot quicker. The autocorrelation plots supports this with RWMH taking 10
iterations to produce an independent sample whereas HMC only takes 4. The faster mixing and lower
autoccorelation of the sample are a result of αHMC being very close to one where as αRWMH = 0.244.
As a result, this was evidence that HMC mixed well and the samples explored the state space well.
This demonstrates HMC’s ability to produce wide ranging proposals that are still accepted with large
probability and thus explore the state space faster a produce a sample that is closer to being an
independent sample.
4
Figure 1: Traceplot and autocorrelation plot of RWMH and HMC targeting a standard Gaussiandistribution with initial value 5.
An alternative way to demonstrate the impact using HMC has on the mixing of the sample compared
with RWMH is to examine the speed at which the chains burn in form a variety of starting values.
Figure 2 plots the estimate of the mean of the target and shows that the HMC algorithm burnt in
much faster than RWMH. The Gelman-Rubin diagnostic, as shown in Figure 3, concurred that HMC
burnt in much faster as the ratio of standard deviations was closer to 1 compared to RWMH. The
shorter burn in period, even from dispersed starting values, perfectly illustrates the fact that HMC can
produce wider ranging proposals allowing the sample to move to areas of high density in the target
more quickly than thw RWMH can and avoiding wasting computing time.
0 10 20 30 40 50
−20
020
Dimension: 1
Index
MC
MC
(a) RWMH
0 10 20 30 40 50
−20
020
Dimension: 1
Index
MC
MC
(b) HMC
Figure 2: Traceplots of 41 MCMCs targeting a standard Gaussian distribution with initial valuesranging from -20 to 20.
5
0 10 20 30 40 50
24
68
Chain length (2n)
Rat
io o
f std
Dimension: 1
(a) RWMH
0 10 20 30 40 50
1.2
1.6
2.0
Chain length (2n)
Rat
io o
f std
Dimension: 1
(b) HMC
Figure 3: Ratios of between and within chains standard deviations targeting the standard Gaussiandistribution. The experiment was repeated 10 times, the means were plotted with error bars corre-sponding to the standard deviation.
Bivariate Gaussian
In order to test the effectiveness of HMC further an experiment was run targeting a bivariate Gaussian
distribution with mean 0, and correlation between the 2 variables at 0.98. This example is one where
RWMH is known to perform badly so HMC will be tested to see if it can improve upon it.
Both the HMC and RWMH algorithms were run targeting the bivariate gaussian. In this case M was
set to 2 times the identity matrix and all other parameters were left the same when targeting the
univariate case. Figures 4 and 5 show trace plots, scatter plots and autocorrelation of the samples
produced by both algorithms
6
Figure 4: Traceplot and autocorrelation plot of RWMH, with initial value (5,5), targeting a bivariateNormal distribution centered at the origin with unit variance and correlation coefficient 0.98 .
Figure 5: Traceplot and autocorrelation plot of HMC, with initial value (5,5), targeting a bivariateNormal distribution centered at the origin with unit variance and correlation coefficient 0.98.
The RWMH sample trace plots show very slow mixing with lots of horizontal straight lines indicat-
ing that the sample has the same data point for a considerable length of time. This suggests that
7
many of the proposed values were rejects which is confirmed by the mean acceptance probability being
αRWMH = 0.0148. Owing to this very high autocorrelation is observed between points many iterations
away. This suggests that the RWMH algorithm would need to be run for a very long time to gain a
reasonable approximation of the target. The RWMH algorithm proposes individual variables indepen-
dently from a random walk, as the correlation in this target is so high only proposals with similar X1
and X2 values will be accepted with high probability. It is quite unlikely to produce 2 similar variable
interdependently, espeically in the tails of the target and this is why the average acceptance probability
is so low.
On the other hand the trace plots from the HMC sample show much faster mixing with the sample
exploring the support of the target well. This produces low autocorrelation in the sample taking only
4 iterations to produce an independent sample. αHMC = 0.734, which is actually quite low for HMC
algorithm and suggests the value of epsilon should be decreased, further demonstrating HMC’s ability
to produce wide ranging proposals that explore the whole target and are accepted with high probability
even when the target is slightly more complicated.
Again the Gelman-Rubin diagnostic was examined in order to illustrate the power of the HMC propos-
als. For the bivariate Gaussian distribution the initial values were set to be 20 points equally spaced
on a circle of radius 5. It was observed that HMC burnt in immediately after a step whereas RWMH
struggled to find the distribution, as shown in Figure 6. The ratios of between and within chains
standard deviations, as shown in Figures 7 and 8, also suggested that RWMH struggled to burn in
after 50 steps as the ratio was not near 1 considering error bars.
−10 −5 0 5 10
−10
−5
05
10
x
y
(a) RWMH
−4 −2 0 2 4
−4
−2
02
4
x
y
(b) HMC
Figure 6: Paths of 20 MCMCs after 50 steps, with initial values on a circle of radius 5, targeting abivariate Normal distribution centered at the origin with unit variance and correlation coefficient 0.98.
8
0 10 20 30 40 50
515
Chain length (2n)
Rat
io o
f std
Dimension: 1
0 10 20 30 40 50
010
2030
Chain length (2n)
Rat
io o
f std
Dimension: 2
Figure 7: Ratios of between and within chain standard deviations for RWHM targeting the bivariateNormal distribution. The experiment was repeated 10 times, the means were plotted with error barscorresponding to the standard deviation.
0 10 20 30 40 50
0.3
0.6
0.9
Chain length (2n)
Rat
io o
f std
Dimension: 1
0 10 20 30 40 50
0.4
0.7
1.0
Chain length (2n)
Rat
io o
f std
Dimension: 2
Figure 8: Ratios of between and within chain standard deviations for HMC targeting the bivariateNormal distribution. The experiment was repeated 10 times, the means were plotted with error barscorresponding to the standard deviation.
Multivariate Gaussian
Another advantage discussed in Neal [2011] of HMC when compared to RWMH is the way in which the
two algorithms perform as the dimension of the target increases. HMC’s ability to produce wide ranging
multidimensional proposals that are accepted with probability close to one allows it to deal with high
correlation between 2 variates, as shown in the previous section and also to deal with multidimensional
targets as will be demonstrated here.
The issue that RWMH has when it is required to sample from a multi-dimensional target is that it
produces a sample for each dimension independently by ways of a random walk and these samples are
all accepted or rejected together. It therefore only takes one dimension of the proposed point to be
in an area of low probability with respect to the target for the whole proposal to be rejected. As the
dimension of the proposal increases it will become more likely that one of the dimensions will have low
probability and therefore the acceptance rate will drop and the RWMH will become inefficient.
To test whether this will happen in practice RWMH with proposal variance 1 and L = 25 and HMC
with ε = 0.3, L = 25 and M = 1 were both run for 10000 iterations targeting an independent standard
tri variate Normal. Plots comparing the performance of the two algorithms are as shown in Figure 9
9
Figure 9: Trace plots, Histograms and Autocorrelation plots comparing the RWMH algorithm to theHMC algorithm
Here what was expected was observed. The trace plots for the RWMH showed that the chain explored
the support of the target reasonably well but a bit slowly, the histograms showed that the sample
produced is an ok approximation of the target and it took about 20 iterations to produce an almost
independent point. The reason for this is that αRWMH = 0.151 so there are less than 2000 distinct data
point and therefore this, somewhat unsatisfactory behavior, was observed. Even in just 3 dimensions
RWMH found it difficult to propose 3 data points that will be accepted together with high probability.
αHMC = 0.986 and this explained why faster mixing, a better approximation of the target and only
4 iterations to an almost independent point were observed in the HMC sample. The Hamiltonian
dynamics allowed HMC to propose a multidimensional point via the leapfrog which was accepted with
probability close to 1. Therefore more data points covering the support of the target were produced
and a better approximation was obtained. There was nothing complicated or difficult about sampling
from 3 independent standard Normal distributions but RWMH still performed poorly and it would
take a very large number of samples to produce a good approximation of the target.
In order to observe how this extended to even larger dimensions, RWMH and HMC were run for 5000
iterations targeting standard multivariate Normal random variables with increasing dimensionality.
The same parameters as before were used except that the RWMH proposal variance was set equal to
0.15, in order to attempt to improve its acceptance rates. Figure 10 below demonstrated how the mean
acceptance rates of the HMC (top line) and the RWMH (bottom line) changed as the dimension of the
target increased.
10
Figure 10: Plots presenting how the mean acceptance rate of the HMC (top line) and the RWMH(bottom line) change as the dimension of the target increases.
This demonstrated that the mean acceptance rates of both the HMC and RWMH algorithms decreased
as dimension increased. The mean acceptance rate of the HMC algorithm seem to decreased linearly
with dimension. The reason that all of the mean acceptance rates for HMC were not one was because of
the discretisation and the value of epsilon caused a difference between the Hamiltonian at the current
and proposed state. The discretisation produced a difference in each dimension and therefore the
total difference was just the sum of these difference so the acceptance probability decreased linearly as
dimension increased. It was less obvious to see at what rate the RWMH mean acceptance rate fell as
the dimension increased but it was at a much faster rate than HMC. This was further evidence of the
difficulty RWMH algorithms had in proposing d independent proposals that were accepted with high
probability and demonstrated that HMC scaled in a much more desirable way than RWMH.
Univariate Normal Mixtures
A common problem where standard MCMC methods often fail is when they try to sample from multi-
modal distributions, for example the mixture of two Normal distributions. The Markov chain structure
means that it is easy for the chain to become stuck in one of the modes and therefore not explore the
whole space. This will become a problem if the structure of the target is not known and what will
appear to be a well mixing sample may only be exploring half the target. Techniques such as simulated
annealing and simulated tempering have been used to try and solve this problem but here HMC was
tested. Neal [2011] remarks ”HMC is no less (or more) vulnerable to problems with isolated modes
than other MCMC methods that make local changes to the state”, this statement was explored by
comparing HMC and RWMH for two different univariate Normal mixture examples.
Example 1
The first univariate Normal mixture example that was investigated is π1(x) = 12N(x; 0, 12)+1
2N(x; 5, 12).
The Figure 11 below shows its density.
11
Figure 11: Density of π1(x) = 12N(x; 0, 12) + 1
2N(x; 5, 12)
π1(x) has ’significant’ positive density between the modes, the smallest density, between the 2 modes
occurs at x = 2.5 where the density is 0.018 compared with 0.20 at the modes x = 0 and x = 5.
A RWMH algorithm with proposal variance 1 was run for 10000 iterations, counting 25 steps as an
iteration, and a HMC algorithm L = 25, ε = 0.3 and M = I was run for 10000 iteration targeting
π1(x). Figure 12 below shows trace plots, histograms and autocorrelation plots for both the RWMH
and HMC runs.
Figure 12: Trace plots, Histograms and Autocorrelation plots comparing the RWMH algorithm to theHMC algorithm targeting π1(x)
From the trace plots it was observed that both the RWMH and the HMC chain appeared to explore the
support of π1 quite quickly. The RWMH trace plots showed excellent mixing with very frequent jumps
between the two modes of the data. This was because the difference between the two modes was 5, only
5 times the RWMH variance making it relatively easy for the RWMH to jump between modes. The
ability of the RWMH chain to do this resulted in the expected time to an independent sample being
relatively short at 10. The HMC trace plots seem to generally show good mixing but it was observed
that jumps between nodes were less frequent. This was because, unlike the RWMH, HMC does not
make jumps between points it ’flows’ between them. Thinking of the area of low density in between
the modes as a hill (or high energy barrier) the particle, in HMC, has to be given enough energy to get
over that hill. Whereas RWMH can just jump over the area of low density so avoids this problem. This
was demonstrated further by the autocorrelation plots which show that HMC took over 40 iterations
to produce an independent sample. However this area of low density was not so low that HMC can
never get over it, an example of this will be presented later, so the HMC algorithm still explored the
12
whole target. Finally the histogram plots show how well the samples were approximating the target.
It was observed that despite slightly slower mixing, the HMC algorithm produced the sample closer to
the target, this was likely to be because of the acceptance probabilities. αHMC = 0.994 meaning that
the HMC sample contained almost 10000 distinct points. However αRWMH = 0.396 so the RWMH
sample had many less distinct points and is therefore a slightly worse approximation of the target. The
RWMH algorithm would need to be run for longer to produce a closer approximation of the target.
Example 2
The second univariate Normal mixture that was examined is π2(x) = 12N(x; 0, 0.32) + 1
2N(x; 3, 0.32).
Figure 13 below shows its density. This example had been chosen deliberately such that the modes
were quite close together but such that there was almost zero density in between.
Figure 13: Density of π1(x) = 12N(x; 0, 0.32) + 1
2N(x; 3, 0.32)
The smallest density, between the 2 modes, of this distribution occurred at x = 1.5 where the density
is 0.000005 compared with 0.66 at the modes x = 0 and x = 3. Once again the RWMH and HMC
algorithms targeted π2(x) with the same input parameters as before.
Figure 14: Trace plots, Histograms and Autocorrelation plots comparing the RWMH algorithm to theHMC algorithm targeting π2(x)
As in Example 1 the trace plots of the RWMH sample appeared to be exploring the target of π2 quite
well though more slowly than in Example 1. The chain managed to jump between chain reasonably
frequently but less often than in Example 1. This was because there was more density in the modes
13
and less between so the RWMH needed a bigger jump, which will happen less often, to jump between
modes. This was verified by the autocorrelation plot for the RWMH where it took considerably longer
to achieve an almost independent point. On the other hand the trace plot of the HMC sample did
not suggest that the chain had explored the whole target. Despite one random fluctuation the chain
became stuck in the mode at x = 3 and not explored the other mode at x = 0. This was because
the density in between the modes was so low that this constituted such a large high energy barrier
that there was never enough momentum to get over it. Different values for the parameters ε, L and
M were tried but none of them produced mixing between the two modes. This was also backed up
by the autocorrelations plots as it took well over 40 iterations to produce and independent point. In
this example the target was known and therefore this was easy to notice but had the target not been
known and because HMC mixed so well with a high acceptance rate αHMC = 0.914 the histogram and
trace plot would made it appear as though the target were just a N(3, 0.32). This histogram of the
RWMH sample indicated that there is a bi modal target, but the algorithm still did not produce a
very good approximation of the target. This was likely to be because αRWMH = 0.141 was so low that
the sample produced contained less than 1500 distinct points. Running the sample for longer should
produced a better approximation though this may not be computationally feasible.
In conclusion, just by looking at these two simple examples it appeared Neal [2011] was right that
HMC is still susceptible to failures when sampling from multi modal distributions. In fact providing
the proposal variance of the RWMH allows it to jump between modes. The RWMH appeared more
proficient at mixing between modes than the HMC algorithm did simply because the RWMH can jump
over regions of low density where the HMC can not. However the low acceptance probability, produced
by the RWMH when sampling from multimodal distributions, may mean that the algorithm may need
to be run for a lot longer in order to produce a suitable approximation of the target. If the modes were
such that there was significant positive density in between them then the high acceptance probability
produced by the HMC algorithm creates a better approximation of the target.
Mixture of Bivarate Normal
In order to test how HMC compared with RWMH one final experiments were done implementing
them on a bivariate Gaussian mixture on a mixture distribution, the exact mixture is below. It was
demonstrated previously that HMC coped with larger dimensions better but RWMH has a stronger
ability to jump between multiple nodes of a target.
N
1
1
,
0.1 0
0 0.1
with probability 1/4
N
−1
1
,
0.1 0
0 0.1
with probability 1/4
N
−1
−1
,
0.1 0
0 0.1
with probability 1/4
N
1
−1
,
0.1 0
0 0.1
with probability 1/4
(11)
14
The RWMH and HMC algorithms were run for 200 iterations with 5 initial values on a circle of radius
2, to give abreif idea of how there chains would behave. The parameters for the RWMH sampler
were L = 25 and σ = 0.02 and the parameters for HMC were selected to be ε = 0.3, L = 25 and
M to be two times the identity matrix. It was observed from Figure 15 that HMC did not jump
between mixtures as often as RWMH, suggesting RWMH better explores the state space. However for
RWMH, the rejection rate was high and did not mix well within a mixture compared to HMC. This
is wholly unsurprising and just goes to reiterate the results found when mixtures and bivariate target
were examined independently
0 50 100 150 200
−2
01
2
Dimension: 1
Index
MC
MC
0 50 100 150 200−
20
12
Dimension: 2
Index
MC
MC
(a) RWMH
0 50 100 150 200
−2
01
2
Dimension: 1
Index
MC
MC
0 50 100 150 200
−2
01
2
Dimension: 2
Index
MC
MC
(b) HMC
Figure 15: Traceplots of 5 RWMH and HMC chains targeting a mixture of 4 bivariate gaussian mixturedistributions with initial values on a circle of radius 2.
Further work and ideas
This paper has explained and implemented the HMC algorithm proposed in Neal [2011]. It has looked
at relatively straightforward examples of target distributions and compared the results of implementing
HMC with the results of implementing the RWMH.
Unfortunately the length of time given to complete this project was not long enough to try out any
more complicated or innovative methods to use HMC. However this was given some thought and some
potential further work is as follows. Girolami and Calderhead [2011] suggested using a position specific
mass matrix and it was hoped that something as simple as using the Hessian of the negative log of
the target may lead to better exploration the space in these simple Gaussian examples. Neal [2011]
15
suggested that it is possible to combine HMC and some other MCMC update. It was therefore felt that
some kind of hybrid algorithm, for example doing HMC updates but with a RWMH update every ith
iteration may help to solve the problem caused by using HMC to sample from a multi modal target.
16
Bibliography
Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple sequences.
Statistical science, pages 457–472, 1992.
Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
W Keith Hastings. Monte carlo sampling methods using markov chains and their applications.
Biometrika, 57(1):97–109, 1970.
Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward
Teller. Equation of state calculations by fast computing machines. The journal of chemical physics,
21(6):1087–1092, 1953.
Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2,
2011.
17