18.330 Lecture Notes: Monte Carlo Integration -...

18.330 Lecture Notes:

Monte Carlo Integration

Homer Reid

March 20, 2014

Contents

1 Monte-Carlo integration 21.1 Monte-Carlo integration . . . . . . . . . . . . . . . . . . . . . . . 21.2 Comparison to nested quadrature rules . . . . . . . . . . . . . . . 21.3 Applications of Monte-Carlo integration . . . . . . . . . . . . . . 3

2 A computational example 5

3 How it works: deriving the convergence rate of Monte-Carlointegration 83.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Mean, variance, standard deviation . . . . . . . . . . . . . . . . . 103.3 Sums and averages of random variables . . . . . . . . . . . . . . 113.4 Functions of random variables . . . . . . . . . . . . . . . . . . . . 133.5 Convergence rate of Monte-Carlo integration . . . . . . . . . . . 143.6 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 153.7 Generating random numbers according to a specified probability

distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A Volume of the D-dimensional ball 18

1

18.330 Lecture Notes 2

1 Monte-Carlo integration

1.1 Monte-Carlo integration

Consider a scalar-valued function of an D-dimensional variable f(x), and sup-pose we want to estimate the integral of f over some subregion R ∈ RD. InMonte-Carlo integration we do this using the following extremely simple rule:∫

Rf(x) dx ≈ V

N

N∑n=1

f(xn) (1)

where V is the volume of R, and where the xn are a set of N randomly chosenpoints distributed uniformly throughout our region R.

It seems too good to be true to think that such an incredibly simple-mindedprocedure could possible yield anything resembling decent numerical accuracy.But it does! If I is the exact value of the integral on the LHS of (1) and INis the N -sample Monte-Carlo approximation on the RHS, then we have theasymptotic convergence rate

|I − IN | ∝1√N

(2)

This result is slightly tricky to prove, so we postpone the proof to Section 3.The most important thing about equation (1) is that it is independent of

the dimension D. The error in Monte-Carlo integration decays with the squareroot of the number of function samples regardless of the dimension. This is thecritical property that makes the method useful; it stands in marked contrast tocase of more pedestrian approaches to multidimensional integration, as we willnow see.

1.2 Comparison to nested quadrature rules

Of course, if you know anything about numerical quadrature, you might bethinking that equation (2) is an appalling slow convergence rate. Even thesimplest, most brain-dead numerical quadrature algorithm – the rectangular rule– converges like 1/N , much faster than 1/

√N , and better quadrature algorithms

converge much more quickly. So why would we ever want to use something thatachieves a lousy convergence rate like (2)?

The answer has to do with a phenomenon sometimes known as the curseof dimensionality. Consider rectangular-rule quadrature as an example. For a1D integral over an interval [a, b] subdivided into N subintervals, we have toevaluate the function Neval = N times and the error decays like E ∼ 1/N , asnoted above. Now suppose we have a 2D integral of the form∫ b

a

dx1

∫ d

c

dx2 f(x1, x2),


Suppose we evaluate the inner (x2) integral using an N -point rectangular ruleto obtain a function F (x1), then integrate this function over x1 again usingan N -point rectangular rule to compute the full integral. (Such a procedure iscalled nested quadrature.) The overall error again decays like E ∼ 1/N . Butwe have to evaluate the function Neval = N2 times, so now the convergencewith respect to the number of function evaluations is only E ∼ 1/

√Neval, much

slower than the 1D case. More generally, if we evaluate a D-dimensional integralusing nested rectangular-rule quadrature, the error decays like

error in nested D-dimensional rectangular-rule quadrature ∼ 1

N1/Deval

.

We see that already for D = 2 the simple Monte-Carlo formula (1) achievesasymptotic convergence equivalent to that of the rectangular rule, while forD > 2 Monte-Carlo is (asymptotically) better.

Of course, the rectangular rule is only the most naıve numerical quadraturescheme. What if we use something more sophisticated like Simpson’s rule?Well, now the error decreases like E ∼ 1/N4, where N is the number of functionsamples per dimension, but the total number of function samples grows like1

Neval ∼ (2N)D, so we have

error in nested D-dimensional Simpsons’-rule quadrature ∼ 1

N4/Deval

.

which is equivalent to Monte-Carlo already for D = 8 and worse for dimensionsD > 8.

The basic point is that repeated nesting of 1D quadrature schemes is a terri-ble way to evaluate high-dimensional integrals, because the number of functionsamples needed to achieve a given tolerance grows exponentially with the di-mension (this is the “curse of dimensionality.”) In some special low-dimensionalcases (such as integration over special low-dimensional regions such as triangles,spheres, or hypercubes) there are generalized quadrature schemes that do bet-ter, but for high-dimensional integrals in general Monte-Carlo integration is theonly available option.

1.3 Applications of Monte-Carlo integration

Computing the volume of complex high-dimensional shapes

What is the volume of intersection of a 12-dimensional sphere with a 5-dimensionalcylinder? What is the electrostatic potential at the origin due to a constantcharge density contained in a solid cubical region? Given two triangles T1, T2in R3, what is the volume V (R) of the 6-dimensional set of points {r1, r2} suchthat r1 ∈ T1, r2 ∈ T2, and |r1 − r2| = R?

1Recall that Simpson’s rule requires 2 function evaluations per subinterval; this is the originof the 2 in this formula.


Questions like this arise in computational geometry and partial differentialequations and may generally be expressed as high-dimensional integrals. In somecases it may be possible to write out explicit limits of integration delimiting theregion in question, in which case the integrals may be evaluated analytically;but in general such a calculation may not be possible, and even when possibleit will generally be unwieldy.

On the other hand, it is almost always easy to write a characteristic functionχ(x) which takes the value 1 for points inside the region in question and 0otherwise; then the volume of the region is given simply by

V =

∫Rχ(x)dx

where R is any simple region (for example, a hypercube) encompassing theregion in question. Integrals of this type are easily evaluated using Monte-Carlointegration; see below for an example.

Path integrals in quantum mechanics and quantum field theory


2 A computational example

As an immediate example of Monte-Carlo integration, let’s compute the volumeof BD, the D-dimensional unit ball. This is the set of all points in D-dimensionalspace that lie within unit distance of the origin:

BD ={

x ∈ RD : |x| < 1}

The characteristic function of BD is

χ(x) =

{1, x ∈ BD

0, otherwise

and, in a high-level language like julia, this function may be implemented in asingle line:

function chiBall(x)

norm(x) < 1 ? 1.0 : 0.0

end

Note that this implementation works for arbitrary dimensions (the dimensionof the x argument is inferred from its length).

Given the characteristic function χ, the volume of BD may be computedaccording to

VD =

∫Rχ(x) dx

where R is any region of RD containing the unit ball – for example, R couldbe all of RD, or could alternatively be the D-dimensional hypercube defined by{x : −1 ≤ xi ≤ 1, i = 1, . . . , D.}

Here’s a julia program that evaluates the Monte-Carlo integration formula(1) over a hypercubic region.

#

# MCIntegrate: integrate func over the hypercube with

# bounds { Lower[1..Dim], Upper[1..Dim]} using a total

# of N function samples

#

function MCIntegrate(func, Lower, Upper, N)

Lower=Lower[:]; # convert to column vectors

Upper=Upper[:];

Dim=length(Lower);

Delta = Upper-Lower;

Jacobian = prod(Delta); # volume of the hypercube

Sum=0.0;

for n=1:N


rv = rand(Dim); # random vector w values \in [0:1]

x = Lower + rv.*Delta; # random point in hypercube

Sum += func(x);

end

Jacobian * Sum / N;

end

To test this program on a simple example, we’ll compute the volume of thethree-dimensional ball, which is 4π

3 = 4.189.

julia> MCIntegrate( chiBall, [-1 -1 -1], [1 1 1], 10000)

4.2352


4.0584


4.1448

Each time we call this routine, we obtain a sample of a random variable whosemean value is the integral we are trying to compute and whose standard devia-tion about that mean decreases like 1/

√N . (These concepts are explained more

fully in the following section.) To give you some graphical intuition for how theprocess works, the following plot shows the results of 100 calls to MCIntegrate,as above, for the two values N = 100 and N = 10000. The dashed line is the truevalue of the integral. As you can see, in both cases the process is approximatingthe true value of the integral, and increasing the number of function samples by100× reduces the fluctuations (the error in our approximate evaluation of theintegral) by 10×.


3

3.5

4

4.5

5

5.5

0 10 20 30 40 50 60 70 80 90 100 3

3.5

4

4.5

5

5.5V

alue

of i

nteg

ral

Number of MC integration runs

N=100N=10000Exact

Figure 1: Results of 100 calls to MCIntegrate to compute the volume of the3-dimensional ball using N = 100 and N = 10000 function samples.


3 How it works: deriving the convergence rateof Monte-Carlo integration

To understand the convergence rate of Monte-Carlo integration, we first needto make a brief foray into the field of random variables.

3.1 Random variables

A good way to think about a random variable x is as a black box with a buttonon it. Each time we push the button, the black box spits out a number.2

Figure 2: Cartoon depiction of a random variable x as a black box with a buttonon it. Each time we hit the button, we get out a sample of x.

If we push the button N times and plot the values of the samples emitted,we might get something like this:

2Think of the little machine at the bank or the driver’s-license office on which you pusha button and get out a number telling you your position in the line of people waiting to seea clerk. One distinction is that in that case the numbers that emerge are integers emittedin ascending order, whereas with a random variable the numbers that emerge are typicallyreal-valued and (hopefully!) not organized in any particular sequence.


-1

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300

Val

ue o

f nth

sam

ple

Sample index nFigure 3: Values of 300 samples of a random variable x, which in this case areuniformly distributed throughout [0 : 1].

Suppose we segment the real line x = [−∞,∞] into buckets of width ∆ andask, after N presses of the button in Figure 2, how many samples of x fall intothe bucket between 7 and 7 + ∆. If we do this for larger and larger values ofN we will find that the fraction of the total number of samples falling into anyone bucket tends to a constant times the width of the bucket:3

limN→∞

# samples of x falling in the interval[7, 7 + ∆]

N= P (7)∆

More generally, we may ask for the fractional number of samples falling into anyinterval [x, x+∆], and the answer as N →∞ would tend to P (x)∆, where P (x)is a number that depends on x. P (x) is called the probability density function orthe probability distribution of the random variable. To be a suitable probabilitydensity function, P (x) must satisfy the conditions

P (x) ≥ 0 ∀x and

∫ ∞−∞

P (x) dx = 1.

3Strictly speaking this equation is only true in the limit ∆ → ∞, but that would be toomany limits to be considering all at once; for now just think of ∆ as a small width.


For the case pictured in (3), we have

P (x) =

{1, x ∈ [0, 1]

0, otherwise

which is known as a uniform distribution; we say that that the random variablex is uniformly distributed in the interval [0 : 1].

System-supplied random-number generators in computers, like the rand

functions in matlab or julia and the drand48 function in the standard clibrary, typically produce random numbers uniformly distributed in the interval[0 : 1]. Later we will discuss how to obtain random numbers distributed withother densities.

3.2 Mean, variance, standard deviation

The black dashed line in Figure (3) is the average value of all the samples of therandom variable emitted from the black box. This is known as the mean valueof the random variable. For a given probability distribution P (x), the meanmay be computed according to

mean = x =

∫ ∞∞

xP (x) dx

=⟨x⟩

(where the second line defines some useful shorthand for integrating over prob-ability distributions). For the probability distribution in Figure (??), we have

x =

∫ 1

0

x dx =1

2

in accordance with our intuition.The quantity that is key for understanding the convergence of Monte-Carlo

integration is the variance σ2x, defined as

variance = σ2x =

∫ ∞∞

(x− x

)2P (x) dx

=⟨(x− x)2

⟩This quantity is measuring how much samples of x deviate from their meanvalue. The bigger the value of σ2

x, the more the random variable is “spread out”or “fluctuating” about its mean.

Note that the specific quantity σ2x is actually characterizing something like

the square of the deviations about the mean value. In particular, if the randomvariable x has units, like say meters, then σ2

x has units of meters2 and hencecannot be used directly to measure the spread of the quantity we are trying to


characterize. Instead, the number that you want to have in mind to characteriz-ing the spread of values in a random variable is the square root of the variance,which is called the standard deviation:

standard deviation = σx =√σ2x.

For the uniformly distributed variable of Figure 3, we have

σ2x =

∫ 1

0

(x− 1

2

)2

dx

=1

12

so the standard deviation is σx =√

1/12 ≈ 0.29. You should think of this as thehalf-width of the interval around the mean within which most of the fluctuationsof the variable are contained.

3.3 Sums and averages of random variables

It is easy to obtain new random variables from old. For example, given a randomvariable x distributed according to some probability distribution P (x), we coulddefine a new random variable y by summing two samples of x:

y = x+ x.

As in Figure 2, the random variable y may be thought of as a machine with abutton on it, which we can press however many times we like to generate samplesof y. In this case, we can think of this machine as containing within it two copiesof the machine of Figure 2. Hitting the button on the y machine amounts tohitting the buttons on the two x machines and summing their results.

Figure 4: A random variable y defined as the sum of two random variables x.Hitting the button on the y machine is like hitting the button on the x machinetwice (or, equivalently, hitting the buttons on two identical x machines) andsumming the results.


The very important fact about random variables defined as sums of randomvariables is this:

When we add a random variable to itself N times, its mean value increases bya factor of N , but its standard deviation increases by only a factor of

√N .

Another way to state this is to consider a random variable defined as the averageof N samples of another random variable (this just means we add the variableto itself N times and divide the result by N :

When we average N samples of a random variable, its mean value doesnot change, but its standard deviation decreases by a factor of

√N .

(3)

This is easy to prove by going through some calculus manipulations similar tothose we did in the previous section, but intuitively all you need to know canbe grasped from the following plot, which is identical to Figure (3) except thathere we are plotting samples of a random variable y10 defined as the average of10 samples of the random variable x :4

y10 ≡1

10

(x+ x+ x+ x+ x+ x+ x+ x+ x+ x

)Applying the key result from above, we expect that the mean value and standarddeviation of this variable will be

y10 = x, σy10=

1√10σx.

4To understand this variable, think of the cartoon of Figure (4), but with 10 copies of thex machine instead of just 2, and with a factor of 1/10 multiplying the result on its way outof the box.


-1

-0.5

0

0.5

1

1.5

2

0 50 100 150 200 250 300

Val

ue o

f nth

sam

ple

Sample index nFigure 5: Values of 300 samples of a random variable y10 = 1

10

∑10n=1 x defined

by averaging 10 copies of a random variable x, where x is uniformly distributedin the interval [0, 1] as in Figure 3. Note that y10 has the same mean as theoriginal x, but the amplitude of its fluctuations about that mean (its standarddeviation) is

√10 ≈ 3 times smaller than x (compare Figure 3).

By comparing Figures (3) and (5), it’s easy to see that by averaging 10samples of x we have obtained a new random variable whose mean is the sameas that of x, but whose fluctuations about that mean are reduced by a factor of√

10 ≈ 3.

3.4 Functions of random variables

Similarly, we could define a new random variable z as the result of operating onthe random variable x with some function f(x)

z = f(x).

This yields a random variable whose cartoon depiction looks something like this:


Figure 6: A random variable z defined as the operation of a function f(x) ona random variable x. Hitting the button on the z machine is like hitting thebutton on the x machine and feeding the result into the function f(x).

It’s easy to compute the mean and variance of z:

z =

∫ ∞−∞

f(x)P (x) dx (4a)

σ2z =

∫ ∞−∞

[f(x)− z

]2P (x) dx (4b)

These are quantities that depend on P (x) and f(x), but not on anything else.

3.5 Convergence rate of Monte-Carlo integration

We can now assemble the insights of the previous sections to understand theconvergence rate of Monte-Carlo integration. Consider a scalar function of Dvariables, f(x). We will consider the evaluation of

I ≡ 1

V

∫Rf(x) dx (5)

where R is some subregion of RD and V is the volume of R. I is just computingthe average value of f over the subregion V. (If you want to compute the integralof f , not its average value, then just multiply I by V.)

Let x be an D-dimensional vector of random variables distributed uniformlythroughout the region R. This means that the probability distribution functionP (x) is constant inside R and zero everywhere else:

P (x) =

{1V , x ∈ R0, otherwise.

Given this fact, we can rewrite the integral we are trying to evaluate, equation(5), in the form

I =

∫RD

f(x)P (x)dx (6)


where now the integral extends over all of RD.But now compare equation (6) to equation (4a). We see that the quantity

we are trying to compute is the mean value of a random variable

I ≡ f(x),

where x is distributed according to P (x). The mean value of this randomvariable, by (4a), is

I = I.

The variance of this random variable is given by (4b):

σ2I ≡

∫RD

[f(x)− I

]2P (x) dx

Of course, we don’t know how to compute σ2I , but the the point is that it exists

and is just some number that depends on the function f and the regionR (whichis what defines P (x).

Finally, consider defining a new random variable IN by averaging N samplesof I:

IN ≡1

N

N∑n=1

I =1

N

N∑n=1

f(x)

Note that this is just the prescription we gave in equation (1) for Monte-Carlointegration, although we are here interpreting it as the definition of a randomvariable.

Invoking the general principle of equation (3), we expect that the mean valueand standard deviation of IN will be

IN = I, σIN =1√NσI

where, again, σI is some number that depends on f and R but not on N . Themean value of IN is the quantity we are trying to compute, and its standarddeviation decreases like the square root of N .

Thus, when we use Monte-Carlo integration with N function samples toestimate an integral, we are evaluating a single sample of a random variablewhose mean value is the integral we are trying to compute and whose variancedecreases like 1/

√N . This explains Figure 1.

3.6 Importance sampling

In some cases we may be trying to integrate a function g(x) that may be de-composed into a product of factors g(x) = f(x)P (x) where P (x) satisfies theconditions of a probability density, i.e. P (x) ≥ 0 and

∫P (x) dx = 1. In this

case, referring back to equation (4a), we interpret∫g(x) dx as the mean value of


a random variable I = f(x) where x is a random variable distributed accordingto a nonuniform probability distribution P (x) :

if g(x) = f(x)P (x) with P (x) ≥ 0,

∫P (x) dx = 1

then

∫RD

g(x) dx ≈ 1

N

∑f(x) (7)

where x is a random variable with probability distribution P (x). This techniqueis called importance sampling. For functions g(x) that may be decomposedin this way it is much better to use (7) than the default Monte-Carlo rulewith uniformly distributed evaluation points x, because the importance-sampledversion will more effectively sample the regions of RD that contribute most tothe integral.

Of course, since computer random-number generators typically produce sam-ples of uniformly-distributed random variables, the question arises of how togenerate samples of random variables distributed with non-uniform densities.We take up this question in the next section.


3.7 Generating random numbers according to a specifiedprobability distribution

Next suppose we want to compute a sequence of random numbers {yn} that aredistributed according to some non-uniform probability distribution P (y). Thegeneral idea will be to compute a sequence of uniformly distributed randomnumbers {xn} and then define yn to be f(xn), where f(x) is some function.Let’s determine the relationship between f(x) and P (y).

Suppose we compute some large number of samples N . The number of xpoints falling within an interval [x, x+ ∆x] is approximately N∆x. All of thesepoints are mapped by our procedure into the interval [y, y+ ∆y] = [f(x), f(x+∆x)]. This latter interval has width ∆y = ∆x|f ′(x)| (the absolute value arisesbecause y2 may be less than y1, but we still want to define the width of theinterval to be a positive number). Thus, if we are trying to define the probabilitydensity P (y) such that the number of sample points falling in an interval [y, y+∆y] is NP (y)∆y, we should say

NP (y)∆y = N∆x

or, using y = f(x) and ∆y = |f ′(x)|∆x,

|f ′(x)| = 1

P (f(x))

This is a differential equation for the function f(x). For example, supposewe want to generate points y with distribution P (y) = e−y. The differentialequation reads

|f ′| = 1

e−f

with solution f(x) = − log x. What this means is this: If {xn} is uniformlydistributed in [0, 1] and we define yn = − log(xn), then yn is distributed in[0,∞] with probability density P (y) = exp(−y).


A Volume of the D-dimensional ball

TheD-dimensional ball BD is the set of points in RD that lie within unit distanceof the origin. Let VD be the D-dimensional volume5 of BD. From elementarygeometry we know

B1 = 2 (length of line segment [−1 : 1])

B2 = π (area of unit circle, πr2 with r = 1)

B3 =4

3π (volume of unit sphere, 4

3πr3 with r = 1)

but how do we extend this table to higher D? Earlier in these notes we discussedhow to do this using Monte-Carlo integration. Here we’ll discuss how to do thecalculation analytically.6

One way is to write

VD =

∫|x|<1

dDx (8)

and evalute the integral in polar coordinates. To get a sense of how to do this,let the cartesian components of a point x ∈ RD be {x1, · · · , xD} and recall thattwo-dimensional polar coordinates are defined by

x2D

1 = r sin θ1 (9a)

x2D

2 = r cos θ1 (9b)

while in three dimensions we have7

x3D

1 = r sin θ1 sin θ2 (10a)

x3D

2 = r sin θ1 cos θ2 (10b)

x3D

3 = r cos θ1 (10c)

Comparing (9) and (10), we see that the transition is effected by introducingone new angle (θ2) and bifurcating the coordinate x2D

1 into two new coordinatesx3D1 and x3D

2 defined by

x3D

1 = x2D

1 sin θ2, x3D

2 = x2D

1 cos θ2.

This procedure may be repeated inductively: we just keep splitting up whichevercoordinate is “all sines” into two new coordinates of which one is “all sines” and

5In colloquial terms, the D-dimensional volume VD of an D-dimensional set is, for thecases D=1,2,3, just what we think of as the length, the area, and the volume of a 1-, 2-,or 3-dimensional shape. The generalization to higher D is conceptually straightforward, ifsomewhat difficult to visualize. If we are measuring distances in, say, meters, then VD hasunits of (meters)D.

6We emphasize that this is the rare example of a high-dimensional region whose volumecan be computed analytically; in most cases, something like Monte-Carlo integration is theonly way to proceed.

7Actually the variable we call x3D1 here is what we usually call y, while what we call x3D

2is what we usually call x: we have performed this swap to improve the logical presentation ofthe formulas.


the other is “all sines except for one cosine.” For example, polar coordinates forthe 7-dimensional sphere are

x7D

1 = r sin θ1 sin θ2 sin θ3 sin θ4 sin θ5 sin θ6

x7D

2 = r sin θ1 sin θ2 sin θ3 sin θ4 sin θ5 cos θ6

x7D

3 = r sin θ1 sin θ2 sin θ3 sin θ4 cos θ5

x7D

4 = r sin θ1 sin θ2 sin θ3 cos θ4

x7D

5 = r sin θ1 sin θ2 cos θ3

x7D

6 = r sin θ1 cos θ2

x7D

7 = r cos θ1

The Jacobian of the transition from Cartesian to polar coordinates is

dx1 · · · dxD = rD−1 sinD−2 θ1 sinD−3 θ2 · · · sin θD−1 dr dθ1 · · · dθD−1

and the integral (8) splits up into a product of D factors:

VD =

∫ 1

0

rD−1dr︸︷︷︸1D!

∫ π

0

sinD−2 θ1 dθ1︸︷︷︸√π

Γ[(D−1)/2]Γ[D/2]

· · ·∫ π

0

sin θD−2 dθD−2︸︷︷︸2

∫ 2π

0

dθD−1︸︷︷︸2π

Integrals over powers of sin factors may be evaluated using the Γ function.Working out the general case, we obtain the closed-form analytical expression

VD =πD/2

Γ(D2 + 1

) .The Γ function here may be evaluated to yield more explicit formulas whichdiffer depending on the parity of D:

V2D =πD

D!, V2D+1 =

2(2π)D

(2D + 1)!!

Date post:	02-Jan-2019
Category:	Documents
Upload:	doancong
View:	215 times
Download:	0 times

18.330 Lecture Notes: Monte Carlo Integration -...

Documents