Lecture Notes on Stochastic Calculus (NYU)

Stochastic Calculus Notes, Lecture 1Last modified September 12, 2004

1 Overture

1.1. Introduction: The term stochastic means “random”. Because it usuallyoccurs together with “process” (stochastic process), it makes people think ofsomething something random that changes in a random way over time. The termcalculus refers to ways to calculate things or find things that can be calculated(e.g. derivatives in the differential calculus). Stochastic calculus is the study ofstochastic processes through a collection of powerful ways to calculate things.Whenever we have a question about the behavior of a stochastic process, we willtry to find an expected value or probability that we can calculate that answersour question.

1.2. Organization: We start in the discrete setting in which there is afinite or countable (definitions below) set of possible outcomes. The tools aresummations and matrix multiplication. The main concepts can be displayedclearly and concretely in this setting. We then move to continuous processesin continuous time where things are calculated using integrals, either ordinaryintegrals in Rn or abstract integrals in probability space. It is impossible (andbeside the point if it were possible) to treat these matters with full mathematicalrigor in these notes. The reader should get enough to distinguish mathematicalright from wrong in cases that occur in practical applications.

1.3. Backward and forward equations: Backward equations and forwardequations are perhaps the most useful tools for getting information about stochas-tic processes. Roughly speaking, there is some number, f , that we want to know.For example f could be the expected value of a portfolio after following a pro-posed trading strategy. Rather than compute f directly, we define an arrayof related expected values, f(x, t). The tower property implies relationships,backward equations or forward equations, among these values that allow us tocompute some of them in terms of others. Proceeding from the few known val-ues (initial conditions and boundary conditions), we eventually find the f wefirst wanted. For discrete time and space, the equations are matrix equations orrecurrence relations. For continuous time and space, they are partial differentialequations of diffusion type.

1.4. Diffusions and Ito calculus: The Ito calculus is a tool for studyingcontinuous stochastic processes in continuous time. If X(t) is a differentiablefunction of time, then ∆X = X(t+∆t)−X(t) is of the order of1 ∆t. Therefore∆f(X(t)) = f(X(t + ∆t)) − f(X(t)) ≈ f ′∆X to this accuracy. For an Itoprocess, ∆X is of the order of

√∆t, so ∆f ≈ f ′∆X + 1

2f ′′∆X2 has an error

1This means that there is a C so that |X(t + ∆t−X(t)| ≤ C |∆| for small ∆t.

1

smaller than ∆t. In the special case where X(t) is Brownian motion, it is oftenpermissible (and the basis of the Ito calculus) to replace ∆X2 by its mean value,∆t.

2 Discrete probability

Here are some basic definitions and ideas of probability. These might seem drywithout examples. Be patient. Examples are coming in later sections. Althoughthe topic is elementary, the notation is taken from more advanced probabilityso some of it might be unfamiliar. The terminology is not always helpful forsimple problems but it is just the thing for describing stochastic processes anddecision problems under incomplete information.

2.1. Probability space: Do an “experiment” or “trial”, get an “outcome”,ω. The set of all possible outcomes is Ω, which is the probability space. TheΩ is discrete if it is finite or countable (able to be listed in a single infinitenumbered list). The outcome ω is often called a random variable. I avoid thatterm because I (and most other people) want to call functions X(ω) randomvariables, see below.

2.2. Probability: The probability of a specific outcome is P (ω). We alwaysassume that P (ω) ≥ 0 for any ω ∈ Ω and that

∑ω∈Ω

P (ω) = 1. The interpreta-

tion of probability is a matter for philosophers, but we might say that P (ω) isthe probability of outcome ω happening, or the fraction of times event ω wouldhappen in a large number of independent trials. The philosophical problem isthat it may be impossible actually to perform a large number of independenttrials. People also sometimes say that probabilities represent our often subjec-tive (lack of) knowledge of future events. Probability 1 means something thatis certain to happen while probability 0 is for something that cannot happen.“Probability zero ⇒ impossible” is only strictly true for discrete probability.

2.3. Event: An event is a set of outcomes, a subset of Ω. The probability ofan event is the sum of the probabilities of the outcomes that make up the event

P (A) =∑ω∈A

P (ω) . (1)

Usually, we specify an event in some way other than listing all the outcomes init (see below). We do not distinguish between the outcome ω and the event thatthat outcome occurred A = ω. That is, we write P (ω) for P (ω) or viceversa. This is called “abuse of notation”: we use notation in a way that is notabsolutely correct but whose meaning is clear. It’s the mathematical version ofsaying “I could care less” to mean the opposite.

2.4. Countable and uncountable (technical detail): A probability space (or

2

any set) that is not countable is called “uncountable”. This distinction wasformalized by the late nineteenth century mathematician Georg Cantor, whoshowed that the set of (real) numbers in the interval [0, 1] is not countable.Under the uniform probability density, P (ω) = 0 for any ω ∈ [0, 1]. It is hard toimagine that the probability formula (1) is useful in this case, since every termin the sum is zero. The difference between continuous and discrete probabilityis the difference between integrals and sums.

2.5. Example: Toss a coin 4 times. Each toss yields either H (heads) or T(tails). There are 16 possible outcomes, TTTT, TTTH, TTHT, TTHH, THTT,. . ., HHHH. The number of outcomes is #(Ω) = |Ω| = 16. We suppose thateach outcome is equally likely, so P (ω) = 1

16 for each ω ∈ Ω. If A is the eventthat the first two tosses are H, then

A = HHHH, HHHT, HHTH, HHTT .

There are 4 elements (outcomes) in A, each having probability 116 Therefore

P (first two H) = P (A) =∑ω∈A

P (ω) =∑ω∈A

116

=416

=14

.

2.6. Set operations: Events are sets, so set operations apply to events. If Aand B are events, the event “A and B” is the set of outcomes in both A andB. This is the set intersection A ∩B, because the outcomes that make both Aand B happen are those that are in both events. The union A ∪ B is the setof outcomes in A or in B (or in both). The complement of A, Ac, is the event“not A”, the set of outcomes not in A. The empty event is the empty set, theset with no elements, ∅. The probability of ∅ should be zero because the sumthat defines it has no terms: P (∅) = 0. The complement of ∅ is Ω. Events Aand B are disjoint if A ∩ B = ∅. Event A is contained in event B, A ⊆ B, ifevery outcome in A is also in B. For example, if the event A is as above and Bis the event that the first toss is H, then A ⊆ B.

2.7. Basic facts: Each of these facts is a consequence of the representationP (A) =

∑ω∈A P (ω). First P (A) ≤ P (B) if A ⊆ B. Also, P (A) + P (B) =

P (A ∪ B) if P (A ∩ B) = 0, but not otherwise. If P (ω 6= 0 for all ω ∈ Ω, thenP (A∩B) = 0 only wehn A and B are distoint. Clearly, P (A)+P (Ac) = P (Ω) =1.

2.8. Conditional probability: The probability of outcome A given that B hasoccurred is the conditional probability (read “the probability of A given B”,

P (A | B) =P (A ∩B)

P (B). (2)

This is the fraction of B outcomes that are also A outcomes. The formula iscalled Bayes’ rule. It is often used to calculate P (A ∩ B) once we know P (B)and P (A | B). The formula for that is P (A ∩B) = P (A | B)P (B).

3

2.9. Independence: Events A and B are independent if P (A | B) = P (A).That is, knowing whether or not B occurred does not change the probability ofA. In view of Bayes’ rule, this is expressed as

P (A ∩B) = P (A) · P (B) . (3)

For example, suppose A is the event that two of the four tosses are H, and Bis the event that the first toss is H. Then A has 6 elements (outcomes), B has8, and, as you can check by listing them, A ∩ B has 3 elements. Since eachelement has probability 1

16 , this gives P (A ∩ B) = 316 while P (A) = 6

16 andP (B) = 8

16 = 12 . We might say “duh” for the last calculation since we started

the example with the hypothesis that H and T were equally likely. Anyway,this shows that (3) is indeed satisfied in this case. This example is supposed toshow that while some pairs of events, such as the first and second tosses, are“obviously” independent, others are independent as the result of a calculation.Note that if C is the event that 3 of the 4 tosses are H (instead of 2 for A),then P (C) = 4

16 = 14 and P (B ∩ C) = 3

16 , because

B ∩ C = HHHT, HHTH, HTHH

has three elements. Bayes’ rule (2) gives P (B | C) = 316/ 3

4 = 34 . Knowing that

there are 3 heads in all raises the probability that the first toss is H from 12 to

34 .

2.10. Working with conditional probability: Let us fix the event B, anddiscuss the conditional probability P (ω) = P (ω | B), which also is a probability(assuming P (B) > 0). There are two slightly different ways to discuss P . Oneway is to take B to be the probability space and define

P (ω) =P (ω)P (B)

for all ω ∈ B. Since B is the probability space for P , we do not have to defineP for ω /∈ B. This P is a probability because P (ω) ≥ 0 for all ω ∈ B and∑

ω∈B P (ω) = 1. The other way is to keep Ω as the probability space andset the conditional probabilities to zero for ω /∈ B. If we know the event Bhappened, then the probability of an outcome not in B is zero.

P (ω | B) =

P (ω)P (B) for ω ∈ B,0 for ω /∈ B.

(4)

Either way, we restrict to outcomes in B and “renormalize” the probabilitiesby dividing by P (B) so that they again sum to one. Note that (4) is just thegeneral conditional probability formula (2) applied to the event A = ω.

We can condition a second time by conditioning P on another event, C. Itseems natural that P (ω | C), which is the conditional probability of ω given that

4

C, occurred given that B occurred, should be be the P conditional probabilityof ω given that both B and C occurred. Bayes’ rule verifies this intuition:

P (ω | C) =P (ω)

P (C)

=P (ω | B)P (C | B)

=P (ω)

P (B)P (C ∩B

P (B)

=P (ω)

P (B ∩ C)= P (ω | B ∩ C) .

The conclusion is that conditioning on B and then on C is the same as condi-tioning on B∩C (B and C) all at once. This tower property underlies the manyrecurrence relations that allow us to get answers in practical situations.

2.11. Algebra of sets and incomplete information: A set of events, F , is analgebra if

i: A ∈ F implies that Ac ∈ F .

ii: A ∈ F and B ∈ F implies that A ∪B ∈ F and A ∩B ∈ F .

iii: Ω ∈ F and ∅ ∈ F .

We interpret F as representing a state of partial information. We know whetherany of the events in F occurred but we do not have enough information todetermine whether an event not in F occurred. The above axioms are naturalin light of this interpretation. If we know whether A happened, we surely knowwhether “not A” happened. If we know whether A happened and whether Bhappened, then we can tell whether “A and B” happened. We definitely knowwhether ∅ happened (it did not) and whether Ω happened (it did). Events inF are called measurable or determined in F .

2.12. Example 1 of an F : Suppose we learn the outcomes of the first twotosses. One event measurable in F is (with some abuse of notation)

HH = HHHH, HHHT, HHTH, HHTT .

An example of an event not determined by this F is the event of no more thanone H:

A = TTTT, TTTH, TTHT, THTT, HTTT .

Knowing just the first two tosses does not tell you with certainty whether thetotal number of heads is less than two.

5

2.13. Example 2 of an F : Suppose we know only the results of the tossesbut not the order. This might happen if we toss 4 identical coins at the sametime. In this case, we know only the number of H coins. Some measurable setsare (with an abuse of notation)

4 = HHHH3 = HHHT, HHTH, HTHH, THHH

...0 = TTTT

The event 2 has 6 outcomes (list them), so its probability is 6 · 116

=38. There

are other events measurable in this algebra, such as “less than 3 H”, but, insome sense, the events listed generate the algebra.

2.14. σ−algebra: An algebra of sets is a σ−algebra (pronounced “sigmaalgebra”) if it is closed under countable intersections, which means the following.Suppose An ∈ F is a countable family of events measurable in F , and A = ∩nAn

is the set of outcomes in all of the An, then A ∈ F , too. The reader cancheck that an algebra closed under countable intersections is also closed undercountable unions, and conversely. An algebra is automatically a σ−algebra ifΩ is finite. If Ω is infinite, an algebra might or might not be a σ−algebra.2 Ina σ−algebra, it is possible to take limits of infinite sequences of events, just asit is possible to take limits of sequences of real numbers. We will never (again)refer to an algebra of events that is not a σ−algebra.

2.15. Terminology: What we call “outcome” is usually called “randomvariable”. I did not use this terminology because it can be confusing, in that weoften think of “variables” as real (or complex) numbers. A “real valued function”of the random variable ω is a real number X for each ω, written X(ω). Themost common abuse of notation in probability is to write X instead of X(ω).We will do this most of the time, but not just yet. We often think of X as arandom number whose value is determined by the outcome (random variable) ω.A common convention is to use upper case letters for random numbers and lowercase letters for specific values of that variable. For example, the “cumulativedistribution function” (CDF), F (x), is the probability that X ≤ x, that is:F (x) =

∑X(ω)≤x

P (ω).

2.16. Informal event terminology: We often describe events in words. Forexample, we might write P (X ≤ x) where, strictly, we might be supposed tosay Ax = ω | X(ω) ≤ x then P (X ≤ x) = P (Ax). For example, if there are

2Let Ω be the set of integers and A ∈ F if A is finite or Ac is finite. This F is an algebra(check), but not a σ−algebra. For example, if An leaves out only the first n odd integers,then A is the set of even integers, and neither A nor Ac is finite.

6

two functions, X1 and X2, we might try to calculate the probability that theyare equal, P (X1 = X2). Strictly speaking, this is the probability of the set of ωso that X1(ω) = X2(ω).

2.17. Measurable: A function (of a random variable) X(ω) is measurablewith respect to the algebra F if the value of X is completely determined bythe information in F . To give a mathematical definition, for any number, x,we can consider the event that X = x, which is Bx = ω : X(ω) = x.In discrete probability, Bx will be the empty set for almost all x values andwill not be empty only for those values of x actually taken by X(ω) for oneof the outcomes ω. The function X(ω) is “measurable with respect to F”if the sets Bx are all measurable. People often write X ∈ F (an abuse ofnotation) to indicate that X is measurable with respect to F . In Example 2above, the function X = number of H minus number of T is measurable, whilethe function X = number of T before the first H is not (find an x and Bx /∈ Fto show this).

2.18. Generating an algebra of sets: Suppose there are events A1, . . .,Ak that you know. The algebra, F , generated by these sets is the algebrathat expresses the information about the outcome you gain by knowing theseevents. One definition of F is that an event A is in F if A can be expressed interms of the known events Aj using the set operations intersection, union, andcomplement a number of times. For example, we could define an event A bysaying “ω is in A1 and (A2 or A3) but not in A4 or A5”, which would be writtenA = (A1 ∩ (A2 ∪ A3)) ∩ (A4 ∪ A5)c. This is the same as saying that F is thesmallest algebra of sets that contains the known events Aj . Obviously (thinkabout this!) any algebra that contains the Aj contains any event described byset operations on the Aj , that is the definition of algebra of sets. Also the setsdefined by set operations on the Aj form an algebra of sets. For example, if A1

is the event that the first toss is H and A2 is the event that both the first twoare H, then A1 and A2 generate the algebra of events determined by knowingthe results of the first two tosses. This is Example 1 above. To generate aσ− algebra, we mayhave to allow infinitely many set operations, but a precisediscussion of this would be “off message”.

2.19. Generating by a function: A function X(ω) defines an algebra ofsets generated by the sets Bx. This is the smallest algebra, F , so that X ismeasurable with respect to F . Example 2 above has this form. We can think ofF as being the algebra of sets defined by statements about the values of X(ω).For example, one A ∈ F would be the set of ω with X either between 1 and 3or greater than 4.

We write FX for the algebra of sets generated by X and ask what it meansthat another function of ω, Y (ω), is measurable with respect to FX . Theinformation interpretation of FX says that Y ∈ FX if knowing the value of X(ω)determines the value of Y (ω). This means that if ω1 and ω2 have the same Xvalue (X(ω1) = X(ω2)) then they also have the same Y value. Said another

7

way, if Bx is not empty, then there is some number, u(x), so that Y (ω) = u(x)for every ω ∈ Bx. This means that Y (ω) = u(X(ω)) for all ω ∈ Ω). Altogether,saying Y ∈ FX is a fancy way of saying that Y is a function of X. Of course,u(x) only needs to be defined for those values of x actually taken by the randomvariable X.

For example, if X is the number of H in 4 tosses, and Y is the number ofH minus the number of T , then, for any 4 tosses, ω, Y (ω) = 2X(ω)− 4. Thatis, u(x) = 2x− 4.

2.20. Equivalence relation: A σ−algebra, F , determines an equivalencerelation. Outcomes ω1 and ω2 are equivalent, written ω1 ∼ ω2, if the informationin F does not distinguish ω1 from ω2. More formally, ω1 ∼ ω2 if ω1 ∈ A ⇒ω2 ∈ A for every A ∈ F . For example, in Example 2 above, THTT ∼ TTTH.Because F is an algebra, ω1 ∼ ω2 also implies that ω1 /∈ A ⇒ ω2 /∈ A (think thisthrough). Note that it is possible that Aω = Aω′ while ω 6= ω′. This happenswhen ω ∼ ω′.

The equivalence class of outcome ω is the set of outcomes equivalent to ω inF , indistinguishable from ω using the information available in F . If Aω is theequivalence class of ω, then Aω ∈ F . (Proof: for any ω′ not equivalent to ω inF , there is at least one Bω′ ∈ F with ω ∈ Bω′ but ω′ /∈ Bω′ . Since there are (atmost) countably many ω′, and F is a σ−algebra, Aω = ∩ω′Bω′ ∈ F . This Aω

contains every ω1 that is equivalent to ω (why?) and only those.) In Example2, the equivalence class of THTT is the event HTTT, THTT, TTHT, TTTH.

2.21. Partition: A partition of Ω is a collection of events, P = B1, B2, . . .so that every outcome ω ∈ Ω is in exactly one of the events Bk. The σ−algebragenerated by P, which we call FP , consists of events that are unions of eventsin P (Why are complements and intersections not needed?). For any partitionP, the equivalence classes of FP are the events in P (think this through). Con-versely, if P is the partition of Ω into equivalence classes for F , then P generatesF . In Example 2 above, the sets Bk = k form the partition corresponding toF . More generally, the sets Bx = ω | X(ω) = x that are not empty are thepartition corresponding to FX . In discrete probability, partitions are a conve-nient way to understand conditional expectation (below). The ininformation inFP is the knowledge of which of the Bj happened. The remaining uncertaintyi swhich of the ω ∈ Bj happened.

2.22. Expected value: A random variable (actually, a function of a randomvariable) X(ω) has expected value

E[X] =∑ω∈Ω

X(ω)P (ω) .

(Note that we do not write ω on the left. We think of X as simply a randomnumber and ω as a story telling how X was generated.) This is the “average”value in the sense that if you could perform the “experiment” of sampling Xmany times then average the resulting numbers, you would get roughly E[X].

8

This is because P (ω) is the fraction of the time you would get ω and X(ω) isthe number you get for ω. If X1(ω) and X2(ω) are two random variables, thenE[X1 + X2] = E[X1] + E[X2]. Also, E[cX] = cE[X] if c is a constant (notrandom).

2.23. Best approximation property: If we wanted to approximate a randomvariable, X, (function X(ω) with ω not written) by a single non random number,x, what value would we pick? That would depend on the sense of “best”. Onesuch sense is least squares, choosing x to minimize the expected value of (X−x)2.A calculation, which uses the above properties of expected value, gives

E[(X − x)2

]= E[X2 − 2Xx + x2]

= E[X2]− 2xE[X] + x2 .

Minimizing this over x gives the optimal value

xopt = E[X] . (5)

2.24. Classical conditional expectation: There are two senses of the termconditional expectation. We start with the original classical sense then turnto the related but different modern sense often used in stochastic processes.Conditional expectation is defined from conditional probability in the obviousway

E[X|B] =∑ω∈B

X(ω)P (ω|B) . (6)

For example, we can calculate

E[#of H in 4 tosses | at least one H] .

Write B for the event at least one H. Since only ω =TTTT does not haveat least one H, |B| = 15 and P (ω | B) = 1

15 for any ω ∈ B. Let X(ω) be thenumber of H in ω. Unconditionally, E[X] = 2, which means

116

∑x∈Ω

X(ω) = 2 .

Note that X(ω) = 0 for all ω /∈ B (only TTTT), so∑ω∈Ω

X(ω)P (ω) =∑ω∈B

X(ω)P (ω) ,

and therefore

116

∑ω∈B

X(ω)P (ω) = 2

9

1516

· 115

∑ω∈B

X(ω)P (ω) = 2

115

∑ω∈B

X(ω)P (ω) =2 · 1615

E[X | B] =3215

= 2 + .133 . . . .

Knowing that there was at least one H increases the expected number of H by.133 . . ..

2.25. Law of total probability: Suppose P = B1, B2, . . . is a partition ofΩ. The law of total probability is the formula

E[X] =∑

k

E[X | Bk]P (Bk) . (7)

This is easy to understand: exactly one of the events Bk happens. The expectedvalue of X is the sum over each of the events Bk of the expected value of Xgiven that Bk happened, multiplied by the probability that Bk did happen. Thederivation is a simple combination of the definitions of conditional expectation(6) and conditional probability (4):

E[X] =∑ω∈Ω

X(ω)P (ω)

=∑

k

(∑ω∈Bk

X(ω)P (ω)

)

=∑

k

(∑ω∈Bk

X(ω)P (ω)P (Bk)

)P (Bk)

=∑

k

E[X | Bk]P (Bk) .

This fact underlies the recurrence relations that are among the primary tools ofstochastic calculus. It will be reformulated below as the tower property whenwe discuss the modern view of conditional probability.

2.26. Modern conditional expectation: The modern conditional expectationstarts with an algebra, F , rather than just the set B. It defines a (function ofa) random variable, Y (ω) = E[X | F ], that is measurable with respect to Feven though X is not. This function represents the best prediction (in the leastsquares sense) of X given the information in F . If X ∈ F , then the value ofX(ω) is determined by the information in F , so Y = X.

In the classical case, the information is the occurrance or non occurrance ofa single event, B. That is, the algebra, FB , consists only of the sets B, Bc, ∅,and Ω. For this FB , the modern definition gives a function Y (ω) so that

Y (ω) =

E[X | B] if ω ∈ B,E[X | Bc] if ω /∈ B.

10

Make sure you understand the fact that this two valued function Y is measurablewith respect to FB .

Only slightly more complicated is the case where F is generated by a parti-tion, P = B1, B2, . . ., of Ω. The conditional expectation Y (ω) = E[X | F ] isdefined to be

Y (ω) = E[X | Bj ] if ω ∈ Bj , (8)

where E[X | Bj ] is classical conditional expectation (6). A single set B definesa partition: B1 = B, B2 = Bc, so this agrees with the earlier definition in thatcase. The information in F is only which of the Bj occurred. The modernconditional expectation replaces X with its expected value over the set tahtoccurred. This is the expected value of X given the information in F .

2.27. Example of modern conditional expectation: Take Ω to be sequences of4 coin tosses. Take F to be the algebra of Example 2 determined by the numberof H tosses. Take X(ω) to be the number of H tosses before the first T (e.g.X(HHTH) = 2, X(TTTT) = 0, X(HHHH) = 4, etc.). With the usual abuseof notation, we calculate (below): Y (0) = 0, Y (1) = 1/4, Y (2) = 2/3,Y (3) = 3/2, Y (4) = 4. Note, for example, that because HHTT and HTHTare equivalent in F (in the equivalence class 2), Y (HHTT) = Y (HTHT) = 1/4even though X(HHTT) 6= X(HTHT). The common value of Y is its average

11

value of X over the outcomes in the equivalence class.

0 TTTT0

expected value = 0

1 HTTT THTT TTHT TTTH1 0 0 0

expected value = (1 + 0 + 0 + 0)/4 = 1/4

2 HHTT HTHT HTTH THHT THTH TTHH2 1 1 0 0 0

expected value = (2 + 1 + 1 + 0 + 0 + 0)/6 = 2/3

3 HHHT HHTH HTHH THHH3 2 1 0

expected value = (3 + 2 + 1 + 0)/4 = 3/2

4 HHHH4

expected value = 4

2.28. Best approximation property: Suppose we have a random variable,X(ω), that is not measurable with respect to the σ−algebra F . That is, theinformation in F does not completely determine the value of X. The conditionalexpectation, Y (ω) = E[X | F ], among all functions measurable with respect toF , is the closest to X in the least squares sense. That is, if Z ∈ F , then

E[(Z −X)2

]≥ E

[(Y −X)2

].

In fact, this best approximation property will be the definition of conditionalexpectation in situations where the partition definition is not directly applica-ble. The best approximation property for modern conditional expectation isa consequence of the best approximation for classical conditional expectation.The least squares error is the sum of the least squares errors over each Bk in thepartition defined by F . We minimize the least squares error in Bk by choosingY (Bk) to be the average of X over Bk (weighted by the probabilities P (ω) forω ∈ Bk). By choosing the best approximation in each Bk, we get the bestapproximation overall.

12

This can be expressed in the terminology of linear algebra. The set of func-tions (random variables) X is a vector space (Hilbert space) with inner product

〈X, Y 〉 =∑ω∈Ω

X(ω)Y (ω)P (ω) = E [XY ] ,

so ‖X − Y ‖2 = E[(X − Y )2

]. The set of functions measurable with respect

to F is a subspace, which we call SF . The conditional expectation, Y , is theorthogonal projection of X onto SF , which is the element of SF that closest toX in the norm just given.

2.29. Tower property: Suppose G is a σ−algebra that has less informationthan F . That is, every event in G is also in F , but events in F need not be inG. This is expressed simply (without abuse of notation) as G ⊆ F . Considerthe (modern) conditional expectations Y = E[X | F ] and Z = E[X | G]. Thetower property is the fact that Z = E[Y | G]. That is, conditioning in one stepgives the same result as conditioning in two steps. As we said before, the towerproperty underlies the backward equations that are among the most useful toolsof stochastic calculus.

The tower property is an application of the law of total probability to condi-tional expectation. Suppose P and Q are the partitions of Ω corresponding toF and G respectively. The partition P is a refinement of Q, which means thateach Ck ∈ Q itself is partitioned into events Bk,1, Bk,2, . . ., where the Bk,j areelements of P. Then (see “Working with conditional probability”) for ω ∈ Ck,we want to show that Z(ω) = E[Y | Ck]:

Z(ω) = E[X | Ck]

=∑

j

E[X | Bjk]P (Bjk | Ck)

=∑

j

Y (Bjk)P (Bjk | Ck)

= E[Y | Ck] .

The linear algebra projection interpretation makes the tower property seemobvious. Any function measurable with respect to G is also measurable withrespect to F , which means that the subspace SG is contained in SF . If youproject X onto SF then project the projection onto SG , you get the same thingas projecting X directly onto SG (always orthogonal projections).

2.30. Modern conditional probability: Probabilities can be defined as ex-pected values of characteristic functions (see below). Therefore, the modern def-inition of conditional expectation gives a modern definition of conditional prob-ability. For any event, A, the indicator function, 1A(ω), (also written χA(ω),for “characteristic function”, terminology less used by probabilists because char-acteristic function means something else to them) is defined by 1A(ω) = 1 ifω ∈ A, and 1A(ω) = 0 if ω /∈ A. The obvious formula P (A) = E[1A] is the

13

representation of the probability as an expected value. The modern conditionalprobability then is P (A | F) = E[1A | F ]. Unraveling the definitions, this is afunction, YA(ω), that takes the value P (A | Bk) whenever ω ∈ Bk. A relatedstatement, given for practice with notation, is

P (A | F)(ω) =∑

Bk∈PF

P (A | Bk)1Bk(ω) .

3 Markov Chains, I

3.1. Introduction: Discrete time Markov3 chains are a simple abstract classof discrete random processes. Many practical models are Markov chains. Herewe discuss Markov chains having a finite state space (see below).

Many of the general concepts above come into play here. The probabilityspace Ω is the space of paths. The natural states of partial information aredescribed by the algebras Ft, which represent the information obtained by ob-serving the chain up to time t. The tower property applied to the Ft leads tobackward and forward equations. This section is mostly definitions. The goodstuff is in the next section.

3.2. Time: The time variable, t, will be an integer representing the numberof time units from a starting time. The actual time to go from t to t + 1 couldbe a nanosecond (for modeling computer communication networks) or a month(for modeling bond rating changes), or whatever. To be specific, we usuallystart with t = 0 and consider only non negative times.

3.3. State space: At time t the system will be in one of a finite list of states.This set of states is the state space, S. To be a Markov chain, the state shouldbe a complete description of the actual state of the system at time t. Thismeans that it should contain any information about the system at time t thathelps predict the state at future times t + 1, t + 2, ... . This is illustrated withthe hidden Markov model below. The state at time t will be called X(t) or Xt.Eventually, there may be an ω also, so that the state is a function of t and ω:X(t, ω) or Xt(ω). The states may be called s1, . . ., sm , or simply 1, 2, . . . , m.depending on the context.

3.4. Path space: The sequence of states X0, X1, . . ., XT , is a path. The set ofpaths is path space. It is possible and often convenient to use the set of paths asthe probability space, Ω. When we do this, the path X = (X0, X1, . . . , XT ) =(X(0), X(1), . . . , X(T )) plays the role that was played by the outcome ω in thegeneral theory above. We will soon have a formula for the P (X), probability ofpath X, in terms of transition probabilities.

3The Russian mathematician A. A. Markov was active in the last decades of the 19th

century. He is known for his path breaking work on the distribution of prime numbers as wellas on probability.

14

In principle, it should be possible to calculate the probability of any event(such as X(2) 6= s, or X(t) = s1 for some t ≤ T) by listing all the paths(outcomes) in that event and summing their probabilities. This is rarely theeasiest way. For one thing, the path space, while finite, tends to be enormous.For example, if there are m = |S| = 7 states and T = 50 times, then the numberof paths is ‖Ω‖ = mT = 750, which is about 1.8× 1042. This number is beyondcomputers.

3.5. Algebras Ft and Gt: The information learned by observing a Markovchain up to and including time t is Ft. Paths X1 and X2 are equivalent in Ft

if X1(s) = X2(s) for 0 ≤ s ≤ t. Said only slightly differently, the equivalenceclass of path X is the set of paths X ′ with X ′(s) = X(s) for 0 ≤ s ≤ t. The Ft

form an increasing family of algebras: Ft ⊆ Ft+1. (Event A is in Ft if we cantell whether A occurred by knowing X(s) for 0 ≤ s ≤ t. In this case, we alsocan tell whether A occurred by knowing X(s) for 0 ≤ s ≤ t + 1, which is whatit means for A to be in Ft+1.)

The algebra Gt is generated by X(t) only. It encodes the information learnedby observing X at time t only, not at earlier times. Clearly Gt ⊆ Ft, but Gt isnot contained in Gt+1, because X(t + 1) does not determine X(t).

3.6. Nonanticipating (adapted) functions: The underlying outcome, whichwas called ω, is now called X. A function of a the outcome, or function ofa random variable, will now be called F (X) instead of X(ω). Over and overin stochastic processes, we deal with functions that depend on both X and t.Such a function will be called F (X, t). The simplest such function is F (X, t) =X(t). More complicated functions are: (i) F (X, t) = 1 if X(s) = 1 for somes ≤ t, F (X, t) = 0 otherwise, and (ii) F (X, t) = min(s > t) with X(s) = 1 orF (X, t) = T if X(s) 6= 1 for t < s ≤ T .

A function F (X, t) is nonanticipating (also called adapted, though the notionsare slightly different in more sophisticated situations) if, for each t, the functionof X given by F (X, t) is measurable with respect to Ft. This is the same assaying that F (X, t) is determined by the values X(s) for s ≤ t. The function(i) above has this property but (ii) does not.

Nonanticipating functions are important for several reasons. In time, wewill see that the Ito integral makes sense only for nonanticipating functions.Moreover, functions F (X, t) are a model of decision making under uncertainty.That F is nonanticipating means that the decision at time t is made based oninformation available at time t and does not depend on future information.

3.7. Markov property: Informally, the Markov property is that X(t) is all theinformation about the past that is helpful in predicting the future. In classicalterms, for example,

P (X(t + 1) = k|X(t) = j) = P (X(t + 1) = k|X(t) = j, X(t− 1) = l, etc.) .

15

In modern notation, this may be stated

P (X(t + 1) = k | Ft) = P (X(t + 1) = k | Gt) . (9)

Recall that both sides are functions of the outcome, X. The function on theright side, to be measurable with respect to Gt must be a function of X(t) only(see “Generating by a function” in the previous section). The left side also isa function, but in general could depend on all the values X(s) for s ≤ t. Theequality (9) states that this function depends on X(t) only.

This may be interpreted as the absence of hidden variables, variables thatinfluence the evolution of the Markov chain but are not observable or includedin the state description. If there were hidden variables, observing the chain for along period might help identify them and therefore change our prediction of thefuture state. The Markov property (9) states, on the contrary, that observingX(s) for s < t does not change our predictions.

3.8. Transition probabilities: The conditional probabilities (9) are transitionprobabilities:

Pjk = P (X(t + 1) = k | X(t) = j) = P (j → k in one step) .

The Markov chain is stationary if the transition probabilities Pjk are indepen-dent of t. Each transition probability Pjk is between 0 and 1, with values 0 and1 allowed, though 0 is more common than 1. Also, with j fixed, the Pjk mustsum to 1 (summing over k) because k = 1, 2, . . ., m is a complete list of thepossible states at time t + 1.

3.9. Path probabilities: The Markov property leads to a formula for theprobabilities of individual path outcomes P (X) as products of transition prob-abilities. We do this here for a stationary Markov chain to keep the notationsimple. First, suppose that the probabilities of the initial states are known, andcall them

f0(j) = P (X(0) = j) .

The Bayes’ rule (2) implies that

P (X(1) = k and X(0) = j)= P (X(1) = k | X(0) = j) · P (X(0) = j) = f0(j)Pjk .

Using this argument again, and using (9), we find (changing the order of thefactors on the last line)

P (X(2) = l and X(1) = k and X(0) = j)= P (X(2) = l | X(1) = k and X(0) = j) · P (X(1) = k and X(0) = j)= P (X(2) = l | X(1) = k) · P (X(1) = k and X(0) = j)= f0(j)PjkPkl .

This can be extended to paths of any length.

16

One way to express the general formula uses a notational habit commonin probability, using upper case letters to represent a random value of a vari-able and lower case for generic values of the same quantity (see “Terminol-ogy”, Section 2, but note that the meaning of X has changed). We writex = (x(0), x(1), · · · , x(T )) for a generic path, and seek P (x) = P (X = x) =P (X(0) = x(0), X(1) = x(1), · · ·). The argument above shows that this is givenby

P (x) = f0(x(0))Px(0),x(1) · · ·Px(T−1),x(T ) = f0(x(0))T−1∏t=0

Px(t),x(t+1) . (10)

3.10. Transition matrix: The transition probabilities form an m×m matrix,P (an unfortunate conflict of notation), called the transition matrix. The (j, k)entry of P is the transition probability Pjk = P (j → k). The sum of theentries of the transition matrix P in row j is

∑k Pjk = 1. A matrix with these

properties: no negative entries, all row sums equal to 1, is a stochastic matrix.Any stochastic matrix can be the transition matrix for a Markov chain.

Methods from linear algebra often help in the analysis of Markov chains. Aswe will see in the next lecture, the time s transition probability

P sjk = P (Xt+s = k | Xt = j)

is the (j, k) entry of P s, the sth power of the transition matrix (explanationbelow). Also, as discussed later, steady state probabilities form an eigenvectorof P corresponding to eigenvalue λ = 1.

3.11. Example 3, coin flips: The state space has m = 2 states, called U(up) and D (down). Writing H and T would conflict with T being the lengthof the chain. The coin starts in the U position, which means that f0(U) = 1and f0(D) = 0. At every time step, the coin turns over with 20% probability,so the transition probabilities are PUU = .8, PUD = .2, PDU = .2, PDD = .8.The transition matrix is (taking U for 1 and D for 2):

P =(

.8 .2

.2 .8

)For example, we can calculate

P 2 = P · P =(

.68 .32

.32 .88

)and P 4 = P 2 · P 2 =

(.5648 .4352.4352 .5648

).

This implies that P (X(4) = D) = P (X(0) = U → X(4) = D) = P 4UD = .5648.

The eigenvalues of P are λ1 = 1 and λ2 = .6, the former required by theory.Numerical experimentation should can convince the reader that∥∥∥∥P s −

(.5 .5.5 .5

)∥∥∥∥ = const · λs2 .

17

Take T = 3 and let A be the event UUzU, where the state X(2) = z isunknown. There are two outcomes (paths) in A:

A = UUUU,UUDU ,

so P (A) = P (UUUU) + P (UUDU). The individual path probabilities are cal-culated using (10):

U .8→ U .8→ U .8→ U so P (UUUU) = 1× .8× .8× .8 = .512 .

U .8→ U .2→ D .2→ U so P (UUDU) = 1× .8× .2× .2 = .032 .

Thus, P (A) = .512 + .032 = .544.

3.12. Example 4: There are two coins, F (fast) and S (slow). Either coin willbe either U or D at any given time. Only one coin is present at any given timebut sometimes the coin is replaced (F for S or vice versa) without changing itsU–D status. The F coin has the same U–D transition probabilities as example3. The S coin has U–D transition probabilities:(

.9 .1.05 .95

)The probability of coin replacement at any given time is 30%. The replacement(if it happens) is done after the (possible) coin flip without changing the U–Dstatus of the coin after that flip. The Markov chain has 4 states, which wearbitrarily number 1: UF, 2: DF, 3: US, 4: DS. States 1 and 3 are U stateswhile states 1 and 2 are F states, etc. The transition matrix is 4 × 4. We cancalculate, for example, the (non) transition probability for UF → UF. We firsthave a U → U (non) transition then an F → (non) transition. The probabilityis then P (U → U | F ) · P (F → F ) = .8 · .7 = .56. The other entries can befound in a similar way. The transitions are:

UF → UF UF → DF UF → US UF → DSDF → UF DF → DF DF → US DF → DSUS → UF US → DF US → US US → DSDS → UF DS → DF DS → US DS → DS

.

The resulting transition matrix is

P =

.8 · .7 .2 · .7 .8 · .3 .2 · .3.2 · .7 .8 · .7 .2 · .3 .8 · .3.9 · .3 .1 · .3 .9 · .7 .1 · .7

.05 · .3 .95 · .3 .05 · .7 .95 · .7

.

If we start with U but equally likely F or S, and want to know the probabilityof being D after 4 time periods, the answer is

.5 ·(P 4

12 + P 414 + P 4

32 + P 434

)18

because states 1 = UF and 3 = US are the (equally likely) possible initial Ustates, and 2 = DF and 4 = DS are the two D states. We also could calculateP (UUzU) by adding up the probabilities of the 32 (list them) paths that makeup this event.

3.13. Example 5, incomplete state information: In the model of example 4we might be able to observe the U–D status but not the F–S status. Let X(t)be the state of the Example 4 model above at time t. Suppose Y (t) = U ifX(t) = UF or X(t) = UD, and Y (t) = D if X(t) = DF or X(t) = DD. Thenthe sequence Y (t) is a stochastic process but it is not a Markov chain. We canbetter predict U ↔ D transitions if we know whether the coin is F or S, oreven if we have a basis for guessing its F–S status.

For example, suppose that the four states (UF, DF, US, DS) at time t = 0are equally likely, that we know Y (1) = U and we want to guess whether Y (2)will again be U. If Y (0) is D then we are more likely to have the F coin soa Y (1) = U → Y (2) = D transition is more likely. That is, with Y (1) fixed,Y (0) = D makes it less likely to have Y (2) = U . This is a violation of theMarkov property brought about by incomplete state information. Models of thiskind are called hidden Markov models. Statistical estimation of the unobservedvariable is a topic for another day.Thanks to Laura K and Craig for pointing out mistakes and confusions inearlier drafts.

19


1 Forward and Backward Equations for Markov

chains

1.1. Introduction: Forward and backward equations are useful ways toget answers to quantitative questions about Markov chains. The probabilitiesu(k, t) = P (X(t) = k) satisfy forward equations that allows us to compute allthe numbers u(k, t + 1) once the all the numbers u(j, t) are known. This movesus forward from time t to time t+1. The expected values f(k, t) = E[V (X(T )) |X(t) = k] (for t < T ) satisfy a backward equation that allows us to calculatethe numbers f(k, t) once all the f(j, t + 1) are known. A duality relation allowsus to infer the forward equation from the backward equation, or conversely.The transition matrix is the generator of both equations, though in differentways. There are many related problems that have solutions involving forwardand backward equations. Two treated here are hitting probabilities and randomcompound interest.

1.2. Forward equation, functional version: Let u(k, t) = P (X(t) = k). Thelaw of total probability gives

u(k, t + 1) = P (X(t + 1) = k)

=∑

j

P (X(t + 1) = k | X(t) = j) · P (X(t) = j) .

Thereforeu(k, t + 1) =

∑j

Pjku(j, t) . (1)

This is the forward equation for probabilities. It is also called the Kolmogorovforward equation or the Chapman Kolmogorov equation. Once u(j, t) is knownfor all j ∈ S, (1) gives u(k, t + 1) for any k. Thus, we can go forward in timefrom t = 0 to t = 1, etc. and calculate all the numbers u(k, t).

Note that if we just wanted one number, say u(17, 49), still we would haveto calculate many related quantities, all the u(j, t) for t < 49. If the state spaceis too large, this direct forward equation approach may be impractical.

1.3. Row and column vectors: If A is an n × m matrix, and B is an m × pmatrix, then AB is n×p. The matrices are compatible for multiplication becausethe second dimension of A, the number of columns, matches the first dimensionof B, the number of rows. A matrix with just one column is a column vector.1

1The physicists’ more sophisticated idea that a vector is a physical quantity with certaintransformation properties is “inoperative” here.

1

Just one row makes it a row vector. Matrix-vector multiplication is a specialcase of matrix-matrix multiplication. We often denote genuine matrices (morethan one row and column) with capital letters and vectors, row or column, withlower case. In particular, if u is an n dimensional row vector, a 1×n matrix, andA is an n× n matrix, then uA is another n dimensional row vector. We do notwrite Au for this because that would be incompatible. Matrix multiplication isalways associative. For example, if u is a row vector and A and B are squarematrices, then (uA)B = u(AB). We can compute the row vector uA thenmultiply by B, or we can compute the n × n matrix AB then multiply by u.

If u is a row vector, we usually denote the k-th entry by uk instead of u1k.Similarly, the k-th entry of column vector f is fk instead of fk1. If both u and fhave n components, then uf =

∑nk=1 ukfk is a 1×1 matrix, i.e. a number. Thus,

treating row and column vectors as special kinds of matrices makes the productof a row with a column vector natural, but not, for example, the product of twocolumn vectors.

1.4. Forward equation, matrix version: The probabilities u(k, t) form thecomponents of a row vector, u(t), with components uk(t) = u(k, t) (an abuse ofnotation). The forward equation (1) may be expressed (check this)

u(t + 1) = u(t)P . (2)

Because matrix multiplication is associative, we have

u(t) = u(t − 1)P = u(t − 2)P 2 = · · · = u(0)P t . (3)

Tricks of matrix multiplication give information about the evolution of probabil-ities. For example, we can write a formula for u(t) in terms of the eigenvectorsand eigenvalues of P . Also, we can save effort in computing u(t) for large t byrepeated squaring:

P → P 2 → (P 2

)2= P 4 → · · · → P 2k

using just k matrix multiplications. For example, this computes P 1024 usingjust ten matrix multiplies, instead of a thousand.

1.5. Backward equation, functional version: Suppose we run the Markovchain until time T then get a “reward”, V (X(T )). For t ≤ T , define the condi-tional expectations

f(k, t) = E [V (X(T )) | X(t) = k] . (4)

This expression is used so often it often is abbreviated

f(k, t) = Ek,t[V (X(T ))] .

These satisfy a backward equation that follows from the law of total probability:

f(k, t) = E [V (X(T )) | X(t) = k]

2

=∑j∈S

E [V (X(T )) | X(t) = k and X(t + 1) = j] · P (X(t + 1) = j | X(t) = k)

f(k, t) =∑j∈S

f(j, t + 1)Pkj . (5)

The Markov property is used to infer that

E[V (X(T )) | X(t) = k and X(t + 1) = j] = Ej,t+1[V (X(T ))] .

The dynamics (5) must be supplemented with the final condition

f(k, T ) = V (k) . (6)

Using these, we may compute all the numbers f(k, T − 1), then all the numbersf(k, T − 2), etc.

1.6. Backward equation using modern conditional expectation: As usual, Ft

denotes the σ−algebra generated by X(0), . . ., X(t). Define F (t) = E[V (X(T )) |Ft]. The left side is a random variable that is measurable in Ft, which meansthat F (t) is a function of (X(0), . . . , X(t)). The Markov property implies thatF (t) actually is measurable with respect to Gt, the σ−algebra generated by X(t)alone. This means that F (t) is a function of X(t) alone, which is to say thatthere is a function f(k, t) so that F (t) = f(X(t), t), and

f(X(t), t) = E[V (X(T )) | Ft] = E[V (X(T )) | Gt] .

Since Gt is generated by the partition k = X(t) = k, this is the same def-inition (4). Moreover, because Ft ⊆ Ft+1 and F (t + 1) = E[V (X(T )) | Ft+1],the tower property gives

E[V (X(T )) | Ft] = E[F (t + 1) | Ft] ,

so that, again using the Markov property,

F (t) = E[F (t + 1) | Gt] . (7)

Note that this is a version of the tower property. On the event X(t) = k, theright side above takes the value∑

j∈Sf(j, t + 1) · P (x(t + 1) = j | X(t) = k) .

Thus, (7) is the same as the backward equation (5). In the continuous timeversions to come, (7) will be very handy.

1.7. Backward equation, matrix version: We organize the numbers f(k, t)into a column vector f(t) = (f(1, t), f(2, t), · · ·)t. It is barely an abuse to writef(t) both for a function of k and a vector. After all, any computer programmer

3

knows that a vector really is a function of the index. The backward equation(5) then is equivalent to (check this)

f(t) = Pf(t + 1) . (8)

Again the associativity of matrix multiplication lets us write, for example,

f(t) = PT−tV ,

writing V for the vector of values of V .

1.8. Invariant expectation value: We combine the conditional expectations(4) with the probabilities u(k, t) with the law of total probability to get, for anyt,

E[V (X(T ))] =∑k∈S

P (X(t) = k) · E[V (X(T )) | X(t) = k]

=∑k∈S

u(k, t)f(k, t)

= u(t)f(t) .

The last line is a natural example of an inner product between a row vector and acolumn vector. Note that the product E[V (X(T ))] = u(t)f(t) does not dependon t even though u(t) and f(t) are different for different t. For this invarianceto be possible, the forward evolution equation for u and the backward equationfor f must be related.

1.9. Relationship between the forward and backward equations: It oftenis possible to derive the backward equation from the forward equation andconversely using the invariance of u(t)f(t). For example, suppose we knowthat f(t) = Pf(t + 1). Then u(t + 1)f(t + 1) = u(t)f(t) may be rewrittenu(t + 1)f(t + 1) = u(t)Pf(t + 1), which may be rearranged as (using rules ofmatrix multiplication)

( u(t + 1) − u(t)P ) f(t + 1) = 0 .

If this is true for enough linearly independent vectors f(t + 1), then the vectoru(t+1)−u(t)P must be zero, which is the matrix version of the forward equation(2). A theoretically minded reader can verify that enough f vectors are producedif the transition matrix is nonsingular and we choose a linearly independentfamily of “reward” vectors, V . In the same way, the backward evolution of f isa consequence of invariance and the forward evolution of u.

We now have two ways to evaluate E[V (X(T ))]: (i) start with given u(0),compute u(T ) = u(0)PT , evaluate u(T )V , or (ii) start with given V = f(T ),compute f(0) = PT V , then evaluate u(0)f(0). The former might be preferable,for example, if we had a number a number of different reward functions toevaluate. We could compute u(T ) once then evalute u(T )V for all our V vectors.

4

1.10. Duality: In it’s simplest form, duality is the relationship between amatrix and its transpose. The set of column vectors with n components is avector space of dimension n. The set of n component row vectors is the dualspace, which has the same dimension but may be considered to be a differentspace. We can combine an element of a vector space with an element of its dualto get a number: row vector u multiplied by column vector f yields the numberuf . Any linear transformation on the vector space of column vectors is repre-sented by an n×n matrix, P . This matrix also defines a linear transformation,the dual transformation, on the dual space of row vectors, given by u → uP .This is the sense in which the forward and backward equations are dual to eachother.

Some people prefer not to use row vectors and instead think of organizingthe probabilities u(k, t) into a column vector that is the transpose of whatwe called u(t). For them, the forward equation would be written u(t + 1) =P tu(t) (note the notational problem: the t in P t means “transpose” while thet in u(t) and f(t) refers to time.). The invariance relation for them would beut(t + 1)f(t + 1) = ut(t)f(t). The transpose of a matrix is often called its dual.

1.11. Hitting probabilities, backwards: The hitting probability for state 1up to time T is

P (X(t) = 1 for some t ∈ [0, T ]) . (9)

Here and below we write [a, b] for all the integers between a and b, includinga and/or b if they are integers. Hitting probabilities can be computed usingforward or backward equations, often by modifying P and adding boundaryconditions. For one backward equation approach, define

f(k, t) = P (X(t′) = 1 for some t′ ∈ [t, T ] | X(t) = k) . (10)

Clearly,f(1, t) = 1 for all t, (11)

andf(k, T ) = 0 for k = 1. (12)

Moreover, if k = 1, the law of total probabilities yields a backward relation

f(k, t) =∑j∈S

Pkjf(j, t + 1) . (13)

The difference between this and the plain backward equation (5) is that therelation (13) holds only for interior states k = 1, while the boundary condition(11) supplies the values of f(1, t). The sum on the right of (13) includes theterm corresponding to state j = 1.

1.12. Hitting probabilities, forward: We also can compute the hitting proba-bilities (9) using a forward equation approach. Define the survival probabilities

u(k, t) = P (X(t) = k and X(t′) = 1 for t′ ∈ [0, t]) . (14)

5

These satisfy the obvious boundary condition

u(1, t) = 0 , (15)

and initial conditionu(k, 0) = 1 for k = 1. (16)

The forward equation is (as the reader should check)

u(k, t + 1) =∑j∈S

u(j, t)Pjk . (17)

We may include or exclude the term with j = 1 on the right because u(1, t) = 0.Of course, (17) applies only at interior states k = 1. The overall probabilityof survival up to time T is

∑k∈S u(k, T ) and the hitting probability is the

complementary 1 − ∑k∈S u(k, T ).

The matrix vector formulation of this involves the row vector

u(t) = (u(2, t), u(3, t), . . .)

and the matrix P formed from P by removing the first row and column. Theevolution equation (17) and boundary condition (15) are both expressed by thematrix equation

u(t + 1) = u(t)P .

Note that P is not a stochastic matrix because some of the row sums are lessthan one: ∑

j =1

Pkj < 1 if Pk1 > 0 .

1.13. Absorbing boundaries: Absorbing boundaries are another way to thinkabout hitting and survival probabilities. The absorbing boundary Markov chainis the same as the original chain (same transition probabilities) as long as thestate is not one of the boundary states. In the absorbing chain, the state neveragain changes after it visits an absorbing boundary point. If P is the transitionmatrix of the absorbing chain and P is the original transition matrix, this meansthat P jk = Pjk if j is not a boundary state, while P jk = 0 if j is a boundarystate and k = j. The probabilities u(k, t) for the absorbing chain are the sameas the survival probabilities (14) for the original chain.

1.14. Running cost: Suppose we have a running cost functtion, W (x), andwe want to calculate

f = E

[T∑

t=0

W (X(t))

]. (18)

Sums like this are called path dependent because their value depends on thewhole path, not just the final value X(T ). We can calculate (18) with the

6

forward equation using

f =T∑

t=0

E [W (X(t))]

=T∑

t=0

u(t)W . (19)

Here W is the column vector with components Wk = W (k). We compute theprobabilities that are the components of the u(t) using the standard forwardequation (2) and sum the products (19).

One backward equation approach uses the quantities

f(k, t) = Ek,t

[T∑

t′=t

W (X(t′))

]. (20)

These satisfy (check this):

f(t) = Pf(t + 1) + W . (21)

Starting with f(T ) = W , we work backwards with (21) until we reach thedesired f(0).

1.15. Multiplicitive functionals: For some reason, a function of a function isoften called a functional. The path, X(t), is a function of t, so a function, F (X),that depends on the whole path is often called a functional. Some applicationscall for finding the expected value of a multiplicative functional:

f = E

[T∏

t=0

V (X(t))

]. (22)

For example, X(t) could represent the state of a financial market and V (k) =1 + r(k) the interest rate for state k. Then (22) would be the expected totalinterest. We also can write V (k) = eW (k), so that∏

V (X(t)) = exp[∑

W (X(t))]

= eZ ,

with Z =∑

W (x(t)). This dos not solve the problem of evaluating (22) becauseE [ez] = eE(Z).

The backward equation approach uses the intermediate quantities

f(k, t) = Ek,t

[T∏

t′=t

V (X(t′))

].

The t′ = t term in the product has V (X(t)) = V (k). The final condition isf(k, T ) = V (k). The backward evolution equation is derived more or less as

7

before:

f(k, t) = Ek,t

[V (k)

∏t′>t

V (X(t′))

]

= V (k)Ek,t

[T∏

t′=t+1

V (X(t′))

]= V (k)Ek,t [f(X(t + 1), t + 1)] (the tower property)

f(k, t) = V (k)(Pf(t + 1)

)(k) . (23)

In the last line on the right, f(t + 1) is teh column vector with componentsf(k, t+1) and Pf(t+1) is teh matrix vector product. We write

(Pf(t+1)

)(k)

for the kth component of the column vector Pf(t + 1). We could express thewhole thing in matrix terms using diag(V ), the diagonal matrix with V (k) inthe (k, k) position:

f(t) = diag(V )Pf(t + 1) .

A version of (23) for Brownian motion is called the Feynman-Kac formula.

1.16. Branching processes: One forward equation approach to (22) leads toa different interpretation of the answer. Let B(k, t) be the event X(t) = kand I(k, t) the indicator function of B(k, t). That is I(k, t, X) = 1 if X ∈B(k, t) (i.e. X(t) = k), and I(k, t, X) = 0 otherwise. It is in keeping with theprobabilists’ habbit of leaving out the arguents of functions when the argumentis the underlying random outcome. We have u(k, t) = E[I(k, t)]. The forwardequation for the quantities

g(k, t) = E

[I(k, t)

t∏t′=0

V (X(t′))

](24)

is (see homework):g(k, t) = V (k)

(g(t − 1)P

)(k) . (25)

This is also the forward equation for a branching process with branchingfactors V (k). At time t, the branching process has N(k, t) particles, or walkers,at state k. The numbers N(k, t) are random. A time step of the branchingprocess has two parts. First, each particle takes one step of the Markov chain.A particle at state j goes to state k with probability Pjk. All steps for allparticles are independent. Then, each particle at state k does a branching orbirth/death step in which the particle is replaced by a random number of particleswith expected number V (k). For example, if V (k) = 1/2, we could delete theparticle (death) with probability half. If V (k) = 2.8, we could keep the existingparticle, one new one, then add a third with probability .8. All particles aretreated independently. If there are m particles in state k before the birth/deathstep, the expected number after the birth/death step is V (k)m. The expectednumber of particles, g(k, t) = E[N(k, t)], satisfies (25).

8

When V (k) = 1 for all k there need be no birth or death. There will bejust one particle, the path X(t). The number of particles at state k at time t,N(k, t), will be zero if X(t) = k or one if X(t) = k. In fact, N(k, t) = I(k, t)(X).The expected values will be g(k, t) = E[N(k, t)] = E[I(k, t)] = u(k, t).

The branching process representation of (22) is possible when V (k) ≥ 0 forall k. Monte Carlo methods based on branching processes are more accuratethan direct Monte Carlo in many cases.

2 Lattices, trees, and random walk

2.1. Introduction: Random walk on a lattice is an important examplewhere the abstract theory of Markov chains is used. It is the simplest model ofsomething randomly moving through space with none of the subtlty of Brownianmotion, though random walk on a lattice is a useful approximation to Brownianmotion, and vice versa. The forward and backward equations take a specificsimple form for lattice random walk and it is often possible to calculate orapproximate the solutions by hand. Boundary conditions will be applied at theboundaries of lattices, hence the name.

We pursue forward and backward equations for several reasons. First, theyoften are the best way to calculate expectations and hitting probabilities. Sec-ond, many theoretical qualitative properties of specific Markov chains are un-derstood using backward or forward equatins. Third, they help explain andmotivate the partial differential equations that arise as backward and forwardequations for diffusion processes.

2.2. Simple random walk: The state space for simple random walk is theintegers, positive and negative. At each time, the walker has three choices:(A) move up one, (B) do not move, (C) move down one. The probabilities areP (A) = P (k → k + 1) = a, P (B) = P (X(t + 1) = X(t)) = b, and P (X(t + 1) =X(t)−1) = c. Naturally, we need a, b, and c to be non-negative and a+b+c = 1.The transation matrix2 has b on the diagonal (Pkk = b for all k), a on the super-diagonal (Pk,k+1 = a for all k), and c on the sub diagonal. All other matrixelements Pjk are zero.

This Markov chain is homogeneous or translation invariant: The probalitiesof moving up or down are independent of X(t). A translation by k is a shift ofeverything by k (I do not know why this is called “translation”). Translationinvariance means, for example, that the probability of going from m to l in ssteps is the same as the probability of going from m + k to l + k in s steps:P (X(t+s) = l | X(t) = m) = P (X(t+s) = l+k | X(t) = m+k). It is commonto simplify general discussions by choosing k so that X(0) = 0. Mathematiciansoften say “without loss of generality” or “w.l.o.g.” when doing so.

2This “matrix” is infinite when the state space is infinite. Matrix multiplication is stilldefined. For example, the k component of uP is given by (uP )k =

∑jujPjk. This possibly

infinite sum has only three nonzero terms when P is tridiagonal.

9

Often, particularly when discussing multidimensional random walk, we usex, y, etc. instead of j, k, etc. to denote lattice points (states of the Markovchain). Probabilists often use lower case Latin letters for general possible valuesof a random variable, while using the capital letter for the random variableitself. Thus, we might write Pxy = P (X(t + 1) = x | X(t) = y). As an execisein definition unwrapping, review Lecture 1 and check that this is the same asPX(t),x = P (X(t + 1) = x | Ft).

2.3. Gaussian approximation, drift, and volatility: We can write X(t + 1) =X(t) + Y (t), where P (Y (t) = 1) = a, P (Y (t) = 0) = b, and P (Y (t) = −1) = c.The random variables Y (t) are independent of each other because of the Markovproperty and homogeniety. Assuming (without loss of generality) that X(0) = 0,we have

X(t) =t−1∑s=0

Y (s) , (26)

which expresses X(t) as a sum of iid (independent and identically distributed)random variables. The central limit theorem then tells us that for large t, X(t)is approximately Gaussian with mean µt and variance σ2t, where µ = E[Y (t)] =a− b and σ2 = var[Y (t)] = a+ c− (a− c)2. These are called drift and volatility3

respectively. The mean and variance of X(t) grow linearly in time with rate µand σ2 respectively. Figure 1 shows some probability distributions for simplerandom walk.

2.4. Trees: Simple random walk can be thought of as a sequence of decisions.At each time you decide: up(A), stay(B), or down(C). A more general sequenceof decisions is a decision tree. In a general decision tree, making choice A attime 0 then B at time one would have a different result than choosing first Bthen A. After t decisions, there could be 3t different decision paths and results.

The simple random walk decision tree is recombining, which means thatmany different decision paths lead to the same X(t) For example, start (w.l.o.g)with X(0) = 0, the paths ABB, CAA, BBA, etc. all lead to X(3) = 1. Arecombining tree is much smaller than a general decision tree. For simple ran-dom walk, after t steps there are 2t + 1 possible states, instead of up to 3t. Fort = 10, this is 21 instead of about 60 thousand.

2.5. Urn models: Urn models illustrate several features of more generalrandom walks. Unlike simple random walk, urn models are mean reverting andhave steady state probabilities that determine their large time behavior. We willcome back to them when we discuss scaling in future lectures.

The simple urn contains n balls that are identical except for their color.There are k red balls and n − k green ones. At each state, someone choosesone of the balls at random with each ball equally likely to be chosen. He or shereplaces the chosen ball with a fresh ball that is red with probability p and green

3People use the term volatility in two distinct ways. In the Black Scholes theory, volatilitymeans something else.

10

−45 −40 −35 −30 −25 −20 −15 −10 −5 00

0.02

0.04

0.06

0.08a = 0.20, b= 0.20, c= 0.60, T = 60

k

prob

abili

ty

−8 −6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2a = 0.20, b= 0.20, c= 0.60, T = 8

k

prob

abili

ty

Figure 1: The probability distributions after T = 8 (top) and T = 60 (bottom)steps for simple random walk. The smooth curve and circles represent the cen-tral limit theorem Gaussian approximation. The plots have different probabilityand k scales. Values not shown have very small probability.

11

with probability 1 − p. All choices are independent. The number of red ballsdecreases by one if he or she removes a red ball and returns a green one. Thishappens with probabilty (k/n) · (1 − p). Similarly, the k → k + 1 probability is((n− k)/n) · p. In formal terms, the state space is the integers from 0 to n andthe transition probabilities are

Pk,k−1 =k(1 − p)

n, Pkk =

(2p − 1)k + (p − 1)nn

, Pk,k+1 =(n − k)p

n,

Pjk = 0 otherwise.

If these formulas are right, then Pk,k−1 + Pkk + Pk,k+1 = 1.

2.6. Urn model steady state: For the simple urn model, the probabilitiesu(k, t) = P (X(t) = k) converge to steady state probabilities, v(k), as t → ∞.This is illustrated in Figure (2). The steady state probabilities are

v(k) =(

n

k

)pk(1 − p)n−k .

The steady state probabilities have the property that if u(k, t) = v(k) for allk, then u(k, t + 1) = v(k) also for all k. This is statistical steady state becausethe probabilities have reached steady state values though the states themselveskeep changing, as in Figure (3). In matrix vector notation, we can form therow vector, v, with entries v(k). Then v is a statistical steady state if vP = v.It is no coincedence that v(k) is the probability of getting k red balls in nindependent trials with probability p for each trial. The steady state expectednumber of red balls is

Ev[X ] = np ,

where the notation Ev[] refers to expectation in probability distribution v.

2.7. Urn model mean reversion: If we let m(t) be the expected value if X(t),then a calculation using the transition probabilities gives the relation

m(t + 1) = m(t) +1n

(np − m(t)) . (27)

This relation shows not only that m(t) = np is a steady state value (m(t) = npimplies m(t +1) = np), but also that m(t) → np as t → ∞ (if r(t) = m(t)−np,then r(t + 1) = αr(t) with |α| =

∣∣1 − 1n

∣∣ < 1).Another way of expression mean reversion will be useful in discussing stochas-

tic differential equations later. Because the urn Model is a Markov chain,

E [X(t + 1) | Ft] = E [X(t + 1) | X(t)]

Again using the transition probabilities, we get

E [X(t + 1) | Ft] = X(t) +1n

(np − X(t)) . (28)

12

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14n = 30, T = 6

k

prob

abili

ty

Figure 2: The probability distributions for the simple urn model plotted everyT time steps. The first curve is blue, low, and flat. The last one is red and mostpeaked in the center. The computation starts with each state being equallylikely. Over time, states near the edges become less likely.

13

0 50 100 150 200 250 300 350 4000

10

20

30

40

50

60

70

80

90

100p = 0.5, n = 100

t

X

Figure 3: A Monte-Carlo sampling of 11 paths from the simple urn model. Attime t = 0 (the left edge), the paths are evenly spaced within the state space.

14

If X(t) > np, we have

E[∆X(t)] = E[X(t + 1) − X(t)] − 1n

(np − X(t)) ,

is negative. If X(t) < np, it is positive.

2.8. Boundaries: The terms boundary, interior, region, etc. as used in thegeneral discussion of Markov chain hitting probabilities come from applicationsin lattice Markov chains such as simple random walk. For example, the regionx > β has boundary x = β. The quantities

u(x, t) = P (X(t) = x and X(s) > β for 0 ≤ s ≤ t)

satisfy the forward equation (just (1) in this special case)

u(x, t + 1) = au(x − 1, t) + bu(x, t) + cu(x + 1, t)

for x > β together with the absorbing boundary condition u(β, t) = 0. We couldcreate a finite state space Markov chain by considering a region β < x < γ withsimple random walk in the interior together with absorbing boundaries at x = βand x = γ. Absorbing boundary conditions are also called Dirichlet boundaryconditions.

Another way to create a finite state space Markov chain is to put reflectingboundaries at x = β and x = γ. This chain has the same transition probabilitiesas ordinary random walk in the interior (β < x < γ). However, transitions fromβ to β − 1 are disallowed and replaced by transitions from β to β + 1. Thismeans changing the transition probabilities starting from x = β to

P (β → β−1) = Pβ,β−1 = 0 , P (β → β) = Pββ = b , P (β → β+1) = Pβ,β+1 = a+c .

The transition rules at x = γ are similarly changed to block γ → γ + 1 transi-tions. There is some freedom in defining the reflection rules at the boundaries.We could, for example, make P (β → β) = b + c and P (β → β + 1) = a, whichchanges the blocked transition to standing still rather than moving right. Wereturn to this point in discussing oblique reflection in multidimensional randomwalks and diffusions.

2.9. Multidimensional lattice: The unit square lattice in d dimensions is theset of d−tuples of integers (the set of integers is called Z):

x = (x1, . . . , xd) with xj ∈ Z for 1 ≤ j ≤ d .

The scaled square lattice, with lattice spacing h > 0, is the set of points hx =(hx1, . . . , hxd), where x are integer lattice points. In the present discussion, thescaling is irrelevent, so we use the unit lattice. We say that latice points x andy are neighbors if

|xj − yj| ≤ 1 for all coordinates j = 1, . . . , d .

15


1 Martingales and stopping times

1.1. Introduction: Martingales and stopping times are inportant technicaltools used in the study of stochastic processes such as Markov chains and diffu-sions. A martingale is a stochastic process that is always unpredictable in thesense that E[Ft+t′ | Ft] = Ft (see below) if t′ > 0. A stopping time is a random“time”, τ(ω), so that we know at time t whether to stop, i.e. the event τ ≤ t ismeasurable in Ft. These tools work well together because stopping a martingaleat a stopping also has mean zero: if t ≤ τ ≤ t′, then E [Fτ | Ft] = Ft. A centralfact about the Ito calculus is that Ito integrals with respect to Brownian motionare martingales.

1.2. Stochstic process: Here is a more abstract definition of a discrete timestochastic processes. We have a probability space, Ω. The information availableat time t is represented by the algebra of events Ft. We assume that for eacht, Ft ⊂ Ft+1; since we are supposed to gain information going from t to t + 1,every known event in Ft is also known at time t + 1. A stochastic processis a family of random variables, Xt(ω), with Xt ∈ Ft (Xt measureable withrespect to Ft). Sometimes it happens that the random variables Xt containall the information in the Ft in the sense that Ft is generated by X1, . . ., Xt.This the minimal algebra in which the Xt form a stochastic process. In othercases Ft contains more information. Economists use these possibilities whenthey distinguish between the “weak efficient market hypothesis” (the Ft areminimal), and the “strong hypothesis” (Ft contains all the public informationin the world, literally). In the case of minimal Ft, it may be possible to identifythe outcome, ω, with the path X = X1, . . . , XT . This is less common when theFt are not minimal because the extra information may have to do with processesother than Xt. For the definition of stochastic process, the probabilities are notimportant, just the algebras of sets and “random variables” Xt. An expandingfamily of σ−algebras Ft ⊆ Ft+1 is a filtration.

1.3. Notation: The value of a stochastic process at time t may be writtenXt or X(t). The subscript notation reminds us that the Xt are a family offunctions of the random outcome (random variable) ω. In practical contexts,particularly in discussing multidimensional processes (X(t) ∈ Rn), we preferX(t) so that Xk(t) can represent the kth component of X(t). When the processis a martingale, we often call it Ft. This will allow us to let X(t) be a Markovchain and Ft(X) a martingale function of X.

1.4. Example 1, Markov chains: In this example, the Ft are minimal andΩ is the path space of sequences of length T from the state space, S. The

1

new information revealed at time t is the state of the chain at time t. Thevariables Xt are may be called “coordinate functions” because Xt is coordinatet (or entry t) in the sequence X. In principle, we could express this with thenotation Xt(X), but that would drive people crazy. Although we distinguishbetween Markov chains (discrete time) and Markov processes (continuous time),the term “stochastic process” can refer to either continuous or discrete time.

1.5. Example 2, diadic sets: This is a set of definitions for discussing averagesover a range of length scales. The “time” variable, t, represents the amount ofaveraging that has been done. The new information revealed at time t is finerscale information about a function (an audio signal or digital image). The statespace is the positive integers from 1 to 2T . We start with a function X(ω) andask that Xt(ω) be constant on diadic blocks of length 2T−t. The diadic blocksat level t are

Bt,k =1 + (k − 1)2T−t, 2 + (k − 1)2T−t, . . . , k2T−t

. (1)

The reader should check that moving from level t to level t+1 splits each blockinto right and left halves:

Bt,k = Bt+1,2k−1 ∪Bt+1,2k . (2)

The algebras Ft are generated by the block partitions

Pt =Bt,k with k = 1, . . . , 2T−t

.

Because Ft ⊂ Ft+1, the Pt+1 is a refinement of Pt. The union (2) shows how.We will return to this example after discussing martingales.

1.6. Martingales: A real valued stochastic process, Ft, is a martingale1 if

E[Ft+1 | Ft] = Ft .

If we take the overall expectation of both sides we see that the expectationvalue does not depend on t, E[Ft+1] = E[Ft]. The martingale property saysmore. Whatever information you might have at time t notwithstanding, stillthe expectation of future values is the present value. There is a gambling in-terpretation: Ft is the amount of money you have at time t. No matter whathas happened, your expected winnings at between t and t + 1, the “martingaledifference” Yt+1 = Ft+1 − Ft, has zero expected value. You can also think ofmartingale differences as a generalization of independent random variables. Ifthe random variables Yt were actually independent, then the sums Ft =

∑tk=1 Yt

would form a martingale (using the Ft, generated by the Y1, . . ., Yt). The readershould check this.

1For finite Ω this is the whole story. For countable Ω we also assume that the sums definingE[Xt] converge absolutely. This means that E[|Xt|] < ∞. That implies that the conditionalexpectations E[Xt + 1 | Ft are well defined.

2

1.7. Examples: The simplest way to get a martingale is to start witha random variable, F (ω), and define Ft = E[F | Ft]. If we apply this toa Markov chain with the minimal filtration Ft, and F is a final time rewardF = V (X(T )), then Ft = f(X(t), t) as in the previous lecture. If we apply thisto Ω =

1, 2, . . . , 2T

, with uniform probability P (k) = 2−T for k ∈ Ω, and the

diadic filtration, we get the diadic martingale with Ft(j) constant on the diadicblocks (1) and equal to the average of F over the block j is in.

1.8. A lemma on conditional expectation: In working with martingales weoften make use of a basic lemma about conditional expectation. Suppose U(ω)and Y (ω) are real valued random variables and that U ∈ F . Then

E[UY | F ] = UE[Y | F ] . (3)

We see this using classical conditional expectation over the sets in the partitiondefining F . Let B be one of these sets. Let yB = E[Y | ω ∈ B] be the value ofE[Y | F ] for ω ∈ B. We know that U(ω) is constant in B because U ∈ F . Callthis value uB . Then E[UY | B] = uBE[Y | B] = ubyb. But this is the value ofUE[Y | F ] for ω ∈ B. Since each ω is in some B, this proves (3) for all ω.

1.9. Doob’s principle: This lemma lets us make new martingales fromold ones. Let Ft be a martingale and Yt = Ft − Ft−1 the martingale differ-ences (called innovations by statisticians and returns in finance). We use theconvention that F−1 = 0 so that F0 = Y0. The martingale condition is thatE[Yt+1 | Ft] = 0. Clearly Ft =

∑tt′=0 Yt′ .

Suppose that at time t we are allowed to place a bet of any size2 on the asyet unknown martingale difference, Yt+1. Let Ut ∈ Ft be the size of the bet.The return from betting on Yt will be Ut−1Yt, and the total accumulated returnup to time t is

Gt = U0Y1 + U1Y2 + · · ·+ Ut−1Yt . (4)

Because of the lemma (3), the betting returns have E[UtYt+1 | Ft] = 0, soE[Gt+1 | Ft] = Gt and Gt also is a martingale.

The fact that Gt in (4) is a martingale sometimes is called Doob’s principleor Doob’s theorem after the probabilist who formulated it. A special case belowfor stopping times is Doob’s stopping time theorem or the optional stoppingtheorem. They all say that strategizing on a martingale never produces anythingbut a martingale. Nonanticipating strategies on martingales do not give positiveexpected returns.

1.10. Weak and strong efficient market hypotheses: It is possible that therandom variables Ft form a martingale with respect to their minimal filtration,Ft, but not with respect to an enriched filtration Gt ⊃ Ft. The simplest examplewould be the algebras Gt = Ft+1, which already know the value of Ft+1 at timet. Note that the Ft also are a stochastic process with respect to the Gt. The

2We may have to require that the bet have finite expected value.

3

weak efficient market hypothesis is that e−µtSt is a martingale (St being thestock price and µ its expected growth rate) with respect to its minimal filtra-tion. Technical analysis means using trading strategies that are nonanticipatingwith respect to the minimal filtration. Therefore, the weak efficient market hy-pothesis says that technical trading does not produce better returns than buyand hold. Any extra information you might get by examining the price historyof S up to time t is already known by enough people that it is already reflectedin the price St.

The strong efficient market hypothesis states that e−µtSt is a martingalewith respect to the filtration, Gt, representing all the public information in theworld. This includes the previous price history of S and much more (prices ofrelated stocks, corporate reports, market trends, etc.).

1.11. Investing with Doob: Economists sometimes use Doob’s principle andthe efficient market hypotheses to make a point about active trading in thestock market. Suppose that Ft, the price of a stock at time t, is a martingale3.Suppose that at time t we all the information in Ft, and choose an amount,Ut, to invest at time t. The fact that the resulting accumulated, Gt, has zeroexpected value is said to show that active investing is no better than a “buyand hold” strategy that just produces the value Ft. The well known book ARandom Walk on Wall Street is mostly an exposition of this point of view.This argument breaks down when applied to non martingale processes, such asstock prices over longer times. Active trading strategies such as (4) may producereduce the risk more than enough to compensage risk averse investors for smallamounts of lost expected value. Merton’s optimal dynamic investment analysisis a simple example of an active trading strategy that is better for some peoplethan passive buy and hold.

1.12. Stopping times: We have Ω and the expanding family Ft. A stoppingtime is a function τ(ω) that is one of the times 1, . . ., T , so that the eventτ ≤ t is in Ft. Stopping times might be thought of as possible strategies.Whatever your criterion for stopping is, you have enough information at time tto know whether you should stop at time t. Many stopping times are expressedas the first time something happens, such as the first time Xt > a. We cannotask to stop, for example, at the last t with Xt > a because we might not knowat time t whether Xt′ > a for some t′ > t.

1.13. Doob’s stopping time theorem for one stopping time: Because stop-ping times are nonanticipating strategies, they also cannot make money from amartingale. One version of this statement is that E[Xτ ] = E[X1]. The proof ofthis makes use of the events Bt, that τ = t. The stopping time hypothesis isthat Bt ∈ Ft. Since τ has some value 1 ≤ τ ≤ T , the Bt form a partition of Ω.Also, if ω ∈ Bt, τ(ω) = t, so Xτ = Xt. Therefore,

E[X1] = E[XT ]3This is a reasonable approximation for much short term trading

4

=T∑

t=1

E[XT | Bt]P (Bt)

=T∑

t=1

E[Xτ ]P (τ = t)

= E[Xτ ] .

In this derivation we made use of the classical statement of the martingaleproperty, if B ∈ Ft then E[XT | B] = E[Xt | B]. In our B = Bt, Xt = Xτ .

This simple idea, using the martingale property applied to the partitionBt, is crucial for much of the theory of martingales. The idea itself was firstused Kolmogorov in the context of random walk or Brownian motion. Doobrealized that Kolmogorov’s was even simpler and more beautiful when appliedto martingales.

1.14. Stopping time paradox: The technical hypotheses above, finite statespace, bounded stopping times, may be too strong, but they cannont be com-pletely ignored, as this famous example shows. Let Xt be a symmetric randomwalk starting at zero. This forms a martingale, so E[Xτ ] = 0 for any stoppingtime, τ . On the other hand, suppose we take τ = min(t | Xt = 1). Then Xτ = 1always, so E[Xτ ] = 1. The catch is that there is no T with τ(ω) ≤ T for allω. Even though τ < ∞ “almost surely” (more to come on that expression),E[τ ] = ∞ (explination later). Even that would be OK if the possible values ofXt were bounded. Suppose you choose T and set τ ′ = min(τ, T ). That is, youwait until Xt = 1 or t = T , whichever comes first, to stop. For large T , it is verylikely that you stopped for Xt = 1. Sill, those paths that never reached 1 prob-ably drifted just far enough in the negative direction so that their contributionto the overall expected value cancels the 1 to yield E[Xτ ′ ] = 0.

1.15. More stopping times theorem: Suppose we have an increasing familyof stopping times, 1 ≤ τ1 ≤ τ2 · · ·. In a natural way the random variablesY1 = Xτ1 , Y2 = Xτ2 , etc. also form a martingale. This is a final elaborate wayof saying that strategizing on a martingale is a no win game.

5

Stochastic Calculus Notes, Lecture 4Last modified October 4, 2004

1 Continuous probability

1.1. Introduction: Recall that a set Ω is discrete if it is finite or countable.We will call a set continuous if it is not discrete. Many of the probabilityspaces used in stochastic calculus are continuous in this sense (examples below).Kolmogorov1 suggested a general framework for continuous probability basedon abstract integration with respect to abstract probability measures. Thetheory makes it possible to discuss general constructions such as conditionalexpectation in a way that applies to a remarkably diverse set of examples.

The difference between continuous and discrete probability is the differencebetween integration and summation. Continuous probability cannot be basedon the formula

P (A) =∑ω∈A

P (ω) . (1)

Indeed, the typical situation in continuous probability is that any event consist-ing of a single outcome has probability zero: P (ω) = 0 for all ω ∈ Ω.

As we explain below, the classical formalism of probability densities also doesnot apply in many of the situations we are interested in. Abstract probabilitymeasures give a framework for working with probability in path space, as wellas more traditional discrete probability and probabilities given by densities onRn.

These notes outline the Kolmogorov’s formalism of probability measures forcontinuous probability. We leave out a great number of details and mathemat-ical proofs. Attention to all these details would be impossible within our timeconstraints. In some cases we indicate where a precise definition or a completeproof is missing, but sometimes we just leave it out. If it seems like somethingis missing, it probably is.

1.2. Examples of continuous probability spaces: Be definition, a probabil-ity space is a set, Ω, of possible outcomes, together with a σ−algebra, F , ofmeasurable events. This section discusses only the sets Ω. The correspondingalgebras are discussed below.

R, the real numbers. If x0 is a real number and u(x) is a probability density,then the probability of the event Br(x0) = x0 − r ≤ X ≤ x0 + r is

P ([x0 − r, x0 + r]) =∫ x0+r

x0−r

u(x)dx → 0 as r → 0.

1The Russian mathematician Kolmogorov was active in the middle of the 20th century.Among his many lasting contributions to mathematics are the modern axioms of probabilityand some of its most important theorems. His theories of turbulent fluid flow anticipatedmodern fractals be several decades.

1

Thus the probability of any individual outcome is zero. An event withpositive probability (P (A) > 0) is made up entirely of outcomes x0 ∈ A,with P (x0) = 0. Because of countable additivity (see below), this is onlypossible when Ω is uncountable.

Rn, sequences of n numbers (possibly viewed as a row or column vector depend-ing on the context): X = (X1 . . . , Xn). Here too if there is a probabilitydensity then the probability of any given outcome is zero.

SN . Let S be the discrete state space of a Markov chain. The space ST

is the set of sequences of length T of elements of S. An element of ST

may be written x = (x(0), x(1), · · · , x(T − 1)), with each of the x(t) inS. It is common to write xt for x(t). An element of SN is an infinite se-quence of elements of S. The “exponent” N stands for “natural numbers”.We misuse this notation because ours start with t = 0 while the actualnatural numbers start with t = 1. We use SN when we ask questionsabout an entire infinite trajectory. For example the hitting probability isP (X(t) 6= 1 for all t ≥ 0). Cantor proved that SN is not countable when-ever the state space has more than one element. Generally, the probabilityof any particular infinite sequence is zero. For example, suppose the tran-sition matrix has P11 = .6 and u0(1) = 1. Let x be the infinite sequencethat never leaves state 1: x = (1, 1, 1, · · ·). Then P (x) = u0(1) · .6 · .6 · · ·.Multiplying together an infinite number of .6 factors should give the an-swer P (x) = 0. More generally, if the transition matrix has Pjk ≤ r < 1for all (j, k), then P (x) = 0 for any single infinite path.

C([0, T ] → R), the path space for Brownian motion. The C stands for “con-tinuous”. The [0, T ] is the time interval 0 ≤ t ≤ T ; the square bracketstell us to include the endpoints (0 and T in this case). Round parentheses(0, T ) would mean to leave out 0 and T . The final R is the “target” space,the real numbers in this case. An element of Ω is a continuous functionfrom the interval [0, T ] to R. This function could be called X(t) or Xt (for0 ≤ t ≤ T ). In this space we can ask questions such as P (

∫ T

0X(t)dt > 4).

1.3. Probability measures: Let F be a σ−algebra of subsets of Ω. Aprobability measure is a way to assign a probability to each event A ∈ F . Indiscrete probability, this is done using (1). In Rn a probability density leads toa probability measure by integration

P (A) =∫

A

u(x)dx . (2)

There are still other ways to specify probabilities of events in path space. Allof these probability measures satisfy the same basic axioms.

Suppose that for each A ∈ F we have a number P (A). The numbers P (A)are a probability measure if

i. If A ∈ F and B ∈ F are disjoint events, then P (A ∪B) = P (A) + P (B).

2

ii. P (A) ≥ 0 for any event A ∈ F .

iii. P (Ω) = 1.

iv. If An ∈ F is a sequence of events each disjoint from all the others and∪∞n=1An = A, then

∑∞n=1 P (An) = P (A).

The last property is called countable additivity. It is possible to consider prob-ability measures that are not countably additive, but is not bery useful.

1.4. Example 1, discrete probability: If Ω is discrete, we may take F to bethe set of all events (i.e. all subsets of Ω). If we know the probabilities of eachindividual outcome, then the formula (1) defines a probability measure. Theaxioms (i), (ii), and (iii) are clear. The last, countable additivity, can be verifiedgiven a solid undergraduate analysis course.

1.5. Borel sets: It is rare that one can define P (A) for all A ⊆ Ω. Usually,there are non measurable events whose probability one does not try to define(see below). This is not related to partial information, but is an intrinsic aspectof continuous probability. Events that are not measurable are quite artificial,but they are impossible to get rid of. In most applications in stochastic calculus,it is convenient to take the largest σ−algebra to be the Borel sets2

In a previous lecture we discussed how to generate a σ−algebra from acollection of sets. The Borel algebra is the σ−algebra that is generated by allballs. The open ball with center x0 and radius r > 0 in n dimensional space isBr(x0) = x | |x − x0| < r. A “ball” in one dimension is an interval. In twodimensions it is a disk. Note that the ball is solid, as opposed to the hollowsphere, Sr(x0) = x | |x− x0| = r. The condition |x − x0| ≤ r instead of|x − x0| < r, defines a closed ball. The σ−algebra generated by open balls isthe same as the σ−algebra generated by closed balls (check this if you wish).

1.6. Borel sets in path space: The definition of Borel sets works the sameway in the path space of Brownian motion, C([0, T ], R). Let x0(t) amd x(t) betwo continuous functions of t. The distance between them in the “sup norm” is

‖x− x0‖ = sup0≤t≤T

|x(t)− x0(t)| .

We often use double bars to represent the distance between functions and singlebar absolute value signs to represent the distance between numbers or vectorsin Rn. As before, the open ball of radius r about a path x0 is the set of allpaths with ‖x− x0‖ < r.

1.7. The σ−algebra for Markov chain path space: There is a convenientlimit process that defines a useful σ−algebra on SN , the infinite time horizonpath space for a Markov chain. We have the algebras Ft generated by the first

2The larger σ−algebra of Lebesgue sets seems to more of a nuisance than a help, particularlyin discussing convergence of probability measures in path space.

3

t + 1 states x(0), x(1), . . ., x(t). We take F to be the σ−algebra generatedby all these. Note that the event A = X(t) 6= 1 for t ≥ 0 is not in any ofthe Ft. However, the event At = X(t) 6= 1 for 0 ≤ t ≤ T is in Ft. ThereforeA = ∪t≥0At must be in any σ−algebra that contains all the Ft. Also note thatthe union of all the Ft is an algebra of sets, though it is not a σ−algebra.

1.8. Generating a probability measure: Let M be a collection of eventsthat generates the σ−algebra F . Let A be the algebra of sets that are finiteintersections, unions, and complements of events in M. Clearly the σ−algebragenerated by M is the same as the one generated by A. The process of goingfrom the algebra A to the σ−algebra F is one of completion, adding all limitsof countable intersections or unions of events in A.

In order to specify P (A) for all A ∈ F , it suffices to give P (A) for all eventsA ∈ A. That is, if there is a countably additive probability measure P (A) forall A ∈ F , then it is completely determined by the numbers P (A) for thoseA ∈ A. Hopefully is is plausable that if the events in A generate those in F ,then the probabilities of events in M determine the probabilities of events in F(proof ommitted).

For example, in Rn if we specify P (A) for event described by finitely manyballs, then we have determined P (A) for any Borel set. It might be that thenumbers P (A) for A ∈ A are inconsistent with the axioms of probability (whichis easy to check) or can’t be extended in a way that is countably additive to all ofF (doesn’t happen in our examples), but otherwise the measure is determined.

1.9. Non measurable sets (technical aside): A construction demonstrates thatnon measurable sets are unavoidable. Let Ω be the unit circle. The simplestprobability measure on Ω would seem to be uniform measure (divided by 2π sothat P (Ω) = 1). This measure is rotation invariant: if A is a measurable eventhaving probability P (A) then the event A + θ = x + θ | x ∈ A is measurableand has P (A + θ) = P (A). It is possible to construct a set B and a (countable)sequence of rotations, θn, so that the events B + θk and B + θn are disjointif k 6= n and

⋃n B + θn = Ω. This set cannot be measurable. If it were and

µ = P (B) then there would be two choices: µ = 0 or µ > 0. In the former casewe would have P (Ω) =

∑n P (B + θn) =

∑n 0 = 0, which is not what we want.

In the latter case, again using countable additivity, we would get P (Ω) = ∞.The construction of the set B starts with a description of the θn. Write n

in base ten, flip over the decimal point to get a number between 0 and 1, thenmultiply by 2π. For example for n = 130, we get θn = θ130 = 2π · .031. Nowuse the θn to create an equivalence relation and partition of Ω by setting x ∼ yif x = y + θn (mod 2π) for some n. The reader should check that this is anequivalence relation (x ∼ y → y ∼ x, and x ∼ y and y ∼ z → x ∼ z). Now,let B be a set that has exactly one representative from each of the equivalenceclasses in the partition. Any x ∈ Ω is in one of the equivalence classes, whichmeans that there is a y ∈ B (the representative of the x equivalence class) andan n so that y + θn = x. That means that any x ∈ Ω has x ∈ B + θn for somen, which is to say that

⋃n B + θn = Ω. To see that B + θk is disjoint from

4

B + θn when k 6= n, suppose that x ∈ B + θk and x ∈ θn. Then x = y + θk

and x = z + θn for y ∈ B and z ∈ B. But (and this is the punch line) thiswould mean y ∼ z, which is impossible because B has only one representativefrom each equivalence class. The possibility of selecting a single element fromeach partition element without having to say how it is to be done is the axiomof choice.

1.10. Probability densities in Rn: Suppose u(x) is a probability density in Rn.Ifg A is an event made from finitely many balls (or rectangles) by set operations,we can define P (A) by integrating, as in (2). This leads to a probability measureon Borel sets corresponding to the density u. Deriving the probability measurefrom a probability density does not seem to work in path space because thereis nothing like the Riemann integral to use in3 (2) Therefore, we describe pathspace probability measures directly rather than through probability densities.

1.11. Measurable functions: Let Ω be a probability space with a σ−algebraF . Let f(ω) be a function defined on Ω. In discrete probability, f was measur-able with respect to F if the sets Ba = ω | f(ω = a) all were measurable. Incontinuous probability, this definition is replaced by the condition that the setsAab = ω | a ≤ f(ω) ≤ b are measurable. Because F is countably additive, andbecause the event a < f is the (countable) union of the events a + 1

n ≤ f , thisis the same as requiring all the sets Aab = ω | a < f(ω) < b to be measurable.If Ω is discrete (finite or countable), then the two definitions of measurablefunction agree.

In continuous probability, the notion of measurability of a function withrespect to a σ−algebra plays two roles. The first, which is purely technical,is that f is sufficiently “regular” (meaning not crazy) that abstract integrals(defined below) make sense for it. The second, particularly for smaller algebrasG ⊂ F , again involves incomplete information. A function that is measurablewith respect to G not only needs to be regular, but also must depend on fewervariables (possibly in some abstract sense).

1.12. Integration with respect to a measure: The definition of integrationwith respect to a general probability measure is easier than the definition of theRiemann integral. The integral is written

E[f ] =∫

ω∈Ω

f(ω)dP (ω) .

We will see that in Rn with a density u, this agrees with the classical definition

E[f ] =∫

Rn

f(x)u(x)dx ,

3The Feynman integral in path space has some properties of true integrals but lacks others.The probabilist Mark Kac (pronounced “cats”) discovered that Feynman’s ideas applied tothe heat equation rather than the Schrodinger equation can be interpreted as integration withrespect to Wiener measure. This is now called the Feynman Kac formula.

5

if we write dP (x) = u(x)dx. Note that the abstract variable ω is replaced bythe concrete variable, x, in this more concrete situation. The general definitionis forced on us once we make the natural requirements

i. If A ∈ F is any event, then E[1A] = P (A). The integral of the indicatorfunction if an event is the probability of that event.

ii. If f1 and f2 have f1(ω) ≤ f2(ω) for all ω ∈ Ω, then E[f1] ≤ E[f2]. “Integra-tion is monotone”.

iii. For any reasonable functions f1 and f2 (e.g. bounded), we have E[af1 +bf2] = aE[f1] + bE[f2]. (Linearity of integration).

iv. If fn(ω) is an increasing family of positive functions converging pointwise tof (fn(ω) ≥ 0 and fn+1(ω) ≥ fn(ω) for all n, and fn(ω → f(ω) as n →∞for all ω), then E[fn] → E[f ] as n →∞. (This form of countable additiv-ity for abstract probability integrals is called the monotone convergencetheorem.)

A function is a simple function if there are finitely many events Ak, andweights wk, so that f =

∑k wk1Ak

. Properties (i) and (iii) imply that theexpectation of a simple function is

E[f ] =∑

k

wkP (Ak) .

We can approximate general functions by simple functions to determine theirexpectations.

Suppose f is a nonnegative bounded function: 0 ≤ f(ω) ≤ M for all ω ∈ Ω.Choose a small number ε = 2−n and define the4 “ring sets” Ak = (k − 1)ε ≤f < kε. The Ak depend on ε but we do not indicate that. Although the eventsAk might be complicated, fractal, or whatever, each of them is measurable. Asimple function that approximates f is fn(ω) =

∑k(k − 1)ε1Ak

. This fn takesthe value (k − 1)ε on the sets Ak. The sum defining fn is finite because f isbounded, though the number of terms is M/ε. Also, fn(ω) ≤ f(ω) for eachω ∈ Ω (though by at most ε). Property (ii) implies that

E[f ] ≥ E[fn] =∑

k

(k − 1)εP (Ak) .

In the same way, we can consider the upper function gn =∑

k kε1Akand have

E[f ] ≤ E[gn] =∑

k

kεP (Ak) .

The reader can check that fn ≤ fn+1 ≤ f ≤ gn+1 ≤ gn and that gn − fn ≤ ε.Therefore, the numbers E[fn] form an increasing sequence while the E[gn] are a

4Take f = f(x, y) = x2 + y2 in the plane to see why we call them ring sets.

6

decreasing sequence converging to the same number, which is the only possiblevalue of E[f ] consistent with (i), (ii), and (iii).

It is sometimes said that the difference between classical (Riemann) integra-tion and abstract integration (here) is that the Riemann integral cuts the x axisinto little pieces, while the abstarct integral cuts the y axis (which is what thesimple function approximations amount to).

If the function f is positive but not bounded, it might happen that E[f ] = ∞.The “cut off” functions, fM (ω) = min(f(ω),M), might have E[fM ] → ∞ asM → ∞. If so, we say E[f ] = ∞. Otherwise, property (iv) implies thatE[f ] = limM→∞E[fM ]. If f is both positive and negative (for different ω),we integrate the positive part, f+(ω) = max(f(ω), 0), and the negative partf−(ω) = min(f(ω), 0 separately and subtract the results. We do not attempt adefinition if E[f+] = ∞ and E[f−] = −∞. We omit the long process of showingthat these definitions lead to an integral that actually has the properties (i) -(iv).

1.13. Markov chain probability measures on SN : Let A = ∪t≥′Ft as before.The probability of any A ∈ A is given by the probability of that event in Ft

if A ∈ Ft. Therefore P (A) is given by a formula like (1) for any A ∈ A. Atheorem of Kolmogorov states that the completion of this measure to all of Fmakes sense and is countably additive.

1.14. Conditional expectation: We have a random variable X(ω) that ismeasurable with respect to the σ−algebra, F . We have σ−algebra that is asub algebra: G ⊂ F . We want to define the conditional expectation Y = E[X |G]. In discrete probability this is done using the partition defined by G. Thepartition is less useful because it probably is uncountable, and because eachpartition element, B(ω) = ∩A (the intersection being over all A ∈ G withω ∈ A), may have P (B(ω)) = 0 (examples below). This means that we cannotapply Bayes’ rule directly.

The definition is that Y (ω) is the random variable measurable with respectto G that best approximates X in the least squares sense

E[(Y −X)2] = minZ∈G

E[(Z −X)2] .

This is one of the definitions we gave before, the one that works for continuousand discrete probability. In the theory, it is possible to show that there is aminimizer and that it is unique.

1.15. Generating a σ−algebra: When the probability space, Ω, is finite, wecan understand an algebra of sets by using the partition of Ω that generates thealgebra. This is not possible for continuous probability spaces. Another wayto specify an algebra for finite Ω was to give a function X(ω, or a collectionof functions Xk(ω) that are supposed to be measurable with respect to F . Wenoted that any function measurable with respect to the algebra generated byfunctions Xk is actually a function of the Xk. That is, if F ∈ F (abuse of

7

notation), then there is some function u(x1, . . . , xn) so that

F (ω) = u(X1(ω), . . . , Xn(ω)) . (3)

The intuition was that F contains the information you get by knowing thevalues of the functions Xk. Any function measurable with respect to this alge-bra is determined by knowing the values of these functions, which is preciselywhat (3) says. This approach using functions is often convenient in continuousprobability.

If Ω is a continuous probability space, we may again specify functions Xk

that we want to be measurable. Again, these functions generate an algebra,a σ−algebra, F . If F is measurable with respect to this algebra then there isa (Borel measurable) function u(x1, . . .) so that F (ω) = u(X1, . . .), as before.In fact, it is possible to define F in this way. Saying that A ∈ F is the sameas saying that 1A is measurable with respect to F . If u(x1, . . .) is a Borelmeasurable function that takes values only 0 or 1, then the function F defined by(3) defines a function that also takes only 0 or 1. The event A = ω | F (ω) = 1has (obviously) F = 1A. The σ−algebra generated by the Xk is the set ofevents that may be defined in this way. A complete proof of this would take afew pages.

1.16. Example in two dimensions: Suppose Ω is the unit square in twodimensions: (x, y) ∈ Ω if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. The “x coordinate function”is X(x, y) = x. The information in this is the value of the x coordinate, but notthe y coordinate. An event measurable with respect to this F will be any eventdetermined by the x coordinate alone. I call such sets “bar code” sets. You cansee why by drawing some.

1.17. Marginal density and total probability: The abstract situation is thatwe have a probability space, Ω with generic outcome ω ∈ Ω. We have somefunctions (X1(ω), . . . , Xn(ω)) = X(ω). With Ω in the background, we can askfor the joint PDF of (X1, . . . , Xn), written u(x1, . . . , xn). A formal definition ofu would be that if A ⊆ Rn, then

P (X(ω) ∈ A) =∫

x∈A

u(x)dx . (4)

Suppose we neglect the last variable, Xn, and consider the reduced vectorX(ω) = (X1, . . . , Xn−1) with probability density u(x1, . . . , xn−1). This u isthe “marginal density” and is given by integrating u over the forgotten variable:

u(x1, . . . , xn1) =∫ ∞

−∞u(x1, . . . , xn)dxn . (5)

This is a continuous probability analogue of the law of total probability: in-tegrate (or sum) over a complete set of possibilities, all values of xn in thiscase.

8

We can prove (5) from (4) by considering a set B ⊆ Rn−1 and the corre-sponding set A ⊆ Rn given by A = B × R (i.e. A is the set of all pairs x, xn)with x = (x1, . . . , xn−1) ∈ B). The definition of A from B is designed so thatP (X ∈ A) = P (X ∈ B). With this notation,

P (X ∈ B) = P (X ∈ A)

=∫

A

u(x)dx

=∫

x∈B

∫ ∞

xn=−∞u(x, xn)dxndx

P (X ∈ B) =∫

B

u(x)dx .

This is exactly what it means for u to be the PDF for X.

1.18. Classical conditional expectation: Again in the abstract setting ω ∈ Ω,suppose we have random variables (X1(ω), . . . , Xn(ω)). Now consider a functionf(x1, . . . , xn), its expectated value E[f(X)], and the conditional expectations

v(xn) = E[f(X) | Xn = xn] .

The Bayes’ rule definition of v(xn) has some trouble because both the denomi-nator, P (Xn = xn), and the numerator,

E[f(X) · 1Xn=xn ] ,

are zero.The classical solution to this problem is to replace the exact condition Xn =

xn with an approximate condition having positive (though small) probability:xn ≤ Xn ≤ xn + ε. We use the approximaion∫ xn+ε

xn

g(x, ξn)dξn ≈ εg(x, xn) .

The error is roughly proportional to ε2 and much smaller than either the termsabove. With this approximation the numerator in Bayes’ rule is

E[f(X) · 1xn≤Xn≤xn+ε] =∫

x∈Rn−1

∫ ξn=xn+ε

ξn=xn

f(x, ξn)u(x, xn)dξndx

≈ ε

∫x

f(x, xn)u(x, xn)dx .

Similarly, the denominator is

P (xn ≤ Xn ≤ xn + ε) ≈ ε

∫x

u(x, xn)dx .

9

If we take the Bayes’ rule quotient and let ε → 0, we get the classical formula

E[f(X) | Xn = xn] =

∫x

f(x, xn)u(x, xn)dx∫x

u(x, xn)dx. (6)

By taking f to be the characteristic function of an event (all possible events)we get a formula for the probability density of X given that Xn = xn, namely

u(x | Xn = xn) =u(x, xn)∫

xu(x, xn)dx

. (7)

This is the classical formula for conditional probability density. The integralin the denominator insures that, for each xn, u is a probability density as afunction of x, that is ∫

u(x | Xn = xn)dx = 1 ,

for any value of xn. It is very useful to notice that as a function of x, u and ualmost the same. They differ only by a constant normalization. For example,this is why conditioning Gaussian’s gives Gaussians.

1.19. Modern conditional expectation: The classical conditional expectation(6) and conditional probability (7) formulas are the same as what comes fromthe “modern” definition from paragraph 1.6. Suppose X = (X1, . . . , Xn) hasdensity u(x), F is the σ−algebra of Borel sets, and G is the σ−algebra generatedby Xn (which might be written Xn(X), thinking of X as ω in the abstractnotation). For any f(x), we have f(xn) = E[f | G]. Since G is generated byXn, the function f being measurable with respect to G is the same as it’s beinga function of xn. The modern definition of f(xn) is that it minimizes∫

Rn

(f(x)− f(xn)

)2

u(x)dx , (8)

over all functions that depend only on xn (measurable in G).To see the formula (6) emerge, again write x = (x, xn), so that f(x) =

f(x, xn), and u(x) = u(x, xn). The integral (8) is then∫ ∞

xn=−∞

∫x∈Rn−1

(f(x, xn)− f(xn)

)2

u(x, xn)dxdxn .

In the inner integral:

R(xn) =∫

x∈Rn−1

(f(x, xn)− f(xn)

)2

u(x, xn)dx ,

f(xn) is just a constant. We find the value of f(xn) that minimizes R(xn) byminimizing the quantity∫

x∈Rn−1(f(x, xn)− g)2 u(x, xn)dx =∫

f(x)2u(x, xn)dx + 2g

∫f(x)u(x, xn)dx + g2

∫u(x, xn)dx .

10

The optimal g is given by the classical formula (6).

1.20. Modern conditional probability: We already saw that the modern ap-proach to conditional probability for G ⊂ F is through conditional expectation.In its most general form, for every (or almost every) ω ∈ Ω, there should bea probability measure Pω on Ω so that the mapping ω → Pω is measureablewith respect to G. The measurability condition probably means that for everyevent A ∈ F the function pA(ω) = Pω(A) is a G measurable function of ω.In terms of these measures, the conditional expectation f = E[f | G] would bef(ω) = Eω[f ]. Here Eω means the expected value using the probability measurePω. There are many such subscripted expectations coming.

A subtle point here is that the conditional probability measures are definedon the original probability space, Ω. This forces the measures to “live” ontiny (generally measure zero) subsets of Ω. For example, if Ω = Rn and G isgenerated by xn, then the conditional expectation value f(xn) is an average off (using density u) only over the hyperplane Xn = xn. Thus, the conditionalprobability measures PX depend only on xn, leading us to write Pxn . Sincef(xn) =

∫f(x)dPxn

(x), and f(xn) depends only on values of f(x, xn) withthe last coordinate fixed, the measure dPxn

is some kind of δ measure on thathyperplane. This point of view is useful in many advanced problems, but wewill not need it in this course (I sincerely hope).

1.21. Semimodern conditional probability: Here is an intermediate “semi-modern” version of conditional probability density. We have Ω = Rn, andΩ = Rn−1 with elements x = (x1, . . . , xn−1). For each xn, there will be a (con-ditional) probability density function uxn

. Saying that u depends only on xn isthe same as saying that the function x → uxn is measurable with respect to G.The conditional expectation formula (6) may be written

E[f | G](xn) =∫

Rn−1f(x, xn)uxn(x)dx .

In other words, the classical u(x | Xn = xn) of (7) is the same as the semimodernuxn

(x).

2 Gaussian Random Variables

The central limit theorem (CLT) makes Gaussian random variables important.A generalization of the CLT is Donsker’s “invariance principle” that gives Brow-nian motion as a limit of random walk. In many ways Brownian motion is amultivariate Gaussian random variable. We review multivariate normal randomvariables and the corresponding linear algebra as a prelude to Brownian motion.

2.1. Gaussian random variables, scalar: The one dimensional “standard

11

normal”, or Gaussian, random variable is a scalar with probability density

u(x) =1√2π

e−x2/2 .

The normalization factor 1√2π

makes∫∞−∞ u(x)dx = 1 (a famous fact). The

mean value is E[X] = 0 (the integrand xe−x2/2 is antisymmetric about x = 0).The variance is (using integration by parts)

E[X2] =1√2π

∫ ∞

−∞x2e−x2/2dx

=1√2π

∫ ∞

−∞x(xe−x2/2

)dx

= − 1√2π

∫ ∞

−∞x

(d

dxe−x2/2

)dx

= − 1√2π

(xe−x2/2

)∣∣∣∣∞−∞

+1√2π

∫ ∞

−∞e−x2/2dx

= 0 + 1

Similar calculations give E[X4] = 3, E[X6] = 15, and so on. I will often writeZ for a standard normal random variable. A one dimensional Gaussian randomvariable with mean E[X] = µ and variance var(X) = E[(X − µ)2] = σ2 hasdensity

u(x) =1√

2πσ2e−

(x−µ)2

2σ2 .

It is often more convenient to think of Z as the random variable (like ω) andwrite X = µ+σZ. We write X ∼ N (µ, σ2) to express the fact that X is normal(Gaussian) with mean µ and variance σ2. The standard normal random variableis Z ∼ N (0, 1)

2.2. Multivariate normal random variables: The n×n matrix, H, is positivedefinite if x∗Hx > 0 for any n component column vector x 6= 0. It is symmetricif H∗ = H. A symmetric matrix is positive definite if and only if all its eigenvalesare positive. Since the inverse of a symmetric matrix is symmetric, the inverseof a symmetric positive definite (SPD) matrix is also SPD. An n componentrandom variable is a mean zero multivariate normal if it has a probability densityof the form

u(x) =1ze−

12 x∗Hx ,

for some SPD matrix, H. We can get mean µ = (µ1, . . . , µn)∗ either by takingX + µ where X has mean zero, or by using the density with x∗Hx replaced by(x− µ)∗H(x− µ).

If X ∈ Rn is multivariate normal and if A is an m× n matrix with rank m,then Y ∈ Rm given by Y = AX is also multivariate normal. Both the casesm = n (same number of X and Y variables) and m < n occur.

12

2.3. Diagonalizing H: Suppose the eigenvalues and eigenvectors of H areHvj = λjvj . We can express x ∈ Rn as a linear combination of the vj either invector form, x =

∑nj=1 yjvj , or in matrix form, x = V y, where V is the n × n

matrix whose columns are the vj and y = (y1, . . . , yn)∗. Since the eigenvectorsof a symmetric matrix are orthogonal to each other, we may normalize them sothat v∗j vk = δjk, which is the same as saying that V is an orthogonal matrix,V ∗V = I. In the y variables, the “quadratic form” x∗Hx is diagonal, as we cansee using the vector or the matrix notation. With vectors, the trick is to use thetwo expressions x =

∑nj=1 yjvj and x =

∑nk=1 ykvk, which are the same since j

and k are just summation variables. Then we can write

x∗Hx =

n∑j=1

yjvj

∗H

(n∑

k=1

ykvk

)

=∑jk

(v∗j Hvk

)yjyk

=∑jk

λkv∗j vkyjyk

x∗Hx =∑

k

λky2k . (9)

The matrix version of the eigenvector/eigenvalue relations is V ∗HV = Λ (Λ be-ing the diagonal matrix of eigenvalues). With this we have x∗Hx = (V y)∗HV y =y∗(V ∗HV )y = y∗Λy. A diagonal matrix in the quadratic form is equivalent tohaving a sum involving only squares λky2

k. All the λk will be positive if H ispositive definite. For future reference, also remember that det(H) =

∏nk=1 λk.

2.4. Calculations using the multivariate normal density: We use the yvariables as new integration variables. The point is that if the quadratic form isdiagonal the muntiple integral becomes a product of one dimensional gaussianintegrals that we can do. For example,∫

R2e−

12 (λ1y2

1+λ2y22)dy1dy2 =

∫ ∞

y1=−∞

∫ ∞

y2=−∞e−

12 (λ1y2

1+λ2y22)dy1dy2

=∫ ∞

y1=−∞e−λ1y2

1/2dy1 ·∫ ∞

y2=−∞e−λ2y2

2/2dy2

=√

2π/λ1 ·√

2π/λ2 .

Ordinarily we would need a Jacobian determinant representing∣∣∣dxdy

∣∣∣, but herethe determinant is det(V ) = 1, for an orthogonal matrix. With this we can findthe normalization constant, z, by

1 =∫

u(x)dx

=1z

∫e−

12 x∗Hxdx

13

=1z

∫e−

12 y∗Λydy

=1z

∫exp(−1

2

n∑k=1

λky2k))dy

=1z

∫ ( n∏k=1

e−λky2k

)dy

=1z

n∏k=1

(∫ ∞

yk=−∞e−λky2

kdyk

)

=1z

n∏k=1

√2π/λk

1 =1z· (2π)n/2√

det(H).

This gives a formula for z, and the final formula for the multivariate normaldensity

u(x) =

√det H

(2π)n/2e−

12 x∗Hx . (10)

2.5. The covariance, by direct integration: We can calculate the covariancematrix of the Xj . The jk element of E[XX∗] is E[XjXk] = cov(Xj , Xk). Thecovariance matrix consisting of all these elements is C = E[XX∗]. Note theconflict of notation with the constant C above. A direct way to evaluate C isto use the density (10):

C =∫

Rn

xx∗u(x)dx

=

√det H

(2π)n/2

∫Rn

xx∗e−12 x∗Hxdx .

Note that the integrand is an n × n matrix. Although each particular xx∗

has rank one, the average of all of them will be a nonsingular positive definitematrix, as we will see. To work the integral, we use the x = V y change ofvariables above. This gives

C =

√det H

(2π)n/2

∫Rn

(V y)(V y)∗e−12 y∗Λydy .

We use (V y)(V y)∗ = V (yy∗)V ∗ and take the constant matrices V outside theintegral. This gives C as the product of three matrices, first V , then an integralinvolving yy∗, then V ∗. So, to calculate C, we can calculate all the matrixelements

Bjk =

√detH

(2π)n/2

∫Rn

yjy∗ke−

12 y∗Λydy .

14

Clearly, if j 6= k, Bjk = 0, because the integrand is an odd (antisymmetric)function, say, of yj . The diagonal elements Bkk may be found using the factthat the integrand is a product:

Bkk =

√det H

(2π)n/2

∏j 6=k

(∫yj

e−λjy2j /2dyj

)·∫

yk

y2ke−λky2

k/2dyk .

As before, λj factors (for j 6= k) integrate to√

2π/λj . The λk factor integratesto√

2π/(λk)3/2. The λk factor differs from the others only by a factor 1/λk.Most of these factors combine to cancel the normalization. All that is left is

Bkk =1λk

.

This shows that B = Λ−1, so

C = V Λ−1V ∗ .

Finally, since H = V ΛV ∗, we see that

C = H−1 . (11)

The covariance matrix is the inverse of the matrix defining the multivariatenormal.

2.6. Linear functions of multivariate normals: A fundamental fact aboutmultivariate normals is that a linear transformation of a multivariate normal isalso multivariate normal, provided that the transformation is onto. Let A bean m × n matrix with m ≤ n. This A defines a linear transformation y = Ax.The transformation is “onto” if, for every y ∈ Rm, there is at least ibe x ∈ Rn

with Ax = y. If n = m, the transformation is onto if and only if A is invertable(det(A) 6= 0), and the only x is A−1y. If m < n, A is onto if its m rowsare linearly independent. In this case, the set of solutions is a “hyperplane”of dimension n − m. Either way, the fact is that if X is an n dimensionalmultivariate normal and Y = AX, then Y is an m dimensional multivariatenormal. Given this, we can completely determine the probability density of Yby calculating its mean and covariance matrix. Writing µX and µY for themeans of X and Y respectively, we have

µY = E[Y ] = E[AX] = AE[X] = AµX .

Similarly, if E[Y ] = 0, we have

CY = E[Y Y ∗] = E[(AX)(AX)∗] = E[AXX∗A∗] = AE[XX∗]A∗ = ACXA∗ .

The reader should verify that if CX is n× n, then this formula gives a CY thatis m×m. The reader should also be able to derive the formula for CY in terms

15

of CX without assuming that µY = 0. We will soon give the proof that linearfunctions of Gaussians are Gaussian.

2.7. Uncorrelation and independence: The inverse of a symmetric matrixis another symmertic matrix. Therefore, CX is diagonal if and only if H isdiagonal. If H is diagonal, the probability density function given by (10) is aproduct of densities for the components. We have already used that fact andwill use it more below. For now, just note that CX is diagonal if and only if thecomponents of X are uncorrelated. Then CX being diagonal implies that H isdiagonal and the components of X are independent. The fact that uncorrelatedcomponents of a multivariate normal are actually independent firstly is a prop-erty only of Gaussians, and secondly has curious consequences. For example,suppose Z1 and Z2 are independent standard normals and X1 = Z1 + Z2 andX2 = Z1 − Z2, then X1 and X2, being uncorrelated, are independent of eachother. This may seem surprising in view of that fact that increasing Z1 by 1/2increases both X1 and X2 by the same 1/2. If Z1 and Z2 were independent uni-form random variables (PDF = u(z) = 1 if 0 ≤ z ≤ 1, u(z) = 0 otherwise), thenagain X1 and X2 would again be uncorrelated, but this time not independent(for example, the only way to get X1 = 2 is to have both Z1 = 1 and Z2 = 1,which implies that X2 = 0.).

2.8. Application, generating correlated normals: There are simple tech-niques for generating (more or less) independent standard normal random vari-ables. The Box Muller method being the most famous. Suppose we have apositive definite symmetric matrix, CX , and we want to generate a multivari-ate normal with this covariance. One way to do this is to use the Choleskifactorization CX = LL∗, where L is an n × n lower triangular matrix. Nowdefine Z = (Z1, . . . , Zn) where the Zk are independent standard normals. ThisZ has covariance CZ = I. Now define X = LZ. This X has covarianceCX = LIL∗ = LL∗, as desired. Actually, we do not necessarily need theCholeski factorization; L does not have to be lower triangular. Another possi-bility is to use the “symmetric square root” of CX . Let CX = V ΣV ∗, whereΣ is the diagonal symmetric matrix with eigenvalues of CX (Σ = Λ−1 whereΛ is given above), and V is the orthogonal matrix if eigenvectors. We cantake A = V

√ΣV ∗, where

√Σ is the diagonal matrix. Usually the Choleski

factorization is easier to get than the symmetric square root.

2.9. Central Limit Theorem: Let X be an n dimensional random variablewith probability density u(x). Let X(1), X(2), . . ., be a sequence of independentsamples of X, that is, independent random variables with the same density u.Statisticians call this iid (independent, identically distributed). If we need totalk about the individual components of X(k), we write X

(k)j for component j

of X(k). For example, suppose we have a population of people. If we choose aperson “at random” and record his or her height (X1) and weight (X2), we get atwo dimensional random variable. If we measure 100 people, we get 100 samples,

16

X(1), . . ., X(100), each consisting of a height and weight pair. The weight ofperson 27 is X

(27)2 . Let µ = E[X] be the mean and C = E[(X − µ)(X − µ)∗]

the covariance matrix. The Central Limit Theorem (CLT) states that for largen, the random variable

R(n) =1√n

n∑k=1

(X(k) − µ)

has a probability distribution close to the multivariate normal with mean zeroand covariance C. One interesting consequence is that if X1 and X2 are uncor-related then an average of many independent samples will have R

(1n) and R

(n)2

nearly independent.

2.10. What the CLT says about Gaussians: The Central Limit Theoremtells us that if we avarage a large number of independent samples from thesame distribution, the distribution of the average depends only on the meanand covariance of the starting distribution. It may be surprising that manyof the properties that we deduced from the formula (10) may be found withalmost no algebra simply knowing that the multivariate normal is the limit ofaverages. For example, we showed (or didn’t show) that if X is multivariatenormal and Y = AX where the rows of A are linearly independent, then Y ismultivariate normal. This is a consequence of the averaging property. If X is(approximately) the average of iid random variables Uk, then Y is the averageof random variables Vk = AUk. Applying the CLT to the averaging of the Vk

shows taht Y is also multivariate normal.Now suppose U is a univariate random variable with iid samples Uk, and

E[Uk] = 0, E[U2k = σ2], and E[U4

k ] = a4 < ∞ Define Xn = 1√n

∑nk=n Uk. A

calculation shows that E[X4n] = 3σ4 + 1

na4. For large n, the fourth moment ofthe average depends only on the second moment of the underlying distribution.A multivariate and slightly more general version of this calculation gives “Wick’stheorem”, an expression for the expected value of a product of components ofa multivariate normal in terms of covariances.

17


1 Brownian Motion

1.1. Introduction: Brownian motion is the simplest of the stochastic pro-cesses called diffusion processes. It is helpful to see many of the properties ofgeneral diffusions appear explicitly in Brownian motion. In fact, the Ito calculusmakes it possible to describea any other diffusion process may be described interms of Brownian motion. Furthermore, Brownian motion arises as a limit ormany discrete stochastic processes in much the same way that Gaussian randomvariables appear as a limit of other random variables throught the central limittheorem. Finally, the solutions to many other mathematical problems, parti-cilarly various common partial differential equations, may be expressed in termsof Brownian motion. For all these reasons, Brownian motion is a central objectto study.

1.2. History: Late in the 18th century, an English botanist named Brownlooked at pollen grains in water under a microscope. To his amazement, theywere moving randomly. He had no explination for supposedly inert pollen grains,and later inorganic dust, seeming to swim as though alive. In 1905, Einsteinproposed the explination that the observed “Brownian” motion was caused byindividual water molecules hitting the pollen or dust particles. This allowedhim to estimate, for the first time, the weight of a water molecule and won himthe Nobel prize (relativity and quantum mechanics being too controversial atthe time). This is the modern view, that the observed random motion of pollengrains is the result of a huge number of independent and random collisions withtiny water molecules.

1.3. Basics: The mathematical description of Brownian motion involves arandom but continuous function on time, X(t). The standard Brownian motionstarts at x = 0 at time t = 0: X(0) = 0. The displacement, or incrementbetween time t1 > 0 and time t2 > t1, Y = X(t2) − X(t1), is the sum of alarge number of i.i.d. mean zero random variables, (each modeling the resultof one water molecule collision). It is natural to suppose that the number ofsuch collisions is proportional to the time increment. This implies, throught thecentral limit theorem, that Y should be a Gaussian random variable with vari-ance proportional to t2 − t1. The standard Brownian motion has X normalizedso that the variance is equal to t2 − t1. The random “shocks” (a term used infinance for any change, no matter how small) in disjoint time intervals shouldbe independent. If t3 > t2 and Y2 = X(t3)−X(t2), Y1 = X(t2)−Xt1), then Y2

and Y1 should be independent, with variances t3 − t2 and t2 − t1 respectively.This makes the increments Y2 and Y1 a two dimensional multivariate normal.

1

1.4. Wiener measure: The probability space for standard Brownian motionis C0([0, T ], R). As we said before, this consists of continuous functions, X(t),defined for t in the range 0 ≤ t ≤ T . The notation C0 means1 that X(0) = 0.The σ−algebra representing full information is the Borel algebra. The infi-nite dimensional Gaussian probability measure on C0([0, T ], R) that representsBrownian motion is called Wiener measure2.

This measure is uniquely specified by requiring that for any times 0 = t0 <t1 < · · · < tn ≤ T , the increments Yk = X(tk+1)−X(tk) are independent Gaus-sian random variables with var(Yk) = tk+1− tk. The proof (which we omit) hastwo parts. First, it is shown that there indeed is such a measure. Second, itis shown that there is only one such. All the information we need is containedin the joint distribution of the increments. The fact that increments from dis-joint time intervals are independent is the independent increments property. Italso is possible to consider Brownian motion on an infinite time horizon withprobability space C0([0,∞), R).

1.5. Technical aside: There is a different descripton of the Borel σ−algebraon C0([0, T ], R). Rather than using balls in the sup norm, one can use sets moreclosely related to the definition of Wiener measure through the joint distributionof increments. Choose times 0 = t0 < t1 < · · · tn, and for each tk a Borel set,Ik ⊆ R (thought of as “intervals” though they may not be). Let A be the eventX(tk) ∈ Ik for all k. The set of such events forms an algebra (check this),though not a σ−algebra. The probabilities P (A) are determined by the jointdistributions of the increments. The Borel algebra on C0([0, T ], R) is generatedby this algebra (proof ommitted), so Wiener measure (if it exists) is determinedby these probabilities.

1.6. Transition probabilities: The transition probability density for Brownianmotion is the probability density for X(t + s) given that X(t) = y. We denotethis by G(y, x, s), the “G” standing for Green’s function. It is much like theMarkov chain transition probabilities P t

y,x except that (i) G is a probabilitydensity as a function of x, not a probability, and (ii) tr is continuous, notdiscrete. In our case, the increment X(t + s)−X(t), is Gaussina with variances. If we learn that X(t) = y, then y becomes the expected value of X(t + s).Therefore,

G(y, x, s) =1√2πs

e(x−y)2/2s . (1)

1.7. Functionals: An element of Ω = C0([0, T ], R) is called X. We de-note by F (X) a real valued function of X. In this context, such a func-tion is often called a functional, to keep from confusing it with X(t), which

1In other contexts, people use C0 to indicate functions with “compact support” (whateverthat means) or functions that tend to zero as t →∞, but not here.

2The American mathematician and MIT professor Norbert Wiener was equally brilliantand inarticulate.

2

is a random function of t. This functional is just what we called a “func-tion of a random variable” (the path X palying the role of the abstract ran-dom outcome ω). The simplest example of a functional is just a function ofX(T ): F (X) = V (X(T )). More complicated functionals are integrals: F (X) =∫ T

0V (X(t))dt. extrema: F (X) = maxt≤T X(t), or stopping times such as

F (X) = min

t such that∫ t

0X(s)dx ≤ 1

. Stochastic calculus provides tools

for computing the expected values of many such functionals, often through solu-tions of partial differential equations. Computing expected values of functionalsis our main way to understand the behavior of Brownian motion (or any otherstochastic process).

1.8. Markov property: The independent increments property makes Brown-ian motion a Markov process. Let Ft be the σ−algebra generated by the pathup to time t. This may be characterized as the σ−algebra generated by all therandom variables X(s) for s ≤ t, which is the smallest σ−algebra in which all thefunctions X(s) are measurable. It also may be characterized as the σ−algebragenerated by events of the form A above (“Tehcnical aside”) with tn ≤ t (proofommitted). We also have the σ−algebra Gt generated by the present only. Thatis, Gt is generated by the single random variable X(t); it is the smallest σ−algebra in which X(t) is measurable. Finally, we let Ht denote the σ−algebrathat depends only on future values X(s) for s ≥ t. The Markov property statesthat if F (X) is any functional measurable with respect to Ht (i.e. dependingonly on the future of t), then E[F | Ft] = E[F | Gt].

Here is a quick sketch of the proof. If F (X) is a function of finitely manyvalues, X(tk), with tk ≥ t, then then E[F | Ft] = E[F | Gt] follows from theindependent increments property. It is possible (though tedious) to show thatany F measurable with respect to Ht may be approximated by a functionaldepending on finitely many future times. This extends E[F | Ft] = E[F | Gt] toall F measurable in Ht.

1.9. Path probabilities: For discrete Markov chains, as here, the individualoutcomes are paths, X. For Markov chains one can compute the probabilityof an individual path by multiplying the transition probabilities. The situationis different Brownian motion, where each individual path has probability zero.We will make much use of the following partial substitute. Again choose timest0 = 0 < t1 < · · · < tn ≤ T , let ~t = (t1, . . . , tn) be the vector of these times, andlet ~X = (X(t1), . . . , X(tn)) be the vector of the corresponding observations ofX. We write U (n)(~x,~t) for the joint probability density for the n observations,which is found by multiplying together the transition probability densities (1)(and using properties of exponentials):

U (n)(~x,~t) =n−1∏k=0

G(xk, xk+1, tk+1 − tk)

3

=1

(2π)n/2

n−1∏k=0

1√tk+1 − tk

exp

(−12

n−1∑k=0

(xk+1 − xk)2

tk+1 − tk

). (2)

The formula (2) is a concrete summary of the defining properties of theprobability measure for Brownian motion, Wiener measure: the independentincrements property, the Gaussian distribution of the increments, the variancebeing proportional to the time differences, and the increments having mean zero.It also makes clear that each finite collection of observations forms a multivariatenormal. For any of the events A as in “Technical aside”, we have

P (A) =∫

x1∈I1

· · ·∫

xn∈In

U (n)(x1, . . . , xn,~t)dx1 · · · dxn .

1.10. Consistency: You cannot give just any old probability densities toreplace the joint densities (2). They must satisfy a simple consistency condition.Having given the joint density for n observations, you also have given the jointdensity for a subset of these observations. For example, the joint density forX(t1) and X(t3) must be the marginal of the joint density of X((t1), X(t2),and X(t3):

U (2)(x1, x3, t1, t3) =∫ ∞

x2=−∞U (3)(x1, x2, x3, t1, t2, t3)dx2 .

It is possible to verify these consistency conditions by direct calculation withthe Gaussian integrals. A more abstract way is to understand the consistencyconditions as adding random increments. The U (2) density says that we getX(t3) from X(t1) by adding an increment that is Gaussian with mean zeroand variance t3 − t1. The U (2) says that we get X(t3) from X(t2) by addinga Gaussian with mean zero and variance t3 − t2. In turn, we get X(t2) fromX(t1) by adding an increment having mean zero and variance t2− t1. Since thesmaller time increments are Gaussian and independent of each other, their sumis also Gaussian, with mean zero and variance (t3 − t2) + (t2 − t1), which is thesame as the variance in going from X(t1) to X(t3) directly.

1.11. Rough paths: The above picture shows 5 Brownian motion paths.They are random and differ in gross features (some go up, others go down), butthe fine scale structure of the paths is the same. They are not smooth, or evendifferentiable functions of t. If X(t) is a differentiable function of t, then forsmall ∆t its increments are roughly proportional to ∆t:

∆X = X(t + ∆t)−X(t) ≈ dX

dt∆tl.

For Brownian motion, the expected value of the square of ∆X (the variance of∆X) is proportional to ∆t. This suggests that typical values of ∆X will be onthe order of

√∆t. In fact, an easy calculation gives

E[|∆X|] =√

∆t

2π.

4

This would be impossible if successive increments of Brownian motion were allin the same direction (see “Total variation” below). Instead, Brownian motionpaths are constantly changing direction. They go nowhere (or not very far) fast.

1.12. Total variation: One quantitative sense of path roughness is the vactthat Brownian motion paths have infinite total variation. The total variationof a function X(t) measures the total distance it moves, counting both ups anddowns. For a differentiable function, this would be

TV(X) =∫ T

0

∣∣∣∣dX

dt

∣∣∣∣ dtl. (3)

If X(t) has simple jump discontinuities, we add the sizes of the jumps to (3).For general functions, the total variation is

TV(X) = supn−1∑k=0

|X(tk+1)−X(tk)| , (4)

where the supremum as over all positive n and all sequences t0 = 0 < t1 < · · · <tn ≤ T .

Suppose X(t) has finitely many local maxima or minima, such as t0 = localmax, t1 = local min, etc. Then taking these t values in (4) gives the exact totalvariation (further subdivision does not increase the left side). This is one wayto relate the general definition (4) to the definition for differentiable functions(??). This does not help for Brownian motion paths, which have infinitely manylocal maxima and minima.

1.13. Almost surely: Let A ∈ F be a measurable event. We say A happensalmost surely if P (A) = 1. This allows us to establish properties of random ob-jects by doing calculations (stochastic calculus). For example, we will show thatBrownian motions paths have infinite total variation almost surely by showingthat for any (small) ε > 0 and any (large) N ,

P (TV(X) < N) < ε . (5)

Let B ⊂ C0([0, t], R) be the set of paths with finite total variation.. This is acountable union

B =⋃

N>0

TV(X) < N =⋃

N>0

BN .

Since P (BN ) < ε) for any ε > 0, we must have P (BN ) = 0. Countable additivitythen implies that P (B) = 0, which means that P (TV = ∞) = 1.

There is a distinction between outcomes that do not exist and events thatnever happen because they have probability zero. For example, if Z is a onedimensional Gaussian random variable, the outcome Z = 0 does exist, but theevent Z = 0 is impossible (never will be observed). This is what we meanwhen we say “a Gaussian random variable never is zero”, or “every Brownianmotion path has invinite total variation”.

5

1.14. The TV of BM: The heart of the matter is tha actual calculationbehind the inequality (5). We choose an n > 0 and define (not for the last time)∆t = T/n and tk = k∆t. Let Y be the random variable

Y =n−1∑k=0

|X(tk+1)−X(tk)| .

Remember that Y is one of the candidates we must use in the supremem (4) thatdefines the total variation. If Y is large, then the total variation is at least as

large. Because E[|∆X|] =√

2π

√∆t, we have E[Y ] =

√2π

√T√

n. A calculationusing the independent increments property shows that

var(Y ) =(

1− 2π

)T

for any n. Tchebychev’s inequality3 implies that

P

(Y <

(√2π

√n− k

√1− 2

π

)√

T

)≤ 1

k2.

If we take very large n and medium large k, this inequality says that it is veryunlikely for Y (or total variation of X) to be much less than const

√n. Our

inequality (5) follows from this whth a suitable choice of n and k.

1.15. Structure of BM paths: For any function X(t), we can define thetotal variation on the interval [t1, t2] in an obvious way. The odometer of a carrecords the distance travelled regardless of the direction. For X(t), the totalvariation on the interval [0, t] plays a similar role. Clearly, X is monotone onthe interval [t1, t2] if and only if TV(X, t1, t2) = |X(t2)−X(t1)|. Otherwise,X has at least one local min or max within [t1, t2]. Now, Brownian motionpaths have infinite total variation on any interval (the proof above implies this).Therefore, a Brownian motion path has a local max or min within any interval.This means that (like the rational numbers, for example) the set of local maximaand minima is dense: There is a local max or min arbitrarily close to any givennumber.

1.16. Dynamic trading: The infinite total variation of Brownian motion hasa consequence for dynamic trading strategies. Some of the simplest dynamictrading strategies, Black-Scholes hedging, and Merton half stock/half cash trad-ing, call for trades that are proportional to the change in the stock price. If thestock price is a diffusion process and there are transaction costs proportionalto the size of the trade, then the total transaction costs will either be infinite(in the idealized continuous trading limit) or very large (if we trade as often as

3If E[Y ] = µ and var(Y ) = σ2, then P (|Y − µ| > kσ) < 1k2 . The proof and more examples

are in any good basic probability book.

6

possible). It turns out that dynamic trading strategies that take trading costsinto account can approach the idealized zero cost strategies when trading costsare small. Next term you will learn how this is done.

1.17. Quadratic variation: A more useful measure of roughness of Brownianmotion paths and other diffusion processes is quadratic variation. Using previousnotations: ∆t = T/n, tk = k∆t, the definition is4 (where n → ∞ as ∆t → 0with t = n∆t fixed)

Q(X) = lim∆t→0

Qn(X) = lim∆t→0

n−1∑k=0

(X(tk+1 −X(tk))2 . (6)

If X is a differentiable function of t, then its quadratic variation is zero (Qn

is the sum of n terms each of order 1/n2). For Brownian motion, Q(T ) =T (almost surely). Clearly E[Qn] = T for any n (independent increments,Gaussian increments with variance ∆t). The independent increments propertyalso lets us evaluate var(Qn) = 3T 2/n (the sum of n terms each equal to 3∆t2 =3T 2/n2). Thus, Qn must be increasingly close to T as n gets larger5

1.18. Trading volatility: The quadratic variation of a stock price (or a similarquantity) is called it’s “realized volatility”. The fact that it is possible to buyand sell realized volatility says that the (geometric) Brownian motion modelof stock price movement is not completely realistic. That model predicts thatrealized volatility is a constant, which is nothing to bet on.

1.19. Brownian bridge construction:

1.20. Continuous time stochastic process: The general abstract definition ofa continuous time stochastic process is just a probability space, Ω, and, for eacht > 0, a σ−algebra Ft. These algebras should form a filtration (correspondingto increase of information): Ft1 ⊆ Ft2 if t1 ≤ t2. There should also be a familyof random variables Yt(ω), with Yt measurable in Ft (i.e. having a value knownat time t). This explains why probabilists often write Xt instead of X(t) forBrownian motion and other diffusion processes. For each t, we think of Xt as afunction of ω with t simply being a parameter. Our choice of probability spaceΩ = C0([0, T ], R) implies that for each ω, Xt(ω) is a continuous function of t.(Actually, for simple Brownian motion, the path X plays the role of the abstractoutcome ω, though we never write Xt(X).) Other stochastic processes, such asthe Poisson jump process, do not have continuous sample paths.

4It is possible, though not customary, to define TV(X) using evenly spaced points. In thelimit ∆t → 0, we would get the same answer for continuous paths or paths with TV(X) < ∞.You don’t have to use uniformly spaced times in the definition of Q(X), but I think you geta different answer if you let the times depend on X as they might in the definition of totalvariation.

5Thes does not quite prove that (almost surely) Qn → T as n → ∞. We will come backto this point in later lectures.

7

1.21. Continuous time martingales: A stochastic process Ft (with Ω and theFt) is a martingale if E[Fs | Ft] = Ft for s > t. Brownian motion forms the firstexample of a continuous time martingale. Another famous martingale related toBrownian motion is Ft = X2

t − t (the reader should check this). As in discretetime, any random variable, Y , defines a continuous time martingale throughconditional expectations: Yt = E[Y | Ft]. The Ito calculus is based on the ideathat a stochastic integral with respect to X should produce a martingale.

2 Brownian motion and the heat equation

2.1. Introduction: Forward and backward equations are tools for calculatingprobabilities and expected values related to Brownian motion, as they are forMarkov chains and stochastic processes more generally. The probability densityof X(t) satisfies a forward equation. The conditional expectations E[V | Ft]satisfy backward equations for a variety of functionals V . For Brownian motion,the forward and backward equations are partial differential equations, either theheat equation or a close relative. We will see that the theory of partial differentialequations of diffusion type (the heat equation being the a prime example) andthe theory of diffusion processes (Brownian motion being a prime example) eachdraw from the other.

2.2. Forward equation for the probability density: If X(t) is a standardBrownian motion with X(0) = 0, then X(t) ∼ N (0, t), so its probability densityis (see (1))

u(x, t) = G(0, x, t) =1√2πt

ex2/2t .

Directly calculating partial derivatives, we can verify that

∂tG =12∂2

xG . (7)

We also could consider a Brownian motion with a more general initial densityX(0) ∼ u0(x). Then X(t) is the sum of independent random variables X(0)and an N (0, t). Therefore, the probability density for X(t) is

u(x, t) =∫ ∞

y=−∞G(y, x, t)u0(y)dy =

∫ ∞

y=−∞G(0, x− y, t)u0(y)dy . (8)

Again, direct calculation (differentiating (8), x and t derivatives land on G)shows that u satisfies

∂tu =12∂2

xu . (9)

This is the heat equation, also called diffusion equation. The equation is used intwo ways. First, we can compute probabilities by solving the partial differentialequation. Second, we can use known probability densities as solutions of thepartial differential equation.

8

2.3. Heat equation via Taylor series: The above is not so much a derivationof the heat equation as a verification. We are told that u(x, t) (the probabilitydensity of Xt) satisfies the heat equation and we verify that fact. Here is amethod for deriving a forward equation without knowing it in advance. Weassume that u(x, t) is smooth enough as a function of x and t that we may expandit to to second order in Taylor series, do the expansion, then take the conditionalexpectation of the terms. Variations of this idea lead to the backward equationsand to major parts of the Ito calculus.

Let us fix two times separated by a small ∆t: t′ = t + ∆t. The rules ofconditional probability allow us to compute the density of X = X(t′) in termsof the density of Y = X(t) and the transition probabilit density (1):

u(x, t + ∆t) =∫ ∞

y=−∞G(y, x, ∆t)u(y, t)dy . (10)

The main idea is that for small ∆t, X(t + ∆t) will be close to X(t). This isexpressed in G being small unless y is close to x, which is evident in (1). Inthe integral, x is a constant and y is the variable of integration. If we wouldapproximate u(y, t) by u(x, t), the value of the integral just would be u(x, t).This would give the true but not very useful approximation u(x, t + ∆t) ≈u(x, t) for small ∆t. Adding the next Taylor series term (writing ux for ∂xu):u(y, t) ≈ u(x, t)+ux(x, t)(y−x), the integral does not change the result because∫

G(y, x,∆t)(y − x)dy = 0. Adding the next term:

u(y, t) ≈ u(x, t) + ux(x, t)(y − x) +12uxx(x, t)(y − x)2 ,

gives (because E[(Y −X)2] = ∆t)

u(x, t + ∆t) ≈ u(x, t) +12uxx(x, t)∆t .

To derive a partial differential equation, we expand the left side as u(x, t+∆t) =u(x, t) + ut(x, t)∆t + O(∆t2). On the right, we use∫

G(y, x, ∆t) |y − x|3 dy = O(∆t3/2) .

Altogether, this gives

u(x, t) + ut(x, t)∆t = u(x, t) + uxx(x, t)∆t + O(∆t3/2) .

If we cancel the common u(x, t) then cancel the common factor ∆t and let∆t → 0, we get the desired heat equation (9).

2.4. The initial value problem: The heat equation (9) is the Brownian motionanologue of the forward equation for Markov chains. If we know the time 0density u(x, 0) = u0(x) and the evolution equation (9), the values of u(x, t) arecompletely and uniquely determined (ignoring mathematical technicalities that

9

would be unlikely to trouble a practical person). The task of finding u(x, t) fort > 0 from u0(x) and (9) is called the “initial value problem”, with u0(x) beingthe “initial value” (or “values”??). This initial value problem is “well posed”,which means that the solution, u(x, t), exists and depends continuously on theinitial data, u0. If you want a proof that the solution exists, just use the integralformula for the solution (8). Given u0, the integral (8) exists, satisfies the heatequation, and is a continuous function of u0. The proof that u is unique is moretechnical, partly because it rests on more technical assumptions.

2.5. Ill posed problems: In some situations, the problem of finding a functionu from a partial differential equation and other data may be “ill posed”, uselessfor practical purposes. A problem is ill posed if it is not well posed. This meanseither that the solution does not exist, or that it does not depend continuouslyon the data, or that it is not unique. For example, if I try to find u(x, t) forpositive t knowing only u0(x) for x > 0, I must fail. A mathematician would saythat the solution, while it exists, is not unique, there being many different waysto give u0(x) for x > 0, each leading to a different u. A more subtle situationarises, for example, if we give u(x, T ) for all x and wish to determine u(x, t)for 0 ≤ t < T . For example, if u(x, T ) = 1[0,1](x), there is no solution (trustme). Even if there is a solution, for example given by (8), is does not dependcontinuously on the values of u(x, T ) for T > t (trust me).

The heat equation (9) relates values of u at one time to values at anothertime. However, it is “well posed” only for determining u at future times from uat earlier times. This “forward equation” is well posed only for moving forwardin time.

2.6. Conditional expectations: We saw already for Markov chains thatcertain conditional expected values can be calculated by working backwards intime with the backward equation. The Brownian motion version of this usesthe conditional expectation

f(x, t) = E[V (XT ) | Xt = x] . (11)

One “modern” formulation of this defines Ft = E[V (Xt) | Ft]. The Markovproperty implies that Ft is measurable in Gt, which makes it a function ofXt. We write this as Ft = f(Xt, t). Of course, these definitions mean thesame thing and yield the same f . The definition is also sometimes written asf(x, t) = Ex,t[V (XT )]. In general if we have a parametrized family of probabilitymeasures, Pα, we write the expected value with respect to Pα as Eα[·]. Here,the probability measure Px,t is the Wiener measure describing Brownian motionpaths that start from x at time t, which is defined by the densities of incrementsfor times larger than t as before.

2.7. Backward equation by direct verification: Given that Xt = x, theconditional density for XT is same transition density (1). The expectation (11)

10

is given by the integral f(x, t) as an integral, we get

f(x, t) =∫ ∞

−∞G(x, y, T − t)V (y)dy . (12)

We can verify by explicit differentiation (x and t derivatives act on G) that

∂tf +12∂2

xf = 0 . (13)

Note that the sign of ∂t here is not what it was in (9), which is because we arecalculating ∂tG(T − t) rather than ∂tG(t). This (13) is the backward equation.

2.8. Backward equation by Taylor series: As with the forward equation (9),we can find the backward equation by Taylor series expansions. We start bychoosing a small ∆t and expressing f(x, t) in terms of6 f(·, t + ∆t). As before,define Ft = E[V (XT ) | Ft] = f(Xt, t). Since Ft ⊂ Ft+∆t, the tower propertyimplies that Ft = E[Ft+∆t | Ft].

f(x, t) = Ex,t[f(Xt+∆t)]

=∫ ∞

y=−∞f(y, t + ∆t)G(x, y, ∆t)dy . (14)

As before, we expand f(y, t+∆t) about x, t dropping terms that contribute lessthan O(∆t):

f(y, t + ∆t)

= f(x, t) + fx(x, t)(y − x) +12fxx(x, t)(y − x)2 + ft(x, t)∆t

+O(|y − x|3) + O(∆t2) .

Substituting this into (14) and integrating each term leads to

f(x, t) = f(x, t) + 0 +12fxx(x, t)∆t + ft(x, t)∆t + O(∆t3/2) + O(∆t2) .

A bit of algebra and ∆t → 0 then gives (13).For future reference, we pause to note the differences between this derivation

of (13) and the related derivation of (9). Here, we integrated G with respectto its second argument, while earlier we integrated with respect to the firstargument. This does not matter for the special case of Brownian motion andthe heat equation because G(x, y, t) = G(y, x, t). When we apply this reasoningto other diffusion processes, G(x, y, t) will be a probability density as a functionof y for every x, but it need not be a probability density as a function of x forgiven y. This is an anologue of the fact in Markov chains that the transition

6The notation f(·, t+∆t) is to avoid writing f(x, t+∆t) which might imply that the valuef(x, t) depends only on f at time t + ∆t for the same x value. Instead, it depends on all thevalues f(y, t + ∆t).

11

matrix P acts from the left on column vectors f (summing Pjk over k) but fromthe right on row vectors u (summing Pjk over j). For each j,

∑k Pjk = 1 but

the column sums∑

j Pjk may not equal one. Of course, the sign of the ∂t termis different in the two cases because we did the t Taylor series on the right sideof (14) but on the left side of (10).

2.9. The final value problem: The final values f(x, T ) = V (x), together withthe backward evolution equation (13) allow us to determine the values f(·, t)for t < T . The definition (11) makes this obvious. This means that the finalvalue problem for the backward heat equation is a well posed problem.

On the other hand, the initial value problem for the backward heat equationis not a well posed problem. If we have a f(x, 0) and we want a V (x) that leadsto it, we are probably out of luck.

2.10. Duality: As for Markov chains, we can express the expected value ofV (XT ) in terms of the probability density at any earlier time t ≤ T

E[V (XT )] =∫

u(x, t)f(x, t)dx .

This again implies that the right side is independent of t, which in turn al-lows us to derive the forward equation (9) from the backward equation (13) orconversely. For example, differentiating and using (13) gives

0 =d

dt

=∫

ut(x, t)f(x, t)dx +∫

u(x, t)ft(x, t)dx

=∫

ut(x, t)f(x, t)dx−∫

u(x, t) 12fxx(x, t)dx .

To derive an equation involving only u derivatives, we want to integrate the lastintegral by parts to move the x derivatives from f to u. In this formal derivation,we will assume that the probability density u(x, t) decays to zero fast enough as|x| → ∞ that we can neglect possible boundary terms at x = ±∞. This gives∫ (

ut(x, t)− 12uxx(x, t)

)f(x, t)dx = 0 .

If this relation holds for a sufficiently rich family of functions f , we can onlyconclude that ut − 1

2uxx is identically zero, which is the forward equation (9).

2.11. The smoothing property, regularity: Solutions of the forward or back-ward heat equation become smooth functions of x and t even if the initial data(for the forward equation) or final data (for the backward equation) are notsmooth. For u, this is clear from the integral formula (8). If we differentiatewith respect to x, this derivative passes under the integral and onto the G fac-tor. This applies also to x or t derivatives of any order, since the corresponding

12

derivatives of G are still smooth integrable functions of x. The same can be saidfor f using (12); as long as t < T , any derivatives of f with respect to x and/or tare bounded. A function that has all partial derivatives of any order bounded iscalled “smooth”. (Warning, this term is not used consistently. Some people say“smoooth” to mean, for example, merely having derivatives up to second orderbounded.) Solutions of more general forward and backward equations often,but not always, have the smoothing property.

2.12. Rate of smoothing: Suppose the payout (and final value) function,V (x), is a discontinuous function such as V (x) = 1x<0(x) (a “digital” option infinance). The solution to the backward equation can be expressed in terms ofthe cumulative normal (with Z ∼ N (0, 1))

N(x) = P (Z < x) =1√2π

∫ x

z=−∞e−z2/2dz .

Then we have

f(x, t) =∫ 0

y=−∞G(x, y, T − t)dy

=1√

2π(T − t)

∫ 0

y=−∞e−(x−y)2/2(t−t)dy

f(x, t) = N(x/√

T − t) . (15)

From this it is clear that f is differentiable when t < T , but the first x derivativeis as large as 1/

√T − t, the second as large as 1/(T − t), etc. All derivatives

blow up as t → T with higher derivatives blowing up faster. This can makenumerical solution of the backward equation difficult and inaccurate when thefinal data V (x) is not smooth.

The formula (15) can be derived without integration. One way is to note thatf(x, t) = P (XT < 0 | Xt = x) and XT ∼ x+

√T − tZ, (Gaussian increments) so

that XT < 0 is the same as Z < x/√

T − t. Even without the normal probability,a physicist would tell you that ∆X ∼

√∆t, so the hitting probability starting

from x at time t has to be some function of x/√

T − t.

2.13. Diffusion: If you put a drop of ink into a glass of still water, youwill see the ink slowly diffuse through the water. This is modelled as a vastnumber of tiny ink particles each preforming an independent Brownian motionin the water. Let u(x, t) represent the density of particles about x at time t(say, particles per cubic millemeter). This u satisfies the heat equation but notthe requirement that

∫u(x, t)dx = 1. If ink has been diffusing through water

for some time, there will be dark regions with a high density of particles (largeu) and lighter regions with smaller u. In the absence of boundaries (sides of theclass and the top of the water), the ink distribution would be Gaussian.

2.14. Heat: Heat also can diffuse through a medium, as happens whenwe put a thick metal pan over a flame and wait for the other side to heat

13

up. We can think of u(x, t) as representing the temperature in a metal atlocation x at time t. This helps us interpret solutions of the heat equation(9) when u is not necessarily positive. In particular, it helps us imagine thecancellation that can occur when regions of positive and negative u are close toeach other. Heat flows from the high temperature regions to low or negativetemperature regions in a way that makes the temperature distribution a moreuniform. A physical argument that heat (temperature) flowing through a metalshould satisfy the heat equation was given by the French mathematical phycisist,friend of Napoleon, and founder of Ecole Polytechnique, Joseph Fourier.

2.15. Hitting times: A stopping time, τ , is any time that depends on theBrownian motion path X so that the event τ ≤ t is measurable with respect toFt. This is the same as saying that for each t there is some process that has asinput the values Xs for 0 ≤ s ≤ t and as output a decision τ ≤ t or τ > t. Onekind of stopping time is a hitting time:

τa = min (t | Xt = a) .

More generally (particularly for Brownian motion in more than one dimension)if A is a closed set, we may consider τA = min(t | Xt ∈ A). It is useful to definea Brownian motion that stops at time τ : Xt = Xt if t ≤ τ , Xt = Xτ if t ≥ τ .

2.16. Probabilities for stopped Brownian motion: Suppose Xt is Brownianmotion starting at X0 = 1 and X is the Brownian motion stopped at time τ0,the first time Xt = 0. The probability measure, Pt, for Xt may be writtenas the sum of two terms, Pt = P s

t + P act . (Since Xt is a single number, the

probability space is Ω = R, and the σ−algebra is the Borel algebra.) The“singular” part, P s

t , corresponds to the paths that have been stopped. If p(t) isthe probability that τ ≤ t, then P s

t = p(t)δ(x), which means that for any Borelset, A ⊆ R, P s

t (A) = p(t) if 0 ∈ A and P st (A) = 0 if 0 /∈ A. This δ is called

the “delta function” or “delta mass”; it puts weight one on the point zero andno weight anywhere else. Probabilists sometimes write δx0 for the measure thatputs weight one on the point x0. Phycisists write δx0(x) = ‘delta(x = x0). The“absolutely continuous” part, P ac

t , is given by a density, u(x, t). This meansthat P ac

t (A) =∫

Au(x, t)dx. Because

∫R

u(x, t)dx = 1− p(t) < 1, u, while beinga density, is not a probability density.

This decomposition of a measure (P ) as a sum of a singular part and ab-solutely continuous part is a special case of the Radon Nikodym theorem. Wewill see the same idea in other contexts later.

2.17. Forward equation for u: The density for the absolutely continuous part,u(x, t), is the density for paths that have not touched X = a. In the diffusioninterpretation, think of a tiny ink particle diffusing as before but being absorbedif it ever touches a. It is natural to expect that when x 6= a, the density satisfiesthe heat equation (9). u “knows about” the boundary condition because ofthe “boundary condition” u(a, t) = 0. This says that the density of particlesapproaches zero near the absorbing boundary. By the end of the course, we

14

will have several ways to prove this. For now, think of a diffusing particle, aBrownian motion path, as being hyperactive; it moves so fast that it has alreadyvisited a neighborhood of its current location. In particluar, if Xt is close to a,then very likely Xs = a for some s < t. Only a small minority of the particlesat x near a, with small density u(x, t) → 0 as x → a have not touched a.

2.18. Probability flux: Suppose a Brownian motion starts at a random pointX0 > 0 with probability density u0(x) and we take the absorbing boundaryat a = 0. Clearly, u(x, t) = 0 for x < 0 because a particle cannot cross frompositive to negative without crossing zero, the Brownian motion paths beingcontinuous. The probability of not being absorbed before time t is given by

1− p(t) =∫

x>0

u(x, t)dx . (16)

The rate of absorbtion of particles, the rate of decrease of probabiltiy, may becalculated by using the heat equation and the boundary condition. Differenti-ating (16) with respect to t and using the heat equation for the right side thenintegrating gives

−p(t) =∫

x>0

∂tu(x, t)dx

=∫

x>0

12∂2

xu(x, t)dx

p(t) =12∂xu(0, t) . (17)

Note that both sides of (17) are positive. The left side because P (τ ≤ t) is anincreasing function of t, the right side because u(0, t) = 0 and u(x, t) > 0 forx > 0. The identity (17) leads us to interpret the left side as the probability“flux” (or “density flux if we are thinking of diffusing particles). The rateat which probability flows (or particles flow) across a fixed point (x = 0) isproportional to the derivative (the gradient) at that point. In the heat flowinterpretation this says that the rate of heat flow across a point is proportionalto the temperature gradient. This natural idea is called Fick’s law (or possibly“Fourier’s law”).

2.19. Images and Reflections: We want a function u(x, t) that satisfies theheat equation when x > 0, the boundary condition u(0, t) = 0, and goes to δx0

as t ↓ 0. The “method of images” is a trick for doing this. We think of δx0 asa unit “charge” (in the electrical, not financial sense) at x0 and g(x − x0, t) =

1√2π

e−(x−x0)2/2t as the response to this charge, if there is no absorbing boundary.

For example, think of puting a unit drop of ink at x0 and watching it spreadalong the x axis in a “bell shaped” (i.e. gaussian) density distribution. Nowthink of adding a negative “image charge” at −x0 so that u0(x) = δx0 − δ−x0

and correspondingly

u(x, t) =1√2πt

(e−(x−x0)

2/2t − e−(x+x0)2/2t)

. (18)

15

This function satisfies the heat equation everywhere, and in particular for x > 0.It also satisfies the boundary condition u(0, t) = 0. Also, it has the same initialdata as g, as long as x > 0. Therefore, as long as x > 0, the u given by (18)represents the density of unabsorbed particles in a Brownian motion with ab-sorption at x = 0. You might want to consider the image charge contributionin (18), 1√

2πe−(x−x0)

2/2t, as “red ink” (the ink that represents negative quanti-ties) that also diffuses along the x axis. To get the total density, we subtractthe red ink density from the black ink density. For x = 0, the red and blackdensities are the same because the distance to the sources at ±x0 are the same.When x > 0 the black density is higher so we get a positive u. We can think ofthe image point, −x0, as the reflection of the original source point through thebarrier x = 0.

2.20. The reflection principle: The explicit formula (18) allows us to evaluatep(t), the probability of touching x = 0 by time t starting at X0 = x0. This is

p(t) = 1−∫

x>0

u(x, t)dx =∫

x>0

1√2πt

(e−(x−x0)

2/2t − e−(x+x0)2/2t)

dx .

Because∫∞−∞

1√2πt

e−(x−x0)/2tdx = 1, we may write

p(t) =∫ 0

−∞

1√2πt

e−(x−x0)2/2tdx +

∫ ∞

0

1√2πt

e−(x+x0)2/2tdx .

Of course, the two terms on the right are the same! Therefore

p(t) = 2∫ 0

−∞

1√2πt

e−(x−x0)2/2tdx .

This formula is a particular case the Kolmogorov reflection principle. It saysthat the probability that Xs < 0 for some s ≤ t is (the left side) is exactlytwice the probability that Xt < 0 (the integral on the right). Clearly some ofthe particles that cross to the negative side at times s < t will cross back, whileothers will not. This formula says that exactly half the particles that touchfor some s ≤ t have Xt > 0. Kolmogorov gave a proof of this based on theMarkov property and the symmetry of Brownian motion. Since Xτ = 0 andthe increments of X for s > τ are independent of the increments for s < τ , andsince the increments are symmetric Gaussian random variables, they have thesame chance to be positive Xt > 0 as negative Xt < 0.

16


1 Integrals involving Brownian motion

1.1. Introduction: There are two kinds of integrals involving Brownianmotion, time integrals and Ito integrals. The time integral, which is discussedhere, is just the ordinary Riemann integral of a continuous but random functionof t with respect to t. Such integrals define stochastic processes that satisfyinteresting backward equations. On the one hand, this allows us to computethe expected value of the integral by solving a partial differential equation. Onthe other hand, we may find the solution of the partial differential equation bycomputing the expected value by Monte Carlo, for example. The Feynman Kacformula is one of the examples in this section.

1.2. The integral of Brownian motion: Consider the random variable, whereX(t) continues to be standard Brownian motion,

Y =∫ T

0

X(t)dt . (1)

We expect Y to be Gaussian because the integral is a linear functional of the(Gaussian) Brownian motion path X. Because X(t) is a continuous functionof t, this is a standard Riemann integral. The Riemann sum approximationsconverge. As usual, for n > 0 we define ∆t = T/n and tk = k∆t. The Riemannsum approximation is

Yn = ∆t

n−1∑k=0

X(tk) , (2)

and Yn → Y as n → ∞ because X(t) is a continuous function of t. The nsummands in (2), X(tk), form an n dimensional multivariate normal, so each ofthe Yn is normal. It would be surprising if Y , as the limit of Gaussians, werenot Gaussian.

1.3. The variance of Y : We will start the hard way, computing the variancefrom (2) and letting ∆t → 0. The trick is to use two summation variablesYn = ∆t

∑n−1k=0 X(tk) and Yn = ∆t

∑n−1j=0 X(tj). It is immediate from (2) that

E[Yn] = 0 and var(Yn) = E[Y 2n ]:

E[Y 2n ] = E[Yn · Yn]

= E

(∆tn−1∑k=0

X(tk)

)·

∆tn−1∑j=0

X(tj)

= ∆t2

∑jk

E[X(tk)X(tj)] .

1

If we now let ∆t → 0, the left side converges to E[Y 2] and the right sideconverges to a double integral:

E[Y 2] =∫ T

s=0

∫ T

t=0

E[X(t)X(s)]dsdt . (3)

We can find the needed E[X(t)X(s)] if s > t by writing X(s) = X(t) + ∆Xwith ∆X independent of X(t), so

E[X(t)X(s) = E[X(t)(X(t) + ∆X)]= E[X(t)X(t)]= t .

A variation of this argument gives E[XtXs] = s if s < t. Altogether

E[XtXs] = min(t, s) ,

which is a famous formula. This now gives

E[Y 2] =∫ T

s=0

∫ T

t=0

E[XtXs]dsdt =∫ T

s=0

∫ T

t=0

min(s, t)dsdt =13T 3 .

There is a simpler and equally rigorous way to get this. Write Y =∫ T

s=0X(s)ds

and∫ T

t=0X(t)dt so that again

E[Y 2] = E

[∫ T

s=0

X(s)ds ·∫ T

t=0

X(t)dt

]

= E

[∫ T

s=0

∫ T

t=0

X(s)X(t)dtds

](4)

=∫ T

s=0

∫ T

t=0

E[X(s)X(t)]dtds; . (5)

Going from the (4) to (5) involves changing the order of integration1. After all,E[·] just represents integration over a probability space. The right side of (4)has the abstarct form∫

ω∈Ω

(∫s∈[0,T ]

∫t∈[0,T ]

F (ω, s, t)dtds

)dP (ω) .

1The possibility of changing order of abstract integrals was established by the twentiethcentury mathematician Fubini. He proved it to be correct if the double (triple in our case)integral converges absolutely (a requirement even for ordinary Riemann integrals) and thefunction F is jointly measurable in all its arguments. Our integrand is nonnegative, so theresult will be infinite if the integral does not converge absolutely. We omit a discussion ofproduct measures and joint measurability.

2

Here F = X(s)X(t), and ω is the random outcome (the whole path X[0, T ]here), and P represents Wiener measure. If we interchange the ordinary Rie-mann dsdt integral with the abstract dP integral, we get∫

s∈[0,T ]

∫t∈[0,T ]

(∫ω∈Ω

F (ω, s, t)dP (ω))

dsdt ,

Which is the abstract form of (5).

1.4. Measurability of Brownian motion integrals: Suppose t1 < t2. Considerthe integrals U =

∫ t10

X(t)dt and V =∫ t2

t1(X(t) − X(t1))dt. We expect U

to be measurable in Ft1 because all the X values defining U are measurablein Ft1 . Similarly, all the differences defining V are independent of anythinganything in Ft1 . Therefore, we expect V to be independent of U . We omit thestraightforward proofs of these facts, which depend on elementary properties ofabstract integration.

1.5. The X3t martingale: Many martingales are constructed from integrals

involving Brownian motion. A simple one is

F (t) = X(t)3 − 3∫ t

0

X(s)ds .

To check the martingale property, choose t2 > t1 and, for t > t1, write X(t) =X(t1) + ∆X(t). Then

E

[∫ t2

0

X(t)ds | Ft1

]= E

[(∫ t1

0

X(t)dt +∫ t2

t1

X(t)dt

) ∣∣∣ Ft1

]= E

[∫ t

0

X(t)dt | Ft

]+ E

[∫ t2

t1

(X(t1) + ∆X(t)) dt | Ft

]=

∫ t

0

X(t)dt + (t2 − t1)X(t1) .

In the last line we use the facts that X(t) ∈ Ft1 when t < t1, and Xt1 ∈ Ft1 ,and that E[∆X(t) | Ft1 ] = 0 when t > t1, which is part of the independentincrements property. For the X(t)3 part, we have,

E[(X(t1) + ∆X(t2))

3 | Ft1

]= E

[X(t1)3 + 3X(t1)2∆X(t2) + 3X(t1)∆X(t2)2 + ∆X(t2)3 | Ft1

]= X(t1)3 + 3X(t1)2 · 0 + 3X(t1)E[∆X(t2)2 | Ft1 ] + 0= X(t1)3 + 3(t2 − t1)X(t1) .

In the last line we used the independent increments property to get E[∆X(t2) |Ft1 ] = 0, and the formula for the variance of the increment to get E[∆X(t2)2 |Ft1 ] = t2 − t1. This verifies that E[F (t2) | Ft] = F (t1), which is the martingaleproperty.

3

1.6. Backward equations for expected values of integrals: Many integralsinvolving Brownian motion arise in applications and may be “solved” usingbackward equations. One example is F =

∫ T

0V (X(t))dt, which represents the

total accumulated V (X) over a Brownian motion path. If V (x) is a continuousfunction of x, the integral is a standard Riemann integral, because V (X(t))is a continuous function of t. We can calculate E[F ], using the more generalfunction

f(x, t) = Ex,t

[∫ T

t

V (X(s))ds

]. (6)

As before, we can describe the function f(x, t) in terms of the random variable

F (t) = E

[∫ T

t

V (X(s))dt | Ft

].

Since F (t) is measurable in Ft and depends only on future values (X(s) withs > t), F (t) is measurable in Gt. Since Gt is generated by X(t) alone, thismeans that F (t) is a function of X(t), which we write as F (t) = f(X(t), t).Of course, this definition is a big restatement of definition (6). Once we knowf(x, t), we can plug in t = 0 to get E[F ] = F (0) = f(x0, 0) if X(0) = x0 isknown. Otherwise, E[F ] = E[f(X(0), t)].

The backward equation for f is

∂tf +12∂2

xf + V (x, t) = 0 , (7)

with final conditions f(x, T ) = 0. The derivation is similar to the one we usedbefore for the backward equation for Ex,t[V (XT )]. We use Taylor series and thetower property to calculate how f changes over a small time increment, ∆t. Westart with ∫ T

t

V (X(s))ds =∫ t+∆t

t

V (X(s))ds +∫ T

t+∆t

V (X(s))ds ,

take the x, t expectation, and use (6) to get

f(x, t) = Ex,t

[∫ t+∆t

t

V (X(s))ds∣∣∣ Ft

]+ Ex,t

[∫ T

t+∆t

V (X(s))ds∣∣∣ Ft

]. (8)

The first integral on the right has the value V (x)∆t+o(∆t). We write o(∆t) fora quantity that is smaller than ∆t in the sense that o(∆t)/∆t → 0 as ∆t → 0(we will shortly divide by ∆t, take the limit ∆t → 0, and neglect all o(∆t)terms.). The second term has

Ex,t

[∫ T

t+∆t

V (X(s))ds | Ft+∆t

]= F (Xt+∆t) = f(X(t + ∆t), t + ∆t) .

4

Writing X(t + ∆t) = X(t) + ∆X, we use the tower property with Ft ⊂ Ft+∆t

to get

E

[∫ T

t+∆t

V (X(s))ds | Ft

]= E [f(Xt + ∆X, t + ∆t) | Ft] .

As before, we use Taylor expansions the conditional expectation to get first

f(x+∆X, t+∆t) = f(x, t)+∆t∂tf(x, t)+∆X∂xf(x, t)+12∆X2∂2

xf(x, t)+o(∆t) ,

then

Ex,t [f(x + ∆X, t + ∆t] = f(x, t) + ∆t∂tf(x, t) +12∆t∂2

xf(x, t) + o(∆t) .

Putting all this back into (8) gives

f(x, t) = ∆tV (x) + f(x, t) + ∆t∂tf(x, t) +12∆t∂2

xf(x, t) + o(∆t) .

Now just cancel f(x, t) from both sides and let ∆t → 0 to get the promisedequation (7).

1.7. Application of PDE: Most commonly, we cannot evaluate either theexpected value (6) or the solution of the partial differential equation (PDE)(7). How does the PDE represent progress toward evaluating f? One way isby suggesting a completely different computational procedure. If we work onlyfrom the definition (6), we would use Monte Carlo for numerical evaluation.Monte Carlo is notoriously slow and inaccurate. There are several techniquesfor finding the solution of a PDE that avoid Monte Carlo, including finite dif-ference methods, finite element methods, spectral methods, and trees. Whensuch deterministic methods are practical, they generally are more reliable, moreaccurate, and faster. In financial applications, we are often able to find PDEsfor quantities that have no simple Monte Carlo probabilistic definition. Manysuch examples are related to optimization problems: maximizing an expectedreturn or minimizing uncertainty with dynamic trading strategies in a randomlyevolving market. The Black Scholes evaluation of the value of an American styleoption is a well known example.

1.8. The Feynman Kac formula: Consider

F = E

[exp

(∫ T

0

V (X(t)dt

)]. (9)

As before, we evaluate F using the related and more refined quantities

f(x, t) = Ex,t

[e

∫ T

tV (Xs)ds

](10)

5

satisfies the backward equation

∂tf +12∂2

xf + V (x)f = 0 . (11)

When someone refers to the Feynman Kac formula, they usually are referringto the fact that (10) is a formula for the solution of the PDE (11). In our work,the situation mostly will be reversed. We use the PDE (11) to get informationabout the quantity defined by (10) or even just about the process X(t).

We can verify that (10) satisfies (11) more or less as in the preceding para-graph. We note that

exp

∫ t+∆t

t

V (X(s))ds +∫ T

t+∆t

V (X(s))ds

= exp

∫ t+∆t

t

V (X(s))ds

· exp

∫ T

t+∆t

V (X(s))ds

= (1 + ∆tV (X(t)) + o(∆t)) · exp

∫ T

t+∆t

V (X(s))ds

The expectation of the rigth side with respect to Ft+∆t is

(1 + ∆tV (Xt) + o(∆t)) · f(X(t + ∆X, t + ∆t) .

When we now take expectation with respect to Ft, which amounts to averagingover ∆X, using Taylor expansion of f about f(x, t) as before, we get (11).

1.9. The Feynman integral: A precurser to the Feynman Kac formula, is theFeynman integral2 solution to the Schrodinger equation. The Feynman integralis not an integral in the sense of measure theory. (Neither is the Ito integral, forthat matter.) The colorful probabilist Marc Kac (pronounced “Katz”) discov-ered that an actual integral over Wiener measure (10) gives the solution of (11).Feynman’s reasoning will help us derive the Girsanov formula, so we pause tosketch it.

The finite difference approximation∫ T

0

V (X(t))dt ≈ ∆tn−1∑k=0

V (X(tk)) , (12)

(always ∆t = T/n, tk = k∆t) leads to an approximation to F of the form

Fn = E

[exp

(∆t

n=1∑k=0

V (X(tk))

)]. (13)

2The American Physicist Richard Feynman was born and raised in Far Rockaway (a neigh-borhood of Queens, New York). He is the author of several wonderful popular books, includingSurely You’re Joking, Mr. Feynman and The Feynman Lectures on Physics.

6

The functional Fn depends only on finitely many values Xk = X(tk), so we mayevaluate (13) using teh known joint density function for ~X = (X1, . . . , Xn). Thedensity is (see “Path probabilities” from Lecture 5):

U (n)(~x) =1

(2π∆tn/2exp

(−

n−1∑k=0

(xk+1 − xk)2/e∆t

).

It is suggestive to rewrite this as

U (n)(~x) =1

(2π∆tn/2exp

[−∆t

2

n−1∑k=0

(xk+1 − xk

∆t

)2]

. (14)

Using this to evaluate Fn gives

Fn =1

(2π∆tn/2

∫Rn exp

[∆t

n−1∑k=0

V (xk)− ∆t

2

n−1∑k=0

(xk+1 − xk

∆t

)2]

d~x . (15)

It is easy to show that Fn → F as n → ∞ as long as V (x) is, say, continuousand bounded (see below).

Feynman proposed a view of F = limn→∞ Fn in (15) that is not mathemat-ically rigorous but explains “what’s going on”. If xk ≈ x(tk), then we shouldhave

∆tn−1∑k=0

V (xk) →∫ T

t=0

V (x(t))dt .

Also, (xk+1 − xk

∆t

)≈ dx

dt= x(tk) ,

so we should also have

∆t

2

n−1∑k=0

(xk+1 − xk

∆t

)2

→∫ T

0

x(t)2dt .

As n →∞, the integral over Rn should converge to the integral over all “paths”x(t). We denote this by P without worring about exactly which paths areallowed (continuous, differentiable, ...?). The integration element d~x has thepossible formal limit

d~x =n−1∏k=0

dxk =n−1∏k=0

dx(tk) →T∏

t=0

dx(t) .

Altogether, this gives the formal expression for the limit of (15):

F = const∫P

exp

(∫ T

0

V (x(t))dt− 12

∫ T

0

x(t)2dt

)T∏

t=0

dx(t) . (16)

7

1.10. Feynman and Wiener integration: Mathematicians were quick to com-plain about (16). For one thing, the constant const = limn→∞(2π∆t)n/2 shouldbe infinite. More seriously, there is no abstract integral measure correspondingto∫P∏T

t=0 dx(t) (it is possible to prove this). Kac proposed to write (16) as

F =∫P

exp

(∫ T

0

V (x(t))dt

)[const · exp

(− 1

2

∫ T

0

x(t)2dt

)T∏

t=0

dx(t)

].

and then interpret the latter part as Wiener measure (dP ):

const · exp

(− 1

2

∫ T

0

x(t)2dt

)T∏

t=0

dx(t) = dP (X) (17)

In fact, we have already implicitly argued informally (and it can be formalized)that

limn→∞

U (n)(~x)n−1∏k=0

dxk → dP (X) as n →∞ .

These intuitive but mathematically nonsensical formulas are a great help inunderstanding Brownian motion. For one thing, (17) makes clear that Wienermeasure is Gaussian. Its density has the form const ·exp(−Q(x)), where Q(x) isa positive quadratic function of x. Here Q(x) =

∫x(t)2dt (and the constant is,

alas, infinite). Moreover, in many cases it is possible to approximate integralsof the form

∫exp(φ(~x))d~x by eφ∗ , where φ∗ = max~x φ(~x) if the φ is sharply

peaked around its maximum. This is particularly common in “rare event” or“large deviation” problems. In our case, this would lead us to solve the calculusof variations problem

maxx

(∫ T

0

V (x(t)dt− 12

∫ T

0x(t)2dt

).

1.11. Application of Feynman Kac: The problem of evaluating

f = E

[exp

(∫ T

0

V (Xt)dt

)]arises in many situations. In finance, f could represent the present value ofa payment in the future subject to unknown fluctuating interest rates. ThePDE (11) provides a possible way to evaluate f = f(0, 0), either analytically ornumerically.

2 Mathematical formalism

2.1. Introduction: We examine the solution formulas for the backward andforward equation from two points of view. The first is an analogy with linear

8

algebra, with function spaces playing the role of vector space and operatorsplaying the role of matrices. The second is a more physical picture, interpretingG(x, y, t) as the Green’s function describing the forward diffusion of a point massof probability or the backward diffusion of a localized unit of payout.

2.2. Solution operator As time moves forward, the probability density forXt changes, or evolves. As time moves backward, the value function f(x, t)also evolves3 The backward evolution process is given by (for s > 0, this is aconsequence of the tower property.)

f(x, t− s) =∫

G(x, y, s)f(y, t)dy . (18)

We write this abstractly as f(t− s) = G(s)f(t).This formula is anologous to the comparable Markov chain formula f(t−s) =

P sf(t). In the Markov chain case, s and t are integers and f(t) represents avector in Rn whose components are fk(t). Here, f(t) is a function of x whosevalues are f(x, t). We can think of P s as an n × n matrix or as the linearoperator that transforms the vector f to the vector g = P sf . Similarly, G(s) isa linear operator, transforming a function f into g, with

g(x) =∫ ∞

−∞G(x, t, s)f(y)dy .

The operation is linear, which means that G(af (1) + bf (2)) = aGf (1) + bGf (2).The family of operators G(s) for s > 0 produces the solution to the backwardequaion, so we call G(s) the solution operator for time s.

2.3. Duhamel’s principle: The inhomogeneous backward equation

∂tf + ∂2xf = V (x, t) , (19)

with homogeneous4 final condition f(x, T ) = 0 may be solved by

f(x, t) = Ex,t

[∫ T

t

V (X(t′), t′dt′)

].

Exchanging the order of integration, we may write

f(x, t) =∫ T

t′=t

g(x, t, t′)dt′ , (20)

whereg(x, t, t′) = Ex,t [V (X(t′))] .

3Unlike biological evolustion, this evolution process makes the solution less complicated,not more.

4We often say “homogeneous” to mean zero and “inhomogeneous” to mean not zero. Thatmay be because if V (x, t) is zero then it is constant, i.e. the same everywhere, which is theusual meaning of homogeneous.

9

This g is the expected value (at (x, t)) of a payout (V (·, t′) at time t′ > t). Assuch, g is the solution of a homogeneous final value problem with inhomogeneousfinal values:

∂tg + 12∂2

xg = 0 for t < t′ ,

g(x, t′) = V (x, t′) .

(21)

Duhamel’s principle, which we just demonstrated, is as follows. To solve theinvonogeneous final value problem (19), we solve a homogeneous final valueproblem (21) for each t′ between t and T then we add up the results (20).

2.4. Infinitesimal generator: There are matrices of many different types thatplay various roles in theory and computation. And so it is with operators. Inaddition to the solution operator, there is the infinitesimal generator (or simplygenerator). For Brownian motion in one dimension, the generator is

L = 12∂2

x . (22)

The backward equation may be written

∂tf + Lf = 0 . (23)

For other diffusion processes, the generator is the operator L that puts thebackward equation for process in the form (23).

Just as a matrix has a transpose, an operator has an adjoint, written L∗.The forward equation takes the form

∂tu = L∗u .

The operator (22) for Brownian motion is self adjoint, which means that L∗ = L,which is why the operator 1

2∂2x is the appears in both. We will return to these

points later.

2.5. Composing (multiplying) operators: If A and B are matrices, then thereare two ways to form the matrix AB. One way is to multiply the matrices. Theother is to compose the linear transformations: f → Bf → ABf . In this way,AB is the composite linear transformation formed by first applying B thenapplying A. We also can compose operators, even if we sometimes lack a goodexplicit representation for the composite AB. As with matrices, composition ofoperators is associative: A(Bf) = (AB)f .

2.6. Composing solution operators: The solution operator G(s1 moves thevalue function backward in time by the amount s1, which is written f(t− s1) =G(s1)f(t). The operator G(s2) moves it back an additional s2, i.e. f(t− (s1 +s2)) = G(s2)f(t−s1) = G(s2)G(s1)f(t). The result is to move f back by s1+s2

in total, which is the same as applying G(s1 + s2). This shows that for every(allowed) f , G(s2)G(s1)f = G(s2 + s1)f ,. which means that

G(s2)G(s1) = G(s2 + s1) . (24)

10

This is called the semigroup property. It is a basic property of the solutionoperator for any problem. The matrix anologue for Markov chains is P s2+s+1 =P s2P s1 , which is a basic fact about powers of matrices having nothing to dowith Markov chains. The property (24) would be called the group property ifwe were to allow negative s2 or s1, which we do not. Negative s is allowed inthe matrix version if P is nonsingular. There is no particular physical reasonfor the transition matrix of a Markov chain to be non singular.

2.7. Operator kernels: If matrix A has elements Ajk, we can compute g = Afby doing the sum gj =

∑k Ajkfk. Similarly, operator A may or may not have

a kernel5, which is a function A(x, y) so that g = Af is represented by

g(x) =∫

A(x, y)f(y)dy .

If operators A and B both have kernels, then the composite operator has thekernel

(AB)(x, y) =∫

A(x, z)B(z, y)dz . (25)

To derive this formula, set g = Bf and h = Ag. Then h(x) =∫

A(x, z)g(z)dzand g(z) =

∫B(z, y)f(y)dy implies that

h(x) =∫ (∫

A(x, z)B(z, y)dz

)f(y)dy .

This shows that (25) is the kernel of AB. The formula is anologous to theformula for matrix multiplication.

2.8. The semigroup property: When we defined (18) the solution operatorsG(s), we did so by specifying the kernels

G(x, t, s) =1√2πs

e−(x−y)2/2s .

According to (25). the semigroup property should be an integral identity in-volving G. The identity is

G(x, y, s2 + s1) =∫

G(x, z, s2)G(z, y, s1)dz .

More concretely:

1√2π(s2 + s1)

e−(x−y)2/2(s2+s1)

=1√

2π(s2)1√

2π(s1)

∫e−(x−z)2/2s2e−(z−y)2/2s1dz .

5The term kernel also describes vectors f with Af = 0, it is unfortunate that the sameword is used for these different objects.

11

The reader is encouraged to verify this by direct integration. It also can beverified by recognizing it as the statement that adding independent mean zeroGaussian random variables with variance s2 and s1 respectively gives a Gaussianwith variance s2 + s1.

2.9. Fundamental solution: The operators G(t) form a fundamental solution6

for the problem ft + Lf = 0 if

∂tG = LG , for t > 0 , (26)

G(0) = I . (27)

The property (26) really means that ∂t

(G(t)f

)= L

(Gf)

for any f . If G(t) hasa kernel G(x, y, t), this in turn means (as the reader should ckeck) that

∂tG(x, y, t) = LxG(x, y, t) , (28)

where Lx means that the derivatives on L are with respect to the x variables inG. In our case with G being the heat kernel, this is

∂t1√2πt

e−(x−y)2/2t = 12∂2

x1√2πt

e−(x−y)2/2t ,

which we have checked and rechecked.Without matrices, we still have the identity operator: If = f for all f . The

property (27) really means that G(t)f → f as t → 0. It is easy to verify thisfor our heat kernel provided that f is continuous.

2.10. Duhamel with fundamental solution operator: The g appearing in (20)may be expresses as g(t, t′) = G(t′ − t)V (t′), where V (t′) is the function withvalues V (x, t′). This puts (20) in the form

f(t) =∫ T

t

G(t′ − t)V (t′)dt′ . (29)

We illustrate the properties of the fundamental solution operator by verifying(29) directly. We want to show taht (29) implies that ∂tf + Lf = V (t) andf(T ) = 0. The latter is clear. For the former we compute ∂tf(t) by differenti-ating the right side of (29):

∂t

∫ T

t

G(t′ − t)V (t′)dt′ = −G(t− t)V (t)−∫ T

t

G′(t′ − t)V (t′)dt′ ,

We write G′(t) to represent ∂tG(t). This allows us to write ∂tG(t′ − t) =−G′(t′ − t) = −LG(t′ − t). Continuing, the left side is

−V (t)−∫ T

t

LG(t′ − t)V (t′)dt′ = −V (t)−∫ T

t

LG(t′ − t)V (t′)dt′ .

6We have adjusted this definition from its original form in books on ordinary differentialequations to accomodate the backward evolution of the backward equation. This amounts toreversing the sign of L.

12

If we take L outside the integral on the right, we recognize what is left in theintegral as f(t). Altogether, we have ∂tf = −V (t)−Lf(t). This is almost right,I just have to fix the minus sign somehow.

2.11. Green’s function: Consider the solution formula for the homogeneousfinal value problem ∂tf + Lf = 0, f(T ) = V :

f(x, t) =∫

G(x, y, T − t)V (y)dy . (30)

Consider a special “jackpot” payout V (y) = δ(y − x0). If you like, you canthink of V (y) = 1

ε when |y = x0| < ε then let ε → 0. We then get f(x, t) =G(x, x0, T − t). The function that satisfies ∂tG + LxG = 0, G(x, T = δ(x− x0)is called the Greens’s function7. The Green’s function represents the result ofa point mass payout. A general payout can be expressed as a sum (integral) ofpoint mass payouts as x0 with weight V (x0):

V (y) =∫

V (x0)δ(y − x0)dx0 .

Since the backward equation is linear, the general value function will be theweighted sum (integral) of the point mass value functions, which is the formula(30).

2.12. More generally: Brownian motion is special in that G(x, y, t) is afunction of x − y. This is because Brownian motion is translation invariant: aBrownian motion starting from any point looks like a Brownian motion startingfrom any other point. Brownian motion is also special in that the forwardequation and backward equations are nearly the same, having the same spatialoperator L = 1

2∂2x.

More general diffusion processes loose both these properties. The solutionoperator depends in a more complicated way on x and y. The backward equa-tion is ∂tf + Lf = 0 but the forward equation is ∂tu = L∗u. The Green’sfunction, G(x, y, t) is the fundamental solution for the backward equation in thex, t variables with y as a parameter. It also is the fundamental solution to theforward equation in the y, t variables with x as a parameter. This material willbe in a future lecture.

7This is in honor of a 19th century Englishman named Green.

13

Stochastic Calculus Notes, Lecture 7Last modified December 3, 2004

1 The Ito integral with respect to Brownian mo-tion

1.1. Introduction: Stochastic calculus is about systems driven by noise. TheIto calculus is about systems driven by white noise, which is the derivative ofBrownian motion. To find the response of the system, we integrate the forcing,which leads to the Ito integral, of a function against the derivative of Brownianmotion.

The Ito integral, like the Riemann integral, has a definition as a certain limit.The fundamental theorem of calculus allows us to evaluate Riemann integralswithout returning to its original definition. Ito’s lemma plays that role for Itointegration. Ito’s lemma has an extra term not present in the fundamentaltheorem that is due to the non smoothness of Brownian motion paths. We willexplain the formal rule: dW 2 = dt, and its meaning.

In this section, standard one dimensional Brownian motion is W (t) (W (0) =0, E[∆W 2] = ∆t). The change in Brownian motion in time dt is formally calleddW (t). The independent increments property implies that dW (t) is independentof dW (t′) when t 6= t′. Therefore, the dW (t) are a model of driving noiseimpulses acting on a system that are independent from one time to another.We want a rule to add up the cumulative effects of these impulses. In the firstinstance, this is the integral

Y (T ) =∫ T

0

F (t)dW (t) . (1)

Our plan is to lay out the principle ideas first then address the mathemat-ical foundations for them later. There will be many points in the beginningparagraphs where we appeal to intuition rather than to mathematical analysisin making a point. To justify this approach, I (mis)quote a snippet of a poem Imemorized in grade school: “So you have built castles in the sky. That is wherethey should be. Now put the foundations under them.” (Author unknown byme).

1.2. The Ito integral: Let Ft be the filtration generated by Brownian motionup to time t, and let F (t) ∈ Ft be an adapted stochastic process. Correspond-ing to the Riemann sum approximation to the Riemann integral we define thefollowing approximations to the Ito integral

Y∆t(t) =∑tk<t

F (tk)∆Wk , (2)

1

with the usual notions tk = k∆t, and ∆Wk = W (tk+1) − W (tk). If the limitexists, the Ito integral is

Y (t) = lim∆t→0

Y∆t(t) . (3)

There is some flexibility in this definition, though far less than with the Riemannintegral. It is absolutely essential that we use the forward difference rather than,say, the backward difference ((wrong) ∆Wk = W (tk)−W (tk−1)), so that

E[F (tk)∆Wk

∣∣ Ftk

]= 0 . (4)

Each of the terms in the sum (2) is measurable measurable in Ft, thereforeYn(t) is also. If we evaluate at the discrete times tn, Y∆t is a martingale:

E[Y∆t(tn+1)∣∣ Ftn = Y ∆t(tn) .

In the limit ∆t → 0 this should make Y (t) also a martingale measurable in Ft.

1.3. Famous example: The simplest interesting integral with an Ft that israndom is

Y (T ) =∫ T

0

W (t)dW (t) .

If W (t) were differentiable with derivative W , we could calculate the limit of(2) using dW (t) = W (t)dt as

(wrong)∫ T

0

W (t)W (t)dt = 12

∫ T

0∂t

(W (t)2

)dt = 1

2W (t)2 . (wrong) (5)

But this is not what we get from the definition (2) with actual rough pathBrownian motion. Instead we write

W (tk) = 12 (W (tk+1) + W (tk))− 1

2 (W (tk+1)−W (tk)) ,

and get

Y∆t(tn) =∑k<n

W (tk) (W (tk+1)−W (tk))

=∑k<n

12 (W (tk+1) + W (tk)) (W (tk+1)−W (tk))

−∑k<n

12 (W (tk+1)−W (tk)) (W (tk+1)−W (tk))

=∑k<n

12

(W (tk+1)2 −W (tk)2

)−∑k<n

12 (W (tk+1)−W (tk))2 .

The first on the bottom right is (since W (0) = 0)

12W (tn)2

2

The second term is a sum of n independent random variables, each with expectedvalue ∆t/2 and variance ∆t2/2. As a result, the sum is a random variable withmean n∆t/2 = tn/2 and variance n∆t2/2 = tn∆t/2. This implies that

12

∑tk<T

(W (tk+1)−W (tk))2 → T/2 as ∆t → 0 . (6)

Together, these results give the correct Ito answer∫ T

0

W (t)dW (t) = 12

(W (t)2 − T

). (7)

The difference between the right answer (7) and the wrong answer (5) is theT/2 coming from (6). This is a quantitative consequence of the roughness ofBrownian motion paths. If W (t) were a differentiable function of t, that termwould have the approximate value

∆t

∫ T

0

(dW

dt

)2

dt → 0 as ∆t → 0 .

1.4. Backward differencing, etc: If we use the backward difference ∆Wk =W (tk) − W (tk−1), then the martingale property (4) does not hold. For ex-ample, if F (t) = W (t) as above, then the right side changes from zero to(W (tn)−W (tn−1)W (tn) (all quantities measurable in Ftn

), which has expectedvalue1 ∆t. In fact, if we use the backward difference and follow the argu-ment used to get (7), we get instead 1

2 (W (T )2 + T ). In addition to the Itointegral there is a Stratonovich integral, which is used the central difference∆Wk = 1

2 (W (tk+1)−W (tk−1)). The Stratonovich definition makes the stochas-tic integral act more like a Riemann integral. In particular, the reader can checkthat the Stratonovich integral of WdW is 1

2W (T )2.

1.5. Martingales: The Ito integral is a martingale. It was defined for thatpurpose. Often one can compute an Ito integral by starting with the ordinarycalculus guess (such as 1

2W (T )2) and asking what needs to change to make theanswer a martingale. In this case, the balancing term −T/2 does the trick.

1.6. The Ito differential: Ito’s lemma is a formula for the Ito differential,which, in turn, is defined in using the Ito integral. Let F (t) be a stochasticprocess. We say dF = a(t)dW (t) + b(t)dt (the Ito differential) if

F (T )− F (0) =∫ T

0

a(t)dW (t) +∫ T

0

b(t)dt . (8)

The first integral on the right is an Ito integral and the second is a Riemannintegral. Both a(t) and b(t) may be stochastic processes (random functions of

1E[(W (tn) − W (tn−1))W (tn−1)] = 0, so E[(W (tn) − W (tn−1))W (tn)] = E[(W (tn) −W (tn−1))(W (tn)−W (tn−1))] = ∆t

3

time). For example, the Ito differential of W (t)2 is

dW (t)2 = 2W (t)dW (t) + dt ,

which we verify by checking that

W (T )2 = 2∫ T

0

W (t)dW (t) +∫ T

0

dt .

This is a restatement of (7).

1.7. Ito’s lemma: The simplest version of Ito’s lemma involves a functionf(w, t). The “lemma” is the formula (which must have been stated as a lemmain one of his papers):

df(W (t), t) = ∂wf(W (t), t)dW (t) + 12∂2

wf(W (t), t)dt + ∂tf(W (t), t)dt . (9)

According to the definition of the Ito differential, this means that

f(W (T ), T )− f(W (0), 0) (10)

=∫ T

0

∂wf(W (t), t)dW (t) +∫ T

0

(∂2

wf(W (t), t) + ∂tf(W (t), t))dt (11)

1.8. Using Ito’s lemma to evaluate an Ito integral: Like the fundamentaltheorem of calculus, Ito’s lemma can be used to evaluate integrals. For example,consider

Y (T ) =∫ T

0

W (t)2dW (t) .

A naive guess might be 13W (T )3, which would be the answer for a differentiable

function. To check this, we calculate (using (9), ∂w13w3 = w2, and 1

2∂2w

13w3 = w)

d 13W (t)3 = W 2(t)dW (t) + W (t)dt .

This implies that

13W (t)3 =

∫ T

0

d 13W (t)3 =

∫ T

0

W (t)2dW (t) +∫ T

0

W (t)dt ,

which in turn gives∫ T

0

W (t)2dW (t) = 13W (t)3 −

∫ T

0

W (t)dt .

This seems to be the end. There is no way to “integrate” Z(T ) =∫ T

0W (t)dt

to get a function of W (T ) alone. This is to say that Z(T ) is not measurable inGT , the algebra generated by W (T ) alone. In fact, Z(T ) depends equally on all

4

W (t) values for 0 ≤ t ≤ T . A more technical version of this remark is comingafter the discussion of the Brownian bridge.

1.9. To tell a martingale: Suppose F (t) is an adapted stochastic processwith dF (t) = a(t)dW (t)+ b(t)dt. Then F is a martingale if and only if b(t) = 0.We call a(t)dW (t) the martingale part and b(t)dt drift term. If b(t) is at allcontinuous, then it can be identified through (because E[

∫a(s)dW (s)

∣∣ Ft] = 0)

E[F (t + ∆t)− F (t)

∣∣ Ft

]= E

[∫ t+∆t

t

b(s)ds∣∣ Ft

]= b(t)∆t + o(∆t) . (12)

We give one and a half of the two parts of the proof of this theorem. Ifb = 0 for all t (and all, or almost all ω ∈ Ω), then F (T ) is an Ito integral andhence a martingale. If b(t) is a continuous function of t, then we may find at∗ and ε > 0 and δ > 0 so that, say, b(t) > δ > 0 when |t− t∗| < ε. ThenE[F (t∗ + ε)− F (t∗ − ε)] > 2δε > 0, so F is not a martingale2.

1.10. Deriving a backward equation: Ito’s lemma gives a quick derivationsof backward equations. For example, take

f(W (t), t) = E[V (W (T ))

∣∣ Ft

].

The tower property tells us that F (t) = F (W (t), t) is a martingale. But Ito’slemma, together with the previous paragraph, implies that F (W (t), t) is a mar-tingale of and only if ∂tF + 1

2 = 0, which is the backward equation for this case.In fact, the proof of Ito’s lemma (below) is much like the proof of this backwardequation.

1.11. A backward equation with drift: The derivation of the backwardequation for

f(w, t) = Ew,t

[∫ T

t

V (W (s), s)ds

]uses the above, plus (12). Again using

F (t) = E

[∫ T

t

V (W (s), s)ds∣∣ Ft

],

with F (t) = f(W (t), t), we calculate

E[F (t + ∆t)− F (t)

∣∣ Ft

]= −E

[∫ t+∆t

t

V (W (s), s)ds∣∣ Ft

]= −V (W (t), t)∆t + o(∆t) .

2This is a somewhat incorrect version of the proof because ε, δ, and t∗ probably are random.There is a real proof something like this.

5

This says that dF (t) = a(t)dW (t) + b(t)dt where

b(t) = −V (W (t), t) .

But also, b(t) = ∂tf + 12∂2

wf . Equating these gives the backward equation fromLecture 6:

∂tf + 12∂2

wf + V (w, t) = 0 .

1.12. Proof of Ito’s lemma: We want to show that

f(W (T ), T )− f(W (0), 0) =∫ T

0

fw(W (t), t)dW (t) +∫ T

0

ft(W (t), t)dt

+ 12

∫ T

0

fww(W (t), t)dt . (13)

Define ∆t = T/n, tk = k∆t, Wk = W (tk), ∆Wk = W (tk+1) − W (tk), andfk = f(Wk, tk), and write

fn − f0 =n−1∑k=0

(fk+1 − fk

). (14)

Taylor series expansion of the terms on the right of (14) will produce termsthat converge to the three integrals on the right of (13) plus error terms thatconverge to zero. In our pre-Ito derivations of backward equations, we used therelation E[(∆W )2] = ∆t. Here we argue that with many independent ∆Wk, wemay replace (∆Wk)2 with ∆t (its mean value).

The Taylor series expansion is

fk+1 − fk = ∂wfk∆Wk + 12∂2

wfk (∆W )2 + ∂tfk∆t + Rk , (15)

where ∂wfk means ∂wf(W (tk), tk), etc. The remainder has the bound3

|Rk| ≤ C(∆t2 + ∆t |∆Wk|+

∣∣∆W 3k

∣∣) .

Finally, we separate the mean value of ∆W 2k from the deviation from the mean:

12∂2

wfk∆W 2k =

12∂2

wfk∆t +12∂2

wfk(∆W 2k −∆t) .

The individual summands on the right side all have order of magnitude ∆t.However, the mean zero terms (the second sum) add up to much less than thefirst sum, as we will see. With this, (14) takes the form

fn − f0 =n−1∑k=0

∂wfk∆Wk +n−1∑k=0

∂tfk∆t + 12

n−1∑k=0

∂2wfk∆t

+ 12

n−1∑k=0

∂2wfk

(∆W 2 −∆t

)+

n−1∑k=0

Rk . (16)

3We assume that f(w, t) is thrice differentiable with bounded third derivatives. The errorin a finite Taylor approximation is bounded by the sized of the largest terms not used. Here,that is ∆t2 (for omitted term ∂2

t f), ∆t(∆W )2 (for ∂t∂w), and ∆W 3 (for ∂3w).

6

The first three sums on the right converge respectively to the correspondingintegrals on the right side of (13). A technical digression will show that the lasttwo converge to zero as n →∞ in a suitable way.

1.13. Like Borel Cantelli: As much as the formulas, the proofs in stochasticcalculus rely on calculating expected values of things. Here, Sm is a sequenceof random numbers and we want to show that Sm → 0 as m → ∞ (almostsurely). We use two observations. First, if sm is a sequence of numbers with∑∞

m=1 |sm| < ∞, then sm → 0 as m → ∞. Second, if B > 0 is a randomvariable with E[B] < ∞, then B < ∞ almost surely (if the event B = ∞ haspositive probability, then E[B] = ∞). We take B =

∑∞m=1 |Sm|. If B < ∞

then∑∞

m=1 |Sm| < ∞ so Sm → 0 as m →∞. What this shows is:

( ∞∑m=1

E [|Sm|] < ∞)

=⇒(

Sm → 0 as m →∞ (a.s.))

(17)

This observarion is a variant of the Borel Cantelli lemma, which often is usedin such arguments.

1.14. One of the error terms: To apply the Borel Cantelli lemma we mustfind bounds for the error terms, bounds whose sum is finite. We start with thelast error term in (16). Choose n = 2m and define Sm =

∑n−1k=0 Rk, with

|Rk| ≤ C(∆t2 + ∆t |∆Wk|+ |∆Wk|3

).

Since E[|∆Wk|] ≤ C√

∆t and E[|∆Wk|3] ≤ C∆t3/2 (you do the math – theintegrals), this gives (with n∆t = T )

E [|Sm|] ≤ Cn(∆t2 + ∆t3/2

)≤ CT

√∆t .

Expressed in terms of m, we have ∆t = T/2m and√

∆t =√

T2−m/2 =√T(√

2)−m

. Therefore E [|Sm|] ≤ C(T )(√

2)−m

. Now, if z is any numbergreater than one, then

∑∞m=1 z−m = 1/(1 + 1/z)) < ∞. This implies that∑∞

m=1 E [|Sm|] < ∞ and (using Borel Cantelli) that Sm → 0 as m →∞ (almostsurely).

This argument would not have worked this way had we taken n = m insteadof n = 2m. The error bounds of order 1/

√n would not have had a finite

sum. If both error terms in the bottom line of (16) go to zero as m → ∞with n = 2m, this will prove Ito’s lemma. We will return to this point whenwe discuss the difference between almost sure convergence, which we are usinghere, and convergence in probability, which we are not.

1.15. The other sum: The other error sum in (16) is small not because of thesmallness of its terms, but because of cancellation. The positive and negative

7

terms roughly balance, leaving a sum smaller than the sizes of the terms wouldsuggest. This cancellation is of the same sort appearing in the central limittheorem, where

∑n−1k=0 Xk = Un is of order

√n rather than n when the Xk are

i.i.d. with finite variance. In fact, using a trick we used before we show that U2n

is of order n rather than n2:

E[U2

n

]=∑jk

E [XjXk] = nE[X2

k

]= cn .

Our sum isUn =

∑12∂2

wf(Wk, tk)(∆W 2

k −∆tk)

.

The above argument applies, though the terms are not independent. Supposej 6= k and, say, k > j. The cross term involving ∆Wj and ∆Wk still vanishesbecause

E[∆Wk −∆t

∣∣ Ftk

]= 0 ,

and the rest is in Ftk. Also (as we have used before)

E[(∆Wk −∆t)2

∣∣ Ftk

]= 2∆t2 .

Therefore

E[U2

n

]= 1

4

n−1∑k=0

(∂2

wf(Wk, tk))2

∆t2 ≤ C(T )∆t .

As before, we take n = 2m and sum to find that U22m → 0 as m →∞, which of

course implies that U2m → 0 as m →∞ (almost surely).

1.16. Convergence of Ito sums: Choose ∆t and define tk = k∆t and Wk =W (tk). To approximate the Ito integral

Y (T ) =∫ T

0

F (t)dW (t) ,

we have the the Ito sums

Ym(T ) =∑

tk<T

F (tk) (Wk+1 −Wk) , (18)

where ∆t = 2−m. In proving convergence of Riemann sums to the Riemannintegral, we assume that the integrand is continuous. Here, we will prove thatlimm→∞ Ym(T ) exists under the hypothesis

E[(F (t + ∆t)− F (t))2

]≤ C∆t . (19)

This is natural in that it represents the smoothness of Brownian motion paths.We will discuss what can be done for integrands more rough than (19).

The trick is to compare Ym with Ym+1, which is to compare the ∆t approxi-mation to the ∆t/2 approximation. For that purpose, define tk+1/2 = (k+ 1

2 )∆t,

8

Wk+1/2 = W (tk+1/2), etc. The tk term in the Ym sum corresponds to the timeinterval (tk, tk+1). The Ym+1 sum divides this interval into two subintervals oflength ∆t/2. Therefore, for each term in the Ym sum there are two correspond-ing terms in the Ym+1 sum (assuming T is a multiple of ∆t), and:

Ym+1(T )− Ym(T ) =∑

tk<T

[F (tk)(Wk+1/2 −Wk) + F (tk+1/2)(Wk+1 −Wk+1/2)

− F (tk)(Wk+1 −Wk)]

=∑

tk<T

(Wk+1 −Wk+1/2)(F (tk+1/2)− F (tk))

=∑

tk<T

Rk ,

whereRk = (Wk+1 −Wk+1/2)(F (tk+1/2)− F (tk)) .

We compute E[(Ym+1(T )−Ym(T ))2] =∑

jk E[RjRk]. As before,4 E[RjRk] =0 unless j = k. Also, the independent increments property and (19) imply that5

E[R2k] = E

[(Wk+1 −Wk+1/2

)2] · E [(F (tk+1/1)− F (tk))2]

≤ ∆t

2· C ∆t

2= C∆t2 .

This givesE[(Ym+1(T )− Ym(T ))2

]≤ C2−m . (20)

The convergence of the Ito sums follows from (??) using our Borel Cantellitype lemma. Let Sm = Ym+1 − Ym. From (20), we have6 E |Sm|] ≤ C

√2−m

.Thus

limm→∞

Ym(T ) = Y1(T ) +∑m≥1

Ym+1(T )− Ym(T )

exists and is finite. This shows that the limit defining the Ito integral exists, atleast in the case of an integrand that satisfies (19), which includes most of thecases we use.

1.17. Ito isometry formula: This is the formula

E

(∫ T2

T1

a(t)dW (t)

)2 =

∫ T2

T1

E[a(t)2]dt . (21)

4If j > k then E[Wj+1−Wj+1/2 | Ftj+1/2 ] = 0, so E[RjRk | Ftj+1/2 ] = 0, and E[RjRk] =0

5Mathematicians often use the same letter C to represent different constants in the sameformula. For example, C∆t + C2∆t ≤ C∆t really means: “Let C = C1 + C2

2 , if u ≤ C1∆t

and v ≤ C2

√∆t, then u + v2 ≤ C∆t. Instead, we don’t bother to distinguish between the

various constants.6The Cauchy Schwartz inequality gives E[|Sm|] = E[|Sm| · 1] ≤ (E[S2

m]E[12])1/2 =E[S2

m]1/2.

9

The derivation uses what we just have done. We approximate the Ito integralby the sum ∑

T1≤tk<T2

a(tk)∆Wk .

Because the different ∆Wk are independent, and because of the independentincrements property, the expected square of this is∑

T1≤tk<T2

a(tk)2∆t .

The formula (21) follows from this.An application of this is to understand the roughness of Y (T ) =

∫ T

0a(t)dW (t).

If E[a(t)2] < C for all t ≤ T , then E[(Y (T2) − Y (T1))2] ≤ C∆t. This is thesame roughness as Brownian motion itself.

1.18. White noise: White noise is a generalized function,7 ξ(t), which isthought of as homogeneous and Gaussian with ξ(t1) independent of ξ(t2) fort1 6= t2. More precisely, if t0 < t1 < · · · < tn and Yk =

∫ tk+1

tkξ(t)dt, then the Yk

are independent and normal with zero mean and var(Yk) = tk+1 − tk. You canconvince yourself that ξ(t) is not a true function by showing that it would haveto have

∫ b

aξ(t)2dt = ∞ for any a < b. Brownian motion can be thought of as

the motion of a particle pushed by white noise, i.e. W (t) =∫ t

0ξ(s)ds. The Yk

defined above then are the increments of Brownian and have the appropriatestatistical properties (independent, mean zero, normal, variance tk+1 − tk).

These properties may be summarized by saying that ξ(t) has mean zero and

cov(ξ(t), ξ(t′) = E[ξ(t)ξ(t′) = δ(t− t′) . (22)

For example, if f(t) and g(t) are deterministic functions and Yf =∫

f(t)dξ(t)and Yg =

∫g(t)dξ(t), then, (22) implies that

E[YfYg] =∫

t

∫t′

f(t)g(t′)E[ξ(t)ξ(t′)]dtdt′

=∫

t

∫t′

f(t)g(t′)δ(t− t′)dtdt′

=∫

t

f(t)g(t)dt

It is tempting to take dW (t) = ξ(t)dt in the Ito integral and use (22) to derivethe Ito isometry formula. However, this must by done carefully because theexistence of the Ito integral, and the isometry formula, depend on the causalitystructure that makes dW (t) independent of a(t).

7A generalized function is not an actual function, but has properties defined as though itwere an actual function through integration. The δ function for example, is defined by theformula

∫f(t)δ(t)dt = f(0). No actual function can do this. Generalized functions also are

called distributions.

10

2 Stochastic Differential Equations

2.1. Introduction: The theory of stochastic differential equations (SDE) is aframework for expressing dynamical models that include both random and nonrandom forces. The theory is based on the Ito integral. Like the Ito integral,approximations based on finite differences that do not respect the martingalestructure of the equation can converge to different answers. Solutions to ItoSDEs are Markov processes in that the future depends on the past only throughthe present. For this reason, the solutions can be studied using backward andforward equations, which turn out to be linear parabolic partial differential equa-tions of diffusion type.

2.2. A Stochastic Differential Equation: An Ito stochastic differential equa-tion takes the form

dX(t) = a(X(t), t)dt + σ(X(t), t)dW (t) . (23)

A solution is an adapted process that satisfies (23) in the sense that

X(T )−X(0) =∫ T

0

a(X(t), t)dt +∫ T

0

σ(X(t), t)dW (t) , (24)

where the first integral on the right is a Riemann integral and the second is anIto integral. We often specify initial conditions X(0) ∼ u0(x), where u0(x) isthe given probability density for X(0). Specifying X(0) = x0 is the same assaying u0(x) = δ(x − x0). As in the general Ito differential, a(X(t), t)dt is thedrift term, and σ(X(t), t)dW (t) is the martingale term. We often call σ(x, t)the volatility. However, this is a different use of the letter σ from Black Scholes,where the martingale term is xσ for a constant σ (also called volatility).

2.3. Geometric Brownian motion: The SDE

dX(t) = µX(t)dt + σX(t)dW (t) , (25)

with initial data X(0) = 1, defines geometric Brownian motion. In the generalformulation above, (25) has drift coefficient a(x, t) = µx, and volatility σ(x, t) =σx (with the conflict of terminology noted above). If W (t) were a differentiablefunction of t, the solution would be

(wrong) X(t) = eµt+σW (t) . (26)

To check this, define the function x(w, t) = eµt|σw with ∂wx = xw = σx, xt = µxand xww = σ2x. so that the Ito differential of the trial function (26) is

dX(W (t), t) = µXdt + σXdW (t) +σ2

2Xdt .

11

We can remove the unwanted final term by multiplying by e−σ2t/2, which sug-gests that the formula

X(t) = eµt−σ2t/2+σW (t) (27)

satisfies (25). A quick Ito differentiation verifies that it does.

2.4. Properties of geometric Brownian motion: Let us focus on the simplecase µ = 0 σ = 1, so that

dX(t) = X(t)dW (t) . (28)

The solution, with initial data X(0) = 1, is the simple geometric Brownianmotion

X(t) = exp(W (t)− t/2) . (29)

We discuss (29) in relation to the martingale property (X(t) is a martingalebecause the drift term in (23) is zero in (28)). A simple calculation based on

X(t + t′) = exp(W (t)− t/2 + (W (t′)−W (t)− (t′ − t)/2)

)and integrals of Gaussians shows that E[X(t + t′ | Ft] = X(t).

However, W (t) has the order of magnitude√

t. For large t, this means thatthe exponent in (29) is roughly equal to −t/2, which suggests that

X(t) = exp(W (t)− t/2) ≈ e−t/2 → 0 as t →∞ (a.s.)

Therefore, the expected value E[X(t)] = 1, for large t, is not produced bytypical paths, but by very exceptional ones. To be quantitative, there is anexponentially small probability that X(t) is as large as its expected value:

P (X(t) ≥ 1) < e−t/4 for large t.

2.5. Dominated convergence theorem: The dominated convergence theoremis about expected values of limits of random variables. Suppose X(t, ω) is afamily of random variables and that limt→∞X(t, ω) = Y (ω) for almost everyω. The random variable U(ω) dominates the X(t, ω) if |X(t, ω)| ≤ U(ω) foralmost every ω and for every t > 0. We often write this simply as X(t) → Yas t → ∞ a.s., and |X(t)| ≤ U a.s. The theorem states that if E[U ] < ∞ thenE[X(t)] → E[Y ] as t → ∞. It is fairly easy to prove the theorem from thedefinition of abstract integration. The simplicity of the theorem is one of theways abstract integration is simpler than Riemann integration.

The reason for mentioning this theorem here is that geometric Brownianmotion (29) is an example showing what can go wrong without a dominatingfunction. Although X(t) → 0 as t → ∞ a.s., the expected value of X(t) doesnot go to zero, as it would do if the conditions of the dominated convergencetheorem were met. The reader is invited to study the maximal function, whichis the random variable M = maxt>0(W (t)− t/2), in enough detail to show thatE[eM ] = ∞.

12

2.6. Strong and weak solutions: A strong solution is an adapted functionX(W, t), where the Brownian motion path W again plays the role of the abstractrandom variable, ω. As in the discrete case, X(t) (i.e. X(W, t)) being measurablein Ft means that X(t) is a function of the values of W (s) for 0 ≤ s ≤ t. The twoexamples we have, geometric Brownian motion (27), and the Ornstein Uhlenbeckprocess8

X(t) = σ

∫ t

0

e−γ(t−s)dW (s) , (30)

both have this property. Note that (27) depends only on W (t), while (30)depends on the whole pate up to time t.

A weak solution is a stochastic process, X(t), defined perhaps on a differentprobability space and filtration (Ω, Ft) that has the statistical properties calledfor by (23). These are (using ∆X = X(t + ∆t)−X(t)) roughly9

E[∆X | Ft] = a(X(t), t)∆t + o(∆t) , (31)

andE[∆X2 | Ft] = σ2(X(t), t)∆t + o(∆t) . (32)

We will see that a strong solution satisfies (31) and (32), so a strong solutionis a weak solution. It makes no sense to ask whether a weak solution is astrong solution since we have no information on how, or even whether, the weaksolution depends on W .

The formulas (31) and (32) are helpful in deriving SDE descriptions of phys-ical or financial systems. We calculate the left sides to identify the a(x, t) andσ(x, t) in (23). Brownian motion paths and Ito integration are merely a toolfor constructing the desired process X(t). We saw in the example of geomerticBrownian motion that expressing the solution in terms of W (t) can be veryconvenient for understanding its properties. For example, it is not particularlyeasy to show that X(t) → 0 as t → ∞ from (31) and (32) with a = µX and10

σ(x, t) = σx.

2.7. Strong is weak: We just verify that the strong solution to (23) thatsatisfies (24) also satisfies the weak form requirements (31) and (32). This is animportant motivation for using the Ito definition of dW rather than, say, theStratonovich definition.

A slightly more general fact is simpler to explain. Define R and I by

R =∫ t+∆t

t

a(t)dt , I =∫ t+∆t

t

σ(t)dW (t) ,

8This process satisfies the SDE dX = −γXdt + σdW , with X(0) = 0.9The little o notation f(t) = g(t) + o(t) informally means that the difference between f

and g is a mathematicians’ order of magnitude smaller than t for small t. Formally, it meansthat (f(t)− g(t))/t → 0 as t → 0.

10This conflict of notation is common in discussing geometric Brownian motion. On the leftis the coefficient of dW (t). On the right is the financial volatility coefficient.

13

where a(t) and σ(t) are continuous adapted stochastic processes. We want tosee that

E[R + I

∣∣ Ft

]= a(t) + o(∆t) , (33)

andE[(R + I)2

∣∣ Ft

]= σ2(t) + o(∆t) . (34)

We may leave I out of (33) because E[I] = 0 always. We may leave R out of (34)because |I| >> |R|. (If a is bounded then R = O(∆t) so E[R2 | Ft] = O(∆t2).The Ito isometry formula suggests that E[I2 | Ft] = O(∆t). Cauchy Schwartzthen gives E[RI | Ft] = O(∆t3/2). Altogether, E[(R + I)2 | Ft] = E[I2 |Ft] + O(∆t3/2).)

To verify (33) without I, we assume that a(t) is a continuous function of tin the sense that for s > t,

E[a(s)− a(t)

∣∣ Ft

]→ 0 as s → t .

This implies that

1∆t

∫ t+∆t

t

E [a(s)− a(t) | Ft] → 0 as ∆t → 0,

so that

E[R∣∣ Ft

]=

∫ t+∆t

t

E [a(s) | Ft]

=∫ t+∆t

t

E [a(t) | Ft] +∫ t+∆t

t

E [a(s)− a(t) | Ft]

= ∆ta(t) + o(∆t) .

This verifies (33). The Ito isometry formula gives

E[I2∣∣ Ft

]=∫ t+∆t

t

σ(s)2ds ,

so (34) follows in the same way.

2.8. Markov diffusions: Roughly speaking, 11 a diffusion process is a contin-uous stochastic process that satisfies (31) and (32). If the process is Markov,the a of (31) and the σ2 of (32) must be functions of X(t) and t. If a(x, t) andσ(x, t) are Lipschitz (|a(x, t) − a(y, t)| ≤ C|x − y|, etc.) functions of x and t,then it is possible to find it is possible to express X(t) as a strong solution ofan Ito SDE (23).

This is the way equations (23) are often derived in practice. We start offwanting to model a process with an SDE. It could be a random walk on a latticewith the lattice size converging to zero or some other process that we hope will

11More detailed treatment are the books by Steele, Chung and Williams, Karatsas andShreve, and Oksendal.

14

have a limit as a diffusion. The main step in proving the limit exists is tightness,which we hint at a lecture to follow. We identify a and σ by calculations. Thenwe use the representation theorem to say that the process may be representedas the strong solution to (23).

2.9. Backward equation: The simplest backward equation is the PDE sat-isfied by f(x, t) = Ex,t[V (X(T ))]. We derive it using the weak form conditions(31) and(32) and the tower property. As with Brownian motion, the towerproperty gives

f(x, t) = Ex,t[V (X(T ))] = Ex,t[F (t + ∆t)] ,

where F (s) = E[V (X(T )) | Fs]. The Markov property implies that F (s) is afunction of X(s) alone, so F (s) = f(X(s), s). This gives

f(x, t) = Ex,t [f(X(t + ∆t), t + ∆t)] . (35)

If we assume that f is a smooth function of x and t, we may expand in Taylorseries, keeping only terms that contribute O(∆t) or more.12 We use ∆X =X(t + ∆t)− x and write f for f(x, t), ft for ft(x, t), etc.

f(X(t + ∆t), t + ∆t) = f + ft∆t + fx∆X + 12fxx∆X2 + smaller terms.

Therefore (31) and (32) give:

f(x, t) = Ex,t [f(X(t + ∆t), t + ∆t)]= f(x, t)ft∆t + fxEx,t[∆X] + 1

2fxxEx,t[∆X2] + o(∆t)

= f(x, t) + ft∆t + fxa(x, t)∆t + 12fxxσ2(x, t)∆t + o(∆t) .

We now just cancel the f(x, t) from both sides, let ∆t → 0 and drop the o(∆t)terms to get the backward equation

∂tf(x, t) + a(x, t)∂xf(x, t) +σ2(x, t)

2∂2

xf(x, t) = 0 . (36)

2.10. Forward equation: The forward equation follows from the backwardequation by duality. Let u(x, t) be the probability density for X(t). Sincef(x, t) = Ex,t[V (X(T ))], we may write

E[V (X(T ))] =∫ ∞

−∞u(x, t)f(x, t)dx ,

which is independent of t. Differentiating with respect to t and using the back-ward equation (36) for ft, we get

0 =∫

u(x, t)ft(x, t)dx +∫

ut(x, t)f(x, t)dx

= −∫

u(x, t)a(x, t)∂xf(x, t)− 12

∫u(x, t)σ2(x, t)∂2

xf(x, t) +∫

ut(x, t)f(x, t) .

12The homework has more on the terms left out.

15

We integrate by parts to put the x derivatives on u. We may ignore boundaryterms if u decaus fast enough as |x| → ∞ and if f does not grow too fast. Theresult is∫ (

∂x (a(x, t)u(x, t))− 12∂2

x

(σ2(x, t)u(x, t)

)+ ∂tu(x, t)

)f(x, t)dx = 0 .

Since this should be true for every function f(x, t), the integrand must vanishidentically, which implies that

∂tu(x, t) = −∂x (a(x, t)u(x, t)) + 12∂2

x

(σ2(x, t)u(x, t)

). (37)

This is the forward equation for the Markov process that satisfies (31) and (32).

2.11. Transition probabilities: The transition probability density is theprobability density for X(s) given that X(t) = y and s > t. We write it asG(y, s, t, s), the probabiltiy density to go from y to x as time goes from t to s.If the drift and diffusion coefficients do not depend on t, then G is a functionof t− s. Because G is a probability density in the x and s variables, it satisfiesthe forward equation

∂sG(y, x, t, s) = −∂x (a(x, s)G(y, x, t, s)) + 12∂2

x

(σ2(x, s)G(y, x, t, s)

). (38)

In this equation, t and y are merely parameters, but s may not be smaller thant. The initial condition that represents the requirement that X(t) = y is

G(y, x, t, t) = δ(x− y) . (39)

The transition density is the Green’s function for the forward equation, whichmeans that the general solution may be written in terms of G as

u(x, s) =∫ ∞

−∞u(y, t)G(y, x, t, s)dy . (40)

This formula is a continuous time version of the law of total probability: theprobability density to be at x at time s is the sum (integral) of the probabilitydensity to be at x at time s conditional on being at y at time t (which isG(y, x, t, s)) multiplied by the probability density to be at y at time s (which isu(y, t)).

2.12. Green’s function for the backward equation: We can also express thesolution of the backward equation in terms the transition probabilities G. Fors > t,

f(y, t) = Ey,t [f(X(s), s)] ,

which is an expression of the tower property. The expected value on the rightmay be evaluated using the transition probability density for X(s). The resultis

f(y, t) =∫ ∞

−∞G(y, x, t, s)f(x, s)dx . (41)

16

For this to hold, G must satisfy the backward equation as a function of y and t(which were parameters in (38). To show this, we apply the backward equation“operator” (see below for terminology) ∂t +a(y, t)∂y + 1

2σ2(y, t)∂2y to both sides.

The left side gives zero because f satisfies the backward equation. Therefore wefind that

0 =∫ (

∂t + a(y, t)∂y + 12σ2(y, t)∂2

y

)G(y, x, t, s)f(x, s)dx

for any f(x, s). Therefore, we conclude that

∂tG(y, x, t, s) + a(y, t)∂yG(y, x, t, s) + 12σ2(y, t)∂2

yG(y, x, t, s) = 0 . (42)

Here x and s are parameters. The final condition for (42) is the same as (39).The equality s = t represents the initial time for s and the final time for tbecause G is defined for all t ≤ s.

2.13. The generator: The generator of an Ito process is the operator con-taining the spatial part of the backward equation13

L(t) = a(x, t)∂x + 12σ2(x, t)∂2

x .

The backward equation is ∂tf(x, t) + L(t)f(x, t) = 0. We write just L when aand σ do not depend on t. For a general continuous time Markov process, thegenerator is defined by the requirement that

d

dtE[g(X(t), t)] = E [(L(t)g)(X(t), t) + gt(X(t), t)] , (43)

for a sufficiently rich (dense) family of functions g. This applies not only to Itoprocesses (diffusions), but also to jump diffusions, continuous time birth/deathprocesses, continuous time Markov chains, etc. Part of the requirement is thatthe limit defining the derivative on the left side should exist. Proving (43) for anIto process is more or less what we did when we derived the backward equation.On the other hand, if we know (43) we can derive the backward equation byrequiring that d

dtE[f(X(t), t)] = 0.

2.14. Adjoint: The adjoint of L is another operator that we call L∗. It isdefined in terms of the inner product

〈u, f〉 =∫ ∞

−∞u(x)f(x)dx .

We leave out the t variable for the moment. If u and f are complex, we takethe complex conjugate of u above. The adjoint is defined by the requirementthat for general u and f ,

〈u, Lf〉 = 〈L∗u, f〉 .

13Some people include the time derivative in the definition of the generator. Watch for this.

17

In practice, this boils down to the same integration by parts we used to derivethe forward equation from the backward equation:

〈u, Lf〉 =∫ ∞

−∞u(x)

(a(x)∂xf(x) + 1

2σ2(x)∂2xf(x)

)dx

=∫ ∞

−∞

(−∂x(a(x)u(x)) + 1

2∂2x(σ2(x)u(x))

)f(x)dx .

Putting the t dependence back in, we find the “action” of L∗ on u to be

(L(t)∗u)(x, t) = −∂x(a(x, t)u(x, t)) + 12∂2

x(σ2(x, t)u(x, t)) .

The forward equation (37) then may be written

∂tu = L(t)∗u .

All we have done here is define notation (L∗) and show how our previous deriva-tion of the forward equation is expressed in terms of it.

2.15. Adjoints and the Green’s function: Let us summarize and record whatwe have said about the transition probability density G(y, x, t, s). It is definedfor s ≥ t and has G(x, y, t, t) = δ(x − y). It moves probabilities forward byintegrating over y (38) and moves expected values backward by integrating overx ??). As a function of x and s it satisfies the forward equation

∂sG(y, x, t, s) = (L∗x(t)G)(y, x, t, s) .

We write L∗x to indicate that the derivatives in L∗ are with respect to the xvariable:

(L∗x(t)G)(y, x, t, s) = −∂x(a(x, t)G(y, x, t, s)) + 12∂2

x(σ2(x, t)G(y, x, t, s)) .

As a function of y and t it satisfies the backward equation

∂tG(y, x, t, s) + (Ly(t)G)(y, x, t, s) = 0 .

3 Properties of the solutions

3.1. Introduction: The next few paragraphs describe some properties ofsolutions of backward and forward equations. For Brownian motion, f and uhave every property because the forward and backward equations are essentiallythe same. Here f has some and u has others.

3.2. Backward equation maximum principle: The backward equation has amaximum principle

maxx

f(x, t) ≤ maxy

f(y, s) for s > t. (44)

18

This follows immediately from the representation

f(x, t) = Ex,t[f(X(s), s)] .

The expected value of f(X(s), s) cannot be larger than its maximum value.Since this holds for every x, it holds in particular for the maximizer.

There is a more complicated proof of the maximum principle that uses thebackward equation. I give a slightly naive explination to avoid taking too longwith it. Let m(t) = maxx f(x, t). We are trying to show that m(t) neverincreases. If, on the contrary, m(t) does increase as t decreases, there must bea t∗ with dm

dt (t∗) = α < 0. Choose x∗ so that f(x∗, t∗) = maxx f(x, t∗). Thenfx(x∗, t∗) = 0 and fxx(x∗, t∗) ≤ 0. The backward equation then implies thatft(x∗, t∗) ≥ 0 (because σ2 ≥ 0), which contradicts ft(x∗, t∗) ≤ α < 0.

The PDE proof of the maximum principle shows that the coefficients a andσ2 have to be outside the derivatives in the backward equation. Our argumentthat Lf ≤ 0 at a maximum where fx = 0 and fxx ≤ 0 would be wrong if we had,say, ∂x(a(x)f(x, t)) rather than a(x)∂xf(x, t). We could get a non zero valuebecause of variation in a(x) even when f was constant. The forward equationdoes not have a maximum principle for this reason. Both the Ornstein Uhlen-beck and geometric Brownian motion problems have cases where maxx u(x, t)increases in forward time or backward time.

3.3. Conservation of probability: The probability density has∫∞−∞ u(x, t)dx =

1. We can see thatd

dt

∫ ∞

−∞u(x, t)dx = 0

also from the forward equation (37). We simply differentiate under the integral,substitute from the equation, and integrate the resulting x derivatives. For thisit is crucial that the coefficients a and σ2 be inside the derivatives. Almost anyexample with a(x, t) or σ(x, t) not independent of x will show that

d

dt

∫ ∞

−∞f(x, t)dx 6= 0 .

3.4. Martingale property: If there is no drift, a(x, t) = 0, then X(t) is amartingale. In particular, E[X(t)] is independent of t. This too follows fromthe forward equation (37). There will be no boundary contributions in theintegrations by parts.

d

dtE[X(t)] =

d

dt

∫ ∞

−∞xu(x, t)dx

=∫ ∞

−∞xut(x, t)

=∫ ∞

−∞x1

2∂2x(σ2(x, t)u(x, t))dx

19

= −∫ ∞

−∞

12∂x(σ2(x, t)u(x, t))dx

= 0 .

This would not be true for the backward equation form 12σ2(x, t)∂2

xf(x, t) or evenfor the mixed form we get from the Stratonovich calculus 1

2∂x(σ2(x, t)∂xf(x, t)).The mixed Stratonovich form conserves probability but not expected value.

3.5. Drift and advection: If there is no drift then the SDE (23) becomes theordinary differential equation (ODE)

dx

dt= a(x, t) . (45)

If x(t) is a solution, then clearly the expected payout should satisfy f(x(t), t) =f(x(s), s), if nothing is random then the expected value is the value. It is easyto check using the backward equation that f(x(t), t) is independent of t if x(t)satisfies (45) and σ = 0:

d

dtf(x(t), t) = ft(x(t), t) +

dx

dtfx(x(t), t) = ft(x(t), t) + a(x(t), t)fx(x(t), t) = 0 .

Advection is the process of being carried by the wind. If there is no diffusion,then the values of f are simply advected by the drift. The term drift impliesthat this advection process is slow and gentle. If σ is small but not zero, thenf may be essentially advected with a little spreading or smearing induced bydiffusion. Computing drift dominated solutions can be more challenging thancomputing diffusion dominated ones.

The probability density does not have u(x(t), t) a constant (try it in theforward equation). There is a conservation of probability correction to this thatyou can find if you are interested.

20

Stochastic Calculus Notes, Lecture 8Last modified December 14, 2004

1 Path space measures and change of measure

1.1. Introduction: We turn to a closer study of the probability measureson path space that represent solutions of stochastic differential equations. Wedo not have exact formulas for the probability densities, but there are approxi-mate formulas that generalize the ones we used to derive the Feynman integral(not the Feynman Kac formula). In particular, these allow us to compare themeasures for different SDEs so that we may use solutions of one to represent ex-pected values of another. This is the Cameron Martin Girsanov formula. Thesechanges of measure have many applications, including importance sampling inMonte Carlo and change of measure in finance.

1.2. Importance sampling: Importance sampling is a technique that canmake Monte Carlo computations more accurate. In the simplest version, wehave a random variable, X, with probability density u(x). We want to estimateA = Eu[φ(X)]. Here and below, we write EP [·] to represent expecation usingthe P measure. To estimate A, we generate N (a large number) independentsamples from the population u. That is, we generate random variables Xk fork = 1, . . . , N that are independent and have probability density u. Then weestimate A using

A ≈ Au =1N

N∑k=1

φ(Xk) . (1)

The estimate is unbiased because the bias, A − Eu[Au], is zero. The error isdetermined by the variance var(Au) = 1

N varu(φ(X)).Let v(x) be another probability density so that v(x) 6= 0 for all x with

u(x) 6= 0. Then clearly

A =∫

φ(x)u(x)dx =∫

φ(x)u(x)v(x)

v(x)dx .

We express this as

A = Eu[φ(X)] = Ev[φ(X)L(X)] , where L(x) =u(x)v(x)

. (2)

The ratio L(x) is called the score function in Monte Carlo, the likelihood ratioin statistics, and the Radon Nikodym derivative by mathematicians. We get adifferent unbiased estimate of A by generating N independent samples of v andtaking

A ≈ Av =1N

N∑k=1

φ(Xk)L(Xk) . (3)

1

The accuracy of (3) is determined by

varv(φ(X)L(X)) = Ev[(φ(X)L(X)−A)2] =∫

(φ(x)L(x)−A)2v(x)dx .

The goal is to improve the Monte Carlo accuracy by getting var(Av) <<

var(Au).

1.3. A rare event example: Importance sampling is particularly helpfulin estimating probabilities of rare events. As a simple example, consider theproblem of estimating P (X > a) (corresponding to φ(x) = 1x>a) when X ∼N (0, 1) is a standard normal random variable and a is large. The naive MonteCarlo method would be to generate N sample standard normals, Xk, and take

Xk ∼ N (0, 1), k = 1, · · · , N ,

A = P (X > a) ≈ Au =1N

# Xk > a =1N

∑Xk>a

1 . (4)

For large a, the hits, Xk > a, would be a small fraction of the samples, with therest being wasted.

One importance sampling strategy uses v corresponding to N (a, 1). Itseems natural to try to increase the number of hits by moving the mean from0 to a. Since most hits are close to a, it would be a mistake to move themean farther than a. Using the probability densities u(x) = 1√

2πe−x2/2 and

v(x) = 1√2π

e−(x−a)2/2, we find L(x) = u(x)/v(x) = ea2/2e−ax. The importancesampling estimate is

Xk ∼ N (a, 1), k = 1, · · · , N ,

A ≈ Av =1N

ea2/2∑

Xk>a

e−aXk .

Some calculations show that the variance of Av is smaller than the varianceof of the naive estimator Au by a factor of roughly e−a2/2. A simple way togenerate N (a, 1) random variables is to start with mean zero standard normalsYk ∼ N (0, 1) and add a: Xk = Yk + a. In this form, ea2/2e−aXk = e−a2/2e−aYk ,and Xk > a, is the same as Yk > 0, so the variance reduced estimator becomes

Yk ∼ N (0, 1), k = 1, · · · , N ,

A ≈ Av = e−a2/2 1N

∑Yk>0

e−aYk . (5)

The naive Monte Carlo method (4) produces a small A by getting a smallnumber of hits in many samples. The importance sampling method (5) getsroughly 50% hits but discounts each hit by a factor of at least e−a2/2 to get thesame expected value as the naive estimator.

2

1.4. Radon Nikodym derivative: Suppose Ω is a measure space withσ−algebra F and probability measures P and Q. We say that L(ω) is theRadon Nikodym derivative of P with respect to Q if dP (ω) = L(ω)dQ(ω), or,more formally, ∫

Ω

V (ω)dP (ω) =∫

Ω

V (ω)L(ω)dQ(ω) ,

which is to sayEP [V ] = EQ[V L] , (6)

for any V , say, with EP [|V |] < ∞. People often write L = dPdQ , and call it

the Radon Nikodym derivative of P with respect to Q. If we know L, then theright side of (6) offers a different and possibly better way to estimate EP [V ].Our goal will be a formula for L when P and Q are measures corresponding todifferent SDEs.

1.5. Absolute continuity: One obstacle to finding L is that it may not exist.If A is an event with P (A) > 0 but Q(A) = 0, L cannot exist because theformula (6) would become

P (A) =∫

A

dP (ω) =∫

Ω

1A(ω)dP (ω) =∫

Ω

1A(ω)L(ω)dQ(ω) .

Looking back at our definition of the abstract integral, we see that if the eventA = f(ω) 6= 0 has Q(A) = 0, then all the approximations to

∫f(ω)dQ(ω) are

zero, so∫

f(ω)dQ(ω) = 0.We say that measure P is absolutely continuous with respect to Q if P (A) =

0 ⇒ Q(A) = 0 for every1 A ∈ F . We just showed that L cannot exist unlessP is absolutely continuous with respect to Q. On the other hand, the RadonNikodym theorem states that an L satisfying (6) does exist if P is absolutelycontinuous with respect to Q.

In practical examples, if P is not absolutely continuous with respect to Q,then P and Q are completely singular with respect to each other. This meansthat there is an event, A ∈ F with P (A) = 1 and Q(A) = 0.

1.6. Discrete probability: In discrete probability, with a finite or countablestate space, P is absolutely continuous with respect to Q if and only if P (ω) > 0whenever Q(x) > 0. In that case, L(ω) = P (ω)/Q(ω). If P and Q representMarkov chains on a discrete state space, then P is not absolutely continuouswith respect to Q if the transition matrix for P (also called P ) allows transitionsthat are not allowed in Q.

1.7. Finite dimensional spaces: If Ω = Rn and the probability measures aregiven by densities, then P may fail to be absolutely continuous with respect to

1This assumes that measures P and Q are defined on the same σ−algebra. It is usefulfor this reason always to use the algebra of Borel sets. It is common to imagine completinga measure by adding to F all subsets of events with P (A) = 0. It may seem better to havemore measurable events, it makes the change of measure discussions more complicated.

3

Q if the densities are different from zero in different places. An example withn = 1 is P corresponding to a negative exponential random variable u(x) = ex

for x ≤ 0 and u(x) = 0 for x > 0, while Q corresponds to a positive exponentialv(x) = e−x for x ≥ 0 and v(x) = 0 for x < 0.

Another way to get singular probability measures is to have measures using δfunctions concentrated on lower dimensional sets. An example with Ω = R2 hasQ saying that X1 and X2 are independent standard normals while P says thatX1 = X2. The probability “density” for P is u(x1, x2) = 1√

2πe−x2

1/2δ(x2 − x1).The event A = X1 = X2 has Q probability zero but P probability one.

1.8. Testing for singularity: It sometimes helps to think of complete singu-larity of measures in the following way. Suppose we learn the outcome, ω andwe try to determine which probability measure produced it. If there is a setA with P (A) = 1 and Q(A) = 0, then we report P if ω ∈ A and Q if ω /∈ A.We will be correct 100% of the time. Conversely, if there is a way to determinewhether P of Q was used to generate ω, then let A be the set of outcomes thatyou say came from P and you have P (A) = 1 because you always are correctin saying P if ω came from P . Also Q(A) = 0 because you never say Q whenω ∈ A.

Common tests involve statistics, i.e. functions of ω. If there is a (measurable)statistic F (ω) with F (ω) = a almost surely with respect to P and F (ω) = b 6= aalmost surely with respect to Q, then we take A = ω ∈ Ω | F (ω) = a and seethat P and Q are completely singular with respect to each other.

1.9. Coin tossing: In common situations where this works, the function F (ω)is a limit that exists almost surely (but with different values) for both P and Q.If limn→∞ Fn(ω) = a almost surely with respect to P and limn→∞ Fn(ω) = balmost surely with respect to Q, then P and Q are completely singular.

Suppose we make an infinite sequence of coin tosses with the tosses beingindependent and having the same probability of heads. We describe this bytaking ω to be infinite sequences ω = (Y1, Y2, . . .), where the kth toss Yk equalsone or zero, and the Yk are independent. Let the measure P represent tossingwith Yk = 1 with probability p, and Q represent tossing with Yk = 1 withprobability q 6= p. Let Fn(ω) = 1

n

∑nk=1 Yk. The (Kolmogorov strong) law of

large numbers states that Fn → p as n → ∞ almost surely in P and Fn →q as n → ∞ almost surely in Q. This shows that P and Q are completelysingular with respect to each other. Note that this is not an example of discreteprobability in our sense because the state space consists of infinite sequences.The set of infinite sequences is not countable (a theorem of Cantor).

1.10. The Cameron Martin formula: The Cameron Martin formula relatesthe measure, P , for Brownian motion with drift to the Wiener measure, W , forstandard Brownian motion without drift. Wiener measure describes the process

dX(t) = dB(t) . (7)

4

The P measure describes solutions of the SDE

dX(t) = a(X(t), t)dt + dB(t) . (8)

For definiteness, suppose X(0) = x0 is specified in both cases.

1.11. Approximate joint probability measures: We find the formula forL(X) = dP (X)/dW (X) by taking a finite ∆t approximation, directly comput-ing L∆t, and observing the limit of L as ∆t → 0. We use our standard notationstk = k∆t, Xk ≈ X(tk), ∆Bk = B(tk+1)− B(tk), and ~X = (X1, . . . , Xn) ∈ Rn.The approximate solution of (8) is

Xk+1 = Xk + ∆ta(Xk, tk) + ∆Bk . (9)

This is exact in the case a = 0. We write V (~x) for the joint density of ~X forW and U(~x) for teh joint density under (9). We calculate L∆t(~x) = U(~x)/V (~x)and observe the limit as ∆t → 0.

To carry this out, we again note that the joint density is the product of thetransition probability densities. For (7), if we know xk, then Xk+1 is normalwith mean xk and variance ∆t. This gives

G(xk, xk+1,∆t) =1√

2π∆te−(xk+1−xk)2/2∆t ,

and

V (~x) =(2π ∆t

)−n/2 exp

(1

2∆t

n−1∑k=0

(xk+1 − kk)2)

. (10)

For (9), the approximation to (8), Xk+1 is normal with mean xk + ∆ta(xk, tk)and variance ∆t. This makes its transition density

G(xk, xk+1,∆t) =1√

2π∆te−(xk+1−xk−∆ta(xk,tk))2/2∆t ,

so that

U(~x) =(2π ∆t

)−n/2 exp

(1

2∆t

n−1∑k=0

(xk+1 − kk −∆ta(xk, tk))2)

. (11)

To calculate the ratio, we expand (using some obvious notation)(∆Xk −∆tak

)2 = ∆x2k − 2∆t∆xk + ∆t2a2

k .

Dividing U by V removes the 2π factors and the ∆x2k in the exponents. What

remains is

L∆t(~x) = U(~x)/V (~x)

= exp

(n−1∑k=0

(a(xk), tk)(xk+1 − xk)− ∆t

2

n−1∑k=0

a(xk), tk)2)

.

5

The first term in the exponent converges to the Ito integral

n−1∑k=0

(a(xk), tk)(xk+1 − xk) →∫ T

0

a(X(t), t)dX(t) as ∆t → 0,

if tn = max tk < T. The second term converges to the Riemann integral

∆tn−1∑k=0

a(xk), tk)2 →∫ T

0

a2(X(t), t)dt as ∆t → 0.

Altogether, this suggests that if we fix T and let ∆t → 0, then

dP

dW= L(X) = exp

(∫ T

0

a(X(t), t)dX(t)− 12

∫ T

0

a2(X(t), t)dt

). (12)

This is the Cameron Martin formula.

2 Multidimensional diffusions

2.1. Introduction: Some of the most interesting examples, curious phenom-ena, and challenging problems come from diffusion processes with more than onestate variable. The n state variables are arranged into an n dimensional statevector X(t) = (X1(t), . . . , Xn(t))t. We will have a Markov process if the statevector contains all the information about the past that is helpful in predictingthe future. At least in the beginning, the theory of multidimensional diffusionsis a vector and matrix version of the one dimensional theory.

2.2. Strong solutions: The drift now is a drift for each component of X,a(x, t) = (a1(x, t), . . . , an(x, t))t. Each component of a may depend on all com-ponents of X. The σ now is an n × m matrix, where m is the number ofindependent sources of noise. We let B(t) be a column vector of m independentstandard Brownian motion paths, B(t) = (B1(t), . . . , Bm(t))t. The stochasticdifferential equation is

dX(t) = a(X(t), t)dt + σ(X(t), t)dB(t) . (13)

A strong solution is a function X(t, B) that is nonanticipating and satisfies

X(t) = X(0) +∫ t

0

a(X(s), s)ds +∫ t

0

σ(X(s), s)dB(s) .

The middle term on the right is a vector of Riemann integrals whose kth com-ponent is the standard Riemann integral∫ t

0

ak(X(s), s)ds .

6

The last term on the right is a collection of standard Ito integrals. The kth

component ism∑

j=1

∫ t

0

σkj(X(s), s)dBj(s) ,

with each summand on the right being a scalar Ito integral as defined in previouslectures.

2.3. Weak form: The weak form of a multidimensional diffusion problem asksfor a probability measure, P , on the probability space Ω = C([0, T ], Rn) withfiltration Ft generated by X(s) for s ≤ t so that X(t) is a Markov processwith

E[∆X

∣∣ Ft

]= a(X(t), t)∆t + o(∆t) , (14)

andE[∆X∆Xt

∣∣ Ft

]= µ(X(t), t)∆t + o(∆t) . (15)

Here ∆X = X(t+∆t)−X(t), we assume ∆t > 0, and ∆Xt = (∆X1, . . . ,∆Xn) isthe transpose of the column vector ∆X. The matrix formula (15) is a convenientway to express the short time variances and covariances2

E[∆Xj∆Xk

∣∣ Ft

]= µjk(X(t), t)∆t + o(∆t) . (16)

As for one dimensional diffusions, it is easy to verify that a strong solution of(13) satisfies (14) and (15) with µ = σσt.

2.4. Backward equation: As for one dimensional diffusions, the weak formconditions (14) and (15) give a simple derivation of the backward equation for

f(x, t) = Ex,t [V (X(T ))] .

We start with the tower property in the familiar form

f(x, t) = Ex,t [f(x + ∆X, t + ∆t)] , (17)

and expand f(x+∆X, t+∆t) about (x, t) to second order in ∆X and first orderin ∆t:

f(x + ∆X, t + ∆t) = f + ∂xkf ·∆Xk + 1

2∂xj∂xk

·∆Xj∆Xk + ∂tf ·∆t + R .

Here follow the Einstein summation convention by leaving out the sums over jand k on the right. We also omit arguments of f and its derivatives when thearguments are (x, t). For example, ∂xk

f ·∆Xk really means

n∑k=1

∂xkf(x, t) ·∆Xk .

2The reader should check that the true covariancesE[(∆Xj − E[∆Xj ])(∆Xk − E[∆Xk])

∣∣ Ft

]also satisfy (16) when E

[∆Xj

∣∣ Ft

]= O(∆t).

7

As in one dimension, the error term R satisfies

|R| ≤ C ·(|∆X|∆t + |∆X|3 + ∆t2

),

so that, as before,E [|R|] ≤ C ·∆t3/2 .

Putting these back into (17) and using (14) and (15) gives (with the sameshorthand)

f = f + ak(x, t)∂xkf∆t + 1

2µjk(x, t)∂xj∂xk

f∆t + ∂tf∆t + o(∆t) .

Again we cancel the f from both sides, divide by ∆t and take ∆t → 0 to get

∂tf + ak(x, t)∂xkf + 1

2µjk(x, t)∂xj ∂xkf = 0 , (18)

which is the backward equation.It sometimes is convenient to rewrite (18) in matrix vector form. For any

function, f , we may consider its gradient to be the row vector 5xf = Dxf =(∂x1f, . . . , ∂xn

f). The middle term on the left of (18) is the product of therow vector Df and the column vector x. We also have the Hessian matrix ofsecond partials (D2f)jk = ∂xj

∂xkf . Any symmertic matrix has a trace tr(M) =∑

k Mkk. The summation convention makes this just tr(M) = Mkk. If A andB are symmetric matrices, then (as the reader should check) tr(AB) = AjkBjk

(with summation convention). With all this, the backward equation may bewritten

∂tf + Dxf · a(x, t) + 12 tr(µ(x, t)D2

xf) = 0 . (19)

2.5. Generating correlated Gaussians: Suppose we observe the solution of(13) and want to reconstruct the matrix σ. A simpler version of this problemis to observe

Y = AZ , (20)

and reconstruct A. Here Z = (Z1, . . . , Zm) ∈ Rm, with Zk ∼ N (0, 1) i.i.d.,is an m dimensional standard normal. If m < n or rank(A) < n then Y is adegenerate Gaussian whose probability “density” (measure) is concentrated onthe subspace of Rn consisting of vectors of the form y = Az for some z ∈ Rm.The problem is to find A knowing the distribution of Y .

2.6. SVD and PCA: The singular value decomposition (SVD) of A is afactorization

A = UΣV t , (21)

where U is an n×n orthogonal matrix (U tU = In×n, the n×n identity matrix),V is an m×m orthogonal matrix (V tV = Im×m), and Σ is an n×m “diagonal”matrix (Σjk = 0 if j 6= k) with nonnegative singular values on the diagonal:Σkk = σk ≥ 0. We assume the singular values are arranged in decreasing orderσ1 ≥ σ2 ≥ · · ·. The singular values also are called principal components and

8

the SVD is called principal component analysis (PCA). The columns of U andV (not V t) are left and right singular vectors respectively, which also are calledprincipal components or principal component vectors. The calculation

C = AAt = (UΣV t)(V ΣtU t) = UΣΣtU t

shows that the diagonal n × n matrix Λ = ΣΣt contains the eigenvalues ofC = AAt, which are real and nonnegative because C is symmetric and positivesemidefinite. Therefore, left singular vectors, the columns of C, are the eigen-vectors of the symmetric matrix C. The singular values are the nonnegativesquare roots of the eigenvalues of C: σk =

√λk. Thus, the singular values and

left singular vectors are determined by C. In a similar way, the right singularvectors are the eigenvectors of the m ×m positive semidefinite matrix AtA. Ifn > m, then the σk are defined only for k ≥ m (there is no Σm+1,m+1 in then×m matrix Σ). Since the rank of C is at most m in this case, we have λk = 0for k > m. Even when n = m, A may be rank deficient. The rank of A being lis the same as σk = 0 for k > l. When m > n, the rank of A is at most n.

2.7. The SVD and nonuniqueness of A: Because Y = AZ is Gaussianwith mean zero, its distribution is determined by its covariance C = E[Y Y t] =E[AZZtAt] = AE[ZZt]At = AAt. This means that the distribution of Adetermines U and Σ but not V . We can see this directly by plugging (21) into(20) to get

Y = UΣ(V tZ) = UΣZ ′ , where Z ′ = V tZ .

Since Z ′ is a mean zero Gaussian with covariance V tV = I, Z ′ has the samedistribution as Z, which means that Y ′ = UΣZ has the same distribution as Y .Furthermore, if A has rank l < m, then we will have σk = 0 for k > l and weneed not bother with the Z ′k for k > l. That is, for generating Y , we never needto take m > n or m > rank(A).

For a simpler point of view, suppose we are given C and want to generateY ∼ N (0, C) in the form Y = AZ with Z ∼ N (0, I). The condition is thatC = AAt. This is a sort of square root of C. One solution is A = UΣ as above.Another solution is the Choleski decomposition of C: C = LLt for a lowertriangular matrix L. This is most often done in practice because the Choleskidecomposition is easier to compute than the SVD. Any A that works has thesame U and Σ in its SVD.

2.8. Choosing σ(x, t): This non uniqueness of A carries over to non unique-ness of σ(x, t) in the SDE (13). A diffusion process X(t) defines µ(x, t) through(15), but any σ(x, t) with σσt = µ leads to the same distribution of X trajec-tories. In particular, if we have one σ(x, t), we may choose any adapted matrixvalued function V (t) with V V t ≡ Im×m, and use σ′ = σV . To say this anotherway, if we solve dZ ′ = V (t)dZ(t) with Z ′(0) = 0, then Z ′(t) also is a Brownianmotion. (The Levi uniqueness theorem states that any continuous path processthat is weakly Brownian motion in the sense that a ≡ 0 and µ ≡ I in (14) and(15) actually is Brownian motion in the sense that the measure on Ω is Wiener

9

measure.) Therefore, using dZ ′ = V (t)dZ gives the same measure on the spaceof paths X(t).

The conclusion is that it is possible for SDEs wtih different σ(x, t) to repre-sent the same X distribution. This happens when σσt = σ′σ′ t. If we have µ, wemay represent the process X(t) as the strong solution of an SDE (13). For this,we must choose with some arbtirariness a σ(x, t) with σ(x, t)σ(x, t)t = µ(x, t).The number of noise sources, m, is the number of non zero eigenvalues of µ. Wenever need to take m > n, but m < n may be called for if µ has rank less thann.

2.9. Correlated Brownian motions: Sometimes we wish to use the SDE model(13) where the Bk(t) are correlated. We can accomplish this with a change in σ.Let us see how to do this in the simpler case of generating correlated standardnormals. In that case, we want Z = (Z1, . . . , Zm)t ∈ Rm to be a multivariatemean zero normal with var(Zk) = 1 and given correlation coefficients

ρjk =cov(Zj , Zk)√var(Zj)var(Zk)

= cov(Zj , Zk) .

This is the same as generating Z with covariance matrix C with ones on thediagonal and Cjk = ρjk when j 6= k. We know how to do this: choose A withAAt = C and take Z = AZ ′. This also works in the SDE. We solve

dX(t) = a(X(t), t)dt + σ(X(t), t)AdB(t) ,

with the Bk being independent standard Brownian motions. We get the effectof correlated Brownian motions by using independent ones and replacing σ(x, t)by σ(x, t)A.

2.10. Normal copulas (a digression): Suppose we have a probability den-sity u(y) for a scalar random variable Y . We often want to generate familiesY1, . . . , Ym so that each Yk has the density u(y) but different Yk are correlated.A favorite heuristic for doing this3 is the normal copula. Let U(y) = P (Y < y)be the cumulative distribution function (CDF) for Y . Then the Yk will havedensity u(y) if and only if U(Yk) − Tk and the Tk are uniformly distributed inthe interval [0, 1] (check this). In turn, the Tk are uniformly distributed in [0, 1]if Tk = N(Zk) where the Zk are standard normals and N(z) is the standardnormal CDF. Now, rather than generating independent Zk, we may use corre-lated ones as above. This in turn leads to correlated Tk and correlated Yk. I donot know how to determine the Z correlations in order to get a specified set ofY correlations.

2.11. Degenerate diffusions: Many practical applications have fewer sourcesof noise than state variables. In the strong form (13) this is expressed as m < nor m = n and det(σ) = 0. In the weak form µ is always n × n but it may be

3I hope this goes out of fashion in favor of more thoughtful methods that postulate somemechanism for the correlations.

10

rank deficient. In either case we call the stochastic process a degenerate diffu-sion. Nondegenerate diffusions have qualitative behavior like that of Brownianmotion: every component has infinite total variation and finite quadratic varia-tion, transition densities are smooth functions of x and t (for t > 0) and satisfyforward and backward equations (in different variables) in the usual sense, etc.Degenerate diffusions may lack some or all of these properties. The qualitativebehavior of degenerate diffusions is subtle and problem dependent. There aresome examples in the homework. Computational methods that work well fornondegenerate diffusions may fail for degenerate ones.

2.12. A degenerate diffusion for Asian options: An Asian option gives apayout that depends on some kind of time average of the price of the under-lying security. The simplest form would have th eunderlier being a geometricBrownian motion in the risk neutral measure

dS(t) = rS(t)dt + σS(t)dB(t) , (22)

and a payout that depends on∫ T

0S(t)dt. This leads us to evaluate

E [V (Y (T ))] ,

where

Y (T ) =∫ T

0

S(t)dt .

To get a backward equation for this, we need to identify a state space sothat the state is a Markov process. We use the two dimensional vector

X(t) =(

S(t)Y (t)

),

where S(t) satisfies (22) and dY (t) = S(t)dt. Then X(t) satisfies (13) with

a =(

rSS

),

and m = 1 < n = 2 and (with the usual double meaning of σ)

σ =(

Sσ0

).

For the backward equation we have

µ = σσt =(

S2σ2 00 0

),

so the backward equation is

∂tf + rs∂sf + s∂yf +s2σ2

2∂2

sf = 0 . (23)

11

Note that this is a partial differential equation in two “space variables”,x = (s, y)t. Of course, we are interested in the answer at t = 0 only for y = 0.Still, we have include other y values in the computation. If we were to try thestandard finite difference approximate solution of (23) we might use a centraldifference approximation ∂yf(s, y, t) ≈ 1

2∆y (f(s, y + ∆y, t) − f(s, y − ∆y, t)).If σ > 0 it is fine to use a central difference approximation for ∂sf , and thisis what most people do. However, a central difference approximation for ∂yfleads to an unstable computation that does not produce anything like the rightanswer. The inherent instability of centeral differencing is masked in s by thestrongly stabilizing second derivative term, but there is nothing to stabalize theunstable y differencing in this degenerate diffusion problem.

2.13. Integration with dX: We seek the anologue of the Ito integral andIto’s lemma for a more general diffusion. If we have a function f(x, t), we seeka formula df = adt + bdX. This would mean that

f(X(T ), T ) = f(X(0), 0) +∫ T

0

a(t)dt +∫ T

0

b(t)dX(t) . (24)

The first integral on the right would be a Riemann integral that would be definedfor any continuous function a(t). The second would be like the Ito integral withBrownian motion, whose definition depends on b(t) being an adapted process.The definition of the dX Ito integral should be so that Ito’s lemma becomestrue.

For small ∆t we seek to approximate ∆f = f(X(t+∆t), t+∆t)−f(X(t), t).If this follows the usual pattern (partial justification below), we should expandto second order in ∆X and first order in ∆t. This gives (wth summation con-vention)

∆f ≈ (∂xj f)∆Xj + 12 (∂xj

∂xkf)∆Xj∆Xk + ∂tf∆t . (25)

As with the Ito lemma for Brownian motion, the key idea is to replace theproducts ∆Xj∆Xk by their expected values (conditional on Ft). If this is true,(15) suggests the general Ito lemma

df = (∂xjf)dXk +

(12 (∂xj

∂xkf)µjk + ∂tf

)dt , (26)

where all quantities are evaluated at (X(t), t).

2.14. Ito’s rule: One often finds this expressed in a slightly different way. Asimpler way to represent the small time variance condition (15) is

E [dXjdXk] = µjk(X(t), t)dt .

(Though it probably should be E[dXjdXk

∣∣ Ft

].) Then (26) becomes

df = (∂xj f)dXk + 12 (∂xj ∂xk

f)E[dXjdXk] + ∂tfdt .

This has the advantage of displaying the main idea, which is that the fluctuationsin dXj are important but only the mean values of dX2 are important, not the

12

fluctuations. Ito’s rule (never enumciated by Ito as far as I know) is the formula

dXjdXk = µjkdt . (27)

Although this leads to the correct formula (26), it is not structly true, since thestandard defiation of the left side is as large as its mean.

In the derivation of (26) sketched below, the total change in f is representedas the sum of many small increments. As with the law of large numbers, thesum of many random numbers can be much closer to its mean (in relative terms)than the random summands.

2.15. Ito integral: The definition of the dX Ito integral follows the definitionof the Ito integral with respect to Brownian motion. Here is a quick sketchwith many details missing. Suppose X(t) is a multidimensional diffusion pro-cess, Ft is the σ−algebra generated by the X(s) for 0 ≤ s ≤ t, and b(t) is apossibly random function that is adapted to Ft. There are n components of b(t)corresponding to the n components of X(t). The Ito integral is (tk = k∆t asusual): ∫ T

0

b(t)dX(t) = lim∆t→0

∑tk<T

b(tk) (X(tk+1)−X(tk)) . (28)

This definition makes sense because the limit exists (almost surely) for a richenough family of integrands b(t). Let Y∆t =

∑tk<T b(tk) (X(tk+1)−X(tk)) and

write (for appropriately chosen T )

Y∆t/2 − Y∆t =∑

tk<T

Rk ,

whereRk =

(b(tk+1/2)− b(tk)

)(X(tk+1)−X(tk+1/2)

).

The boundE[(

Y∆t/2 − Y∆t

)2] = O(∆tp) , (29)

implies that the limit (28) exists almost surely if ∆tl = 2−l.As in the Brownian motion case, we assume that b(t) has the (lack of)

smoothness of Brownian motion: E[(b(t + ∆t) − b(t))2] = O(∆t). In the mar-tingale case (drift = a ≡ 0 in (14)), E[RjRk] = 0 if j 6= k. In evaluating E[R2

k],we get from (15) that

E[∣∣X(tk+1)−X(tk+1/2)

∣∣2 ∣∣ Ftk+1/2

]= O(∆t) .

Since b(tt+1/2) is known in Ftk+1/2 , we may use the tower property and ourassumption on b to get

E[R2k] ≤ E

[∣∣X(tk+1)−X(tk+1/2)∣∣2 ∣∣b(tk+1/2)− b(t)

∣∣2] = O(∆t2) .

13

This gives (29) with p = 1 (as for Brownian motion) for that case. For thegeneral case, my best effort is too complicated for these notes and gives (29)with p = 1/2.

2.16. Ito’s lemma: We give a half sketch of the proof of Ito’s lemma fordiffusions. We want to use k to represent the time index (as in tk = k∆t) sowe replace the index notation above with vector notation: ∂xf∆X instead of∂xk

∆Xk, ∂2x(∆Xk,∆Xk) instead of (∂xj

∂xkf)∆Xj∆Xk, and tr(∂2

xfµ) insteadof (∂xj

∂xkf)µjk. Then ∆Xk will be the vector X(tk+1) −X(tk) and ∂2

xfk then× n matrix of second partial derivatives of f evaluated at (X(tk), tk), etc.

Now it is easy to see who f(X(T ), T ) − f(X(0), 0) =∑

tk<T ∆Fk is givenby the Riemann and Ito integrals of the right side of (26). We have

∆fk = ∂tfk∆t + ∂xfk∆Xk + 12∂2

xfk(∆Xk,∆Xk)

+ O(∆t2) + O (∆t |∆Xk|) + O(∣∣∆X3

k

∣∣) .

As ∆t → 0, the contribution from the second row terms vanishes (the thirdterm takes some work, see below). The sum of the ∂tfk∆t converges to theRiemann integral

∫ T

0∂tf(X(t), t)dt. The sum of the ∂xfk∆Xk converges to the

Ito integral∫ T

0∂xf(X(t), t)dX(t). The remaining term may be written as

∂2xfk(∆Xk,∆Xk) = E

[∂2

xfk(∆Xk,∆Xk)∣∣ Ftk

]+ Uk .

It can be shown that

E[|Uk|2

∣∣ Ftk

]≤ CE

[|∆Xk|4

∣∣ Ftk

]≤ C∆t2 ,

as it is for Brownian motion. This shows (with E[UjUk] = 0) that

E

∣∣∣∣∣∑tk<T

Uk

∣∣∣∣∣2 =

∑tk<T

E[|Uk|2

]≤ CT∆t ,

so∑

tk<T Uk → 0 as ∆t → 0 almost surely (with ∆t = 2−l). Finally, the smalltime variance formula (15) gives

E[∂2


]= tr

(∂2

xfkµk

)+ o(∆t) ,

so ∑tk<T

E[∂2


]→∫ T

0

tr(∂2

xf(X(t), t)µ(X(t), t))dt ,

(the Riemann integral) as ∆t → 0. This shows how the terms in the Ito lemma(26) are accounted for.

2.17. Theory left out: We did not show that there is a process satisfying (14)and (15) (existence) or that these conditions characterize the process (unique-ness). Even showing that a process satisfying (14) and (15) with zero drift and

14

µ = I is Brownian motion is a real theorem: the Levi uniqueness theorem.The construction of the stochastic process X(t) (existence) also gives boundson higher moments, such as E

[|∆X|4

]≤ C · ∆t2, that we used above. The

higher moment estimates are true for Brownian motion because the incrementsare Gaussian.

2.18. Approximating diffusions: The formula strong form formulation of thediffusion problem (13) suggests a way to generate approximate diffusion paths.If Xk is the approximation to X(tk) we can use

Xk+1 = Xk + a(Xk, tk)∆t + σ(Xk, tk)√

∆tZk , (30)

where the Zk are i.i.d. N (0, Im×m). This has the properties corresponding to(14) and (15) that

E[Xk+1 −Xk

∣∣ X1, · · · , Xk

]= a(Xk, tk)∆t

andcov(Xk+1 −Xk) = µ∆t .

This is the forward Euler method. There are methods that are better in someways, but in a surprising large number of problems, methods better than thisare not known. This is a distinct contrast to numerical solution of ordinarydifferential equations (without noise), for which forward Euler almost never isthe method of choice. There is much research do to to help the SDE solutionmethodology catch up to the ODE solution methodology.

2.19. Drift change of measure:The anologue of the Cameron Martin formula for general diffusions is the

Girsanov formula. We derive it by writing the joint densities for the discretetime processes (30) with and without the drift term a. As usual, this is aproduct of transition probabilities, the conditional probability densities for Xk+1

conditional on knowing Xj for j ≤ k. Actually, because (30) is a Markovprocess, the conditional densityh for Xk+1 depends on Xk only. We write itG(xk, xk+1, tk,∆t). Conditional on Xk, Xk+1 is a multivariate normal withcovariance matrix µ(Xk, tk)∆t. If a ≡ 0, the mean is Xk. Otherwise, the meanis Xk + a(Xk, tk)∆t. We write µk and ak for µ(Xk, tk) and a(Xk, tk).

Without drift, the Gaussian transition density is

G(xk, xk+1, tk,∆t) =1

(2π)n/2√

det(µk)exp

(−(xk+1 − xk)tµ−1

k (xk+1 − xk)2∆t

)(31)

With nonzero drift, the prefactor

zk =1

(2π)n/2√

det(µk)

15

is the same and the exponential factor accomodates the new mean:

G(xk, xk+1, tk,∆t) = zk exp(−(xk+1 − xk − ak∆t)tµ−1

k (xk+1 − xk − ak∆t)2∆t

).

(32)Let U(x1, . . . , xN ) be the joint density without drift and U(x1, . . . , xN ) withdrift. We want to evaluate L(~x) = V (~x)/U(~x) Both U and V are products ofthe appropriate transitions densities G. In the division, the prefactors zk cancel,as they are the same for U and V because the µk are the same.

The main calculation is the subtraction of the exponents:

(∆xk−ak∆t)tµ−1k (∆xk−ak∆t)−∆xt

kµ−1∆xk = −2∆tatkµ−1

k ∆xk+∆t2atkµ−1

k ak .

This gives:

L(~x) = exp

(N−1∑k=0

atkµ−1

k ∆xk +∆t

2

N−1∑k=0

atkµ−1

k ak

).

This is the exact likelihood ratio for the discrete time processes without drift.If we take the limit ∆t → 0 for the continuous time problem, the two terms inthe exponent converge respectively to the Ito integral∫ T

0

a(X(t), t)tµ(X(t), t)−1dX(t) ,

and the Riemann integral∫ T

0

12a(X(t), t)tµ(X(t), t)−1a(X(t), t)dt .

The result is the Girsanov formula

dP

dQ= L(X)

= exp

(∫ T

0

a(X(t), t)tµ(X(t), t)−1dX(t)−∫ T

0

12a(X(t), t)tµ(X(t), t)−1a(X(t), t)dt

).(33)

16

Stochastic Calculus, Fall 2004 (http://www.math.nyu.edu/faculty/goodman/teaching/StochCalc2004/)

Assignment 1.

Given Summer 2004, due September 9, the first day of class. The course web page hashints for reviewing or filling in missing background.Last revised, May 26.

Objective: Review of Basic Probability.

1. We have a container with 300 red balls and 600 blue balls. We mix the balls well andchoose one at random, with each ball being equally likely to be chosen. After eachchoice, we return the chosen ball to the container and mix again.

a. What is the probability that the first n balls chosen are all blue?

b. Let N be the number of blue balls chosen before the first red one. What is theP (N = n)? What are the mean and variance of N . Explain your answers usingthe formulae

∞∑n=0

xn =1

1 − xfor |x| < 1

∞∑n=0

nxn = xd

dx

1

1 − xfor |x| < 1

etc.

c. What is the probability that N = 0 given that N ≤ 2?

d. What is the probability that N is an even number? Count 0 as an even number.

2. A tourist decides between two plays, called “Good” (G) and “Bad” (B). The probabilityof the tourist choosing Good is P (G) = 10%. A tourist choosing Good likes it (L)with 70% probability (P (L | G) = .7) while a tourist choosing Bad dislikes it with 80%probability (P (D | B) = .8).

a. Draw a probability decision tree diagram to illustrate the choices.

b. Calculate P (L), the probability that the tourist liked the play he or she saw.

c. If the tourist liked the play he or she chose, what is the probability that he or shechose Good?

3. A “triangular” random variable, X, has probability density function (PDF) f(x) givenby

f(x) =

2(1 − x) if 0 ≤ x ≤ 1,

0 otherwise.

a. Calculate the mean and variance of X.

1

b. Suppose X1 and X2 are independent samples (copies) of X and Y = X1 +X2. Thatis to say that X1 and X2 are independent random variables and each has the samedensity f . Find the PDF for Y .

c. Calculate the mean and variance of Y without using the formula for its PDF.

d. Find P (Y > 1).

e. Suppose X1, X2, . . ., X100 are independent samples of X. Estimate Pr(X1 + · · · +X100 > 34) using the central limit theorem. You will need access to standardnormal probabilities either through a table or a calculator or computer program.

4. Suppose X and Y have a joint PDF

f(x, y) =1

8π

4 − x2 − y2 if x2 + y2 ≤ 4,

0 otherwise.

a. Calculate P (X2 + Y 2 ≤ 1).

b. Calculate the marginal PDF for X alone.

c. What is the covariance between X and Y ?

d. Find an event depending on X alone whose probability depends on Y . Use this toshow that X is not independent of Y .

e. Write the joint PDF for U = X2 and V = Y 2.

f. Calculate the covariance between X2 and Y 2. It may be easier to do this withoutusing part e. Use this to show, again, that X and Y are not independent.

2


Assignment 2.

Given September 9, due September 23.Last revised, September 10.

Objective: Conditioning and Markov chains.

1. Suppose that F and G are two algebras of sets and that F adds information to G in thesense that any G measurable event is also F measurable: G ⊂ F . Suppose that theprobability space Ω is discrete (finite or countable) and that X(ω) is a variable definedon Ω (that is, a function of the random variable ω). The conditional expectations (inthe modern sense) of X with respect to F and G are Y = E[X | F ] and Z = E[X | G].In each case below, state whether the statement is true or false and explain your answerwith a proof or a counterexample.

a. Z ∈ F .

b. Y ∈ G.

c. Z = E[Y | G].

d. Y = E[Z | F ].

2. Let Ω be a discrete probability space and F a σ−algebra. Let X(ω) be a (function ofa) random variable with E[X2] < ∞. Let Y = E[X | F ]. The variance of X isvar(X) = E[(X −X)2], where X = E[X].

a. Show directly from the (modern) definition of conditional expectation that

E[X2] = E[(X − Y )2] + E[Y 2] . (1)

Note that this equation also could be written

E[X2] = E[(X − E[X | F ])2

]+ E

[(E[X | F ])2

].

.

b. Use this to show that var(X) = var(X − Y ) + var(Y ).

c. If we interpret conditional expectation as an orthogonal projection in a vector space,what theorem about orthogonality does part a represent?

d. We have n independent coin tosses with each equally likely to be H or T. Take Xto be the indicator function of the event that the first toss is H. Take F to bethe algebra generated by the number of H tosses in all. Calculate each of thethree quantities in (1) from scratch and check that the equation holds. Both ofthe terms on the right are easiest to do using the law of total probability, whichis pretty obvious in this case.

1

3. (Bayesian identification of a Markov chain) We have a state space of size m and twom ×m stochastic matrices, Q, and R. First we pick one of the matrices, choosing Qwith probability f and R with probability 1 − f . Then we use the chosen matrix torun a Markov chain X, starting with X(0) = 1 up to time T .

a. Describe the probability space Ω appropriate for this situation.

b. Let F be the algebra generated by the chain itself, without knowing whether Q orR was chosen. Find a formula for P (Q | F) (which would be P (Q | X = x) inclassical notation). Though this formula might be ugly, it is easy to probram.

4. Suppose we have a 3 state Markov chain with transition matrix

P =

.6 .2 .2.3 .5 .2.1 .2 .7

and suppose that X(0) = 1. For any t > 0, the algebras Ft and Gt are as in the notes,and Ht is the algebra generated by X(s) for t ≤ s ≤ T .

a. Show that the probability distribution of the first t steps conditioned on Gt+1 is thesame as that conditioned on Ht+1. This is a kind of backwards Markov property:a forward Markov chain is a backward Markov chain also.

b. Calculate P (X(3) = 2 | G4). This consists of 3 numbers.

2


Assignment 3.

Given September 16, due September 23. Last revised, September 16.Objective: Markov chains, II and lattices.

1. We have a three state Markov chain wih transition matrix

P =

12

14

14

12

12

013

13

13

.

(Some of the transition probabilities are P (1 → 1) = 12, P (3 → 1) = 1

3, and P (1 → 2) =

14. Let τ = min(t | Xt = 3).) Even though this τ is not bounded (it could be arbitrarily

large), we will see that P (τ > t) ≤ Cat for some a < 1 so that the probability of largeτ is very small. This is enough to prevent the stopping time paradox (take my wordfor it). Suppose that at time t = 1, all states are equally likely.

a. Consider the quantities ut(j) = P (Xt = j and τ > t). Find a matrix evolutionequation for a two component vector made from the ut(j) and a submatrix, P , ofP .

b. Solve this equation using the the eigenvectors and eigenvalues of P to find a formulafor mt = P (τ = t).

c. Use the answer of part b to find E[τ ]. It might be helpful to use the formula

∞∑t=1

tP (τ = t) =∞∑

t=1

P (τ ≥ t) .

Verify the formula if you want to use it.

d. Consider the quantities ft(j) = P (τ ≥ t | X1 = j). Find a matrix recurrence forthem.

e. Use the method of question 1 to solve this and find the ft.

2. Let P be the transition matrix for an n state Markov chain. Let v(k) be a function ofthe state k ∈ S. For this problem, suppose that paths in the Markov chain start attime t = 0 rather than t = 1, so that X = (X0, X1, . . .). For any complex number, z,with |z| < 1, consider the sum

E

[ ∞∑t=0

ztv(Xt) | F0

]. (1)

1

Of course, this is a function of X0 = k, which we call f(k). Find a linear matrixequation for the quantities f that involves P , z, and v. Hint: the sum

E

[ ∞∑t=1

ztv(Xt) | F1

].

may be related to (1) if we take out the common factor, z.

3. (Change of measure) Let P be the probability measure corresponding to ordinary randomwalk with step probabilities a = b = c = 1/3. Let Q be the measure for the randomwalk where the up, stay, and down probabilities from site x depend on x as

(c(x), b(x), a(x)) =1

3e−β(x)

(e−α(x), 1, eα(x)

).

We may choose α(x) arbitrarily and then choose β(x) so that a(x) + b(x) + c(x) = 1.Taking α 6= 0 introduces drift into the random walk. The state space for walks oflength T is the set of paths x = x(0), · · · , x(T ) through the lattice. Assume thatthe probability distribution for x(0) is the same for P and Q. Find a formula forQ(x)/P (x). The answer is a discrete version of Girsanov’s formula.

4. In the urn process, suppose balls are either stale or fresh. Assume that the processstarts with all n stale balls and that all replacement balls are fresh. Let τ be the firsttime all balls are fresh. Let X(t) be the number of stale balls at time t. Show thatX(t) is a Markov chain and write the transition probabilities. Use the backward orforward equation aproach to calculate the quantities a(x) = Ex(τ). This is a two termrecurrence relation (a(x + 1) = something · a(x)) that is easy to solve. Show that attime τ , the colors of the balls are iid red with probability p. Use this ti explain thebinomial formula for the invariant distribution of colors.

2


Assignment 4.

Given October 1, due SOctober 14. Last revised, October 1.Objective: Gaussian random variables.

1. Suppose X ∼ N (µ, σ2). Find a formula for E[eX ]. Hint: write the integral and completethe square in the exponent. Use the answer without repeating the calculation to getE[eaX ] for any constant a.

2. In finance, people often use N(x) for the CDF (cumulative distribution function) for thestandard normal. That is, if Z ∼ N (0, 1) then N(x) = P (Z ≤ x). Suppose S = eX forX ∼ N (µ, σ2). Find a formula for E[max(S, K)] in terms of the N function. (Hint:as in problem 1.) This calculation is part of the Black–Scholes theory of the value of avanilla European style call option. K is the known strike price and S is the unknownstock price.

3. Suppose X = (X1, X2, X3) is a 3 dimensional Gaussian random variable with mean zeroand covariance

E[XX∗] =

2 1 01 2 10 1 2

.

Set Y = X1 + X2 −X3 and Z = 2X1 −X2.

a. Write a formula for the probability density of Y .

b. Write a formula for the joint probability density for (Y, Z).

c. Find a linear combination W = aY + bZ that is independent of X1.

4. Take X0 = 0 and define Xk+1 = Xk + Zk, for k = 0, . . . , n − 1. Here the Zk are iid.standard normals, so that

Xk =k−1∑j=0

Zj . (1)

Let X ∈ Rn be the vector X = (X1, . . . , Xn).

a. Write the joint probability density for X and show that X is a multivariate normal.Identify the n× n tridiagonal matrix H that arises.

b. Use the formula (1) to calculate the variance of Xk and the covariance E[Xj, Xk].

c. Use the answers to part b to write a formula for the elements of H−1.

d. Verify by matrix multiplication that your answer to part c is correct.

1


Assignment 5.

Given October 1, due October 21. Last revised, October 7.Objective: Brownian Motion.

1. Suppose h(x) has h′(x) > 0 for all x so that there is at most one x for each y so thaty = h(x). Consider the process Yt = h(Xt), where Xt is standard Brownian motion.Suppose the function h(x) is smooth. The answers to the questions below depend atleast on second derivatives of h.

a. With the notation ∆Yt = Yt+∆t − Yt, for a positive ∆t, calculate a(y) and b(y) sothat E[∆Yt | Ft] = a(Yt)∆t + O(∆t2) and E[∆Y 2

t | Ft] = b(Yt)∆t + O(∆t2).

b. With the notation f(Yt, t) = E[V (YT ) | Ft], find the backward equation satisfiedby f . (Assume T > t.)

c. Writing u(y, t) for the probability density of Yt, use the duality argument to find theforward equation satisfied by u.

d. Write the forward and backward equations for the special case Yt = ecXt . Note (forthose who know) the similarity of the backward equation to the Black Scholespartial differential equation.

2. Use a calculation similar to the one we used in class to show that YT = X4T − 6

∫ T0 X2

t dtis a martingale. Here Xt is Brownian motion.

3. Show that Yt = cos(kXt)ek2t/2 is a martingale.

a. Verify this directly by first calculating (as in problem 1) that

E[Yt+∆t | Ft] = Yt + O(∆t2) .

Then explain why this implies that Yt is a martingale exactly (Hint: To show thatE[Yt′ | Ft] = Yt, divide the time interval (t, t′) into n small pieces and let n →∞.

b. Verify that Yt is a martingale using the fact that a certain function satisfies thebackward equation. Note that, for any function V (x), Zt = E[V (XT ) | Ft] is amartingale (the tower property). Functions like this Z satisfy backward equations.

c. Find a simple intuition that allows a supposed martingale to grow exponentially intime.

4. Let Ax0,t be the event that a standard Brownian motion starting at x0 has Xt′ > 0 for all t′

between 0 and t. Here are two ways to verify the large time asymptotic approximationP (Ax0,t) ≈ 1√

2π2x0√

t.

1

a. Use the formula from “Images and reflections” to get

P (Ax0,t) =∫ ∞

0u(x, t)dx

≈ 1√2πt

∫ ∞

0e−x2/2t

(exx0/t − e−xx0/t

)dx .

The change of variables y = x/√

t should make it clear how to approximate thelast integral for large t.

b. Use the same formula to get

−d

dtP (Ax0,t) =

1√2π

2x0

t3/2e−x2

0/2t . (1)

Once we know that P (Ax0,t) → 0 as t → ∞, we can estimate its value by inte-grating (1) from t to ∞ using the approximation econst/t ≈ 1 for large t. Note:There are other hitting problems for which P (At) does not go to zero as t →∞.This method would not work for them.

2


Assignment 6.

Given October 21, due October 28. Last revised, October 21.Objective: Forward and Backward equations for Brownian motion.

The terms forward and backward equation refer to the equations the probability densityof Xt and Ex,t[V (XT )] respectively. The integrals below are easily done if you use identitiessuch as

1√2πσ2

∫ ∞−∞

x2ne−x2/2σ2

dx = σ2n · (2n− 1)(2n− 3) · · · 3 .

You should not have to do any actual integration for these problems.

1. Solve the forward equation with initial data

u0(x) =x2

√2π

e−x2/2 .

a. Assume the solution has the form

u(x, t) =(A(t)x2 + B(t)

)g(x, t) , g(x, t) = G(0, x, t+1) =

1√2π(t + 1)

e−x2/2(t+1) .

(1)Then find and solve the ordinary differential equations for A(t) and B(t) thatmake (1) a solution of the forward equation.

b. Compute the integrals

u(x, t) =∫ ∞−∞

u0(y)G(y, x, t)dy .

This should give the same answer as part a.

c. Sketch the probability density at time t = 0, for small time, and for large time.Rescale the large time plot so that it doesn’t look flat.

d. Why does the structure seen in u(x, t) for small time (the double hump) disappearfor large t?

e. Show in a rough way that a similar phenomenon happens for any initial data of theform u0(x) = p(x)g(x, 0), where p(x) is an even nonnegative polynomial. When tis large, u(x, t) looks like a simple Gaussian, no matter what p was.

2. Solve the backward equation with final data V (x) = x4.

a. Write the solution in the form

f(x, t) = x4 + a(t)x2 + b(t) . (2)

Then find and solve the differential equations that a(t) and b(t) must satisfy sothat (2) is the solution of the backward equation.

1

b. Compute the integrals

f(x, t) =∫ ∞−∞

G(x, y, T − t)V (y)dy .

This should be the same as your answer to part a.

c. Give a simple explanation for the form of the formula for f(0, t) = b(t) in terms ofmoments of a Gaussian random variable.

3. Check that ∫ ∞−∞

u(x, t)f(x, t)dx

is independent of t.

2


Assignment 7.

Given November 4, due November 11. Last revised, November 9.Objective: Pure and applied mathematics.

The first problems are strictly theoretical. They illustrate how clever some rigorous proofsare. The inequality (3) serves the following function: We want to understand somethingabout the entire path Fk for 0 ≤ k ≤ n. We can get bounds on Fk for particular valuesof k by calculating expectations (e.g. E[F 2

k ]). Then (3) uses this to say something aboutthe whole path. As an application, we will have an easy proof of the convergence of theapproximations to the Ito integral for all t ≤ T once we can prove it at the single time T .

1. Let Fk be a discrete time nonnegative martingale. Let Mn = max0≤k≤n Fk be its maximalfunction. This problem is the proof that

P (Mn > f) ≤ 1

fE[Fn1Mn≥f ] . (1)

The proof also shows that if Fk is any martingale and Mn = max0≤k≤n |Fk| its maximalfunction, then

P (Mn > f) ≤ 1

fE[|Fn|1Mn≥f ] . (2)

These inequalities are relatives of Markov’s inequality (also called Chebychev’s inequal-ity, though that term us usually applied to an interesting special case), which says thatif X is a nonnegative random variable then P (X > a) < 1

aE[X1X>a], or, if X is any

random variable, that P (|X| > a) < 1aE[|X|1|X|>a].

a. Let A be the event A = Mn ≥ f. Write A as a disjoint union of disjoint eventsBk ∈ Fk so that Fk(ω) ≥ f when ω ∈ Bk. Hint: If Mn ≥ f , there is a first k withFk ≥ f .

b. Since Fk(ω) ≥ f for ω ∈ Bk, show that P (Bk) ≤ 1fE[1Bk

Fk] (the main step in the

Markov/Chebychev inequality)).

c. Use the martingale property and the tower property to show that 1BkFk = E[1Bk

Fn |Fk] so E[1Bk

Fk] = E[1BkFn]. Do this for discrete probability if that helps you.

d. Add these to get (1).

e. We say Fk is a submartingale if Gk ≤ E[Gn | Fk] (warning: submartingales go up,not down). Show that if Fk is any martingale, then |Fk| is a submartingale. Show(1) applies to nonnegative submartingales so (2) applies to general martingales,positive or not.

1

2. Let M be any nonnegative random variable. Define µ(f) = P (M ≥ f), which is relatedto the CDF of M . Use the definition of the abstract integral to show that E[M ] =∫∞0 µ(f)df and E[M2] = 2

∫∞0 fµ(f)df . These formulas work even if the common

value is infinite. If G is another nonnegative random variable, show that E[GM ] =∫∞0 E[G1M≥f ]df . Of course, one way to do this is to formulate a single general formula

that each of these is a special case of.

3. Use the formulas of part 2 together with Doob’s inequality (2) to show that

E[M2n] ≤ 2E[Mn |Fn|] ,

soE[M2

n] ≤ 2E[F 2

n

]. (3)

(It will help to use the Cauchy Schwarz inequality E[XY ] ≤ (E[X2]E[Y 2])1/2.)

Now some more concrete examples. We can think of martingales as absolutely non meanreverting. The inequality (3) expresses that fact in one way: the maximum of a martingaleis comparable to its value at the final time, on average. The Ornstein Uhlenbeck processis the simplest continuous time mean reverting process, a continuous time anologue of thesimple urn model.

4. An Ornstein Uhlenbeck process is an adapted process X(t) that satisfies the Ito differentialequation

dX(t) = −γX(t)dt + σdW (t) . (4)

We cannot use Ito’s lemma to calculate dX(t) because X(t) is not a function of W (t)and t alone.

a. Examine the definition of the Ito integral and verify that if g(t) is a differentiablefunction of t, and dX(t) = a(t)dt + b(t)dW (t), with a random but bounded b(t),then d(g(t)X(t)) = g(t)X(t)dt + g(t)dX(t). It may be helpful to use the Itoisometry formula (paragraph 1.17 of lecture 7).

b. Bring the drift term to the left side of (4), multiply by eγt and integrate (using parta) to get

X(T ) = e−γT X(0) + σ∫ T

0e−γ(T−t)dW (t) .

c. Conclude that X(t) is Gaussian for any T (if X(0) is) and that the probabilitydensity for X(T ) has a limit as T → ∞. Find the limit by computing the meanand variance.

d. Contrast the large time behavior of the Ornstein Uhlembeck process with that ofBrownian motion.

5. Show that eikW (t)+k2t/2 is a martingale using Ito differentiation.

2


Assignment 8.

Given November 11, due November 18. Last revised, November 11.Objective: Diffusions and diffusion equations.

1. An Ornstein Uhlenbeck process is a stochastic process that satisfies the stochastic differ-ential equation

dX(t) = −γX(t)dt + σdW (t) . (1)

a. Write the backward equation for f(x, t) = Ex,t[V (X(T )].

b. Show that the backward equation has (Gaussian) solutions of the form f(x, t) =A(t) exp(−s(t)(x− ξ(t))2/2). Find the differential equations for A, ξ, and s thatmake this work.

c. Show that f(x, t) does not represent a probability distribution, possibly by showingthat

∫∞−∞ f(x, t)dt is not a constant.

d. What is the large time behavior of A(t) and s(t)? What does this say about thenature of an Ornstein Uhlenbeck reward that is paid long in the future as afunction of starting position?

2. The forward equation:

a. Write the forward equation for u(x, t) which is the probability density for X(t).

b. Show that the forward equation has Gaussian solutions of the form

u(x, t) =1√

2πσ(t)2e−(x−µ(t))2/2σ2(t) .

Find the appropriate differential equations for µ and σ.

c. Use the explicit solution formula for (1) from assignment 7 to calculate µ(t) =E[X(t)] and σ(t) = var[X(t)]. These should satisfy the equations you wrote forpart b.

d. Use the approximation from (1): ∆X ≈ −γX∆t + σ∆W (and the independentincrements property) to express ∆µ and ∆(σ2) in terms of µ and σ and get yetanother derivation of the answer in part b. Use the definitions of µ and σ frompart c.

e. Differentiate∫∞−∞ xu(x, t)dx with respect to t using the forward equation to find a

formula for dµ/dt. Find the formula for dσ/dt in a similar way from the forwardequation.

1

f. Give an abstract argument that X(t) should be a Gaussian random variable foreach t (something is a linear function of something), so that knowing µ(t) andσ(t) determines u(x, t).

g. Find the solutions corresponding to σ(0) = 0 and µ(0) = y and use them to get aformula for the transition probability density (Green’s function) G(y, x, t). Thisis the probability density for X(t) given that X(0) = y.

h. The transition density for Brownian motion is GB(y, x, t) = 1√2πt

exp(−(x−y)2/2t).Derive the transition density for the Ornstein Uhlenbeck process from this usingthe Cameron Martin Girsanov formula (warning: I have not been able to do thisyet, but it must be easy since there is a simple formula for the answer. Check thebboard.).

i. Find the large time behavior of µ(t) and σ(t). What does this say about the distri-bution of X(t) for large t as a function of the starting point?

3. Duality:

a. Show that the Green’s function from part 2 satisfies the backward equation as afunction of y and t.

b. Suppose the initial density is u(x, 0) = δ(x − y) and that the reward is V (x) =δ(x− z). Use your expressions for the corresponding forward solution u(x, t) andbackward solution f(x, t) to show by explicit integration that

∫∞−∞ u(x, t)f(x, t)dx

is independent of t.

2


Assignment 9.

Given December 9, due December 23. Last revised, December 91.

Instructions: Please answer these questions without discussing them with others or lookingup the answers in books.

1. Let S be a finite state space for a Markov chain. Let ξ(t) ∈ S be the state of thechain at time t. The chain is nondegenerate if there is an n with P n

jk 6= 0 for allj ∈ S and k ∈ S. Here the Pjk are the j → k transition probabilities and P n

jk isthe (j, k) entry of P n, which is the n step j → k transition probability. For anynondegenerate Markov chain with a finite state space, the Perron Frobeneus theoremgives the following information. There is a row vector, π, with

∑k∈S π(k) = 1 and

π(k) > 0 for all k ∈ S (a probability vector) so that ‖P t − 1π‖ ≤ Ce−αt. Here 1 is thecolumn vector of all ones and α > 0. In the problems below, assume that the transitionmatrix P represents a nondegenerate Markov chain.

(a) Show that if P (ξ(t) = k) = π(k) for all k ∈ S, then P (ξ(t + 1) = k) = π(k) forall k ∈ S. In this sense, π represents the steady state or invariant probabilitydistribution.

(b) Show that P has one eigenvalue equal to one, which is simple, and that everyother eigenvalue has |λ| < 1.

(c) Let u(k, t) = P (ξ(t) = k). Show that u(k, t) → π(k) as t →∞. No matter whatprobability distribution the Markov chain starts with, the probability distributionconverges to the unique steady state distribution.

(d) Suppose we have a function f(k) defined for k ∈ S and that Eπ[f(ξ)] = 0.Let f be the column vector with entries f(k) and f the row vector with entries

(f)(k) = f(k)π(k). Show that

covπ(f(ξ(0)), f(ξ(t))) = Eπ[f(ξ(0)), f(ξ(t))] = fP tf .

(e) Show that if A is a square matrix with ‖A‖ < 1, then

∞∑t=0

At = (I − A)−1 .

This is a generalization of the geometric sequence formula∑∞

t=0 at = 1/(1− a) if|a| < 1, and the proof/derivation can be almost the same, once the series is shownto converge.

(f) Show that if Eπ[f(ξ)] = 0, then∑∞

t=0 P tf = g with g−Pg = f and Eπ[g(ξ)] = 0.If the series converges, the argument above should apply.

1

(g) Show that

C =∞∑

t=0

covπ[f(ξ(0)), f(ξ(t))] = fg ,

where g is as above.

(h) Let X(T ) =∑T

t=0 f(ξ(t)). Show that var(X(T )) ≈ DT for large T , where

D = varπ[f(ξ)] + 2∞∑

t=1

covπ[f(ξ(0)), f(ξ(t))] .

This is a version of the Einstein Kubo formula. To be precise, 1Tvar(X(T )) → D

as T →∞. Even more precisely, |var(X(T ))−DT | is bounded as T →∞. Provewhichever of these you prefer.

(i) Suppose P represents a Markov chain with invariant probability distribution πand we want to know µ = Eπ[f(ξ)]. Show that µT = 1

T

∑Tt=0 f(ξ(t)) converges

to µ as T → ∞ in the sense that E[(µT − µ)2] → 0 as T → ∞. Show that thisconvergence does not depend on u(k, 0), the initial probability distribution. It isnot terribly hard (though not required in this assignment) to show that µT → µ asT →∞ almost surely. This is the basis of Markov chain Monte Carlo, which usesMarkov chains to sample probability distributions, π, that cannot be sampled inany simpler way.

(j) Consider the Markov chain with state space −L ≤ k ≤ L having 2L + 1 states.The one step transition probabilities are 1

3for any k → k− 1, k → k or k → k +1

transitions that do not take the state out of the state space. Transitions thatwould go out of S are rejected, so that, for example, P (L → L) = 2

3. Take

f(k) = k and calculate π and D. Hint: the general solution to the equations(g − Pg)(k) = k is a cubic polynomial in k.

2. A Brownian bridge is a Brownian motion, X(t), with X(0) = X(T ) = 0. Find anSDE satisfied by the Brownian bridge. Hint: Calculate Ex,t[∆X | X(T ) = 0], which issomething about a multivariate normal.

3. Suppose stock prices S1(t), . . . , Sn(t) satisfy the SDEs dSk(t) = µkSkdt + σkSkdWk(t),where the Wk(t) are correlated standard Brownian motions woth correlation coefficientsρjk = corr(Wj(t), Wk(t)).

(a) Write a formula for S1(t), . . . , Sn(t) in terms of independent Brownian motionsB1(t), . . . , Bn(t). You may use the Cholesky decomposition LLt = ρ.

(b) Write a formula for u(s, t), the joint density function of S(t) ∈ Rn. This is the ndimensional correlated lognormal density.

(c) Write the partial differential equation one could solve to determine E[max(S1(T ), S2(T ))]with S1(0) = s1 and S2(0) = s2 and ρ12 6= 0

4. Suppose dS(t) = a(S(t), t)dt + σ(S(t), t)S(t)dB(t). Write a formula for∫ T0 S(t)dS(t)

that involves only Riemann integrals and evaluations.

2

Date post:	28-Oct-2014
Category:	Documents
Upload:	alfascorpion
View:	153 times
Download:	5 times

Lecture Notes on Stochastic Calculus (NYU)

Documents