Advanced probability theory - univie.ac.at

Advanced probability theory

Jirı Cerny

June 1, 2016

Preface

These are lecture notes for the lecture ‘Advanced Probability Theory’ given at Uni-versity of Vienna in SS 2014 and 2016. This is a preliminary version which will beupdated regularly during the term. If you have questions, corrections or suggestions forimprovements in the text, please let me know.

ii

Contents

1 Introduction 1

2 Probability spaces, random variables, expectation 52.1 Kolmogorov axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Expectation of real-valued random variables . . . . . . . . . . . . . . . . 10

3 Independence 143.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Dynkin’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Elementary facts about the independence . . . . . . . . . . . . . . . . . . 173.4 Borel-Cantelli lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Kolmogorov 0–1 law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Laws of large numbers 254.1 Kolmogorov three series theorem . . . . . . . . . . . . . . . . . . . . . . 254.2 Weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Strong law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Law of large numbers for triangular arrays . . . . . . . . . . . . . . . . . 35

5 Large deviations 375.1 Sub-additive limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Cramer’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Weak convergence of probability measures 416.1 Weak convergence on R . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.2 Weak convergence on metric spaces . . . . . . . . . . . . . . . . . . . . . 446.3 Tightness on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.4 Prokhorov’s theorem* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Central limit theorem 527.1 Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.3 Some generalisations of the CLT* . . . . . . . . . . . . . . . . . . . . . . 56

8 Conditional expectation 608.1 Regular conditional probabilities* . . . . . . . . . . . . . . . . . . . . . . 65

iii

9 Martingales 679.1 Definition and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 679.2 Martingales convergence, a.s. case . . . . . . . . . . . . . . . . . . . . . . 739.3 Doob’s inequality and Lp convergence . . . . . . . . . . . . . . . . . . . . 759.4 L2-martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779.5 Azuma-Hoeffding inequality . . . . . . . . . . . . . . . . . . . . . . . . . 789.6 Convergence in L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799.7 Optional stopping theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 839.8 Martingale central limit theorem* . . . . . . . . . . . . . . . . . . . . . . 84

10 Constructions of processes 8510.1 Semi-direct product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8510.2 Ionescu-Tulcea Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 8610.3 Complement: Kolmogorov extention theorem . . . . . . . . . . . . . . . . 90

11 Markov chains 9211.1 Definition and first properties . . . . . . . . . . . . . . . . . . . . . . . . 9211.2 Invariant measures of Markov chains . . . . . . . . . . . . . . . . . . . . 9411.3 Convergence of Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 99

12 Brownian motion and Donsker’s theorem 10212.1 Space C([0, 1]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10212.2 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10412.3 Donsker’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10512.4 Some applications of Donsker’s theorem . . . . . . . . . . . . . . . . . . . 109

iv

1 Introduction

The goal of this lecture is to present the most important concepts of the probabilitytheory in the context of infinite sequences X1, X2, . . . of random variables, or, otherwisesaid, in the context of stochastic processes in discrete time.

We will mostly be interested in the asymptotic behaviour of these sequences. The fol-lowing examples cover some questions that will be answered in the lecture and introduceheuristically some concepts that we will develop in order to solve them.

Example 1.1 (Series with random coeficients). It is well known that

X(1)n =

n∑i=1

(−1)i

i

n→∞−−−→ − log 2, but

X(2)n =

n∑i=1

1

i

n→∞−−−→∞ (no absolute convergence).

One can then ask what happens if the signs are chosen randomly, that is for independentrandom variables Z1, Z2, . . . with P [Zi = +1] = P [Zi = −1] = 1

2one considers the sum

Xn =n∑i=1

Zii.

Does this random(!) series converge or no? If yes, is the limit random or deterministic?

Example 1.2 (Sums of independent random variables). In the lecture ‘Probability andStatistic’ you were studying the following problem. Let Zi be as in Example 1.1, that isZi’s are outcomes of independent throws of a fair coin, and set

Sn =n∑i=1

Zi, and Xn =1

nSn.

By the weak law of large numbers, denoting by EZi(= 0) the expectation of Zi, we knowthat

P (|Xn − EXn| ≥ ε)n→∞−−−→ 0 for every ε > 0.

Observe however that the last display says only that the probability that |Xn| is far fromzero decays with n. It says nothing about the convergence of Xn for a single realisationof coin throws.

To address these (and many other) questions we will develop the formalism of prob-ability theory which bases on the measure theory and Kolmogorov axioms. In thisformalism, we will show an improved version of the weak LLN, so called strong LLN

P[

limn→∞

Xn = 0]

= 1, or equivalently limn→∞

Xn = 0, P -a.e.

1

Example 1.3 (Random walk and Brownian motion). Continuing with Example 1.2,we can view Sn as a function S : N → R. By linear interpolation we can extend itto a function S : R+ → R (see Figure 1.1). This is a random continuous function,i.e. random element of the space C(R+;R). As such random object cannot be describedby elementary means of the ‘Probability and Statistics’ lecture, one of our goals is todevelop a sound mathematical theory allowing for this.

0 5 10 15 20

−4

−2

02

4

0 500 1000 1500 2000

−40

−20

020

40

Figure 1.1: Random walk and its scaling. Observe that on the second picture the x-axisis 100 times longer, but y-axis only 10 times. Second picture “looks almostlike” a Brownian motion

We also want discuss the convergence of such random objects. More exactly, recallthat the central limit theorem says that

1√nSn

d−−−→n→∞

N (0, 1),

where N (0, 1) stands for the standard normal distribution. The arrow notation in theprevious display stands here for the convergence in distribution which can formally bedefined here e.g. by

P[ 1√

nSn ≤ a

]n→∞−−−→

∫ a

−∞

1√2π

e−x2/2 dx, for all a ∈ R.

In view of (1.3), it seems not unreasonable to scale the function S by n−1 in the timedirection and by n−1/2’ in the space direction, that is to consider

S(n)(t) = n−1/2Snt,

and ask ‘Does this sequence of random elements of C(R+,R) converge? What is thelimit object?’

2

We will see that the answer on the first question is ‘YES’, but to this end we need tointroduce the right convergence notion. Even more interesting is the limit object, theBrownian motion.

Apart being very interesting objects of their own, random walk and Brownian motionare prototypes of two important classes of processes, namely Markov chains/processesand martingales, that we are going to study.

We close this section by few examples linking the probability theory to other domainsof mathematics. Some of them will be treated in the lecture in more detail.

Example 1.4 (Random walk and discrete Dirichlet problem, link to PDE’s). Considera simple random walk on Z2 started at x ∈ Z2, that is a sequence of random variablesX0, X1, . . . determined byX0 = x and by requirement that its increments Zi = Xi−Xi−1,i ≥ 1 are i.i.d. random variables satisfying

P [Zi = ±e1] = P [Zi = ±e2] = 1/4.

Here e1, e2 stand for the canonical basis vectors of Z2. See Figure 1.2 for a typicalrealisation.

O

x

Y

Figure 1.2: Realisation of random walk on Z2

Let g : R2 → R2 be a continuous function and O a large domain in R2. Let Y be therandom position of the exit point of the random walk from the domain O, i.e. Y = XT

with T = infk : Xk /∈ O, see the figure again.Define a function u : Z2 → R by

u(x) = Ex[g(Y )], x ∈ Z2,

where Ex stands here for the expectation for the random walk started at x. We willlater show that u solves a discrete Dirichlet problem

∆du(x) = 0, x ∈ Z2 ∩O,u(x) = g(x), x ∈ Z2 \O,

3

where ∆d is a discrete1 Laplace operator

∆du(x) = 14u(x+ e1) + u(x− e1) + u(x+ e2) + u(x− e2) − u(x)

Example 1.5 (Lower bound on Ramsey numbers, a tiny link to the graph theory).Ramsey number R(k) is the smallest number n such that any colouring of the edgesof the complete graph Kn by two colours (red and blue say) must contain at least onemonochromatic (that is completely blue or red) copy of Kk as a subgraph.

These numbers are rather famous in the graph theory, not only because they arevery hard to compute. Actually,2 the only known values are R(1) = 1, R(2) = 2,R(3) = 6, R(4) = 18. For larger Ramsey numbers it is known e.g. R(5) ∈ [43, 49] orR(10) ∈ [798, 23556]. It is thus essential to get good estimates on these numbers. Weare going to use an easy probabilistic argument to find a lower bound on R(k).

Lemma (taken from [AS08], Proposition 1.1.1). Assume that(nk

)21−(k2

)< 1. Then

n > R(k). In particular R(k) ≥ b2k/2c for all k ≥ 3.

Proof. Consider a random two-coloring of the edges of Kn obtained by coloring eachedge independently either red or blue, where each color is equally likely. For any fixedset R ⊂ 1, . . . , n of k vertices, let AR be the event that the induced subgraph of Kn

on R is monochromatic. Clearly,

P [AR] = 2 · 2(k2

).

Since there are(nk

)possible choices for R, the probability that at least one of the events

AR occurs is at most(nk

)21−(k2

)< 1. Thus, with positive probability, no event AR

occurs and hence there should be a two-coloring of Kn without a monochromatic Kk;that is, R(k) > n. The second claim follows by checking that the assumption holds forn = b2k/2c (exercise).

The best present estimates on R(k) are (1 + o(1))√

2ke−12k/2 ≤ R(k) ≤ k−c log k

log log k 4k,that is the lower bound of the lemma is much better than what would one expect giventhe simplicity of the proof.

Observe also that the proof is non-constructive. It only says that the colouring withrequired properties exists (since it has positive probability), and can be found by throw-ing a coin for a sufficiently long enough period of time. Such probabilistic arguments canbe used in many situations in graph and number theory, see the very nice book [AS08].

Example 1.6 (few other examples). TBD

1If you prefer the usual Laplace operator, you can replace the random walk by a Brownian motionstarted at x ∈ R2.

2source Wikipedia

4

2 Probability spaces, random variables,expectation

The goal of this chapter is to quickly review the basic concepts and definitions whichshould be known from previous lectures “Probability and Statistics” and “Measure the-ory”.

2.1 Kolmogorov axioms

We start by describing how random experiments are mathematically formalised. Anexperiment is modelled by a probability space (Ω,A, P ) where

• Ω is a non-empty set containing all possible results of the experiment.

• A is a σ-algebra on Ω, that is A is a collection of subsets of Ω satisfying thefollowing three properties

(σ1) ∅ ∈ A,

(σ2) (A ∈ A) =⇒ (Ac ∈ A),

(σ3) for every sequence (Ai)i≥1 with Ai ∈ A holds⋃i≥1Ai ∈ A.

• P is a probability measure on (Ω,A). That is P is a mapping from A to [0, 1]satisfying

(m1) P (Ω) = 1

(m2) P is σ-additive, that is for every sequence (Ai)i≥1 of pairwise disjoint elementsof A (i.e. Ai ∩ Aj = ∅ if i 6= j) holds

P[⋃i≥1

Ai

]=∑i≥1

P [Ai].

In the language of measure theory, P is a normed measure on a measurable space(Ω,A). Any element ω ∈ Ω is called (possible) outcome of the experiment. The sets inA are called events .

The σ-algebra A should be interpreted as the collection of all subsets of Ω that wewant to measure (or sometimes that we can measure), that is to which we want (or can)associate their probability. For A ∈ A, the quantity P [A] then gives the probability thatthe random outcome of the experiment falls into A.

5

This formalism includes the two important cases of probability distributions that wereintroduced in the elementary lecture—the ‘continuous’ and ‘discrete’ distributions—ascan be seen from the next two examples.

Example 2.1. Let Ω = R, A = B(R) be the Borel σ-algebra on R (i.e. the smallestσ-algebra containing all open subset of R), and let

P (A) =

∫A

1√2πσ2

exp− (x−m)2

2σ2

dx, A ∈ B(R).

This probability space describes an experiment whose outcome has the normal (or Gaus-sian) distribution with mean m ∈ R and variance σ2 > 0, which we abbreviate byN (m,σ2).

Example 2.2. Let Ω = N = 0, 1, . . . , A = P(N) be the power set of N and let for aλ > 0

P (A) =∑k∈A

e−λλk

k!.

This gives the Poisson distribution with parameter λ, Pois(λ).

Random vectors are also easily covered by the formalism.

Example 2.3 (Product spaces). Let (Ω1,A1, P1), (Ω2,A2, P2) be two probability spaces.We can obtain another probability space by setting Ω = Ω1 × Ω2, A to be the productσ-algebra A1 ⊗A2 (the smallest σ-algebra containing all rectangles A1 × A2, A1 ∈ A1,A2 ∈ A2), and P to be the product measure P1 ⊗ P2, which is determined by its valueson rectangles

P (A1 × A2) = P (A1)P (A2) for all A1 ∈ A1, A2 ∈ A2.

The product of n probability spaces (Ω1,A1, P1), . . . , (Ωn,An, Pn) can be defined anal-ogously.

For example taking (Ωi,Ai, Pi), 1 ≤ i ≤ n to be the probability space of Example 2.1with m = 0, σ = 1, we obtain by taking their product the n-dimensional standardnormal distribution

Ω = Rn, A = B(Rn) = B(R)⊗n, and

P (A) =

∫A

1

(2π)n/2exp

− x2

1 + · · ·+ x2n

2

dx1 · · · dxn, A ∈ A.

(2.4)

Example 2.5 (Infinte product spaces). The previous example can be extended further.Let I be an infinite (even not countable) index set and for every ι ∈ I let (Ωι,Aι, Pι)be a probability space. We set Ω = ×ι∈IΩι to be the usual Cartesian product. A littlebit more care is needed while defining the σ-algebra A: It is the smallest σ-algebra onΩ containing all ‘finite-dimensional cylinders’, that is

A = σ

(×ι∈JAι)× (×ι∈I\JΩι) : J ⊂ I finite, Aι ∈ Aι∀ι ∈ J.

6

A is called the cylinder σ-algebra. For finite dimensional cylinders we then set

P [(×ι∈JAι)× (×ι∈I\JΩι)] =∏ι∈J

Pι(Aι)

which determines uniquely, as we will see, a probability measure on A, the productmeasure.

2.2 Random variables

Another useful known concept is that of random variable. Random variables serveto model various properties of the random experiment, and are to some extend moreimportant than the probability space itself1

Definition 2.6. Let (Ω,A, P ) be a probability space and (S,S) a measurable space2.A function X : Ω → S is called S-valued random variable if it is measurable functionfrom (Ω,A) to (S,S), i.e. it holds

(2.7) X−1(B) ∈ A for every B ∈ S.

If S = R, we usually tacitly assume that S is the corresponding Borel σ-algebra B(R).In this case X is called real valued random variable, or simply random variable.

Remark 2.8. Checking the condition (2.7) of Definition 2.6 seems to be a tedious task,since the condition need to be verified for all B in the σ-algebra S, which is typicallyrather large. Fortunately this is not the case:

A subset E of the σ-algebra S is called generator of S if S = σ(E), that is S is thesmallest σ-algebra containing E . We now show that if E is a generator of S, (2.7) isequivalent with

(2.9) X−1(B) ∈ A for all B ∈ E .

Proof of (2.9) ⇐⇒ (2.7). The collection of sets B ⊂ S : X−1(B) ∈ A is a σ-algebra.Indeed, X−1(S) = Ω, X−1(Bc) = (X−1(B))c, X−1(∪i≥1Bi) = ∪i≥1X

−1(Bi). This σ-algebra contains all elements of E and therefore it contains also σ(E) = S.

When S is the Borel σ-algebra on R, the following generators work:

• The collection O of all open sets of R.

• The collection C of all closed sets of R.

• The collection of all open intervals (a, b), −∞ < a < b <∞.

1Reading the probability theory literature, you will quickly remark that it often assumes the existenceof some abstract probability space (Ω,A, P ) where one can define all random variables that onewants to deal with, without really constructing it explicitly.

2That is S is non-empty set and S is a σ-algebra of subsets of S

7

• The collection of all closed intervals [a, b], −∞ < a ≤ b <∞.

• The collection H of all half-infinite intervals (−∞, a], a ∈ R.

Hence, to check that X : Ω→ R is a random variable it is e.g. sufficient to show that

X−1((−∞, a]) =: X ≤ a ∈ A for all a ∈ R.

Definition 2.10. Let X : (Ω,A) → (S,S) be a random variable. The probabilitymeasure µX on (S,S) defined by

µX(B) := P (X−1(B)) = P (X ∈ B), for B ∈ S,

is called the distribution of X. One sometimes writes X#P (push-forward of the measureP by function X) or P X−1 to denote this distribution.

Exercise 2.11. Check that (S,S, µX) is again a probability space.

Exercise 2.12. Let (Ω,A, P ) be as in (2.4) and X : Rn → R, (x1, . . . , xn) 7→ x1. ThenX is a random variable (check!) and

µX(B) = P (X ∈ B)

=

∫B×Rn−1

1

(2π)12

exp− x2

1 + · · ·+ x2n

2

dx1 . . . dxn

=

∫B

1

(2π)12

exp− x2

1

2

dx1.

That is µX is the standard normal distribution.

We recall the important tool from the elementary probability theory, which allows todeal with all (in particular both discrete and continuous) real valued random variablesin unified manner.

Definition 2.13. Let X be a real valued random variable. The map R 3 a 7→ FX(a) =P (X ≤ a) is called distribution function of X.

Claim 2.14. Let F be the distribution function of a random variable. Then

(i) F (·) is non-decreasing,

(ii) limx→∞ F (x) = 1, limx→∞ F (x) = 0.

(iii) F (x) is right-continuous

Proof. (i) follows directly from the definition of F . For (iii), let x ∈ R and (xn)n≥0

be a sequence such that xn ↓ x. Then ∩n≥0(−∞, xn] = (−∞, x] and the sequenceof intervals (−∞, xn] is decreasing. Hence, by upper regularity of the measure µX ,F (x) = µx((−∞, x]) = limn→∞ µx((−∞, xn]) = limn→∞ F (xn). (ii) is proved similarlyto (iii).

8

The distribution functions characterise completely the set of all probability measureson (R,B(R)), as can be seen from the next theorem.

Theorem 2.15 (Lebesgue-Stieltjes). Let F : R → [0, 1] satisfy the conditions (i)-(iii)of Claim 2.14. Then there exists a unique probability measure µ on (R,B(R)) such that

F (x) = µ((−∞, x]) for all x ∈ R.

Proof. Existence: Consider the probability space Ω = (0, 1), A = B((0, 1)), P =Lebesgue measure on (0, 1), and define a map X : (0, 1)→ R by

(2.16) X(ω) = supy ∈ R : F (y) < ω, ω ∈ (0, 1).

X should be thought of as an “inverse” to F . Formally, we will show that

(2.17) ω : X(ω) ≤ x = ω : ω ≤ F (x).

The existence of measure µ follows from (2.17). Indeed, by (2.17), X is a random variableand its distribution function is F .

We now show (2.17). Assume first that ω ∈ (0, 1) with ω ≤ F (x). Then x /∈ y :F (y) < ω and thus x ≥ X(ω), hence ω : X(ω) ≤ x ⊂ ω : ω ≤ F (x). Onthe other hand, for ω ∈ (0, 1) with ω > F (x), the right-continuiry of F yields theexistence of ε > 0 such that F (x + ε) < ω, that is X(ω) ≥ x + ε > x. It follows thatω : X(ω) ≤ x ⊃ ω : ω ≤ F (x), completing the proof of (2.17).

Uniqueness: From the definition of the distribution function it follows that µ((a, b]) =F (b) − F (a) for every a < b. Therefore µ((a, b]) is uniquely determined by F . Furtherµ((a, b)) = limn→∞ µ((a, b − 1

µ]) is uniquely determined, and thus also µ(O) for every

open set O ⊂ R (O is a countable union of disjoint open intervals and thus µ(O) isdetermined by the σ-additivity). From this the uniqueness follows.

We will later see (or you have seen?) a general argument based on Dynkin’s lemmawhich implies the uniqueness in the last proof.

Remark 2.18. The existence part of the proof is useful for simulating random vari-ables with a given distribution µ in a computer: If your programming language/libraryprovides you with an uniform random variable (which it usually does), (2.16) gives therecipe how to transform it to obtain µ-distributed random variable.

Exercise 2.19. (a) Let X1, . . . , Xn be random variables on (Ω,F , P ) and f : Rn → Ra measurable function. Show that f(X1, . . . , Xn) is a random variable.

(b) Let X1, X2, . . . be a sequence of [−∞,+∞]-valued random variables on (Ω,A, P ).Show that supnXn, infnXn, lim infnXn, lim supn Sn are random variables. (An[−∞,∞]−valued function X is a random variable iff e.g. X−1([a,∞]) ∈ A forevery a ∈ R ∪ −∞.)

9

2.3 Expectation of real-valued random variables

As you know form the elementary lecture, the expectation of a random variable corre-sponds to a “mean” of all its possible values. In the two special cases of discrete andcontinuous real-valued random variables it is defined by

E[X] =∑x∈RX

xP [X = x] (RX = set of all possible values of X),

E[X] =

∫RxfX(x) dx (fX = density of X),

(2.20)

whenever these expressions give sense.In the language of the measure-theoretic probability theory, there is no need for two

separate definitions:

Definition 2.21. Let X : Ω → R be a random variable. Assume that X ≥ 0 orX ∈ L1(Ω,A, P ) (i.e.

∫Ω|X| dP <∞)). The expectation of X is given by

E(X) =

∫Ω

X(ω)P (dω).

When we want to point out over which probability measure we take the expectation, wewrite EP (X).

The expectation inherits many properties of the integral, which were shown in theMeasure Theory course.

Linearity. Let X, Y be random variables on Ω and a, b ∈ R, then

E[aX +BY ] = aE[X] + bE[Y ],

whenever the right-hand side is well defined.

Monotonicity. Let X, Y be two random variables with well-defined expectation. Then

X ≥ Y (i.e. X(ω) ≥ Y (ω) for all ω ∈ Ω) =⇒ E[X] ≥ E[Y ].

Holder inequality. For p, q ∈ [1,∞] with 1p

+ 1q

= 1

E[|XY |] ≤ ‖X‖p‖Y ‖q

where the Lp(Ω,A, P )-norm of X is defined by ‖X‖p = (E[|X|p])1/p for p ∈ (1,∞),and by ‖X‖∞ = inf

M : P [|X| > M ] = 0

=: ess supP |X| for p =∞.

Cauchy-Schwarz inequality is the special case of Holder inequality for p = q = 2,

E[|XY |] ≤ ‖X‖2‖Y ‖2.

10

Minkowski inequality. This is the triangle inequality for the ‖ · ‖p-norm,

‖X + Y ‖p ≤ ‖X‖p + ‖Y ‖p, p ∈ [1,∞].

Fatou’s lemma. Let (Xn)n≥0 a sequence of [0,∞]-valued random variables. Then

E[lim infn→∞

Xn] ≤ lim infn→∞

E[Xn].

Monotone convergence theorem (MCT, Beppo-Levi theorem) For random variablesXn ≥ 0 with Xn X P -a.s. (i.e. P [ω : Xn(ω) ≤ Xn+1(ω)∀n, and limXn(ω) =X(ω)] = 1)

E[Xn] E[X], as n→∞.

Dominated convergence theorem (DCT, Lebesgue theorem) Let X, Y and (Xn)n≥0

be random variables such that

• P -a.s. Xn converge pointwise to X, that is P[

limn→∞Xn(ω) = X(ω)]

= 1,

• Xn are dominated by Y , that is |Xn| ≤ Y P -a.s. for all n ≥ 0,

• Y is P -integrable, that is Y ∈ L1(Ω,A, P ).

Thenlimn→∞

E[Xn] = E[X]

Other properties of expectation require P to be normalised, that is P [Ω] = 1:

Jensen’s inequality. Let X ∈ L1(Ω,A, P ) and ϕ : R → R a convex function (i.e.ϕ(λx + (1 − λ)y) ≤ λϕ(x) + (1 − λ)ϕ(y) for every x, y ∈ R and λ ∈ [0, 1]) suchthat E[ϕ(X)] <∞. Then

ϕ(E[X]) ≤ E[ϕ(x)].

Proof. When ϕ is convex, then there exists a ∈ R such that a(x − E(X)) +ϕ(E(X)) ≤ ϕ(x). Replacing x by X in this inequality and taking the expectationsJensen’s inequality follows.

(Generalised) Chebyshev’s inequality. Let ϕ : S → [0,∞] be a measurable function,A ∈ B(R), and X a S-valued random variable. Then

infϕ(y) : y ∈ AP [X ∈ A] ≤∫

Ω

ϕ(X(ω))1X ∈ AP (dω)

:= E[ϕ(X);X ∈ A] ≤ E[ϕ(x)].

Proof. Observe that infϕ(y) : y ∈ A1x ∈ A ≤ ϕ(x)1x ∈ A ≤ ϕ(x) (usingconvention 0 · ∞ = 0 usual in the measure theory). Replacing x by X, taking theexpectation and applying the monotonicity the claim follows.

11

Markov inequality. Taking ϕ(x) = |x|p, p ≥ 0, in Chebyshev’s inequality, we obtain forevery X ∈ Lp and a ≥ 0

P [|X| ≥ a] ≤ 1

apE[|X|p].

Chebyshev’s inequality. By taking X ∈ L2 and ϕ(x) = (x−EX)2 we obtain the usualChebyshev’s inequality

P [|X − EX| ≥ a] ≤ 1

a2VarX :=

1

a2E[(X − EX)2].

Finally, recall a transformation (substitution) theorem.

Lemma 2.22. Let (Ω,F , P ) be a probability space, (S,S) a measurable space, X : Ω→ Sa S-valued random variable, and g : S → R a measurable function. Then g X =: g(X)is a R-valued random variable, and denoting by µX = X#P = P X−1 the distributionof X on (S,S) (cf. Definition 2.10 and Exercise 2.19) we have

(2.23)

∫S

|g(s)|µX(ds) <∞ ⇐⇒∫

Ω

|(g X)(ω)|P (dω) <∞,

and if these both equivalent conditions hold, then

(2.24) µX(g) = EX#P [g] =

∫S

g(s)µX(ds)!

= EP [g(X)] =

∫Ω

(g X)(ω)P (dω).

(the only real statement is the equality marked with ‘!’, the remaining ones introducevarious notation for the same objects)

Proof. This is a small exercise in the measure theory that is good to recall: We provethe theorem in four steps, starting with simple functions g and going to more generalones.

(a) g = 1B is indicator of a set B ∈ S: In this case the conditions in (2.23) hold truesince g ≤ 1. Moreover, µX(g) = (P X−1)(B) = EP [1B X], that is (2.24) holds.

(b) g is a linear combination of indicators, g =∑m

i=1 ci1Bi for some ci ≥ 0, andBi ∈ S: In this case (2.23) and (2.24) follow directly using (a) and the linearity of theexpectation.

(c) g ≥ 0 arbitrary measurable function: Set

gn =n2n−1∑k=0

k

2n1 k

2n≤ g <

k + 1

2n

+ n1g ≥ n.

Then gn g, gn X g X, and the functions gn are as in (b). By the monotonousconvergence theorem, EX#P [gn] EX#P [g] and EP [gn(X)] ↑ EP [g(X)], and thus (2.23),(2.24) holds true for non-negative g.

(d) g arbitrary measurable: Set g+ = g ∨ 0, g− = (−g) ∨ 0. Then g = g+ − g−, andg+, g− are as in (c). Further, EX#P [|g|] < ∞ is equivalent with EX#P [g+] < ∞ andEX#P [g−] <∞. Similar claims hold for functions g X, g+ X, and g− X with respectto expectation EP . (2.23) then follows, and (2.24) is a consequence of the linearity, thatis of EP [g(X)] = EP [g+(X)]− EP [g−(X)].

12

To finish this chapter we recall several elementary definitions.

Definition 2.25. Let X ∈ L1(Ω,A, P ). The variance of X is defined by

(2.26) Var(X) = E[(X − EX)2] ∈ [0,∞].

Lemma 2.27. Let X ∈ L1(Ω,A, P ). Then

(i) Var(X) = E[X2 − 2XEX − (EX)2] = E(X2)− (EX)2.

(ii) VarX <∞ iff X ∈ L2(Ω,A, P ).

(iii) VarX = 0 iff X = EX P -a.s.

Proof. Obvious.

Definition 2.28. Let X, Y be two random variables in L1(Ω,A, P ) such that XY isalso in L1(Ω,A, P ). The covariance of X and Y is given by

Cov(X, Y ) := E[(X − EX)(Y − EY )] = E(XY )− E(X)E(Y ).

Definition 2.29. If X ∈ LP (Ω,F , P ), then E[Xp] is called the p-th moment of X.

Definition 2.30. Let X = (X1, X2, . . . , Xn) be a random vector (i.e. Rn-valued ran-dom variable). We define EX component-wise, EX = (EX1, . . . , EXn) ∈ Rn, and itscovariance matrix Σ(X) = (σij(X))i,j=1,...,n by σij(X) = Cov(Xi, Xj).

Exercise 2.31. (a) When X, Y ∈ L2(Ω,A, P ), the covariance exists.

(b) Cov(X,X) = Var(X)

(c) For every vector X, its covariance matrix Σ(X) is symmetric and positive definitive,that is σij(X) = σji(X) and ξTΣ(X)ξ for every column vector ξ ∈ Rn.

13

3 Independence

3.1 Definitions

We recall the elementary definition.

Definition 3.1. Let (Ω,A, P ) be a probability space. Events A,B ∈ F are calledindependent when

(3.2) P [A ∩B] = P [A] · P [B]

When P [B] > 0, this can be written using conditional probabilities as

P [A|B] :=P [A ∩B]

P [B]

(3.2)= P [A],

i.e. the events A, B are independent if the information about the occurrence of B hasno influence on the probability of A.

We now extend this definition in several directions:

Definition 3.3. (a) Two σ-algebras G1,G2 ⊂ A are called independent if P [A1 ∩A2] =P [A1]P [A2] for every A1 ∈ G1, A2 ∈ G2.

(b) Two random variables X1, X2 on (Ω,A, P ) are called independent if the σ-algebrasσ(X1), σ(X2) generated by these random variables are independent in sense of (a).(Recall, for (S,S)-valued random variable, σ(X) is the smallest σ-algebra on Ω whichmakes X measurable, σ(X) = X−1(A), A ∈ S.)Definition 3.4. (a) The σ-algebras G1, . . . ,Gn ⊂ A are independent if P [A1∩· · ·∩An] =P [A1] · · ·P [An] for every A1 ∈ G1, . . . , An ∈ Gn.

(b) Events Ai are independent if σ-algebras ∅, Ai, Aci ,Ω, i = 1, . . . , n, are.(c) Similarly, random variables X1, . . . , Xn on Ω are independent if the σ-algebras

σ(Xi), i = 1, . . . , n, are.

Remark 3.5. As one can take Ai = Ω in Definition 3.4(a), we see also that every subsetof G1, . . . ,Gn is independent. One also checks easily that Definition 3.4(b) is equivalentto the definition of independence which you know from the elementary lecture: Theevents A1, . . . , An ∈ A are independent if

P[⋂i∈J

Ai

]=∏i∈J

P [Ai] for every J ⊂ 1, . . . , n.

Checking the condition of Definition 3.4(a) for every Ai ∈ Gi is again a tedious task,cf. (2.8). At this level, we may even wonder if independent σ-algebras exist at all. Wetherefore make a small excursion into the measure theory and solve this and similarissues for once.

14

3.2 Dynkin’s lemma

Definition 3.6. A family D of subsets of Ω is called Dynkin system (or λ-system) when

(λ1) Ω ∈ D

(λ2) A ∈ D =⇒ Ac ∈ D

(λ3) for every sequence (Ai)i≥0 of pairwise disjoint elements of D, we have⋃i≥0Ai ∈ D

Observe that condition (λ3) is different from the condition (σ3) of the definition ofσ-algebra.

Definition 3.7. A family of C of subsets of Ω is called π-system if it is closed underintersections, i.e. A,B ∈ C implies that also A ∩B ∈ C.

The following lemma is a useful tool in measure theory.

Lemma 3.8 (Dynkin). Let D be a Dynkin system and C a π-system on Ω with C ⊂ D.Then, D contains the σ-algebra generated by C,

D ⊃ σ(C).

Before proving the lemma, let us see some of its consequences.

Lemma 3.9. Let P , Q be two probabilities on (Ω,A). Then the family

D = A ∈ A : P (A) = Q(A)

is a Dynkin system.

Proof. (λ1), (λ2) of Definition 3.6 are trivial to check. (λ3) follows from the σ-additivityof P and Q.

As corollary of Lemmas 3.8, 3.9 we obtain:

Lemma 3.10. Let P and Q be as above, and let C be a π-system such that P (C) = Q(C)for every C ∈ C. Then P (B) = Q(B) for every B ∈ σ(C).

Remark 3.11. It is essential that C is a π-system, not an arbitrary generator. To seethis consider Ω = 1, 2, 3, 4, E = 1, 2, 2, 3 and µ = 1

2(δ1 + δ3), ν = 1

2(δ2 + δ4).

Then, it is easy to see that σ(E) = P(Ω) and that µ|E∪Ω = ν|E∪Ω. On the other hand,µ 6= ν on P(Ω).

Lemma 3.10 gives a powerful tool for checking equality of measures, of course if weknow generators which are also π-systems:

Exercise 3.12. (a) Go back to Remark 2.8 and check which of the generators of B(R)are also π-systems.

(b) Simplify the proof of the Lebesgue-Stieltjes theorem (2.15) using the previouslemma.

15

Proof of Dynkin’s lemma (*). We want to show D ⊃ σ(C). To this end we define aDynkin system generated by C,

D(C) =⋂D′⊃C

D′ is Dynkin system

D′

We will show that

(3.13) D(C) = σ(C),

which implies the lemma.The inclusion D(C) ⊂ σ(C) is obvious, as every σ-algebra is a Dynkin system. The

inclusion σ(C) ⊂ D(C) will follow if we show

(3.14) D(C) is a σ-algebra.

To see this, we first claim

(3.15) D(C) is closed under intersections.

and use this to prove (3.14).As a consequence of (3.15), D(C) is also closed under unions (just use A ∪ B =

(Ac ∩Bc)c, (3.15) and part (λ2) of the Definition 3.6). To see that D(C) is closed undercountable unions, we fix a sequence (An)n≥1, An ∈ D(C) and write their union as

(3.16)⋃n≥1

An =⋃n≥1

(Bn \Bn−1),

where the sets Bi, i ≥ 0, are given by

B0 = 0 and Bn =n⋃i=1

Ai

As D(C) is closed under unions, Bi ∈ D(C) for every i ≥ 0. Writing Bn \ Bn−1 as(Bc

n∪Bn−1)c, it follows also that Bn\Bn−1 ∈ D(C) for every n ≥ 1. Finally, using (3.16),and part (λ3) of Definition 3.6, we see

⋃n≥1An ∈ D(C) as claimed. Since Ω ∈ D(C) and

D(C) is closed under complements, by definition, we proved (3.14).

It remains to show (3.15):

Step 1. We first claim

(3.17) A ∈ D(C), B ∈ C =⇒ A ∩B ∈ D(C).

Indeed, to see this define for B ∈ C a family

DB = A ⊂ Ω : A ∩B ∈ D(C).

We claim that DB that is a Dynkin system. Indeed,

16

(1) Ω ∈ DB is obvious.

(2) A ∈ DB implies that Ac ∩ B = B \ (A ∩ B) ∈ D(C), since B ∈ C, A ∩ B ∈ D(C),and B \ (A ∩ B) can be written as (Bc ∪ (A ∩ B))c and this is a disjoint union ofelements of D(C). It follows that AC ∈ DB as well.

(3) Let Ai ∈ DB, i ≥ 1, be pairwise disjoint. Then(⋃

i≥1Ai)∩B =

⋃i≥1(Ai∩B) ∈ D(C)

by disjointness of Ai ∩B. Hence⋃i≥1Ai ∈ DB.

Since C ⊂ DB, and DB is a Dynkin system, we see immediately that DB ⊃ D(C), whichimplies (3.17)

Step 2. For A ∈ D(C) we define, similarly as in Step 1,

DA = B ⊂ Ω : A ∩B ∈ D(C).

By essentially the same reasoning as in Step 1, it can be then shown that DA is a Dynkinsystem. Due to Step 1, DA ⊃ C, and thus DA ⊃ D(C). Hence for every A,B ∈ D(C) alsoA ∩B ∈ D(C) that is (3.15) holds. This completes the proof of Dynkin’s lemma.

3.3 Elementary facts about the independence

As a consequence of Dynkin’s lemma we get.

Theorem 3.18. Let C1, . . . , Cn ⊂ A be π-systems with Ω ∈ Ci for all i ≤ n. Assumethat (Ci)1≤i≤n are independent, i.e.

P [C1 ∩ · · · ∩ Cn] = P [C1] · · ·P [Cn] ∀Ci ∈ Ci.(3.19)

Then, the σ-algebras σ(C1), . . . , σ(Cn) are independent.

Proof. The proof resembles the previous one. We first fix C2 ∈ C2, . . . , Cn ∈ Cn andconsider a family D1 given by

D1 = A ∈ A : P [A ∩ C2 ∩ · · · ∩ Cn] = P [A]P [C2] · · ·P [Cn].

Then, D1 ⊃ C1 due to (3.19) and D1 is a Dynkin system. Indeed, as (λ1), (λ2) areobvious, one needs to show (λ3). Take (Ai)i≥0 pairwise disjoint and A =

⋃iAi; then

P [A ∩ C2 ∩ · · · ∩ Cn] =∑i≥1

P [Ai ∩ C2 ∩ · · · ∩ Cn]

Ai∈D1=∑i≥1

P [Ai]P [C2] · · ·P [Cn]

= P [A]P [C2] · · ·P [Cn],

which implies that A ∈ D1. Dynkin’s lemma now implies that D1 ⊃ σ(C1).

17

To continue, define

D2 = A ∈ A : P [D ∩A∩C3 ∩ · · · ∩Cn] = P [D]P [A]P [C3] · · ·P [Cn] for all D ∈ σ(C1).

Using the same reasoning as above one shows that D2 is Dynkin system and D2 ⊃ C2.Thus D2 ⊃ σ(C2). The claim of the theorem then follows by repeating the same stepn-times.

Corollary 3.20. (a) Let Ai,j ⊂ A, 1 ≤ i ≤ n, 1 ≤ j ≤ m(i) be independent σ-algebras.

Then, the σ-algebras Gi = σ(⋃m(i)j=1 Aij), i ≤ n, are also independent.

(b) Let Xij : 1 ≤ i ≤ n, 1 ≤ j ≤ m(i) be independent r.v.’s and let fi : Rm(i) −→ R bemeasurable functions. Then Yi = fi(Xi1, . . . , Xim(i)) are independent, i ≤ n.

Proof. It is easy to see that (a) implies (b). Indeed, it is sufficient to observe thatσ(Yi) ⊂ Gi := σ(Xi1, . . . , Xim(i)) and that Gi’s are independent by (a).

To prove (a), let Ci be the family of subsets of Ω of form⋂m(i)j=1 Aij with Aij ∈ Aij. Then

Ci are independent π-systems (Exercise! ) and the claim follows by Theorem 3.18.

Independent random variables are naturally related with product measures.

Proposition 3.21. Let X, Y be two independent real-valued random variables on a com-mon probability space (Ω,A, P ) with distributions µX , µY . Then the image of the mea-

sure P under the map ωϕ7→ (X(ω), Y (ω)) equals µX⊗µY , and thus for h : (R2,B(R2))→

(R,B(R)) measurable we have

E[|h(X, Y )|] <∞ iff h is µX ⊗ µY -integrable,

and then

E[h(X, Y )] =

∫R2

h(x, y)µX(dx)µY (dy).

In particular, E[XY ] = E[X]E[Y ] when E[|X|] < ∞, E[|Y |] < ∞, and Var[X + Y ] =Var[X] + Var[Y ].

Proof. We only need to show that µX ⊗ µY = ϕ#P . The remaining claims of theproposition then follow using Lemma 2.22 with (X, Y ), R2, h playing the roles of X, Sand g.

To show µX ⊗ µY = ϕ#P , let C = A1 × A2 : A1, A2 ∈ B(R). C is a π-system andfor A = A1 ×A2 ∈ C we have (µX ⊗ µY )(A) = µX(A1)µY (A2) and (ϕ#P )(A) = P [X1 ∈A1, X2 ∈ A2], which by independence equals P [X1 ∈ A1]P [X2 ∈ A2] = µX(A1)µY (A2).Since σ(C) = B(R2), the claim follows from Lemma 3.9.

Exercise 3.22. Modify the statement of Proposition 3.21 for

(a) X and Y taking values in some measurable spaces (S1,S1), (S2,S2) respectively.

(b) more than two random variables.

18

Exercise 3.23. (a) Let (Xi)i∈I be a family of real-valued random variables. For everyfinite set J = i1, . . . , i|J | ⊂ I, and for every x ∈ R|J | define the cummulativedistribution function

FJ(x) = P [Xi1 ≤ x1, . . . Xi|J| ≤ x|J |].

Then (Xi)i∈I are independent iff for every such J and x

FJ(x) = Fi1(x1) · · ·Fi|J|(x|J |).

(b) When the random variables (X1, . . . , Xn) possess a joint density f : Rn → [0,∞],then they are independent iff f factorises as f(x1, . . . , xn) = fX1(x1) · · · fXn(xn).

(c) Show that the identity E[XY ] = E[X]E[Y ] does not imply the independence ofX and Y . (Random variables that satisfy E[XY ] = E[X]E[Y ], and thus alsoVar[X + Y ] = Var[X] + Var[Y ], are called uncorrelated).

Exercise 3.24. Show the converse to Proposition 3.21: Let X, Y be two random vari-ables on (Ω,A, P ) so that the image of P under ω 7→ (X(ω), Y (ω)) is a product measure.Then X, Y are independent.

We finish this section by extending Definition 3.4 to arbitrary number of σ-algebras.

Definition 3.25. Let I be an arbitrary index set. A family (Ai)i∈I of σ-algebras on Ωis called independent if (Ai)i∈J is independent in sense of (3.4) for every finite J ⊂ I.

With this definition, Theorem 3.18 and Corollary 3.20 easily generalize (the condition(3.19) must be replaced by P [∩i∈JCi] =

∏i∈J P [Ci] for all Ci ∈ Ci and J ⊂ I finite).

3.4 Borel-Cantelli lemma

In this and the following section we develop two techniques for proving that some ‘asymp-totic’ events occur with probability one.

We start with some notation. Let (Ai)i≥1 be a sequence of events on (Ω,A). We define

lim supn→∞

An =⋂n≥1

( ⋃m≥n

Am

)= ω ∈ Ω : ω is contained in infinitely many of Ai’s,

lim infn→∞

An =⋃n≥1

( ⋂m≥n

Am

)= ω ∈ Ω : ω only finitely many of Ai’s do not contain ω.

Observe that 1lim supn An = lim supn 1An and similarly for lim inf.The next two lemmas are indispensable tools in probability theory.

19

Lemma 3.26 (First Borel-Cantelli lemma). Let (Ai)i≥1 be a sequence of events on aprobability space (Ω,A, P ). Then∑

i≥1

P [Ai] <∞ implies P [lim supi

Ai] = 0.

Proof. By the monotone convergence theorem

E[ ∞∑i=1

1Ai

]=∞∑i=1

P [Ai] <∞.

Therefore∑∞

i=1 1Ai <∞ P -a.s. and thus P [lim supiAi] = 0.

The next lemma is a partial converse to Lemma 3.26.

Lemma 3.27 (Second Borel-Cantelli lemma). Let (Ai)i≥1 be a sequence of independentevents on (Ω,A, P ). Then,

∑i≥1 P [Ai] =∞ implies that P [lim supiAi] = 1.

Remark 3.28. To see why the independence assumption is necessary, consider e.g.Ω = (0, 1), A = B((0, 1)), P being Lebesgue measure and set An = (0, n−1).

Proof of Lemma 3.27. We show that P [(lim supAi)c] = P [lim inf Aci ] = 0. Since 1−x ≤

e−x, the independence implies

P[ M⋂k=m

Ack

]=

M∏k=m

P [Ack] =M∏k=m

(1− P [Ak]) ≤ exp−

M∑k=m

P [Ak]

M→∞−−−−→ 0

due to the assumption. Hence P [∩k≥mAck] = 0 for all m ≥ 1 and thus P [lim inf Aci ] = 0by σ-additivity, as required.

Example 3.29. Let Xn be independent N (0, σ2)-distributed random variables, σ > 0.The second Borel-Cantelli lemma implies (exercise!) that

lim supn→∞

Xn =∞, P -a.s.

We now prove a more precise statement

(3.30) lim supn→∞

Xn

σ√

2 log n= 1.

We first need estimates on the tail of the normal distribution:

Claim 3.31. Let X be N (0, σ2) distributed. Then for every x > 0

1√2π

(x+

1

x

)−1

e−x2/2 ≤ P [X ≥ xσ] =

∫ ∞x

1√2π

e−y2/2 dy ≤ 1

x√

2πe−x

2/2.

20

Proof. For the left-hand side observe that ddy

( 1ye−y

2/2) = −(1 + 1y2

)e−y2/2 and thus

1

xe−x

2/2 =

∫ ∞x

(1 +

1

y2

)e−y

2/2 dy ≤(

1 +1

x2

)∫ ∞x

e−y2/2 dy.

The right-hand side follows from∫ ∞x

e−y2/2 dy ≤ 1

x

∫ ∞x

y e−y2/2 dy =

1

xe−x

2/2.

This completes the proof of the claim

Proof of (3.30). Upper bound. We will apply the first Borel-Cantelli lemma. Let ε > 0and define An = Xn ≥ (1 + ε)σ

√2 log n, n ≥ 1. Then, using Claim 3.31,

P [An] ≤ 1

(1 + ε)√

2π√

2 log n· e−(1+ε)2 logn

=1

(1 + ε)√

2π√

2 log n· 1

n(1+ε)2,

and thus∑

n≥1 P [An] < ∞. It follows that P [lim supn→∞An] = 0 and thus P -a.s. for

all n large Xn ≤ (1 + ε)σ√

2 log n, which is equivalent to

lim supn→∞

Xn

σ√

2 log n≤ 1 + ε, P -a.s.

As ε is arbitrary, this yields the upper bound.Lower bound. Let ε ∈ (0, 1) and set Bn = Xn ≥ (1 − ε)σ

√2 log n. Then, using

Lemma 3.31 again,

P [Bn] ≥ 1√2π

((1− ε)

√2 log n+

1

(1− ε)√

2 log n

)−1

e−(1−ε)2 logn

≥ 1

nafor n ≥ n0(a, ε) and (1− ε)2 < a < 1.

Hence,∑

n≥1 P [Bn] = ∞. Since Bn are indpendent, the second Borel-Cantelli lemma

implies that P [lim supBn] = 1. This yields lim supn→∞Xn/(σ√

2 log n) ≥ (1− ε), P -a.s.Observing that ε is arbitrary, the lower bound follows.

Remark 3.32. Knowing (3.30), one may wonder how supk≤nXk fluctuates around thevalue σ

√2 log n. The (elementary) extreme value theory gives the following ‘convergence

in distribution’ result:

P

[√2 log n

(supk≤n

Xk − σ√

2 log n+σ(log log n+ log 4π)

2√

2 log n

)≤ u

]n→∞−−−→ e−e−u .

The function on the right-hand side is a distribution funcition of the so called Gumbeldistribution.

Exercise 3.33. (*) Try to prove the claim of the last remark. (It is actually not sodifficult.)

21

3.5 Kolmogorov 0–1 law

Let (Xi)i≥1 be a sequence of random variables on (Ω,A, P ). For n ≥ 1 we define theσ-algebra Tn describing the ‘future after n of the sequence (Xi)i≥1’ by

Tn = σ(Xn, Xn+1, . . . )

= the smallest σ-algebra containing σ(Xi) for all i ≥ n.

Obviously, for p > n,

(3.34) Tp ⊂ Tn and σ(Xn, . . . , Xp) ⊂ Tn.

We further define the so-called tail σ-algebra describing the ‘far future’ of the sequence(Xi)i≥1, by

T =⋂i≥1

Tn.

Even if this is not intuitively obvious on the first sight, this σ-algebra contains manyinteresting events. For example, the convergence set of the series

∑i≥1Xi is in T .

Indeed, let

Ω1 =ω ∈ Ω : lim sup

p→∞

p∑i=1

Xi(ω) = lim infp→∞

p∑i=1

Xi(ω).

be the convergence set of this series. Then, for every n ≥ 1, also

Ω1 =ω ∈ Ω : lim sup

p→∞

p∑i=n

Xi(ω) = lim infp→∞

p∑i=n

Xi(ω).

Hence Ω1 ∈ Tn for every n ≥ 1 and thus Ω1 ∈ T .The definition of the convergence set Ω1 allows for limn→∞

∑ni=1Xi(ω) = ±∞, however

a similar reasoning implies also thatω : limn→∞

∑ni=1Xi(ω) =∞

∈ T , or

Ω2 =ω ∈ Ω : −∞ < lim inf

n→∞

n∑i=1

Xi(ω) = lim supn→∞

n∑i=1

Xi(ω) < +∞∈ T .

In the case of independent random variables, the tail σ-algebra is trivial :

Theorem 3.35 (Kolmogorov’s 0–1 law). Let (Xi)i≥1 be independent random variableson a probability space (Ω,A, P ). Then

P [A] ∈ 0, 1 for every A ∈ T .

Corollary 3.36. A sum of independent random variables∑

i≥1Xi is P -a.s. convergent,or P -a.s. not converging. Formally, P [Ω1] ∈ 0, 1 and P [Ω2] ∈ 0, 1 for Ω1,Ω2 ⊂ Ωas above.

22

Proof of Theorem 3.35. Step 1. For fixed n ≥ 1,

(3.37) P [A ∩B] = P [A]P [B] for every A ∈ σ(X1, . . . , Xn−1) =: An−1 and B ∈ Tn,

that is An−1 and Tn are independent.Indeed, (3.37) holds for every A ∈ An−1 and B ∈ σ(Xn, . . . , Xp), p > n (by the inde-

pendence assumption and (3.20)). Moreover, An−1 and ∪p≥nσ(Xn, . . . , Xp) are π-systemsand σ

(∪p≥n σ(Xn, . . . , Xp)

)= Tn. The claim (3.37) then follows from Theorem 3.18.

Step 2.

(3.38) P [A ∩B] = P [A]P [B] for every A ∈ σ(X1, X2, . . . ) and B ∈ T ,

that is T and σ(X1, X2, . . . ) are independent.Indeed, due to (3.37) and T ⊂ Tn, P [A ∩B] = P [A]P [B] holds for every

A ∈⋃n≥1

An, and B ∈ T .

The set systems⋃n≥1An and T are π-systems, and (3.38) follows from Theorem 3.18

and the fact that σ(X1, X2, . . . ) is the smallest σ-algebra containing ∪n≥1An.

Step 3. Let A ∈ T . As T ⊂ σ(X1, X2, . . . ), we see that A ∈ σ(X1, X2, . . . ) as well.Therefore,

(3.39) P [A] = P [A ∩ A]Step 2

= P [A]P [A] = P [A]2.

This yields P [A] ∈ 0, 1.

Example 3.40 (Percolation). Percolation is one of the most beautiful and yet themost challenging models in the probability theory. It was introduced by Broadbent andHammersley (1957) as a model of disordered porous medium trough which a fluid or gaswas can flow. Since then thousands of papers and many books have been devoted to it.

It can be defined as follows. Let Ed be the edge set of Zd,

(3.41) Ed =x, y : x, y ∈ Zd with |x− y| = 1

.

For a fixed p ∈ [0, 1], let Xe, e ∈ Ed be a family of i.i.d. random variables on some(Ω,A, Pp) satisfying

(3.42) Pp[Xe = 1] = 1− Pp[Xe = 0] = p.

For given realisation of Xe’s, we declare an edge e open if Xe = 1 and closed otherwise.A natural question to ask about this model is: “Is there, for a P-typical configuration,

an infinite self-avoiding path using only open edges”? We first need to ask whether thisquestion has sense, that is to deal with the measurability. Let Cx be the connected opencomponent containing the vertex x, x ∈ Zd. Define the events

Jx = |Cx| =∞,

I = ω ∈ Ω : there exists an infinite open component in ω =⋃x∈Zd

Jx.(3.43)

23

It is easy to see that I and Jx are events in A. Indeed, let B(n) = [−n, n]d ∩ Zd be thebox in Zd and let An =

0 is connected by an open path1 to the set Bc = Zd \B

. An is

an event, since it depends on the state of finitely many edges of EB(n) only. Moreover,since infinite cluster cannot exist in a finite box, J0 = ∩n≥0An, and hence J0 ∈ A.Analogically, Jx are events for all x ∈ Zd. Since, I = ∪x∈ZdJx, I is an event too.

Finally, we can come to the application of the 0–1 law.

Proposition 3.44. The probability Pp[I] equals 0 or 1. It has value 0 exactly whenθ(p) := Pp[|C0| =∞] = 0.

Proof. We first show that

(3.45) I is σ(Xe : e ∈ Ed \ EB(n)) measurable for all n ≥ 1.

Indeed, let In be the event ‘the restriction of (Xe)e∈Ed to Ed \EB(n) contains an infinitecluster’. Of course, In ∈ σ(Xe : e ∈ Ed \ EB(n)), by the same proof as for (3.43). Theclaim (3.45) then directly follows from the following equality:

(3.46) I = In.

To check this observe first that In ⊂ I. Conversely, if ω ∈ I, let C be an infiniteopen compenent of ω. Consider connected components of C \ B(n) induced by ω. Ifthey are at least two such components, any of them must contain at least one vertexneighbouring with B(n) (since it must be connected by an open path in B(n) to theremaining components). Hence, C \B(n) has finitely many connected components, andthus one of them should be infinite. Hence ω ∈ In. This shows I ⊂ In and consequently(3.46) and (3.45).

From (3.45) if follows that I is tail-measurable, that is

(3.47) I ∈ T := ∩E⊂Ed,E finite σ(ω(e) : e ∈ Ed \ E).

From the Kolmogorov 0–1 law we then deduce that Pp(I) = 0 or 1.

1path is a finite sequence of neighbouring vertices

24

4 Laws of large numbers

4.1 Kolmogorov three series theorem

We consider a sequence (Xi)i≥1 of independent random variables on some probabilityspace (Ω,A, P ) and are interested in the convergence of series

∑Xi. From the previous

chapter, recall that the ‘convergence set’ of this series

Ω2 =ω ∈ Ω : −∞ < lim inf

∑k

Xk(ω) = lim sup∑k

Xk(ω) < +∞∈ T .

Therefore, by Kolmogorov 0–1 law (Theorem 3.35),

P (Ω2) = 0 or 1.

We now search for a criteria allowing to decide which of the two possibilities holds.We define the partial sums

(4.1) S0 = 0, Si =i∑

j=1

Xj, i ≥ 1.

The following lemma will be useful later.

Lemma 4.2 (Kolmogorov’s inequality). Let (Xi)i≥1 be a sequence of independent ran-dom variables satisfying E[Xi] = 0 and E[X2

i ] <∞ for all i ≥ 1. Then

P[

max1≤k≤n

|Sk| ≥ u]≤ u−2 Var(Sn).

Remark 4.3. Kolmogorov’s inequality is a ‘maximal-inequality’, that means that themaximum maxk≤n |Sk| is controlled by the variance of the last term only. Compare itChebyshev’s inequality which gives, under the same hypothesis, a control on the tailbehaviour of one partial sum only: P [|Sn| ≥ u] ≤ u−2 Var(Sn).

Proof. We break down the event on the left-hand side according to the time1 at which|Sk| for the first time exceeds u. Let

Ak = |Sk| ≥ u, |Sj| < u for all j < k.

1We use the language of stochastic processes, and call the index n of Sn time.

25

Since Ak’s are disjoint2

ES2n ≥

n∑k=1

E[S2n;Ak

]=

n∑k=1

E[S2k + 2Sk(Sn − Sk) + (Sn − Sk)2;Ak

]≥

n∑k=1

E[S2k ;Ak] +

n∑k=1

E[2Sk1Ak(Sn − Sk)

].

By the independence assumption, Sk1Ak and (Sn − Sk) are independent. (Indeed,Sk1Ak ∈ σ(X1, . . . , Xk), and Sn − Sk ∈ σ(Xk+1, . . . , Xn).) Moreover, E[Sn − Sk] = 0.Therefore,

E[2Sk1Ak(Sn − Sk)

]= E[2Sk1Ak ]E[Sn − Sk] = 0.

As Ak are disjoint and |Sk| ≥ u on Ak, we thus have

ES2n ≥

n∑k=1

E[S2k ;Ak] ≥

n∑k=1

u2P [Ak] = u2P[

max1≤k≤n

|Sk| ≥ u].

This completes the proof.

Exercise 4.4. Assume that (Xi)i≥0 are i.i.d., E[Xi] = 0, VarXi = c < ∞. Prove thatSn/n

p → 0 P -a.s. for every p > 1/2.

As an application of Kolmogorov’s inequality (Lemma 4.2) we obtain.

Theorem 4.5. Let (Xi)i≥0 be independent with EXi = 0 and∑∞

i=1 VarXi <∞. Thenwith probability one

∑∞i=1Xi converges, i.e. P (Ω2) = 1.

Remark 4.6. Since variance of Xi is finite, we can view the random variables Xi as theelements of the Hilbert space

L2(X,A, P ) =X : Ω→ R : ‖X‖2

2 :=

∫Ω

X2dP < +∞,

with a scalar product 〈X, Y 〉 =∫

ΩXY dP = E[XY ]. Due to independence, for i 6= j,

〈Xi, Xj〉 = E[XiXj] = E[Xi]E[Xj] = 0, that is the random variables Xi are orthogonalin L2(Ω,A, P ). It follows that

∑‖Xk‖2

2 =∑

VarXk <∞. Using Pythagoras theorem,

for m > n, ‖Sm − Sn‖22 ≤

∑mi=n+1 ‖Xi‖2

2

m,n→∞−−−−→ 0, that is Sn is a Cauchy sequence inL2(Ω,A, P ) and thus converges also in L2(Ω,A, P ).

Example 4.7. In Example 1.1 we asked whether the series∑

k≥1Zkk

converges for Zki.i.d. with P [Zk = ±1] = 1/2. We can now give the affirmative answer: this seriesconverge P -a.s, since ∑

k≥1

VarZkk

=∑k=1

1

k2<∞.

2For an event B and a random variable Y we use E[Y ;B] to denote E[Y 1B ] =∫BY dP .

26

Proof of Theorem 4.5. Set WM = supm,n≥M |Sm − Sn|, and observe that WM W∞ asM →∞. We will show that,

(4.8) P [W∞ = 0] = 1.

(4.8) implies that Sk is P -a.s. a Cauchy sequence, which implies directly the claim ofthe theorem.

To show (4.8), fix ε > 0 and M ≥ 1. Then

supm≥M

|Sm − SM | ≤ ε implies supm,n≥M

|Sm − Sn| ≤ 2ε.

Therefore,

P [WM > 2ε] ≤ P[

supm≥M

|Sm − SM | > ε]

= limN→∞

P[

supM≤m≤N

|Sm − SM | > ε]

(monotonicity and regularity of P )

≤ limN→∞

ε−2

N∑k=M+1

VarXk (Kolmogorov’s inequality)

= ε−2∑k>M

VarXk.

Hence, since WM W∞,

P [W∞ ≥ 2ε] ≤ P [WM ≥ 2ε] ≤ ε−2∑k≥M

VarXkM→∞−−−−→ 0.

From this (4.8) follows and the proof is complete.

The final word in the convergence of series with independent increments is the followingtheorem.

Theorem 4.9 (Kolmogorov’s three series theorem). Assume that (Xi)i≥1 are indepen-dent random variables. Let for A > 0, Yi = Xi1|Xi| ≤ A. Then

∑i≥1Xi converges

iff for some A > 0 the following three conditions hold:

(i)∑∞

n=1 P [|Xn| > A] <∞,

(ii)∑∞

n=1 E[Yn] converges,

(iii)∑∞

n=1 VarYn <∞.

Exercise 4.10. Use this theorem to show that for Xk = Zkkα

with Zk’s i.i.d., P [Zk =±1] = 1/2, α > 0, the series

∑k>0 Zk converges iff α > 1/2. Observe that the series

converges absolutely iff α > 1.

27

Proof. Sufficiency of (i)–(iii): Let Y ′k = Yk −EYk. Then EY ′k = 0 and VarY ′k = VarYk.By (iii) and Theorem 4.5,

∑k≥1 Y

′k converges P -a.s. Due to (ii)

(4.11)∑k≥1

Yk =∑k≥1

Y ′k +∑k≥1

EYk converges P -a.s.

By (i) and Borel-Cantelli lemma P [lim supk→∞ |Xk| > A] = 0, that is |Xk| > A only forfinitely many k’s, P -a.s. If this occurs, the sum

∑k≥1Xk converges iff the sum

∑k≥1 Yk

converges. This implies the claim.Necessity of (i)–(iii). Assume that

∑i≥1Xi converges P -a.s. Then condition (i) must

be satisfied for every A > 0, because otherwise, by the second Borel-Cantelli lemma,there is some A > 0 such that |Xn| > A occurs infinitely often, P -a.s., which isincompatible with the convergence of

∑i≥1Xi. Thus (i) holds with A = 1. It follows

that also∑

i≥1 Yi :=∑

i≥1Xi1|Xi| ≤ 1 converges.Suppose that we have verified condition (iii), then by Theorem 4.5 the series

∑i≥1(Yi−

EYi) converges which together with the convergence of∑

i≥1 Yi implies (ii). Therefore,we only need to verify condition (iii), assuming that

∑i≥1 Yn converges P -a.s., where Yn

are independent and bounded by 1.We claim that we can assume without loss of generality that EYn = 0. Indeed,

let (Y ′n)n≥1 be an independent copy of (Yn)n≥1. Then Zn := Yn − Y ′n is a sequenceof independent random variables bounded by 2, EZn = 0, VarZn = 2 VarYn, and∑

i≥1 Zi =∑

i≥1 Yi −∑

i≥1 Y′i converges P -a.s. Therefore, we only need to verify condi-

tion (iii) for (Zn)n≥1 which have mean zero and are bounded by 2.To this end write Sn =

∑ni=1 Zn, σ2

n = VarZn. Fix L > 0 and let τL = minn ≥ 0 :|Sn| ≥ L. By convergence of Sn we see that limL→∞ P [τL =∞] = 1, and |SτL| ≤ L+ 2,if τL <∞. Moreover,

E[S2n∧τL ] =

n∑j=1

E[Z2j 1j ≤ τL] + 2

∑1≤i<j≤n

E[ZiZj1j ≤ τL].

Note now that the event j ≤ τL = τL ≤ j − 1c is determined by Z1, . . . , Zj−1 andhence independent of Zj. Therefore,

(L+ 2)2 ≥ E[S2n∧τL ] =

n∑j=1

σ2jP [j ≤ τL] + 2

∑1≤i<j≤n

E[Zi1j ≤ τL]E[Zj]

≥ P [τL =∞]n∑j=1

σ2j .

Choosing L large enough so that P [τL = ∞] > 0, and letting n → ∞, we obtain∑i≥1 σ

2i <∞, which implies the necessity of (iii).

Remark 4.12. The last part of the proof is a martingale type argument. We are goingto see more of them later.

28

4.2 Weak law of large numbers

In the following two sections we consider a sequence (Xi)i≥0 and study the conditionsunder which the normalised sum 1

n(X1 + · · · + Xn) := 1

nSn converges. Some theorems

here should be known, partially with stronger assumptions, from the elementary lecture.We recall the following definitions:

Definition 4.13. We say that a sequence (Yn)n≥1 of random variables converges in

probability to a random variable Y (notation YnP−→ Y ), when

limn→∞

P (|Yn − Y | > ε) = 0 for every ε > 0.

Exercise 4.14. Prove that the convergence in probability is weaker than the P -a.s.convergence and the convergence in Lp, p ∈ [1,∞], that is

Yn → Y P -a.s implies that YnP−→ Y.

Yn → Y in Lp(Ω,A, P ) implies that YnP−→ Y.

One distinguishes ‘weak’ and ‘strong’ law of large numbers. The weak one claimsthe convergence of 1

nSn in probability, the strong one then the P -a.s. convergence. The

terminology should be understandable in view of the last exercise

Theorem 4.15 (Weak law of large numbers in L2(Ω)). Let Xi be identically distributed,EX2

i <∞ and assume that

Cov(Xi, Xj) := E[(Xi − EXi)(Xj − EXj)] ≤ c|i−j|,

for some sequence cn ↓ 0. Then n−1Sn converges in L2(Ω,A, P ) and thus in probabilityto E[Xi].

Proof. By an easy computation,∥∥∥Snn− EX1

∥∥∥2

2= E

[(Snn− EX1

)2]= E

[ 1

n2

( n∑i=1

(Xi − EXi))2]

=1

n2

n∑i,j=1

Cov(Xi, Xj) ≤1

n2

n∑i,j=1

c|i−j|

≤ 2n

n2

n∑i=0

cin→∞−−−→ 0,

where we used the fact that every ci is contained at most 2n times in the double sumon the second line, and the assumption cn ↓ 0. This completes the proof.

29

4.3 Strong law of large numbers

We now turn our attention to strong laws of large numbers. We prove it first underrather strong assumption, which makes the proof very short.

Theorem 4.16 (Strong LLN with the forth moment). Let (Xi)i≥1 be i.i.d. randomvariables satisfying EX4

i <∞. Then

Snn−→ E[X1] P -a.s.

Proof. With help of a simple transformation Xi → Xi − EXi, we may assume withoutloss of generality that EXi = 0. Using Chebyshev’s inequality,

(4.17) P [|Sn| ≥ nε] = P [S4n ≥ n4ε4] ≤ E[S4

n]

n4ε4.

Further,

E[S4n] = E

[( n∑i=1

Xi

)4]= E

[ n∑i,j,k,l=1

XiXjXkXl

]=

′∑i,j,k,l

E[XiXjXkXl] +

(4

2

) ′∑i,j,k

E[XiXjX2k ] +

(4

1

) ′∑i,j

E[XiX3j ]

+

(4

2

) ′∑i,j

E[X2iX

2j ] +

∑i

E[X4i ],

where∑′ stands for a summation with all indices different. Using now the independence

assumption and E[Xi] = 0 one observes that the first three terms on the right-hand sidevanish. Moreover E[X4

i ] <∞ implies also E[X2i ] <∞ and thus

E[S4n] ≤ 6n2E[X2

i ]2 + nE[Xi]4 ≤ Cn2.

Inserting this into (4.17) and applying the Borel-Cantelli lemma completes the proof.

Assuming the forth moment is not necessary. We now prove an optimal result.

Theorem 4.18 (Strong LLN, Etemadi 1981). Let (Xi)i≥1 be identically distributed andpairwise independent with E[|Xi|] <∞. Then

(4.19)Snn−→ E[X1] P -a.s.

Proof. Step 1: Let X+i = Xi∨0 and X−i = (−Xi)∨0. The sequences (X+

1 )i≥1, respective(X−i )i≥1 are again identically distributed, pairwise independent and E|X±i | <∞. Whenwe show (4.19) for X±i in place of Xi, then (4.19) follows by linearity. We can thus,without loss of generality assume that

Xk ≥ 0 for all k ≥ 1.

30

Step 2, truncation (unnecessary when E(X2i ) < ∞). Define truncated random vari-

ables Yk by Yk = Xk1Xk ≤ k. Set A = lim infk→∞Xk = Yk. Then P [A] =1− P [lim supXk 6= Yk]. In addition,∑

k

P [Xk 6= Yk] =∑k

P [Xk ≥ k]

=∑k

E[1X1 ≥ k]

= E[∑k≥1

1X1 ≥ k]≤ E[X1] <∞.

Thus, using Borel-Cantelli lemma, we see that P [A] = 1. Moreover, on A

1

n(X1 + · · ·+Xn)

n→∞−−−→ EX1 ⇐⇒ 1

n(Y1 + · · ·+ Yn)

n→∞−−−→ EX1.

Hence, (4.19) will follow if we show

(4.20) Tn/nn→∞−−−→ EX1 with T0 = 0, Tn =

n∑k=1

Yk.

Step 3: If for all α > 1

(4.21)T[αn]

[αn]

n→∞−−−→ EX1 P -a.s.,

then (4.20) follows.Indeed, let αM = 1 + 1

Mand define

Ω =⋂M≥1

ΩM with ΩM =ω ∈ Ω :

T[αnM ](ω)

[αnM ]

n→∞−−−→ EX1

.

As P [ΩM ] = 1 for all M by assumption, we have P [Ω] = 1. For fixed M , let kn = [αnM ].Then for any fixed ω ∈ Ω and kn ≤ m < kn+1, using the non-negativity of Yk’s,

Tknkn+1

≤ Tmm≤Tkn+1

kn.

Trivially,

limn→∞

kn+1

kn= αM .

Combining the last two displays with with the assumption, for ω ∈ Ω and M ≥ 1

1

αME[X1] ≤ lim inf

n→∞

Tn(ω)

n≤ lim sup

n→∞

Tn(ω)

n≤ αMEX1,

31

and the claim of Step 3 follows by taking M →∞.

Step 4, Proof of (4.21). Fix α > 1 and define kn = [αn]. Then ε > 0,

∞∑n=1

P [|Tkn − E[Tkn ]| > εkn] ≤ 1

ε2

∞∑n=1

VarTknk2n

(Chebyshev)

=1

ε2

∞∑n=1

1

k2n

kn∑m=1

VarYm (pairwise independence)

=1

ε2

∞∑m=1

VarYm∑

n:kn≥m

1

k2n

. (Fubini)

(4.22)

For every n ≥ 1, kn ≥ αn

2(check this). Hence

(4.23)∑

n:kn≥m

1

k2n

≤ 4∑

n:αn≥m

α−2n =4α−2n0(m)

1− α−2≤ 4

(1− α−2)m2,

where n0(m) is the smallest integer n with αn ≥ m. Hence, by (4.22), (4.23).

(4.24)∞∑n=1

P [|Tkn − E[Tkn ]| ≥ εkn] ≤ 4

ε2(1− α−2)

∞∑m=1

EY 2m

m2

Remark 4.25. When EX21 <∞, then∑

m≥1

EY 2m

m2≤∑m≥1

EX21

m2<∞.

Hence, by the Borel-Cantelli lemma and (4.24)

lim sup∣∣∣Tknkn− ETkn

kn

∣∣∣ ≤ ε P -a.s.

Actually, when EX21 < ∞, then it is not necessary to introduce the truncated random

variables Yi. Using the same steps as in (4.21)–(4.24) with Xi instead of Yi, we find that

lim sup∣∣∣Sknkn− ESkn

kn

∣∣∣ = lim sup∣∣∣Sknkn− EX1

∣∣∣ ≤ ε P -a.s.

and thus (4.21) holds for Xi’s. ♦

We continue with the proof of (4.21). Inspecting (4.24), it remains to show thefollowing two claims

m∑i=1

EY 2m

m2<∞(4.26)

n−1ETnn→∞−−−→ EX1.(4.27)

32

Indeed, if these two hold, we obtain using the Borel-Cantelli lemma

P[

lim supn→∞

∣∣∣Tknkn− ETkn

kn

∣∣∣ ≤ ε]

= 1,

which together with (4.27) yields

P[

lim supn→∞

∣∣∣Tknkn− EX1

∣∣∣ ≤ 2ε]

= 1,

and thus also

P[ ⋂M≥1

lim supn→∞

∣∣∣Tknkn− EX1

∣∣∣ ≤ 1

M

]= 1,

and (4.21) follows.

Step 5, Proof of (4.27). Using the definition of Yi’s we get 0 ≤ EXk − EYk =E[X11X1 > k], which by the dominated convergence theorem converges to 0 as k →∞. Hence EYk

k→∞−−−→ EX1. Finally as n−1ETn = n−1EY1 + . . . EYn, this implies thatn−1ETn

n→∞−−−→ EX1, as required.

Step 6, Proof of (4.26). Using integration by parts and Fubini’s theorem we obtain

EY 2m = E

[2

∫ ∞0

y1Ym ≥ ydy]

= 2

∫ ∞0

yP [Ym ≥ y] dy.

Since Ym ≤ m and Xm ≥ Ym this can be bounded from above by

EY 2m ≤ 2

∫ m

0

yP [Ym ≥ y] dy ≤ 2

∫ m

0

yP [Xm ≥ y] dy.

Therefore, using Fubini’s and monotone convergence theorem,∑m≥1

EY 2m

m2≤ 2

∑m≥1

1

m2

∫ m

0

yP [X1 ≥ y]dy

= 2

∫ ∞0

∑m≥1

y

m2P [X1 ≥ y]1y ≤ mdy.

(4.28)

For y ≥ 2 we have an upper bound

(4.29)∑m≥y

1

m2≤∑m≥y

∫ m

m−1

dx

x2=

∫ ∞y−1

dx

x2=

1

y − 1,

and further∑

m≥1m−2 = 1 +

∑m≥2m

−2 ≤ 1 + 1 = 2. Inserting these into (4.28),∑m≥1

EY 2m

m2≤ 2

∫ 2

0

2y︸︷︷︸≤4

P [X1 ≥ y]dy + 2

∫ ∞2

y

y − 1︸︷︷︸≤4

P [X1 ≥ y]dy

≤ 8

∫ ∞0

P [X1 ≥ y] = 8E[ ∫ X1

0

dy]

= 8EX1 <∞,

and (4.26) follows. This completes the proof of Theorem 4.18.

33

Remark 4.30. The original proof of the strong law of large numbers (due to Kol-mogorov) is less general and only works for i.i.d. random variables. It is based on thethree-series theorem. The connection between stochastic series and the law of largenumbers is provided by Kronecker’s lemma, whose proof can be found e.g. in Durrett,page 65.

Lemma 4.31. Let (an)n≥1 and (xn)n≥1 be two sequences of real numbers such thatan ∞. Then if

∑n≥1

xnan

converges, then also 1an

∑nk=1 xk converges.

Remark 4.32 (necessity of the integrability condition for the LLN). Consider theCauchy random variableX, that is a random variable with density fX(x) = 1/(π(1+x2)).The characteristic function of X is given by

E[eiλX ] =

∫ ∞−∞

eiλxfX(x) dx = e−|λ|.

Hence, if Xi are i.i.d. Cauchy, then the characteristic function of their normalised sumn−1Sn = n−1(X1 + · · ·+Xn) is

E[eiλSn/n] = (E[eiλX1/n])n = (e−|λ/n|)n = e−|λ|.

Since the characteristic function uniquely determines the probability distribution (we willshow this soon), n−1Sn is Cauchy distributed for every n ≥ 1. Hence n−1Sn 6→ 0, whichyou could consider natural given the fact that the density of the Cauchy distribution iseven, so one is tempted to define EX = 0. This is however not correct, the expectationof the Cauchy distribution is not defined because EX+ = EX− =∞.

Exercise 4.33. If you find the previous remark not convincing enough, you can tryto prove the following statement which shows that the integrability condition in thelaw of large numbers is not only sufficient, but also necessary: If (Xi)i≥1 are i.i.d. withE|Xi| =∞, then

P [|Xn| ≥ n for infinitely many n] = 1.

As a consequence, P [limn−1Sn exists and is finite] = 0. (Hint. Show that E|X1| ≤∑∞n=0 P [|X1| > n] and use the second Borel-Cantelli lemma.)

Example 4.34 (Renewal process). We now consider a problem where the strong lawof large numbers can be applied. Let (Xi)i≥1 be i.i.d. with 0 < Xi < ∞ and setTn = X1 + · · ·+Xn. Think of Tn as the time of n-th occurence of some event (like timesof reparations of a machine). Let

Nt = supn : Tn ≤ t

be the number of events occurring before the time t.

Theorem 4.35. If EX1 = µ <∞, then

limt→∞

Nt

t=

1

µ, P -a.s.

34

Proof. By the strong law of large numbers, n−1Tnn→∞−−−→ µ, P -a.s. By definition of Nt,

T (Nt) ≤ t < T (Nt + 1), hence

T (Nt)

Nt

≤ t

Nt

≤ T (Nt + 1)

Nt + 1· Nt + 1

Nt

.

To take the limit, note that Tn < ∞ for all n so that Ntt→∞−−−→ ∞, P -a.s. Combining

this with the strong law of large numbers, we thus obtain

T (Nt(ω))(ω)

Nt(ω)

t→∞−−−→ µ andNt(ω) + 1

Nt(ω)

t→∞−−−→ 1 for P -a.e. ω.


The last theorem is just a begining of the so-called ‘renewal theory’ which provesmany finer result about the behaviour of the sequence of renewal times Tn. One of itsimportant results is the Blackwell’s renewal theorem:

Theorem 4.36. Let the distribution of Xi’s be non-arithmetic (i.e. not supported on0, δ, 2δ, 3δ, . . . for any δ > 0). Then, for any h > 0, the expected number U(t, t + h)of renewal times in the interval [t, t+ h],

U(t, t+ h) =∑n≥0

P [Tn ∈ [t, t+ h]],

converges to h/µ as t tends to infinity.

4.4 Law of large numbers for triangular arrays

In many situations one is confronted with sums or random variables whose distributiondepend on the length of the considered sum. Formally, one is given a triangular arrayXnk, 1 ≤ k ≤ n, of random variables and is interested in behaviour of Sn = Xn1 + · · ·+Xnn.

Theorem 4.37 (Weak LLN for triangular arrays). For each n let (Xnk)1≤k≤n be inde-pendent. Assume that for some sequence bn ∞ we have

(i)∑n

k=1 P [|Xnk| > bn]n→∞−−−→ 0,

(ii) b−2n

∑nk=1EX

2nk

n→∞−−−→ 0, where Xnk = Xnk1|Xnk| ≤ bn.

Then, with Sn = Xn1 + · · ·+Xnn, an =∑n

k=1EXnk,

Sn − anbn

n→∞−−−→ 0 in probability

35

Remark 4.38. In typical applications, the rows of the array are defined on differentprobability spaces. In such situation, it has no sense to discuss a strong law of largenumbers. However, if all random variables are defined on the same probability space, astrong law of large numbers can be obtained using the ideas of the proof of Theorem 4.18.

Proof of Theorem 4.37. Let Sn = Xn1 + · · ·+ Xnn. Clearly,

P[∣∣∣Sn − an

bn

∣∣∣ ≥ ε]≤ P [Sn 6= Sn] + P

[∣∣∣ Sn − anbn

∣∣∣ ≥ ε].

For the first term we note that

P [Sn 6= Sn] ≤ P[ n⋃k=1

Xnk 6= Xnk]≤

n∑k=1

P [|Xnk| ≥ bn]n→∞−−−→ 0,

by assumption (i). For the second term, using the Chebyshev inequality together withan = ESn and VarX ≤ EX2,

P[∣∣∣ Sn − an

bn

∣∣∣ ≥ ε]≤ ε−2b−2

n Var[Sn] = (bnε)−2

n∑i=1

Var Xnk

≤ (bnε)−2

n∑i=1

EX2nk

n→∞−−−→ 0,

due to assumption (ii). This completes the proof.

Example 4.39 (Coupon collector’s problem). Let (Xni)i≥1 be i.i.d. random variablesuniformly distributed in the set 1, . . . , n. We are interested in the first time that allpossible values 1, . . . , n appear in the sequence (Xni)i≥1, that is in

Tn = infm ≥ 1 : for every ` ∈ 1, . . . n there is i ≤ m such that Xni = `.= infm ≥ 1 : |Xn1, . . . , Xnm| = n.

To this end we define τn0 = 0 and

τnk = infm ≥ 1 : |Xn1, . . . , Xnm| = k

to be the first time when k of n symbols are observed, obviously τnn = Tn. We furtherset Znk = τnk− τn,k−1, hence Znk is the time we should wait for k-th symbol after seeing(k − 1)-th one. Elementary considerations yields that Znk has geometrical distributionwith parameter pnk = 1 − k−1

nand is independent of Znj, j < k. Hence, E[Znk] = p−1

nk ,VarZnk ≤ p−2

nk . It is then trivial to check the assumptions of Theorem 4.37 with bn =n log n (Exercise!) and to obtain that

(4.40)Tn

n log n

n→∞−−−→ 1 in probability.

36

5 Large deviations

Let X1, X2, . . . be an i.i.d. sequence of random variables and Sn = X1 + · · ·+Xn. Fromthe previous chapter we know that n−1Sn converges to EX1, whenever the expectationexists.

In this chapter we investigate the rate at which the probability of the ‘unusual event’n−1Sn > u for a u > EX1 decays to zero. We will see later that when E[etX ] < ∞ forsome t > 0, then this probability decays exponentially, and we will identify the exactexponential rate,

(5.1) I(u) = − limn→∞

1

nlogP [Sn ≥ nu].

Remark 5.2. Formula (5.1) can be interpreted as

P [Sn ≥ nu] = exp−n(I(u) + an)

for some sequence an converging to 0 with n.

5.1 Sub-additive limit theorem

We first develop rather useful technique allowing to show the existence of the limitin (5.1) and in many other situations.

Let πn = P [Sn ≥ nu]. Using the independence of Xi’s, we see that

πm+n = P [Sm+n ≥ (m+ n)u]

≥ P [Sn ≥ nu, Sm ≥ mu] = P [Sn ≥ nu]P [Sm ≥ mu]

= πnπm.

Defining γn = − log πn, it follows that γn ≥ 0 and

(5.3) γm+n ≤ γn + γn for all m,n ≥ 0.

Sequences satisfying (5.3) are called subadditive. Their important property is that theyconverge (after normalisation):

Lemma 5.4 (Fekete’s subadditive lemma). Let γn ≥ 0 satisfy (5.3). Then

limn→∞

γnn

= infn≥1

γnn.

37

Proof. Since lim inf γn/n ≥ inf γn/n, we should only show that for any m

(5.5) lim supγnn≤ γm

m.

Writing n = km+` with 0 ≤ ` < m and k ∈ N, and by repeatedly using the subadditivityassumption yields γn ≤ kγm + γ`. Dividing by n gives

γnn≤ km

km+ `

γmm

+γ`n.

Claim (5.5) then follows by taking n→∞, observing that ` ≤ m.

Coming back to our original problem, that is the behaviour of P [Sn ≥ nu], Lemma 5.4and (5.3) imply easily that the limit in (5.1) exists.

5.2 Cramer’s theorem

We are now going to identify the limit in (5.1) (and also give another proof of itsexistence).

Theorem 5.6 (Cramer, Chernov). Let (Xi)i≥1 be an i.i.d. sequence of random variablesand Sn = X1 + · · ·+Xn. Assume that the Laplace transform of Xi is finite, that is

(5.7) ϕ(t) := E[etXi ] <∞ for all t ∈ R,

and that u > EX1. Then

(5.8) limn→∞

1

nlogP [Sn ≥ nu] = −I(u),

where I is given by the Legendre transform of logϕ,

I(a) = supt∈R

(ta− logϕ(t)).

Remark 5.9. Assumption (5.7) is not essential, it only simplifies the proof. It is evennot necessary that EX1 exist. However, in this case the information provided by thetheorem might be not very interesting.

Proof. Without loss of generality we may assume that u = 0 and EX1 < 0. To see this,consider random variables Y = X − u. Then P [Sn ≥ nu] = P [Y1 + · · · + Yn ≥ 0], andϕY (t) := E[etY ] = e−utϕ(t) which implies that

I(u) = supt− log(ϕ(t)e−ut) = sup

t−ϕY (t) =: IY (0).

We will also assume that Xi’s are non-degenerate (i.e. not a.s. constant), otherwisethe claim is trivial. Set ρ = inft ϕ(t). Since I(0) = − log ρ we must show that

(5.10) limn→∞

1

nlogP [Sn ≥ 0] = log ρ.

38

From the definition of ϕ(t) and assumption (5.7), one can easily show that

ϕ ∈ C∞(R), ϕ′(t) = E[X1etX1 ], ϕ′′(t) = E[X21 etX1 ].

It follows that ϕ is strictly convex and ϕ′(0) = E[X1] < 0.We now consider three cases:(a) The case P [X1 < 0] = 1. Then ϕ is strictly decreasing, limt→∞ ϕ(t) = 0 = ρ. As

P [Sn ≥ 0] = 0, the claim (5.10) follows.(b) The case P [X1 ≤ 0] = 1 and P [X1 = 0] > 0. In this case ϕ is decreasing and

limt→∞ ϕ(t) = ρ = P [X1 = 0]. On the other hand, P [Sn ≥ 0] = P [Sn = 0] = ρn, byindependence. That is (5.10) holds in this case.

(c) The most interesting is the case P (X1 < 0) > 0 and P (X1 > 0) > 0. Thenlimt→∞ ϕ(t) = ∞ and there exists a unique τ > 0 where ϕ(t) is minimised. That isϕ(τ) = ρ and ϕ′(τ) = 0.

For the upper bound in (5.7) we use Chebyshev’s inequality. As τ > 0 (we could useany u ≥ 0 in place of τ here, but it can be shown that τ is the optimal),

P [Sn ≥ 0] = P[eτSn ≥ 1] ≤ EeτSn = (ϕ(τ))n = ρn,

and thus

lim supn→∞

1

nlogP [Sn ≥ 0] ≤ log ρ.

The lower bound in (5.7) is more delicate. In its proof, we use a ‘transformation ofmeasure’ trick which will make the event Sn ≥ 0 ‘typical’.

We create a new i.i.d. sequence by ‘tilting’ (or Cramer transform) of Xi’s. Let µ be thedistribution of Xi on R. For τ ≥ τ and ρ = ϕ(τ) = E[eτX1 ], we define a new probabilitydistribution µ by

µ(A) =1

ρ

∫A

eτxµ(dx).

(Check that µ is really a probability distribution!) Let now Xi be i.i.d. random variableswith distribution µ.

We claim that

(5.11) EXi > 0 and Var Xi ∈ (0,∞).

Indeed, by definition of Xi,

ϕ(t) := EetXi =

∫etxµ(dx) =

1

ρ

∫etxeτxµ(dx) =

ϕ(t+ τ)

ϕ(τ)<∞.

Then, as in (5.2), EX1 = ϕ′(0) = ϕ′(τ)/ϕ(τ) > 0 since τ > τ , and Var Xi ≤ EX2i =

ϕ′′(0) <∞.

39

We now rewrite the probability of the event in (5.7) with help of the new randomvariables,

P [Sn ≥ 0] =

∫x1+···+xn≥0

µ(dx1) . . . µ(dxn)

=

∫x1+···+xn≥0

(ρe−τx1µ(dx1)) . . . (ρe−τxnµ(dxn))

= ρnE[e−τ Sn1Sn ≥ 0

],

(5.12)

where Sn = X1 + · · ·+ Xn. To estimate the above expectation, for δ ∈ (0, 1),

E[e−τ Sn1Sn ≥ 0

]≥ E

[e−τ Sn1|Sn − ESn| ≤ δESn

]≥ e−τ(1+δ)ESnP

[|Sn − ESn| ≤ δESn

]Obviously, ESn = nEX1 and, by the weak law of large numbers, P

[|Sn − ESn| ≤

δESn] n→∞−−−→ 1. Hence, taking “lim inf 1

nlog” in the last display, and taking the limit

δ → 0 consequently, yields

(5.13) lim infn→∞

1

nlogE

[e−τ Sn1Sn ≥ 0

]≥ −τEX1.

Finally, combining (5.12) with (5.13), lim inf n−1 logP [Sn ≥ 0] ≥ ρ − τEX1. Takingτ τ , we see that ρ → ρ and EX1 = ϕ′(τ)/ϕ(τ) → 0, as ϕ′(τ) = 0. This yields therequired lower bound for (5.7), and completes the proof.

40

6 Weak convergence of probabilitymeasures

In this chapter we discuss the weak convergence of probability measures on arbitrarymetric spaces.

6.1 Weak convergence on RWe recall the definition from the elementary lecture:

Definition 6.1. A sequence (µn)n∈N of probability distributions on (R,B(R)) convergesweakly to the probability distribution µ when

Fµn(y)n→∞−−−→ Fµ(y) for all points of continuity of Fµ.

Here Fµn(y) = µn((−∞, y]) and Fµ = µ((−∞, y]) denote the distribution functions of

µn and µ, respectively. We then write µnw−→ µ.

A sequence (Xn)n≥1 of real valued random variables converges in distribution (or inlaw) to a random variable X when the distributions µn of Xn converge weakly to the

distribution µ of X. We write Xnd−→ X.

Remark 6.2. The random variables Xn in the previous definition do not need to bedefined on the same probability space.

Remark 6.3. To see why we do not require the convergence for all y, let µn be centred

Gaussian distribution with variance 1/n. Obviously, µnd−→ δ0, but Fµn(0) = 1/2 6= 1 =

Fδ0(0).

Example 6.4 (De Moivre-Laplace Theorem). Consider a sequence (Xi)i≥1 of Bernoullirandom variables, P [Xi = ±1] = 1/2. Let as usual Sn = X1 + · · · + Xn. An easycombinatorial argument yields the formula

P [S2n = 2k] =

(2n

n+ k

)2−2n.

Analysing this using Stirling’s formula x! = xxe−x√

2πx(1 + o(1)) we find (Exercise!),for k(n) = bx

√2n/2c, x ∈ R,

limn→∞

n

2P[ S2n√

2n=

2k(n)√2n

]=

1√2π

exp− x2

2

.

41

Using the Riemann approximation of an integral with some care, one deduces that

P[ S2n√

2n≤ y]

n→∞−−−→ 1√2π

∫ y

−∞e−x

2/2 dx,

that is the S2n/√

2n converges in distribution to a standard normal random variable.

Exercise 6.5 (Exponential and geometric distribution). Let Xp be a geometric randomvariable with parameter p, that is P [Xp = k] = (1− p)k−1p, k ≥ 1. Prove that

pXpd−−→

p→0X,

where X is an exponential random variable with parameter 1, P [X ≥ y] = e−y.

Exercise 6.6 (Poisson approximation). Let Xn be a binomial random variable with

parameters (n, pn). Assume that npn → λ ∈ (0,∞) as n → ∞. Prove that Xnd−→ X,

where X has Poisson distribution with parameter λ.

Example 6.7. Let µn(dy) = 12δ0(dy) + 1

2δn(dy). Then

Fµn(y) =1

21y ≥ 0+

1

21y ≥ n n→∞−−−→ 1

21y ≥ 0.

Observe that the limit is not a distribution function of any probability distribution, soµn do not converge weakly. As is easy to see, problem lies in the fact that some mass ofµn ‘disappears to infinity’.

We now study properties of the weak convergence.

Proposition 6.8. The following statements are equivalent

(i) µnw−→ µ.

(ii) There are random variables Y and Yn, n ≥ 1 on a probability space (Ω,A, P ) suchthat µn is distribution of Yn, µ is the distribution of Y and

Ynn→∞−−−→ Y, P -a.s.

(iii) For every bounded continuous function f : R→ R∫fdµn

n→∞−−−→∫fdµ.

Proof. (i) =⇒ (ii). We take Ω = (0, 1), A = B((0, 1)) and P the Lebesgue measure on(0, 1) and set for ω ∈ Ω

Yn(ω) = supy ∈ R : Fµn(y) < ω,Y (ω) = supy ∈ R : Fµ(y) < ω.

42

Then, as in the proof of Theorem 2.15, µn is the distribution of Yn and µ of Y . Thefunctions ω 7→ Y (ω) and ω 7→ Yn(ω) are increasing. Let further

Y (ω) = infy ∈ R : Fµ(y) > ω,Ω0 = ω ∈ Ω : Y (ω) = Y (ω).

Observe that for every ω < ω′ one has Y (ω) ≤ Y (ω) ≤ Y (ω′) ≤ Y (ω′) and thusω 7→ Y (ω) has a jump in every ω ∈ Ω \ Ω0. Since Y is increasing, Ω \ Ω0 is at mostcountable, and thus P [Ω0] = 1. It is thus sufficient to show that

(6.9) limn→∞

Yn(ω) = Y (ω) for every ω ∈ Ω0.

To this end, fix ω ∈ Ω0. For every continuity point y of Fµ satisfying y < Y (ω) and forn large enough Fµn(y) < ω and thus y ≤ Yn(ω). Hence, y ≤ lim inf Yn(ω). We can lety Y (ω), as Fµ has at most countably many jumps, and obtain Y (ω) ≤ lim inf Yn(ω).On the other hand, if y is a point of continuity of Fµ with y ≥ Y (ω), recalling thatω ∈ Ω0 and thus y > Y (ω), one has Fµ(y) > ω. Hence, for n large enough Fµn(y) > ωand thus Yn(ω) ≤ y. This implies that lim supYn(ω) ≤ Y (ω) and (6.9) follows.

(ii) =⇒ (iii). For a bounded continuous function f∫f dµn = E[f(Yn)]

DCT−−−→ E[f(Y )] =

∫f dµ.

(iii) =⇒ (i). Let y ∈ R be a point of continuity of Fµ. Define gε : R→ [0, 1] by

gε(x) =

1, x ≤ y,

0, x ≥ y + ε,

linear, continuous, x ∈ [y, y + ε].

Obviously gε is bounded and continuous. Moreover, using (iii),

Fµ(y + ε) = µ((−∞, y + ε]) ≥∫gε(x)µ(dx)

= lim

∫gε(x)µn(dx) ≥ lim supµn((−∞, y]) = lim supFn(y),

and thus, letting ε→ 0,lim supFµn(y) ≤ F (y).

Similarly, with hε : R→ [0, 1] given by

hε(x) =

1, x ≤ y − ε,0, x ≥ y,

linear, continuous, x ∈ [y − ε, y],

using the fact that y is a continuity point of Fµ, we obtain

Fµ(y) ≤ lim inf Fµn(y),

and the proof is complete.

43

6.2 Weak convergence on metric spaces

Definition 6.1 and Proposition 6.8 deal with the convergence of distribution on the realline only, but the equivalent statement (iii) of the proposition allows to define the weakconvergence of distribution on an arbitrary metric1 spaces.

Through the remaining part of this chapter we let (S, d) to stand for a complete metricspace and S for the associated Borel σ-algebra, S = B(S). We use Cb(S) to denote thespace of continuous bounded functions on S.

Definition 6.10. Let µ and µn, n ≥ 1 be probability distributions on (S,S). We saythat µn converge weakly to µ if

limn→∞

∫S

f dµn =

∫S

f dµ for every f ∈ Cb(S).

Let X and Xn, n ≥ 1, be S-valued random variables with respective distributions µand µn. We say that Xn converge in distribution to X, when µn

w−→ µ.

Remark 6.11. There are other natural definitions of convergence of probability distri-butions on (S,S). One could, for example, require that

limn→∞

µn(A) = µ(A) for all A ∈ S

or, even more restrictively, to require that this convergence is uniform,

‖µn − µ‖TV := supA∈S|µn(A)− µ(A)| n→∞−−−→ 0.

These modes of convergence are however too strong in practical situations (especiallywhen S is not finite or countable). To see this consider a sequence xn of points in S suchthat limxn = x but xn 6= x for all n ≥ 1. Then the measures µn = δxn do not convergeto δx in any of the previous two types of convergence, but they converge weakly, as canbe checked easily.

Similar problem occurs when considering the De-Moivre-Laplace theorem, Exam-ple 6.4, as the distribution of n−1/2Sn is supported on n−1/2Z, but the probability ofthis set is zero for the standard normal distribution.

The following theorem provides many useful conditions that are equivalent with theweak convergence.

Theorem 6.12 (Portmanteau). The following are equivalent:

(i) µnw−→ µ

(ii) limn→∞∫Sf dµn =

∫Sf dµ for every f ∈ Cb(S) which is uniformly continuous.

(iii) lim supn→∞ µn(F ) ≤ µ(F ) for all closed sets F ⊂ S.

1In fact, topological would be sufficient here.

44

(iv) lim infn→∞ µn(G) ≥ µ(G) for all open sets G ⊂ S.

(v) limn→∞ µn(A) = µ(A) for all µ-continuity sets A ∈ S, that is for all A withµ(A \ A) = µ(∂A) = 0.

Proof. (i) =⇒ (ii). Obvious.(ii) =⇒ (iii). Let F be a closed set and ε > 0. Define h(s) = ϕ(d(s, F )) where

ϕ(x) := max(1 − ε−1x, 0). Then h is uniformly continuous, bounded, h ≡ 1 on F andh ≡ 0 on Uε(F )c, where Uε(F ) = s ∈ S : d(s, F ) < ε is the ε-neighbourhood of F .Hence, 1F ≤ h ≤ 1Uε(F ). By (ii)

lim supn→∞

µn(F ) ≤ lim supn→∞

∫h dµn =

∫h dµ ≤ µ(Uε(F )).

As F is closed, F =⋂k≥1 U1/k(F ) and thus, using the regularity of µ, µ(U1/k(F ))

k→∞−−−→µ(F ). Inserting this in the last display, the claim (iii) follows.

(iii) ⇐⇒ (iv) follows by taking complements.(iii) and (iv) =⇒ (v). For a µ-continuity set A, µ(A) = µ(A) = µ(A), and thus

µ(A) = µ(A)(iv)

≤ lim inf µn(A) ≤ lim inf µn(A)

≤ lim supµn(A) ≤ lim supµn(A)(iii)

≤ µ(A) = µ(A).

(v) =⇒ (i). Fix f ∈ Cb(S) and decompose the range of f so that[infs∈S

f(s), sups∈S

f(s)]⊂

N⋃j=1

[cj, cj+1)

with µ(f = cj) = 0 and 0 < cj+1 − cj ≤ ε for all 1 ≤ j ≤ N . This is possiblesince f is bounded and the distribution function of f , t 7→ µ(f ≤ t), has at mostcountably many discontinuities. The sets Aj = cj ≤ f < cj+1 are thus µ-continuity

sets. Defining g =∑N

j=1 cj1Aj , it follows from (v) that∫g dµn

n→∞−−−→∫g dµ. Moreover,

by construction, sups∈S |g(s)− f(s)| ≤ ε and thus

lim supn→∞

∣∣∣ ∫ f dµn −∫f dµ

∣∣∣ ≤ 2ε.

Since ε and f are arbitrary, this implies µnw−→ µ.

Sometimes it is useful to prove the weak convergence by checking µn(A) → µ(A) fora particular class of sets A. (Beware, however, that checking this for all A in someπ-system generating S does not imply µn

w−→ µ, cf Lemma 3.10. We will see exampleslater.)

Proposition 6.13. Let S0 ⊂ S be a set system that is closed under finite intersectionsand every open set G ⊂ S can be written as a countable union of sets in S0. Thenlimn→∞ µn(A) = µ(A) for all A ∈ S0 implies µn

w−→ µ.

45

[[TODO: check this proposition, it feels strange]]

Proof. For finite unions, by the inclusion-exclusion principle,

µn

( N⋃i=1

Ai

)=

N∑k=1

(−1)k+1∑

1≤i1<···<ik≤N

µn

( k⋂j=1

Aij

).

Hence, when Ai ∈ S0, as S0 is closed under intersections,

µn

( N⋃i=1

Ai

)n→∞−−−→ µ

( N⋃i=1

Ai

).

If G is open, then G =⋃i≥1Ai for some Ai ∈ S0. Thus, for ε > 0 and N = N(ε) large

µ(G)− ε ≤ µ( N⋃i=1

Ai

)= lim

n→∞µn

( N⋃i=1

Ai

)≤ lim inf

n→∞µn(G).

As ε is arbitrary, the weak convergence µnw−→ µ follows from (iv) of Theorem 6.12.

Remark 6.14. When S is separable, we can always construct S0 as above, it can evenbe countable. Indeed, let (sn)n≥1 be a sequence which is dense in S, and iterativelyconstruct partitions Zm = Am,k : k ∈ N of S such that diam(Am,k) ≤ 1

mfor all k

and Zm+1 refines Zm (To to this start with Am,k ∩ U1/(2(m+1))(sn) and make these setspairwise disjunct.) Finally, take S0 =

⋃m≥1Zm, which is closed under intersections by

the construction. Moreover, S0 is countable and every G open can be written as a unionof elements of S0 (Exercise!).

Exercise 6.15. Use the previous remark to show the following statement: Let (S, d) be aseparable metric space. Then every probability measure µ on (S,S) can be approximatedby a discrete probabilty measure, that is there are µn =

∑i≥1 c

ni δxni with cni ∈ [0, 1],

xni ∈ S, such that µnw−→ µ.

Exercise 6.16. Let (Xn)n≥1 be an i.i.d. µ-distributed sequence of S-valued randomvariables. Consider the empirical distribution of the first n variables

µn(ω) =1

n

n∑k=1

δXk(ω), ω ∈ Ω.

(Observe µn is a random probability measure.) Show that

µn(ω)w−→ µ for P -a.e. ω.

46

6.3 Tightness on RWe characterise sequentially compact sets under weak convergence. The first resultconsiders probability measures on R only.

Theorem 6.17 (Helly’s section theorem). Let (Fn)n≥1 be a sequence of distributionfunctions of some probability measures on R. Then there is a subsequence Fn(k) and aright-continuous increasing function F : R→ [0, 1] such that

F (y) = limk→∞

Fn(k)(y) for every point of continuity y of F .

Remark 6.18. F might not be a distribution function in general, cf. Example 6.7. Moreprecisely, in general it only satisfies 0 ≤ limx→−∞ F (x) ≤ limx→∞ F (x) ≤ 1. Such Fcorresponds naturally to a sub-probability measure, i.e. measure µ on R with µ(R) ≤ 1.

For (sub-)probabilities there is another notion of convergence:

Definition 6.19. A sequence of sub-probability measures νk converges vaguely to asub-probability measure ν if

limn→∞

∫f dνn =

∫f dν for every continuous function f with compact support.

It is not hard to see, by the same techniques as in Proposition 6.8, that Helly’s theoremimplies that the space of sub-probabilities on R is vaguely sequentially compact. .

Proof. Let Q = q1, q2, . . . be an enumeration of rationals. Since Fn(q1) ∈ [0, 1] for

all n, there is a sequence m1(i), i ≥ 1, such that Fm1(i)(q1)i→∞−−−→ G(q1). Similarly,

Fm1(i) ∈ [0, 1] for all i ≥ 1, so there is a m2(i) = m1(n2(i)) so that Fm2(i)(q2)i→∞−−−→ G(q2).

Inductively, we obtain mk(i), k ≥ 1, i ≥ 1 such that mk+1(i) = mk(nk+1(i)) with nk+1

increasing (i.e. mk+1 is a subsequence of mk), and for all k ≥ 1

limi→∞

Fmk(i)(q`) = G(q`), 1 ≤ ` ≤ k.

Using the ‘diagonal sequence’ Fn(k) := Fmk(k) we obtain

limk→∞

Fn(k)(q`) = G(q`), for all ` ≥ 1,

that is Fn(k)(t) converges for all t ∈ Q. The function G : Q → [0, 1] is non-decreasing,hence F : R → [0, 1] defined by F (x) = infG(q) : x < q ∈ Q is non-decreasing andright-continuous.

We show that Fn(k) and F satisfy the claim of the theorem. Let y be a point ofcontinuity of F , ε > 0, and fix r1, r2, s ∈ Q so that r1 < r2 < y < s and

F (y)− ε < F (r1) ≤ F (r2) ≤ F (y) ≤ F (s) < F (y) + ε.

Then, Fn(k)(r2) → G(r2) ≥ F (r1) and Fn(k)(s) → G(s) ≤ F (s). Hence, for every k largeenough

F (y)− ε ≤ Fn(k)(r2) ≤ Fn(k)(y) ≤ F (y) + ε

and thus Fn(k)(y)→ F (y) as required.

47

Our goal is to obtain conditions ensuring the compactness for the weak, and not vague,convergence. To this end we must deal with the issue of the mass escaping to infinity,cf. Remark 6.18.

Definition 6.20. A family µi, i ∈ I, of probability measures on (R,B) is called tightwhen for every ε > 0 there is M = M(ε) <∞ such

supi∈I

µi([−M,M ]c) ≤ ε.

Theorem 6.21. Let (Fn)n≥1 be a sequence of distribution functions and µn, n ≥ 1, theassociated probability measures. If (µn)n≥1 is tight, then every subsequential limit F ofFn is a distribution function.

Proof. Let F be increasing, right-continuous, with F (y) = limk→∞ Fn(k)(y) for everycontinuity point y of F . Fix ε > 0 and M <∞ so that µn([−M,M ]c) ≤ ε for all n ≥ 1.Let y1 > M and y2 < −M be continuity points of F . Then

F (y2) = limkFn(k)(y2) ≤ lim supµn(k)((−∞,−M)) ≤ ε,

F (y1) = limkFn(k)(y1) ≥ lim inf µn(k)((−∞,M ]) ≥ 1− ε.

Hence limy→∞ F (y) = 1 and limy→−∞ F (y) = 0, that is F is a distribution function.

Corollary 6.22. Every tight sequence of probability measures on (R,B(R)) has a weaklyconvergent subsequence.

6.4 Prokhorov’s theorem*

Surprisingly enough, almost all claims of the previous section generalise to arbitrarymetric (say) spaces. [[TODO: check]]

Let (S, d) be a metric space with the Borel σ-field S, and let M1(S) be the familyof all probability measures on (S,S). We study the properties ofM1(S) topologised bythe weak convergence.

Remark 6.23. The topology corresponding to the weak convergence has the neighbour-hood basis

Uε,h1,...,hn(µ) =ν ∈M1(S) :

∣∣∣ ∫ hi dν −∫hi dµ

∣∣∣ < ε, i = 1, . . . , n

with n ∈ N, ε > 0 and hi ∈ Cb(S).

Definition 6.24. A family M⊂M1(S) is called relatively (weakly sequentially) com-pact if every sequence of elements of M contains a weakly converging subsequence.

The next definition generalises Definition 6.20

48

Definition 6.25. A family M ⊂ M1(S) is called tight, when for every ε > 0 exists acompact set K = K(ε) ⊂ S, such that

µ(K) ≥ 1− ε for all µ ∈M.

As on R, the tightness and relative compactness are closely related:

Theorem 6.26 (Prokhorov). Let M⊂M1(S).

(a) When M is tight, then it is relatively compact.

(b) When S is complete and separable and M is relatively compact, then M is tight.

Remark 6.27 (Utility of the characterisation of the relative compactness). Let µn be asequence of probability measures on (S,S) and assume that for every set A in some π-system C generating S we have µn(A)→ µ(A). As we know this is not sufficient to deducethe weak convergence of µ. However, assume in addition that (µn) is relatively compact.Then it contains a subsequence µn(k) converging weakly to some ν ∈M1(S). Moreover,ν(A) must be equal µ(A) for every A ∈ C, which implies µ = ν, by Lemma 3.10.Therefore, all weak subsequential limits of µn agree, which, by standard arguments,imply that µn converges weakly to µ.

Proof of Theorem 6.26. (b): The part (b) of the theorem is less useful, but its proof issimpler, so we start with it. Let Gn be an arbitrary sequence of open sets such thatGn S. We claim

(6.28) For every ε there is n such that µ(Gn) > 1− ε for all µ ∈M.

Indeed, if this is not the case, then there is for every n a µn ∈M with µn(Gn) < 1− ε.Due to relative compactness there is a subsequence µn(k) converging weakly to someµ ∈M1(S). But then, as Gn S,

µ(S) = limn→∞

µ(Gn),

and, using Portmanteau’s theorem,

µ(Gn) ≤ lim infk→∞

µn(k)(Gn) ≤ lim infk→∞

µn(k)G(n(k)) ≤ 1− ε,

so µ(S) < 1 which is a contradiction.As S is separable, there is for every n ∈ N a sequence (Ank)k≥1 of open balls of radius

n−1 covering S. Fixing now Gm =⋃i≤mAni, we see from (6.28) that there is mn such

thatµ( ⋃i≤mn

Ani

)> 1− ε2−n for all µ ∈M.

The setA :=

⋂n≥1

⋃i≤mn

Ani

49

is totally bounded (that is coverable by finitely many balls of radius δ for every δ > 0)by construction. Therefore K = A ⊃ A is compact. Moreover, by construction,

µ(Kc) ≤ µ(Ac) ≤∑n≥1

µ( ⋃i≤mn

Ani

)≤∑n≥1

ε2−n = ε

for all µ ∈M, that is M is tight.(a): To prove (a) we need some preparations.

Proposition 6.29. When S is compact, thenM1(S) is sequentially compact. (It is evencompact, but we will not need it.)

Proof. If S is compact, then C(S) = Cb(S) is separable in the ‖ · ‖∞-topology. We canthus take a dense set hn : n ∈ N in C(S) and set

I = ×n∈N[−‖hn‖∞, ‖hn‖∞].

which is compact in the product topology by Tychonov’s theorem, and since it is metris-able (take ρ(u, v) =

∑k |uk − vk|/(2k‖hk‖∞)), it is also sequentially compact.

Define T :M1(S)→ I by

T (µ) =(∫

hn dµ)n∈N

.

This is homeomorphism of M1(S) and T (M1(S)). Indeed. T is surjective and itsinjectivity follows from fact that T (µ) = T (ν) implies

∫hndµ =

∫hndν for all n, and

since hn are dense, also∫hdµ =

∫hdν for all h ∈ C(S), that is µ = ν. T is continuous,

as µnw−→ µ implies

∫hkdµn →

∫hkdµ for all k and thus T (µn)→ T (µ). Finally, T−1 is

continuous, as T (µn)→ T (µ) implies∫hkdµn →

∫hkdµ for every k, and thus as above,∫

hdµn →∫hdµ for every h ∈ C(S), that is µn

w−→ µ.Since T is a bijection, and since M1(C) can be identified with the family L+(S)

of all non-negative normed linear forms on C(S), by Riesz’ representation theorem (seemeasure theory), we can identify T (M1(S)) with L+(S). L+(S) is closed under pointwiseconvergence (i.e.) and thus also T (M1(S)) is closed and thus sequentially compact

Finally, since T (M1(S) is sequentially compact and T−1 is continuous, M1(S) issequentially compact as well.

We now return to (a) of Theorem 6.26. Let M be tight. Therefore, for every n ∈ Nthere is a compact Kn ⊂ S so that µ(Kn) ≥ 1− n−1 for all µ ∈ M. Set S0 =

⋃n≥1Kn.

Then µ(S0) = 1 for every µ ∈M. Since Kn are compact, S0 is separable metric space (assubspace of S). By Urysohn embedding theorem, S0 is homeomorphic to a measurablesubset of a compact metric space, i.e. it can be viewed as a subset S0 ⊂ S where S is acompact metric space.

We now consider M ⊂ M1(S0) as a subset of M1(S), which is by Proposition 6.29sequentially compact. Every sequence in M has thus a M1(S)-weakly convergent sub-sequence µn(k) with a limit µ ∈M1(S).

50

To finish the proof, we need to show that µ(S0) = 1, since then there is µ ∈ M1(S)agreeing with µ on S0 such that µ(S \ S0) = 0, and thus µn(k)

w−→ µ in M1(S). Indeed,by Portmanteau’s theorem,

(6.30) µ(S0) ≥ µ(KN) ≥ lim supk→∞

µn(k)(KN) ≥ 1−N−1,

and the claim follows by letting N →∞.

The topology of the weak convergence is even metrisable:

Theorem 6.31 (Prokhorov’s metric). Let S be a separable metric space and for µ, ν ∈M1(S) define

ρ(µ, ν) = infε > 0 : ν(A) ≤ ε+ µ(Uε(A)) and µ(A) ≤ ε+ ν(Uε(A)) for all A ∈ S.

Then ρ is a metric on M1(S) which is compatible with the weak convergence and(M1(S), ρ) is a separable metric space. Moreover, if S is complete, then (M1(S), ρ)is complete as well.

Proof. See e.g. Ethier-Kurtz (1986) paragraphs 3.1–3.3.[[TODO: possibly write the proof]]

51

7 Central limit theorem

This chapter mostly recalls, for sake of completeness, the results known from the ele-mentary probability lecture.

7.1 Characteristic functions

Definition 7.1. Let µ be a probability measure on R. Characteristic function of µ is afunction µ : R→ C given by

µ(t) =

∫R

eitxµ(dx), t ∈ R.

Characteristic function ϕX of a random variable X is the characteristic function of itsdistribution µX , that is

ϕX(t) = E[eitx]

= µX(t).

Remark 7.2. In the case when µ possesses a density f , the characteristic function isnothing else as the Fourier transform of f , µ(t) =

∫f(x)eitxdx.

Lemma 7.3 (elementary properties). Let µ be a probability measure on R

(i) µ(0) = 1,

(ii) |µ(t)| ≤ 1 for all t ∈ R,

(iii) µ(−t) = µ(t) for all t ∈ R,

(iv) µ is uniformly continuous,

(v) When X, Y are independent random variables, then ϕX+Y (t) = ϕX(t)ϕY (t).

Proof. Is left as exercise!

Lemma 7.4 (Characteristic function and moments). Assume that µ has finite k-thabsolute moment, that is

∫|x|k dµ <∞, k ≥ 1. Then, µ is k-times differentiable and

dl

dtlµ(t) =

∫(ix)leitxµ(dx), 0 ≤ l ≤ k.

In particular, if X is µ-distributed, for l ≤ k,

dl

dtlµ(0) = ilE(X l).

52

Proof. Exercise!

Exercise 7.5. Show that the characteristic functions of the normal distribution withmean m and variance σ2 is given by

ϕm,σ2(t) = exp

imt− t2σ2

2

.

As for the Fourier transform, there is an inversion formula.

Proposition 7.6. For every a < b,

(7.7) limT→∞

(2π)−1

∫ T

−T

e−ita − e−itb

itµ(t) dt = µ((a, b)) +

1

2µ(a, b).

Proof. The left-hand side equals

(2π)−1

∫ T

−T

∫e−ita − e−itb

iteitx µ(dx)dt.

It is not hard to see that the absolute value of the integrand is bounded by b− a. Sinceµ is a probability measure and [−T, T ] a bounded interval, we can thus apply Fubini’stheorem to show that

= (2π)−1

∫ ∫ T

−T

sin(t(x− a))

tdt−

∫ T

−T

sin(t(x− b))t

dtµ(dx).

Define J(α, T ) =∫ T−T sin(αt)/t dt. Then

= (2π)−1

∫(J(x− a, T )− J(x− b, T )µ(dx).

A little bit of analysis shows that

limT→∞

J(α, T ) = π signx,

where signx is 1 if x > 0, 0 if x = 0 and −1 if x < 0. Hence,

J(x− a, T )− J(x− b, T )t→∞−−−→

2π a < x < b

π x ∈ a, b0 otherwise.

Moreover, supα J(α, T ) <∞, and the result follows by dominated convergence theorem.

As a corollary we immediately see that the characteristic function determine the prob-ability measure uniquely.

53

Theorem 7.8 (Uniqueness). Let µ, ν be two probability measures on (R, (R)). Thenµ = ν implies µ = ν.

Exercise 7.9. Use Theorem 7.8 to show that the sum X+Y of two independent Poissonrandom variables X, Y with respective parameters λX , λX has Poisson distribution withparameter λX + λY .

The following lemma will be useful to prove the tightness.

Lemma 7.10. For every u > 0,

µ(x : |x| ≥ 2

u

)≤ 1

u

∫ u

−u(1− µ(t)) dt.

Remark 7.11. This is a first statement showing that the tail-behaviour of µ is deter-mined by the behaviour of µ near to 0.

Proof. For x 6= 0∫ u

−u(1− eitx)dt = 2u− eiux − e−iux

ix= 2u

(1− sinux

ux

)≥ 0.

Therefore,

1

u

∫ u

−u(1− µ(t))dt

Fubini=

∫R

1

u

∫ u

−u(1− eitx)dtµ(dx)

= 2

∫R

(1− sinux

ux︸︷︷︸≥0

)µ(dx)

≥ 2

∫|x|≥ 2

u

(1− 1

|ux|︸︷︷︸≥1/2

)µ(dx) ≥ µ

(x : |x| ≥ 2

u

),

and the proof is complete.

Theorem 7.12 (Levy’s continuity theorem). Let µn, n ≥ 1, be a sequence of probabilitymeasures on R.

(i) If µnw−→ µ, then µn(t)→ µ(t) for all t ∈ R.

(ii) If µn(t)→ ϕ(t) for every t ∈ R and the limit ϕ(t) is continuous at 0, then there isa probability measure µ such that µ = ϕ and µn

w−→ µ.

Remark 7.13. The condition ‘ϕ is continuous at 0’ is essential: To see this let µnbe the centered normal distribution with variance n2. Using Exercise 7.5, we obtainµn(t) = exp−t2n2/2 n→∞−−−→ 10(t). It is evident that µn do not converge weakly toany distribution, but it is not in contradiction with the theorem as the limit of theircharacteristic functions is not continuous. The discontinuity at 0 is related to a massescaping to infinity, cf. Example 6.7.

54

Proof. (i) eitx is continuous in x, so (i) follows from Proposition 6.8.(ii) We first claim that the sequence µn is tight. Indeed, let ε > 0 and choose u > 0

such thatε

2≥ 1

u

∫ u

−u(1− ϕ(t)) dt

DCT= lim

n→∞

1

u

∫ u

−u(1− µn(t))dt.

Therefore, for n ≥ n0, using Lemma 7.10,

ε >1

u

∫ u

−u(1− µn(t))dt ≥ µn

([− 2

n,

2

n

]c),

proving the tightness.As µn is tight, it has a convergent subsequence µn(k) such that µn(k)

w−→ ν. Due to thepart (i), ν = ϕ, hence the limit does not depend on the taken subsequence, and we candefine µ := ν. Then, using standard arguments, it follows µn

w−→ µ.1

7.2 Central limit theorem

We have now all tools to extend De Moivre-Laplace theorem (Example 6.4 to a broadfamily of probability distributions.

Theorem 7.14 (Central limit theorem). Let (Xi)i≥1 be an i.i.d. sequence of randomvariables with E(X2

i ) <∞. Set m = EXi, σ2 = VarXi > 0 and

Zn :=Sn −mnτ√n

.

Then, as n→∞, Zn converge in distribution to a standard normal random variable.

Proof. (We repeat the arguments from the introductory lecture.) Let Xi = Xi −m andSn = X1 + · · ·+ Xn. Then, using Lemma 7.3(v),

ϕZn(t) = E[

exp it

σnSn

]= ϕX1

( t

σ√n

)n.

Due to Lemma 7.4, ϕX1is twice differentiable and

ϕ′X1

(0) = iEXi = 0

ϕ′′X1

(0) = −E[X2i ] = −σ2.

Using the Taylor expansion

ϕX1(u) = ϕX1

(0)− u2

2(σ2 + ε(u))

1Otherwise, one would find y a continuously print of F (·) = µ((−∞, ·)) and a subsequence n(k) so thatfor all k ≥ 1 |Fn(k)(y)− F (y)| ≥ ε, contradicting the existence of subsequence n(k`) that convergesto µ.

55

with ε(u)→ 0 as u→ 0, we obtain

ϕZn(t) =(

1− t2

2σ2n

(σ2 + ε

( t

σ√n

)))n= exp

n log

(1− t2

2σ2n

(σ2 + ε(

t

σ√n

)))

n→∞−−−→ exp− t2

2

.

As e−t2/2 is continuous, the claim of the theorem follows from 7.12 and Exercise 7.5.

7.3 Some generalisations of the CLT*

We close this chapter by presenting, partly without proofs, several extensions of thecentral limit theorem, Theorem 7.14.

In many situations of practical interest, one is confronted with random variables whichare independent but not identically distributed, and whose law may depend on n. Thisis the same situation as in Theorem 4.37, where we proved a law of large numbers forsuch random variables. We now show the corresponding CLT.

Theorem 7.15 (The Lindeberg-Feller theorem). For each n, let Xn,k, 1 ≤ k ≤ n, beindependent random variables with EXn,k = 0. Assume

(i)∑n

i=1 EX2n,k

n→∞−−−→ σ2 > 0,

(ii) For all ε > 0, limn→∞∑n

k=1E[|Xn,k|2; |Xn,k| ≥ ε] = 0.

Then Sn = Xn,1 + · · · + Xn,n converges as n → ∞ in distribution to a centred normalvariable with variance σ2.

Remark 7.16. This theorem actually implies Theorem 7.14. To see this, let X1, X2, . . .be i.i.d. with EXi = µ, EX2

i = σ2 <∞. Define Xn,m = 1√n(Xm − µ). Then E(X2

n,m) =

σ2/n, so (i) holds trivially. For (ii) we need to check that

n∑m=1

E[X2n,m; |Xn,m| ≥ ε] = nE

[(X1 − µ)2

n; |X1| ≥ ε

√n]≤ E[X2

1 ; |X1| ≥ ε√n]

n→∞−−−→ 0,

for any ε > 0. This follows by the DCT and the fact that EX21 <∞.

Proof of Theorem 7.15. The proof ultimately reduces to similar arguments as for theCLT case, but we have to impose some truncation first. For this we note that (ii)implies that

∃εn ↓ 0 such that Kn :=1

ε2n

1∑m=1

E[X2n,m; |Xn,m| ≥ εn]

n→∞−−−→ 0.

56

We now set Xn,m = Xn,m1|Xn,m| ≤ εn and define Sn =∑n

m=1 Xn,m. Since, usingChebyshev inequality,

P (|Sn − Sn| ≥ εn) ≤ P [∃m : |Xn,m| > εn]

≤n∑

m=1

P [|Xn,m| > εn]

≤ 1

ε2n

E[X2n,m; |Xn,m| > εn]

and εn → 0, Lemma 12.12 implies that the random variables Sn and Sn have the samedistributional limit, provided that one of the limits exists. So we may focus of provingthe theorem for Sn instead of Sn. [[TODO: move the lemma where it belongs]]

For later use observe that since ESn = 0, the triangle and Chebyshev inequality imply

|ESn| = |E[Sn − Sn]| ≤n∑

m=1

E|Xn,m − Xn,m| ≤1

εn

n∑m=1

E[X2n,m; |Xn,m| ≥ εn],

and so, using also (i),

(7.17) ESnn→∞−−−→ 0 and Var(Sn)

n→∞−−−→ σ2.

We now introduce ϕn,m(t) = E[eit(Xn,m−EXn,m)]. Note that

E[eit(Sn−ESn)] =n∏

m=1

ϕn,m(t),

so, in view of the first part of (7.17), it suffices to show that this converges to e−σ2t2/2.

A natural instinct is to take a logarithm and convert the product into a sum, but itis hindered by the fact that ϕn,m are complex-valued, and complex logarithm is notparticularly pleasant to work with. We thus replace ϕn,m by its second order expansion,which is real valued, and estimate the error of this approximation.

For the error estimate we need the following simple lemma.

Lemma 7.18. Let z1, . . . , zn, w1, . . . , wn ∈ z ∈ C : |z| ≤ 1. Then∣∣∣ n∏i=1

zi −n∏i=1

wi

∣∣∣ ≤ n∑i=1

|zi − wi|.

Proof. For n = 1 this holds trivially, and for n > 1 it follows from the induction step∣∣∣ n∏i=1

zi −n∏i=1

wi

∣∣∣ ≤ |z1|∣∣∣ n∏i=2

zi −n∏i=2

wi

∣∣∣ ≤ +|z1 − w1|∣∣∣ n∏i=2

wi

∣∣∣,by bounding both |z1| and the last product by one, using the assumption.

57

The error bound will then be provided by:

Lemma 7.19.n∑

m=1

∣∣∣ϕn,m(t)−(

1− t2

2Var Xn,m

)∣∣∣ n→∞−−−→ 0.

Proof. To simplify the notation, write Xn,m = Xn,m − EXn,m. Note that EXn,m = 0

and Var Xn,m = EX2n,m. By Taylor’s theorem (with the reminder in integral form) we

have

ϕn,m(t) = 1 + i0− t2∫ 1

0

uE[X2n,meit(1−u)Xn,m ]du.

Hence

ϕm,n(t)−(

1− t2

2Var Xn,m

)= t2

∫ 1

0

uE[X2n,m(1− eit(1−u)Xn,m)]du.

We now explore the fact that Xn,m are truncated (!), that is Xn,m ≤ 2εn. This impliesthe uniform bound

|1− eit(1−u)Xn,m| ≤ 2| sin(t(1− u)Xn,m)| ≤ max0≤x≤|t|εn

2 sin(x),

which does not depend on m. Using this we obtain

n∑m=1

∣∣∣ϕn,m(t)−(

1− t2

2Var Xn,m

)∣∣∣ ≤ t2

2max

0≤x≤|t|εn2 sin(x)

n∑m=1

EX2n,m.

Since EX2n,m ≤ EX2

n,m ≤ EX2n,m, condition (i) ensures that the sum is bounded in n.

The maximum tends zero and so the result follows.

To conclude the proof of the theorem, we notice that Var Xn,m ≤ ε2n. Hence, by the

last two lemmas, ∣∣∣ n∏m=1

ϕn,m(t)−n∏

m=1

(1− t2

2Var Xn,m

)∣∣∣ n→∞−−−→ 0.

Moreover, for n large enough such that δn := 12t2εn < 1, we can write

n∏m=1

(1− t2

2Var Xn,m

)= exp

n∑m=1

log(

1− t2

2Var Xn,m

).

Using the bound | log(1 + z)− z| ≤ |z|2/(1− |z|) holding for |z| < 1 then yields∣∣∣∣ n∑m=1

log(

1− t2

2Var Xn,m

)+t2

2

n∑m=1

Var Xn,m

∣∣∣∣ ≤ δn1− δn

∑Var Xn,m.

58

The sum of variances is Var Sn, which tends to σ2 by (7.17). Putting everything together,using δn → 0, then implies that

E[eit(Sn−ESn)]n→∞−−−→ e−

12σ2t2 , t ∈ R.

Since ESn → 0, this completes the proof.

Theorem 7.14 does not give any information about the speed of the convergence ofthe law of Sn to the standard normal distribution. With some additional assumptions,this rate can be estimated. (For proof see [Dur10], Theorem 3.4.9.)

Theorem 7.20 (Berry-Esseen). Let (Xi)i≥1 be i.i.d. with EXi = 0, EX2i = σ2, and

E|Xi|3 = ρ <∞. If Fn is the distribution function of Sn/(σ√n) and N the distribution

function of the standard normal distribution, then

|Fn(x)−N (x)| ≤ 3ρ/σ3√n, n ≥ 1, x ∈ R.

The central limit theorem allows us to compute (for large n) the probability thatSn ∈ (a

√n, b√n), a < b. It however does not give any control of the probability of

Sn being in intervals much smaller than√n. This control will be provided by the last

theorem of this chapter.

Theorem 7.21 (local CLT). Assume that (Xi)i≥1 are i.i.d. with EXi = 0, EX2i = σ2 ∈

(0,∞), and having a common characteristic function ϕ(t) that has |ϕ(t)| < 1 for allt 6= 0. Then, if xn/(σ

√n)→ x and a < b, then

√nP [Sn ∈ (xn + a, xn + b)]

n→∞−−−→ (b− a)1√2π

e−x2/2.

The condition |ϕ(t)| < 1 for all t 6= 0 ensures that the distribution of Xi’s is non-lattice, that is there are no c, d such that P [X ∈ cN + d] = 1. Requiring non-latticeproperty is necessary for the local central limit theorem to be true, think about b−a < c.For the proof of Theorem 7.21, see [Dur10], Theorem 3.5.3.

59

8 Conditional expectation

We introduce here the concept of conditional expectation. As its definition is slightlyabstract, we start with few examples.

Example 8.1. Consider two independent random variables X, Y on a probability space(Ω,A, P ) having Poisson distribution with respective parameters λX and λY , and setS = X + Y .

From the elementary probability theory recall that the conditional probability of anevent A given an event B with P [B] > 0 is given by

P [A|B] =P [A ∩B]

P [B].

Using this formula, it is easy to show that for every 0 ≤ k ≤ n,

p[X = k|S = n] =

(n

k

)pk(1− p)n−k,

where p = λX/(λX +λY ). Hence, given S, X has the binomial distribution with param-eters (S, p), and thus

E[X|S = n] =∑k∈N

kP [X = k|S = n] = np.

The random variable Z = pS can be thus viewed as the ‘expectation of X given S’. Thiscan be written as

Z = E[X|S] or Z = E[X|σ(S)].

Observe that Z is σ(S)-measurable. Moreover, for any event C ∈ σ(S) we may use Zto compute E[1CX] without knowing the joint distribution of (X, Y ). To see this, letC = S ∈ A ∈ σ(S) for some A ⊂ N. Then

E[1CX] = E[1S∈AX] =∑n∈A

E[1S=nX]

=∑n∈A

E[X|S = n]P [S = n] =∑n∈A

pnP [S = n]

= E[1CZ].

Example 8.2. Being just a little bit more abstract, let (Ω,A, P ) be a probability space,B = (B1, . . . , Bn) a measurable partition of Ω, and X an integrable random variable.

60

For every Bi with P [Bi] > 0, the mapping A 7→ P [A|Bi], A ∈ A, defines a probabilitymeasure on Ω. Let E[X|Bi] be the expectation of X with respect to this measure (checkthat it is finite!).

We may use the numbers E[X|Bi] to define a random variable (!)

E[X|B](ω) :=∑

i≤n:P [Bi]>0

E[X|Bi]1Bi(ω).

The value of this random variable is determined, when we know which of the elementsof the partition B is realised, otherwise said E[X|B] is σ(B)-measurable.

A very similar calculation as in the previous example then shows that for any σ(B)-measurable event C,

E[X1C ] = E[E[X|B]1C

].

Example 8.3. Assume that random variables X and Y have a joint density f(x, y) > 0,x, y ∈ R, and E|X| <∞. It seems natural do define the conditional density of X givenY by

f(x|y) =f(x, y)∫f(u, y) du

, x, y ∈ R,

and the conditional expectation of X given Y as a random variable Z = E[X|Y ] = ϕ(Y )with ϕ(y) =

∫xf(x|y) dx.

Obviously, Z ∈ σ(Y ) again, and for any C = Y ∈ A ∈ σ(Y )

E[X1C ] =

∫ ∫x1A(y)f(x, y) dy dx

=

∫ϕ(y)1A(y)

(∫f(u, y) du

)︸︷︷︸

density of Y

dx

= E[ϕ(Y )1A(Y )] = E[Z1C ].

In view of these three examples, it now seems completely natural to introduce

Definition 8.4. Let X be an integrable random variable on some probability space(Ω,A, P ) and let G ⊂ A be a sub-σ-algebra of A. The conditional expectation of Xgiven G, denoted E[X|G], is any random variable Y such that

(i) Y is G-measurable

(ii) for every A ∈ G,E[X1A] = E[Y 1A].

Since E[X|G] is defined through the integral equality, any random variable Z suchthat Y = Z P -a.s. is also necessarily the conditional expectation of X given G. This ishowever the only ‘non-uniqueness’ issue:

Theorem 8.5. The conditional expectation E[X|G] exists and is unique up to P -nullequivalence. Moreover, if X ≥ 0, then E[X|G] ≥ 0 P -a.s.

61

Proof. Existence. We assume first that X ≥ 0. From the lecture ‘Measure Theore’ recallthe Radon-Nikodym theorem:

Theorem 8.6 (Radon-Nikodym). Let µ, ν be σ-finite measures on (Ω,A). If ν isabsolutely continuous with respect to µ (that is for every A ∈ A, µ(A) = 0 impliesν(A) = 0, notation ν µ), then there is a A-measurable non-negative function f suchthat

ν(A) =

∫A

f dµ, for all A ∈ A.

The function f is called Radon-Nikodym density of ν with respect to µ and denotedby f = dν

dµ.

To prove the existence of the conditional expectation, define a new measure Q on(Ω,A) by

Q(A) =

∫A

X dP = E[X1A], that isdQ

dP= X.

Let Q and P be the restrictions of Q and P to the σ-algebra G. Since X is integrable, Qis a finite measure. Moreover, for G ∈ G with P [G] = 0 we have P [G] = P [G] = 0 andthus also 0 = Q[G] = Q[G]. It follows that Q P , and by Radon-Nikodym theoremapplied to Q, P and (Ω,G), there is a non-negative G-measurable random variable Zsuch that for all G ∈ G

E[X1G] = Q[G] =

∫G

Z dP =

∫G

Z dP = E[Z1G].

Hence Z satisfies the conditions of Definition 8.4 and thus it is a conditional expectationof X given G. In addition, we proved the last claim of the theorem.

For a general random variable X, we write X = X+ − X− with X+ = max(X, 0),X− = max(−X, 0). Using the above construction, we obtain random variables Z+ andZ− that are conditional expectations of X+ or X− given G and set Z = Z+ − Z−.Using the linearity of the expectation it can be checked easily that Z is a conditionalexpectation of X given G.

Uniqueness. Let Z1 and Z2 both satisfy the conditions of Definition 8.4. Set D =Z1−Z2. Then D is G-measurable and for every G ∈ G, E[D1G] = E[Z11G]−E[Z21G] =E[X1G] = E[X1G] = 0. Hence, E[|D|] = E[D1D>0] − E[D1D<0] = 0, so D = 0 P-a.s.and the claim follows.

Remark 8.7. The usual expectation E[X] of a random variable X can be interpretedas the ‘best guess’ of X without having any information about the outcome of therandom experiment. In the similar vein, the conditional expectation of X given G canbe interpreted as the best guess of X having the ‘information contained in the σ-algebraG’. This can be best understood from Example 8.2, where the information contained inG is simply the information about which element of the partition B is realised.

Example 8.8. (i) For G = A, we see that E[X|A] = X. That is if we know ‘everything’,then the best guess of X is X itself.

62

(ii) If G = ∅,Ω, then E[X|G] = E[X]. The best guess of X without any additionalinformation is E[X].

(iii) If X and G are independent (i.e. σ(X) and G are independent), then for G ∈ G,E[X1G] = E[X]P [G], and thus again E[X|G] = E[X].

Lemma 8.9 (Simple properties of conditional expectation). Asume E|X|, E|Y | <∞.

(i) Conditional expectation is linear,

E[aX + bY |G] = aE[X|G] + bE[Y |G].

(ii) If X ≤ Y , then E[X|G] ≤ E[Y |G], P -a.s.

(iii) If Xn ≥ 0 and Xn X, then

E[Xn|G] E[X|G], P -a.s.

Proof. The linearity is obvious from the definition. The monotonicity follows directlyfrom the last claim of Theorem 8.5, using the linearity. Finally, for (iii), let Yn = X−Xn.It then suffices to show that E[Yn|G] 0. Since, Yn is decreasing, (ii) implies thatZn = E[Yn|G] is decreasing P -a.s. and thus a limit Z∞ exists, P -a.s. again. If G ∈ G,then E[Zn1G] = E[Yn1G]. Since 0 ≤ Yn ≤ X we see using the dominated convergencetheorem that E[Z∞1G] = 0. The same argument as in the uniqueness part of the proofof Theorem 8.5 implies that Z∞ = 0 P -a.s., completing the proof.

Lemma 8.10 (Jensen’s inequality). Let ϕ : R → R be a convex function and X arandom variable such that E|X| <∞ and E|ϕ(X)| <∞. Then

ϕ(E[X|G]) ≤ E[ϕ(X)|G], P -a.s.

Proof. For ϕ(x) = ax + b, the claim holds by linearity. A general convex function ϕcan be written as ϕ(x) = supn≥0 ϕn(x) with ϕn(x) = anx + bn being ‘tangents of ϕ atrationals’. Hence, P -a.s., for every n ≥ 1,

E[ϕ(X)|G] ≥ E[ϕn(X)|G] = ϕn[E(X)|G].

Taking the supremum over n, we obtain

E[ϕ(X)|G] ≥ supnϕn[E(X)|G] = ϕ(E[X|G])

and the proof is completed.

Corollary 8.11. Let X ∈ Lp(Ω,A, P ), p ∈ [1,∞]. Then E[X|G] ∈ Lp(Ω,G, P ) and

‖E[X|G]‖p ≤ ‖X‖p,

that is the conditional expectation is a contraction on Lp.

63

Proof. For p ∈ [1,∞) this follows easily from Jensen’s inequality. For p = ∞, observethat −‖X‖∞ ≤ X ≤ ‖X‖∞ and use the monotonicity of the conditional expectation.

Proposition 8.12. If X is G measurable and E|X|, E|Y | <∞, then

E[XY |G] = XE[Y |G], P -a.s.

Proof. The right hand side of the claim is G-measurable, so we only need to check (ii) ofDefinition 8.4. Assume first that Y ≥ 0 and X = 1B for some B ∈ G. Then for G ∈ G

E[E[XY |G]1G] =

∫A∩G

E[Y |G]dP =

∫A∩G

Y dP = E[XY 1G]

and thus (ii) of Definition 8.4 holds for such X and Y . We then continue as usual. ForX simple (i.e. for finite linear combinations of indicator functions), the claim extendsby linearity, for general X ≥ 0 by monotone convergence (Lemma 8.9(iii)). Finally, forgeneral X and Y we write X = X+−X− and Y = Y +−Y − and apply the linearity.

Exercise 8.13. Use the proposition to verify Example 8.8(i).

Proposition 8.14 (the smaller σ-algebra wins). Let G1 ⊂ G2 ⊂ A be σ-algebras. Then,P -a.s.,

(i) E[E(X|G1)|G2] = E(X|G1).

(ii) E[E(X|G2)|G1] = E(X|G1).

Proof. (i) Since G1 ⊂ G2, the random variable E(X|G1) is G2-measurable and the claimfollows from Proposition 8.12.

(ii) Let G ∈ G1 ⊂ G2. Then

E[E[X|G2]1G] = E[X1G] = E[E[X|G1]1G].

Hence, E[X|G1] satisfies the conditions of Definition 8.4 for Y = E[X|G2].

Exercise 8.15. Use the proposition to verify Example 8.8(ii).

The next theorem provides another interpretation of the conditional expectation inthe case of square integrable random variables.

Theorem 8.16. Assume that X ∈ L2(Ω,A, P ), G ⊂ A. Then E[X|G] is the orthogonalprojection of X from L2(Ω,A, P ) on L2(Ω,G, P ).

Proof. Start by observing that if Z ∈ L2(Ω,G, P ), then by Proposition 8.12 ZE[X|G] =E[ZX|G] (the right-hand side is finite by the Cauchy-Schwarz inequality). Hence

(8.17) E[ZE[X|G]] = E[E[ZX|G]] = E[ZX],

using Proposition 8.14.Let now Y ∈ L2(Ω,G, P ) and set Z = Y − E[X|G]. Then,

E[(X−Y )2] = E[(X−E(X|G)−Z)2] = E[(X−E(X|G))2]+E[Z2]+2E[Z(X−E(X|G))].

The last term vanishes due to (8.17) and thus E[(X − Y )2] ≥ E[(X − E(X|G))2] withequality when Z = 0 that is Y = E(X|G). This completes the proof.

64

8.1 Regular conditional probabilities*

Consider a probability space (Ω,A, P ) together with a σ-algebra G ⊂ A and a randomvariable X taking values in some measurable space (S,S). For every B ∈ S, the indicator1X ∈ B is a bounded random variable, so that the conditional expectation

(8.18) E[1X ∈ B|G] =: P [X ∈ B|G]

exits, and is a [0, 1]-valued, G-measurable function on Ω.It is natural to view P [X ∈ B|G] as a ‘map’

Ω× S → [0, 1]

(ω,B) 7→ P [X ∈ B|G].

Beware that this map is not uniquely defined since the conditional expectations in (8.18)are determined only P -a.s. Also, even if the notation suggests it, it is not a priory clearthat for fixed ω ∈ Ω, the map B 7→ P [X ∈ B|G](ω) is a probability measure on S. Sothe question is:

Is there a version of P [X ∈ · |G](ω)which is a probability measure for every ω ∈ Ω?

To see that this is not a trivial issue, let us explain where the problem is. If A1, A2, . . .are disjoint elements of S, it is easy to see from Lemma 8.9(i,iii) that

P [X ∈ ∪n≥1An|G] =∑n≥1

P [X ∈ An|G], P -a.s. (!)

However, the null set where this relation fails to hold may depend on the sequenceA1, A2, . . . . If we require P [X ∈ · |G](ω) to be probability measure, then this relationshould hold for all disjoint collections A1, A2, . . . . But the space S typically containsuncountably many countable disjoint collections, so the exceptional sets may pile up.

To deal with this problem we introduce:

Definition 8.19. Let (S1,S1), (S2,S2) be two measurable spaces. A map κ : S1×S2 →[0, 1] is called stochastic kernel if

(i) for every A ∈ S2, the map S1 3 x 7→ κ(x,A) is S1-measurable.

(ii) fore every x ∈ S1, the map S2 3 A 7→ κ(x,A) is a probability measure on (S2,S2).

We write∫f(y)κ(x, dy) for the integral with respect to the probability measure κ(x, ·).

This definition allows us to formalize our requirements on the ‘regularity’ of the con-ditional distribution P [X ∈ · |G].

Definition 8.20. A regular conditional distribution of X given G is a stochastic kernelκ from (Ω,G) to (S,S) so that κ( · , B) is for every B ∈ S a version of the conditionalexpectation E[1X ∈ B|G], that is for every B ∈ S

P [X ∈ B|G] = E[1X ∈ B|G](ω) = κ(ω,B) for P -a.e. ω.

65

We show that regular conditional distributions exist on ‘nice’ measurable spaces.

Definition 8.21. A Borel space is a measurable space (S,S) for which there is a A ∈B(R) and a bijection ϕ : S → A such that both ϕ and ϕ−1 are measurable.

Most of the measurable spaces one encounters are Borel, in particular every Polishspace endowed with the corresponding Borel σ-algebra is.

Theorem 8.22. Let (S,S) be a Borel space, (Ω,A, P ) a probability space, X : Ω → Sa random variable, and G ⊂ A a σ-algebra. Then there exists a regular conditionaldistribution of X given G.

Proof. See [Dur10], page 198. [[TODO: Type this?]]

66

9 Martingales

Up to now we were (mostly) studying properties of independent random variables andrelated convergence results. Dealing with dependent random variables is much harder.There are many ways ‘how to make the random variables dependent’ and therefore thereis no general theory. Martingales are particular sequences of dependent random variableswhere a general theory exists.

9.1 Definition and Examples

Definition 9.1. Let (Ω,A, P ) be a probability space. A non-decreasing sequence(Fn)n≥0 of sub-σ-algebras, that is F0 ⊂ F1 ⊂ · · · ⊂ A, is called filtration.

Definition 9.2. A sequence (Xn)n≥0 of random variables is said adapted to Fn if therandom variable Xn is Fn-measurable for every n ≥ 0.

Definition 9.3. A Fn-adapted sequence (Xn)n≥0 of integrable random variables is calledmartingale when for every n ≥ 0

E[Xn+1|Fn] = Xn P -a.s.

It is called submartingale when for every n ≥ 0

E[Xn+1|Fn] ≥ Xn P -a.s.

It is called supermartingale when for every n ≥ 0

E[Xn+1|Fn] ≤ Xn P -a.s.

Example 9.4. Let (Xi)i≥1 be an i.i.d. sequence with E[Xi] = 0. Set Sn = X1 + · · ·+Xn,F0 = ∅,Ω, and Fn = σ(X1, . . . , Xn) for n ≥ 1. Then obviously Sn is Fn-adapted andfor every n ≥ 0

(9.5) E[Sn+1|Fn] = E[ Sn︸︷︷︸∈Fn

+ Xn+1︸︷︷︸indep. of Fn

|Fn] = Sn + E[Xn+1] = Sn.

Hence, Sn is a martingale.

Example 9.6. Let Xi and Sn be as in the previous example and assume in additionthat EX2

i = σ2 <∞. Set

Mn = S2n − nσ2, n ≥ 0.

67

Mn is Fn-adapted and

E[Mn+1 −Mn|Fn] = E[S2n+1 − S2

n − σ2|Fn]

= E[(Sn +Xn+1)2 − S2n − σ2|Fn]

= E[2 Sn︸︷︷︸∈Fn

Xn+1 + X2n+1︸︷︷︸

indep. of Fn

−σ2|Fn]

= 2SnE[Xn+1] + E[X2n+1]− σ2 = 0.

Hence, E[Mn+1|Fn] = E[Mn|Fn] = Mn, that is Mn is a martingale. The same compu-tation implies that S2

n is a submartingale.

Example 9.7 (Asymmetric random walk on Z). Let Xi be i.i.d. with P [Xi = +1] =1− P [Xi = −1] = p and p 6= 1

2. Define Sn and Fn as in the previous examples and set

Mn =(1− p

p

)Sn, n ≥ 0.

Mn is again Fn-adapted and integrable (since |Sn| ≤ n) and

E[Mn+1|Fn] = E[Mn

(1− pp

)Xn+1∣∣∣Fn]

= MnE[(1− p

p

)X1]

= Mn

p(1− p

p

)+ (1− p)

( p

1− p

)= Mn.

Observe that for p > 12, limn→∞ =∞, P -a.s. Since 1−p

p< 1, this implies limn→∞Mn = 0,

P -a.s. On the other hand E[Mn] = 1 for all n. This is an important example of amartingale which converges P -a.s. but not in L1(Ω).

Example 9.8 (Radon-Nikodym derivatives). Let (Ω,A) be a measurable space, Fn afiltration. Further, let µ be a measure and ν a probability measures on Ω. Define µnand νn to be the restrictions of µ and ν to Fn and assume that µn νn for every n ≥ 0.By this assumption we can define

Mn(ω) :=dµndνn

(ω), n ≥ 0.

Mn is Fn adapted since, by the Radon-Nikodym theorem applied to µn, νn and (Ω,Fn),the Radon-Nikodym derivative dµn/dνn is Fn-measurable. Mn is obviously integrableon (Ω,A, ν), and for n ≥ 0, A ∈ Fn

Eν [Mn+11A] =

∫A

Mn+1 dν =

∫A

Mn+1 dνn+1 =

∫A

dµn+1

dνn+1

dνn+1 = µn+1(A)

= µn(A) =

∫A

dµndνn

dνn =

∫A

Mn dνn =

∫A

Mn dν = Eν [Mn1A].

By definition of the conditional expectation, it follows that Eν [Mn+1|Fn] = Mn, that isMn is a martingale on the filtered probability space (Ω,A, (Fn), ν).

68

Example 9.9 (Progressive conditioning). Let X ∈ L1 and let Fn be any filtration. SetMn = E[X|Fn]. Then, by Proposition 8.14,

E[Mn+1|Fn] = E[E(X|Fn+1)|Fn] = E[X|Fn] = Mn.

Hence Mn is a Fn-martingale.This is actually the same martingale as in the previous example, it suffices to set

µ(dω) = X(ω)ν(dω).

Example 9.10 (Galton-Watson branching process). In mid 19th century, there wasan interest to develop a theory of family trees—particularly, in connection with royalfamilies. Two statisticians (by back then standards) Galton and Watson devised amodel that nowadays bears their name. In this model, there are generations containinga certain number of currently-alive individuals. The dynamics is such that, at eachunit of time, each individual produces a certain number of off-spring, which is sampledindependently from a common law with probabilities p(n) : n ≥ 0. (In particular, ifthere is no off-spring, the lineage of that individual dies out.)

We will define the problem as follows. Consider a family of i.i.d. integer-valued randomvariables ξn,k : n, k ≥ 0 with law determined by P [ξn,k = m] = p(m), m ≥ 0. Define,inductively, random variables Snn≥0 as follows: S0 := 1 and

Sn+1 =

0, if Sn = 0,

ξn+1,1 + · · ·+ ξn+1,Sn , if Sn > 0.

It is easy to verify that the dynamics does what we described verbally above. Theadditional assumption S0 = 1 means that there is one individual at time zero.

Consider now the filtration Fn := σ(ξm,k : 0 ≤ m ≤ n, k ≥ 0) and let us compute

E[Sn+1|Fn] =∞∑k=0

E[Sn+11Sn = k|Fn]

=∞∑k=0

1Sn = kE[ξn+1,1 + · · ·+ ξn+1,k|Fn]

=∞∑k=0

1Sn = kkE(ξ1,1) = SnE(ξ1,1).

Thus, denoting µ := E(ξ1,1), we see that Mn := µ−nSn is a martingale. The value µ = 1is obviously special because Sn is then itself a martingale, which plays a role later.

Example 9.11 (Polya’s Urn). Consider an urn that has, initially, r red balls and ggreen balls in it. We now proceed as follows: Sample a ball from the urn and replaceit back along with another ball of the same colour. Repeating this step, the question iswhat is status of the urn in the long run. Let Rn denote the number of red balls in the

69

urn at time n and let Gn denote the corresponding number of green balls. Obviously,Rn +Gn = r + g + n. Now consider the random variable

Mn :=Rn

Rn +Gn

=Rn

r + g + n,

which is the fraction of red balls in the urn at time n.The dynamics described above can be encoded using a sequence U1, U2, . . . of i.i.d.

uniform random variables on [0, 1] as follows:

Rn+1 := (Rn + 1)1Un+1 ≤Mn+Rn1Un+1 > Mn.

We claim that Mn is a martingale for the filtration Fn := σ(U1, . . . , Un). Indeed, sinceRn is Fn-measurable, we have

E[Mn+1|Fn] =Rn + 1

r + g + n+ 1· Rn

r + g + n+

Rn

r + g + n+ 1· Gn

r + g + n

=Rn

r + g + n· Rn + 1 +Gn

r + g + n+ 1=

Rn

r + g + n= Mn.

Notice that, unlike the earlier examples, this martingale is non-negative and bounded.

We now give few elementary properties of martingales.

Proposition 9.12. When Xn is a supermartingale, then for n ≥ m ≥ 0

(9.13) E[Xn|Fm] ≤ Xm, P -a.s.

When Xn is a submartingale, then the reversed equality holds, and when Xn is a mar-tingale, then

(9.14) E[Xn|Fm] = Xm, P -a.s.

Proof. The first claim follows from the definition by induction. The second claim canbe proved from the first one by observing that if Xn is submartingale, then −Xn is asupermartingale. Moreover, any martingale is both sub- and supermartingale, yieldingthe last claim.

Proposition 9.15. If Xn is a martingale and ϕ is a convex function with E[ϕ(Xn)] <∞for all n ≥ 0, then ϕ(Xn) is a submartingale. In particular, if ϕ(x) = |x|p for p ≥ 1 andXn is in Lp, then |Xn|p is a submartingale.

Proof. By Jensen’s inequality and the definition

E[ϕ(Xn+1)|Fn] ≥ ϕ(E[Xn+1|Fn]) = ϕ(Xn),

completing the proof.

70

Proposition 9.16. If Xn is a submartingale and ϕ is a non-decreasing convex functionwith E[ϕ(Xn)] <∞ for all n ≥ 0, then ϕ(Xn) is again a submartingale.

Proof. As above, by Jensen’s inequality and the definition,

E[ϕ(Xn+1)|Fn] ≥ ϕ(E[Xn+1|Fn]) ≥ ϕ(Xn).

The monotonicity of ϕ is needed in the last inequality.

Corollary 9.17. If Xn is supermartingale and a ∈ R, then Xn∧a is a supermartingale.Similarly, for submartingale Xn, Xn ∨ a is submartingale.

Exercise 9.18. Find a supermartingale Xn so that X2n is a submartingale.

Definition 9.19. A sequence (Hn)n≥1 of random variables is called predictable if Hn isFn−1 measurable for all n ≥ 1.

Proposition 9.20 (Doob’s decomposition). Xn is a Fn-submartingale iff Xn can bewritten as Xn = Mn + An where Mn is a martingale and 0 = A0 ≤ A1 ≤ A2 ≤ . . . is apredictable sequence. This decomposition is unique up to P -null equivalence.

Proof. To show uniqueness, observe that for n ≥ 0 we must have

E[Xn+1|Fn]−Xn = E[Mn+1|Fn]−Mn︸︷︷︸=0

+E[An+1|Fn]︸︷︷︸=An+1

−An.

Hence A0 = 0 and An+1−An = E[Xn+1−Xn|Fn] ≥ 0 for all n, which uniquely determinesthe sequence An and thus also Mn = Xn − An. The same computation yields also theexistence of An with required properties.

Finally, if Xn = Mn +An with Mn, An as in the statement, then it is trivial to checkthat Xn is a submartingale.

Definition 9.21. Let Mn be adapted and Hn predictable. We define a discrete stochasticintegral of Hn with respect to Mn by

(H ·M)n =n∑i=1

Hi(Mi −Mi−1).

Remark 9.22. The previous definition is a discrete version of the stochastic integral∫ t0HsdMs which will be (for suitable processes Ht and Mt, t ∈ [0,∞)) defined in the

lecture ‘Stochastic analysis’.

Remark 9.23 (Interpretation as a gambling system). Consider a coin-tossing gamewhere Xi denotes the result of the i-th toss, Xi = 1 means head, Xi = −1 means tail.

Let Hi be the amount that a player bets on i-th toss being head. This value shouldnaturally be determined from the information known at time n− 1, hence it is naturalto require that Hn is predictable.

71

The game has the following rules. If i-th toss heads, then the player wins double ofhis bet, otherwise it looses it. The total amount the player has at time n is

H1X1 +H2X2 + · · ·+HnXn =n∑i=1

Hi(Mi −Mi−1) = (H ·M)n,

where Mn = X1 + · · ·+Xn for n ≥ 1 and M0 = 0.

Remark 9.24 (historical). The famous gambling system or strategy called ‘martingale’is defined by setting H1 = 1 and for n ≥ 2 Hn = 2Hn−1 if Xn−1 = −1, and Hn = 1if Xn−1 = +1. This strategy seems to provide sure profit in this faire game since−1− 2− · · · − 2k−1 + 2k = 1, but it is only an illusion as can be seen from the followingtheorem.

Theorem 9.25. (a) If (Xn)n≥0 is a (sub-/super-)martingale and Hn ≥ 0 a predictablesequence which is bounded for every n, then also (H ·X)n is (sub-/super-)martingale.

(b) If (Xn)n≥0 is a martingale and Hn a predictable sequence which is bounded forevery n, then also (H ·X)n is martingale.

Proof. Since H is bounded, (H · X)n ∈ L1 for every n. For (a), assume that Xn is asubmartingale (the remaining claim follows as by replacing X by −X). Then,

E[(H ·X)n+1 − (H ·X)n|Fn] = E[Hn+1︸︷︷︸∈Fn

(Xn+1 −Xn)|Fn]

= Hn+1︸︷︷︸≥0

E[Xn+1 −Xn|Fn]︸︷︷︸≤0

≤ 0,

and thus H ·X is a supermartingale. The proof of (b) is analogous.

We now introduce an important concept:

Definition 9.26. A N ∪ ∞-valued random variable T is called stopping time (withrespect to filtration Fn) when

(9.27) T = n ∈ Fn for all n <∞.

Think about T is the time when the player stops playing. Of course T = n must bemeasurable with respect to the information he has at time n.

Example 9.28. (a) Every deterministic time T (ω) = k is a stopping time as T = nis either ∅ or Ω.

(b) Let A ∈ B(R) and Xn an adapted process. Define

HA = infk ≥ 0 : Xk ∈ A

to be the time of the first visit to A. Then HA is a stopping time. (Exercise!).(c) On the other hand, let

LA = supk : Xk ∈ A

be the time of the last visit to A. Show that LA is not a stopping time in general.

72

Corollary 9.29 (first stopping theorem). Let Xn be an Fn-(sub-/super-) martingale andlet T be a Fn-stopping time. Then XT∧n is again a Fn-(sub-/super-) martingale.

Proof. We assume without loss of generality that Xn is a submartingale. Set Hn =1T ≥ n. As T is a stopping time and T ≥ n = T > n− 1 ∈ Fn−1, the sequenceHn is predictable. Moreover,

(H ·X)n =n∑i=1

1T ≥ i(Xi −Xi−1) = XT∧n −X0.

Therefore, XT∧n = X0 + (H ·X)n is a submartingale by Theorem 9.25.

9.2 Martingales convergence, a.s. case

In the next three sections we study principle convergence theorems for martingales. Westart with an important inequality which will be useful to control the fluctuations of(sub-/super-) martingales.

Let a < b be two real numbers and define stopping times Ti, i ≥ 0, by setting T0 = 0and then recursively for k ≥ 1,

T2k−1 = infi ≥ T2k−2 : Xi ≤ a,T2k = infi ≥ T2k−1 : Xi ≥ b.

If Ti = ∞ for some i, that is the set in the infimum is empty, we define Tj = ∞ for allj > i. Let

Un = supk : T2k ≤ n

be the number of upcrossings of the interval (a, b) completed by time n.

Theorem 9.30 (Upcrossing inequality). If (Xm)m≥0 is a submartingale, then

(b− a)EUn ≤ E[(Xn − a)+]− E[(X0 − a)+].

Proof. Let Yn = a+ (Xn − a)+. By Proposition 9.16, Yn is also a submartingale. WhenX up-crosses (a, b), then so does Y and vice versa, therefore UX

n = UYn . We thus can

consider Y instead of X.We define the following sequence of random variables

Hm =

1, if T2k−1 < m ≤ T2k for some k ∈ N,

0, otherwise.

The value Hm = 1 indicates that the time interval (m− 1,m) is ‘a part of a upcrossing’.Observe that

T2k−1 < m ≤ T2k = T2k−1 ≤ m− 1 ∩ T2k ≤ m− 1c ∈ Fm−1,

73

since Ti’s are stopping times. Hence Hm is predictable.The process H can be interpreted as an investing strategy: If Yn denotes the price of

a stock, we buy one stock when its price is at a, hold it up to the first time when itsprice is above b and then sell it, gaining at least b− a.

From this interpretation it is easy to see that

(9.31) (b− a)UYn ≤ (H · Y )n,

since every upcrossing generates profit at least (b − a) and a possible incomplete up-crossing at time n makes a non-negative contribution to the right-hand side (this is truefor Y but not for X). Write

(9.32) Yn − Y0 = (H · Y )n + ((1−H) · Y )n.

By Theorem 9.25, ((1 − H) · Y ) is a submartingale, and therefore E((1 − H) · Y )n ≥E((1−H) · Y )0 = 0. Combining (9.31), (9.32) then yields

EYn − EY0 ≥ (b− a)EUYn = (b− a)UX

n ,

which, after inserting the definition of Y , completes the proof.

Using the upcrossing inequality we easily get the first martingale convergence theorem

Theorem 9.33 (A.s. martingale convergence theorem). Let (Xn)n≥0 be a submartingalewith supnEX

+n <∞. Then, as n→∞, Xn converges a.s. to a limit X with E|X| <∞.

Proof. Take a < b ∈ Q. Due to Theorem 9.30, using the assumption,

EUa,bn ≤

E[(Xn − a)+]

b− a≤ E[X+

n ] + |a|b− a

= const(a, b).

Defining Ua,b = limn→∞ Ua,bn to be the total number of upcrossings of (a, b) by X, we

conclude using the monotone convergence theorem that EUa,b ≤ const(a, b) < ∞, andthus Ua,b is finite P -a.s. This conclusion holds for all pairs of rational numbers a, b, andthus

P[ ⋃a,b∈Q

lim inf Xn < a < b < lim supXn

]= 0.

It follows that lim supXn = lim inf Xn, P -a.s., that is the limit limXn exists P -a.s.Fatou’s lemma guarantees that EX+ ≤ lim inf EX+

n <∞, so X <∞ P -a.s. Further,using submartingale property, EX−n = EX+

n −EXn ≤ EX+n −EX0. Hence, by Fatou’s

lemma again,EX− ≤ lim inf EX−n ≤ supEX+

n − EX0 <∞.Hence E|X| = EX+ + EX− <∞, completing the proof.

As a special case we obtain

Corollary 9.34. Let Xn ≥ 0 be a supermartingale. Then, as n → ∞, Xn convergesa.s. to a limit X and EX ≤ EX0.

74

Proof. Yn = −Xn is a submartingale with EY +n = 0. So the convergence follows from

the previous theorem. The inequality is a consequence of Fatou’s lemma and the super-martingale property EXn ≤ EX0.

Remark 9.35. Example 9.7 shows that the assumptions of Corollary 9.34 are not suf-ficient to guarantee the L1 convergence of Xn to X.

Example 9.36 (Example 9.8 continued). Let µ, ν be two probability measures on(Ω,A), and assume that Fn is a filtration with limFn := σ

(∪n≥0 Fn

)= A. Let µn, νn

be the restrictions of µ and ν to Fn. Assuming that νn µn, we know that Mn = dµndνn

is a non-negative martingale on (Ω,A, (Fn), ν).By Corollary 9.34, Mn →M ν-a.s. Moreover, it is possible to show that (see [Dur10,

p.242]) that

µ(A) =

∫A

M dν + µ(A ∩ M =∞).

Assuming in addition that µ ν, this simplifies to

µ(A) =

∫A

M dν

that is M = dµdν

. [[TODO: clean]]

Example 9.37 (Branching process continued). In Example 9.10 we proved that if Sn isa Galton-Watson branching process with Eξ1,1 = µ then Mn = Sn/µ

n is a martingale.Since Mn is non-negative, by Corollary 9.34, Mn → M P -a.s. as n → ∞. We now

show that M = 0 P -a.s. when µ ≤ 1.When µ < 1, by martingale property ESn = µn → 0. Since Sn is integer valued,

P [Sn > 0] = P [Sn ≥ 1] ≤ E[Sn]→ 0.In the case µ = 1, assuming in addition P [ξn,i = 1] < 1 to exclude the degeneracy, Sn

is itself a martingale with values in N. Since Sn → M , it means that Sn = M for all nlarge. But this is only possible if M = 0, the non-degeneracy assumption excludes theremaining values.

The case µ > 1 is slightly more complicated, one can e.g. show that in this caseP [Sn > 0 for all n] > 0.

Example 9.38 (Polya’s urn continued). In Example 9.11, we proved that the proportionof red balls in Polya’s urn is a martingale with values in [0, 1]. Theorem 9.33 impliesthat this proportion converges a.s. to a, in general random, limit.

9.3 Doob’s inequality and Lp convergence

We now discuss the convergence of martingales in Lp spaces. The following inequalityextends Kolmogorov’s inequality (Lemma 4.2).

75

Theorem 9.39 (Doob’s maximal inequality). Let (Xn)n≥0 be a submartingale. Thenfor any λ ≥ 0

λP [ max0≤i≤n

Xi ≥ λ] ≤ E[Xn; max0≤i≤n

Xi ≥ λ] ≤ E[X+n ].

Proof. Set T = infm ≥ 0 : Xm ≥ λ and A = max0≤i≤nXi ≥ λ = T ≤ n. Then,

E[XT∧n] = E[Xn1T > n] +n∑k=0

E[Xk1T = k]

≤ E[Xn1T > n] +n∑k=0

E[Xn1T = k] = E[Xn].

Therefore,E[XT∧n1A] + E[XT∧n1Ac ] ≤ E[Xn1A] + E[Xn1Ac ].

Since XT∧n ≥ λ on A, and T ∧ n = n on Ac, the last inequality yields

λP [A] + E[Xn1Ac ] ≤ E[Xn1A] + E[Xn1Ac ].

and thus λP [A] ≤ E[Xn1A] ≤ E[X+n ] as claimed.

Remark 9.40. To see why Kolmogorov’s inequality follows from Doob’s one considerthe submartingale Xn = S2

n where Sn = ξ1 + · · ·+ ξn is an i.i.d. sum.

Theorem 9.41 (Doob’s Lp-inequality). Let (Xn)n≥0 be a submartingale, p ∈ (1,∞) andset Xn = max0≤i≤nX

+n . Then

‖Xn‖p ≤( p

p− 1

)‖Xn‖p.

In particular Xn ∈ Lp implies Xn ∈ Lp.

Proof. Without loss of generality, due to Proposition 9.16, we may assume that Xn = X+n

and Xn ∈ Lp. Fix M > 0. Then

E[(Xn ∧M)p] =

∫ ∞0

pλp−1P [Xn ∧M > λ]dλ.

Observe that

Xn ∧M > λ =

Xn > λ, on λ < M,

∅, on λ ≥M.

Using Theorem 9.39, Fubini’s theorem, and Holder’s inequality then yields

E[(Xn ∧M)p] ≤∫ ∞

0

pλp−2E[Xn1Xn ∧M > λ] dλ

=p

p− 1E[Xn

∫ Xn∧M

0

(p− 1)λp−1 dλ]

=p

p− 1E[Xn(Xn ∧M)p−1]

≤ p

p− 1‖Xn‖pE[(Xn ∧M)p](p−1)/p.

76

Either ‖Xn ∧M‖p = 0 or we divide by ‖Xn ∧M‖p−1p to obtain

‖Xn ∧M‖p ≤( p

p− 1

)‖Xn‖p.

The theorem then follows by the monotone convergence theorem, sending M →∞.

Corollary 9.42. When (Xn)n≥0 is a martingale, X?n = max0≤i≤n |Xi|, then

‖X?n‖p ≤

( p

p− 1

)‖Xn‖p.

Theorem 9.43 (Lp-convergence theorem). Let (Xn)n≥0 be martingale such that, for

some p ∈ (1,∞), supnE|Xn|p <∞. Then Xnn→∞−−−→ X P -a.s. and in Lp.

Proof. Since, by Jensen’s inequality, (EX+n )p ≤ (E|Xn|)p ≤ E[|Xn|p], Theorem 9.33

implies that Xn → X P -a.s. Letting n → ∞ in Corollary 9.42, using the monotoneconvergence theorem and the assumption, yields ‖ supn |Xn|‖p <∞. Since |Xn−X|p ≤(2 sup |Xn|)p ∈ L1, we obtain thatXn → X in Lp by the dominated convergence theorem.

9.4 L2-martingales

As usual, the most important special case of the above theorem is p = 2. The followingtwo simple lemmas are useful when dealing with L2-martingales

Lemma 9.44 (orthogonality of martingale increments). Let (Xn)n≥0 be a martingalewith EX2

n < ∞ for all n ≥ 0. Then, for every m ≤ n and Y a Fm-measurable randomvariable with EY 2 <∞,

E[(Xn −Xm)Y ] = 0.

Proof. By Cauchy-Schwarz inequality E|(Xn −Xm)Y | <∞. We can thus write

E[(Xn −Xm)Y ] = E[E[(Xn −Xm)Y ]|Fm] = E[Y E[Xn −Xm|Fm]︸︷︷︸=0

] = 0.


Lemma 9.45 (conditional variance formula). Let (Xn)n≥0 be as in the last lemma,m ≤ n. Then

E[(Xn − EXn)2|Fm] = E[(Xn −Xm)2|Fm] = E[X2n|Fm]−X2

m.

Proof. Is left for exercise!

77

Example 9.46 (branching process continued). Recall Example 9.37. Assume nowin addition that the offspring distribution has a finite second moment, E[(ξn,i)

2] =∑∞l=0 pll

2 =: σ2 <∞. Let Mn = Sn/µn be the martingale, as before. By Lemma 9.45,

E[M2n|Fn−1] = M2

n−1 + E[(Mn −Mn−1)2|Fn−1]

= M2n−1 + E

[(Snµn− Sn−1

µn−1

)2

|Fn−1

]= µ−2nE[(Sn − µSn−1)2|Fn−1] +M2

n−1.

Further,

E[(Sn − µSn−1)2|Fn−1] =∞∑l=0

E[(Sn − µSn−1)21Sn−1 = l|Fn−1]

=∞∑l=0

1Sn−1 = llσ2 = Sn−1σ2 = µn−1Mn−1σ

2,

and thus E[M2n|Fn−1] = M2

n−1 + σ2µ−n−1. Taking expectation we obtain EM2n =

EM2n−1 + σ2µ−n−1 and iterating

EM2n = 1 +

n+1∑k=2

σ2µ−k.

It follows that supEM2n <∞ and thus Mn →M in L2.

9.5 Azuma-Hoeffding inequality

Various inequalities from the previous section get considerably better if the martingalesin consideration have bounded increments, that is in the L∞-case.

Theorem 9.47 (Azuma-Hoeffding inequality). Let Xn be a supermartingale with M0 =0 such that for some deterministic sequence ck of non-negative numbers

|Xk −Xk−1| ≤ ck, k = 1, . . . , n.

Then

P [Xn > λ] ≤ exp− 1

2

λ2∑nk=1 c

2k

.

Proof. The proof is based on the following inequality

(9.48) ety ≤ cosh(tc) +y

csinh(tc), |y| ≤ c, t ∈ R.

This follows from the fact that the right-hand side can be written as

cosh(tc) +y

csinh(tc) = etc

c+ y

2c+ e−tc

c− y2c

.

78

Under condition |y| ≤ c, both c+y2c

and c−y2c

are non-negative and add up to one. So bythe convexity of the exponential function, the right-hand side is at most

exptcc+ y

2c− tcc− y

2c

= exp(ty),

so we get (9.48).Consider now the random variable etMn for t ≥ 0. Since Mn is bounded, this is

integrable and soE[etMn ] = E

[etMn−1E[et(Mn−Mn−1)|Fn−1]

].

Moreover, (9.48) insures that

E[et(Mn−Mn−1)|Fn−1] ≤ cosh(tcn) +1

cnsinh(tcn)E[Mn −Mn−1|Fn−1],

and this is equal to cosh(tcn) for martingales and is ≤ cosh(tcn) for supermartingales,since t ≥ 0. By induction we obtain

E[etMn ] ≤n∏k=1

cosh(tck) = exp n∑k=1

log cosh(tck), t ≥ 0.

Using the second-order Taylor expansion of the function x 7→ log cosh(x) with the re-minder in Lagrange form and observing that the third derivative of this function isnegative for x < 0 and positive for x > 0 yields log cosh(x) ≤ 1

2x2. Hence,

E[etMn ] ≤ expt2

2

n∑k=1

c2k

, t ≥ 0.

By the exponential Chebyshev inequality

P [Mn ≥ λ] ≤ etλE[etMn ], t ≥ 0.

Combining the two bounds, and minimising over t shows that the optimal value ist = λ(

∑nk=1 c

2k)−1 which yields the desired inequality.

9.6 Convergence in L1

We now study the convergence in L1. Recall that in this case Doob’s inequality (The-orem 9.41) is of no use. Further, as we have seen in Example 9.7 and Remark 9.35,the fact supE|Mn| < ∞ does not imply Mn → X in L1 for a martingale Mn, that isTheorem 9.43 does not hold for p = 1.

On the other hand Theorem 9.33 clearly applies in this case so Mn → M a.s. Wealready know one way how to deduce L1 convergence from the a.s. convergence, namelythe dominated convergence theorem, whose application requires the existence of L1-dominating function.

We are going to develop another condition allowing to deduce L1-convergence fromthe a.s. one. We will see that this condition is not only sufficient, but also necessary. Itdeals with general families of random variables, not only with martingales.

79

Definition 9.49. A collection (Xi)i∈I is said to be uniformly integrable (UI) if

limM→∞

supi∈I

E[|Xi|1|Xi| > M

]= 0.

Example 9.50. (a) When |Xi| < Y for all i ∈ I and some Y ∈ L1, that is there isL1-dominating function, then (Xi)i∈I is UI. (Exercise!)

(b) Let ϕ ≥ 0 be a function such that limϕ(x)/x = ∞. Examples are ϕ(x) = xp,p > 1, or ϕ(x) = x log+(x). If supi∈I Eϕ(|Xi|) <∞, then (Xi)i∈I is UI.

Indeed, let A = supiEϕ(|Xi|). Choose ε > 0 and M <∞ such that infu≥Mϕ(u)u≥ A

ε.

Then for all i ∈ I,

E[|Xi|1|Xi| > M

]≤ ε

AE[ϕ(|Xi|)|Xi|

|Xi|1|Xi| > M]≤

ε

AE[ϕ(|Xi|)] ≤ ε,

which implies the condition of Definition 9.49. ♦

Lemma 9.51. Let X ∈ L1(Ω,A, P ). Then the family

E[X|G] : G ⊂ A is a σ-algebra

is UI.

Proof. We start with a technical claim:

Claim 9.52. If X ∈ L1 then for every ε > 0 exists δ > 0 such that

P (A) < δ =⇒ E[|X|;A] ≤ ε.

Proof. Assume, by contradiction, that there is a sequence of events An with P (An) ≤ 1n

and E[|X|;An] ≥ ε. It follows that |X|1An → 0 in probability and thus a.s. alonga subsequence kn. For such sub-sequence, the dominated convergence theorem impliesE[|X|;Akn ]

n→∞−−−→ 0, leading to contradiction.

Fix now ε and δ as in the claim and choose M < ∞ such that E|X|/M ≤ δ. ForG ⊂ A, by Jensen’s inequality

E[|E[X|G]|; |E[X|G]| ≥M

]≤ E

[E[|X|∣∣G];E[|X|∣∣G] ≥M︸︷︷︸

∈G

]= E

[|X|;E

[|X|∣∣G] ≥M

],

(9.53)

where the equality follows from the definition of the conditional expectation. In addition,by Chebyshev’s inequality

P[E[|X|∣∣G] ≥M

]≤M−1E

[E[|X|∣∣G]] = M−1E[|X|] ≤ δ,

and thus, by Claim 9.52, the right-hand side of (9.53) is bounded by ε, proving the UIproperty.

80

The following theorem explains the usefullness of the UI property for dealing withL1-convergence.

Theorem 9.54. If Xnn→∞−−−→ X in probability, then the following are equivalent

(i) Xn : n ≥ 1 is UI,

(ii) Xnn→∞−−−→ X in L1,

(iii) E|Xn|n→∞−−−→ E|X| <∞.

Proof. (i) =⇒ (ii). For M > 0,

E[|Xn −X|] ≤ E[|Xn −X|; |Xn| ≤M, |X| ≤M ]

+ 3E[|Xn|; |Xn| > M ] + 3E[|X|; |X| > M ].(9.55)

[[TODO: This should be better explained]]

For ε ∈ (0, 1), (i) implies the existence of M0 such that

supnE[|Xn|; |Xn| ≥M ] ≤ ε

2for all M ≥M0.

By Fatou’s lemma

E[|X|] ≤ lim inf E[|Xn|] ≤ε

2+M0 ≤M0 + 1.

Hence, we may choose M so that, uniformly in n, the last two terms in (9.55) are smallerthan ε

2. Hence,

lim supE|Xn −X| ≤ lim supE[|Xn −X|; |Xn| ≤M, |X| ≤M ] + ε = ε

by the dominated convergence theorem. As ε is arbitrary, (ii) follows.(ii) =⇒ (iii). By Jensen’s inequality∣∣E|Xn| − E|X|

∣∣ ≤ E[|Xn| − |X|] ≤ E[|Xn −X|]→ 0

by (ii), which implies (iii).(iii) =⇒ (i). Fix ε > 0. Let ψM : R+ → R+ be a continuous function such that

ψM(x) =

x if x ≤M − 1,

linear on x ∈ [M − 1,M ],

0 if x ≥M.

By the dominated convergence theorem, for M large enough, E|X|−EψM(|X|) ≤ ε2. An-

other application of the dominated convergence theorem implies that E[ψM(|Xn|)]n→∞−−−→

E[ψM(|X|)], so by (iii), for all n larger than some n0

E[|Xn|; |Xn| ≥M ] ≤ E[|Xn|]− EψM(|Xn|) ≤ E[|X|]− E[ψM(|X|)] +ε

2< ε.

By increasing M , we can make the last inequality to be valid for all n’s, that is (Xn)n≥1

is UI.

81

Remark 9.56. In the proof we used the following extended version of the dominatedconvergence theorem: Assume that Xn → X in probability and that |Xn| ≤ Y ∈ L1 forall n. Then Xn → X in L1. Prove this as exercise!

As corrolary of Theorem 9.54 we obtain L1-convergence theorem for submartingales.

Theorem 9.57. For a submartingale (Xn)n≥0 the following are equivalent

(i) (Xn)n≥0 is UI,

(ii) Xn → X in L1 and P -a.s.

(iii) Xn → X in L1.

Proof. (i) =⇒ (ii). The UI property implies supE|Xn| < ∞, so by Theorem 9.33,Xn → X, P -a.s. Theorem 9.54 then implies that Xn → X in L1.

(ii) =⇒ (iii) is obvious.(iii) =⇒ (i) Since Xn → X in L1, we have also Xn → X in probability. The claim

then follows by another application of Theorem 9.54.

Theorem 9.58. For a martingale (Xn)n≥0 the following are equivalent

(i) (Xn)n≥0 is UI,

(ii) Xn → X∞ in L1 and P -a.s.

(iii) Xn → X∞ in L1.

(iv) There is a random variable X such that Xn = E[X|Fn]

Proof. (i)⇔(ii)⇔(iii) follows from Theorem 9.57.(iii) =⇒ (iv). Let n < m. Then, for every A ∈ Fn

E[Xn1A] = E[Xm1A](iii)−−−→m→∞

E[X∞1A],

that is Xn = E[X∞|Fn] for all n ≥ 0.(iv) =⇒ (i) is a direct consequence of Example 9.51.

Exercise 9.59. Let F∞ = σ(∪n Fn

)and X ∈ L1(Ω,A, P ). Show that

E[X|Fn]n→∞−−−→ E[X|F∞] a.s. and in L1.

82

9.7 Optional stopping theorem

We explore the behaviour of the martingales at stopping times. Recall from Corol-lary 9.29, that if Xn is a martingale and T a stopping time such that P [T ≤ c] = 1 forsome c <∞, then

E[XT ] = E[X0].

For unbounded stopping times, such equality fails in general (Consider the simple ran-dom walk (Sn)n≥0 and the stopping time T = infk : Sk = −1), but it holds underadditional assumptions.

Theorem 9.60. If (Xn)n≥0 is a UI submartingale, then for any stopping time T ≤ ∞,

EX0 ≤ EXT ≤ EX∞,

where X∞ = limn→∞Xn.

To show the theorem we need a simple lemma.

Lemma 9.61. If (Xn)n≥0 is a UI submartingale and T ≤ ∞ a stopping time, then thesequence (Xn∧T )n≥0 is UI.

Proof. By Proposition 9.16, (X+n )n≥0 is a submartingale which is also UI. By Corol-

lary 9.29, EX+T∧n ≤ EX+

n . Since (X+n )n≥0 is UI, we have

supnEX+

T∧n ≤ supnEX+

n <∞,

so by martingale convergence theorem (Theorem 9.33), XT∧n → XT , P -a.s. and E|XT | <∞. Then,

E[|XT∧n|; |XT∧n| ≥M ] ≤ E[|XT |; |XT | ≥M ] + E[|Xn|; |Xn| ≥M ],

and the claim follows from the UI property of (Xn)n≥0 and the fact that E|XT | <∞.

Proof of Theorem 9.60. By Corollary 9.29, EX0 ≤ ET∧n ≤ EXn. Letting n → ∞, ob-serving that Xn → X∞ in L1, and XT∧n → XT in L1 by Lemma 9.61 and Theorem 9.57,yields the desired result.

Corollary 9.62. If S ≤ T are two stopping times and X is a UI submartingale, thenEXS ≤ EXT .

Proof. Use the inequality EYS ≤ EY∞ for Yn = XT∧n.

83

Applications of Optional stopping theorem Let Sn = X1 + · · ·+Xn for Xi-i.i.d. withP [Xi = 1] = 1−P [Xi = −1] = p be a random walk with drift. In Examples 9.4–9.7, wehave seen several martingales related to this process.

For x ∈ Z define Tx = infk ≥ 0 : Sk = x to be the hitting time of x. We can usethe optional stopping theorem to study the exit time of Sn from an interval:

Claim 9.63. Let a < 0 < b ∈ Z. Then

P [Ta < Tb] =

b

b−a , if p = 12,

ϕ(b)−ϕ(0)ϕ(b)−ϕ(a)

, if p 6= 12,

where ϕ(x) =(

1−pp

)x, cf. Example 9.7.

Proof. We consider only the case p > 1/2. The proof for p = 1/2 is completely analogous,using the martingale from Example 9.4. Set T = Ta∧Tb, and consider the process Mn =ϕ(Sn∧T ), which is a martingale by Example 9.7 and Corollary 9.29. This martingale isbounded and so UI. It thus converges a.s. and in L1. Moreover, it can be seen easilythat the limit cannot be in the set ϕ(x) : a < x < b and thus M∞ ∈ ϕ(a), ϕ(b) andT <∞ a.s. Optional stopping theorem then yields

ϕ0 = E[ϕ(MT )]

= ϕ(a)P [ST = a] + ϕ(b)P [ST = b]

= ϕ(a)P [Ta < Tb] + ϕ(b)(1− P [Ta < Tb]).

Solving for P [Ta < Tb] yields the claim.

Exercise 9.64. Use the above method to show that

(a) If p > 1/2 and a < 0, then P [minn Sn ≤ a] = ϕ(−a).

(b) If p > 1/2 and b > 0, then ETb = b2p−1

.

(c) Let ψ(θ) = logE[expθX1]. Then M θn = expθSn−nψ(θ) is a martingale and the

generating function of T1 is given by

EsT1 =1−

√1− 4p(1− p)s2

2(1− p)s, s ∈ [0, 1].

9.8 Martingale central limit theorem*

TBD

84

10 Constructions of processes

[[TODO: This and following chapters should be proofreaded]]

Up to now, we did not pay a lot of attention how to explicitly construct stochas-tic processes we are dealing with. In this chapter we present two general techniquesthat guarantee the existence of the stochastic process (Xi)i≥1 as a sequence of randomvariables on some probability space (Ω,F ,P).

10.1 Semi-direct product

[[TODO: This section should be expanded, see the handwritten notes]]

We have seen in Section 8.1, how stochastic kernels naturally arise when consideringregular versions of conditional probabilities. We now follow the opposite direction, anduse stochastic kernels to construct measures on ‘larger’ spaces.

Consider a probability space (Ω1,A1, P1). Let κ be a stochastic kernel (as in Defini-tion 8.19) from (Ω1,A1) to some other measurable space (Ω2,A2). Let Ω = Ω1 × Ω2,A = A1 ⊗A2. We define Xi : Ω→ Ωi, i = 1, 2, to be the canonical projection.

The semi-direct product P = P1 × κ is a probability measure on (Ω,A) uniquelydetermined by

(10.1) P (A1 × A2) =

∫A1

κ(ω1, A2)P (dω1), A1 ∈ A1, A2 ∈ A2.

Obviously then, for every Y ∈ L1(Ω, P ), we have

EP [Y ] =

∫Ω1

∫Ω2

Y ((ω1, ω2))κ(ω1, dω2)P (dω1).

The following lemma shows the relation to the construction of Section 8.1.

Lemma 10.2. Let Y ∈ L1(Ω, P ). Then

EP [Y |σ(X1)](ω) =

∫Ω2

Y (ω1, ω′)κ(ω1, dω

′),

for ω = (ω1, ω2). In particular, κ is the regular conditional distribution of X2 givenσ(X1).

Proof. The second claim follows easily by taking Y = 1ω1 ∈ A) with A ∈ A1.

85

10.2 Ionescu-Tulcea Theorem

The basic idea of our first construction of a stochastic process is to specify for for everyn = 1 a stochastic kernel κn which describes the conditional distribution of Xn givenX0, . . . , Xn−1, and to specify a starting distribution of random variable X0. We willsee that these input data are sufficient to construct a stochastic process (Xn)n≥0 whosedistribution is determined uniquely.

We will work in a slightly more general setting, where random variables Xi do not needto take values in the same spaces. We thus consider a sequence (Si,Si)i≥0 of measurablespaces and define

Ω0 = S0, Ω1 = S0 ×X1, . . . ,Ωn = S0 × · · · × Sn,F0 = S0, F1 = S0 ⊗ S1, . . . ,Fn = S0 ⊗ · · · ⊗ Sn.

The input data for the construction are

• a probability measure P0 on (S0,S0) viewed as the starting distribution,

• a sequence of stochastic kernels κn from (Ωn−1,Fn−1) to (Sn,Sn), giving the tran-sition probabilities

Using these ingredients and semi-direct construction (10.1), we can define probabilitiesQn on (Ωn,Fn) by

Q0 = P0

Q1 = P0 × κ1,

. . .

Qn+1 = Qn × κn+1.

The required stochastic process will be constructed with help of a a probability mea-sure on countable-infinite product space

(10.3) Ω =∏i≥0

Si = ω = (x0, x1, . . . ) : xi ∈ Si∀i ≥ 0

endowed with the canonical coordinates

(10.4) Xi(ω) = xi ∈ Si for ω = (x0, x1, . . . ) ∈ Ω,

and the product σ-algebra F

F = σ(Xn : n ≥ 0)

= σ(A1 × . . . Ak × Sk+1 × · · · : k ≥ 0, Ai ∈ Si, 0 ≤ i ≤ k= σ(A× Sk+1 × · · · : k ≥ 0, A ∈ Fk.

We also define the canonical projections

πn : Ω→ Ωn, πn(ω) = (x0, . . . , xn) ∈ Ωn for ω = (x1, x2, . . . ) ∈ Ω.(10.5)

86

Theorem 10.6 (Ionescu-Tulcea). There is a unique probability measure Q on (Ω,F) sothat for every n ≥ 0

(10.7) (πn)#Q = Qn

In particular, the conditional distribution of Xn given σ(X1, . . . , Xn−1) is given by κn,and thus for every bounded measurable function f on (Ωn,Fn)

EQ[f(X0, . . . , Xn)]

=

∫S0

P0(dx0)

∫κ1(X0, dx1) . . .

∫κn(x0, . . . , xn−1, dxn)f(x0, x1, . . . , xn).

(10.8)

Proof. The second claim of the theorem is a direct consequence of the first one, usingthe semidirect product construction (10.1).

To show the uniqueness, observe that (10.8) uniquely determines the measure Q onthe collection

B = A× Sk+1 × · · · : k ≥ 0, A ∈ Fk.

Since B is a π-system and σ(B) = F , the uniqueness of Q follows by Dynkin’s Lemma.It remains to show the existence of Q. In accordance with (10.7), we define Q on B

by

(10.9) Q(A× Sk+1 × . . . ) = Qk(A), for every k ≥ 0, A ∈ Fk.

We first show Q is well-defined on B by (10.9). To this end we need to check that for0 ≤ l ≤ n and A ∈ Fl, B ∈ Fn with

A× Sl+1 × · · · = B × Sn+1 × . . . ,

we have Q(A) = Ql(A) = Qn(B) = Q(B). This is trivial in the case n = l, since thennecessarily A = B. When n > l, then

Ql+1(A× Sl+1) = Ql × κl+1 =

∫A

Ql(dx0, . . . , dxl)κl+1(x0, . . . , xl;Sl+1) = Ql(A).

By induction then Qn(A× Sl+1 × · · · × Sn) = Ql(A). As B = A× Sl+1 × · · · × Sn, theclaim follows for n > l as well.

By definition, the collection B is an algebra (i.e. Ω ∈ B, B ∈ B =⇒ Bc ∈ B,Bi ∈ B, i = 1, . . . , n =⇒ ∪ni=1Bi ∈ B), and Q(Ω) = 1. It is also easy to see thatthe function Q is additive on B, (i.e. Q(B1 ∪ B2) = Q(B1) + Q(B2) for B1, B2 ∈ Bwith , B1 ∩ B2 = ∅). To see this observe that for every pair B1, B2 ∈ B we canfind k ≥ 0 and A1, A2 ∈ Fk such that Bi = Ai × Sk+1 × . . . , i = 1, 2, and thereforeQ(B1 ∪B2) = Qk(A1 ∪ A2) = Qk(A1) +Qk(A2) = Q(B1) +Q(B2) by additivity of Qk.

The existence of a probability measure Q on (Ω,F) extending the additive set functionQ on B will then follow from the Caratheordory extension theorem, once we show thatQ is σ-additive on B, that is:

87

Claim 10.10. When Bi ∈ B, i ∈ N, Bi are pairwise disjoint, and B =⋃i≥1Bi ∈ B(!),

then Q(B) =∑

i≥1Q(Bi).

Proof. The proof starts by two reductions. First, setting Bn = B \ (∪ni=1Bi), n ≥ 0, wehave Bn ↓ ∅, and, by the additivity, Q(B) = Q(Bn) +

∑ni=1Q(Bi). Hence the claim will

follow once we show that

(10.11) For every decreasing sequence Bn ∈ B with Bn ↓ ∅, limn→∞Q(Bn) = 0.

Second, for any Bn as in (10.11) we may construct another sequence Bk = Ak×Sk+1×. . . , Ak ∈ Fk with Bk ↓ ∅ such that (Bn) is a subsequence of (Bk). (10.11) thus followsfrom

(10.12)For every Bk = Ak × Sk+1 × . . . , Ak ∈ Fk, k ≥ 1 with Bk ↓ ∅,limk→∞Q(Bk) = 0.

Assume now, by contradiction, that (10.12) does not hold for a sequence (Bk)k, thatis

infk≥1

Q(Bk) > ε > 0.(10.13)

By (10.12), Ak+1 ⊂ Ak × Sk+1, k ≥ 0 and

Q(Bk) = Qk(Ak)

=

∫S0

P0(dx0)

∫S1

κ1(x0, dx1) . . .

∫Sk

κ(x0, . . . , xk−1, dxk)1Ak(x0, . . . , xk)︸︷︷︸:=f0,k(x0)

and similarly

Q(Bk+1) = Qk+1(Ak+1) =

∫S0

P0(dx0)f0,k+1(xo).

Since (Bk) is a decreasing, we have 1Ak+1(x0, . . . , xk+1) ≤ 1Ak(x0, . . . , xk)1Sk+1

(xk+1) andthus

f0,k(x0) ≥ f0,k+1(x0) for every x0 ∈ S0, k ≥ 1.

From (10.13) and the monotone convergence theorem we see that there a x0 ∈ S0 suchthat(10.14)

infk≥1

f0,k(x0) = inf

∫S1

K1(x0, dx1) . . .

∫Sk

Kk(x0, x1, . . . , xk−1dxk)1Ak(X0, x1, . . . , xk) > 0.

Define now

f1,k(x0, x1) =

∫S2

κ2(x0, x1, dx2) . . .

∫Sk

κk(x0, . . . , xk−1, dxk)1Ak(x0, . . . , xk).

88

Using similar steps as in the last paragraph, assumption (10.14) implies that there isx1 ∈ S1 such that

infk≥2

f1,k(x0, x1) > 0.(10.15)

By induction we may then construct a sequence (xk)k≥1, xk ∈ Sk, such that for everyl ≥ 0

infk>l

∫Sl+1

κl+1(x0, . . . xl, dxl+1) . . .

∫Sk

κk(x0, . . . xl, xl+1, . . . , xk−1, dxk)

× 1Ak(x0, . . . , xl, xl+1, . . . , xk) > 0

In particular, for k = l + 1,

0 <

∫Sl+1

κl+1(x0, . . . xl, dxl+1) 1Al+1(x0, . . . , xl, xl+1)︸︷︷︸≤1Al (x0,...,xl)

≤ 1Al(x0, . . . , xl)

and thus (x0, . . . , xl) ∈ Al for every l ≥ 0. Taking into account also (10.12), it meansthat for every ω := (x0, x1, . . . ) ∈ Bl for every l ≥ 1, and thus ω ∈

⋂l≥1Bl. This is in

contradiction with Bl ↓ ∅.

This completes the proof of Theorem 10.6.

As the first consequence of Ionescu-Tulcea theorem we may prove the existence ofcountable independent sequences.

Corollary 10.16 (Product measure on Ω =∏Si). For i ≥ 0, let νi be a probability

measure on (Si,Si). Then there exists a unique probability measure Q on (Ω,F) suchthat

(πn)#Q = ν0 ⊗ · · · ⊗ νndenoted by Q =

⊗i≥0 νi, a product measure.

Proof. It is sufficient to choose stochastic kernels κi by

κi(x0, . . . , xi−1dxi) = νi(dxi), i ≥ 1,

and P0 = ν0. ThenQn = ν0 ⊗ · · · ⊗ νn

and the claim follows directly from Theorem 10.6.

89

10.3 Complement: Kolmogorov extention theorem

Ionescu-Tulcea Theorem allows the construction of probability measures on countableproduct spaces from a sequence of stock kernels. We now give another constructionof stochastic processes which works on arbitrary products, however with additional as-sumption on ’components’ (Si,Si).

We consider an arbitrary index set I and a collection of measurable spaces (Si,Si)i∈I .We assume

(10.17) (Si,Si) is a Borel space for every i ∈ I, cf. Definition 8.21.

Similarly as before we define product spaces

ΩJ =∏i∈J

Si, FJ = ⊗i∈JFi, for J ⊂ I,(10.18)

and write Ω := ΩI , F := FI . For I ⊃ J ⊃ K we let πJ,K : ΩJ → ΩK to be the canonicalprojection, and set πJ := πI,J Finally, let F (I), resp. G(I), be the set of all finite,resp. countable, subsets of I.

Starting data for our construction will be a collection of finite-dimensional distribution,that is a family of probability measures QJ on (ΩJ ,FJ), J ∈ F (I). We want to find ameasure Q on (Ω,F) such that QJ ’s are its finite dimensional marginals, that is

(10.19) (πJ)#Q = QJ , for all J ∈ F (I).

Of course, there need to be a consistency requirement on QJ ’s, After all, given K ⊂ J ,QK is already determined by QJ :

(10.20) (πJ,K)#QJ = QK .

It turns out, that this is everything we need:

Theorem 10.21. Let I, (Si,Si)i∈I , (QJ)J∈F (I) satisfy (10.17), (10.20). Then thereexists a unique probability measure Q on (Ω,F) such that (10.19) holds.

Remark 10.22. It can be shown that the assumption (10.17) is necessary for the validityof the theorem.

Proof. We first assume that I is countable. Without loss of generality we may thenassume that I = N, and denote Ωn = Ω1,...,n, Qn = Q1,...,n, Fn = F1,...,n, πn =π1,...,n. A finite product of Borel spaces is again a Borel space. By Theorem 8.22, thereis a regular conditional distribution κn of Qn given Fn−1, that is

κn((x1, . . . , xn−1);A) = Q(A|σ(πn−1)).

From condition (10.20) it follows that Qn = Qn−1×κn and thus Qn = Q1×κ2×· · ·×κn.Ionescu-Tulcea theorem then asserts the existence of the required measure Q on (Ω,F).

90

Assume now that I is uncountable. Recall that the product σ-algebra F can bewritten as

F =⋃

J∈G(I)

π−1J (FJ).

(To see that the union on the right-hand side is indeed a σ-algebra, recall that a countableunion of countable sets is again countable.) By the first step of the proof, we canconstruct for every J ∈ G(I) a measure QJ on (ΩJ ,FJ) such that (πJ,K)#QJ = QK forevery K ⊂ F (I). For A ∈ π−1

J (FJ) with J ∈ G(I) we may set

Q(A) = QJ(πJ(A)).

This is a well defined function, since when A ∈ π−1J1

(FJ1) ∩ π−1J2

(FJ2), for J1, J2 ∈ G(I),then (10.20) implies that QJ1(πJ1(A)) = QJ2(πJ2(A)) (Exercise).

It remains to be shown that Q is a probability measure on F . Obviously 0 ≤ Q ≤ 1and Q(Ω) = 1. Given sequence An ∈ F of disjoint sets, let Jn ∈ G(I) be such thatAn ∈ π−1

Jn(FJn). Then J =

⋃n Jn ∈ G(I), and An ∈ π−1

J (FJ). Hence, by σ-additivity ofQn,

Q(∪n≥1An) = QJ(∪n≥1πJAn) =∑n≥1

QJ(πJAn) =∑n≥1

Q(An).


91

11 Markov chains

The second important family of dependent random variables that we will study in thislecture are the Markov chains.

11.1 Definition and first properties

Definition 11.1. Let (Ω,F , (F)n≥0, P ) be a probability space with a filtration and(S,S) a measurable space. A sequence (Xn)n≥0 of S-valued random variables is calledMarkov chain with respect to (Fn) if (a) (Xn) is Fn-adapted, and (b) for all B ∈ S andn ≥ 0

P [Xn+1 ∈ B|Fn] = P [Xn+1 ∈ B|σ(Xn)].

As application of the results of the previous chapter we show that Markov chain exist:

Proposition 11.2 (Existence of canonical Markov chain). Let (Ω,F) = (SN,S⊗N),Xi : Ω → S the canonical coordinates, Fn = σ(X0, . . . , Xn). Consider sequence ofstochastic kernels (κi)i≥1 from (S,S) to (S,S) and a probability measure µ on (S,S).Then there is a unique probability measure Pµ on (Ω,F) under which (Xn)n≥0 form aMarkov chain w.r.t. Fn such that X0 is µ-distributed and

Pµ[Xn+1 ∈ B|Fn](ω) = κn+1(Xn(ω), B), Pµ-a.s.

In particular, for every bounded measurable f : Sn+1 → R, n ≥ 0,(11.3)

EPµ [f(X0, . . . , Xn)] =

∫S

µ(dx0)

∫S

κ1(x0, dx1) · · ·∫S

κn(xn−1, Dxn)f(x0, . . . , xn).

Proof. The claim of the proposition follows directly from the Ionescu-Tulcea theoremby taking κn(x0, . . . , xn−1; dxn) of this theorem being independent of x0, . . . , xn−2 andbeing equal κn(xn−1, dxn).

When κn = κ for some given probability kernel κ on (S,S), then the Markov chain iscalled time-homogeneous. From now on we consider only this case. For µ = δx we writePx instead of Pδx .

Example 11.4. We have already seen many examples of Markov chains: Random walks,Galton-Watson process, renewal process, Polya’s urn, . . . . [[TODO: extend]]

Exercise 11.5. The Markov chains from the previous example are mostly non-canonicalones. How you can construct their canonical versions?

92

We now consider a time-homogeneous canonical Markov chain given by transitionkernel κ and initial distribution µ, as constructed in Proposition 11.2. For n ≥ 0, weintroduce a shift operator θn : Ω→ Ω by

θ((ω0, ω1, . . . )) = (ωn, ωn+1, . . . ),

i.e. θn “erases the past before time n”.The definition of Markov chain requires that conditionally on Xn, the “near future ”,

that is Xn+1, is independent of the past. We now extend it to the “whole future”:

Proposition 11.6 (Markov property). (a) The map (x,B) ∈ S × F 7→ Px(B) is astochastic kernel from (S,S) to (Ω,F).

(b) For every n ≥ 0 and a bounded random variable Y

EPµ [Y θn|Fn](ω) = EPXn(ω) [Y ], Pµ-a.s.

(c) In particular, for C ∈ σ(Xn, Xn+1, . . . ),

EPµ [1C |Fn] = EPµ [1C |σ(Xn)], Pµ-a.s.

Proof. (a) B 7→ Px(B) is a probability measure for every x ∈ S by construction. Themeasurability of x 7→ Px(B) follows from the formula (11.3) and Dynkin’s argument.

(b) Let A ∈ Fn, and let f : Sk+1 → R be a bounded measurable function. Then, by(11.3) and an easy computation,

EPµ [1Af(Xn, . . . , Xn+k)] = EPµ[1AE

PXn [f(X0, . . . , Xk)]].

In particular for every B ∈ B = ∪k≥0σ(X0, . . . , Xk)

EPµ [1A1B θn] = EPµ[1APXn [B]

].

B is a π-system generating F , so the last display actually holds for all B ∈ F . Hence,by definition of the conditional expectation,

EPµ [Y θn|Fn](ω) = PXn(ω)[B], Pµ-a.s.

The claim (b) then follows by the usual approximation procedure.(c) It is sufficient to write C ∈ σ(Xn, . . . ) as C = θn(B) for B ∈ F and apply the

claim (b).

As an easy corollary we get

Proposition 11.7 (Chapman-Kolmogorov equation). For every n,m ∈ N, x ∈ S andA ∈ S,

Px[Xm+n ∈ A] =

∫S

Px(Xn ∈ dy)Py(Xm ∈ A).

93

Proof. By Proposition 11.6(c),

Px(Xm+n ∈ A) = EPx [Px[Xm+n ∈ A|Fn]] = Ex[PXn(Xm ∈ A)]

=

∫S

Px(Xn ∈ Dy)Py(Xm ∈ A),

as claimed.

Proposition 11.6 deals with the ’future’ of Markov chains after deterministic times.We now extend this proposition to certain random times. Recall that a N ∪∞-valuedrandom variable T is called Fn-stopping time when T ≤ n ∈ Fn for every n ≥ 0.Given a stopping time T we define σ-algebra FT by

FT = A ∈ F : A ∩ T = n ∈ Fn for every n ≥ 0.

FT should be viewed as the σ-algebra describing the past relative to T .

Proposition 11.8 (strong Markov property). For every bounded random variable Yand a stopping time T

Eµ[Y θT |FT ] = EXT [Y ], Pµ-a.s. on T <∞.

In this proposition Y θT should be understood as(Y θT (ω)

)(ω) on T < ∞, and

zero otherwise. Similarly, EXT [Y ] is EXT (ω)(ω)[Y ] on T <∞.

Proof. We verify the two defining properties of the conditional expectation. On T =n, EXT [Y ] = EXn [Y ] which is Fn measurable. Therefore EXT [Y ] is FT -measurable.Moreover, for A ∈ FT ,

Eµ[Y θT1A∩T<∞

]=∑n≥0

Eµ[Y θT1A∩T=n

]=∑n≥0

Eµ

[Y θn 1A∩T=n︸︷︷︸

∈Fn

]By the Markov property this equals

=∑n≥0

Eµ[EXn [Y ]1A∩T=n

]=∑n≥0

Eµ[EXT [Y ]1A∩T=n

]= Eµ

[EXT [Y ]1A∩T<∞

].


11.2 Invariant measures of Markov chains

We want to understand here the asymptotic behaviour of Markov chains. To simplifythe matter, we assume that the state space S is at most countable and S = P(S).

To make the situation even easier, we assume that the Markov chain (Xn) is irre-ducible, that is for every x, y ∈ S there is n ≥ 1 such that Px[Xn = y] > 0. Fromthe lecture ‘Stochastic processes’ you know that this is not very restrictive assumption:If X is not irreducible, it is possible to restrict it to certain subsets of S where it isirreducible.

94

Definition 11.9. A state x ∈ S is called recurrent if

Px[Xn = x for infinitely many x] = 1.

It is called transient if

Px[Xn = x for infinitely many x] = 0.

We first show that there is no other possibility. To this end, let Hx and Hx be thefirst entrance time, resp. hitting time of x,

Hx = infn ≥ 1 : Xn = x,Hx = infn ≥ 1 : Xn = x.

Theorem 11.10. The following dichotomy holds:

(i) If Px[Hx <∞] = 1, then x is recurrent and∑

n≥0 Px[Xn = x] =∞.

(ii) If Px[Hx <∞] < 1, then x is transient and∑

n≥0 Px[Xn = x] <∞.

Proof. Let Hkx , k ≥ 0 be the times of successive visits to x defined by H0

x = 0 and

(11.11) Hk+1x =

θHk

x Hx +Hk

x , on Hkx <∞,

∞, otherwise.

The key observation of the proof is the fact that the increments Hnx − Hn−1

x areessentially i.i.d. More precisely, define

Sn =

Hnx −Hn−1

x , on Hn−1x <∞,

0, otherwise.

Then, conditionally on Hn−1x <∞, Sn is independent of FHn−1

x, and

(11.12) P [Sn = k|Hn−1x <∞] = Px[Hx = k], k ∈ N ∪ ∞.

Indeed, this follows from the strong Markov property, observing that Sn = Hx θHn−1x

on Hn−1x <∞.

Let Nx be the total number of returns to x, Nx :=∑

n≥1 1Xn=x and set fx = Px[Hx <∞]. By (11.12),

Px[Nx ≥ k + 1] = Px[Hk+1x <∞] = Px[H

kx <∞, Sk+1 <∞]

= Px[Sk+1 <∞|Hkx <∞]Px[H

kx <∞] = Px[Hx <∞]Px[Nx ≥ k]

= fxPx[Nx ≥ k].

By easy induction argument then follows that for fx = 1 we have Px[Nx = ∞] = 1and thus x is recurrent. On the other hand, for fx < 1, the number of returns Nx isgeometrically distributed, P [Nx = k] = fkx (1− fx), and thus finite a.s.

Finally, the last parts of both claim follows from E[Nx] =∑

n≥1 Px[Xn = x].

95

For irreducible Markov chains the recurrence and transience are global properties:

Lemma 11.13. When (Xn) is irreducible and there is x ∈ S which is recurrent (ortransient), then all y ∈ S are recurrent (or transient).

Proof. It is sufficient to show that ‘x is recurrent’ implies ‘y is recurrent’ for all x, y ∈ S.By irreducibility, there is k, l > 0 such that Px[Xk = y] > 0 and Py[Xl = x] > 0. Further,by Chapman-Kolmogorov equation,

Py[Xk+l+n = y] ≥ Py[Xl = x]Px[Xn = x]Px[Xk = y].

Hence,∞∑

n=k+l

Py[Xn = y] ≥ Py[Xl = x]Px[Xk = y]∑n≥0

Px[Xn = x].

The sum on the right-hand side is infinite by Theorem 11.10(a), and thus the left-handside is infinite as well. Another application of Theorem 11.10 then yields the claim.

[[TODO: Examples]]

In order to understand the asymptotic behaviour of Markov chain the following objectplays the key role.

Definition 11.14. A measure π is called invariant for the Markov chain (Xn) withtransition kernel κ if π × κ = π. For countable S this is equivalent to

(11.15) π(y) =∑x∈S

π(x)Px[X1 = y].

When π is a probability measure, we call it invariant distribution.

We are interested in existence and uniqueness of invariant measures and distributions.The following proposition construct invariant measures in the recurrent case.

Proposition 11.16. Let nx(y) = Ex

[∑Hxn=1 1Xn=y

]be the mean number of visit to y

before returning to x. If x is recurrent, then

(i) nx(x) = 1,

(ii) nx(·) is an invariant measure for (Xn).

(iii) If X is irreducible, then nx(y) ∈ (0,∞), y ∈ S.

96

Proof. (i) is obvious from the definition. To show (ii) we write

nx(y) = Ex

[ Hx∑n=1

1Xn=y

]=∞∑n=1

Ex[1Xn = y, n ≤ Hx

]=∞∑n=1

∑z∈S

Px[Xn = y,Xn−1 = zn− 1 < Hx

]=∑z∈S

Pz[X1 = y]∞∑n=1

Px[Xn−1 = zn− 1 < Hx

]=∑z∈S

Pz[X1 = y]Ex

[ Hx−1∑n=0

1Xn−1 = z],

where on the last line we made a trivial change of variable. Since (Xn) is recurrent, Hx

is Px-a.s. finite and thus Px[X0 = XHx= x]. This implies that under Px

Hx−1∑n=0

1Xn−1 = zHx∑n=1

1Xn−1 = z.

Inserting this in the previous computation we see that

nx(y) =∑z∈S

Pz[X1 = y]Ex

[ Hx∑n=1

1Xn−1 = z]

=∑z∈S

Pz[X1 = y]nx(z).

This shows that nx(·) is invariant.To show (iii) observe first that using 11.15 inductively implies that every invariant

measure satisfies

(11.17) π(y) =∑x∈S

π(x)Px[Xk = y], k ≥ 1.

By irreducibility there is k, l > 0 such that Px[Xk = y] > 0 and Py[Xl = x] > 0. Hence,using (ii) and (11.17), nx(y) =

∑z Pz[Xk = y]nx(z) ≥ nx(x)Px(Xk = y) > 0. On the

other hand, 1 = nx(x) =∑

z nx(z)Pz(Xl = x) ≥ nx(y)Py(Xl = x) which easily yieldsnx(y) <∞.

To show the uniqueness of invariant measures we will need:

Lemma 11.18. Assume that π is an invariant measure such that π(x) = 1. Thenπ(y) ≥ nx(y) for all y ∈ S.

97

Proof. To simplify the notation we define pxy = Px[X1 = y]. Then, using repeatedly theinvariance of π,

π(z) =∑y1∈S

π(y1)py1z =∑y1 6=x

π(y1)py1z + pxz

=∑y1 6=x

∑y2 6=x

π(y2)py2y1py1z +∑y1 6=x

pxy1py1z + pxz

=∑

y1,...,yk 6=x

π(yk)pykyk−1. . . py1z +

∑y1,...,yk−1 6=x

pxyk−1. . . py1z + · · ·+ pxz

Using the fact that the first sum is always non-negative, this is bounded from below by

≥ Px[Xk = z, Hx ≥ k] + · · ·+ Px[X1 = z, Hx ≥ 1]

= Ex

[ k∑l=1

1Xl = z, Hx ≥ l].

Letting k tend to ∞ we obtain π(z) ≥ nx(z), completing the proof.

Corollary 11.19. Assume that (Xn) is irreducible and recurrent and π is its invariantmeasure with π(x) = 1. Then π(y) = nx(y) for all y ∈ S.

Proof. Since π and nx are invariant, also the measure λ = π−nx is invariant. Moreoverλ ≥ 0 and λ(x) = 0. However, 0 = λ(x) =

∑z∈S λ(z)Pz[Xl = x] ≥ λ(y)Py[Xl = x].

Using the irreducibility, we can fix l such that the last probability is positive, whichimplies λ(y) = 0 as well. Hence, nx = π as required.

We now turn our attention to existence and uniqueness of invariant distributions.

Definition 11.20. Let x be a recurrent state for a Markov chain (Xn). It is calledpositively recurrent when Ex[Hx] <∞. Otherwise it is called null-recurrent .

Theorem 11.21. Let (Xn) be irreducible. Then the following are equivalent:

(i) There is x ∈ S which is positively recurrent.

(ii) All x ∈ S are positively recurrent.

(iii) (Xn) has an invariant distribution π.

Moreover, π of (iii) is unique and given by π(y) = nx(y)

ExHx.

Proof. (ii) =⇒ (i) is trivial.(i) =⇒ (iii): Since x is positively recurrent and thus recurrent, nx is an invariant

measure. Moreover,∑

y∈S nx(y) =∑

y∈S Ex[∑Hx

n=1 1Xn = y] = Ex[Hx]. Therefore,nx(·)ExHx

is an invariant distibution.

98

(iii) =⇒ (ii): Let x ∈ S be arbitrary. Since π is a probability, there is y ∈ Swith π(y) > 0. By irreducibility, for some n > 0, Py[Xn = x] > 0 and thus π(x) ≥π(y)Py[Xn = x] > 0. Set λ(y) = π(y)/π(x). Then λ is invariant, λ(x) = 1 and thusλ ≥ nx by Lemma 11.18. Hence,

ExHx =∑y∈S

nx(y) ≤∑y∈S

λ(y) =∑y∈S

π(y)

π(x)=

1

π(x)<∞.

This implies that x is positively recurrent.It remains to show the unicity of π. Since x is recurrent, the inequality in the last

display is an equality by Corollary 11.19 and thus

π(x) =1

ExHx

.


We close this section by few examples and exercises illustrating the situation for null-recurrent and transient Markov chains.

Example 11.22. Let (Xn) be a simple random walk on Z. (Xn) is null-recurrent. Themeasure π(x) ≡ 1 is invariant and every other invariant measure must be its multiple,by Corollary 11.19. Hence, there is no invariant distribution. The situation is analogousfor every irreducible null-recurrent chain.

Example 11.23. Let (Xn) be a random walk with drift Px[X1 = x+ 1] = 1− Px[X1 =x − 1] = p > 1

2. Show that π(x) = A + B( p

1−p)x is invariant for every A > 0 and

B > 0 and (Xn) is transient. In particular transient Markov chain may posses invariantmeasures, and they do not need to form a one-parameter family as in the recurrent case.

Exercise 11.24. Let S = N and consider a Markov chain on S given by Px[X1 =x + 1] = 1 − 10−x, Px[X1 = 0] = 10−x, x > 1, and P0[X = 1] = 1. Show that (Xn) istransient and has no non-trivial invariant measure.

11.3 Convergence of Markov chains

In this section we assume that (Xn) is an irreducible positively recurrent Markov chainon an at most countable state space (S,S). By Theorem 11.21 it possesses a uniqueinvariant distribution π. We now investigate the asymptotic behaviour of (Xn).

Lemma 11.25. For every x, y ∈ S

limn→∞

1

n

n∑k=1

1Xk = y = π(y), Px-a.s.

99

Proof. Writing Hky for the time of k-th visit of y defined as in (11.11),

1

n

n∑k=1

1Xk = y =1

nmaxk ≥ 0 : Hk

y ≤ n.

Therefore, the claim of the lemma is equivalent to

(11.26) limk→∞

1

kHky =

1

π(y), Px-a.s.

Using the same arguments as in the proof of Theorem 11.10 together with the recurrenceof (Xn), we see that the random variables Sl = H l

y−H l−1y are independent and, moreover,

(Sl)l≥2 are i.i.d. with Px[Sl = k] = Py[Hy = k] for l ≥ 2. Therefore E[Sl] = Ey[Hy] =1

π(y), by Theorem 11.21. Since Hk

y =∑k

l=1 Sl, claim (11.26) follows by the strong law oflarge numbers.

Lemma 11.25 and dominated convergence imply

limn→∞

Ex

[ 1

n

n∑k=1

1Xk = y]

= π(y).

We now strengthen this Cesaro-type convergence.

Definition 11.27. For x ∈ S let T (x) = n ≥ 1 : Px[Xn = x] > 0 be the set of timeswhen return to x is possible. The period of x is the greatest common divisor gcd T (x).

Exercise 11.28. If X is irreducible and x, y ∈ S, then gcd T (x) = gcd T (y).

Definition 11.29. A Markov chain (Xn) is called aperiodic when gcd T (x) = 1 for allx ∈ S.

Exercise 11.30. If (Xn) is irreducible and aperiodic then for every x, y ∈ S there isn ≥ 0 such that Px[Xm = y] > 0 for all m ≥ n.

[[TODO: Provide a proof]]

Theorem 11.31. Let X be irreducible and aperiodic with invariant distribution π. Thenfor every x, y ∈ S

limn→∞

Px[Xn = y] = π(y).

Proof. We use a coupling argument. On some probability space (Ω,A, P ) define twoindependent Markov chain (Xn)n≥0 and (Yn)n≥0 with respective distributions Px andPπ. Since π is invariant, P [Yn = y] = π(y). Fix now z ∈ S and set T = infn ≥ 0, Xn =yn = z.

We first claim that P [T < ∞] = 1. To see this, observe that Wn = (Xn, Yn) is aMarkov chain on S × S. Using Exercise 11.30, it is not difficult to see that Wn is again

100

irreducible. Moreover π(x, y) = π(x)π(y) is invariant distribution for (Wn). ThereforeW is positively recurrent, which implies the claim.

We now construct a new process (Zn) by

Zn =

Xn, if n ≤ T,

Yn, if n > T.

Since XT = YT , it is not hard to see that (Zn) is Px-distributed, similarly as (Xn).Hence,

Px[Xn = y] = P [Zn = y] = P [Xn = y, n ≤ T ] + P [Yn = y, n > T ],

and thus ∣∣Px[Xn = y]− π(y)∣∣ ≤ 2P [n ≤ T ]

n→∞−−−→ 0.


[[TODO: add more detail here:mixing times, examples]]

101

12 Brownian motion and Donsker’stheorem

In this chapter we introduce one of the most important stochastic processes in continuoustime, the Brownian motion. We construct it as a suitable limit of rescaled trajectoriesof the simple random walk.

12.1 Space C([0, 1])

In order to construct the Brownian motion, we shell understand the weak convergence ofprobability measures on the space C = C([0, 1],R) endowed with the sup-norm ‖w‖ =supt∈[0,1] |w(t)| and the corresponding metric. It is well known that C is separable.

Lemma 12.1. The Borel σ-algebra B(C) is generated by the system Z of cylinder sets

Z = w ∈ C : w(ti) ∈ Ai, i = 1, . . . , n where n ∈ N, 0 ≤ t1 ≤ · · · ≤ tn, Ai ∈ B(R).

More over Z is a π-system.

Proof. To see that B(C) ⊃ σ(Z) it is sufficient to observe that the map w ∈ C 7→ w(t) ∈R is continuous for every t ∈ [0, 1] and thus the sets of the form w ∈ C : w(t) ∈ Awith A ∈ B(R), and thus all elements of Z, are contained in B(C).

To show that B(C) ⊂ σ(Z), recall first that as C is separable, every open set is acountable union of open balls. Moreover, every open ball Uδ(w) ⊂ C with w ∈ C, δ > 0can be written as ∪n∈NUδ− 1

n(w), so every open set is a countable union of closed balls.

By continuity,

Uδ(w) = w′ ∈ C : ‖w − w′‖ ≤ δ

=⋂n∈N

w′ ∈ C : |w(i/n)− w′(i/n)| ≤ δ, i = 0, . . . , n ∈ σ(Z).

Hence every open set of C is contained in σ(Z) and thus B(C) ⊂ σ(Z).The last claim of the lemma is obvious.

We now consider the measurable space (C,B(C)). Due to Lemma 12.1 and Dynkin’slemma, every probability measure µ is determined by its finite-dimensional marginals(πt1,...,tn)#µ, where πt1,...,tn are the natural projections from C to Rn given by πt1,...,tn(w) =(wt1 , . . . , wtn).

If (µn) is a sequence of probability measures on (C,B(C)) converging weakly to µ, thenobviously also the corresponding finite-dimensional marginals converge, (πt1,...,tn)#µn

w−→(πt1,...,tn)#µ for all 0 ≤ t1 ≤ · · · ≤ tn ≤ 1. The converse is false.

102

Example 12.2. Consider µ = δw, µn = δwn , where w ≡ 0 and wn is piecewise linearwith wn(0) = 0, wn(1/n) = 1, wn(2/n) = 0, wn(1) = 0. Since wn → w pointwise, itfollows that (πt1,...,tn)#µn

w−→ (πt1,...,tn)#µ for every ti’s. On the other hand ‖w−wn‖ = 1

and thus µn 6w−→ µ.

Recalling the theory of weak convergence, what the above example is missing is thetightness. Actually, as the direct consequence of Prokhorov’s theorem (Theorem 6.26)and Remark 6.27 we obtain:

Theorem 12.3. Let µn and µ be probability measures on (C,B(C)). Then the followingare equivalent:

(i) µnd−→ µ.

(ii) the sequence (µn) is tight and all finite-dimensional marginals of µn converge weaklyto those of µ.

We thus need to develop a tightness criterion on C. Define for w ∈ C and δ > 0 themodulus of continuity of w,

ω(w, δ) := sup|w(s)− w(t)| : s, t ∈ [0, 1], |t− s| ≤ δ.

Recall that for every w ∈ C, limδ→0 ω(w, δ) = 0, and that w 7→ ω(w, δ) is continuous.The following well known theorem characterises relatively compact sets of C

Theorem 12.4 (Arzela-Ascoli). A set A ⊂ C is relatively compact iff

supw∈A|w(0)| <∞ and lim

δ→0supw∈A

ω(w, δ) = 0.

We can now characterise the tightness in C:

Theorem 12.5. A set M⊂M1(C) is tight iff the following two conditions holds:

(a) The set (π0)#µ : µ ∈M of 0-marginals is tight in R, that is

limK→∞

supµ∈M

µ(w : |w(0)| ≥ K) = 0,

(b) As δ → 0, the modulus of continuity converges to 0 in probability, uniformly overµ ∈M, that is

limδ→0

supµ∈M

µ(w : ω(w, δ) ≥ η) = 0 for all η > 0.

Proof. Assume first that the two conditions of the theorem holds. Since (π0)#µ : µ ∈M is tight, for every ε > 0 there is c(ε) <∞ such that

infµ∈M

(π0)#µ[−c(ε), c(ε)] ≥ 1− ε.

103

Moreover, using the second condition, we can find a sequence δk with

infµ∈M

µ(w : ω(w, δk) ≤

1

k

)≥ 1− ε2−k.

Set K(ε) = w : |w0| ≤ c(ε)∩⋂k≥1w : ω(w, δk) ≤ 1

k. Then, by the last two estimates,

for every µ ∈M,

µ(Kcε) ≤ ε+

∞∑k=1

ε2−k ≤ 2ε.

Since w 7→ ω(w, δ) is continuous, the set Kε is closed, and by construction it satisfiesthe conditions of Arzela-Ascoli theorem. Hence Kε is compact, and thus M is tight.

Assume now that M is tight. Hence, for every ε > 0 there is a compact K ⊂ Csuch that supµ∈M(Kc) ≤ ε. Let b = supw∈K |w0| < ∞, by Arzela-Ascoli theorem.Therefore, infµ∈M(π0)#µ([−b, b]) ≥ µ(K) ≥ 1− ε, that is ((π0)#µ)µ∈M is tight and thefirst condition hold. Further, since K is compact, by Arzela-Ascoli theorem again,supw∈K ω(w, δ) ≤ η for all δ small enough. Hence, for all δ small supµ∈M µ(w :ω(w, δ) ≥ η) ≤ µ(Kc) ≤ ε, and the second condition follows easily.

Exercise 12.6. Assume that M is a sequence (µn)n≥0. To verify the tightness it issufficient to check

limK→∞

lim supn→∞

µn(w : |w(0)| ≥ K) = 0

andlimδ→0

lim supn→∞

µn(w : ω(w, δ) ≥ η) = 0 for all η > 0.

12.2 Brownian motion

Definition 12.7. Brownian motion is a R-valued stochastic process (Bt)t≥0 on someprobability space (Ω,A, P ) such that

(i) B0 = 0, P -a.s.

(ii) For every n ∈ N and 0 = t0 < t1 < · · · < tn, the increments Bt1 − Bt0 , . . . ,Btn −Btn−1 are independent random variables.

(iii) For all t ≥ 0 and s > 0, Bt+s −Bt is N (0, s) distributed

(iv) For P -a.e. ω, the trajectory of the process t 7→ Bt(ω) is a continuous function.

There are many ways how to construct a Brownian motion, in particular differentprobability spaces can be used. On the other hand, as we will see now, the conditions(i)–(iv) determine the distribution of Brownian motion uniquely. One way to see this isto construct the so-called canonical Brownian motion.

To this end, let C = C([0,∞),R) endowed with canonical coordinates Xt : C → R,w ∈ C 7→ Xt(w) = wt, and the canonical σ-algebra F = σ(Xt, t ≥ 0). Consider

104

now an arbitrary Brownian motion constructed on a probability space (Ω,A, P ). ByDefinition 12.7(iv), we can find a P -negligible set N such that the trajectories of B arecontinuous on Ω \N . Define now the a map B by(

Ω \N,A ∩ (Ω \N)) (−→ C,F)

ω 7→ (t 7→ Bt(ω)).

Exercise 12.8. Check the measurability of this map.

Let W be the image of P , restricted to Ω \N , under this map, W = B#P .

Theorem 12.9 (uniqueness of Brownian motion). (a) The measure W on (C,F) isuniquely determined, and is called the Wiener measure.

(b) The process (Xt(w))t≥0 on (C,F ,W ) is also a Brownian motion, called canonicalBrownian motion.

Sketch of the proof. (a) Fix 0 = t0 < t1 < · · · < tn and a bounded measurable functionh : Rn+1 → R. Then by definition of W and condition Definition 12.7(i)–(iii),

EW [h(Xt0 , . . . , Xtn)] = EP [h(Bt0 , . . . , Btn)]

= EP [h(Bt0 , Bt1 −Bt0 , . . . , Btn −Btn−1 − · · ·+Bt1 −Bt0)]

=

∫Rnh(0, y1, y1 + y2, . . . , y1 + · · ·+ yn)

n∏i=1

1

(2π(ti − ti−1))1/2e−y

2i /2(ti−ti−1)dyi.

Taking D ∈ B(Rn+1) and h = 1D, this determines W (Xt0 , . . . , Xtn ∈ D). The class ofsets of this form is a π-system generating F , so W is uniquely determined. This provesclaim (a). The claim (b) is then obvious.

12.3 Donsker’s theorem

To show that a Brownian motion exist, we constuct it as a weak limit of random walktrajectories. For sake of simplicity we restrict the time to the interval [0, 1] first, wecomment on extension to t ∈ R+ later.

Let C = C([0, 1],R) endowed with canonical coordinates Xt, t ∈ [0, 1] and the canon-ical σ-algebra F , constructed as previously.

Let (ξi)i≥1 be an i.i.d. sequence on a probability space (Ω,A, P ) such that Eξi = 0,Eξ2

i = 1. Set S0 = 0 and Sn = ξ1 + · · · + ξn, n ≥ 1. For t ∈ [0,∞) \ N define St bypolygonal interpolation

St = Sbtc + (t− btc)ξbtc+1,

and consider its rescaling

(12.10) Bnt =

1

nStn2 , t ∈ [0, 1], n ∈ N.

Obviously, for every ω ∈ Ω, t 7→ Bnt (ω) is a random element of C. Let µn = (Bn

· )#P bethe distribution of Bn

· on (C,F).

105

Theorem 12.11 (Donsker). Under the above assumptions, the sequence of measures µnconverges weakly on (C, ‖.‖∞) to a measure µ which is the Wiener measure restricted to[0, 1].

Eventually, we will apply Theorems 12.3 and 12.5 to show this theorem. To check theconvergence of finite-dimensional marginals we need the following general lemma.

Lemma 12.12. Let (Un)n∈N be a sequence of random variables on (Ω,A, P ) with values

in a normed separable vector space (S, ‖ · ‖) such that Und−→ [n→∞]U .

(a) If (Vn)n∈N is another sequence of S-valued random variables with VnP−→ [n→∞]0,

then Un + Vnd−→ [n→∞]U .

(b) If (Cn)n∈N is a sequence of R-valued random variables with CnP−→ [n→∞]c where

c is a constant, then CnVnd−→ [n→∞]cU .

Proof. (a) Since S is separable, B(S×S) = B(S)⊗B(S), so Un+Vn is a random variable.For h bounded and uniformly continuous

|Eh(Un + Vn)− Eh(U)| ≤ |Eh(Un + Vn)− Eh(Un)|+ |Eh(Un)− Eh(U)|.

The second term tends to 0 since Und−→ U . For the first term, observe that

|Eh(Un + Vn)− Eh(Un)| ≤ supx,z:‖x−z‖≤δ

|h(x)− h(z)|+ 2‖h‖∞P [‖Vn‖ ≥ δ].

Here, the first term can be made arbitrarily small by choosing δ, since h is uniformlycontinuous, and the second term converges to 0 for every δ, since Vn converge to 0 inprobability. Portmanteau theorem (Theorem 6.12) then implies Un + Vn → U .

The proof of (b) is analogous and is left as an exercise.

We now can verify the convergence of finite-dimensional marginals, as required byTheorem 12.3.

Proposition 12.13. The finite-dimensional marginals of µn converge to the correspond-ing marginals of µ.

Remark 12.14. Observe that the statement does not rely on the existence of theWiener measure but only of its finite-dimensional marginals, which is obvious fromDefinition 12.7.

Proof of Proposition 12.13. Consider first the one-dimensional marginals. Fix t ∈ [0, 1].Then (Xt)#µ is the normal distribution N (0, t), and (Xt)#µn is the distribution of

Bnt =

1

n

(Sbtn2c + (tn2 − btnwc) + ξbtn2c+1

)=

1

n

Sbtn2c

btn2c+tn2 − btn2c

nξbtn2c+1.

106

By the central limit theorem, the second fraction in the first term converges in distribu-tion to a standard normal random variable. Moreover, for every δ > 0,

P[∣∣∣tn2 − btn2c

nξbtn2c+1

∣∣∣ > ε]≤ P [|ξ1| ≥ εn]

n→∞−−−→ 0,

that is the second term converges to 0 in probability. Applying Lemma 12.12 severaltimes then proves the convergence of one-dimensional marginal.

For the general case fix 0 = t0 < · · · < tn ≤ 1 and argue as previously for the vector(Bn

ti−Bn

( ti−1))1≤i≤n. Using the formula

Bnti−Bn

ti−1=

1

n

bn2tic∑j=bn2ti−1c+1

ξj+1

n

(tin

2−btin2c)ξbtin2c+1.−(ti−1n2−bti−1n

2c)ξbti−1n2c+1

.

The components of the vector (∑bn2tic

j=bn2ti−1c+1 ξj)1≤i≤n are independent and converge, bythe central limit theorem again, to normal variables with corresponding variances. Thesecond term in the last display is a perturbation and can be treated as previously.

To check the tightness of the sequence µn we need another lemma.

Lemma 12.15 (Ottaviani). Let U1, . . . , UN be centred independent random variableswith

∑Ni=1 VarUi ≤ c2. Then for Zk = U1 + · · ·+ Uk, α ≥ c,

P [maxk≤N|Zk| ≥ 2α] ≤ 1

1− c2

α2

P [|ZN | > α].

Proof. Set T = infk ≤ N : |Zk| ≥ 2α. On T ≤ N, |ZT | > 2α. As T = k andZN − Zk are independent,

P [|ZN | > α] ≥ P [T ≤ N, |ZN − ZT | ≤ α] =N∑k=1

P [T = k]P [|ZN − Zk| ≤ α].

By Chebyshev inequality, P [|ZN − Zk| > α] ≤ α−2 Var(ZN − Zk) ≤ c2/α2, so

P [|ZN | > α] ≥(1− c2

α2

)P [T ≤ N ] =

(1− c2

α2

)P [max

k≤N|Zk| > 2α].


Proposition 12.16. The sequence (µn) of Theorem 12.11 is tight.

Proof. As Bn0 = 0, (a) of Theorem 12.5 holds automatically. To check (b) of the same

theorem, we need to show that

(12.17) limδ→0

lim supn→∞

P [ sups≤t≤s+δ

|Bnt −Bn

s | ≥ η] = 0 for all η > 0.

107

For s ∈ [kδ, (k+1)δ) and s ≤ t ≤ s+δ, either t ∈ [kδ, (k+1)δ) or t ∈ [(k+1)δ, (k+2)δ).In the first case

|Bnt −Bn

s | ≤ |Bnt −Bn

kδ|+ |Bns −Bn

kδ|.In the second case

|Bnt −Bn

s | ≤ |Bnt −Bn

(k+1)δ|+ |Bns −Bn

kδ|+ |Bn(k+1)δ −Bn

kδ|.

Putting the two together, we see that

sups≤t≤s+δ

|Bnt −Bn

s | ≤ 3 supk≤1/δ

supt∈[kδ,(k+1)δ]

|Bnt −Bn

kδ|,

and thus

P [ sups≤t≤s+δ

|Bnt −Bn

s | ≥ η] ≤ P [ supk≤1/δ

supt∈[kδ,(k+1)δ]

|Bnt −Bn

kδ| ≥ η/3]

≤1/δ∑k=0

P [ sup0≤t≤δ

|Bnkδ+t −Bn

kδ| ≥ η/3].

Set j = bkδn2c and m = b2δn2c, and tak n such that n2 ≥ 2δ−1. Since Bn· is a polygonal

interpolation of S·, we have

sup0≤t≤δ

|Bnkδ+t −Bn

kδ| ≤ maxl≤m

1

n|Sj+l − Sj| ∨max

l≤m

1

n|Sj+l − Sj+1|.

Hence,

P[

sup0≤t≤δ

|Bnkδ+t −Bn

kδ| ≥η

3

]≤ 2P

[maxl≤m

1

n|Sl| ≥

η

3

].

Using Lemma 12.15 with Ui = 1n√

2δξi, Zk = 1

n√

2δSk, α = η

6√

2δand c = 1, this can be

bounded from above by

1

1− 72δη2

P[ |Sm|√

m≥ η

6√

2δ

]n→∞−−−→ c(η)

∫ ∞η

6√2δ

1√2π

e−y2/2dy.

It follows that, for δ small, using the Gaussian tail estimate∫∞u

e−y2/2dy ≤ u−1e−u

2/2 ≤e−u

2/2 for u ≥ 1,

lim supn→∞

P [ sups≤t≤s+δ

|Bnt −Bn

s | ≥ η] ≤ 1

δc(η)

∫ ∞η

6√2δ

1√2π

e−y2/2dy

≤ 1

δe−η

2/144δ δ→0−−→ 0.

This proves (12.17) and completes the proof of the proposition.

Proof of Theorem 12.11. The theorem follows from Propositions 12.13 and 12.16, usingTheorems 12.3 and 12.5.

[[TODO: extension to [0,∞).]]

108

12.4 Some applications of Donsker’s theorem

We will show how Donsker’s theorem can be used to claim some properties of Brownianmotion. As above, µn denotes the distribution of the Bn on C[0, 1] and µ is the Wienermeasure restricted to this space.

We start with a general lemma on weak convergence.

Lemma 12.18. Let ν, (νn)n≥1 be measures on some metric space (S, d) (endowed withthe Borel σ-field) and F a continuous map from S to another metric space (S, d). When

νnw−→ ν on (S, d), then F#νn

d−→ F#ν.

Proof. Let f ∈ Cb(S). Then, f F ∈ Cb(S) and thus∫S

fd(F#νn) =

∫S

f Fdνnn→∞−−−→

∫f Fdν =

∫fd(F#ν),

proving the weak convergence.

We now use Donsker’s theorem to determine the distribution of the supremum ofBrownian motion M = supBt : t ∈ [0, 1].

Proposition 12.19. P [M ≤ z] = P [|B1| ≤ z] = 2Φ(z)− 1, where Φ(z) is the distribu-tion function of standard normal distribution.

Proof. Let F : (C[0, 1], ‖ · ‖∞)→ (R, | · |) be the continuous map F (w) := supw(t), t ∈[0, 1]. By Lemma 12.18, and Portmanteau’s theorem, as ϕ is continuous,

P [M ≤ z] = (F#ν)((−∞, z])= lim

n→∞(F#νn)((−∞, z])

= limn→∞

P[

supk≤n2

Skn≤ z].

(12.20)

We estimate the last probability in the case when Sn is the simple random walk:

Claim 12.21 (Reflection for the SRW). Let Mn = max0≤n Sn. Then

P [Mn ≥ r, Sn = v] =

P [Sn = v], v ≥ r ≥ 0,

P [Sn = 2r − v], v < r, r ≥ 0.

In particular, P [Mn ≥ r] = 2P [Sn > r] + P [Sn = r].

Proof. The case v ≥ r ≥ 0 is obvious. For v < r we consider the map ϕ that “reflectsany SRW trajectory after its first visit to r”, see the Figure 12.1.

It is easy to see that ϕ is bijection of ω ∈ Ω,maxi≤n Si ≥ r, Sn = v and ω ∈ Ω :Sn = 2r − v. As every path of length n has the same probability 2−n we see thatP [Mn ≥ r, Sn = v] = P [Sn = 2r − v].

The last claim follows by summing over all possible terminal values.

109

[[TODO: Figure]]

Figure 12.1: Illustration of reflection principle

Using Claim 12.21, (12.20) equals

limn→∞

P[ 1

nMn2 ≤ z

]= lim

n→∞

(1− P [Sn2 = nz]− 2P [Sn2 > nz]

)= 2Φ(z)− 1,

by the central limit theorem. This completes the proof.

Remark 12.22. Inverting the reasoning of (12.20), it can be shown that for arbitraryincrements Xi’s as in Donsker’s theorem, limn→∞ P [ 1

nmax1≤n2 Si ≤ z] = 2ϕ(z) − 1,

transforming the result for the simple random walk to an asymptotic result holding foran “arbitrary” random walk.

For further applications of Donsker’s theorem we need to extend Lemma 12.18 to somenon-continuous functions.

Lemma 12.23. Let F : (S, d)→ (S, d) be measurable, and ν, (νn) probability measuresas in Lemma 12.18. Then

(i) The set of discontinuity points of F is measurable, that is

DF := x ∈ S : F is not continuous at x ∈ B(S).

(ii) When νnw−→ ν and ν(DF ) = 0 (that is F is ν-a.s. continuous), then F#νn

d−→ F#ν.

Proof. (i) Observe that DF =⋃n≥1

⋂k≥1An,k, where

An,k = x ∈ S : ∃y, z ∈ S with d(x, y) ≤ k−1, d(x, z) ≤ k−1d(F (y), F (z)) ≥ n−1

=⋃

y,z∈S,d(F (y),F (z))≥ε

U1/k(y) ∩ U1/k(z),

where Uδ(x) denotes the open δ-neighbourhood of x. Since any union of open sets isopen, An,k is open and thus measurable. Therefore DF ∈ B(S), as claimed.

(ii) Let A ⊂ S closed. Observe that

F−1(A) ⊂ F−1(A) ∪DF

By Portmanteau’s theorem, since νnw−→ ν,

lim supn→∞

(F#ν)(A) ≤ lim supn→∞

νn(F−1(A)) ≤ ν(F−1(A))

≤ ν(F−1(A)) + ν(DF ) = ν(F−1(A)).

The claim then follows using Portmanteau’s theorem once more.

110

Proposition 12.24. Let L = supt ∈ [0, 1] : B(t) = 0 ∨ 0 be the time of the last visitto 0 before time 1 by a BM. Then

P [L ≤ z] =2

πarcsin

√z.

Proof. Let F : C[0, 1] → R, be given by F (w) = supt ≤ 1, w(t) = 0 ∨ 0. F is notcontinuous, but

Claim 12.25. F is µ-a.s. continuous on (C[0, 1], ‖·‖∞) with µ being the Wiener measure.

Proof. Observe that if F is discontinuous at w, then for some ε > 0 the function w musthave the same sign on intervals (F (w)− ε, F (w)) and (F (w), 1]. Define(12.26)

A± = w ∈ C[0, 1] : x ≷ 0 on both (F (w), 1] and (F (w)− ε, F (w)) for aε > 0.

Then DF ⊂ A+ ∪ A−. By choosing r ∈ Q “shortly before F (w)” and looking on theFigure 12.2 we see that

[[TODO: Figure]]

Figure 12.2: XXX

A− ⊂⋃

r∈[0,1]∩Q

maxr≤s≤1

w(s)− w(r) = −w(r).

For r fixed, under µ, w(s) − w(r) is a Brownian motion independent of w(r) (this isMarkov property for BM, see later). Lemma 12.19 implies that maxr≤s≤1w(s) − w(r)has an absolutely continuous distribution, that is µ[maxr≤s≤1w(s)−w(r) = −w(r)] = 0.Hence µ(A−) = 0. Similar argumentation yields also µ(A+) = 0 and thus µ(DF ) = 0.

We now determine the distribution of L under µn, in the simple random walk caseagain. We proceed by three claims:

Claim 12.27. P (S1 6= 0, . . . , S2n 6= 0) = P (S2n = 0).

Proof. It is not hard to see that

P [S1 6= 0, . . . , S2n 6= 0] = 2P [S1 > 0, . . . , S2n > 0]

= 2∑r≥1

P [S1 > 0, . . . , S2n−1 > 0, S2n = 2r].

By a variant of reflection principle, the number of paths from (1, 1) to (2n, 2r) inter-secting the x-axis is the same as the number of paths from (+1,−1) to (2n, 2r), seeFigure 12.3

111

[[TODO: Figure]]

Figure 12.3: XXX

We thus see

P [S1 6= 0, . . . , S2n 6= 0]

= 2∑r≥1

1

2· 2−(2n−1)#(paths from (1, 1) to (2n, 2r) not touching the x-axis)

=∑r≥1

2−(2n−1)#(paths from (1, 1) to (2n, 2r))−#(paths from (1,−1) to (2n, 2r))

=∑r≥1

P (S2n−1 = 2r − 1)− P (S2n−1 = 2r + 1)

= P (S2n−1 = 1) = P (S2n = 0),

completing the proof.

Claim 12.28. Let Ln = maxm ≤ n : Sm = 0 ∨ 0. Then P (L2m = 2k) = u2ku2n−2k

with u2k = P [S2k = 0].

Proof. Use Markov property and Claim 12.27.

Claim 12.29. For every z ∈ [0, 1],

P[L2n

2n≤ z]

n→∞−−−→∫ z

0

π−1(x(1− x))−1/2dx.

Proof. Easy combinatorics and Stirling formula imply that u2k = 2−2k(

2k,k∼

)1√πk

as k →∞. Hence when k

n→ x, then nP (L2n = 2k) → Π−1(x(1 − x))−1/2. Thus, for 0 < a <

b < 1,

P [a ≤ L2n

2n≤ b] =

bnbc∑k=dnae

P [L2n = 2k] ∼∫ b

a

Π−1(x(1− x))−1/2 dx,

by dominated convergence theorem, and the claim follows.

Proposition12.24 follows from the last three claims using Donsker’s theorem andLemma 12.23. We only need to observe that

intz0π−1(x(1− x))−1/2 dx =

∫ √z0

2

π(1− y2)−1/2 dy =

2

piarcsin(

√z).

Corollary 12.30. As in Remark 12.22, we have for an “arbitrary RW”

limn→∞

P[Ln ≤ zn] =2

πarcsin(

√z) z ∈ [0, 1].

112

Bibliography

[AS08] Noga Alon and Joel H. Spencer, The probabilistic method, third ed., Wiley-Interscience Series in Discrete Mathematics and Optimization, John Wiley &Sons Inc., Hoboken, NJ, 2008, With an appendix on the life and work of PaulErdos. MR 2437651

[Dur10] Rick Durrett, Probability: theory and examples, fourth ed., Cambridge Series inStatistical and Probabilistic Mathematics, Cambridge University Press, Cam-bridge, 2010, downloadable at http://www.math.duke.edu/~rtd/PTE/pte.

html. MR 2722836

113

http://www.ams.org/mathscinet-getitem?mr=2437651

http://www.math.duke.edu/~rtd/PTE/pte.html

http://www.math.duke.edu/~rtd/PTE/pte.html

http://www.ams.org/mathscinet-getitem?mr=2722836

Index

λ-system, 15π-system, 15σ-additive, 5σ-algebra, 5

moment, 13

adapted, 67aperiodic, 100

Borel σ-algebra, 6

canonical Brownian motion, 104, 105Characteristic function, 52Chebyshev’s inequality, 11conditional expectation, 61convergence in probability, 29convergence set, 22covariance, 13cylinder σ-algebra, 7

discrete stochastic integral, 71distribution, 8distribution function, 8Dynkin system, 15Dynkin’s lemma, 15

events, 5expectation, 10

filtration, 67finite-dimensional cylinder, 6finite-dimensional marginals, 102first entrance time, 95

generator, 7

hitting time, 95

independent, 14invariant distribution, 96invariant measure, 96irreducible, 94

Jensen’s inequality, 11

Kolmogorov’s inequality, 25

Markov chain, 92martingale, 67modulus of continuity, 103

normal distribution, 6null-recurrent, 98

outcome, 5

period, 100Poisson distribution, 6positively recurrent, 98probability measure, 5probability space, 5product measure, 7Product space, 6

random variable, 7recurrent, 95regular conditional distribution, 65

semi-direct product, 85stochastic kernel, 65stopped σ-algebra, 94stopping time, 72subadditive, 37submartingale, 67supermartingale, 67

114

tail σ-algebra, 22tight, 48transient, 95

uncorrelated, 19uniformly integrable, 80upcrossings, 73

vague convergence, 47variance, 13

Wiener measure, 105

115

Date post:	05-Oct-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Advanced probability theory - univie.ac.at

Documents