Probability Theory(MATHIAS LOWE)

8/13/2019 Probability Theory(MATHIAS LOWE)

http://slidepdf.com/reader/full/probability-theorymathias-lowe 1/69

Probability Theory

Matthias Lowe

Academic Year 2001/2002



Contents

1 Introduction 1

2 Basics, Random Variables 1

3 Expectation, Moments, and Jensen’s inequality 3

4 Convergence of random variables 7

5 Independence 11

6 Products and Sums of Independent Random Variables 16

7 Infinite product probability spaces 18

8 Zero-One Laws 23

9 Laws of Large Numbers 26

10 The Central Limit Theorem 35

11 Conditional Expectation 43

12 Martingales 50

13 Brownian motion 5713.1 Construction of Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . 61

14 Appendix 66



1 Introduction

Chance, luck, and fortune have been in the centre of human mind ever since people havestarted to think. The latest when people started to play cards or roll dices for money,there has also been a desire for a mathematical description and understanding of chance.Experience tells, that even in a situation, that is governed purely by chance there seems

to be some regularly. For example, in a long sequence of fair coin tosses we will typicallysee ”about half of the time” heads and ”about half of the time” tails. This was formulatedalready by Jakob Bernoulli (published 1713) and is called a law of large numbers. Notmuch later the French mathematician de Moivre analyzed how much the typical numberof heads in a series of n fair coin tosses fluctuates around 1

2n. He thereby discovered the

first form of what nowadays has become known as the Central Limit Theorem.

However, a mathematically ”clean” description of these results could not be given byeither of the two authors. The problem with such a description is that, of course, theaverage number of heads in a sequence of fair coin tosses will only typically converge to12 .

One could, e.g., imagine that we are rather unlucky and toss only heads. So the principalquestion was: ”what is the probability of an event?” The standard idea in the early dayswas, to define it as the limit of the average time of occurrences of the event in a longrow of typical experiments. But here we are back again at the original problem of whata typical sequence is. It is natural that, even if we can define the concept of typicalsequences, they are very difficult to work with. This is, why Hilbert in his famous talkon the International Congress of Mathematicians 1900 in Paris mentioned the axiomaticfoundation of probability theory as the sixth of his 23 open problems in mathematics.

This problem was solved by A. N. Kolmogorov in 1933 by ingeniously making use of the newly developed field of measure theory: A probability is understood as a measure onthe space of all outcomes of the random experiment. This measure is chosen in such a waythat it has total mass one.

From this start-up the whole framework of probability theory was developed: Startingfrom the laws of large numbers over the Central Limit Theorem to the very new field of mathematical finance (and many, many others).

In this course we will meet the most important highlights from probability theory andthen turn into the direction of stochastic processes, that are basic for mathematical finance.

2 Basics, Random Variables

As mentioned in the introduction the concept of Kolmogorov understands a probability onspace of outcomes as measure with total mass one on this space. So in the framework of probability theory we will always consider a triple (Ω, F , P) and call it a probability space:

Definition 2.1 A probability space is a triple (Ω, F , P), where

• Ω is a set, ω ∈ Ω is called a state of the world.

1



• F is a σ-algebra over Ω, A ∈ F is called an event.

• P is a measure on F with P(Ω) = 1, P(A) is called the probability of event A.

A probability space can be considered as an experiment we perform. Of course, inan experiment in physics one is not always interested to measure everything one could in

principle measure. Such a measurement in probability theory is called a random variable.

Definition 2.2 A random variable X is a mapping

X : Ω → Rd

that is measurable. Here we endow Rd with its Borel σ-algebra B d.

The important fact about random variables is that the underlying probability space(Ω, F , P) does not really matter. For example, consider the two experiments

Ω1 = 0, 1 , F 1 = P Ω1, and P1 0 = 12

and

Ω2 = 1, 2, . . . , 6 , F 2 = P Ω2, and P2 i = 1

6, i ∈ Ω2.

Consider the random variablesX 1 : Ω1 → R

ω → ω

andX 2 : Ω2

→ R

ω →

0 i is even1 i is odd.

Then

P1 (X 1 = 0) = P2(X 2 = 0) = 1

2

and therefore X 1 and X 2 have same behavior even though they are defined on completelydifferent spaces. What we learn from this example is that what really matters for a randomvariable is the “ distribution” P X −1:

Definition 2.3 The distribution PX of a random variable X : Ω

→ R

d, is the following

probability measure on

Rd, B d:

PX (A) := P (X ∈ A) = P

X −1(A)

, A ∈ B d.

So the distribution of a random variable is its image measure in Rd.

Example 2.4 Important distributions of random variables X (we have already met in introductory courses in probability and statistics) are

2



• The Binomial distribution with parameters n and p, i.e. a random variable X is Binomially distributed with parameters n and p ( B(n, p)-distributed), if

P(X = k) =

n

k

pk(1 − p)n−k 0 ≤ k ≤ n.

The binomial distribution is the distribution of the number of 1’s in n independent coin tosses with success probability p.

• The Normal distribution with parameters µ and σ2, i.e. a random variable X is normally distributed with parameters µ and σ2 ( N (µ, σ2)-distributed), if

P(X ≤ a) = 1√

2πσ2

a−∞

e−12(x−µ

σ )2

dx.

Here a ∈ R.

• The Dirac distribution with atom in a

∈ R, i.e. a random variable X is Dirac

distributed with parameter a, if

P(X = b) = δ a(b) =

1 if b = a0 otherwise.

Here b ∈ R.

• The Poisson distribution with parameter λ ∈ R, i.e. a random variable X is Poisson distributed with parameter λ ∈ R ( P (λ)-distributed), if

P(X = k) = λke−λ

k!

k

∈ N0 = N

∪ 0

.

• The Multivariate Normal distribution with parameters µ ∈ Rd and Σ, i.e. a random

variable X is normally distributed in dimension d with parameters µ and Σ ( N (µ, Σ)-distributed), if µ ∈ Rd, Σ is a symmetric, positive definite d × d matrix and for A = (−∞, a1] × . . . × (−∞, ad]

P(X ∈ A) =

1

(2π)d detΣ

a1−∞

. . .

ad−∞

exp

−1

2Σ−1(x − µ), (x − µ)

dx1 . . . d xd.

3 Expectation, Moments, and Jensen’s inequality

We are now going to consider important characteristics of random variables

Definition 3.1 The expectation of a random variable is defined as

E(X ) := E P(X ) :=

XdP =

X (ω)dP(ω)

if this integral is well defined.

3



Notice that for A ∈ F one has E(1A) =

1AdP = P(A). Quite often one may want tointegrate f (X ) for a function f : R

d → Rm. How does this work?

Proposition 3.2 Let X : Ω → Rd be a random variable and f : R

d → R be a measurable function. Then

f Xd

P =

E[f X ] =

f d

PX . (3.1)

Proof. If f = 1A, A ∈ B d we have

E [f X ] = E(1A X ) = P (X ∈ A) = PX (A) =

1AdPX .

Hence (3.1) holds true for functions f = n

i=1 αi1Ai, αi ∈ R, Ai ∈ B d. The standard

approximation techniques yield (3.1) for general integrable f .In particular Proposition 3.2 yields that

E [X ] =

xdPX (x).

Exercise 3.3 If X : Ω → N0, then

EX =∞

n=0

nP (X = n) =∞n=1

P (X ≥ n)

Exercise 3.4 If X : Ω → R, then

∞n=1

P (|X | ≥ n) ≤ E (|X |) ≤∞

n=0P (|X | ≥ n) .

Thus X has an expectation, if and only if ∞

n=0 P (|X | ≥ n) converges.

Further characteristics of random variables are the p-th moments.

Definition 3.5 1. For p ≥ 1, the p-th moment of a random variable is defined as E (X p).

2. The centered p-th moment of a random variable is defined as E [(X − EX ) p].

3. The variance of a random variable X is its centered second moment, hence

V(X ) := E

(X − EX )2

.

Its standard deviation σ is defined as

σ :=

V(X ).

4



Proposition 3.6 V(X ) < ∞ if and only if X ∈ L2(P). In this case

V(X ) = E

X 2− (E (X ))2 (3.2)

as well as V(X )

≤ E X 2 (3.3)

and (EX )2 ≤ E

X 2

(3.4)

Proof. If V(X ) < ∞, then X −EX ∈ L2 (P). But L2 (P) is a vector space and the constant

EX ∈ L2 (P) ⇒ X = (X − EX ) + EX ∈ L2 (P) .

On the other hand if X ∈ L2 (P) ⇒ X ∈ L1 (P). Hence EX exists and is a constant.Therefore also

X − EX ∈ L2 (P) .

By linearity of expectation then:

V(X ) = E (X − EX )2 = EX 2 − 2(EX )2 + (EX )2 = EX 2 − (EX )2 .

This immediately implies (3.3) and (3.4).

Exercise 3.7 Show that X and X − EX have the same variance.

It will turn out in the next step that (3.4) is a special case of a much more generalprinciple. To this end recall the concept of a convex function. Recall that a function

ϕ : R → R is convex if for all α ∈ (0, 1) we have

ϕ (αx + (1 − α) y) ≤ αϕ (x) + (1 − α) ϕ (y)

for all x, y ∈ R. In a first course in analysis one learns that convexity is implied by ϕ ≥ 0.On the other hand convex functions do not need to be differentiable. The following exerciseshows that they are close to being differentiable.

Exercise 3.8 Let I be an interval and ϕ : I → R be a convex function. Show that the right derivative ϕ

+ (x) exist for all x ∈ I (the interior of x) and the left derivative ϕ

−(x)

exists for all x

∈I. Hence ϕ is continuous on I. Show moreover that ϕ

+ is monotonely

increasing on I and it holds:

ϕ(y) ≥ ϕ(x) + ϕ+(x)(y − x)

for x ∈ I , y ∈ I .

Applying Exercise 3.8 yields the generalization of (3.4) mentioned above.

5



Theorem 3.9 (Jensen’s inequality) Let X : Ω → R be a random variable on (Ω, F , P)and assume X is P-integrable and takes values in an open interval I ⊂ R. Then EX ∈ I and for every convex

ϕ : I → R

ϕ X is a random variable. If this random variable ϕ X is P-integrable it holds:

ϕ (E (X )) ≤ E(ϕ X ).

Proof. Assume I = (α, β ). Thus X (ω) < β for all ω ∈ Ω. But then E(X ) ≤ β , butthen also E(X ) < β . Indeed E(X ) = β , implies that the strictly positive random variableβ − X (ω) equals 0 on a set of P measure one, i.e. P-almost surely. This is a contradiction.Analogously: EX > α. According to exercise 3.8 ϕ is continuous on I = I hence Borel-measurable. Now we know

ϕ(y) ≥ ϕ (x) + ϕ+ (x) (y − x) (3.5)

for all x, y ∈ I with equality for x = y. Hence

ϕ (y) = supx∈I

ϕ (x) + ϕ

+ (x) (y − x)

(3.6)

for all y ∈ I . Putting y = X (ω) in (3.5) yields

ϕ X ≥ ϕ(x) + ϕ+(x) (X − x)

and by integrationE(ϕ

X )

≥ ϕ(x) + ϕ

+(x) (E (X )

−x)

all x ∈ I . Together with (3.6) this gives

E(ϕ X ) ≥ supx∈I

ϕ(x) + ϕ

+(x)(EX − x)

= ϕ(E(X )).

This is the assertion of Jensen’s inequality.

Corollary 3.10 Let X ∈ L p(P) for some p ≥ 1. Then

|E(X )

| p

≤ E(

|X

| p).

Exercise 3.11 Let I be an open interval and ϕ : I → R be convex. For x1, . . . , xn ∈ I and λ1, . . . λn ∈ R

+ with n

i=1 λi = 1, show that

ϕ

ni=1

λixi

≤

ni=1

λiϕ (xi) .

6



4 Convergence of random variables

Already in the course in measure theory we met three different types on convergence:

Definition 4.1 Let (X n) be a sequence of random variables.

1. X n is stochastically convergent (or convergent in probability) to a random variable X , if for each ε > 0

limn→∞

P (|X n − X | ≥ ε) = 0

2. X n converges almost surely to a random variable X , if for each ε > 0

P

lim supn→∞

|X n − X | ≥ ε

= 0

3. X n converges to a random variable X in L p or in p-norm, if

limn→∞ E (|X n − X | p) = 0.

Already in measure theory we proved:

Theorem 4.2 1. If X n converges to X almost surely or in L p, then it also converges stochastically. None of the converses is true.

2. Almost sure convergence does not imply convergence in L p and vice versa.

Definition 4.3 Let (Ω, F ) be a measurable, topological space endowed with its Borel σ-

algebra F . This means F is generated by the topology on Ω. Moreover for each n ∈ N let µn and µ be probability measures on (Ω, F ). We say that µn converges weakly to µ, if for each bounded, continuous, and real valued function f : Ω → R (we will write Cb (Ω) for the space of all such functions) it holds

µn (f ) :=

f dµn →

f dµ =: µ (f ) as n → ∞ (4.1)

Theorem 4.4 Let (X n)n∈N be a sequence of real valued random variables on a space (Ω,

F , P). Assume (X n) converges to a random variable X stochastically. Then PX n (the

sequence of distributions) converges weakly to PX ,i.e.

limn→∞

f dPX n =

f dPX

or equivalently limn→∞

E (f X n) = E (f X )

for all f ∈ Cb (R).

If X is constant P-a.s. the converse also holds true.

7



Proof. First assume f ∈ Cb (R) is uniformly continuous. Then for ε > 0 there is a δ > 0such that for any x , x ∈ R

|x − x| < δ ⇒ |f (x) − f (x)| < ε.

Define An :=

|X n

−X

| ≥ δ

, n

∈ N. Then

f dPX n −

f dPX

= |E (f X n) − E (f X )|≤ E (|f X n − f X |)= E (|f X n − f X | 1An) + E

|f X n − f X | 1Acn

≤ 2 f P (An) + εP (Ac

n)

≤ 2 f P (An) + ε.

Here we used the notation

f

:= sup

|f (x)

| : x

∈ R

and that|f X n − f X | ≤ |f X n| + |f X | ≤ 2 f .

But since X n → X stochastically, P (An) → 0 as n → ∞, so for n large enough P (An) ≤ε

2f . Thus for such n

f dPX n −

f dPX

≤ 2ε.

Now let f ∈ Cb (R) be arbitrary and denote I n := [−n, n]. Since I n ↑ R as n → ∞ wehave PX (I n) ↑ 1 as n → ∞. Thus for all ε > 0 there is n0 (ε) =: n0 such that

1 − PX (I n0) = PX (R \ I n0) < ε.

We choose the continuous function un0 in the following way:

un0(x) =

1 x ∈ I n00 |x| ≥ n0 + 1

−x + n0 + 1 n0 < x < n0 + 1x + n0 + 1 −n0 − 1 < x < −n0

Eventually put f := un0 · f . Since f ≡ 0 outside the compact set [−n0 − 1, n0 + 1], the

function ¯f is uniformly continuous (and so is un0) and hence

limn→∞

fdPX n =

f dPX

as well as

limn→∞

un0dPX n =

un0dPX ;

thus also

limn→∞

(1 − un0) dPX n =

(1 − un0) dPX .

8



By the triangle inequality

f dPX n −

f dPX

≤ (4.2)

f − f dPX n + fdPX n −

f dPX + f − f dPX .

For large n ≥ n1(ε), f dPX n −

f dPX

≤ ε. Furthermore from 0 ≤ 1 − un0 ≤ 1R\I n0 weobtain

(1 − un0) dPX ≤ PX (R \ I n0) < ε,

so that for all n ≥ n2 (ε) also (1 − un0) dPX n < ε.

This yields

f

− f dPX = |

f

|(1

−un0) dPX

≤ f

ε

on the one hand and f − f dPX n ≤ f ε

all n ≥ n2 (ε) on the other. Hence we obtain from (4.2) for large n:

f dPX n −

f dPX

≤ 2 f ε + ε.

This proves weak convergence of the distributions.

For the converse let η ∈ R and assume X ≡ η (X is identically equal to η ∈ R P-almost

surely). This means P

X = δ η where δ η is the Dirac measure concentrated in η. For theopen interval I = (η − ε, η + ε) , ε > 0 we may find f ∈ Cb (R) with f ≤ 1I and f (η) = 1.Then

f dPX n ≤ PX n (I ) = P (X n ∈ I ) ≤ 1.

Since we assumed weak convergence of PX n to PX we know that f dPX n →

f dPX = f (η) = 1

as n → ∞. Since

f dPX ≤ P(X n ∈ I ) ≤ 1, this implies

P (X n ∈ I ) → 1

as n → ∞. ButX n ∈ I = |X n − η| ≤ ε

and thus

P (|X n − X | ≥ ε) = P (|X n − η| ≥ ε)

= 1 − P (|X n − η| < ε) → 0

for all ε > 0. This means X n converges stochastically to X .

9



Definition 4.5 Let X n, X be random variables on a probability space (Ω, F , P). If PX n

converges weakly to PX we also say that X n converges to X in distribution.

Remark 4.6 If the random variables in Theorem 4.4 are Rd-valued and the f : R

d → R

belong to

Cb(R

d) the statement of the theorem stays valid.

Example 4.7 For each sequence σn > 0 with lim σn = 0 we have

limn→∞

N (0, σ2n) = δ 0.

Here δ 0 denotes the Dirac measure concentrated in 0 ∈ R. Indeed, substituting x = σy we obtain

1√ 2πσ2

∞−∞

f (x)e− x2

2σ2 dx =

∞−∞

1√ 2π

e−y2

2 f (σy)dy

Now for all y

∈ R 1√

2πe− y2

2 f (σy) ≤ f e− y2

2

which is integrable. Thus by dominated convergence

limn→∞

∞−∞

1 2πσ2

n

e− x2

2σ2n f (x)dx = limn→∞

∞−∞

1√ 2π

e−y2

2 f (σny)dy

=

∞−∞

limn→∞

1√ 2π

e−y2

2 f (σny)dy = f (0) =

f dδ 0.

Exercise 4.8 Show that the converse direction in Theorem 4.4 is not true, if we drop the assumption that X ≡ η P-a.s.

Exercise 4.9 Assume the following holds for sequence (X n) of random variables on a probability space (Ω, F , P):

P (|X n| > ε) < ε

For all n large enough (larger than n (ε)) for each given ε > 0. Is this equivalent with stochastic convergence of X n to 0 ?

Exercise 4.10 For a sequence of Poisson distributions (παn) with parameters αn > 0 show that

limn→∞

παn = δ 0,

if limn→∞ αn = 0. Is there a probability measure µ on B 1, with

limn→∞

παn = µ (weakly)

if lim αn = ∞?

10



5 Independence

The concept of independence of events is one of the most essential in probability theory. Itis met already in the introductory courses. Its background is the following:

Assume we are given a probability space (Ω, F , P). For two events A, B ∈ F withP

(B) > 0 one may ask, how the probability of the event A changes, if we know alreadythat B has happened, we only need to consider the probability of A ∩ B . To obtain aprobability again we normalize by P (B) and get the conditional probability of A given B:

P (A | B) = P (A ∩ B)

P (B) .

now we would call A and B independent, if the knowledge that B has happened does notchange the probability that A will happen or not, i.e. if P (A | B) and P (A) are the same

P (A | B) = P (A) .

This in other words means A and B are independent, if

P (A ∩ B) = P (A) P (B) .

More generally we define

Definition 5.1 A family (Ai)i∈I of events on F is called independent if for each choice of different indices i1, . . . , in ∈ I

P (Ai1 ∩ .. ∩ Ain) = P (Ai1) . . . P (Ain) (5.1)

Exercise 5.2 Give an example of a sequence of events that are pairwise independent, i.e.each pair of events from this sequence is independent, but not independent (i.e. all events together not independent).

We generalize Definition 5.1 to set systems in the following way:

Definition 5.3 For each i ∈ I let E i ⊂ F be a collection of events. (E i)i∈I are called independent if (5.1) holds for each i1, . . . , in ∈ I , each n ∈ N and each Aiν ∈ E iν , ν =1, . . . , n.

Exercise 5.4 Show the following

1. A family (E i)i∈I is independent, if and only if every finite sub-family is independent.

2. Independence of (E i)i∈I is maintained, if we reduce the families (E i). More precisely,let (E i) be independent and E i ⊆ E i then also the families (E i) are independent.

3. If for all n ∈ N, (E ni )i∈I are independent and for all n ∈ N and i ∈ I , E ni ⊆ E n+1i ,

then (

n E ni )i∈I are independent.

11



Exercise 5.5 If (E i)i∈I are independent, then so are the Dynkin-systems (D (E i))i∈I . Here D (A) is the Dynkin system generated by A, it coincides with the intersection of all Dynkin systems containing A. (See the Appendix for definitions.)

Corollary 5.6 Let (E i)i∈I be an independent family of ∩-stable sets E i ⊆ F . Then also the families (σ (

E i))i

∈I of the σ-algebras generated by the

E i are independent.

Theorem 5.7 Let (E i)i∈I be an independent family of ∩-stable sets E i ⊆ F and

I = ∪ j∈J I j

with I i ∩ I j = ∅, i = j . Let A j := σ∪i∈I jE i

. Then also (A j) j∈J is independent.

Proof. For j ∈ J let E j be the system of all sets of the form

E i1 ∩ . . . ∩ E in

where ∅ = i1, . . . , in ⊆ I j and E iν ∈ E iν , ν = 1, . . . , n, are arbitrary. Then E j is ∩-stable. As an immediate consequence of the independence of the (E i)i∈I also the (E j) j∈J are independent. Eventually A j = σ(E j). Thus the assertion follows from Corollary 5.6.

In the next step we want to show that events that depend on all but finitely manyσ-algebras of a countable family of independent σ-algebras can only have probability zeroor one. To this end we need the following definition.

Definition 5.8 Let (An)n be a sequence of σ-algebras from F and

T n := σ ∞

m=n

Am

the σ-algebra generated by An, An+1, . . .. Then

T ∞ :=∞n=1

T n

is called the σ-algebra of the tail events.

Exercise 5.9 Why is T ∞ a σ-algebra?

This is now the result announced above:

Theorem 5.10 (Kolmogorov’s Zero-One-Law) Let (An) be an independent sequence of σ-algebras An ⊆ F . Then for every tail event A ∈ T ∞ it holds

P (A) = 0 or P(A) = 1.

12



Proof. Let A ∈ T ∞ and D the system of all sets D ∈ F that are independent of A. Wewant to show that A ∈ D:

In the exercise below we show that D is a Dynkin system. By Theorem 5.7 the σ-algebraT n+1 is independent of the σ-algebra

An := σ (A1 ∪ . . . ∪ An) .Since A ∈ T n+1 we know that An ⊆ D for every n ∈ N. Thus

A :=∞n=1

An ⊆ D

ObviouslyAn

is increasing. For E , F ∈ A there thus exists n with E , F ∈ An and hence

E ∩ F ∈ An and thus E ∩ F ∈ A. This means that A is ∩-stable. Since A ⊆ D, i.e.(A, A) is independent, from Exercise 5.5 we conclude that (D(A), A) is independent,so that σ A

=

D A ⊆ D. Moreover

An

⊆ A, all n. Hence

T 1 = σ(

∪An)

⊆ σ A

.

ThereforeA ∈ T ∞ ⊆ T 1 ⊆ σ

A ⊆ D.

Therefore A is independent of A, i.e. it holds

P(A) = P(A ∩ A) = P(A) · P(A) = (P(A))2 .

Hence P(A) ∈ 0, 1 is asserted.

Exercise 5.11 Show that D from the proof of Theorem 5.10 is a Dynkin system.

As an immediate consequence of the Kolmogorov Zero-One-Law (Theorem 5.10) weobtain

Theorem 5.12 (Borel’s Zero-One-Law) For each independent sequence (An)n of events An ∈ F we have

P(An for infinitely many n) = 0

or P( An for infinitely many n) = 1,

i.e.P (lim sup An) ∈ 0, 1 .

Proof. Let An = σ (An), i.e. An = ∅, Ω, An, Acn. It follows that (An)n is independent.

For Qn :=∞

m=n Am we have Qn ∈ T n. Since (T n)n is decreasing we even have Qm ∈ T n forall m ≥ n, n ∈ N. Since (Qn)n is decreasing we obtain

lim supn→∞

An =∞k=1

Qk =∞k= j

Qk ∈ T j

for all j ∈ N. Hence lim sup An ∈ T ∞. Hence the assertion follows from Kolmogorov’szero-one law.

13



Exercise 5.13 In every F the pairs (A, B) (any A, B ∈ F with P(A) = 0 or P(A) = 1 or P(B) = 0 or P(B) = 1) are pairs of independent sets. If these are the only pairs of indepen-dent sets we call F independence-free. Show that the following space is independence-free:

Ω = N, F = P (Ω) , and P (k) = 2−k!

for each k ≥ 2, P (1) = 1 − ∞k=2 P (k). (Hint: Door zo nodig over te gaan op comple-menten, mag je aannemen dat 1 /∈ A en 1 /∈ B.)

A special case of the above abstract setting is the concept of independent randomvariables. This will be introduced next. Again we work over a probability space (Ω , F , P).

Definition 5.14 A family of random variables (X i)i is called independent, if the σ-algebras (σ (X i))i generated by them are independent.

For finite families there is another criterion (which is important, since by definition of

independence we only need to check the independence of finite families).

Theorem 5.15 Let X 1, . . . , X n be a sequence of n random variables with values in mea-surable spaces (Ωi, Ai) with ∩-stable generators E i of Ai. X 1, . . . , X n are independent if and only if

P (X 1 ∈ E 1, . . . , X n ∈ E n) =n

i=1

P (X i ∈ E i)

for all E i ∈ E i, i = 1, . . . , n.

Proof. Put

Gi :=

X −1i (E i) , E i ∈ E i .

Then Gi generates σ (X i). Gi is ∩-stable and Ω ∈ Gi. According to Corollary 5.6 we needto show the independence of (Gi)i=1...n, which is equivalent with

P (G1 ∩ · · · ∩ Gn) = P (G1) · · · P (Gn)

for all choices of Gi ∈ Gi. Sufficiency is evident, since we may choose Gi = Ω for appropriatei.

Exercise 5.16 Random variables X 1, . . . , X n+1 are independent with values in (Ωi, F i) if and only if X 1, . . . , X n are independent and X n+1 is independent of σ (X 1, . . . , X n).

The following theorem states that a measurable deformation of an independent familyof random variarables stays independent:

Theorem 5.17 Let (X i)i∈I be a family of independent random variables X i with values in (Ωi, Ai) and let

f i : (Ωi, Ai) → (Ωi, A

i)

be measurable. Then (f i (X i))i∈I is independent.

14



Proof. Let i1, . . . , in ∈ I . Then

P

f i1 (X i1) ∈ Ai1

, . . . , f in (X in) ∈ Ain

= P

X i1 ∈ f −1

i1

A

i1

, . . . , X in ∈ f −1

in

A

in

=

n

ν =1

P X iν ∈

f −1iν A

iν=

nν =1

P

f iν (X iν) ∈ Aiν

by the independence of (X i)i∈I . Here the Aiν ∈ A

iν were arbitrary.

Already Theorem 5.15 gives rise to the idea that independence of random variablesmay be somehow related to product measures. This is made more precise in the followingtheorem. To this end let X 1, . . . , X n be random variables such that

X i : (Ω, F ) → (Ωi, Ai) .

DefineY := X 1 ⊗ · · · ⊗ X n : Ω → Ω1 × · · · × Ωn

Then the distribution of Y which we denote by PY can be computed as PY = PX 1⊗···⊗X n.Note that PY is a probability measure on ⊗n

i=1Ai.

Theorem 5.18 The random variables X 1, . . . , X n are independent if and only if their dis-tribution is the product measure of the individual distributions, i.e. if

PX 1

⊗...

⊗X n = PX 1

⊗ · · · ⊗PX n

Proof. Let Ai ∈ Ai, i = 1, . . . , n. Then with

Y = X 1 ⊗ · · · ⊗ X n :

PY

ni=1

Ai

= P

Y ∈

ni=1

Ai

= P (X 1 ∈ A1, . . . , X n ∈ An)

as well asPX i (Ai) = P (X i ∈ Ai) i = 1, . . . n.

Now PY

is the product measure of the PX i

if and only if

PY (A1 × . . . × An) = PX 1 (A1) . . . PX n (An) .

But this is identical with

P (X 1 ∈ A1, . . . , X n ∈ An) =n

i=1

P (X i ∈ Ai) .

But according to Theorem 5.15 this is equivalent with the independence of the X i.

15



6 Products and Sums of Independent Random Vari-

ables

In this section we will study independent random variables in greater detail.

Theorem 6.1 Let X 1, . . . , X n be independent, real-valued random variables. Then

E

ni=1

X i

=

ni=1

E (X i) (6.1)

if EX i is well defined (and finite) for all i. (6.1) shows that then also E (n

i=1 X i) is well defined.

Proof. We know that Q := ⊗ni=1PX i is the joint distribution of the X 1, . . . , X n. By

Proposition 3.2 and Fubini’s theorem

En

i=1

X i = |x1 . . . xn| dQ(x1, . . . , xn)

=

. . .

|x1| . . . |xn| dPX 1 (x1) . . . dPX n (xn)

|x1| dPX 1 (x1) . . .

|xn| dPX n (xn)

This shows that integrability of the X i implies integrability of n

i=1 X i. In this case theequalities are also true without absolute values. This proves the result.

Exercise 6.2 For any two random variables X, Y that are integrable Theorem 6.1 tells that independence of X, Y implies that

E (X · Y ) = E (X ) E (Y )

Show that the converse is not true.

Definition 6.3 For any two random variables X, Y that are integrable and have an inte-grable product we define the covariance of X and Y to be

cov (X, Y ) = E [(X − EX ) (Y − EY )]

= E (XY ) − EX EY.

X and Y are uncorrelated if cov (X, Y ) = 0.

Remark 6.4 If X, Y are independent cov(X, Y ) = 0.

Theorem 6.5 Let X 1, . . . , X n be square integrable random variables. Then

V

ni=1

X i

=

ni=1

V(X i) +i= j

cov (X i, X j) (6.2)

In particular, if X 1, . . . , X n are uncorrelated

V

ni=1

X i

=

ni=1

V (X i) . (6.3)

16



Proof. We have

V

ni=1

X i

= E

n

i=1

(X i − EX i)

2

= E n

i=1(X i − EX i)

2

+i= j (X i − EX i) (X j − EX j)

=

ni=1

V (X i) +i= j

cov (X i, X j) .

This proves (6.2). For (6.3 just note that for uncorrelated random variables X, Y one hascov(X, Y ) = 0.

Eventually we turn to determining the distribution of the sum of independent randomvariables.

Theorem 6.6 Let X 1, . . . , X n be independent Rd valued random variables. Then the distri-

bution of the sum S n := X 1+. . .+X n is given by the convolution product of the distributions of the X i, i.e.

PS n := PX 1 ∗ PX 2 ∗ · · · ∗ PX n

Proof. Again let Y := X 1 ⊗ . . . ⊗ X n : Ω → R

dn

, and vector addition

An : Rd × . . . × R

d → Rd.

Then S n = An Y , hence a random variable. Now PS n is the image measure of P underAn Y , which we denote by (An Y ) (P). Thus

PS n = (An

Y ) (P) = An (PY ) .

Now PY = ⊗PX i . So by the definition of the convolution product

PX 1 ∗ . . . ∗ PX n = An (PY ) = PS n

More explicitly, in the case d = 1, let g(x1, . . . , xn) = 1 f o r x1 + · · · + xn ≤ s andg(x1, . . . , xn) = 0, otherwise. Then application of Fubini’s theorem yields

P(S n ≤ s) = E(g(X 1, . . . , X n)) =

g(x1, x2, . . . , xn)dPX 1(x1)dP(X 2,...,X n)(x2, . . . , xn) =

= P(X 1

≤ s

−x2

− · · · −xn)dP(X 2,...,X n)(x2, . . . , xn)

In the case that X 1 has a density f X 1 with respect to Lebesgue measure, and the order of differentiation with respect to s and integration can be exchanged, it follows that S n has adensity f S n and

f S n(s) =

f X 1(s − x2 − · · · − xn)dP(X 2,...,X n)(x2, . . . , xn)

The same formula holds in the case that X 1, . . . , X n have a discrete distribution (that is,almost surely assume values in a fixed countable subset of R) if densities are taken withrespect to the count measure.

17



Example 6.7 1. As we learned in Introduction to Statistics the convolution of a Bi-nomial distribution with parameters n and p, B(n, p) and a Binomial distribution B (m, p) is a B(n + m, p) distribution:

B(n.p) ∗ B(m, p) = B(n + m, p).

2. As we learned in Introduction to Statistics the convolution of a P (λ)-distribution (a Poisson distribution with parameter λ) with a P (µ)-distribution is a P (µ + λ)-distribution:

P (λ) ∗ P (µ) = P (λ + µ)

3. As has been communicated in Introduction to Probability:

N µ, σ2 ∗ N ν, τ 2

= N µ + ν, σ2 + τ 2

.

7 Infinite product probability spacesMany theorems in probability theory start with: ”Let X 1, X 2, . . . , X n, . . . be a sequenceof i.i.d. random variables”. But how do we know that such sequences really exist? Thiswill be shown in this section. In the last section we established the framework for someof the most important theorems from probabilities, as the Weak Law of Large Numbers orthe Central Limit Theorem. Those are the theorems that assume: ”Let X 1, X 2, . . . X n bei.i.d. random variables”. Others, as the Strong Law of Large Numbers ask for the behaviorof a sequence of independent and identically distributed (i.i.d.) random variables; theyusually start like ”Let X 1, X 2, . . . be a sequence of i.i.d. random variables”. The naturalfirst question to ask is: Does such a sequence exist at all?

In the same way as the existence of a finite sequence of i.i.d. random variables is relatedto finite product measures, the answer to the above question is related to infinite productmeasures. So, we assume that we are given a sequence of measure spaces (Ωn, An, µn) of which we moreover assume that

µn (Ωn) = 1 for all n.

We construct (Ω, A) as follows: We want each ω ∈ Ω to be a sequence (ωn)n where ωn ∈ Ωn.So we put

Ω :=

∞

n=1

Ωn.

Moreover we have the idea that a probability measure on Ω should be defined by whathappens on the first n coordinates, n ∈ N. So for A1 ∈ A1, A2 ∈ A2, . . . , An ∈ An, n ∈ N

we wantA := A1 × . . . × An × Ωn+1 × Ωn+2 × . . . (7.1)

to be in A. By independence we want to define a measure µ on (Ω, A) that assigns to Adefined in (7.1) the mass

µ(A) = µ1(A) . . . µn(An).

18



We will solve this problem in greater generality. Let I be an index set and (Ωi, Ai, µi)i∈I be measure spaces with µi (Ωi) = 1. For ∅ = K ⊆ I define

ΩK :=i∈K

Ωi, (7.2)

in particular Ω := ΩI . Let pK J for J ⊆ K denote the canonical projection from ΩK to ΩJ .For J = i we will also write pK

i instead of pK i and pi in place of pI

i . Obviously

pLJ = pK

J pLK (J ⊆ K ⊆ L) (7.3)

and pJ := pI

J = pK J pK (J ⊆ K ) . (7.4)

Moreover denote byH (I ) := J ⊆ I , J = ∅, |J | is finite .

For J ∈ H (I ) the σ-algebras and measuresAJ := ⊗i∈J Ai and µJ := ⊗i∈J µi

are defined by Fubini’s theorem in measure theory.

In analogy to the finite dimensional case we define

Definition 7.1 The product σ-algebra ⊗i∈I Ai of the σ-algebras (Ai)i∈I is defined as the smallest σ-algebra A on Ω, such that all projections pi : Ω → Ωi are (A, Ai)-measurable.Hence

⊗i

∈I

Ai := σ ( pi, i

∈ I ) . (7.5)

Exercise 7.2 Show that ⊗i∈I Ai := σ ( pJ , J ∈ H (I )) . (7.6)

According to the above we are now looking for a measure µ on (Ω, A), that assigns massµ1 (A1) . . . µn (An) to each A as defined in (7.1). In other words

µ

p−1

J

i∈J

Ai

= µJ

i∈J

Ai

.

The question, whether such a measure exists, is solved in

Theorem 7.3 On A := ⊗i∈I Ai there is a unique measure µ with

pJ (µ) := µ p−1J = µJ (7.7)

for all J ∈ H (I ). It holds µ (Ω) = 1.

19



Proof. We may assume |I | = ∞, since otherwise the result is known from Fubini’s theorem.We start with some preparatory considerations:

In Exercise 7.4 below it will be shown that pK J is (AK , AJ )-measurable for J ⊆ K and

that pK J (µK ) = µJ , (J ⊆ K, J, K ∈ H (I )).

Hence, if we introduce the σ-algebra of the J -cylinder sets

Z J := p−1J (AJ ) (J ∈ H (I )) (7.8)

the measurability of pK J implies

pK

J

−1(AJ ) ⊂ AK and thus

Z J ⊆ Z K (J ⊆ K , J, K ∈ H (I )) . (7.9)

Eventually we introduce the system of all cylinder sets

Z :=

J ∈H(I )

Z J .

Note that due to (7.9) for Z 1, Z 2 ∈ Z we have Z 1, Z 2 ∈ Z J , for suitably chosen J ∈ H(I ).

Hence Z is an algebra (but generally not a σ-algebra). From (7.5) and (7.6) it followsA = σ (Z ) .

Now we come to the main part of the proof. This will be divided into four parts.

1. Assume Z Z = p−1J (A), J ∈ H (I ) , A ∈ AJ . According to (7.7) Z must get mass

µ (Z ) = µJ (A). We have to show that this is well defined. So let

Z = p−1J (A) = p−1

K (B)

for J, K ∈ H (I ), A ∈ AJ , B ∈ AK . If J ⊆ K we obtain:

p−1

J (A) = p−1

K

pK

J −1

(A)

and thus p−1

K (B) = p−1K (B) with B :=

pK

J

−1(A) .

Since pK (Ω) = ΩK we obtain

B = B = pK

J

−1(A) .

Thus by the introductory considerations

µK (B) = µJ (A) .

For arbitrary J, K define L := J ∪

K . Since J, K ⊆

L, (7.9) implies the existence of C ∈ AL with p−1

L (C ) = p−1J (A) = p−1

K (B). Therefore from what we have just seen:

µL(C ) = µJ (A) and µL(C ) = µK (B).

HenceµJ (A) = µK (B).

Thus the function

µ0

p−1

J (A)

= µJ (A) (J ∈ H(I ) , A ∈ AJ ), (7.10)

is well-defined on Z .

20



2. Now we show that µ0 as defined in (7.10) is a volume on Z . Trivially it holds, µ0 ≥ 0and µ0(∅) = 0. Moreover, as shown above for Y, Z ∈ Z , Y ∩ Z = ∅, there is aJ ∈ H (I ) , A , B ∈ AJ such that Y = p−1

J (A), Z = p−1J (B). Now Y ∩ Z = ∅ implies

A ∩ B = ∅ and due toY ∪ Z = p−1

J (A ∪ B)

we obtain

µ0(Y ∪ Z ) = µJ (A ∪ B) = µJ (A) + µJ (B) = µ0(Y ) + µ0(Z )

hence the finite additivity of µ0.

It remains to show that µ0 is also σ-additive. Then the general principles frommeasure theory yield that µ0 can be uniquely extended to a measure µ on σ(Z ) = A.µ also is a probability measure, because of Ω = p−1

J (ΩJ ) for all J ∈ H(I ) and therefore

µ(Ω) = µ0(Ω) = µJ (ΩJ ) = 1.

To prove the σ-additivity of µ0 we first show:

3. Let Z ∈ Z and J ∈ H(I ). Then for all ωJ ∈ ΩJ the set

Z ωJ := ω ∈ Ω : (ωJ , pI \J (ω)) ∈ Z

is a cylinder set. This set consists of all ω ∈ Ω with the following property: if wereplace the coordinates ωi with i ∈ J by the corresponding coordinates of ωJ , weobtain a point in Z . Moreover

µ0(Z ) =

µ0(Z ωJ

)dµJ (ωJ ). (7.11)

This is shown by the following consideration. For Z ∈ Z there are K ∈ H(I ) andA ∈ AK such that Z = p−1

K (A), this means that µ0(Z ) = µK (A). Since I is infinitewe may assume J ⊂ K and J = K . For the ωJ -intersection of A in ΩK , which wecall AωJ , i.e. for the set of all ω ∈ ΩK \J with (ωJ , ω) ∈ A, it holds

Z ωJ = p−1K \J (AωJ ).

By Fubini’s theorem AωJ ∈ AK \J and hence Z ωJ = p−1K \J (AωJ ) are (K \J )-cylindersets.

Since µK = µJ ⊗ µK \J Fubini’s theorem implies

µ0(Z ) = µK (A) =

µK \J (AωJ )dµJ (ωJ ). (7.12)

But this is (7.11), sinceµ0(Z ωJ ) = µK \J (AωJ )

(because of Z ωJ = p−1K \J (AωJ )).

21



4. Eventually we show that µ0 is ∅-continuous and thus σ-additive. To this end let (Z n)be a decreasing family of cylinder sets in Z with α := inf n µ0(Z n) > 0. We will showthat ∞

n=1

Z n = ∅. (7.13)

Now each Z n is of the form Z n = p−1J n

(An), J n ∈ H(I ), An ∈ AJ n. Due to (7.9) we mayassume J 1 ⊆ J 2 ⊆ J 3. . . . We apply the result proved in 3. to J = J 1 and Z = Z n.As ωJ 1 −→ µ0

Z

ωJ 1n

is AJ 1-measurable

Qn :=

ωJ 1 ∈ ΩJ 1 : µ0

Z

ωJ 1n

≥ α

2

∈ AJ 1.

Since all µJ ’s have mass one we obtain from (7.11):

α ≤ µ0 (Z n) ≤ µJ 1 (Qn) + α

2,

hence µJ 1 (Qn) ≥ α2

> 0, for all n ∈ N. Together with (Z n) also (Qn) is decreasing.µJ 1 as a finite measure is ∅ - continuous, which implies

∞n=1 Qn = ∅. Hence there is

ωJ 1 ∈ ∞

n=1 Qn with

µ0

Z

ωJ 1n

≥ α

2 > 0 all n. (7.14)

Successive application of 3. implies via induction that for each k ∈ N there is ωJ k ∈ΩJ k with (7.14) µ0

Z

ωJ kn

≥ 2−kα > 0 and p

J k+1

J k

ωJ k+1

= ωJ k .

Due to this second property there is ω0 ∈ Ω with pJ k (ω0) = ωJ k . Because of (7.14)

we have Z

ωJ n

n = ∅ such that there is ωn ∈ Ω with

ωJ n, pI \J n (ωn) ∈ Z n. But thenalso

ωJ n, pI \J n (ω0)

= ω0 ∈ Z n.

Thus ω0 ∈ ∞

n=1 Z n which proves (7.13).Therefore µ0 is σ-additive and hence has an extension to A by Caratheodory’s theorem(Theorem 14.7). It is clear that µ0 has mass one (i.e. µ0 (Ω) = 1), since for J ∈ H (I )we have Ω = p−1

J (ΩJ ) and hence

µ0 (Ω) = µJ (ΩJ ) = 1.

In particular µ0 is σ–finite, and the extension µ is unique, and it is a probabilitymeasure, that is µ(Ω) = µ0(Ω) = 1.

This proves the theorem.

We conclude the chapter with an Exercise, that was left open during this proof.

Exercise 7.4 With the notations of this section, in particular of Theorem 7.3 show that pK

J is (AK , AJ )-measurable (J ⊆ K,J, K ∈ H (I )) and that

pK J (µK ) = µJ .

22



8 Zero-One Laws

Already in Section 5 we encountered the prototype of a zero-one law: For a sequence of events (An)n that are independent we have Borel’s Zero-One-Law (Theorem 5.12):

P (lim sup An)

∈ 0, 1

.

In a first step we will now ask, when the probability in question is zero and when it is one.This leads to the following frequently used lemma:

Lemma 8.1 (Borel-Cantelli Lemma) Let (An) be a sequence of events over a probabil-ity space (Ω, F , P). Then

∞n=1

P (An) < ∞ ⇒ P (lim sup An) = 0 (8.1)

If the events (An

) are pairwise independent then also

∞n=1

P (An) = ∞ ⇒ P (lim sup An) = 1. (8.2)

Remark 8.2 The Borel-Cantelli Lemma is most often used in the form of (8.1). Note that this part does not require any knowledge about the dependence structure of the An.

Proof of Lemma 8.1. (8.1) is easy. Define

A := lim sup An =

∞n=1

∞i=n Ai.

This implies

A ⊆∞i=n

Ai for all n ∈ N.

and thus

P (A) ≤ P

∞i=n

Ai

≤

∞i=n

P (Ai) (8.3)

Since ∞i=1 P (Ai) converges, ∞

i=n P (Ai) converges to zero as n

→ ∞. This implies P (A) =

0, hence (8.1).For (8.2) again put A := lim sup An and furthermore

I n := 1An, S n :=

n j=1

I j

and eventually

S :=∞

j=1

I j .

23



Since the An are assumed to be pairwise independent they are pairwise uncorrelated aswell. Hence

V (S n) =n

j=1

V (I j) =n

j=1

E

I 2 j− E (I j)2

= E (S n) −n

j=1

E (I j)2 ≤ ES n,

where the last equality follows since I 2 j = I j . Now by assumption ∞

n=1 E (I n) = +∞.Since S n ↑ S this is equivalent with

limn→∞

E (S n) = E (S ) = ∞ (8.4)

On the other hand ω ∈ A, if and only if ω ∈ An for infinitely many n which is the case, if and only if S (ω) = +∞. The assertion thus is

P (S = +∞) = 1.

This can be seen as follows. By Chebyshev’s inequality

P (|S n − E(S n)| ≤ η) ≥ 1 − V (S n)

η2

for all η > 0. Because of (8.4) we may assume that ES n > 0 and choose η = 12

ES n. Hence

P

S n ≥ 1

2E (S n)

≥ P

|S n − ES n| ≤ 1

2ES n

≥ 1 − 4

V (S n)

E (S n)2

But V (S n) ≤ E (S n) and E (S n) → ∞. Thus

lim V (S

n)

E (S n)2 = 0.

Therefore for all ε > 0 and all n large enough

P

S n ≥ 1

2ES n

≥ 1 − ε.

But now S ≥ S n and hence also

P

S ≥ 1

2ES n

≥ 1 − ε

for all ε > 0. But this implies P (S = +∞

) = 1 which is what we wanted to show.

Example 8.3 Let (X n) be a sequence of real valued random variables which satisfies

∞n=1

P (|X n| > ε) < ∞ (8.5)

for all ε > 0. Then X n → 0 P-a.s. Indeed the Borel-Cantelli Lemma says that (8.5)implies that

P (|X n| > ε infinitely often in n) = 0.

But this is exactly the definition of almost sure convergence of X n to 0.

24



Exercise 8.4 Is (8.5) equivalent with P-almost sure convergence of X n to 0?

Here is how Theorem 5.10 translates to random variables.

Theorem 8.5 (Kolmogorov’s 0-1 Law) Let (X n)n be a sequence of independent random variables with values in arbitrary measurable spaces. Then for every tail event A, i.e. for

each A with A ∈

∞n=1

σ (X m, m ≥ n)

it holds that P (A) ∈ 0, 1.

Exercise 8.6 Derive Theorem 8.5 from Theorem 5.10.

Corollary 8.7 Let (X n)n∈N a sequence of independent, real-valued random variables. De- fine

T ∞ :=

∞

m=1 σ (X i, i ≥ m)

to be the tail σ-algebra. If then T is a real-valued random variable, that is measurable with respect to T ∞, then T is P-almost surely constant. I.e. there is a α ∈ R such that

P (T = α) = 1.

Such random variables T : Ω → R that are T ∞-measurable are called tail functions.

Proof. For each γ ∈ R we have that

T ≤ γ ∈ T ∞.This implies P (T ≤ γ ) ∈ 0, 1. On the other hand, being a distribution function we have

limγ →−∞

P(T ≤ γ ) = 0 and limγ →+∞

P(T ≤ γ ) = 1

Let C := γ ∈ R : P (T ≤ γ ) = 1 and α := inf(C ) = inf γ ∈ R : P (T ≤ γ ) = 1 Then foran appropriately chosen decreasing sequence (γ n) ∈ C we have γ n ↓ α and since T ≤ γ n ↓T ≤ α we have α ∈ C . Hence α = min γ ∈ C . This implies

P (T < α) = 0

which impliesP (T = α) = 1.

Exercise 8.8 A coin is tossed infinitely often. Show that every finite sequence

(ω1, . . . , ωk) , ωi ∈ H, T , k ∈ N

occurs infinitely often with probability one.

25



Exercise 8.9 Try to prove (8.2) in the Borel-Cantelli Lemma for independent events (An) as follows:

1. For each sequence (αn) of real numbers with 0 ≤ αn ≤ 1 we have

ni=1

(1 − αi) ≤ exp−

ni=1

αi

(8.6)

This implies ∞

n=1

αn = ∞ ⇒ limn→∞

ni=1

(1 − αi) = 0

2. For A := lim sup An we have

1 − P (A) = limn→∞

P ∞

m=n

Acm

= limn→∞

limN →∞

N m=n

(1 − P (Am)) .

3. As

P (An) diverges we have because of 1.

limN →∞

N m=n

(1 − P (Am)) = 0

and hence P (A) = 1 because of 2. Fill in the missing details!

9 Laws of Large Numbers

The central goal of probability theory is to describe the asymptotic behavior of a sequenceof random variables. In its easiest form this has already been done for i.i.d. sequencesin Introduction to Probability and Statistics. In the first theorem of this section this isslightly generalized.

Theorem 9.1 (Khintchine) Let (X n)n∈N be a sequence of square integrable, real valued random variables, that are pairwise uncorrelated. Assume

limn→∞

1

n2

ni=1

V (X i) = 0.

Then the weak law of large numbers holds, i.e.

limn→∞

P

1

n

ni=1

X i − 1

nE

ni=1

X i

> ε

= 0 for all ε > 0.

26



Proof. By Chebyshev’s inequality for each ε > 0:

P

1

n

ni=1

(X i − EX i)

> ε

≤ 1

ε2V

1

n

ni=1

(X i − EX i)

=

1

ε21

n2 V n

i=1(X i − EX i)

=

1

ε21

n2

ni=1

V (X i − EX i)

= 1

ε21

n2

ni=1

V (X i) .

Here we used that the random variables are pairwise uncorrelated. By assumption thelatter expression converges to zero.

Remark 9.2 As we will learn in the next Theorem, for an independent sequence square integrability is even not required.

Theorem 9.1 raises the question whether we can replace the stochastic convergence thereby almost sure convergence. This will be shown in the following theorem. Such a theoremis called a strong law of large numbers. Its first form was proved by Kolmogorov. Wewill present a proof due to Etemadi from 1981.

Theorem 9.3 (Strong Law of Large Numbers – Etemadi 1981) For each sequence (X n)n of real-valued, pairwise independent, identically distributed (integrable) random vari-

ables the Strong Law of Large Numbers holds, i.e.

P

lim supn→∞

1

n

ni=1

X i − EX 1

> ε

= 0 for each ε > 0.

Before we prove Theorem 9.3 let us make a couple of remarks. These should reveal thestructure of the proof a bit:

1. Denote S n = n

i=1 X i. Then Theorem 9.3 asserts that 1n

S n → η := EX 1, P-almostsurely.

2. Together with X n also X +n and X −n (where X +n = max(X n, 0) and X −n = (−X n)+)satisfy the assumptions of Theorem 9.3. Since X n = X +n − X −n it therefore suffices toprove Theorem 9.3 for positive random variables. We therefore assume X n ≥ 0 forthe rest of the proof.

3. All proofs of the Strong Law of Large Numbers use the following trick: We truncatethe random variables X n by cutting off values that are too large. We thereforeintroduce

Y n := X n1|X n|<n = X n1X n<n

27



Of course, if µ is the distribution of X n and µn is the distribution of Y n, then µn = µ.Indeed µn = f n (µ), where

f n(x) :=

x if 0 ≤ x < n,0 otherwise.

The idea behind truncation is that we gain square integrability of the sequence.Indeed:

E

Y 2n

= E

f 2n X n

=

f 2n(x)dµ(x) =

n0

x2dµ(x) < ∞.

4. Of course, after having gained information about the Y n we need to translate theseresults back to the X n. To this end we will apply the Borel-Cantelli Lemma and showthat ∞

n=1

P (X n = Y n) < ∞.

This implies that X n

= Y n

only for finitely many n with probability one. In particular,if we can show that 1

n

ni=1 Y i → η P-a.s., we also can show that 1

n

ni=1 X i → η

P-a.s..

5. For the purposes of the proof we remark the following: Let α > 1 and for n ∈ N let

kn := [αn]

denote the largest integer ≤ αn. This means kn ∈ N and

kn ≤ αn < kn + 1.

Sincelimn→∞

αn − 1

αn = 1

there is a number cα, 0 < cα < 1, such that

kn > αn − 1 ≥ cααn for all n ∈ N. (9.1)

We now turn toProof of Theorem 9.3.

Step 1: Without loss of generality X n > 0. Define Y n = 1X n<nX n. Then Y n are

independent and square integrable. Define

S n :=n

i=1

(Y i − EY i) .

Let ε > 0 and α > 1. Using Chebyshev’s inequality and the independence of the randomvariables (Y n) we obtain

P

1

nS n

> ε

≤ 1

ε2V

1

nS n

=

1

ε21

n2V (S n) =

1

n2

1

ε2

ni=1

V (Y i) .

28



Observe that V (Y i) = E (Y 2i ) − (E (Y i))2 ≤ E (Y 2i ). Thus

P

1

nS n

> ε

≤ 1

n2

1

ε2

ni=1

E

Y 2i

.

For kn = [αn] this gives

P

1

kn

S kn

> ε

≤ 1

ε2k2n

kni=1

E

Y 2i

for all n ∈ N. Thus

∞n=1

P

1

kn

S kn

> ε

≤ 1

ε2

∞n=1

kni=1

1

k2n

E

Y 2i

.

By rearranging the order of summation we obtain

∞n=1

P 1

knS kn

> ε ≤ 1

ε2

∞ j=1

t jE

Y 2 j

where

t j :=∞

n=nj

1

k2n

and n j is the smallest n with kn ≥ j . From (9.1) we obtain

t j ≤ 1

c2α

∞

n=nj

1

α2n =

1

c2αα−2nj

1

1 − 1α2

= dαα−2nj

where dα = c−2α (1 − α−2)

−1> 0. This implies

t j ≤ dα j−2.

By using the above

∞n=1

P

1

kn

S kn

> ε

≤ dα

ε2

∞ j=1

1

j2

jk=1

kk−1

x2dµ(x).

Again rearranging the order of summation yields:

∞ j=1

1

j2

jk=1

kk−1

x2dµ(x) =∞k=1

∞ j=k

1

j2

kk−1

x2dµ(x).

Since∞

j=k

1

j2 <

1

k2 +

1

k (k + 1) +

1

(k + 1) (k + 2) + . . .

= 1

k2 +

1

k − 1

k + 1

+

1

k + 1 − 1

k + 2

+ . . . =

1

k2 +

1

k ≤ 2

k,

29



this yields

∞n=1

P

1

knS kn

> ε

≤ 2dα

ε2

∞k=1

kk−1

x

kxdµ(x) ≤ 2dα

ε2

∞k=1

kk−1

xdµ(x) = 2dα

ε2 E (X 1) < ∞.

Thus by the Borel-Cantelli Lemma

P

1

knS kn

> ε infinitely often in n

= 0.

But this is equivalent with the almost sure convergence of

limn→∞

1

kn

S kn = 0 P-a.s. (9.2)

Step 2: Next let us see that indeed 1kn

kni=1 Y i can only converge to E (X 1). By

definition of Y n we have that

E (Y n) =

xdµn (x) =

n

0

xdµ (x) .

Thus by monotone convergence

E (X 1) = limn→∞

E (Y n) .

By Exercise 9.6 below this implies

E (X 1) = limn→∞

1

n (EY 1 + . . . + EY n) . (9.3)

By definition of the sums S n we have

1

knS kn =

1

kn

kni=1

Y i − 1

kn

kni=1

E (Y i) ,

(9.2) and (9.3) together imply

limn→∞

1

kn

kni=1

Y i = limn→∞

1

kn

S kn + limn→∞

1

kn

kni=1

EY i = EX 1 P-a.s.,

which is what we wanted to show in this step.

Step 3: Now we are aiming at removing the truncation from the X n. Consider the sum

∞n=1

P (X n = Y n) =∞n=1

P (X n ≥ n)

According to Exercise 3.4 this is smaller than E(X 1), so that it is bounded. Therefore

P (X n = Y n infinitely often) = 0.

30



Hence there is a n0 (random) such that with probability one X n = Y n for all n ≥ n0. Butthe finitely many differences drop out when averaging, hence

limn→∞

1

kn

S kn = EX 1 P-a.s.

Step 4: Eventually we show that the theorem holds not only for subsequences kn chosenas above, but also for the whole sequence.

For fixed α > 1, of course, the sequences (kn)n are fixed and diverge to +∞. Hence forevery m ∈ N there exists n ∈ N such that

kn < m ≤ kn+1.

Since we assumed the X i to be non-negative this implies

S kn ≤ S m ≤ S kn+1 .

Hence S knkn

· kn

m ≤ S m

m ≤ S kn+1

kn+1

· kn+1

m .

The definition of kn yields

kn ≤ αn < kn + 1 ≤ m ≤ kn+1 ≤ αn+1.

This giveskn+1

m <

αn+1

αn = α

as well as

kn

m > α

n

− 1αn+1

.

Now, given α, for all n ≥ n1 = n1 (α) we have αn − 1 ≥ αn−1. Hence, if m ≥ kn1 and thusn ≥ n1 we obtain

kn

m >

αn − 1

αn+1 >

αn−1

αn+1 =

1

α2

Now for each α we have a set Ωα with P (Ωα) = 1 with

lim 1

kn

S kn (ω) = EX 1 for all ω ∈ Ωα.

Without loss of generality we may assume that X i are not identically equal to zero P-a.s.,otherwise the assertion of the Strong Law of Large Numbers is trivially true. Therefore wemay assume without loss of generality that EX 1 > 0. Since α > 1 we then have

1

αEX 1 <

1

knS kn (ω) < αEX 1

for all ω ∈ Ωα and all n large enough. For such m and ω this means

α−3 − 1

EX 1 <

1

mS m (ω) − EX 1 <

α2 − 1

EX 1.

31



Define

Ω1 :=∞

n=1

Ω1+ 1n

.

Then P (Ω1) = 1 and

limm→∞

1

mS m (ω) = EX 1

for all ω ∈ Ω1. This proves the theorem.

Remark 9.4 Theorem 9.3 in particular implies that for i.i.d. sequences of random vari-ables (X n) with a finite first moment the Strong Law of Large Numbers holds true. Since stochastic convergence is implied by almost sure convergence also Theorem 9.1 – the Weak Law of Large Numbers – holds true for such sequences as well. Therefore the finiteness of the second moment is not necessary for Theorem 9.1 to hold true for i.i.d. sequences.

Remark 9.5 One might, of course, ask whether a finite first moment is necessary for Theorem 9.3 to hold. Indeed one can prove that, if a sequence of i.i.d. random variables is such that 1

n

ni=1 X i converges almost surely to some random variable Y (necessarily a tail

function as in Corollary 8.7!), then EX 1 exists and Y = EX 1 almost surely. This will not be shown in the context of this course.

Exercise 9.6 Let (am)m be real numbers such that limm→∞ am = a. Show that this implies that their Cesaro mean

limn→∞

1

n (a1 + a2 + . . . + an) = a.

Exercise 9.7 Prove the Strong Law of Large Numbers for a sequence of i.i.d. random variables (X n)n with a finite fourth moment, i.e. for random variables with E (X 41 ) <

∞.

Do not use the statement of Theorem 9.3 explicitly.

Remark 9.8 A very natural question to ask in the context of Theorem 9.3 is: how fast does 1

nS n converge to EX 1, i.e. given a sequence of i.i.d. random variables (X n)n what is

P

1

n

X i − EX 1

≥ ε

?

If X 1 has a finite moment generating function; i.e. if

M (t) := log EetX 1 < ∞ for all t,

the answer is: exponentially fast. Indeed, Cramer’s theorem (which cannot be proven in the context of this course) asserts the following: let I : R → R be given by

I (x) = supt

[xt − M (t)] .

Then for every closed set A ⊂ R

lim supn→∞

1

n log P

1

n

ni=1

X i ∈ A

≤ − inf

x∈AI (x)

32



and for every open set O ⊂ R

lim inf n→∞

1

n log P

1

n

ni=1

X i ∈ O

≥ − inf x∈O

I (x).

This is called a principle of large deviations for the random variables

1

nn

i=1 X i. In particular, one can show that the function I is convex and non-negative, with

I (x) = 0 ⇔ x = EX 1.

We therefore obtain

∀δ > 0∃N ∀n > N : P

1

n

X i − EX 1

≥ ε

≤ e−nmin(I (EX 1+ε),I (EX 1−ε))+nδ

where I is the I -function introduced above evaluated for the random variables X i − EX 1.

The speed of convergence is thus exponentially fast.

Example 9.9 For a Bernoulli B(1, p) random variable X

eM (t) = E(etX ) = pet + (1 − p); I (x) = −x log p

x

− (1 − x)log

1 − p

1 − x

.

Exercise 9.10 Determine the functions M and I for a normally distributed random vari-able.

Exercise 9.11 Argue that if the moment generating function of a random variable X is

finite, all its moments are finite. In particular both Laws of Large Numbers are applicable to a sequence X 1, X 2, . . . of iid random variables distributed like X .

At the end of this section we will turn to two applications of the Law of Large Numberswhich are interesting in their own right:

The first of these two applications is in number theory. Let (Ω, F , P) be given by Ω =[0, 1), F = B 1 Ω and P = λ1

Ω (Lebesgue measure). For every number ω ∈ Ω we mayconsider its g-adic representation

ω =∞

n=1

ξ ng−n (9.4)

Here g ≥ 2 is a natural number and ξ n ∈ 0, . . . , g − 1. This representation is unique,if we ask that not all but finitely many of the digits ξ n are equal to g − 1. For eachε ∈ 0, . . . , g − 1 let S ε,gn (ω) be the number of all i ∈ 1, . . . , n with ξ i(ω) = ε in itsg-adic representation (9.4). We will call a number ω ∈ [0, 1) g -normal1, if

limn→∞

1

nS ε,gn (ω) =

1

g

1The usual meaning of g-normality is that each string of digits ε1ε2 . . . εk occurs with frequency g−k.

33



for all ε = 0, . . . , g−1. Hence ω is g-normal, if in the long run all of its digits occur with thesame frequency. We will call ω absolutely normal, if ω is g-normal for all g ∈ N, g ≥ 2.Now for a number ω ∈ [0, 1) randomly chosen according to Lebesgue measure the ξ i(ω) arei.i.d. random variables; they have as their distribution the uniform distribution on the set0, . . . , g − 1. This has to be shown in Exercise 9.13 below and is a consequence of theuniformity of Lebesgue measure. Hence the random variable

X n(ω) =

1 if ξ n(ω) = ε0 otherwise

are i.i.d. random variables for each g and ε. Moreover S ε,gn (ω) =n

i=1 X ε,gi (ω). Accordingto the Strong Law of Large Numbers (Theorem 9.3)

1

nS ε,gn (ω) → E (X ε,g1 ) =

1

g λ1-a.s.

for all ε

∈ 0, . . . , g

−1

and all g

≥ 2. This means λ1-almost every number ω is g-normal,

i.e. there is a set N g with λ1 (N g) = 0, such that ω is g-normal for all ω ∈ N cg . Now

N :=

∞g=2

N g

is a set of Lebesgue measure zero as well. This readily implies

Theorem 9.12 (E. Borel) λ1-almost every ω ∈ [0, 1) is absolutely normal.

It is rather surprising that hardly any normal numbers (in the usual meaning, see

Footnote page 33), are known. Champernowne (1933) showed that

ω = 0, 1234567891011121314. . .

is 10-normal. Whether√

2, log2, e or π are normal of any kind has not been shown yet.There are no absolutely normal numbers known at all.

Exercise 9.13 Show that for every g ≥ 2, the random variables ξ n (ω) introduced above are i.i.d. random variables that are uniformly distributed on 0, . . . , g − 1.

The second application is to derive a classical result from analysis – which in principlehas nothing to do with probability theory – is related to the Strong Law of Large Numbers.

As may be well known the approximation theorem by Stone and Weierstraß assertsthat every continuous function on [a, b] (more generally on every compact set) can beapproximated uniformly by polynomials. Obviously it suffices to prove this for [a, b] = [0, 1].So let f ∈ C ([0, 1]) be a continuous function on [0, 1]. Define the n’th Bernstein polynomialfor f as

Bnf (x) =n

k=0

nk

f

k

n

xk (1 − x)n−k .

34



Theorem 9.14 For each f ∈ C ([0, 1]) the polynomials Bnf converge to f uniformly in [0, 1].

Proof. Since f is continuous and [0, 1] is compact f is uniformly continuous on [0, 1], i.e.for each ε > 0 there exists δ > 0 such that

|x − y| < δ ⇒ |f (x) − f (y)| < ε.

Now consider a sequence of i.i.d Bernoullis with parameter p, (X n)n. Call

S ∗n := 1

nS n :=

1

n

ni=1

X i.

Then by Chebyshev’s inequality

P (|S ∗n − p| ≥ δ ) ≤ 1

δ 2V (S ∗n) =

1

n2δ 2V(S n) =

p (1 − p)

nδ 2 ≤ 1

4nδ 2. (9.5)

This yields

|Bnf ( p) − f ( p)| = |E (f S ∗n) − f ( p)| =

f (x)dPS ∗n(x) − f ( p)

≤

|S ∗n− p|<δ

|f (S ∗n(x)) − f ( p)| dPS ∗n(x) +

+

|S ∗n− p|≥δ

|f (S ∗n(x)) − f ( p)| dPS ∗n(x)

≤ ε + 2

f

P (

|S ∗n

− p

| ≥ δ )

≤ ε +

2 f 4nδ

2

.Here f is the sup-norm of f . Hence

sup p∈[0,1]

|Bnf ( p) − f ( p)| ≤ ε + 2 f

4nδ 2

This can be made smaller than 2ε by choosing n large enough.Notice that Weak Law of Large Numbers by itself would yield, instead of inequality (9.5),an inequality of the kind

∀ρ > 0

∃N such that

∀n

≥ N : P(

|S ∗n

− p

| ≥ δ )

≤ ρ.

In this approach it is not clear that N can be chosen independently of p, so that we onlywould get pointwise convergence.

10 The Central Limit Theorem

In the previous section we met one of the central theorems of probability theory – theLaw of Large Numbers: If EX 1 exists, for sequence of i.i.d. random variables (X n), theiraverage 1

n ni=1 X i converges to EX 1 (a.s.). The following theorem, called the Central Limit

35



Theorem analyzes the fine structure in the Law of Large Numbers. Its name is due to Polya,the proof of the following theorem goes back to Charles Stein.

First of all notice that in a certain sense, in order to analyze the fine structure of n

i=1 X ithe scaling of the Weak Law of Large Numbers 1

n is already an overscaling. On this scale

we just cannot see the shape of the distribution anymore, since by scaling ni=1 X i by a

factor 1n we have reduced its variance to a scale 1

n which converges to zero. What we seein the Law of Large Numbers is a bell shaped curve with a tiny, tiny width. Here is whatwe get, if we scale the variance to one:

Theorem 10.1 (Central Limit Theorem - CLT) Let X 1, . . . , X n be a sequence of ran-dom variables that are independent and have identical distribution (the same for all n) with EX 21 < ∞. Then

limn→∞

P

ni=1(X i − EX 1)√

nVX 1≤ a

=

1√ 2π

a−∞

e−x2/2dx. (10.1)

Before proving the Central Limit Theorem let us remark that it holds under weakerassumptions as well

Remark 10.2 Indeed, the Central Limit Theorem also holds under the following weaker assumption. Assume given for n = 1, 2, . . . an independent family of random variables X ni,i = 1, . . . , n. For j = 1, . . . n let

ηnj = EX nj

and

sn := n

i=1

VX ni.

The sequence ((X ni)ni=1)n is said to satisfy the Lindeberg condition, if

Ln(ε) → 0 as n → ∞ for all ε > 0. Here

Ln(ε) = 1

s2n

n j=1

E

(X nj − ηnj)2 ; |X nj − ηnj| ≥ εsn

.

Intuitively speaking the Lindeberg condition asks that none of the variables dominates the whole sum.The generalized form of the CLT stated above now asserts that if the sequence (X n)

satisfies the Lindeberg condition, it also satisfies the CLT, i.e.

limn→∞

P

ni=1(X ni − ηni) n

i=1 VX ni≤ a

=

1√ 2π

a−∞

e−x2/2dx.

The proof of this more general theorem basically mimicks the proof we will give below for Theorem 10.1. We spare ourselves the additional technical work.

36



We will present a proof of the CLT that goes back to Charles Stein. It is based on acouple of facts:

Fact 1: It suffices to prove the CLT for i.i.d. random variables with EX 1 = 0. Otherwiseone just substracts EX 1 from each of the X i.

Fact 2: Define

S n :=n

i=1

X i and σ2 := V(X 1)

Theorem 10.1 asserts the convergence in distribution of the (normalized) S n to a Gaussianrandom variable. What we need to show is thus

E

f

S n√

nσ2

→ 1√

2π

∞−∞

f (x)e−x2/2dx = E [f (Y )] (10.2)

as n → ∞ for all f : R → R that are uniformly continuous and bounded. Here Y is astandard normal random variable, i.e. it is N (0, 1) distributed.

We prepare the proof of the CLT in two lemmata.

Lemma 10.3 Let f : R → R be bounded and uniformly continuous. Define

N (f ) := 1√

2π

∞−∞

f (y)e−y2/2dy

and

g(x) := ex2/2

x−∞

(f (y) − N (f )) e−y2/2dy.

Then g fulfills g(x)

−xg(x) = f (x)

− N (f ). (10.3)

Proof. Differentiating g gives

g(x) = xex2/2

x−∞

(f (y) − N (f )) e−y2/2dy + ex2/2 (f (x) − N (f )) e−x2/2

= xg(x) + f (x) − N (f ).

The importance of the above lemma becomes obvious, if we substitute a random variableX into (10.3) and take expectations:

E [g(X ) − Xg(X )] = E [f (X ) − N (f )] .If X ∼ N (0, 1) is standard normal, the right hand side is zero and so is the left hand

side. The idea is thus that instead of showing that

E [f (U n) − N (f )]

converges to zero, we may show the same for

E [g(U n) − U ng(U n)] .

The next step discusses the function g introduced above.

37



Lemma 10.4 Let f : R → R be bounded and uniformly continuous and g be the solution of

g(x) − xg(x) = f (x) − N (f ). (10.4)

Then g(x), xg(x) and g(x) are bounded and continuous

Proof. Obviously g is even differentiable, hence continuous. But then also xg(x) is con-tinuous. Eventuallyg(x) = xg(x) + f (x) − N (f )

is continuous as the sum of continuous functions. For the boundedness part first note thatany continuous function on a compact set is bounded, hence we only need to check thatthe functions g, xg, and g are bounded for x → ±∞.

To this end first note that

g(x) = ex2/2

x−∞

(f (y) − N (f )) e−y2/2dy

= −ex2

/2 ∞x

(f (y) − N (f )) e−y2

/2dy

(this is true since the whole integral must equal zero).For x ≤ 0 we have

g(x) ≤ supy≤0

|f (y) − N (f )| ex2/2

x−∞

e−y2/2dy

while for x ≥ 0 we have

g(x)

≤ supy≥0 |

f (y)

− N (f )

|ex2/2

∞

x

e−y2/2dy.

Now for x ≤ 0

ex2/2

x−∞

e−y2/2dy ≤ ex2/2

x−∞

−y

|x| e−y2/2dy = 1

|x|and similarly for x ≥ 0

ex2/2

∞x

e−y2/2dy ≤ ex2/2

∞x

y

xe−y2/2dy ≤ 1

|x| . (10.5)

Thus we see that for x ≥ 1

|g(x)| ≤ |xg(x)| ≤ supy≥0

|f (y) − N (f )|

as well as for x ≤ −1|g(x)| ≤ |xg(x)| ≤ sup

y≤0|f (y) − N (f )| .

Hence g and xg are bounded. But then also g is bounded, since

g(x) = xg(x) + f (x) − N (f ).

38









Exercise 10.5 Let X 1, . . . , X n be i.i.d. random variables and S n =n

i=1 X i. Let

g : R → R

be a continuously differentiable function. Show that for all j

10 [g(S n − (1 − s)X j) − g(S n − X j)]X jds = g(S n) − g(S n − X j) − X jg(S n − X j).

We conclude the section with the informal discussion of two extensions on the CentralLimit Theorem. The first is of practical importance, the second is of more theoreticalinterest.

When one tries to apply the Central Limit Theorem, e.g. for a sequence of i.i.d. randomvariables, it is of course not only important to know that

X n := ni=1(X i − EX 1)

√ nVX 1

converges to a random variable Z ∼ N (0, 1). One also needs to know, how close thedistributions of X n and Z are. This is stated in the following theorem due to Berry andEsseen:

Theorem 10.6 (Berry-Esseen) Let (X i)i∈N be a sequence of i.i.d. random variables with E(|X 1|3) < ∞.Then for a N (0, 1)-distributed random variable Z it holds:

supa∈R

P

ni=1 X i − EX 1√

nVX 1≤ a

− P (Z ≤ a)

≤ C √

n

E(|X 1 − EX 1|3)

(VX 1)3/2 .

The numerical value of C is below 6 and larger than 0.4. This is rather easy to prove.

The second extension of the Central Limit Theorem starts with the following obser-vation: Let X 1, X 2, . . . be a sequence of i.i.d. random variables with finite variance andexpectation zero. Then the law of large numbers says that 1

n

ni=1 X i converges to EX 1 = 0

in probability and almost surely. But it tells nothing about the size of the fluctuations.This is considered in greater detail by the Central Limit Theorem. The latter describes theasymptotic probabilities that

P

ni=1(X i − EX 1)

nV (X 1)≥ a

.

Since these probabilities are positive for all a ∈ R according to the Central Limit Theorem,it can be shown that the fluctuations of

ni=1 X i are larger than

√ n, more precisely, for

each positive a ∈ R it holds with probability one that lim supPn

i=1 X i√ n

≥ a.

The question for the precise size of the fluctuations, i.e. for the right scaling ( an) suchthat

lim sup

ni=1 X i√ ann

is almost surely finite, is answered by the law of the iterated logarithm:

42



Theorem 10.7 (Law of the Iterated Logarithm by Hartmann and Winter) Let (X i)i∈N be a sequence of i.i.d. random variables with σ2 := VX 1 < ∞ (σ > 0). Then for S n :=

ni=1 X i it holds

lim sup S n√ 2n log log n

= +σ P-a.s.

and

lim inf S n√ 2n log log n

= −σ P-a.s.

Due to the restricted time we will not be able to prove the Law of the Iterated Logarithmin the context of this course. Despite its theoretical interest its practical relevance is ratherlimited. To understand why, notice that the correction to the

√ nVX 1 from the Central

Limit Theorem to the Law of the Iterated Logarithm are of order√

log log n. Even for afantastically large number of observation, 10100 (which is more than one observation per

atom in the universe) √ log log n is really small,e.g. log log 10100 =

log(100 log 10) =

log 100 + log log 10 √

6.13 2.47.

11 Conditional Expectation

To understand the concept of conditional expectation, we will start with a little example.

Example 11.1 Let Ω be a finite population and let the random variable X (ω) denote the income of person ω. So, if we are only interested in income, X contains the full information

of our experiment. Now assume we are a sociologist and want to measure the influence of a person’s religion on his income. So we are not interested in the full information given by X , but only in how X behaves on each of the sets,

catholic, protestant, islamic, jewish, atheist,

etc. This leads to the concept of conditional expectation.

The basic idea of conditional expectation will be that given a random variable

X : (Ω,

F )

→ R

and a sub-σ-algebra A of F to introduce a new random variable called E [X | A] =: X 0such that X 0 is A-measurable and

C

X 0dP =

C

XdP

for all C ∈ A. So X 0 contains all information necessary when we only consider events inA. First we need to see that such a X 0 can be found in a unique way.

43



Theorem 11.2 Let (Ω, F , P) be a probability space and X an integrable random variable.Let C ⊆ F be a sub-σ-algebra. Then (up to P-a.s. equality) there is a unique random variable X 0, which is C-measurable and satisfies

C

X 0dP = C

XdP for all C ∈ C. (11.1)

If X ≥ 0, then X 0 ≥ 0 P-a.s.

Proof. First we treat the case X ≥ 0. Denote P 0 := P |C and Q = X P |C. Both, P 0 andQ are measures on C, P 0 even is a probability measure. By definition

Q (C ) =

C

XdP.

Hence Q (C ) = 0 for all C with P (C ) = 0 = P0 (C ). Hence Q P 0. By the theorem of Radon-Nikodym there is a C-measurable function X 0 ≥ 0 on Ω such that Q = X 0P 0. Thus

C

X 0dP 0 = C

XdP for all C ∈ C.

Hence C

X 0dP =

C

XdP for all C ∈ C.

Hence X 0 satisfies (11.1). For X 0 that is C-measurable and satisfies (11.1) the set C = X 0 < X 0

is in C and

C

X 0dP = C

X 0dP, whence P(C ) = 0. In the same way P(

X 0 > X 0

=0. Therefore X 0 is P-a.s. equal to X 0.

The proof for arbitrary, integrable X is left to the reader.

Exercise 11.3 Prove Theorem 11.2 for arbitrary, integrable X : Ω → R.

Definition 11.4 Under the conditions of Theorem 11.2 the random variable X 0 (which is P-a.s. unique) is called the conditional expectation of X given C. It is denoted by

X 0 =: E [X | C] =: EC [X ] .

If C is generated by a sequence of random variable (Y i)i∈I such that C = σ (Y i, i ∈ I ) we write

E X

| (Y i)i

∈I = E [X

| C] .

If I = 1, . . . , n we also write E [X | Y 1, . . . , Y n].

Note that, in order to check whether Y (Y C-measurable) is a conditional expectationof X given the sub-σ-algebra C we need to check

C

Y dP =

C

XdP

for all C ∈ C. This determines E [X | C] only P-a.s. on sets C ∈ C. We therefore also speakabout different versions of conditional expectation.

44



Example 11.5 1. If C = ∅, Ω, then the constant random variable EX is a version of E [X | C]. Indeed if C = ∅, then any variable does the job. If C = Ω

C

XdP = EX =

EXdP.

2. If C is generated by the family (Bi)i∈I of mutually disjoint sets (i.e. Bi ∩ B j = ∅if i = j), where I is countable and Bi ∈ A (the original space being (Ω, A, P)) and P(Bi) > 0 then

E [X | C] =i∈I

1

P(Bi)1Bi

Bi

XdP P-a.s.

be checked in the following exercise.

Exercise 11.6 Show that the assertion of Example 11.5.2. is true.

Exercise 11.7 Show that the fol lowing assertions for the conditional expectation E [X | C

]of random variables

X, Y : (Ω, A) → R, B 1

(C ⊂ A) are true:

1. E [E [X | C]] = EX

2. If X is C-measurable then E [X | C] = X P-a.s.

3. If X = Y P-a.s., then E [X | C] = E [Y | C] P-a.s.

4. If X

≡ α, then E [X

| C] = α P-a.s.

5. E [αX + βY | C] = αE [X | C] + β E [Y | C] P-a.s. Here α, β ∈ R.

6. X ≤ Y P-a.s. implies E [X | C] ≤ E [Y | C] P-a.s.

The following theorems have proofs that are almost identical with the proofs of thetheorems for expectations:

Theorem 11.8 (monotone convergence) Let (X n) be an increasing sequence of positive random variables with X = sup X n, X integrable , then

supn

E [X n | C] = limn→∞ E [X n | C] = E

limn→∞ X n | C = E [X | C] .

Theorem 11.9 (dominated convergence) Let (X n) be a sequence of random variables converging pointwise to an (integrable) random variable X , such that there is an integrable random variable Y with Y ≥ |X n|, then

limn→∞

E [X n | C] = E [X | C] .

Also Jensen’s inequality has a generalization to conditional expectations:

45



Theorem 11.10 (Jensen’s inequality) Let X be an integrable random variable taking values in an open interval I ⊂ R and let

q : I → R

be a convex function. Then for each C ⊂ A it holds

E [X | C] : Ω → I

and q (E [X | C]) ≤ E [q X | C] .

An immediate consequence of Theorem 11.10 is the following (for p ≥ 1):

|E [X | C]| p ≤ E [|X | p | C]

which impliesE (

|E [X

| C]

| p)

≤ E (

|X

| p) .

Denoting by

N p (f ) =

|f | p dP

1/p

this meansN p (E [X | C]) ≤ N p (X ) , X ∈ L p (P) .

This holds for 1 ≤ p < ∞. N p(f ) is called the L p-norm of f . The case p = ∞, which meansthat if X is bounded P-a.s. by some M ≥ 0, then so is E [X | C], follows from Exercise11.7.

We slightly reformulate the definition of conditional expectation to discuss its furtherproperties.

Lemma 11.11 Let X be a positive integrable function. Let X 0 : (Ω, A) → (R, B 1) a posi-tive C-measurable integrable random variable that is a version of E [X | C] ( X integrable),then

ZX 0dP =

ZXdP (11.2)

for all C-measurable, positive random variables Z .

Proof. From (11.1) we obtain (11.2) for step functions. The general result follows from

monotone convergence.We are now prepared to show a number of properties of conditional expectations which

we will call smoothing properties

Theorem 11.12 (Smoothing properties of conditional expectations) Let (Ω, F , P)be probability space and X ∈ L p (P) and Y ∈ Lq (P), 1 ≤ p ≤ ∞, 1

p + 1

q = 1.

1. If C ⊆ F and X is C-measurable then

E [XY | C] = X E [Y | C] (11.3)

46



2. If C1, C2 ⊆ F with C1 ⊆ C2 then

E [E [X | C2] | C1] = E [E [X | C1] | C2] = E [X | C1] .

Proof.

1. First assume that X, Y ≥ 0. Let X be C-measurable and C ∈ C. Then C

XY dP =

1C XY dP =

1C X E [Y | C] dP =

C

X E [Y | C] dP.

Indeed, this follows immediately from lemma 11.11 since 1C X is C-measurable. Onthe other hand, we also have XY ∈ L1(P) and

C

XY dP =

C

E [XY | C] dP.

Since X E [Y

| C] is

C-measurable we obtain

E [XY | C] = X E [Y | C] P-a.s.

In the case X ∈ L p (P) , Y ∈ Lq (P) we observe that then XY ∈ L1 (P) and concludeas above.

2. Observe that, of course, E [X | C1] is C1-measurable and, since C1 ⊆ C2, also C2-measurable. Property 2 in Exercise 11.7 than implies

E [E [X | C1] | C2] = E [X | C1] , P-a.s.

Moreover for all C

∈ C1

C

E [X | C1] dP = C

XdP.

Hence for all C ∈ C1 C

E [X | C1] dP =

C

E [X | C2] dP.

But this meansE [E [X | C2] | C1] = E [X | C1] P-a.s.

The previous theorem leads to yet another characterization of the conditional expecta-tion. To this end take X ∈ L2 (P) and denote X 0 := E [X | C] for a C ⊆ F . Let Z ∈ L2(P)be C-measurable. Then X 0 ∈ L2 (P) and by (11.3)

E [Z · (X − X 0) | C] = Z E [X − X 0 | C] = Z · (E [X | C] − X 0) = Z · (X 0 − X 0) = 0.

Theorem 11.13 For all X ∈ L2 (P) and each C ⊆ F the conditional expectation E [X | C]is (up to a.s. equality) the unique C-measurable random variable X 0 ∈ L2 (P) with

E

(X − X 0)2

= min E

(X − Y )2

; Y ∈ L2 (P) , Y C-measurable

47



Proof. Let Y ∈ L2 (P) be C-measurable. Put X 0 := E [X | C]. Then

E((X −Y )2) = E((X −X 0+X 0−Y )2) = E((X −X 0)2)+E((X 0−Y )2)+2E((X −X 0)(X 0−Y ))

But E((X − X 0)(X 0 − Y )) = 0, since X 0 − Y is C-measurable.This gives

E

(X − Y )2− E

(X − X 0)2

= E

(X 0 − Y )2

. (11.4)

Due to positivity of squares we hence obtain

E

(X − X 0)2 ≤ E

(X − Y )2

.

If, on the other handE

(X − X 0)2

= E

(X − Y )2

thenE

(X 0 − Y )2

= 0

which implies Y = X 0 = E [X | C] P-a.s.The last theorem states that E [X | C] for X ∈ L2 (P) is the ”best approximation” of X

in the C-measurable function space ”in the sense of a least squares approximation”. It isthe projection of X onto the space of square integrable, C-measurable functions.

Exercise 11.14 Prove that for X ∈ L2(P), µ = E(X ) is the number that minimizes E((X − µ)2).

With the help of conditional expectation we can also give a new definition of conditional

probability

Definition 11.15 Let (Ω, F , P) be a probability space and C ⊂ F be a sub-σ-algebra. For A ∈ F

P [A | C] := E [1A | C]

is called the conditional probability of A given C.

Example 11.16 In the situation of Example 11.5.2 the conditional expectation of A ∈ F is given by

P (A

| C) =

i∈I

P (A

| Bi) 1Bi

:= i∈I

P (A ∩ Bi) 1Bi

P (Bi)

.

In a last step we will only introduce (but not prove) conditional expectations on eventswith zero probability. Of course, in general this will just give nonsense but in the case of a conditional expectation E [X | Y = y ] where X, Y are random variables such that (X, Y )has a Lebesgue density we can give this expression a meaning.

48



Theorem 11.17 Let X, Y be real valued random variables such that (X, Y ) has a density f : R

2 → R+0 with respect to two dimensional Lebesgue measure λ2. Assume that X is

integrable and that

f 0 (y) :=

f (x, y) dx > 0 for all y ∈ R.

Then the function E

(X |Y ) will be denoted by

y → E (X | Y = y)

and one has

E (X | Y = y) = 1

f 0 (y)

xf (x, y) dx for PY -a.e. y ∈ R.

In particular

E (X | Y ) = 1

f 0 (Y ) xf (x, Y ) dx P-a.s.

We will also need the following relationship between conditional expectation and indepen-dence, which is a generalization of Example 11.5, case 1.

Lemma 11.18 Let X be an integrable real valued random variable and C ⊂ F a sub-σ-algebra such that X is independent of C, that is σ(X ) and C are independent, then

E(X | C) = E(X ), P-a.s.

Proof. Suppose X ≥ 0. Then an increasing sequence of step functions X n can be con-structed by X n = [2nX ]/2n. Then X n converges monotonically to X . Notice that X n is a

linear combination of indicator functions 1A with A ∈ σ(X ). And C 1AdP = P(C ∩ A) =

P(C )P(A) = C

E(1A)dP. Thus E(1A | C) = E(1A), and by linearity E(X n | C) = E(X n)and by the monotone convergence theorem E(X | C) = E(X ). The general case follows bylinearity, X = X + − X −.

Exercise 11.19 Let X and Y be as in Theorem 11.17, such that X and Y are independent.Then X is independent of σ(Y ), and by Lemma 11.18 we have E(X | Y ) = E(X | σ(Y )) =E(X ). Apply Theorem 11.17 to give an alternative derivation of this fact.

49



12 Martingales

In this section we are going to define a notion, that will turn out to be of central interestin all of so called stochastic analysis and mathematical finance. A key role in its definitionwill be taken by conditional expectation. In this section we will just give the definition anda couple of examples. There is a rich theory of martingales. Parts of this theory we will

meet in a class on Stochastic Calculus.

Definition 12.1 Let (Ω, F , P) be a probability space and I be an ordered set (linearly or-dered), i.e. for s, t ∈ I either s ≤ t or t ≤ s, with s ≤ t and t ≤ s implies s = t and s ≤ t,t ≤ u implies s ≤ u. For t ∈ I let F t ⊂ F be a σ-algebra. (F t)t∈I is called a filtration, if s ≤ t implies F s ⊂ F t. A sequence of random variables (X t)t∈I is called (F t)t∈I - adapted if X t is F t-measurable for all t ∈ I .

Exercise 12.2 Construct a filtration on a probability space with |I | ≥ 3.

Example 12.3 Let (X t)t∈I be a family of random variables, and I a linearly ordered set,then F t = σ X s, s ≤ t

is a filtration and (X t) is adapted with respect to (F t). (F t)t∈I is called the canonical filtration with respect to (X t)t∈I .

Definition 12.4 Let (Ω, F , P) be a probability space and I a linearly ordered set. let (F t)t∈I be a filtration and (X t)t∈I be an (F t)-adapted sequence of random variables. (X t) is called an (F t)-supermartingale, if

E

[X t | F s] ≤ X s P

-a.s. (12.1)

for all s ≤ t. (12.1) is equivalent with C

X tdP ≤ C

X sdP, for all C ∈ F s. (12.2)

(X t) is called a (F t)-submartingale, if (−X t) is a (F t)-supermartingale. Eventually (X t)is called a martingale, if it is both a submartingale and a supermartingale. This means that

E [X t | F s] = X s P-a.s.

for s ≤ t or, equivalently, C

X tdP =

C

X sdP, C ∈ F s.

Exercise 12.5 Show that the conditions (12.1) and (12.2) are equivalent.

Remark 12.6 1. If (F t) is the canonical filtration with respect to (X t)t∈I , then often (X t) simply called a supermartingale, submartingale, or a martingale.

50



2. (12.1) and (12.2) are evidently correct for s = t (with equality). Hence these proper-ties only need to be checked for s < t.

3. Putting C = Ω in (12.2) we obtain for a supermartingale (X t)t

s ≤ t ⇒ E (X s) ≥ E (X t) .

Hence for supermartingales (E (X s))s is a decreasing sequence, while for a submartin-gale (E (X s)) is an increasing sequence.

4. In particular, if each of the random variables X s is almost surely constant, e.g. if Ωis a singleton (a set with just one element) then (X s) is a decreasing sequence, if (X s)is a supermartingale. And it is an increasing sequence, if (X s) is a submartingale.Hence martingales are (in a certain sense) the stochastic generalization of constant sequences.

Exercise 12.7 Let (X t), (Y t) be adapted to the same filtration and α, β ∈ R. Show the following

1. If (X t) and (Y t) are martingales, then (αX t + βY t) is a martingale.

2. If (X t) and (Y t) are supermartingales, then so is (X t ∧ Y t) = (min(X t, Y t))

3. If (X t)) is a submartingale, so is

X +t , F t

.

4. If (X t) is a martingale taking values in an open set J ≤ R and

q : J → R

is convex then (q X t, F t) is a submartingale, if q (X t) is integrable for all t.

Of course, at first glance the definition of a martingale may look a bit weird. We willtherefore give a couple of examples to show that it is not as strange as expected.

Example 12.8 Let (X n) be an i.i.d. sequence of R-valued random variables. Put S n =X 1 + . . . + X n and consider the canonical filtration F n = σ (S m, m ≤ n). By Lemma 11.18 we have

E [X n+1 | S 1, . . . , S n] = E [X n+1] P-a.s.

and by part 2. of Exercise 11.7

E [X i | S 1, . . . , S n] = X i P-a.s.

for all i = 1, . . . , n. Adding these n + 1 equations gives

E [S n+1 | F n] = S n + E [X n+1] P-a.s.

If EX i = 0 for all i, then E [S n+1 | F n] = S n

i.e. (S n) is a martingale. If E [X i] ≤ 0 then

E [S n+1 | F n] ≤ S n,

i.e. (S n) is a supermartingale. In the same way (S n) is a submartingale if EX i ≥ 0.

51



Example 12.9 Consider the following game. For each n ∈ N a coin with probability p for heads is tossed. If it shows heads ( X n = +1) our player receives money otherwise he ( X n = −1) looses money. The way he wins or looses is determined in the following way.Before the game starts he determines a sequence (n)n of functions

n :

H, T

n

→ R

+.

In round number n + 1 he plays for n (X 1, . . . , X n) Euros depending on how the first ngames ended. If we denote by S n his capital at time n, then

S 1 = X 1 and S n+1 = S n + n (X 1, . . . , X n) X n+1.

Hence

E [S n+1 | X 1, . . . , X n] = S n + n (X 1, . . . , X n) · E [X n+1 | X 1, . . . , X n]

= S n + n (X 1, . . . , X n) E (X n+1)

= S n + (2 p − 1) n (X 1, . . . , X n) ,

since X n+1 is independent of X 1, . . . , X n and E (X n+1) = 2 p − 1. Hence for p = 12

E [S n+1 | X 1, . . . , X n] = S n

so (S n) is a martingale while for p > 12

E [S n+1 | X 1, . . . , X n] ≥ S n,

hence (S n) is a submartingale and for p < 12

E [S n+1 | X 1, . . . X n] ≤ S n,

so (S n) is a supermartingale. This explains the idea that ”martingales are generalizations of fair games”.

Exercise 12.10 Let X 1, X 2, . . . be a sequence of independent random variables with finite variance V(X i) = σ2

i . Then ni=1(X i − E(X i))2 −n

i=1 σ2i is a martingale with respect

to the filtration F n = σ(X 1, . . . , X n).

Exercise 12.11 Consider the gambler’s martingale. Consider an i.i.d. sequence (X n)∞n=1

of Bernoulli variables with values –1 and 1, each with probability 1/2. Consider the sequence (Y n) such that Y n = 2n−1 if X 1 = · · · = X n−1 = −1, and Y n = 0 if X i = 1 for some i ≤ n−1.Show that S n =

ni=1 X iY i is a martingale. Show that S n almost surely converges and

determine its limit S ∞. Observe that S n = E(S ∞ | F n).

Example 12.12 In a sense Example 12.8 is both, a special case and a generalization of the following example. To this end let X 1, . . . , X n, . . . denote an i.i.d. sequence of R

d-valued random variables. Assume

P (X i = +k) = P (X i = −k) = 1

2d

52



for all i = 1, 2, . . . and all k = 1, . . . , d. Here k denotes the k-th unit vector. Define the stochastic process S n by

S 0 = 0,

and

S n =n

i=1

X i.

This process is called a random walk in d directions. Some of its properties will be discussed below. First we will see that indeed (S n) is a martingale. Indeed,

E [S n+1 | X 1, . . . , X n] = E [X n+1] + S n = S n.

As a matter of fact, not only is (S n) a martingale, but, in a certain sense it is the discrete time martingale.

Since the random walk in d dimensions is the model for a discrete time martingale(the standard model of a continuous time martingale will be introduced in the followingsection) it is worth while studying some of its properties. This has been done in thousandsof research papers in the past 50 years. We will just mention one interesting property here,that reveals a dichotomy in the random walk’s behavior for dimensions d = 1, 2 or d ≥ 3.

Definition 12.13 Let (S n) be a stochastic process in Zd, i.e. for each n ∈ N, S n is a random variable with values in Zd. (S n) is called recurrent in a state x ∈ Zd, if

P (S n = x infinitely often in n) = 1.

It is called transient in x, if P (S n = x infinitely often in n) < 1.

(S n) is called recurrent (transient), if each x ∈ Zd is recurrent (transient).

Proposition 12.14 In the situation of Example 12.12, if x ∈ Zd is recurrent, then al l

y ∈ Zd are recurrent.

Exercise 12.15 Show proposition 12.14.

We will show a variant of the following

Theorem 12.16 The random walk (S n) introduced in Example 12.12 is recurrent in di-mensions d = 1, 2 and transient for d ≥ 3.

To prove a version of Theorem 12.16 we will first discuss the property of recurrence:

Lemma 12.17 Let f k denote probability that the random walk return to the origin after k steps for the first time, and let pk denote probability that the random walk return to the origin after k steps. Then a random walk (S n) is recurrent if and only if

k f k = 1 and

this is the case if and only if

k pk = ∞.

53



Proof. The first equivalence is easy. Denote by Ωk the set of all realizations of the randomwalk returning to the origin for the first time after k steps. Then if the random walk (S n)is recurrent, with probability one there exists a k > 0 such that S k = 0 and S l = 0 for all0 < l < k. Hence

k f k = 1. On the other hand, if

k f k = 1, then P(

k Ωk) = 1. Hence

with probability one there exists a k > 0 such that S k = 0 and S l = 0 for all 0 < l < k.But then the situation at times 0 and k is completely the same and hence there existsk > k such that S k = 0 and S l = 0 for all k < l < k. Iterating this gives that S k = 0 forinfinitely many k’s with probability one.

In order to relate f k and pk we derive the following recursion

pk = f k + f k−1 p1 + · · · + f 0 pk (12.3)

(the last summand ist just added for completeness, we have f 0 = 0). Indeed this is againeasy to see. The left hand side is just the probability to be at the origin at time k. Thisevent is the disjoint uinion of the events to be at 0 for the first time after 1 ≤ l ≤ k stepsand to walk from zero to zero in the remaining steps. Hence we obtain.

pk =k

i=1

f i pk−i and p0 = 1. (12.4)

Define the generating functions

F (z ) =k≥0

f kz k and P (z ) =k≥0

pkz k.

Multiplying the left and right sides in (12.4) with z k and summing them from k = 0 toinfinity gives

P (z ) = 1 + P (z )F (z )i.e.

F (z ) = 1 − 1/P (z ).

By Abel’s theorem∞k=1

f k = F (1) = limz↑1

F (z ) = 1 − limz↑1

1

P (z ).

First assume that

k pk < ∞. Then

limz

↑1

P (z ) = P (1) = k

pk < ∞

and thus

limz↑1

1

P (z ) = 1/

k

pk > 0.

Hence∞

k=1 f k < 1 and the random walk (S n) is transient.Next assume that

k pk = ∞ and fix ε > 0. Then we find N such that

N k=0

pk ≥ 2

ε.

54



Then for z sufficiently close to one we haveN

k=0 pkz k ≥ 1ε

and consequently for such z

1

P (z ) ≤ 1N

k=0 pkz k≤ ε.

But this implies that

limz↑1

1P (z )

= 1/k

pk = 0

and therefore∞

k=1 f k = 1 and the random walk (S n) is transient.

Exercise 12.18 What has the Borel-Cantelli Lemma 8.1 to say in the above situation? Don’t overlook that the events S n = x ( n ∈ N) may be dependent.

We will now apply this criterion to analyze recurrence and transience for a random walksimilar to the one defined in Example 12.12.

To this end define the following random walk (Rn) in d dimensions. For k

∈ N let

Y k1 , . . . , Y kd be i.i.d. random variables, taking values in −1, +1 with P(Y k1 = 1) = P(Y k1 =−1) = 1/2. Let X k be the random vector X k = (Y k1 , . . . , Y kd ). Define R0 ≡ 0 and for k ≥ 1

Rn =n

k=1

X k.

Theorem 12.19 The random walk (Rn) defined above is recurrent in dimensions d = 1, 2and transient for d ≥ 3.

Proof. Consider a sequence of i.i.d. random variables (Z k) taking values in −1, +1 with

P(Z k = 1) = P(Z k = −1) = 1/2. Write q k = P(2k

i=1 Z i = 0). Then we apply Stirling’sformula

limn→∞

n!/(√

2πnn+1/2e−n) = 1.

to obtain

q k =

2k

k

2−2k =

2k!

k!k!2−2k

∼√

4πk2ke

2k2πk

ke

2k 2−2k

=

1πk

.

Hence the probability of a single coordinate of R2n to be zero (Rn cannot be zero if n is

odd) asymptotically behaves like

1πn

. Hence

P(Rn = 0) ∼

1

πn

d2

.

55



But n

1

πn

d2

= ∞

for d = 1 and d = 2, while

n

1

πn d

2

< ∞for d ≥ 3. This proves the theorem.

We will give two results about martingales. The first one is inspired by the fact thatgiven a random variable M and a filtration F t of F , the family of conditional expectations,E(M | F t), yields a martingale. Can a martingale always be described in this way? Thatis the content of the Martingale Limit Theorem.

Theorem 12.20 Suppose M t is a martingale with respect to the filtration

F t and that

the martingale is (uniformly) square integrable, that is lim supt E(M 2t ) < ∞. Then there is a square integrable random variable M ∞ such that M t = E(M ∞ | F t) a.s.. Moreover lim M t = M ∞ in L2 sense.

Proof. The basic property that we will use, is that the space of square integrable randomvariables L2(Ω, F , P) with the L2 inner product is a complete vectorspace. In other words,a Cauchy sequence converges. Recall that for any t < s it holds that M t = E(M s | F t), andwe have seen in Theorem 11.13 that then M s − M t is perpendicular to M t. In particularwe have the Pythagoras formula

E(M 2s ) = E(M 2t ) + E((M s

−M t)

2).

This implies that E(M 2s ) is increasing in s, and therefore its limit exists and equalslim sup E(M 2t ) which is finite. Therefore given ε > 0 there is a u such that for t > s > u wehave E((M s−M t)

2) < ε. That means that M ss is a Cauchy sequence. Let M ∞ be its limit.In particular M ∞ is a random variable. Since orthogonal projection onto the suspace of F tmeasurable functions is a continuous map, it holds that E(M ∞ | F t) = lim E(M s | F t) = M t.

With some extra effort one may show that M ∞ is also the limit in the sense of almostsure convergence. The Martingale Limit Theorem is valid under more general circum-stances, for example it is sufficient that only lim supt E(

|M t

|) <

∞, in which case M

∞ is

the limit in L1 sense (as well as almost surely).An important concept for random processes is the concept of a stopping time.

Definition 12.21 A stopping time is a random variable τ : Ω → I ∪ ∞, such that for all t ∈ I , ω ; τ (ω) ≤ t is F t measurable. Here I ∪∞ is ordered such that t < ∞ for all t ∈ I . Given a process M t, the stopped process M τ ∧t is given by M τ ∧t(ω) = M s(ω) where s = τ ∧ t = min(τ, t).

Example 12.22 Given A ∈ F u a stopping time is constructed by τ = ∞ · 1Ac + u · 1A, that is, τ A(ω) = ∞ if ω /∈ A, and τ (ω) = u if ω ∈ A.

56



Exercise 12.23 If T : Ω → R is constant, then T is a stopping time. If S and T are stopping times then max(S, T ) and min(S, T ) are stopping times.

A very important property of martingales is the following Martingale Stopping Theorem.

Theorem 12.24 Let

M t

t be a martingale, and τ a stopping time. Then the stopped

process M τ ∧tt is a martingale.

Proof. It is easy to see that an adapted process stopped at a stopping time is again anadapted process. We will give a proof of the martingale property for the simple stoppingtime τ A given above. If s > t ≥ u, let B ∈ F t, then

B

E(M τ ∧s | F t)dP =

B

M τ ∧sdP =

B∩A

M τ ∧s +

B∩Ac

M τ ∧s

=

B∩A

M u +

B∩Ac

M s =

B∩A

M u +

B∩Ac

E(M s | F t)

= B∩A M u +

B∩Ac M t =

B M τ ∧t.

If u ≥ t and s > t, then one can apply Theorem 11.12

E(M τ ∧s | F t) = E(E(M τ ∧s | F u) | F t) = E(M u∧s | F t) = M t = M τ ∧t.

Exercise 12.25 Modify the proof of Theorem 12.24 to show that a stopped supermartingale is a supermartingale.

Exercise 12.26 Consider the roulette game. There are several possibilities for a bet, given by a number p ∈ (0, 1) such that with probability 36/37 · p the return is p−1 times the stake and the return is zero with probability 1−36/37· p. The probabilities p are such that p−1 ∈ N.Suppose you start with an initial fortune X 0 ∈ N, and perform a sequence of bets until this fortune is reduced to zero. We are interested in the expected value of the total sum of stakes. To determine this consider the sequence of subsequent fortunes X i, and consider the sequence of stakes Y i, meaning that the stake in bet i is Y i = Y i(X 1, . . . , X i−1) ( Y i ≤ X i−1).In particular, if for this stake the probability p is chosen, either X i = X i−1 − Y i + p−1 · Y i(with probability 36/37 · p) or X i = X i−1 − Y i (with probability 1 − 36/37 · p). Show that (X i + 1/37 ·

i j=1 Y j)i is a martingale with respect to the filtration F i = σ(X 1, . . . , X i). The

stopping time N is the first time i such that X i = 0. Show that E

(N

j=1 Y j) = 37 · X 0.

13 Brownian motion

In this section we will construct the continuous time martingale, Brownian motion. Besidesthis, Brownian motion is also a building block of stochastic calculus and stochastic analysis.

In stochastic analysis one studies random functions of one variable and various kindsof integrals and derivatives thereof. The argument of these functions is usually interpretedas ‘time’, so the functions themselves can be thought of as the path of a random process.

57



Here, like in other areas of mathematics, going from the discrete to the continuousyields a pay-off in simplicity and smoothness, at the price of a formally more complicatedanalysis. Compare, to make an analogy, the integral

n0

x3dx with the sumn

k=1 k3. Theintegral requires a more refined analysis for its definition and its properties, but once thishas been done the integral is easier to calculate. Similarly, in stochastic analysis you willbecome acquainted with a convenient differential calculus as a reward for some hard workin analysis.

Stochastic analysis can be applied in a wide variety of situations. We sketch a fewexamples below.

1. Some differential equations become more realistic when we allow some randomnessin their coefficients. Consider for example the following growth equation , used amongother places in population biology:

d

dtS t = (r + “N t”)S t. (13.1)

Here, S t is the size of the population at time t, r is the average growth rate of thepopulation, and the “noise” N t models random fluctuations in the growth rate.

2. At time t = 0 an investor buys stocks and bonds on the financial market, i.e., hedivides his initial capital C 0 into A0 shares of stock and B0 shares of bonds. Thebonds will yield a guaranteed interest rate r. If we assume that the stock price S tsatisfies the growth equation (13.1), then his capital C t at time t is

C t = AtS t + Btert, (13.2)

where At and Bt are the amounts of stocks and bonds held at time t. With a keen eyeon the market the investor sells stocks to buy bonds and vice versa. If his tradingsare ‘self-financing’, then dC t = AtdS t + Btd(ert). An interesting question is:

- What would he be prepared to pay for a so-called European call option , i.e.,the right (bought at time 0) to purchase at time T > 0 a share of stock at apredetermined price K ?

The rational answer, q say, was found by Black and Scholes (1973) through an analysisof the possible strategies leading from an initial investment q to a payoff C T . Theirformula is being used on the stock markets all over the world.

3. The Langevin equation describes the behaviour of a dust particle suspended in a fluid:

m d

dtV t = −ηV t + “N t”. (13.3)

Here, V t is the velocity at time t of the dust particle, the friction exerted on theparticle due to the viscosity η of the fluid is −ηV t, and the “noise” N t stands for thedisturbance due to the thermal motion of the surrounding fluid molecules collidingwith the particle.

58



4. The path of the dust particle in example 3 is observed with some inaccuracy. Onemeasures the perturbed signal Z (t) given by

Z t = V t + “ N t”. (13.4)

Here N t is again a “noise”. One is interested in the best guess for the actual value of

V t, given the observation Z s for 0 ≤ s ≤ t. This is called a filtering problem : how tofilter away the noise N t. Kalman and Bucy (1961) found a linear algorithm, whichwas almost immediately applied in aerospace engineering. Filtering theory is now aflourishing and extremely useful discipline.

5. Stochastic analysis can help solve boundary value problems such as the Dirichletproblem. If the value of a harmonic function f on the boundary of some boundedregular region D ⊂ R

n is known, then one can express the value of f in the interiorof D as follows:

E (f (Bxτ )) = f (x), (13.5)

where Bxt := x + t0 N tdt is an “integrated noise” or Brownian motion , starting at x,

and τ denotes the time when this Brownian motion first reaches the boundary. (Aharmonic function f is a function satisfying ∆f = 0 with ∆ the Laplacian.)

The goal of the course Stochastic Analysis is to make sense of the above equations, and towork with them.

In all the above examples the unexplained symbol N t occurs, which is to be thought of as a “completely random” function of t, in other words, the continuous time analogue of a sequence of independent identically distributed random variables. In a first attempt tocatch this concept, let us formulate the following requirements:

1. N t is independent of N s for t = s;

2. The random variables N t (t ≥ 0) all have the same probability distribution µ;

3. E (N t) = 0.

However, when taken literally these requirements do not produce what we want. This isseen by the following argument. By requirement 1 we have for every point in time anindependent value of N t. We shall show that such a “continuous i.i.d. sequence” N t is not

measurable in t, unless it is identically 0.Let µ denote the probability distribution of N t, which by requirement 2 does not depend

on t, i.e., µ([a, b]) := P[a ≤ N t ≤ b]. Divide R into two half lines, one extending from a to−∞ and the other extending from a to ∞. If N t is not a constant function of t, then theremust be a value of a such that each of the half lines has positive measure. So

p := P(N t ≤ a) = µ ((−∞, a]) ∈ (0, 1). (13.6)

Now consider the set of time points where the noise N t is low: E := t ≥ 0 : N t ≤ a .It can be shown that with probability 1 the set E is not Lebesgue measurable. Without

59



giving a full proof we can understand this as follows. Let λ denote the Lebesgue measureon R. If E were measurable, then by requirement 1 and Eq. (13.6) it would be reasonableto expect its relative share in any interval (c, d) to be p, i.e.,

λ (E ∩ (c, d)) = p (d − c) . (13.7)

On the other hand, it is known from measure theory that every measurable set E is ar-bitrarily thick somewhere with respect to the Lebesgue measure λ, i.e., for all α < 1 aninterval (c, d) can be found such that

λ (E ∩ (c, d)) > α(d − c)

(cf. Halmos (1974) Th. III.16.A). This clearly contradicts Eq. (13.7), so E is not measurable.This is a bad property of N t: for, in view of (13.1), (13.3), (13.4) and (13.5), we would liketo integrate N t.

For this reason, let us approach the problem from another angle. Instead of N t, let usconsider the integral of N t, and give it a name:

Bt :=

t0

N sds.

The three requirements on the evasive object N t then translate into three quite sensiblerequirements for Bt.

BM1. For 0 = t0 ≤ t1 ≤ · · · ≤ tn the random variables Btj+1 − Btj ( j = 0, . . . , n − 1) areindependent;

BM2. Bt has stationary increments, i.e., the joint probability distribution of

(Bt1+s − Bu1+s, Bt2+s − Bu2+s, . . . , Btn+s − Bun+s)

does not depend on s ≥ 0, where ti > ui for i = 1, 2, · · · , n are arbitrary.

BM3. E (Bt − B0) = 0 for all t.

We add a normalisation:

BM4. B0 = 0 and E(B21) = 1.

Still, these four requirements do not determine Bt. For example, the compensated Poisson jump process also satisfies them. Our fifth requirement fixes the process Bt uniquely:

BM5. t → Bt continuous a.s.

The object Bt so defined is called the Wiener process , or (by a slight abuse of physicalterminology) Brownian motion . In the next section we shall give a rigorous and explicitconstruction of this process.

Before we go into details we remark the following

60



Exercise 13.1 Show that BM5 , together with BM1 and BM2 , implies the following:For any ε > 0

nP (|Bt+ 1n

− Bt| > ε) → 0 (13.8)

as n → ∞. Hint: compare with inequality (8.6).

Exercise 13.1 helps us to specify the increments of Brownian motion in the following way2.

Exercise 13.2 Suppose BM1, BM2 , BM4 and (13.8) hold. Apply the Central Limit Theorem (Lindeberg’s condition, page 36) to

X n,k := B ktn

− B (k−1)tn

and conclude that Bs+t − Bs, t > 0 has a normal distribution with variance t, i.e.

P (Bs+t − Bs ∈ A) = 1√

2πt

A

e−x2

2t dx.

As a matter of fact, BM1 and BM5 already imply that the increments Bs+t − Bs arenormally distributed3.

BM 2’. If s ≥ 0 and t > 0, then

P (Bs+t − Bs ∈ A) = 1√

2πt

A

e−x2

2t dx.

we can now define Brownian motion as follows

Definition 13.3 A one-dimensional Brownian motion is a real-valued process Bt, t ≥ 0with the properties BM1, BM2’ , and BM5 .

13.1 Construction of Brownian Motion

Whenever a stochastic process with certain porperties is defined, the most natural questionto ask is, does such a process exist? Of course, the answer is yes, otherwise these lecturenotes would not have been written.

In this section we shall construct Brownian motion on [0, T ]. For the sake of simplicity

we will take T = 1, the construction for general T can be carried out along the same lines,or, by just concatenating independent Brownian motions.The construction we shall use was given by P. Levy in 1948. Since we saw that the

increments of Brownian motion are independent Gaussian random variables, the idea is toconstruct Brownian motion from these Gaussian increments.

2See R. Durrett (1991), Probability: Theory and Examples , Section 7.1, Exercise 1.1, p. 334. Unfortu-nately, there is something wrong with this exercise. See the 3rd edition (2005) for a correct treatment.

3See e.g. I. Gihman, A. Skorohod, The Theory of Stochastic Processes I , Ch. III, § 5, Theorem 5, p. 189.For a high-tech approach, see N. Ikeda, S. Watanabe, Stochastic Differential Equations and Diffusion

Processes , Ch. II, Theorem 6.1, p. 74.

61



More precisely, we start with the following observation. Suppose we already had con-structed Brownian motion, say (Bt)0≤t≤T . Take two times 0 ≤ s < t ≤ T , put θ := s+t

2 ,

and let

p(τ , x , y) := 1√

2πτ e−(y−x)2/2τ , τ > 0, x , y, ∈ R

be the Gaussian kernel centered in x with variance τ . Then, conditioned on Bs

= x andBt = z , the random variable Bθ is normal with mean µ := x+z

2 and variance σ2 := t−s

4 .

Indeed, since Bs,Bθ − Bs, and Bt − Bθ are independent we obtain

P [Bs ∈ dx, Bθ ∈ dy, Bt ∈ dz ] = p(s, 0, x) p(t − s

2 , x , y) p(

t − s

2 , y , z )dxdydz

= p(s, 0, x) p(t − s,x,z ) · 1

σ√

2πe−

(y−µ)2

2σ2 dxdydz

(which is just a bit of algebra). Dividing by

P [Bs

∈ dx, Bt

∈ dz ] = p(s, 0, x) p(t

−s,x,z )dxdz

we obtain

P [Bθ ∈ dy |Bs ∈ dx, Bt ∈ dz ] = 1

σ√

2πe−

(y−µ)2

2σ2 dy,

which is our claim.

This suggests that we might be able to construct Brownian motion on [0 , 1] by interpo-lation.

To carry out this program, we begin with a sequence ξ (n)k , k ∈ I (n), n ∈ N0 of inde-

pendent, standard normal random variables on some probability space (Ω, F , P ). Here

I (n) := k ∈ N, k ≤ 2n, k = 2l + 1 for some l ∈ Ndenotes the set of odd, positive integers less than 2n. For each n ∈ N0 we define a processB(n) := B

(n)t : 0 ≤ t ≤ 1 by recursion and linear interpolation of the preceeding process,

as follows. For n ∈ N, B(n)k/2n−1 will agree with B

(n−1)k/2n−1 , for all k = 0, 1, . . . , 2n−1. Thus for

each n we only need to specify the values of B(n)k/2n for k ∈ I (n). We start with

B(0)0 = 0 and B(1)

1 = ξ (0)1 .

If the values of B(n−1)k/2n−1, k = 0, 1 . . . 2n−1 have been defined (an thus B(n−1)

t , k/2n−1

≤ t

≤(k + 1)/2n−1 is the linear interpolation between B(n−1)k/2n−1 and B(n−1)

(k+1)/2n−1) and k ∈ I (n), we

denote s = (k − 1)/2n, t = (k + 1)/2n, µ = 12

(B(n−1)s + B

(n−1)t ) and σ2 = t−s

4 = 2−n+1 and

set in accordance with the above observations

B(n)k/2n := B

(n)(t+s)/2 := µ + σξ

(n)k .

We shall show that, almost surely, B(n)t converges uniformly in t to a continuous function

Bt (as n → ∞) and that Bt is a Brownian motion.

62



We start with giving a more convenient representation of the processes B (n), n = 0, 1, . . ..We define the following Haar functions by H 01(t) ≡ 1, and for n ∈ N, k ∈ I (n)

H (n)k (t) :=

2(n−1)/2, k−12n

≤ t < k2n

−2(n−1)/2, k2n

≤ t < k+12n

0 otherwise.

The Schauder functions are defined by

S (n)k (t) :=

t0

H (n)k (u)du, 0 ≤ t ≤ 1, n ∈ N0, k ∈ I (n).

Note that S (0)1 (t) = t, and that for n ≥ 1 the graphs of S

(n)k are little tents of height

2−(n+1)/2 centered at k/2n and non overlapping for different values of k ∈ I (n). Clearly,

B(0)t = ξ

(0)1 S

(0)1 (t), and by induction on n, it is readily verified that

B(n)

t

(ω) =n

m=0

k∈I (m)

ξ (m)

k

(ω)S (m)

k

(t), 0 ≤

t ≤

1, n ∈

N. (13.9)

Lemma 13.4 As n → ∞, the sequence of functions B(n)t (ω), 0 ≤ t ≤ 1, n ∈ N 0, given

by (13.9) converges uniformly in t to a continuous function Bt(ω), 0 ≤ t ≤ 1 for almost every ω ∈ Ω.

Proof. Let bn := maxk∈I (n) |ξ (n)k |. Oberserve that for x > 0 and each n, k

P (|ξ (n)k | > x) =

2

π

∞x

e−u2/2du

≤

2π

∞x

ux

e−u2/2du =

2π

1x

e−x2/2,

which gives

P (bn > n) = P (

k∈I (n)|ξ

(n)k | > n) ≤ 2nP (|ξ

(n)1 | > n) ≤

2

π

2n

n e−n2/2,

for all n ∈ N. Since

n

2

π

2n

n e−n2/2 < ∞,

the Borel-Cantelli Lemma implies that there is a set Ω with P (Ω) = 1 such that for ω ∈ Ωthere is an n0(ω) such that for all n ≥ n0(ω) it holds true that bn(ω) ≤ n. But then

n≥n0(ω)

k∈I (n)

|ξ (n)k (ω)S

(n)k (t)| ≤

n≥n0(ω)

n2−(n+1)/2 < ∞;

so for ω ∈ Ω, B(n)t (ω) converges uniformly in t to a limit Bt. The uniformity of the

convergence implies the continuity of the limit Bt.The following exercise facilitates the construction of Brownian motion substantially:

63



Exercise 13.5 Check the following in a textbook of functional analysis:

The inner product

f, g :=

10

f (t)g(t)dt

turns L2[0, 1] into a Hilbert space, and the Haar functions

H (n)

k ; k

∈ I (n), n

∈ N0

form a

complete, orthonormal system.Thus the Parseval equality

f, g =∞n=0

k∈I (n)

f, H (n)k g, H (n)k (13.10)

holds true.

Applying (13.10) to f = 1[0,t] and g = 1[0,s] yields

∞n=0

k∈I (n)

S (n)k (t)S

(n)k (s) = s ∧ t. (13.11)

Now we are able to prove

Theorem 13.6 With the above notations

Bt := limn→∞

B(n)t

is a Brownian motion in [0, 1].

Proof. In view of our definition of Brownian motion it suffices to prove that for 0 = t0 <t1 . . . < tn ≤ 1, the increments (Btj − Btj−1) j=1,...,n are independent, normally distributedwith mean zero and variance (t j − t j−1). For this we will show that the Fourier transformssatisfy the appropriate condition, namely that for λ j ∈ R (and as usual i :=

√ −1)

E

exp

i

n j=1

λ j(Btj − Btj−1)

=n

j=1

exp−1

2λ2 j(t j − t j−1)

. (13.12)

To derive (13.12) it is most natural to exploit the construction of Bt form Gaussian random

variables. Set λn+1 = 0 and use the independence and normality of the ξ (n)

k

to compute for

64



M ∈ N

E

exp

−in

j=1

(λ j+1 − λ j)B(M )tj

= E

exp−i

M m=0

k∈I (m)

ξ

(m)

k

n j=1(λ j+1 − λ j)S

(m)

k (t j)

=M

m=0

k∈I (m)

E

exp

−iξ (m)k

n j=1

(λ j+1 − λ j)S (m)k (t j)

=M

m=0

k∈I (m)

exp−1

2

n j=1

(λ j+1 − λ j)S (m)k (t j)

2

= exp

−1

2

n

j=1

n

l=1

(λ j+1 − λ j)(λl+1 − λl)M

m=0 k∈I (m)

S (m)k (t j)S (m)

k (tl)

Now we send M → ∞ and apply (13.11) to obtain

E

exp

i

n j=1

λ j(Btj − Btj−1)

= E

exp

−in

j=1

(λ j+1 − λ j)Btj

= exp

−

n−1 j=1

nl= j+1

(λ j+1 − λ j)(λl+1 − λl)t j − 1

2

n j=1

(λ j+1 − λ j)2t j

= exp−

n−1

j=1

(λ j+1

−λ j)(

−λ j+1)t j

−

1

2

n

j=1

(λ j+1

−λ j)2t j

= exp

1

2

n−1 j=1

(λ2 j+1 − λ2

j)t j − 1

2λ2ntn

=n

j=1

exp−1

2λ2 j(t j − t j−1)

.

65



14 Appendix

Let Ω be a set and P (Ω) the collection of subsets of Ω.

Definition 14.1 A system of sets R ⊂ P (Ω) is called a ring if it satisfies

∅ ∈ RA, B ∈ R ⇒ A \ B ∈ RA, B ∈ R ⇒ A ∪ B ∈ R

If additionally Ω ∈ R

then R is called an algebra.

Note that for A, B ⊂ Ω their intersection A ∩ B = A \ (B \ A).

Definition 14.2 A system D ⊂ P (Ω) is called a Dynkin system if it satisfies

Ω ∈ D

D ∈ D ⇒ Dc ∈ DFor every sequence (Dn)n∈N of pairwise disjoint sets Dn ∈ D, their union ∪nDn is also in D.

The following theorem holds:

Theorem 14.3 A Dynkin system is a σ-algebra if and only of for any two A, B ∈ D we have A ∩ B ∈ D

Similar to the case of σ-algebras for every system of sets E ⊂ P (Ω) there is a smallestDynkin system D(E ) generated by (and containing) E . The importance of Dynkin systemsmainly is due to the following

Theorem 14.4 For every E ⊂ P (Ω) with

A, B ∈ E ⇒ A ∩ B ∈ E

we have D(E ) = σ(E ).

Definition 14.5 Let R be a ring. A function

µ : R → [0, ∞]

is called a volume, if it satisfies µ(∅) = 0

66



and

µ(∪ni=1Ai) =

ni=1

µ(Ai) (14.1)

for all pairwise disjoint sets A1, . . . , An ∈ R and all n ∈ N. A volume µ is called a pre-measure if

µ(∪∞i=1Ai) =

∞i=1

µ(Ai) (14.2)

for all pairwise disjoint sequence of sets (Ai)i∈N ∈ R. We will call (14.1) finite additivity and (14.2) σ-additivity.

A pre-measure µ on a σ-algebra A is called a measure.

Theorem 14.6 Let R be a ring and µ be a volume on R. If µ is a pre-measure, then it is ∅-continuous, i.e. for all (An)n, An ∈ R, with µ(An) < ∞ and An ↓ ∅ it holds limn→∞ µ(An) = 0. If

R is an algebra and µ(Ω) <

∞, then the reverse also holds: an

∅-continuous volume is a pre-measure.

Theorem 14.7 (Caratheodory) For every pre-measure µ on a ring R over Ω there is at least one way to extend µ to a measure on σ(R).

In the case that R is an algebra and µ is σ-finite (i.e. Ω is the countable union of subsetsof finite measure), this extension is unique.

Date post:	04-Jun-2018
Category:	Documents
Upload:	vladimir-moreno
View:	215 times
Download:	0 times

Probability Theory(MATHIAS LOWE)

Documents