+ All Categories
Home > Documents > Probability Book

Probability Book

Date post: 15-Dec-2015
Category:
Upload: zjedwin
View: 8 times
Download: 0 times
Share this document with a friend
Description:
Italian
241
AN INTRODUCTION TO MEASURE THEORY AND PROBABILITY Luigi Ambrosio, Giuseppe Da Prato, Andrea Mennucci
Transcript
Page 1: Probability Book

AN INTRODUCTION TO MEASURETHEORY AND PROBABILITY

Luigi Ambrosio, Giuseppe Da Prato, Andrea Mennucci

Page 2: Probability Book

Introduction

This course consists of two parts. The first one is an introduction to the moderntheories of measure and of integration. Historically, this has been motivated by thenecessity to go beyond the classical theory of Riemann’s integration, usually taughtin elementary Calculus courses on the real line. It is therefore useful to describe thereasons that motivate this extension.

(1) It is not possible to characterize the class of Riemann’s integrable function withinthe Riemann theory. This is indeed possible within the stronger theory, due essen-tially to Lebesgue, that we are going to introduce.

(2) The extensions of Riemann’s theory to multiple integrals are very cumbersome.This extension, useful to compute areas, volumes, etc., is known as Peano–Jordantheory, and it is sometimes taught in elementary courses of integration in more thanone variable. In addition to that, important heuristic principles like Cavalieri’s onecan be proved only under technical and basically unnecessary regularity assumptionson the domains of integration.

(3) Many constructive processes typical of Analysis (limits, series, integrals depend-ing on a parameter, etc.) cannot be handled well within Riemann’s theory of inte-gration. For instance, the following statement is true (it is a particular case of theso-called dominated convergence theorem):

Theorem. Let fh : [−1, 1] → R be continuous functions pointwise converging to acontinuous function f . Assume the existence of a constant M satisfying |fh(x)| ≤Mfor all x ∈ [−1, 1] and all h ∈ N. Then

limh→∞

∫ 1

−1

fh(x) dx =

∫ 1

−1

f(x) dx.

Even though this statement makes perfectly sense within Riemann’s theory, anyattempt to prove this result within the theory (try, if you don’t believe!) seems tofail, and leads to a larger theory. In addition to that, the continuity assumption onthe limit function f is not natural, because a pointwise limit of continuous functionsneed not be continuous, and we would like to give a sense to

∫ 1

−1f(x) dx even without

i

Page 3: Probability Book

this assumption. This necessity emerges for instance in the study of the convergenceof Fourier series

f(x) =∞∑i=0

ai cos(iπx) + bi sin(iπx) x ∈ [−1, 1].

In this case the uniform convergence of the series, which implies the continuity off as well, is ensured by the condition

∑i |ai| + |bi| < ∞. On the other hand, we

will see that the “natural” condition for the convergence (in a suitable sense) of theseries is much weaker:

∞∑i=0

a2i + b2i <∞.

Under this condition the limit function f need not be continuous: for instance, iff(x) = 1 for x ∈ [−1/2, 1/2] and f(x) = 0 otherwise, then we will see that thecoefficients of the Fourier series are given by bi = 0 for all i (because f is even) andby

ai =

12

if i = 0;

sin(πi/2)

iπif i > 0.

(4) The spaces of integrable functions, as for instance

H :=

f : [−1, 1] → R :

∫ 1

−1

f 2(x) dx <∞

endowed with the scalar product

〈f, g〉 :=

∫ 1

−1

f(x)g(x) dx

and with the (pseudo) induced distance d(f, g) = 〈f − g, f − g〉1/2, are not complete,if we restrict ourselves to Riemann integrable functions only. In this sense, the pathfrom Riemann’s to Lebesgue’s theory is the same one that led from the (incomplete)set of rational numbers Q to the (complete) real line R.

Lebesgue’s theory extends Riemann’s theory in two independent directions. Thefirst one is concerned, as we already said, with more general classes of functions, notnecessarily continuous or piecewise continuous (the so-called Borel or measurablefunctions). The second direction can be better understood if we remind the verydefinition of Riemann’s integral∫ 1

−1

f(x) dx ∼n−1∑i=1

(ti+1 − ti)f(ti)

Page 4: Probability Book

where t1 = −1, tn = 1 and the approximation is better and better as the parametersupi<n ti+1− ti tends to 0. More generally, instead of integrating with respect to the“length” measure, we can integrate with respect to a generic measure µ and define∫ 1

−1

f(x) dµ(x) ∼n∑

i=1

µ(Ai)f(xi) (1)

where A1, . . . , An is a partition of [−1, 1], xi ∈ Ai and the approximation is expectedto be better and better as the parameter supi diam(Ai) tends to 0. We may think, forinstance to [−1, 1] as a possibly non-homogeneous bar, and to µ(A) as the “mass” ofthe subset A of the bar: because of non-homogeneity, µ(A) need not be proportionalto the length of A.

Once we adopt this viewpoint, we will see that it is not hard to obtain a theoryof integration in general metric spaces, and even in more general classes of spaces.On the other hand, the approximation (1), that in any case clarifies the intuitivemeaning of the integral, will remain valid for continuous functions only.

Page 5: Probability Book

Chapter 1

Measure spaces

In this chapter we shall introduce all basic concepts of measure theory, adoptingthe point of view of measures as set functions. The domains of measures mayhave different stability properties, and this leads to the concepts of ring, algebraand σ–algebra. The most basic tool developed in the chapter is Caratheodory’stheorem, which ensures in many cases the existence and the uniqueness of a σ–additive measure having some prescribed values on a set of generators of the σ–algebra. In the final part of the chapter we will apply these abstract tools to theproblem of constructing a “length” measure on the real line, the so-called Lebesguemeasure, and we will study its main properties.

1.1 Notation and preliminaries

We denote by N = 0, 1, 2, . . . the set of natural numbers, and by N∗ the set ofpositive natural numbers. Unless otherwise stated, sequences will always be indexedby natural numbers.

We shall denote by X a non-empty set, by P(X) the set of all parts of X and by∅ the empty set. For any subset A of X we shall denote by Ac its complement Ac :=x ∈ X : x /∈ A. If A, B ∈ P(X) we denote by A \ B the relative complementA ∩Bc, and by A∆B the symmetric difference (A \B) ∪ (B \ A).

Let (An) be a sequence in P(X). Then the following De Morgan identity holds,

(∞⋃

n=0

An

)c

=∞⋂

n=0

Acn.

1

Page 6: Probability Book

2 Measure spaces

Moreover, we define (1)

lim supn→∞

An :=∞⋂

n=0

∞⋃m=n

Am, lim infn→∞

An :=∞⋃

n=0

∞⋂m=n

Am.

As it can be easily checked, lim supnAn (resp. lim infnAn) consists of those elementsof X that belong to infinitely many An (resp. that belong to all but finitely manyAn).

It easy to check that if (An) is nondecreasing (i.e. An ⊂ An+1, n ∈ N), we have

lim infn→∞

An = lim supn→∞

An =∞⋃

n=0

An,

whereas if (An) is nonincreasing (i.e. An ⊃ An+1, n ∈ N), we have

lim infn→∞

An = lim supn→∞

An =∞⋂

n=0

An.

In the first case we shall write An ↑ L, and in the second one An ↓ L.

1.2 Rings, algebras and σ–algebras

Definition 1.1 (Rings and Algebras) A non empty subset A of P(X) is saidto be a ring if:

(i) ∅ belongs to A ;

(ii) A, B ∈ A =⇒ A ∪B, A ∩B ∈ A ;

(iii) A, B ∈ A =⇒ A \B ∈ A .

We say that a ring is an algebra if X ∈ A .

Notice that rings are stable only with respect to relative complement, whereasalgebras are stable under complement in X.

Let K ⊂ P(X). As the intersection of any family of algebras is still an algebra,the minimal algebra including K (that is the intersection of all algebras including

(1)Notice the analogy with liminf and limsup limits for a sequence (an) of real numbers. We havelim sup

n→∞an = inf

n∈Nsupm≥n

am and lim infn→∞

an = supn∈N

infm≥n

am. This is something more than an analogy,

see Exercise 1.1.

Page 7: Probability Book

Chapter 1 3

K ) is well defined, and called the algebra generated by K . A constructive char-acterization of the algebra generated by K can be easily achieved as follows: setF (0) = K ∪ ∅ and

F (i+1) :=⋃

A ∪B, Ac : A, B ∈ F (i)

∀i ≥ 0.

Then, the algebra A generated by K is given by⋃

i F(i). Indeed, it is immediate

to check by induction on i that A ⊃ F (i), and therefore the union of the F (i)’s iscontained in A . On the other hand, this union is easily seen to be an algebra, sothe minimality of A provides the opposite inclusion.

Definition 1.2 (σ–algebras) A non-empty subset E of P(X) is said to be a σ–algebra if:

(i) E is an algebra;

(ii) if (An) is a sequence of elements of E then∞⋃

n=0

An ∈ E .

If E is a σ–algebra and (An) ⊂ E we have⋂

nAn ∈ E by the De Morgan identity.Moreover, both sets

lim infn→∞

An, lim supn→∞

An,

belong to E .

Obviously, ∅, X and P(X) are σ–algebras, respectively the smallest and thelargest ones. Let K be a subset of P(X). As the intersection of any family ofσ–algebras is still a σ-algebra, the minimal σ–algebra including K (that is theintersection of all σ–algebras including K ) is well defined, and called the σ–algebragenerated by K . It is denoted by σ(K ).

In contrast with the case of generated algebras, it is quite hard to give a con-structive characterization of the generated σ-algebras: this requires the transfiniteinduction and it is illustrated in Exercise 1.16.

Definition 1.3 (Borel σ-algebra) If (E, d) is a metric space, the σ–algebra gen-erated by all open (resp. closed) subsets of E is called the Borel σ–algebra of E andit is denoted by B(E).

In the case when E = R the Borel σ-algebra has a particularly simple class ofgenerators.

Page 8: Probability Book

4 Measure spaces

Example 1.4 (B(R)) Let I be the set of all semi–closed intervals [a, b) with a ≤ b.Then σ(I ) coincides with B(R). In fact σ(I ) contains all open intervals (a, b) since

(a, b) =∞⋃

n=n0

[a+

1

n, b)

with n0 >1

b− a.

Moreover, any open setA in R is a countable union of open intervals. (2) An analogousargument proves that B(R) is generated by semi-closed intervals (a, b], by openintervals, by closed intervals and even by open or closed half-lines.

1.3 Additive and σ–additive functions

Let A ⊂ P(X) be a ring and let µ be a mapping from A into [0,+∞] such thatµ(∅) = 0. We say that µ is additive if

A, B ∈ A , A ∩B = ∅ =⇒ µ(A ∪B) = µ(A) + µ(B).

If µ is additive, A, B ∈ A and A ⊃ B, we have µ(A) = µ(B) + µ(A \ B), sothat µ(A) ≥ µ(B). Therefore any additive function is nondecreasing with respectto set inclusion. Moreover, by applying repeatedly the additivity property, additivemeasures satisfy

µ

(n⋃

k=1

Ak

)=

n∑k=1

µ(Ak)

for n ∈ N∗ and mutually disjoint sets A1, . . . , An ∈ A .An set function µ on A is called σ–additive if µ(∅) = 0 and for any sequence

(An) ⊂ A of mutually disjoint sets such that⋃

nAn ∈ A we have

µ

(∞⋃

n=0

An

)=

∞∑n=0

µ(An).

Obviously σ–additive functions are additive, because we can consider countableunions in which only finitely many An are nonempty.

Another useful concept is the σ–subadditivity : we say that µ is σ–subadditive if

µ(B) ≤∞∑

n=0

µ(An),

(2)Indeed, let (ak) be a sequence including all rational numbers of A and denote by Ik the largestopen interval contained in A and containing ak. We clearly have A ⊃

⋃∞k=0 Ik, but also the opposite

inclusion holds: it suffices to consider, for any x ∈ A, r > 0 such that (x − r, x + r) ⊂ A, and ksuch that ak ∈ (x− r, x+ r) to obtain (x− r, x+ r) ⊂ Ik, by the maximality of Ik, and then x ∈ Ik.

Page 9: Probability Book

Chapter 1 5

for any B ∈ A and any sequence (An) ⊂ A such that B ⊂⋃

nAn. Notice that,unlike the definition of σ–additivity, the sets An need not be disjoint here.

Remark 1.5 (σ–additivity and σ–subadditivity) Let µ be additive on a ringA and let (An) ⊂ A be mutually disjoint and such that

⋃nAn ∈ A . Then by

monotonicity we have

µ

(∞⋃

n=0

An

)≥ µ

(k⋃

n=0

An

)=

k∑n=0

µ(An), for all k ∈ N.

Therefore, letting k ↑ ∞ we get

µ

(∞⋃

n=0

An

)≥

∞∑n=0

µ(An).

Thus, to show that an additive function is σ–additive, it is enough to prove that itis σ–subadditive.

Conversely, it is not difficult to show that σ–additive set functions are σ–subadditive:indeed, if B ⊂ ∪nAn we can define A′

0 = B ∩ A0 and A′n := B ∩ An \

⋃m<nAm for

n ∈ N∗, so that B is the disjoint union of the sets A′n, to obtain

µ(B) =∞∑

n=0

µ(A′n) ≤

∞∑n=0

µ(An).

Let µ be additive on A . Then σ–additivity of µ is equivalent to continuity of µin the sense of the following proposition.

Proposition 1.6 (Continuity on nondecreasing sequences) If µ is additive ona ring A , then (i) ⇐⇒ (ii), where:

(i) µ is σ–additive;

(ii) (An) ⊂ A and A ∈ A , An ↑ A =⇒ µ(An) ↑ µ(A).

Proof. (i)=⇒(ii). In the proof of this implication we can assume with no loss ofgenerality that µ(An) <∞ for all n ∈ N. Let (An) ⊂ A , A ∈ A , An ↑ A. Then

A = A0 ∪∞⋃

n=0

(An+1 \ An),

the unions being disjoint. Since µ is σ–additive, we deduce that

µ(A) = µ(A0) +∞∑

n=0

(µ(An+1)− µ(An)) = limn→∞

µ(An),

Page 10: Probability Book

6 Measure spaces

and (ii) follows.(ii)=⇒(i). Let (An) ⊂ A be mutually disjoint and such that A :=

⋃nAn ∈ A .

Set

Bm :=m⋃

k=0

Ak.

Then Bm ↑ A and µ(Bm) =m∑0

µ(Ak) ↑ µ(A) by the assumption. This implies (i).

Proposition 1.7 (Continuity on nonincreasing sequences) Let µ be σ–additiveon a ring A . Then

(An) ⊂ A and A ∈ A , An ↓ A, µ(A0) <∞ =⇒ µ(An) ↓ µ(A). (1.1)

Proof. Setting Bn := A0 \ An, B := A0 \ A, we have Bn ↑ B, therefore theprevious proposition gives µ(Bn) ↑ µ(B). As µ(An) = µ(A0) − µ(Bn) and µ(A) =µ(A0)− µ(B) the proof is achieved.

Corollary 1.8 (Upper and lower semicontinuity of the measure) Let µ be σ–additive on a σ–algebra E and let (An) ⊂ E . Then we have

µ(lim infn→∞

An

)≤ lim inf

n→∞µ(An) (1.2)

and, if µ(X) <∞, we have also

lim supn→∞

µ(An) ≤ µ

(lim sup

n→∞An

). (1.3)

Proof. Set L := lim supnAn. Then we can write

L =∞⋂

n=0

Bn where Bn :=∞⋃

m=n

Am. (1.4)

Now, assuming µ(X) <∞, by Proposition 1.7 it follows that

µ(L) = limn→∞

µ(Bn) = infn∈N

µ(Bn) ≥ infn∈N

supm≥n

µ(Am) = lim supn→∞

µ(An).

Thus, we have proved (1.3). The inequality (1.2) can be proved similarly usingProposition 1.6, thus without using the assumption µ(X) <∞.

The following result is very useful to estimate the measure of a lim sup of sets.

Page 11: Probability Book

Chapter 1 7

Lemma 1.9 (Borel–Cantelli, first part) Let µ be σ–additive on a σ–algebra E

and let (An) ⊂ E . Assume that∞∑

n=0

µ(An) <∞. Then

µ

(lim sup

n→∞An

)= 0.

Proof. Set L = lim supn→∞

An and define Bn as in (1.4). Then the inclusion L ⊂ Bn

gives

µ(L) ≤ µ(Bn) ≤∞∑

m=n

µ(Am) for all n ∈ N.

As n→∞ we find µ(L) = 0.

1.4 Measurable spaces and measure spaces

Let E be a σ–algebra of subsets of X. Then we say that the pair (X,E ) is ameasurable space. Let µ : E → [0,+∞] be a σ–additive function. Then we call µ ameasure on (X,E ), and we call the triple (X,E , µ) a measure space.

The measure µ is said to be finite if µ(X) <∞, σ–finite if there exists a sequence(An) ⊂ E such that

⋃nAn = X and µ(An) <∞ for all n ∈ N. Finally, µ is called a

probability measure if µ(X) = 1.The simplest (but fundamental) example of a probability measure is the Dirac

mass δx, defined by

δx(B) :=

1 if x ∈ B0 if x /∈ B.

This example can be generalized as follows, see also Exercise 1.5.

Example 1.10 (Discrete measures) Assume that Y ⊂ X is a finite or countableset. Given c : Y → [0,+∞] we can define a measure on (X,P(X)) as follows:

µ(B) :=∑

x∈B∩Y

c(x) ∀B ⊂ X.

Clearly µ =∑

x∈Y c(x)δx is a finite measure if and only if∑

x∈Y c(x) <∞, and it isσ–finite if and only if c(x) ∈ [0,+∞) for all x ∈ Y .

More generally, the construction above works even when Y is uncountable, byreplacing the sum with sup

∑c∈B∩Y ′ c(x), where the supremum is made among the fi-

nite subsets Y ′ of Y . The measures arising in the previous example are called atomic,

Page 12: Probability Book

8 Measure spaces

and clearly if X is either finite or countable then any measure µ in (X,P(X)) isatomic: it suffices to notice that

µ =∑x∈X

c(x)δx with c(x) := µ(x).

In the next section we will introduce a fundamental tool for the construction ofnon-atomic measures.

Definition 1.11 (µ–negligible sets and µ–almost everywhere) Given a mea-sure space (X,E , µ), we say that B ∈ E is µ–negligible if µ(B) = 0, and we say thata property P (x) holds µ–almost everywhere if the set

x ∈ X : P (x) is false

is contained in a µ–neglibigle set.

Notice that the class of µ–negligible sets is stable under finite or countable unions.It is sometimes convenient to know that any subset of a µ–negligible set is still µ–negligible.

Definition 1.12 (µ–completion of a σ–algebra) Let (X,E , µ) be a measure space.We define

Eµ := A ∈ P(X) : for some B, C ∈ E with µ(C) = 0, A∆B ⊂ C .

It is easy to check that Eµ is still a σ–algebra, the so-called completion of E withrespect to µ.

It is also easy to check that µ can be extended to all A ∈ Eµ simply by settingµ(A) = µ(B), where B ∈ E is any set such that A∆B is contained in a µ–negligibleset of E . This extension is well defined (i.e. independent of the choice of B),still σ–additive and µ–negligible sets coincide with those sets that are contained insome B ∈ E with µ(B) = 0. As a consequence, any subset of a µ–negligible set isµ–negligible as well.

1.5 The basic extension theorem

The following result, due to Caratheodory, allows to extend a σ–additive functionon a ring A to a σ–additive function on σ(A ). It is one of the basic tools in theconstruction of non-trivial measures in many cases of interest, as we will see.

Page 13: Probability Book

Chapter 1 9

Theorem 1.13 (Caratheodory) Let A ⊂ P(X) be a ring, and let E be theσ–algebra generated by A . Let µ : A → [0,+∞] be σ–additive. Then µ can beextended to a measure on E . If µ is σ–finite, i.e. there exist An ∈ A with An ↑ Xand µ(An) <∞ for any n, then the extension is unique.

To prove this theorem we need some preliminaries: for the uniqueness the Dynkintheorem and for the existence the concepts of outer measure and additive set.

1.5.1 π–systems and Dynkin systems

A non-empty subset K of P(X) is called a π–system if

A, B ∈ K =⇒ A ∩B ∈ K .

A non-empty subset D of P(X) is called a Dynkin system if

(i) X, ∅ ∈ D ;

(ii) A ∈ D =⇒ Ac ∈ D ;

(iii) (Ai) ⊂ D mutually disjoint =⇒⋃

iAi ∈ D .

Obviously any σ–algebra is a Dynkin system. Moreover, if D is both a Dynkinsystem and a π–system then it is a σ–algebra. In fact, if (Ai) is a sequence in D ofnot necessarily disjoint sets we have

∞⋃i=0

Ai = A0 ∪ (A1 \ A0) ∪ ((A2 \ A1) \ A0) ∪ · · ·

and so⋃

iAi ∈ D by (ii) and (iii).Let us prove now the following important result.

Theorem 1.14 (Dynkin) Let K be a π–system and let D ⊃ K be a Dynkinsystem. Then σ(K ) ⊂ D .

Proof. Let D0 be the minimal Dynkin system including K . We are going to showthat D0 is a σ–algebra which will prove the theorem. For this it is enough to show,as remarked before, that the following implication holds:

A, B ∈ D0 =⇒ A ∩B ∈ D0. (1.5)

For any B ∈ D0 we set

H (B) = F ∈ D0 : B ∩ F ∈ D0.

Page 14: Probability Book

10 Measure spaces

We claim that H (B) is a Dynkin system. In fact properties (i) and (iii) are clear. Itremains to show that if F ∩B ∈ D0 then F c∩B ∈ D0 or, equivalently, F ∪Bc ∈ D0.In fact, since F ∪Bc = (F \Bc)∪Bc = (F ∩B)∪Bc and F ∩B and Bc are disjoint,we have that F ∪Bc ∈ D0 as required.

Notice first that if K ∈ K we have K ⊂ H (K) since K is a π–system.Therefore H (K) = D0, by the minimality of D0. Consequently, the followingimplication holds

K ∈ K , B ∈ D0 =⇒ K ∩B ∈ D0,

which implies K ⊂ H (B) for all B ∈ D0. Again, the fact that H (B) is a Dynkinsystem and the minimality of D0 give that H (B) = D0 for all B ∈ D0. By thedefinition of H (B), this proves (1.5).

The uniqueness part in Caratheodory’s theorem is a direct consequence of thefollowing coincidence criterion for measures; in turn, the proof of the criterion relieson Theorem 1.14.

Proposition 1.15 (Coincidence criterion) Let µ1, µ2 be measures in (X,E ) andassume that:

(i) the coincidence set

D := A ∈ E : µ1(A) = µ2(A)

contains a π–system K with σ(K ) = E ;

(ii) there exists a nondecreasing sequence (Xi) ⊂ K with µ1(Xi) = µ2(Xi) < ∞and Xi ↑ X.

Then µ1 = µ2.

Proof. We first assume that µ1(X) = µ2(X) is finite. Under this assumption D isa Dynkin system including the π–system K (stability of D is ensured precisely bythe finiteness assumption). Thus, by the Dynkin theorem, D = E , which impliesthat µ1 = µ2.

Assume now that we are in the general case and let Xi be given by assumption(ii). Fix i ∈ N and define the σ–algebra Ei of subsets of Xi by

Ei := A ⊂ Xi : A ∈ E .

We may obviously consider µ1 and µ2 as finite measures in the measurable space(Xi,Ei). Since these measures coincide on the π–system

Ki := A ⊂ Xi : A ∈ K

Page 15: Probability Book

Chapter 1 11

we obtain, by the previous step, that µ1 and µ2 coincide on σ(Ki) ⊂ P(Xi).Now, let us prove the inclusion

B ∈ E : B ⊂ Xi ⊂ σ(Ki). (1.6)

IndeedB ⊂ X : B ∩Xi ∈ σ(Ki)

is a σ–algebra containing K (here we use the fact that Xi ∈ K ), and thereforecontains E . Hence any element of E contained in Xi belongs to σ(Ki).

By (1.6) we obtain µ1(B∩Xi) = µ2(B∩Xi) for all B ∈ E and all i ∈ N. Passingto the limit as i→∞, since B is arbitrary we obtain that µ1 = µ2.

1.5.2 The outer measure

Let µ be defined on A ⊂ P(X). For any E ∈ P(X) we define:

µ∗(E) := inf

∞∑i=0

µ(Ai) : Ai ∈ A , E ⊂∞⋃i=0

Ai

.

µ∗(E) is called the outer measure of E. We can easily show that:

• µ∗ is a nondecreasing set function, namely µ∗(E) ≤ µ∗(F ) whenever E ⊂ F ⊂X;

• µ∗ extends µ if µ is σ–subadditive. Indeed, choose E ∈ A ; since E ⊂⋃

iAi

then µ(E) ≤∑

i µ(Ai), so we deduce µ∗(E) ≥ µ(E); but, by choosing A0 = Eand An = ∅ for n ≥ 1, we obtain that µ∗(E) = µ(E).

Proposition 1.16 The set function µ∗ is σ–subadditive on P(X).

Proof. Let (Ei) ⊂ P(X) and set E :=⋃

iEi. Assume that∑

i µ∗(Ei) are finite

(otherwise the assertion is trivial). Then, since µ∗(Ei) is finite for any i ∈ N, forany ε > 0 there exist Ai,j ∈ A such that

∞∑j=0

µ(Ai,j) < µ∗(Ei) +ε

2i+1, Ei ⊂

∞⋃j=0

Ai,j, i ∈ N.

Consequently∞∑

i, j=0

µ(Ai,j) ≤∞∑i=0

µ∗(Ei) + ε.

Page 16: Probability Book

12 Measure spaces

Since E ⊂∞⋃

i,j=0

Ai,j we have

µ∗(E) ≤∞∑

i,j=0

µ(Ai,j) ≤∞∑i=0

µ∗(Ei) + ε

and the conclusion follows from the arbitrariness of ε.

Let now define the additive sets, according to Caratheodory. A set A ∈ P(X)is called additive if

µ∗(E) = µ∗(E ∩ A) + µ∗(E ∩ Ac) for all E ∈ P(X). (1.7)

We denote by G the family of all additive sets.Notice that, since µ∗ is σ–subadditive, (1.7) is equivalent to

µ∗(E) ≥ µ∗(E ∩ A) + µ∗(E ∩ Ac) for all E ∈ P(X). (1.8)

Obviously, the class G of additive sets is stable under complement; moreover, bytaking E = A ∪B with A ∈ G and A ∩B = ∅, we obtain the additivity property

µ∗(A ∪B) = µ∗(A) + µ∗(B). (1.9)

Other important properties of G are listed in the next proposition.

Theorem 1.17 Assume that A is a ring. Then G is a σ–algebra containing Aand µ∗ is σ–additive on G .

From Theorem 1.17 the existence part of the Caratheodory theorem clearly fol-lows.Proof. We proceed in three steps: we show that G contains A , that G is a σ–algebraand that µ∗ is additive on G . As pointed in Remark 1.5, if µ∗ is σ–subadditive andadditive on the σ–algebra G , then µ∗ is σ–additive.Step 1. A ⊂ G . Let A ∈ A and E ∈ P(X), we have to show (1.8). Assumeµ∗(E) < ∞ (otherwise (1.8) trivially holds), fix ε > 0 and choose (Bi) ⊂ A suchthat

E ⊂∞⋃i=0

Bi, µ∗(E) + ε >

∞∑i=0

µ(Bi).

Then, by the definition of µ∗, it follows that

µ∗(E) + ε >

∞∑i=0

µ(Bi) =∞∑i=0

[µ(Bi ∩ A) + µ(Bi ∩ Ac)] ≥ µ∗(E ∩ A) + µ∗(E ∩ Ac).

Page 17: Probability Book

Chapter 1 13

Since ε is arbitrary we have µ∗(E) ≥ µ∗(E ∩ A) + µ∗(E ∩ Ac), and (1.8) follows.Step 2. G is an algebra and µ∗ is additive on G . We already know that A ∈ Gimplies Ac ∈ G . Let us prove now that if A, B ∈ G then A ∪ B ∈ G . For anyE ∈ P(X) we have

µ∗(E) = µ∗(E ∩ A) + µ∗(E ∩ Ac)

= µ∗(E ∩ A) + µ∗(E ∩ Ac ∩B) + µ∗(E ∩ Ac ∩Bc)

= [µ∗(E ∩ A) + µ∗(E ∩ Ac ∩B)] + µ∗(E ∩ (A ∪B)c).

(1.10)

Since(E ∩ A) ∪ (E ∩ Ac ∩B) = E ∩ (A ∪B),

we have by the subadditivity of µ∗,

µ∗(E ∩ A) + µ∗(E ∩ Ac ∩B) ≥ µ∗(E ∩ (A ∪B)).

So, by (1.10) it follows that

µ∗(E) ≥ µ∗(E ∩ (A ∪B)) + µ∗(E ∩ (A ∪B)c),

and A ∪B ∈ G as required. The additivity of µ∗ on G follows directly from (1.9).Step 3. G is a σ–algebra. Let (An) ⊂ G . We are going to show that S :=

⋃nAn ∈

G . Since we know that G is an algebra, it is not restrictive to assume that all setsAn are mutually disjoint. Set Sn :=

⋃n0 Ai, for n ∈ N.

For any n ∈ N, by using the σ–subadditivity of µ∗ and by applying (1.7) repeat-edly, we get

µ∗(E ∩ S) + µ∗(E ∩ Sc) ≤∞∑i=0

µ∗(E ∩ Ai) + µ∗(E ∩ Sc)

= limn→∞

[n∑

i=0

µ∗(E ∩ Ai) + µ∗(E ∩ Sc)

]= lim

n→∞[µ∗(E ∩ Sn) + µ∗(E ∩ Sc)] .

Since Sc ⊂ Scn it follows that

µ∗(E ∩ S) + µ∗(E ∩ Sc) ≤ lim supn→∞

[µ∗(E ∩ Sn) + µ∗(E ∩ Scn)] = µ∗(E).

So, S ∈ G and G is a σ–algebra.

Page 18: Probability Book

14 Measure spaces

Remark 1.18 We have proved that

σ(A ) ⊂ G ⊂ P(X). (1.11)

One can show that the inclusions above are strict in general. In fact, in the casewhen X = R and σ(A ) is the Borel σ-algebra, Exercise 1.17 shows that σ(A ) hasthe cardinality of continuum, while G has the cardinality of P(R), since it containsall subsets of Cantor’s middle third set (see Exercise 1.7). An example of a non-additive set will be built in Remark 1.21, so that also the second inclusion in (1.11)is strict.

1.6 The Lebesgue measure in R

In this section we build the Lebesgue measure on the real line R. To this aim, weconsider first the set I of all bounded right open intervals of R

I := (a, b] : a, b ∈ R, a < b

and the collection A containing ∅ and the finite unions of elements of I . Our choiceof half-open intervals ensures that A is a ring, because I is stable under intersectionand relative complement (the families of open and closed intervals, instead, do nothave this property).

We define length((a, b]) := b− a. More generally, any non-empty A ∈ A can bewritten, possibly in many ways, as a disjoint finite union of intervals Ii, i = 1, . . . , N ;we define

λ(A) :=N∑

i=1

length(Ii). (1.12)

Setting λ(∅) = 0, it is not hard to show by elementary methods that λ is welldefined (i.e. λ(A) does not depend on the chosen decomposition) and additive onA (3). In the next theorem we shall rigorously prove these facts, and more.

Theorem 1.19 The set function λ defined in (1.12) is σ–additive on A .

Proof. (λ is well defined) Given disjoint partitions I1, . . . , In and J1, . . . , Jm ofA ∈ A , we say that J1, . . . , Jm is finer than I1, . . . , In if any interval Ii is thedisjoint union of some of the intervals Jj. Obviously, given any two partitions,there exists a third partition finer than both: it suffices to take all intersections

(3)The reader acquainted with Riemann’s theory of integration can also notice that λ(A) is theRiemann integral of the characteristic function 1A of A, and deduce the additivity property of λdirectly by the additivity properties of the Riemann integral

Page 19: Probability Book

Chapter 1 15

of elements of the first partition with elements of the second partition, neglectingthe empty intersections. Given these remarks, to show that λ is well defined, itsuffices to show that

∑i λ(Ii) =

∑j λ(Jj) if J1, . . . , Jm is finer than I1, . . . , In. This

statement reduces to the fact that λ(I) =∑

k λ(Fk) if I ∈ I is the disjoint unionof some elements Fk ∈ I ; this last statement can be easily proved, starting fromthe identity (a, b] = (a, c] ∪ (c, b], by induction on the number of the intervals Fk.(λ is additive) If F, G ∈ A and F ∩ G = ∅, any disjoint decompositions of F inintervals I1, . . . , In ∈ I and any disjoint decomposition ofG in intervals J1, . . . , Jm ∈I provide a decomposition I1, . . . , In, J1, . . . , Jm of F ∪G in intervals belonging toI . Using this decomposition to compute λ(F ∪G) the additivity easily follows.(λ is σ–additive) Let (Fn) ⊂ A be a sequence of disjoints sets in A and assumethat

F :=∞⋃

n=0

Fn (1.13)

also belongs to A .We prove the additivity property first in the case when F = (x, y] ∈ I . It is

also not restrictive to assume that the series∑

n λ(Fn) is convergent. As any Fn

is a finite union of intervals, say Nn, we can find, given any ε > 0, a finite unionF ′

n ⊃ Fn of intervals in I such that λ(F ′n) ≤ λ(Fn) + ε/2n (just increase the right

endpoint of each interval in Fn by a small factor, to obtain a larger interval in I ,increasing the length at most by ε/(Nn2n)). Let also F ′′

n be the internal part ofF ′

n, that still includes Fn, and let x′ ∈ (x, y]. Then, since [x′, y] ⊂⋃

n F′′n , the

Heine–Borel theorem (4) provides an integer k such that [x′, y] ⊂⋃k

0 F′n. Hence, the

additivity of λ in A gives

y − x′ ≤ λ

(k⋃

n=0

F ′n

)≤

k∑n=0

λ(F ′n)

≤k∑

n=0

λ(Fn) +ε

2n≤ 2ε+

∞∑n=0

λ(Fn).

By letting first ε ↓ 0 and then letting x′ ↓ x we obtain that λ(F ) ≤∑∞

0 λ(Fn).The opposite inequality simply follows by the monotonicity and the additivity of λ,because the finite unions of the sets Fn are contained in F .

(4)Any bounded and closed interval J contained in the union of a family Aii∈I of open setsis contained in the union of finitely many of them. If the family is countable, as in our case, theproof is simple: assume with no loss of generality that I = N and An ⊂ An+1, and assume bycontradiction that there exist xn ∈ J \ An for all n; by the Bolzano–Weierstrass theorem thereexists a subsequence (xn(k)) converging to some x ∈ J . If n is such that x ∈ An, for k large enoughxn(k) belongs to An, because An is open. But this is not possible, as soon as n(k) ≥ n, becausexn(k) /∈ An(k) and An(k) ⊃ An

Page 20: Probability Book

16 Measure spaces

In the general case, let

F =k⋃

i=1

Ii,

where I1, . . . , Ik are disjoint sets in I . Then, since for any i ∈ 1, . . . , k we havethat Ii is the disjoint union of Ii ∩ Fn, we know by the previous step that

λ(Ii) =∞∑

n=0

λ(Ii ∩ Fn).

Adding these identities for i = 1, . . . , k, commuting the sums on the right hand sideand eventually using the additivity of λ on A we obtain

λ(F ) =k∑

i=1

λ(Ii ∩ F ) =∞∑

n=0

k∑i=1

λ(Ii ∩ Fn) =∞∑

n=0

λ(Fn).

We say that a measure µ in (R,B(R)) is translation invariant if µ(A+h) = µ(A)for all A ∈ B(R) and h ∈ R (notice that, by Exercise 1.2, the class of Borel sets istranslation invariant as well). We say also that µ is locally finite if µ(I) <∞ for allbounded intervals I ⊂ R.

Theorem 1.20 (Lebesgue measure in R) There exists a unique, up to multipli-cation with constants, translation invariant and locally finite measure λ in (R,B(R)).The unique such measure λ satisfying λ([0, 1]) = 1 is called Lebesgue measure.

Proof. (Existence) Let A be the class of finite unions of intervals and let λ : A →[0,+∞) be the σ–additive set function defined in (1.12). According to Theorem 1.19λ admits a unique extension, that we still denote by λ, to σ(A ) = B(R). Clearly λis locally finite, and we can use the uniqueness of the extension to prove translationinvariance: indeed, for any h ∈ R also the σ–additive measure A 7→ λ(A + h) is anextension of λ|A . As a consequence λ(A) = λ(A+ h) for all h ∈ R.

(Uniqueness) Let ν be a translation invariant and locally finite measure in (R,B(R))and set c := ν([0, 1]). Notice first that the set of atoms of ν is at most countable(Exercise 1.5), and since R is uncountable there exists at least one x such thatν(x) = 0. By translation invariance this holds for all x, i.e., ν has no atom.

Excluding the trivial case c = 0 (that gives ν ≡ 0 by translation invariance andσ–additivity), we are going to show that ν = cλ on the class A of finite unions ofintervals; by the uniqueness of the extension in Caratheodory theorem this wouldimply that ν = cλ on B(R).

Page 21: Probability Book

Chapter 1 17

By finite additivity and translation invariance it suffices to show that ν([0, t)) =ct for any t ≥ 0 (by the absence of atoms the same holds for the intervals (0, t), (0, t],[0, t]). Notice first that, for any integer q ≥ 1, [0, 1) is the union of q disjoint intervalsall congruent to [0, 1/q); as a consequence, additivity and translation invariance give

ν([0, 1/q)

)=ν([0, 1))

q=c

q.

Similarly, for any integer p ≥ 1 the interval [0, p/q) is the union of p disjoint intervalsall congruent to [0, 1/q); again additivity and translation invariance give

ν([0,p

q)) = pν

([0,

1

q))

= cp

q.

By approximation we eventually obtain that ν([0, t)) = ct for all t ≥ 0.

The completion of the Borel σ–algebra with respect to λ is the so-called σ-algebraof Lebesgue measurable sets. It coincides with the class C of additive sets withrespect to λ∗ considered in the proof of Caratheodory theorem (see Exercise 1.10).

Remark 1.21 (Outer Lebesgue measure and non-measurable sets) The mea-sure λ∗ used in the proof of Caratheodory’s theorem is also called outer Lebesguemeasure, and it is defined on all parts of R. The terminology is slightly misleadinghere, since λ∗, though σ–subadditive, fails to be σ–additive. In particular, thereexist subsets of R that are not Lebesgue measurable. To see this, let us considerthe equivalence relation in R defined by x ∼ y if x− y ∈ Q and let us pick a singleelement x ∈ [0, 1] in any equivalence

class induced by this relation, thus forming a set A ⊂ [0, 1]. Were this setLebesgue measurable, all the sets A + h would still be measurable, by translationinvariance, and the family of sets A+ hh∈Q would be a countable and measurablepartition of R, with λ∗(A + h) = c independent of h ∈ Q. Now, if c = 0 we reach acontradiction with the fact that λ∗(R) = ∞, while if c > 0 we consider all sets A+hwith h ∈ Q ∩ [−1, 1] to obtain

3 = λ∗([−1, 2]) ≥∑

h∈Q∩[−1,1]

λ∗(A+ h) = ∞,

reaching again a contradiction.Notice that this example is not constructive and strongly requires the axiom of

choice (also the arguments based on cardinality, see Exercise 1.17 and Exercise 1.18,have this limitation). On the other hand, one can give constructive examples ofLebesgue measurable sets that are not Borel (see for instance 2.2.11 in [2]).

Page 22: Probability Book

18 Measure spaces

The construction done in the previous remark rules out the existence of locallyfinite and translation invariant σ–additive measures defined on all parts of R. InRn, with n ≥ 3, the famous Banach–Tarski paradox shows that it is also impossibleto have a locally finite, invariant under rigid motions and finitely additive measuredefined on all parts of Rn.

1.7 Inner and outer regularity of measures on met-

ric spaces

Let (E, d) be a metric space and let µ be a finite measure on (E,B(E)). We shallprove a regularity property of µ.

Proposition 1.22 For any B ∈ B(E) we have

µ(B) = supµ(C) : C ⊂ B, closed = infµ(A) : A ⊃ B, open. (1.14)

Proof. Let us setK = B ∈ B(E) : (1.14) holds.

It is enough to show that K is a σ–algebra of parts of E including the open sets ofE. Obviously K contains E and ∅. Moreover, if B ∈ K then its complement Bc

belongs to K . Let us prove now that (Bn) ⊂ K implies⋃

nBn ∈ K . Fix ε > 0.We are going to show that there exist a closed set C and an open set A such that

C ⊂∞⋃

n=0

Bn ⊂ A, µ(A \ C) ≤ 2ε. (1.15)

Let n ∈ N. Since Bn ∈ K there exist an open set An and a closed set Cn such thatCn ⊂ Bn ⊂ An and

µ(An \ Cn) ≤ ε

2n+1.

Setting A :=⋃

nAn, S :=⋃

nCn we have S ⊂⋃

nBn ⊂ A and µ(A \ S) ≤ ε.However, A is open but S is not necessarily closed. So, we approximate S by settingSn :=

⋃n0 Ck. The set Sn is obviously closed, Sn ↑ S and consequently µ(Sn) ↑ µ(S).

Therefore there exists nε ∈ N such that µ(S \ Snε) < ε. Now, setting C = Snε

we have C ⊂⋃

nBn ⊂ A and µ(A \ C) < µ(A \ S) + µ(S \ C) < 2ε. Therefore⋃nBn ∈ K . We have proved that K is a σ–algebra. It remains to show that K

contains the open subsets of E. In fact, let A be open and set

Cn =

x ∈ E : d(x,Ac) ≥ 1

n

,

Page 23: Probability Book

Chapter 1 19

where d(x,Ac) := infy∈Ac d(x, y) is the distance function from Ac. Then Cn areclosed subsets of A, and moreover Cn ↑ A, which implies µ(A \ Cn) ↓ 0. Thus theconclusion follows.

The following result is a straightforward consequence of Proposition 1.22.

Corollary 1.23 Let µ, ν be finite measures in (E,B(E)), such that µ(C) = ν(C)for any closed subset C of E. Then µ = ν.

Page 24: Probability Book

20 Measure spaces

EXERCISES

1.1 Given A ⊂ X, denote by 1A : X → 0, 1 its characteristic function, equal to 1 on A and equalto 0 on Ac. Show that

1A∪B = max1A,1B, 1A∩B = min1A,1B, 1Ac = 1X − 1A

and that

lim supn→∞

An = A ⇐⇒ lim supn→∞

1An= 1A, lim inf

n→∞An = A ⇐⇒ lim inf

n→∞1An

= 1A.

1.2 Let A ⊂ Rn be a Borel set. Show that for h ∈ Rn and t ∈ R the sets

A+ h := a+ h : a ∈ A , tA := ta : a ∈ A

are Borel as well.1.3 Find an example of a σ–additive measure µ on a σ–algebra A such that there exist An ∈ Awith An ↓ A and infn µ(An) > µ(A).1.4 Let µ be additive and finite, on an algebra A . Show that µ is σ–additive if and only if it iscontinuous along nonincreasing sequences.1.5 Let µ be a finite measure on (X,E ). Show that the set of atoms of µ, defined by

Aµ := x ∈ X : x ∈ E and µ(x) > 0

is at most countable. Show that the same is true for σ–finite measures, and provide an example ofa measure space for which this property fails.1.6 Let (X,E , µ) be a measure space, with µ finite. We say that µ is diffuse if for all A ∈ Ewith µ(A) > 0 there exists B ⊂ A with 0 < µ(B) < µ(A). Show that, if µ is diffuse, thenµ(E ) = [0, µ(X)]. Show also that if X is a separable metric space and E is the Borel σ–algebra,then µ is diffuse if and only if µ has no atom.1.7 Let λ be the Lebesgue measure in [0, 1]. Show the existence of a λ–negligible set having thecardinality of the continuum. Hint: consider the classical Cantor’s middle third set, obtained byremoving the interval (1/3, 2/3) from [0, 1], then by removing the intervals (1/9, 2/9) and (7/9, 8/9),and so on.1.8 Let λ be the Lebesgue measure in [0, 1]. Show the existence, for any ε > 0, of a closed setC ⊂ [0, 1] containing no interval and such that λ(C) > 1− ε. Hint: remove from [0, 1] a sequenceof open intervals, centered on the rational points of [0, 1].1.9 ? Let λ be the Lebesgue measure in [0, 1]. Construct a Borel set E ⊂ (0, 1) such that

0 < λ(E ∩ I) < λ(I)

for any open interval I ⊂ (0, 1).1.10 Let (X,E , µ) be a measure space and let µ∗ : P(X) → [0,+∞] be the outer measure inducedby µ. Show that the completed σ–algebra Eµ is contained in the class C of additive sets withrespect to µ∗.1.11 Let (X,E , µ) be a measure space and let µ∗ : P(X) → [0,+∞] be the outer measure inducedby µ. Show that for all A ⊂ X there exists B ∈ E containing A with µ(B) = µ∗(A).1.12 Let (X,E , µ) be a measure space. Check the following statements, made in Definition 1.12:

Page 25: Probability Book

Chapter 1 21

(i) Eµ is a σ–algebra;

(ii) the extension µ(A) := µ(B), where B ∈ E is any set such that A∆B is contained in aµ–negligible set of E , is well defined and σ–additive on Eµ;

(iii) µ–negligible sets of Eµ are characterized by the property of being cointained in a µ–negligibleset of E .

1.13 ? Let (X,E , µ) be a measure space and let µ∗ : P(X) → [0,+∞] be the outer measureinduced by µ. Show that if µ(X) is finite, the class C of additive sets with respect to µ∗ coincideswith the class of Eµ–measurable sets. Hint: one inclusion is provided by Exercise 1.10. For theother one, given an additive set A, by applying Exercise 1.11 twice, find first a set B ∈ E withµ∗(B \A) = 0, and then a set C ∈ E with µ(C) = 0 and B \A ⊂ C.1.14 Find a σ–algebra E ⊂ P(N) containing infinitely many sets and such that any B ∈ E differentfrom ∅ has an infinite cardinality.1.15 Find µ : P(N) → 0,+∞ that is additive, but not σ–additive.1.16 ? Let ω be the first uncountable ordinal and, for K ⊂ P(X), define by transfinite inductiona family F (i), i ∈ ω, as follows: F (0) := K ∪ ∅,

F (i) :=

∞⋃k=0

Ak, Bc : (Ak) ⊂ F (j), B ∈ F (j)

,

if i is the successor of j, and F (i) :=⋃

j∈i F (j) otherwise. Show that⋃

i∈ω F (i) = σ(K ).

1.17 ? Show that B(R) has the cardinality of the continuum. Hint: use the construction of theprevious exercise, and the fact that ω has at most the cardinality of continuum.1.18 ? Show that the σ–algebra L of Lebesgue measurable sets has the same cardinality of P(R),thus strictly greater than the continuum. Hint: consider all subsets of Cantor’s middle third set.1.19 ? ? Show that the cardinality of any σ–algebra is either finite or uncountable.1.20 ? ? Find an example of a set function µ : E ⊂ P(X) → [0,+∞), with E σ–algebra and µadditive, but not σ–additive (the construction of this example requires Zorn’s lemma).

Page 26: Probability Book

22 Measure spaces

Page 27: Probability Book

Chapter 2

Integration

This chapter is devoted to the construction of the integral of E –measurable functionsin general measure spaces (Ω,E , µ), and its main continuity and lower semicontinuityproperties. Having built in the previous chapter the Lebesgue measure in the realline R, we obtain as a byproduct the Lebesgue integral on R; in the last section wecompare Lebesgue and Riemann integral.

2.1 Inverse image of a function

Let X be a non empty set. For any function ϕ : X → Y and any I ∈ P(Y ) we set

ϕ−1(I) := x ∈ X : ϕ(x) ∈ I = ϕ ∈ I.

The set ϕ−1(I) is called the inverse image of I.Let us recall some elementary properties of ϕ−1 (the easy proofs are left to the

reader as an exercise):

(i) ϕ−1(Ic) = (ϕ−1(I))c for all I ∈ P(Y );

(ii) if Jii∈I ⊂ P(Y ) we have⋃i∈I

ϕ−1(Ji) = ϕ−1(⋃

i∈I

Ji

),

⋂i∈I

ϕ−1(Ji) = ϕ−1(⋂

i∈I

Ji

).

In particular, if I ∩ J = ∅ we have ϕ−1(I) ∩ ϕ−1(J) = ∅. Also, if E ⊂ P(Y ) andwe consider the family ϕ−1(E ) of subset of X defined by

ϕ−1(E ) :=ϕ−1(I) : I ∈ E

, (2.1)

we have that ϕ−1(E ) is a σ–algebra whenever E is a σ–algebra.

23

Page 28: Probability Book

24 Integration

2.2 Measurable and Borel functions

We are given measurable spaces (X,E ) and (Y,F ). We say that a function ϕ : X →Y is (E ,F )–measurable if ϕ−1(F ) ⊂ E . If (Y,F ) = (R,B(R)), we say that ϕ isa real valued E –measurable function, and if (X, d) is a metric space and E is theBorel σ–algebra, we say that ϕ is a real valued Borel function.

The following simple but useful proposition shows that the measurability condi-tion needs to be checked only on a class of generators.

Proposition 2.1 (Measurability criterion) Let G ⊂ F be such that σ(G ) = F .Then ϕ : X → Y is (E ,F )–measurable if and only if ϕ−1(I) ∈ E for all I ∈ G(equivalently, iff ϕ−1(G ) ⊂ E ).

Proof. Consider the family D := I ∈ F : ϕ−1(I) ∈ E . By the above-mentionedproperties of ϕ−1 as an operator between P(Y ) and P(X), it follows that D is aσ–algebra including G . So, it coincides with σ(G ) = F .

A simple consequence of the previous proposition is the fact that any continuousfunction is a Borel function: more precisely, assume that ϕ : X → Y is continuousand that E = B(X) and F = B(Y ). Then, the σ–algebra

A ⊂ Y : ϕ−1(A) ∈ B(X)

contains the open subsets of Y (as, by the continuity of ϕ, ϕ−1(A) is open in X, andin particular Borel, whenever A is open in Y ), and then it contains the generatedσ–algebra, i.e. B(Y ).

The following proposition, whose proof is straightforward, shows that the classof measurable functions is stable under composition.

Proposition 2.2 Let (X,E ), (Y,F ), (Z,G ) be measurable spaces and let ϕ : X →Y and ψ : Y → Z be respectively (E ,F )–measurable and (F ,G )–measurable. Thenψ ϕ is (E ,G )–measurable.

It is often convenient to consider functions with values in the extended spaceR := R ∪ +∞,−∞, the so-called extended functions. We say that a mappingϕ : X → R is E –measurable if

ϕ−1(−∞), ϕ−1(+∞) ∈ E and ϕ−1(I) ∈ E , ∀I ∈ B(R). (2.2)

This condition can also be interpreted in terms of measurability between E and asuitable Borel σ–algebra in R, see Exercise 2.3. Analogously, when (X, d) is a metricspace and E is the Borel σ–algebra, we say that ϕ : X → R is Borel whenever theconditions above hold.

The following proposition shows that extended E –measurable functions are sta-ble under pointwise limits and countable supremum and infimum.

Page 29: Probability Book

Chapter 2 25

Proposition 2.3 Let (ϕn) be a sequence of extended E –measurable functions. Thenthe following functions are E –measurable:

supn∈N

ϕn(x), infn∈N

ϕn(x), lim supn→∞

ϕn(x), lim infn→∞

ϕn(x).

Proof. Let us prove that ϕ(x) := supn ϕn(x) is E –measurable (all other cases canbe deduced from this one, or directly proved by similar arguments). For any a ∈ Rwe have

ϕ−1([−∞, a]) =∞⋂

n=0

ϕ−1n ([−∞, a]) ∈ E .

In particular ϕ = −∞ ∈ E , so that ϕ−1((−∞, a)) ∈ E for all a ∈ R ∪ +∞; byletting a ↑ ∞ we get ϕ−1(R) ∈ E . As a consequence, the class

I ∈ B(R) : ϕ−1(I) ∈ E

is a σ–algebra containing the intervals of the form (−∞, a] with a ∈ R, and thereforecoincides with B(R). Eventually, ϕ = +∞ = X \ [ϕ−1(R) ∪ ϕ = −∞] belongsto E as well.

2.3 Partitions and simple functions

Let (X,E ) be a measurable space. A function ϕ : X → R is said to be simple if itsrange ϕ(X) is a finite set. The class of simple functions is obviously a real vectorspace, as the range of ϕ+ ψ is contained in

a+ b : a ∈ range(ϕ), b ∈ range(ψ) .

If ϕ(X) = a1, . . . , an, with ai 6= aj if i 6= j, setting Ai = ϕ−1(ai), i = 1, . . . , nwe can canonically represent ϕ as

ϕ(x) =n∑

k=1

ak1Ak, x ∈ X. (2.3)

Moreover, A1, . . . , An is a finite partition of X (i.e. Ai are mutually disjoint andtheir union is equal to X). However, a simple function ϕ has many representationsof the form

ϕ(x) =N∑

k=1

a′k1A′k, x ∈ X,

Page 30: Probability Book

26 Integration

where A′1, . . . , A

′N need not be mutually disjoint and a′k need not be in the range of

ϕ. For instance1[0,1) + 31[1,2] = 1[0,2] + 21[1,2].

It is easy to check that a simple function is E –measurable if, and only if, all levelsets Ak in (2.3) are E –measurable; in this case we shall also say that Ak is a finiteE –measurable partition of X.

Now we show that any nonnegative E –measurable function can be approximatedby simple functions; a variant of this result, with a different construction, is proposedin Exercise 2.8.

Proposition 2.4 Let ϕ be a nonnegative extended E –measurable function. For anyn ∈ N∗, define

ϕn(x) =

i− 1

2nifi− 1

2n≤ ϕ(x) <

i

2n, i = 1, 2, . . . , n2n;

n if ϕ(x) ≥ n.

(2.4)

Then ϕn are E –measurable, (ϕn) is nondecreasing and convergent to ϕ. If in additionϕ is bounded the convergence is uniform.

Proof. It is not difficult to check that (ϕn) is nondecreasing. Moreover, we have

0 ≤ ϕ(x)− ϕn(x) ≤ 1

2nif ϕ(x) < n, x ∈ X,

and0 ≤ ϕ(x)− ϕn(x) = ϕ(x)− n if ϕ(x) ≥ n, x ∈ X.

So, the conclusion easily follows.

2.4 Integral of a nonnegative E –measurable func-

tion

We are given a measure space (X,E , µ). We start to define the integral for simplenonnegative functions.

2.4.1 Integral of simple functions

Let ϕ be a nonnegative simple E –measurable function, and let us represent it as

ϕ(x) =N∑

k=1

ak1Ak, x ∈ X,

Page 31: Probability Book

Chapter 2 27

with N ∈ N, a1, . . . , aN ≥ 0 and A1, . . . , AN in E . Then we define (using thestandard convention in the theory of integration that 0 · ∞ = 0),

∫X

ϕdµ :=N∑

k=1

akµ(Ak).

It is easy to see that the definition does not depend on the choice of the representa-tion formula for ϕ. Indeed, let b1, . . . , bM be the range of ϕ and let ϕ =

∑M1 bj1Bj

,with Bj := ϕ−1(bj), be the canonical representation of ϕ. We have to prove that

N∑k=1

akµ(Ak) =M∑

j=1

bjµ(Bj). (2.5)

As the Bi’s are pairwise disjoint, (2.5) follows by adding the M identities

N∑k=1

akµ(Ak ∩Bj) = bjµ(Bj) j = 1, . . . ,M. (2.6)

In order to show (2.6) we fix j and consider, for I ⊂ 1, . . . , N, the sets

AI := x ∈ Bj : x ∈ Ai iff i ∈ I ,

so that AI are a E –measurable partition of Bj and x ∈ AI iff the set of i’s forwhich x ∈ Ai coincides with I. Then, using first the fact that AI ⊂ Ai if i ∈ I,and Ai ∩ AI = ∅ otherwise, and then the fact that

∑k∈I

ak = bj whenever AI 6= ∅

(because∑N

1 ak1Akcoincides with bj, the constant value of ϕ on Bj), we have

N∑k=1

akµ(Ak ∩Bj) =N∑

k=1

∑I

akµ(Ak ∩ AI) =∑

I

N∑k=1

akµ(Ak ∩ AI)

=∑

I

∑k∈I

akµ(AI) =∑

I

bjµ(AI) = bjµ(Bj).

Proposition 2.5 Let ϕ, ψ be simple nonnegative E –measurable functions on X andlet α, β ≥ 0. Then αϕ+ βψ is simple, E –measurable and we have∫

X

(αϕ+ βψ) dµ = α

∫X

ϕdµ+ β

∫X

ψ dµ

Page 32: Probability Book

28 Integration

Proof. Let

ϕ =n∑

k=1

ak1Ak, ψ =

m∑h=1

bh1Bh

with Ak, Bh finite E –measurable partitions of X. Then Ak ∩ Bh is a finiteE –measurable partition of X and αϕ+ βψ is constant (and equal to αak + βbh) onany element Ak ∩Bh of the partition. Therefore the level sets of αϕ+ βψ are finiteunions of elements of this partition and the E –measurability of αϕ+βψ follows (seealso Exercise 2.2). Then, writing

ϕ(x) =n∑

k=1

m∑h=1

ak1Ak∩Bh(x), ψ(x) =

n∑k=1

m∑h=1

bh1Ak∩Bh(x), x ∈ X,

we arrive at the conclusion.

2.4.2 The repartition function

Let ϕ : X → R be E –measurable. The repartition function F of ϕ, relative to µ, isdefined by

F (t) := µ(ϕ > t), t ∈ R.The function F is nonincreasing and satisfies

limt→−∞

F (t) = limn→−∞

F (n) = limn→∞

µ(ϕ > −n) = µ(ϕ > −∞),

and, if µ is finite,

limt→+∞

F (t) = limn→∞

F (n) = limn→∞

µ(ϕ > n) = µ(ϕ = +∞),

since

ϕ > −∞ =∞⋃

n=1

ϕ > −n, ϕ = +∞ =∞⋂

n=1

ϕ > n.

Other important properties of F are provided by the following result.

Proposition 2.6 Let ϕ : X → R be E –measurable and let F be its repartition func-tion.

(i) For any t0 ∈ R we have limt→t+0

F (t) = F (t0), that is, F is right continuous.

(ii) If µ is finite, for any t0 ∈ R we have limt→t−0

F (t) = µ(ϕ ≥ t0), that is, F has

left limits (1).

(1)In the literature F is called a cadlag function.

Page 33: Probability Book

Chapter 2 29

Proof. Let us prove (i). We have

limt→t+0

F (t) = limn→+∞

F (t0 +1

n) = lim

n→+∞µ(ϕ > t0 +

1

n) = µ(ϕ > t0) = F (t0),

since

ϕ > t0 =∞⋃

n=1

ϕ > t0 +

1

n

= lim

n→∞

ϕ > t0 +

1

n

.

So, (i) follows. We prove now (ii). We have

limt→t−0

F (t) = limn→+∞

F (t0 −1

n) = lim

n→+∞µ(ϕ > t0 −

1

n) = µ(ϕ ≥ t0),

since

ϕ ≥ t0 =∞⋂

n=1

ϕ > t0 −

1

n

= lim

n→∞

ϕ > t0 −

1

n

and (ii) follows.

From Proposition 2.6 it follows that, in the case when µ is finite, F is continuousat t0 iff µ(ϕ = t0) = 0.

Now we want to extend the integral operator to nonnegative E –measurable func-tions. Let ϕ be a nonnegative, simple and E –measurable function and let

ϕ(x) =n∑

k=0

ak1Ak, x ∈ X,

with n ∈ N∗, 0 = a0 < a1 < a2 < · · · < an <∞. Then the repartition function F ofϕ is given by

F (t) =

µ(A1) + µ(A2) + · · ·+ µ(An) = F (0) if 0 ≤ t < a1

µ(A2) + µ(A3) + · · ·+ µ(An) = F (a1) if a1 ≤ t < a2

· · · · · · · · · · · · · · ·µ(An) = F (an−1) if an−1 ≤ t < an

0 = F (an) if t ≥ an.

Consequently, we can write∫X

ϕ(x) dµ(x) =n∑

k=1

akµ(Ak) =n∑

k=1

ak(F (ak−1)− F (ak)) (2.7)

=n∑

k=1

akF (ak−1)−n∑

k=1

akF (ak) =n−1∑k=0

ak+1F (ak)−n−1∑k=0

akF (ak)

=n−1∑k=0

(ak+1 − ak)F (ak) =

∫ ∞

0

F (t) dt.

Page 34: Probability Book

30 Integration

Example 2.7 We set X = R, µ = λ,

A1 = [1, 2] ∪ [10, 11], A2 = [2, 3], A3 = [3, 4], A4 = [4, 6], A5 = [7, 10],a1 = 5, a2 = 8, a3 = 10, a4 = 7, a5 = 2

and ϕ :=∑5

k=1 ak1Akto be the simple function shown in Fig. 2.1. It is easy to

verify that F has the graph shown in the right picture in Fig. 2.1. The color schemeused for the areas below the two graphs in 2.1 proves graphically that the areas areidentical.

ϕ F

1

1

2

3

4

5

6

7

9

8

10

4 6 7 8 1091 2 3 5

Figure 2.1: a simple function ϕ, and its repartition F

Now, we want to define the integral of any nonnegative extended E –measurablefunction by generalizing formula (2.7). For this, we need first to define the integralof any nonnegative nonincreasing function in (0,+∞).

2.4.3 The archimedean integral

We generalize here the (inner) Riemann integral to any nonincreasing function f :[0,+∞) → [0,+∞]. The strategy is to consider the supremum of the areas ofpiecewise constant minorants of f .

Let Σ be the set of all finite decompositions σ = t1, . . . , tN of [0,+∞], whereN ∈ N∗ and 0 = t0 ≤ t1 < · · · < tN < +∞.

Let now f : [0,+∞) → [0,+∞] be a nonincreasing function. For any σ =t0, t1, . . . , tN ∈ Σ we consider the partial sum

If (σ) :=N−1∑k=0

f(tk+1)(tk+1 − tk).

Page 35: Probability Book

Chapter 2 31

We define ∫ ∞

0

f(t) dt := supIf (σ) : σ ∈ Σ. (2.8)

The integral∫∞

0f(t) dt is called the archimedean integral of f . It enjoys the usual

properties of the Riemann integral (see Exercise 2.5) but, among these, we willneed only the monotonicity with respect to f in the sequel. For our purposes themost relevant property of the Archimedean integral is instead the continuity undermonotonically nondecreasing sequences, a property not satisfied by the Riemannintegral.

Proposition 2.8 Let fn ↑ f , with fn : [0,+∞) → [0,+∞] nonincreasing. Then∫ ∞

0

fn(t) dt ↑∫ ∞

0

f(t) dt.

Proof. It is obvious that ∫ ∞

0

fn(t) dt ≤∫ ∞

0

f(t) dt.

To prove the converse inequality, fix L <∫∞

0f(t) dt. Then there exists σ =

t1, . . . , tN ∈ Σ such that

N−1∑k=0

f(tk)(tk+1 − tk) > L.

Since for n large enough∫ ∞

0

fn(t) dt ≥N−1∑k=0

fn(tk+1)(tk+1 − tk) > L,

letting n→∞ we find that

supn∈N

∫ ∞

0

fn(t) dt ≥ L.

This implies

supn∈N

∫ ∞

0

fn(t) dt ≥∫ ∞

0

f(t) dt

and the conclusion follows.

Page 36: Probability Book

32 Integration

2.4.4 Integral of a nonnegative E –measurable function

We are given a measure space (X,E , µ) and an extended nonnegative E –measurablefunction ϕ. Having the identity (2.7) in mind, we define∫

X

ϕdµ : =

∫ ∞

0

µ(ϕ > t) dt. (2.9)

Notice that the function t 7→ µ(ϕ > t) ∈ [0,+∞] is nonnegative and nonincreasingin [0,+∞), so that its archimedean integral is well defined and (2.9) extends, bythe remarks made at the end of Section 2.4.2, the integral elementarily defined onsimple functions. If the integral is finite we say that ϕ is µ–integrable.

It follows directly from the analogous properties of the archimedean integral thatthe integral so defined is monotone, i.e.

ϕ ≥ ψ =⇒∫

X

ϕdµ ≥∫

X

ψ dµ.

Indeed, ϕ ≥ ψ implies ϕ > t ⊃ ψ > t andmu(ϕ > t) ≥ µ(ψ > t) for all t > 0. Furthermore, the integral is invariantunder modifications of ϕ in µ–negligible sets, that is

ϕ = ψ µ–a.e. in X =⇒∫

X

ϕdµ =

∫X

ψ dµ.

To show this fact it suffices to notice that ϕ = ψ µ–a.e. in X implies that thesets ϕ > t and ψ > t differ in a µ–negligible set for all t > 0, thereforeµ(ϕ > t) = µ(ψ > t) for all t > 0.

Let us prove the following basic Markov inequality.

Proposition 2.9 For any a ∈ (0,∞) we have

µ(ϕ ≥ a) ≤ 1

a

∫X

ϕ(x) dµ(x). (2.10)

Proof. For any a ∈ (0,+∞) we have, recalling that ϕ ≥ a ⊂ ϕ > t for anyt ∈ (0, a), so that µ(ϕ > t) ≥ µ(ϕ ≥ a) for all t ∈ (0, a). The monotonicity ofthe archimedean integral gives∫

X

ϕ(x) dµ(x) =

∫ ∞

0

µ(ϕ > t) dt ≥∫ ∞

0

1(0,a)(t)µ(ϕ > t) dt ≥ aµ(ϕ ≥ a).

The Markov inequality has some important consequences.

Page 37: Probability Book

Chapter 2 33

Proposition 2.10 Let ϕ : X → [0,+∞] be an extended E –measurable function.

(i) If ϕ is µ–integrable then the set ϕ = +∞ has µ–measure 0, that is, ϕ isfinite µ–a.e. in X.

(ii) The integral of ϕ vanishes iff ϕ is equal to 0 µ–a.e. in X.

Proof. (i) Since∫

Xϕdµ <∞ we deduce from (2.10) that

lima→+∞

µ(ϕ > a) = 0.

Since

ϕ = ∞ =∞⋂

n=1

ϕ > n,

we have thatµ(ϕ = ∞) = lim

n→+∞µ(ϕ > n) = 0.

(ii) If∫

Xϕdµ = 0 we deduce from (2.10) that µ(ϕ > a) = 0 for all a > 0.

Since

µ(ϕ > 0) = limn→+∞

µ(ϕ > 1

n) = 0,

the conclusion follows. The other implication follows by the invariance of the inte-gral.

Proposition 2.11 (Monotone convergence) Let (ϕn) be a nondecreasing sequenceof extended E –measurable functions and set ϕ(x) := lim

n→∞ϕn(x) for any x ∈ X. Then∫ ∞

0

ϕn(x) dµ(x) ↑∫ ∞

0

ϕ(x) dµ(x).

Proof. It suffices to notice that µ(ϕn > t) ↑ µ(ϕ > t) for all t > 0, and then toapply Proposition 2.8.

Now, by Proposition 2.4 we obtain the following important approximation prop-erty.

Proposition 2.12 Let ϕ : X → [0,+∞] be an extended E –measurable function.Then there exist simple E –measurable functions ϕn : X → [0,+∞) such that ϕn ↑ ϕ,so that ∫ ∞

0

ϕn(x) dµ(x) ↑∫ ∞

0

ϕ(x) dµ(x).

Page 38: Probability Book

34 Integration

Remark 2.13 (Construction of Lebesgue and Riemann integrals) Proposition 2.12could be used as an alternative, and equivalent, definition of the Lebesgue integral:we can just define it as the supremum of the integral of minorant simple functions.This alternative definition is closer to the definitions of Archimedean integrals andof inner Riemann integral: the only (fundamental) difference is due to the choiceof the family of “simple” functions. In all cases simple functions take finitely manyvalues, but within the Lebesgue theory their level sets belong to a σ–algebra, and sothe family of simple function is much richer, in comparison with the other theories.

We can now prove the additivity property of the integral.

Proposition 2.14 Let ϕ, ψ : X → [0,∞] be E –measurable functions. Then∫X

(ϕ+ ψ) dµ =

∫X

ϕdµ+

∫X

ψ dµ.

Proof. Let ϕn, ψn be simple functions with ϕn ↑ ϕ and ψn ↑ ψ. Then, the additivityof the integral on simple functions gives∫

X

(ϕn + ψn) dµ =

∫X

ϕn dµ+

∫X

ψn dµ.

We conclude passing to the limit as n → ∞ and using the monotone convergencetheorem.

The following Fatou’s lemma, providing a semicontinuity property of the integral,is of basic importance.

Lemma 2.15 (Fatou) Let ϕn : X → [0,+∞] be extended E –measurable functions.Then we have ∫

X

lim infn→∞

ϕn(x) dµ(x) ≤ lim infn→∞

∫X

ϕn(x) dµ(x). (2.11)

Proof. Setting ϕ(x) := lim infn ϕn(x), and ψn(x) = infm≥n ϕm(x), we have thatψn(x) ↑ ϕ(x). Consequently, by the monotone convergence theorem,∫

X

ϕ(x) dµ(x) = limn→∞

∫X

ψn(x) dµ(x).

On the other hand ∫X

ψn(x) dµ(x) ≤∫

X

ϕn(x) dµ(x),

Page 39: Probability Book

Chapter 2 35

so that ∫X

ϕ(x) dµ(x) ≤ lim infn→∞

∫X

ϕn(x) dµ(x).

In particular, if ϕn are pointwise converging to ϕ, we have∫X

ϕ(x) dµ(x) ≤ lim infn→∞

∫X

ϕn(x) dµ(x).

2.5 Integral of functions with a variable sign

Let ϕ : X → R be an extended E –measurable function. We say that ϕ is µ–integrableif both the positive part ϕ+(x) := maxϕ(x), 0 and the negative part ϕ−(x) :=max−ϕ(x), 0 of ϕ are µ–integrable in X. As ϕ = ϕ+ − ϕ−, in this case it isnatural to define∫

X

ϕ(x) dµ(x) :=

∫X

ϕ+(x) dµ(x)−∫

X

ϕ−(x) dµ(x).

As |ϕ| = ϕ+ + ϕ−, the additivity properties of the integral give that

ϕ is µ–integrable if and only if

∫X

|ϕ| dµ <∞.

Let ϕ : X → R and let A ∈ E be such that 1Aϕ is µ-integrable. We define also∫A

ϕ(x) dµ(x) :=

∫X

1A(x)ϕ(x) dµ(x).

In the following proposition we summarize the main properties of the integral.

Proposition 2.16 Let ϕ, ψ : X → R be µ–integrable functions.

(i) For any α, β ∈ R we have that αϕ+ βψ is µ–integrable and∫X

(αϕ+ βψ) dµ = α

∫X

ϕdµ+ β

∫X

ψ dµ.

(ii) If ϕ ≤ ψ in X we have

∫X

ϕdµ ≤∫

X

ψ dµ.

(iii)

∣∣∣∣∫X

ϕdµ

∣∣∣∣ ≤ ∫X

|ϕ| dµ.

Page 40: Probability Book

36 Integration

Proof. (i). Since (−ϕ)+ = ϕ− and (−ϕ)− = ϕ+, we have∫

X−ϕdµ = −

∫Xϕdµ.

So, possibly replacing ϕ by −ϕ and ψ by −ψ we can assume that α ≥ 0 and β ≥ 0.We have

(αϕ+ βψ)+ + αϕ− + βψ− = (αϕ+ βψ)− + αϕ+ + βψ+,

so that we can integrate both sides and use the additivity on nonnegative functionsto obtain ∫

X

(αϕ+ βψ)+ dµ+ α

∫X

ϕ− dµ+ β

∫X

ψ− dµ

=

∫X

(αϕ+ βψ)− dµ+ α

∫X

ϕ+ dµ+ β

∫X

ψ+ dµ.

Rearranging terms we obtain (i).(ii). It follows by the monotonicity of the integral on nonnegative functions and

from the inequalities ϕ+ ≤ ψ+ and ϕ− ≥ ψ−.(iii). Since −|ϕ| ≤ ϕ ≤ |ϕ| the conclusion follows from (ii).

Another consequence of the additivity property of the integral is the additivityof the real-valued map

A ∈ E 7→∫

A

ϕdµ

whenever ϕ is µ–integrable. We will see in the next section that, as a consequenceof the dominated convergence theorem, this map is even σ–additive.

2.6 Convergence of integrals

In this section we study the problem of commuting limit and integral; we havealready seen that this can be done in some particular cases, as when the functions arenonnegative and monotonically converge to their supremum, and now we investigatesome more general cases, relevant for the applications.

Proposition 2.17 (Lebesgue dominated convergence theorem) Let (ϕn) bea sequence of E –measurable functions pointwise converging to ϕ. Assume that thereexists a nonnegative µ–integrable function ψ such that

|ϕn(x)| ≤ ψ(x) ∀x ∈ X, n ∈ N.

Then the functions ϕn and the function ϕ are µ–integrable and

limn→∞

∫X

ϕn dµ =

∫X

ϕdµ.

Page 41: Probability Book

Chapter 2 37

Proof. Passing to the limit as n → ∞ we obtain that ϕ is E –measurable and|ϕ| ≤ ψ in X. In particular ϕ is µ–integrable. Since ϕ + ψ is nonnegative, by theFatou lemma we have ∫

X

(ϕ+ ψ) dµ ≤ lim infn→∞

∫X

(ϕn + ψ) dµ.

Consequently, ∫X

ϕdµ ≤ lim infn→∞

∫X

ϕn dµ. (2.12)

In a similar way we have∫X

(ψ − ϕ) dµ ≤ lim infn→∞

∫X

(ψ − ϕn) dµ.

Consequently, ∫X

ϕdµ ≥ lim supn→∞

∫X

ϕn dµ. (2.13)

Now the conclusion follows by (2.12) and (2.13).

An important consequence of the dominated convergence theorem is the absolutecontinuity property of the integral of µ–integrable functions ϕ:

for any ε > 0 there exists δ > 0 such that µ(A) < δ =⇒∫

A

|ϕ| dµ < ε. (2.14)

The proof of this property is sketched in Exercise 2.9.

2.6.1 Uniform integrability and Vitali convergence theorem

In this subsection we assume for simplicity that the measure µ is finite. A familyϕii∈I of R–valued µ–integrable functions is said to be µ–uniformly integrable if

limµ(A)→0

∫A

|ϕi(x)| dµ(x) = 0, uniformly in i ∈ I.

This means that for any ε > 0 there exists δ > 0 such that

µ(A) < δ =⇒∫

A

|ϕi(x)| dµ(x) ≤ ε ∀ i ∈ I.

This property obviously extends from single functions to families of functionsthe absolute continuity property of the integral.

Notice that any family ϕii∈I dominated by a single µ–integrable function ϕ(i.e. such that |ϕi| ≤ |ϕ| for any i ∈ I) is obviously µ–uniformly integrable. Takingthis remark into account, we are going to to prove the following extension of thedominated convergence theorem, known as Vitali Theorem.

Page 42: Probability Book

38 Integration

Theorem 2.18 (Vitali) Assume that µ is a finite measure and let (ϕn) be a µ–uniformly integrable sequence of functions with supn

∫X|ϕn| dµ < ∞ and pointwise

converging to ϕ. Then ϕ is µ–integrable and

limn→∞

∫X

ϕn dµ =

∫X

ϕdµ.

To prove the Vitali theorem we need the following Egorov Lemma.

Lemma 2.19 (Egorov) Assume that µ is a finite measure and let (ϕn) be a se-quence of E –measurable functions pointwise converging to a function ϕ. Then forany δ > 0 there exists a set Aδ ∈ E such that µ(Aδ) < δ and ϕn → ϕ uniformly inX \ Aδ.

Proof. For any integer m ≥ 1 we write X as the increasing union of the sets Bn,m,where

Bn,m :=

x ∈ X : |ϕi(x)− ϕ(x)| < 1

m∀i ≥ n

.

Since µ is finite there exists n(m) such that µ(Bn(m),m) > µ(X)− 2−mδ. We denoteby Aδ the union of X \Bn(m),m, so that

µ(Aδ) ≤∞∑

m=1

µ(X \Bn(m),m) <∞∑

m=1

δ

2m= δ.

Now, given any ε > 0, we can choose m > 1/ε to obtain that

|ϕn(x)− ϕ(x)| ≤ 1

m< ε for all x ∈ Bn(m),m, n ≥ n(m).

As X \Aδ ⊂ Bn(m),m, this proves the uniform convergence of ϕn to ϕ on X \Aδ.

Proof of the Vitali Theorem. Since the integrals of |ϕn| are uniformly bounded,Fatou’s Lemma gives that ϕ is µ–integrable. To prove the convergence of the integral,fix ε > 0 and find δ > 0 such that

∫A|ϕn| dµ < ε whenever µ(A) < δ. Again, Fatou’s

Lemma yields that∫

A|ϕ| dµ ≤ ε whenever µ(A) < δ.

Assume now that A is given by Egorov Lemma, so that ϕn → ϕ uniformly onX \ A. Then, writing∫

X

(ϕ− ϕn) dµ =

∫X\A

(ϕ− ϕn) dµ+

∫A

(ϕ− ϕn) dµ

and using the fact that limn supX\A

|ϕn − ϕ| = 0 we obtain∣∣∣∣∫X

(ϕ− ϕn) dµ

∣∣∣∣ ≤ 3ε

for n large enough. The statement follows letting ε ↓ 0.

Page 43: Probability Book

Chapter 2 39

2.7 A characterization of Riemann integrable func-

tions

The integrals∫

Jf dλ, with J = [a, b] closed interval of the real line and λ equal

to the Lebesgue measure in R, are traditionally denoted with the classical notation∫ b

af dx or with

∫Jf dx. This is due to the fact that Riemann’s and Lebesgue’s

integral coincide on the class of Riemann’s integrable functions.We denote by I∗(f) and I∗(f) the upper and lower Riemann integral of f respec-

tively, the former defined by taking the supremum of the sums∑n−1

1 ai(ti+1− ti) incorrespondence of all step functions

h =n−1∑i=1

ai1[ti,ti+1) ≤ f a = t1 < · · · < tn = b, (2.15)

and the latter considering the infimum in correspondence of all step functions h ≥ f .We denote by I(f) the Riemann integral, equal to the upper and lower integralwhenever the two integrals coincide.

As the Lebesgue integral of the function h in (2.18) coincides with∑n−1

i ai(ti+1−ti), we have ∫

J

g dλ = I(g) for any step function g : J → R.

Now, if f : J → R is continuous, we can choose a uniformly bounded sequence ofstep functions gh converging pointwise to f (for instance splitting J into i equalintervals [xi, xi+1[ and setting ai = min[xi,xi+1] f) whose Riemann integrals convergeto I(f). Therefore, passing to the limit in the identity above with g = gh, and usingthe dominated convergence theorem we get∫

J

f dλ = I(f) for any continuous function f : J → R.

We are going to generalize this fact, providing a full characterization, within theLebesgue theory, of Riemman’s integrable functions.

Theorem 2.20 Let f : J = [a, b] → R be a bounded function. Then f is Riemannintegrable if and only if the set of its discontinuity points is Lebesgue negligible. Ifthis is the case, we have that f is B(J)λ–measurable and∫

J

f dλ = I(f). (2.16)

Page 44: Probability Book

40 Integration

Proof. Let f∗(x) := inf

lim infh→∞

f(xh) : xh → x

f ∗(x) := sup

lim sup

h→∞f(xh) : xh → x

.

(2.17)

It is not hard to show (see Exercise 2.6 and Exercise 2.7) that f∗ is lower semicontin-uous and f ∗ is upper semicontinuous, therefore both f∗ and f ∗ are Borel functions.

We are going to show that I∗(f) =∫

Jf∗ dλ and I∗(f) =

∫Jf ∗ dλ. These two

equalities yield the conclusion, as f is continuous at λ-a.e. point in J iff f ∗− f∗ = 0λ–a.e. in J , and this holds iff (because f ∗ − f∗ ≥ 0)∫

J

(f ∗ − f∗) dλ = 0.

Furthermore, if the set of discontinuity points of f is λ–negligible, the Borel functionf ∗ differs from f only in a λ–negligible set, thus f is B(J)λ–measurable (becausef > t differs from the Borel set f ∗ > t only in a λ–negligible set, see alsoExercise 2.4) and its integral coincides with

∫Jf ∗ dλ =

∫Jf∗ dλ; this leads to (2.16).

Since I∗(f) = −I∗(−f) and f ∗ = −(−f)∗, we need only to prove the first of thetwo equalities, i.e. ∫

J

f∗ dλ = I∗(f). (2.18)

In order to check the inequality ≤ in (2.18) we apply Exercise 2.11, finding a se-quence of continuous functions gh ↑ f∗ ≤ f and obtaining, thanks to the monotoneconvergence theorem,∫

J

f∗ dλ = suph∈N

∫J

gh dλ = suph∈N

I(gh) = suph∈N

I∗(gh) ≤ I∗(f).

In order to prove ≥ in (2.18) we fix a step function h ≤ f in [a, b) as in (2.15) andwe notice that f ≥ ai = h in (ti, ti+1) implies f∗ ≥ ai in the same interval. Hencef∗ ≥ h in J \ t1, . . . , tn and, being the set of the ti’s Lebesgue negligible, we have∫

J

f∗ dλ ≥∫

J

h dλ = I(h).

As h is arbitrary the inequality is achieved.

Page 45: Probability Book

Chapter 2 41

EXERCISES

2.1 Show that any of the conditions listed below is equivalent to the E –measurability of ϕ : X → R.

(i) ϕ−1((−∞, t]) ⊂ E for all t ∈ R;

(ii) ϕ−1((−∞, t)) ⊂ E for all t ∈ R;

(iii) ϕ−1([a, b]) ⊂ E for all a, b ∈ R;

(iv) ϕ−1([a, b)) ⊂ E for all a, b ∈ R;

(v) ϕ−1((a, b)) ⊂ E for all a, b ∈ R.

2.2 Let ϕ, ψ : X → R be E –measurable. Show that ϕ+ ψ and ϕψ are E –measurable. Hint: provethat

ϕ+ ψ < t =⋃r∈Q

[ϕ < r ∩ ψ < t− r]

andϕ2 > a = ϕ >

√a ∪ ϕ < −

√a, a ≥ 0.

2.3 Let us define a distance d in R by

d(x, y) := | arctanx− arctan y|

where, by convention, arctan(±∞) = ±π/2.

(i) Show that (R, d) is a compact metric space (the so-called compactification of R) and thatA ⊂ R is open relative to the Euclidean distance if, and only if, it is open relative to d;

(ii) use (i) to show that, given a measurable space (X,E ), f : X → R is E –measurable accordingto (2.2) if and only if it is measurable between E and the Borel σ–algebra of (R, d).

2.4 Let (X,E , µ) be a measure space and let Eµ be the completion of E induced by µ. Showthat f : X → R is Eµ–measurable iff there exists a E –measurable function g such that f 6= g iscontained in a µ–negligible set of E .2.5 Let us endow Σ with the usual partial ordering σ = t1, . . . , tN ≤ ζ = s1, . . . , sM if andonly if σ ⊂ ζ. Show that σ 7→ If (σ) is nondecreasing. Use this fact to show that f 7→

∫∞0f(t) dt

is additive.2.6 Let f : R→ R be a function. Show that the functions f∗, f∗ defined in (2.17) are respectivelylower semicontinuous and upper semicontinuous.2.7 Let f : R→ R be a bounded function. Using Exercise 2.6 show that f∗ ≤ t and f∗ ≥ t areclosed for all t ∈ R. In particular deduce that

Σ = x ∈ R : f is continuous at x

belongs to B(R).2.8 Let (an) ⊂ (0,∞) with

∞∑i=0

ai = ∞, limi→∞

ai = 0.

Show that for any ϕ : X → [0,+∞] E –measurable there exist Ai ∈ E such that ϕ =∑

i ai1Ai .Hint: set ϕ0 := ϕ, A0 := ϕ ≥ a0 and ϕ1 := ϕ0 − a01A0 ≥ 0. Then, set A1 := ϕ1 ≥ a1 andϕ2 := ϕ0 − a11A1 and so on.

Page 46: Probability Book

42 Integration

2.9 Let ϕ : X → R be µ–integrable. Show that the property (2.14) holds. Hint: assume bycontradiction its failure for some ε > 0 and find Ai with µ(Ai) < 2−i and

∫Ai|ϕ| dµ ≥ ε. Then,

notice that B := lim supiAi is µ–negligible, consider

Bn :=⋃i≥n

Ai \B ↓ ∅

and apply the dominated convergence theorem.2.10

• Prove that if ϕn → ϕ in L1(Ω,E , µ), then (ϕn) is µ–uniformly integrable.

• Find a space (X,E , µ) and a sequence (ϕn) that is µ–uniformly integrable, for which thereis no g ∈ L1(X,E , µ) satisfying |ϕn| ≤ g for all n ∈ N.

2.11 Let (X, d) be a metric space and let g : X → [0,∞] be lower semicontinuous and notidentically equal to ∞. For any λ > 0 define

gλ(x) := infy∈X

g(y) + λd(x, y) .

Check that:

(a) |gλ(x)− gλ(x′)| ≤ λd(x, x′) for all x, x′ ∈ X;

(b) gλ ↑ g as λ ↑ ∞.

2.12 Let f : R2 → R be satisfying the following two properties:

(i) x 7→ f(x, y) is continuous in R for all y ∈ R;

(ii) y 7→ f(x, y) is continuous in R for all x ∈ R.

Show that f is a Borel function. Hint: first reduce to the case when f is bounded. Then, for ε > 0consider the functions

fε(x, y) :=12ε

∫ x+ε

x−ε

f(x′, y) dx′,

proving that fε are continuous and fε → f as ε ↓ 0.

Page 47: Probability Book

Chapter 3

Spaces of integrable functions

This chapter is devoted to the properties of the so-called Lp spaces, the spaces offunctions whose p-th power is integrable. Throughout the chapter a measure space(X,E , µ) will be fixed.

3.1 Spaces L p(X,E , µ) and Lp(X,E , µ)

Let Y be a real vector space. We recall that a norm ‖ · ‖ on Y is a nonnegativereal-valued map defined on Y such that:

(i) ‖y‖ = 0 if and only if y = 0;

(ii) ‖αy‖ = |α| ‖y‖ for all α ∈ R and y ∈ Y ;

(iii) ‖y1 + y2‖ ≤ ‖y1‖+ ‖y2‖ for all y1, y2 ∈ Y .

The space Y , endowed with the norm ‖ · ‖, is called a normed space. Y is alsoa metric space when endowed with the distance d(y1, y2) = ‖y1 − y2‖ (the triangleinequality is a direct consequence of (iii)). If (Y, d) is a complete metric space, wesay that (Y, ‖ · ‖) is a Banach space.

We denote by L 1(X,E , µ) the real vector space of all µ–integrable functions on(X,E ). We define

‖ϕ‖1 :=

∫X

|ϕ(x)| dµ(x), ϕ ∈ L 1(X,E , µ).

We have clearly

‖αϕ‖1 = |α| ‖ϕ‖1 ∀α ∈ R, ϕ ∈ L 1(X,E , µ),

43

Page 48: Probability Book

44 Lp spaces

and‖ϕ+ ψ‖1 ≤ ‖ϕ‖1 + ‖ψ‖1 ∀ϕ, ψ ∈ L 1(X,E , µ),

so that conditions (ii) and (iii) in the definition of the norm are fulfilled. However,‖ · ‖1 is not a norm in general, since ‖ϕ‖1 = 0 if and only if ϕ = 0 µ–a.e. in X, so(i) fails.

Then, we can consider the following equivalence relation R on L 1(X,E , µ),

ϕ ∼ ψ ⇐⇒ ϕ = ψ µ–a.e. in X (3.1)

and denote by L1(X,E , µ) the quotient space of L 1(X,E , µ) with respect to R. Inother words, L1(X,E , µ) is the quotient vector space of L 1(X,E , µ) with respectto the vector subspace made by functions vanishing µ–a.e. in X.

For any ϕ ∈ L 1(X,E , µ) we denote by ϕ the equivalence class determined by ϕand we set

ϕ+ ψ := ϕ+ ψ, αϕ := αϕ. (3.2)

It is easily seen that these definitions do no depend on the choice of representativesin the equivalence class, and endow L1(X,E , µ) with the structure of a real vectorspace, whose origin is the equivalence class of functions vanishing µ–a.e. in X.Furthermore, setting

‖ϕ‖1 = ‖ϕ‖1, ϕ ∈ L1(X,E , µ),

it is also easy to see that this definition does not depend on the particular element ϕchosen in ϕ, and that (ii), (iii) still hold. Now, if ‖ϕ‖1 = 0 we have that the integralof |ϕ| is zero, and therefore ϕ = 0. Therefore L1(X,E , µ), endowed with the norm‖ · ‖1, is a normed space.

To simplify the notation typically ϕ is identified with ϕ whenever the formuladoes not depend on the choice of the function in the equivalence class: for instance,quantities as µ(ϕ > t) or

∫Xϕdµ have this independence, as well as most state-

ments and results in Measure Theory and Probability, so this slight abuse of notationis justified. It should be noted, however, that formulas like ϕ(x) = 0, for some fixedx ∈ X, do not make sense in L1(X,E , µ), since they depend on the representativechosen (unless µ(x) > 0).

More generally, if an exponent p ∈ (0,∞) is given, we can apply a similarconstruction to the space

L p(X,E , µ) :=

ϕ : ϕ is E –measurable and

∫X

|ϕ|p dµ <∞.

Since |x + y|p ≤ |x|p + |y|p if p ≤ 1, and |x + y|p ≤ 2p−1(|x|p + |y|p) if p ≥ 1, itturns out that L p(X,E , µ) is a vector space, and we shall denote by Lp(X,E , µ) the

Page 49: Probability Book

Chapter 3 45

quotient vector space, with respect to the equivalence relation (3.1). Still we candefine the sum and product by a real number as in (3.2), to obtain that Lp(X,E , µ)has the structure of a real vector space. The case p = 2 is particularly relevant forthe theory, as we will see.

Sometimes we will omit either E or µ, writing Lp(X,µ) or even Lp(X). Thistypically happens when (X, d) is a metric space, and E is the Borel σ-algebra, orwhen X ⊂ R and µ is the Lebesgue measure.

3.2 The Lp norm

For any ϕ ∈ Lp(X,E , µ) we define

‖ϕ‖p :=

(∫X

|ϕ|p dµ)1/p

.

We are going to show that ‖ · ‖p is a norm for any p ∈ [1,+∞). Notice thatwe already checked this fact when p = 1, and that the homogeneity condition (ii)trivially holds, whatever the value of p is. Furthermore, condition (i) holds preciselybecause Lp(X,E , µ) consists, strictly speaking, of equivalence classes induced by(3.1). So, the only condition that needs to be checked is the subadditivity condition(ii), and in the sequel we can assume p > 1.

The concept of Legendre transform will be useful. Let f : R→ R be a function;we define its Legendre transform f ∗ : R→ R ∪ +∞ by

f ∗(y) = supx∈R

xy − f(x), y ∈ R.

Then the following inequality clearly holds:

xy ≤ f(x) + f ∗(y) ∀x, y ∈ R, (3.3)

and actually f ∗ could be equivalently defined as the smallest function with thisproperty.

Example 3.1 Let p > 1 and let

f(x) =

xp

pif x ≥ 0,

0 if x < 0.

Page 50: Probability Book

46 Lp spaces

Then, by an elementary computation, we find that

f ∗(y) =

yq

qif y ≥ 0,

+∞ if y < 0,

where q = p/(p−1) (equivalently, 1p+ 1

q= 1). Consequently, the following inequality,

known as Young inequality, holds:

xy ≤ xp

p+yq

q, x, y ≥ 0. (3.4)

Motivated by the previous example, we say that p and q are dual (or conjugate)exponents if 1

p+ 1

q= 1, i.e. q = p/(p − 1). The duality relation is symmetric in

(1,+∞), and obviously 2 is self-dual.

Example 3.2 Let f(x) = ex, x ∈ R. Then

f ∗(y) := supx∈R

xy − ex =

+∞ if y < 0,

0 if y = 0,

y log y − y if y > 0.

Consequently, the following inequality holds:

xy ≤ ex + y log y − y, x, y ≥ 0. (3.5)

3.2.1 Holder and Minkowski inequalities

Proposition 3.3 (Holder inequality) Assume that ϕ ∈ Lp(X,E , µ) and ψ ∈Lq(X,E , µ), with p and q dual exponents in (1,+∞). Then ϕψ ∈ L1(X,E , µ) and

‖ϕψ‖1 ≤ ‖ϕ‖p‖ψ‖q. (3.6)

Proof. If either ‖ϕ‖p = 0 or ‖ψ‖q = 0 then one of the two functions vanishes µ–a.e.in X, hence ϕψ vanishes µ–a.e. and the inequality is trivial. If both ‖ϕ‖p and ‖ψ‖q

are strictly positive, by the 1–homogeneity of the both sides in (3.6) with respect toϕ and ψ, we can assume with no loss of generality that the two norms are equal to1.

Now we apply (3.4) to |ϕ(x)| and |ψ(x)| to obtain

|ϕ(x)ψ(x)| ≤ |ϕ(x)|p

p+|ψ(x)|q

q.

Page 51: Probability Book

Chapter 3 47

Integrating over X with respect to µ yields∫X

|ϕ(x)ψ(x)| dµ(x) ≤ 1

p+

1

q= 1.

Proposition 3.4 (Minkowski inequality) Assume that p ∈ [1,+∞) and ϕ, ψ ∈Lp(X,E , µ). Then ϕ+ ψ ∈ Lp(X,E , µ) and

‖ϕ+ ψ‖p ≤ ‖ϕ‖p + ‖ψ‖p. (3.7)

Proof. The cases p = 1 is obvious. Assume that p ∈ (1,∞). Then we have∫X

|ϕ+ ψ|p dµ ≤∫

X

|ϕ+ ψ|p−1|ϕ| dµ+

∫X

|ϕ+ ψ|p−1|ψ| dµ.

Since |ϕ+ ψ|p−1 ∈ Lq(X,E , µ) where q = p/(p− 1), using the Holder inequality wefind that ∫

X

|ϕ+ ψ|p dµ ≤(∫

X

|ϕ+ ψ|p dµ)1/q

(‖ϕ‖p + ‖ψ‖p),

and the conclusion follows.

By the previous proposition it follows that ‖ · ‖p is a norm on Lp(X,E , µ).

3.3 Convergence in Lp(X,E , µ) and completeness

We have seen in the previous section that Lp(X,E , µ) is a normed space for allp ∈ [1,+∞). In this section we prove some properties of the convergence in thesespaces, obtaining as a byproduct the following result.

Theorem 3.5 Lp(X,E , µ) is a Banach space for any p ∈ [1,+∞).

This theorem will be a direct consequence of the following proposition, thatprovides also a relation between convergence in Lp and convergence µ–a.e. in X.

Proposition 3.6 Let p ∈ [1,+∞) and let (ϕn) be a Cauchy sequence in Lp(X,E , µ).Then:

(i) there exists a subsequence (ϕn(k)) converging µ–a.e. to a function ϕ in Lp(X,E , µ);

(ii) (ϕn) is converging to ϕ in Lp(X,E , µ), so that Lp(X,E , µ) is a Banach space.

Page 52: Probability Book

48 Lp spaces

Proof. Let (ϕn) be a Cauchy sequence in Lp(X,E , µ). Choose a subsequence (ϕn(k))such that

‖ϕn(k+1) − ϕn(k)‖p < 2−k ∀k ∈ N.

Next, set

g(x) :=∞∑

k=0

|ϕn(k+1)(x)− ϕn(k)(x)|, x ∈ X.

By the monotone convergence theorem and the subadditivity of the Lp norm itfollows that(∫

X

gp(x) dµ(x)

)1/p

= limN→∞

(∫X

∣∣∣∣∣N−1∑k=0

|ϕn(k+1) − ϕn(k)|

∣∣∣∣∣p

)1/p

≤ limN→∞

N−1∑k=0

2−k = 2 <∞.

Therefore, g is finite µ–a.e., that is, there exists B ∈ E such that µ(B) = 0 andg(x) <∞ for all x ∈ Bc. Set now

ϕ(x) := ϕn(0)(x) +∞∑

k=0

(ϕn(k+1)(x)− ϕn(k)(x)), x ∈ Bc.

The series above is absolutely convergent for any x ∈ Bc; moreover, replacing theseries in the definition of ϕ by the finite sum

∑N−10 (ϕn(k+1)(x)−ϕn(k)(x)) we obtain

ϕ(x) = limN ϕN(x). Therefore, if we define (for instance) ϕ = 0 on the µ–negligibleset B, we obtain that ϕN → ϕ µ–a.e. on X.

The inequality |ϕ| ≤ |ϕn(0)| + g gives that ϕp is µ–integrable, so that ϕ ∈Lp(X,E , µ). We claim that ϕn(k) → ϕ in Lp(X,E , µ) as k →∞. In fact, since

|ϕ(x)− ϕn(h)(x)| ≤∞∑

k=h

|ϕn(k+1)(x)− ϕn(k)(x)|, x ∈ X,

we have, again by monotone convergence and subadditivity of the norm,(∫X

|ϕ(x)− ϕn(h)(x)|p dµ(x)

)1/p

≤∞∑

k=h

(∫X

|ϕn(k+1)(x)− ϕn(k)(x)|p dµ(x)

)1/p

≤∞∑

k=h

2−k,

and the conclusion follows. So, (i) is proved.

Page 53: Probability Book

Chapter 3 49

Let us show (ii). Since (ϕn) is Cauchy, for any ε > 0 there exists nε ∈ N suchthat

n, m > nε =⇒ ‖ϕn − ϕm‖p < ε.

Now choose k ∈ N such that n(k) > nε and ‖ϕ− ϕn(k)‖p < ε. For any n > n(ε) wehave

‖ϕ− ϕn‖p ≤ ‖ϕ− ϕn(k)‖p + ‖ϕn(k) − ϕn‖p ≤ 2ε.

Remark 3.7 (Lp convergence versus µ–a.e. convergence) The argument usedin the previous proof applies also to converging sequences (as these sequences areobviously Cauchy), and proves that any sequence (ϕn) strongly converging to ϕ inLp(X,E , µ) admits a subsequence (ϕn(k)) converging µ-a.e. to ϕ: precisely, thishappens whenever

∑∞0 ‖ϕn(k+1) − ϕn(k)‖p <∞.

In general, however, convergence in Lp does not imply convergence µ-a.e.: the func-tions

ϕ0 = 1[0,1]

ϕ1 = 1[0,1/2], ϕ2 = 1[1/2,1]

ϕ3 = 1[0,1/3], ϕ4 = 1[1/3,2/3], ϕ5 = 1[2/3,1]

. . .

converge to 0 in Lp(0, 1), but are nowhere pointwise converging.

The previous remark shows that we can expect to infer pointwise convergencefrom convergence in Lp only modulo the extraction of a subsequence. Now, weask ourselves about the converse implication: given a sequence (ϕn) in Lp(X,E , µ)pointwise converging to a function ϕ ∈ Lp(X,E , µ), we want to find conditionsensuring the convergence of (ϕn) to ϕ in Lp(X,E , µ). This is not true in general, asthe following example shows.

Example 3.8 Let X = [0, 1], E = B([0, 1]) and let µ = λ be the Lebesgue measure.Set

ϕn(x) =

n if x ∈ [0, 1/n],0 if x ∈ [1/n, 1].

Then ϕn(x) → 0 for all x ∈ (0, 1] but ‖ϕn‖1 = 1.

Proposition 3.9 Let (ϕn) be a sequence in Lp(X,E , µ) pointwise convergent to afunction ϕ ∈ Lp(X,E , µ), with (|ϕn|p) µ–uniformly integrable. Then ϕn → ϕ inLp(X,E , µ).

Page 54: Probability Book

50 Lp spaces

Proof. The functions hn := |ϕn−ϕ|p are pointwise converging to 0 and, because ofthe inequality

hn ≤ 2p−1(|ϕn|p + |ϕ|p),they are also easily seen to be uniformly µ–integrable. Therefore, by applying VitaliTheorem 2.18 to hn we obtain the conclusion.

3.4 The space L∞(X,E , µ)

Let ϕ : X → R be a E –measurable function. We say that ϕ is µ–essentially boundedif there exists a real number M > 0 such that

µ(|ϕ| > M) = 0.

If ϕ is µ–essentially bounded there exists a number, denoted by ‖ϕ‖∞, such that

‖ϕ‖∞ = min t ≥ 0 : µ(|ϕ| > t) = 0 . (3.8)

This easily follows from the fact that the function t→ µ(|ϕ| > t) is right contin-uous (Proposition 2.6), so the infimum is attained.

Notice also that ‖ϕ‖∞ is characterized by the property

‖ϕ‖∞ ≤M ⇐⇒ |ϕ| ≤M µ–a.e. in X. (3.9)

We shall denote by L∞(X,E , µ) the space of all equivalence classes of µ–essentiallybounded functions with respect to the equivalence relation ∼ in (3.1), thus identi-fying functions that coincide µ–a.e. in X.

Several properties of the Lp spaces extend up to the case p = ∞: first of allL∞(X,E , µ) is a real vector space and we have the Minkowski inequality

‖ϕ+ ψ‖∞ ≤ ‖ϕ‖∞ + ‖ψ‖∞. (3.10)

Indeed, by (3.9) and the triangle inequality, |ϕ(x) + ψ(x)| ≤ ‖ϕ‖∞ + ‖ψ‖∞ µ–a.e.in X, therefore (3.8) provides (3.10). As a consequence, L∞(X,E , µ) endowed withthe norm ‖ · ‖∞, is a normed space.

The Holder inequality takes the form∫X

|ϕψ| dµ ≤ ‖ϕ‖∞∫

X

|ψ| dµ. (3.11)

Indeed, we have just to notice that |ϕ(x)ψ(x)| ≤ ‖ϕ‖∞|ψ(x)| for µ–a.e. x ∈ X,and then integrate with respect to µ. This inequality can be still written as (3.6),provided we agree that q = 1 is the dual exponent of p = ∞ (and conversely).

For finite measures we can apply Holder inequality to obtain that the Lp spacesare nested; in particular L∞ is the smaller one and L1 is the larger one.

Page 55: Probability Book

Chapter 3 51

Remark 3.10 (Inclusions between Lp spaces) Assume that µ is finite. Then,if 1 ≤ r ≤ s ≤ ∞ we have

Lr(X,E , µ) ⊃ Ls(X,E , µ).

In fact, if r < s and ϕ ∈ Ls(X,E , µ) we have, in view of the Holder inequality (withp = s/r and q = s/(s− r)),∫

X

|ϕ(x)|r dµ(x) ≤(∫

X

|ϕ(x)|s dµ(x)

)r/s (∫X

1X dµ(x)

)1−r/s

,

and so‖ϕ‖r ≤ (µ(X))(s−r)/rs‖ϕ‖s. (3.12)

By (3.12) we obtain that p 7→ µ(X)−1/p‖ϕ‖p is nondecreasing for ϕ ∈⋂

p<∞ Lp(X,E , µ),

so that it has a limit as p → ∞. Since µ(X)−1/p → 1 as p → ∞ we obtainthat limp→∞ ‖ϕ‖p exists, finite or infinite. The following proposition characterizesL∞(X,E , µ) and the L∞ norm in terms of this limit.

Proposition 3.11 Assume that µ is finite and let ϕ be in the intersection⋂

p<∞ Lp(X,E , µ).Then ϕ ∈ L∞(X,E , µ) if and only if the limit limp→∞ ‖ϕ‖p is finite. If this is thecase, we have that ‖ϕ‖∞ coincides with the value of the limit.

Proof. If p ≥ 1 we have by the Markov inequality

µ(|ϕ| ≥ a) = µ(|ϕ|p ≥ ap) ≤ a−p‖ϕ‖pp.

Consequently, ‖ϕ‖p ≥ aµ(|ϕ| ≥ a)1/p, which yields limp ‖ϕ‖p ≥ a wheneverµ(ϕ ≥ a) > 0. So, if the limit is finite, we have ϕ ∈ L∞(X,E , µ) and ‖ϕ‖∞ ≤limp ‖ϕ‖p. The converse inequality follows directly from (3.11); the same inequalityalso proves that if the limit is not finite, then ϕ /∈ L∞(X,E , µ).

In the next remark we characterize the convergence in L∞, proving also thatL∞(X,E , µ) is a Banach space: as a matter of fact, convergence in L∞(X,E , µ)differs from the convergence in supremum norm only because a µ–negligible set isneglected.

Remark 3.12 (L∞(X,E , µ) is a Banach space) Assume that (ϕn) ⊂ L∞(X,E , µ)is a Cauchy sequence, and let us consider the µ–negligible set

∞⋃n, m=0

x ∈ X : |ϕn(x)− ϕm(x)| > ‖ϕn − ϕm‖∞ .

Page 56: Probability Book

52 Lp spaces

Then supBc |ϕn − ϕm| ≤ ‖ϕn − ϕm‖∞; as a consequence, the completeness of thespace of bounded functions defined in Bc provides a bounded function ϕ : Bc → R

such that ϕn → ϕ uniformly in Bc. Extending ϕ in an arbitrary E –measurable way(for instance with the 0 value) to the whole of X, we get ϕn → ϕ in L∞(X,E , µ).A similar argument proves that ϕn → ϕ in L∞(X,E , µ) if and only if there exists aµ–negligible set B ∈ E satisfying ϕn → ϕ uniformly in Bc.

We know that∣∣∫

Xϕdµ

∣∣ is always less than∫

X|ϕ| dµ. A nice and useful general-

ization of this fact is the so-called Jensen inequality.Recall that, if J ⊂ R is an interval, a continuous function g : J → R is said to

be convex if

g(x+ y

2) ≤ g(x) + g(y)

2∀x, y ∈ J. (3.13)

By several approximations (see Exercise 3.7) one can prove that a convex functionf satisfies g(tx + (1− t)y) ≤ tg(x) + (1− t)g(y) for all x, y ∈ J and t ∈ [0, 1], andeven that

g(n∑

i=1

tixi) ≤n∑

i=1

tig(xi) whenever ti ≥ 0, xi ∈ J andn∑

i=1

ti = 1. (3.14)

In the proof we use an elementary property of nonnegative convex functionsg : R → [0,+∞) satisfying g(t) → +∞ as |t| → +∞, namely the existence ofa minimum point t0; moreover, the function g is nondecreasing in [t0,+∞) andnonincreasing in (−∞, t0] (see Exercise 3.8).

Proposition 3.13 (Jensen) Assume that µ is a probability measure. Let g : R →[0,+∞) be convex and let ϕ ∈ L1(X,E , µ). Then we have

g

(∫X

ϕdµ

)≤∫

X

g(ϕ) dµ. (3.15)

Proof. Let us first show (3.15) when ϕ is simple. Let

ϕ =n∑

i=1

αi1Ai,

where n ≥ 1 is an integer, α1, . . . , αn ∈ R and A1, . . . , An are mutually disjoint setsin E whose union is X, so that

n∑i=1

µ(Ai) = 1.

Page 57: Probability Book

Chapter 3 53

Then, from (3.14) we infer

g

(∫X

ϕdµ

)= g

(n∑

i=1

αiµ(Ai)

)≤

n∑i=1

g(αi)µ(Ai) =

∫X

g(ϕ) dµ.

In the general case, let us first assume that g(t) → +∞ as |t| → +∞. Then, byExercise 3.8 we know that g has a minimum point t0, and that g is nondecreasing in[t0,+∞), and nonincreasing in (−∞, t0]. We can assume with no loss of generality(possibly replacing g(t) by g(t − t0) and ϕ by ϕ + t0) that g attains its minimumvalue at t0 = 0, and that

∫Xg(ϕ) dµ is finite. Furthermore, replacing g by g − g(0),

we can assume that the minimum value of g is 0.Let ϕ±n be nonnegative simple functions satisfying ϕ±n ↑ ϕ±; the simple functions

ϕ+n − ϕ−n converge to ϕ+ − ϕ− = ϕ in L1(X,E , µ). In addition, since g is monotone

in (−∞, 0] and [0,+∞), the monotone convergence theorem gives∫X

g(ϕ+n ) dµ ↑

∫X

g(ϕ+) dµ,

∫X

g(−ϕ−n ) dµ ↑∫g(−ϕ−) dµ,

so that (since g(0) = 0, ϕ+nϕ

−n = 0 and ϕ+ϕ− = 0)

∫Xg(ϕ+

n−ϕ−n ) dµ =∫

Xg(ϕ+

n ) dµ+∫Xg(−ϕ−n ) converges to

∫Xg(ϕ+) dµ+

∫Xg(−ϕ−) =

∫Xg(ϕ) dµ. Passing to the limit

as n→∞ in Jensen’s inequality for the simple functions ϕ+n − ϕ−n

g

(∫X

(ϕ+n − ϕ−n ) dµ

)≤∫

X

g(ϕ+n − ϕ−n ) dµ

we get (3.15).Finally, the assumption that g(t) → +∞ as t→ +∞ can be removed by consid-

ering the functions gε(t) := g(t) + ε|t|: we obtain

g

(∫X

ϕdµ

)+ ε

∣∣∣∣∫X

ϕdµ

∣∣∣∣ ≤ ∫X

g(ϕ) dµ+ ε

∫X

|ϕ| dµ.

and Jensen’s inequality follows by letting ε ↓ 0.

A much simpler proof of the same result will be given in the second part, inLemma 9.8: while this proof was based on (3.14), the other one is based on the ap-proximation of a convex function by affine functions. Both viewpoints are importantin the theory of convex functions.

3.5 Dense subsets of Lp(X,E , µ)

Proposition 3.14 For any p ∈ [1,+∞], the space of all simple integrable functionsis dense in Lp(X,E , µ).

Page 58: Probability Book

54 Lp spaces

Proof. Let f ∈ Lp(X,E , µ) with f ≥ 0. Then the conclusion follows from Propo-sition 2.12 and the dominated convergence theorem. In the general case we write fas f+ − f− and approximate in Lp both parts by simple functions.

We consider now the special situation when X is a metric space, E is the σ–algebra of all Borel subsets of X and µ is any finite measure on (X,E ).

We denote by Cb(X) the space of all continuous bounded functions onX. Clearly,Cb(X) ⊂ Lp(X,E , µ) for all p ∈ [1,+∞].

Proposition 3.15 For any p ∈ [1,+∞) and any finite measure µ, Cb(X) is densein Lp(X,E , µ).

Proof. Let C be the closure of Cb(X) in Lp(X,E , µ); obviously C is a vector space,as Cb(X) is a vector space. In view of Proposition 3.14 it is enough to show that forany Borel set I ∈ B(X) there exists a sequence (ϕn) ⊂ Cb(X) such that ϕn → 1I

in Lp(X,E , µ).Assume first that I is closed. Set

ϕn(x) =

1− n d(x, I) if d(x, I) ≤ 1

n

0 if d(x, I) ≥ 1n,

whered(x, I) := inf|x− y| : y ∈ I.

It is easy to see that ϕn are continuous, that 0 ≤ ϕn ≤ 1 and that ϕn(x) → 1I(x),hence the dominated convergence theorem implies that ϕn → 1I in Lp(X,E , µ).

Now, letG := I ∈ B(X) : 1I ∈ C .

It is easy to see that G is a Dynkin system (which includes the π–system of closedsets), so that by the Dynkin theorem we have G = B(X).

Remark 3.16 Cb(X) is a closed subspace of L∞(X,E , µ), and therefore it is notdense in general. Indeed, if (ϕn) ⊂ Cb(X) is Cauchy in L∞(X,E , µ), then it uni-formly converges, up to a µ-negligible set B (just take as in Remark 3.12 as B theunion of the µ–negligible sets |ϕn−ϕm| > ‖ϕn−ϕm‖). Therefore (ϕn) uniformlyconverges on Bc and on its closure K. Denoting by ϕ ∈ Cb(K) its uniform limit, byTietze’s exension theorem we may extend ϕ to a function, that we still denote byϕ, in Cb(X). As X \K ⊂ B is µ–negligible, it follows that ϕn → ϕ in L∞(X,E , µ).

Page 59: Probability Book

Chapter 3 55

EXERCISES

3.1 Assume that µ is σ–finite, but not finite. Provide examples showing that no inclusion holdsbetween the spaces Lp(X,E , µ) in general. Nevertheless, show that for any E –measurable functionϕ : X → R the set

p ∈ [1,∞] : ϕ ∈ Lp(X,E , µ)

is an interval. Hint: consider for instance the Lebesgue measure on R.3.2 Let 1 ≤ p ≤ q < ∞ and f ∈ Lq(X,E , µ). Show that, regardless of any finiteness assumptionon µ, for any δ ∈ (0, 1) we can write f = g + f , with g ∈ Lp(X,E , µ), f ∈ Lq(X,E , µ) and‖f‖q ≤ δ‖f‖q.3.3 Let p ∈ (1,∞), ϕ ∈ Lp and ψ ∈ Lq, with q = p′, be such that ‖ϕψ‖1 = ‖ϕ‖p‖ψ‖q. Show thateither ψ = 0 or there exists a constant λ ∈ [0,+∞) such that |ϕ| = λ|ψ|q−1 µ–a.e. in X. Hint:first investigate the case of equality in Young’s inequality.3.4 Prove the following variant of Holder’s inequality, known as Young’s inequality: if ϕ ∈ Lp,ψ ∈ Lq and 1

p + 1q = 1

r , with r ≥ 1, we have that ϕψ ∈ Lr and ‖ϕψ‖r ≤ ‖ϕ‖p‖ψ‖q.

3.5 Let (ϕn) ⊂ L1(X,E , µ) be satisfying lim infn ϕn ≥ ϕ µ–a.e. in X. Show that∫X

ϕn dµ =∫

X

ϕdµ = 1 =⇒∫

X

|ϕ− ϕn| dµ→ 0.

Hint: notice that the positive part and the negative part of ϕ−ϕn have the same integral to obtain∫X

|ϕ− ϕn| dµ = 2∫

X

(ϕ− ϕn)+ dµ.

Then, apply the dominated convergence theorem.3.6 Show that the following extension of Fatou’s lemma: if ϕn ≥ −ψn, with ψn ∈ L1(X) nonneg-ative, ψn → ψ in L1(X), then

lim infn→∞

∫X

ϕn dµ ≥∫

X

lim infn→∞

ϕn dµ.

Hint: prove first the statement under the additional assumption that ψn → ψ µ–a.e. in X.3.7 Show that (3.13) implies f(tx+ (1− t)y) ≤ tf(x) + (1− t)f(y) for all x, y ∈ J and t ∈ [0, 1].Then, deduce from this property (3.14). Hint: it is useful to consider dyadic numbers t = k/2m,with k ≤ 2m integer.3.8 Let g : R → [0,+∞) be a convex function such that g(t) → +∞ as |t| → +∞. Show theexistence of t0 ∈ R where g attains its minimum value. Then, show that g is nondecreasing in[t0,+∞) and nonincreasing in (−∞, 0].3.9 Let (ϕn) ⊂ L1(X,E , µ) be nonnegative functions. Show that the conditions

lim infn→∞

ϕn ≥ ϕ µ–a.e. in X, lim supn→∞

∫X

ϕn dµ ≤∫

X

ϕdµ <∞

imply the convergence of ϕn to ϕ in L1(X,E , µ). Hint: use Exercise 3.5.3.10 Let ϕii∈I be a family of functions satisfying

supi∈I

∫X

Φ(|ϕi|) dµ = M < +∞

Page 60: Probability Book

56 Lp spaces

and assume that Φ(c)/c is nondecreasing and tends to +∞ as c → +∞. Show that ϕii∈I isµ–uniformly integrable. Hint: use the inequalities∫

A

|ϕi| dµ ≤∫

A∩|ϕi|≥c

Φ(ϕi)Ψ(c)

dµ+∫

A∩|ϕi|<c|ϕi| dµ ≤

M

Ψ(c)+ cµ(A),

with Ψ(c) := Φ(c)/c, and then choose c sufficiently large, such that M/Ψ(c) < ε/2.3.11 ? Assuming that (X, d) is a metric space, E = B(X) and µ is finite, prove Lusin’s theorem:for any ε > 0 and any f ∈ L1(X,E , µ), there exists a closed set C ⊂ X such that µ(X \ C) < εand f |C is continuous. Hint: use the density of Cb(X) in L1 and Egorov’s theorem.

Page 61: Probability Book

Chapter 4

Hilbert spaces

In this chapter we recall the basic facts regarding real vector spaces endowed witha scalar product. We introduce the concept of Hilbert space and show that, evenfor the infinite-dimensional ones, continuous linear functionals are induced by thescalar product. Moreover, we see that even in some classes of infinite dimensionalspaces (the so-called separable ones) there exists a well-defined notion of basis (theso-called complete orthonormal systems), obtained replacing finite sums with con-verging series. Even though the presentation will be self-contained, we assume thatthe reader has already some familiarity with these concepts (basis, scalar product,representation of linear functionals) in finite-dimensional spaces.

4.1 Scalar products, pre-Hilbert and Hilbert spaces

A real pre–Hilbert space is a real vector space H endowed with a mapping

H ×H → R, (x, y) → 〈x, y〉,

called scalar product, such that:

(i) 〈x, x〉 ≥ 0 for all x ∈ H and 〈x, x〉 = 0 if and only if x = 0;

(ii) 〈x, y〉 = 〈y, x〉 for all x, y ∈ H;

(iii) 〈αx+ βy, z〉 = α〈x, z〉+ β〈y, z〉 for all x, y, z ∈ H and α, β ∈ R.

In the following H represents a real pre–Hilbert space.

The scalar product allows us to introduce the concept of orthogonality. We saythat two elements x and y of H are orthogonal if 〈x, y〉 = 0.

57

Page 62: Probability Book

58 Hilbert spaces

We are going to prove that the function

‖x‖ :=√〈x, x〉, x ∈ H

is a norm in H. For this we need the following Cauchy–Schwartz inequality.

Proposition 4.1 For any x, y ∈ H we have

|〈x, y〉| ≤ ‖x‖ ‖y‖. (4.1)

In (4.1) equality holds if and only if x and y are linearly dependent.

Proof. Set

F (λ) = ‖x+ λy‖2 = λ2‖y‖2 + 2λ〈x, y〉+ ‖x‖2, λ ∈ R.

Since F (λ) ≥ 0 for all λ ∈ R we have

|〈x, y〉|2 − ‖x‖2 ‖y‖2 ≤ 0,

which yields (4.1).If x and y are linearly dependent, it is clear that |〈x, y〉| = ‖x‖ ‖y‖. Assume

conversely that 〈x, y〉 = ±‖x‖ ‖y‖ and that y 6= 0. Then we have F (λ) = (‖x‖ ±λ‖y‖)2 so that, choosing λ = ∓‖x‖/‖y‖, we find F (λ) = 0. This implies x+λy = 0,so that x and y are linearly dependent.

Now we can prove easily that ‖ · ‖ is a norm in H. In fact, it is clear that‖αx‖ = |α|‖x‖ for all α ∈ R and all x ∈ H. Moreover, taking into account (4.1), wehave for all x, y ∈ H,

‖x+ y‖2 = 〈x+ y, x+ y〉 = ‖x‖2 + ‖y‖2 + 2〈x, y〉

≤ ‖x‖2 + ‖y‖2 + 2‖x‖ ‖y‖ = (‖x‖+ ‖y‖)2,

so that ‖x+ y‖ ≤ ‖x‖+ ‖y‖.Therefore a pre–Hilbert space H is a normed space and, in particular, a metric

space. If H, endowed with the distance induced by the norm, is complete we saythat H is a Hilbert space.

Example 4.2 (i). Rn is a Hilbert space with the canonical scalar product

〈x, y〉 :=n∑

k=1

xkyk,

Page 63: Probability Book

Chapter 4 59

inducing the Euclidean distance, where x = (x1, . . . , xn), y = (y1, . . . , yn) ∈ Rn.(ii). Let (X,E , µ) be a measure space. Then L2(X,E , µ), endowed with the scalarproduct

〈ϕ, ψ〉 :=

∫X

ϕ(x)ψ(x) dµ(x) ϕ, ψ ∈ L2(X,E , µ),

is a Hilbert space (completeness follows from Proposition 3.5).(iii). Let `2 be the space of all sequences of real numbers x = (xk) such that∞∑

k=0

x2k <∞. `2 is a vector space with the usual operations,

a(xk) = (axk) a ∈ R, (xk) + (yk) = (xk + yk), (xk), (yk) ∈ `2.

The space `2, endowed with the scalar product

〈x, y〉 :=∞∑

k=0

xkyk, x = (xk), y = (yk) ∈ `2

is a Hilbert space. This follows from (ii) taking X = N, E = P(X) and µ(x) = 1for all x ∈ X.

(iv). Let X = C([0, 1]) be the linear space of all real continuous functions on [0, 1].X is a pre–Hilbert space with the scalar product

〈f, g〉 :=

∫X

f(t)g(t) dt.

However, X is not a Hilbert space: indeed, X is dense, but strictly contained, inL2(0, 1).

Finite-dimensional pre-Hilbert spaces H are always Hilbert spaces: indeed, ifv1, . . . , vn, with n = dimH, is a basis of H, the Gram-Schmidt orthonormalizationprocess (recalled in Exercise 4.4) provides an orthonormal basis e1, . . . , en of H(i.e. ‖ei‖ = 1 and ei is orthogonal to ej for i 6= j), and the map

x =n∑

i=1

〈x, ei〉ei 7→ (〈x, e1〉, 〈x, e2〉, . . .)

(mapping x to the Euclidean vector of its coordinates with respect to this basis) iseasily seen to provide an isometry with Rn: indeed,

‖n∑

i=1

〈x, ei〉ei‖2 =n∑

i, j=1

〈x, ei〉〈x, ej〉〈ei, ej〉 =n∑

i=1

(〈x, ei〉)2.

Thus, being Rn complete, H is complete.

Page 64: Probability Book

60 Hilbert spaces

4.2 The projection theorem

It is useful to notice that for any x, y ∈ H the following parallelogram identity holds:

‖x+ y‖2 + ‖x− y‖2 = 2‖x‖2 + 2‖y‖2, x, y ∈ H. (4.2)

One can show that identity (4.2) characterizes pre-Hilbert spaces among normedspaces, and Hilbert among Banach spaces, see Exercise 4.1.

Theorem 4.3 Let H be a Hilbert space and let Y be a closed subspace of H. Thenfor any x ∈ H there exists a unique y ∈ Y , called projection of x on Y and denotedby πY (x), such that

‖x− y‖ = minz∈Y

‖x− z‖.

Moreover, y is characterized by the property

〈x− y, z〉 = 0 for all z ∈ Y. (4.3)

Proof. Set d := infz∈Y ‖x− z‖ and choose yn ∈ Y such that ‖x− yn‖ ↓ d. We aregoing to show that (yn) is a Cauchy sequence.

For any m, n ∈ Y we have, by the parallelogram identity (4.2),

‖(x− yn) + (x− ym)‖2 + ‖(x− yn)− (x− ym)‖2 = 2‖x− yn‖2 + 2‖x− ym‖2.

Consequently

‖yn − ym‖2 = 2‖x− yn‖2 + 2‖x− ym‖2 − 4

∥∥∥∥x− yn + ym

2

∥∥∥∥2

.

Taking into account that (yn + ym)/2 ∈ Y we find

‖yn − ym‖2 ≤ 2‖x− yn‖2 + 2‖x− ym‖2 − 4d2,

so that ‖yn − ym‖ → 0 as n, m → ∞. Thus, (yn) is a Cauchy sequence and, sincethe space is complete and Y is closed, it is convergent to an element y ∈ Y. Since‖x− yn‖ → ‖x− y‖ we find that ‖x− y‖ = d. Existence is thus proved. Uniquenessfollows again by the parallelogram identity, that gives

‖y − y′‖2 ≤ 2‖x− y‖2 + 2‖x− y′‖2 − 4

∥∥∥∥x− y + y′

2

∥∥∥∥2

≤ 2d2 + 2d2 − 4d2 = 0

whenever y and y′ are minimizers.

Page 65: Probability Book

Chapter 4 61

Let us prove (4.3). Define

F (λ) = ‖x− y − λz‖2 = λ2‖z‖2 − 2λ〈x− y, z〉+ ‖x− y‖2, λ ∈ R.

Since F attains a minimum at λ = 0, we have F ′(0) = 〈x− y, z〉 = 0, as claimed.

Conversely, if (4.3) holds for all z ∈ Y , we have

‖x− y − z‖2 = ‖z‖2 + ‖x− y‖2 ≥ ‖x− y‖2.

Remark 4.4 (Projection on convex closed sets) The previous proof works, withabsolutely no modification, to show that for any convex closed set K ⊂ H and anyx ∈ H there exists a unique solution y = πK(x) to the problem

minz∈K

‖x− z‖.

In this case, however, πK(x) is not characterized by (4.3), but by a one-sided con-dition, namely 〈x− πK(x), z − πK(x)〉 ≤ 0 for all z ∈ K, see Exercise 4.3.

Corollary 4.5 Let Y be a closed proper subspace of H. Then there exists x0 ∈H \ 0 such that 〈x0, y〉 = 0 for all y ∈ Y .

Proof. It is enough to choose an element z0 in H which does not belong to Y andset x0 = z0 − πY (z0).

Fix an integer N ≥ 1, a N -dimensional subspace HN ⊂ H and an orthonormalbasis e1, . . . , eN of it. The following result gives the best approximation of anelement x by a linear combination of e1, . . . , eN.

Proposition 4.6 The projection of an element x ∈ H on HN is given by

πHN(x) =

N∑k=1

〈x, ek〉ek.

Proof. We have to show that for any y1, . . . , yN ∈ R we have∥∥∥∥∥x−N∑

k=1

xkek

∥∥∥∥∥2

∥∥∥∥∥x−N∑

k=1

ykek

∥∥∥∥∥2

, (4.4)

Page 66: Probability Book

62 Hilbert spaces

where xk = 〈x, ek〉. We have in fact∥∥∥∥∥x−N∑

k=1

ykek

∥∥∥∥∥2

= ‖x‖2 +N∑

k=1

y2k − 2

N∑k=1

xkyk

= ‖x‖2 −N∑

k=1

x2k +

N∑k=1

(xk − yk)2.

This quantity is clearly minimal when xk = yk, and∥∥∥∥∥x−N∑

k=1

xkek

∥∥∥∥∥2

= ‖x‖2 −N∑

k=1

x2k. (4.5)

An alternative proof of the Proposition, based on the characterization (4.3) ofπHN

(x), is proposed in Exercise 4.5.

4.3 Linear continuous functionals

A linear functional F on H is a mapping F : H → R such that

F (αx+ βy) = αF (x) + βF (y) ∀x, y ∈ H, ∀α, β ∈ R.

F is said to be bounded if there exists K ≥ 0 such that

|F (x)| ≤ K|x| for all x ∈ H.

Proposition 4.7 A linear functional F is continuous if, and only if, it is bounded.

Proof. It is obvious that if F is bounded then it is continuous (even Lipschitzcontinuous). Assume conversely that F is continuous and, by contradiction, that itis not bounded. Then for any n ∈ N there exists xn ∈ H such that and |F (xn)| ≥n2‖xn‖. Setting yn = 1

nxn/‖xn‖ we have ‖yn‖ = 1

n→ 0, whereas F (yn) ≥ n, which

is a contradiction.

The following basic Riesz theorem, gives an intrinsic representation formula ofall linear continuous functionals.

Proposition 4.8 Let F be a linear continuous functional on H. Then there existsa unique x0 ∈ H such that

F (x) = 〈x, x0〉 ∀x ∈ H. (4.6)

Page 67: Probability Book

Chapter 4 63

Proof. Assume that F 6= 0 and let Y = F−1(0) = KerF . Then Y 6= H is closed(because F is continuous) and a vector space (because F is linear), so that byCorollary 4.5 there exists z0 ∈ H such that F (z0) = 1 and

〈z0, z〉 = 0 for all z ∈ Ker F.

On the other hand, for any x ∈ H the element z = x − F (x)z0 belongs to Ker Fsince F (z) = F (x)− F (x)F (z0) = 0. Therefore

〈z0, x− F (x)z0〉 = 0 for all x ∈ H,

so that〈x, z0〉 − F (x)‖z0‖2 = 0

and (4.6) follows setting x0 = z0/‖z0‖2.It remains to prove the uniqueness. Let y0 ∈ H be such that

F (x) = 〈x, x0〉 = 〈x, y0〉, x ∈ H.

Then, choosing x = x0 − y0 we find that ‖x0 − y0‖2 = 0, so that x0 = y0.

4.4 Bessel inequality, Parseval identity and or-

thonormal systems

Let us discuss the concept of basis in a Hilbert space H, assuming with no loss ofgenerality that the dimension of H is not finite. We use Kronecker’s notation δhk,equal to 1 for h = k and equal to 0 if h = k.

Definition 4.9 (Orthonormal system) A sequence (ek)k∈N ⊂ H is called an or-thonormal system if

〈eh, ek〉 = δh,k, h, k ∈ N.

Proposition 4.10 Let (ek)k∈N be an orthonormal system in H.

(i) For any x ∈ H we have∞∑

k=0

|〈x, ek〉|2 ≤ ‖x‖2. (4.7)

(ii) For any x ∈ H the series∞∑

k=0

〈x, ek〉ek is convergent in H (1).

(1)A series∞∑

k=0

xi of vectors in a Banach space E is said to be convergent if the sequence of the

finite sumsn∑

k=0

xi is convergent in E

Page 68: Probability Book

64 Hilbert spaces

(iii) The equality in (4.7) holds iff

x =∞∑

k=0

〈x, ek〉ek. (4.8)

Inequality (4.7) is called Bessel inequality and when the equality holds, Parsevalidentity.

Proof. (i). Let n ∈ N. Then by (4.5) we have∥∥∥∥∥x−n∑

k=0

〈x, ek〉ek

∥∥∥∥∥2

= ‖x‖2 −n∑

k=0

|〈x, ek〉|2, (4.9)

so that (4.7) follows by the arbitrariness of n.

(ii). Let n, p ∈ N and set

sn =n∑

k=0

〈x, ek〉ek.

Then

‖sn+p − sn‖2 =

∥∥∥∥∥n+p∑

k=n+1

〈x, ek〉ek

∥∥∥∥∥2

=

n+p∑k=n+1

|〈x, ek〉|2.

Since the series∞∑

k=0

|〈x, ek〉|2 is convergent by (i), the sequence (sn) is Cauchy and

the conclusion follows.

Passing to the limit as n→∞ in (4.9) we find∥∥∥∥∥x−∞∑

k=0

〈x, ek〉ek

∥∥∥∥∥2

= ‖x‖2 −∞∑

k=0

|〈x, ek〉|2.

This proves statement (iii).

Definition 4.11 (Complete orthonormal system) An orthonormal system (ek)k∈Nis called complete if

x =∞∑

k=0

〈x, ek〉ek ∀x ∈ H.

Page 69: Probability Book

Chapter 4 65

Example 4.12 Let H = `2 as in Example 4.2(iii). Then, it is easy to see that thesystem (ek), where

ek := (0, 0, . . . , 0, 1, 0, 0, . . .) (with the digit 1 in the k-th position)

is complete. Indeed, if x = (xk) ∈ `2 we have that 〈x, ei〉 = xi (the i-th componentof the sequence x), so that

‖x−n∑

k=0

〈x, ei〉ei‖2 =∞∑

k=n+1

x2k → 0.

We already noticed that Rn is the canonical model of n-dimensional Hilbertspaces H, because any choice of an orthonormal basis v1, . . . , vn of H induces thelinear isometry

a 7→n∑

i=1

aiei

from Rn to H (which, as a consequence, preserves also the scalar product, see Ex-ercise 4.2). For similar reasons, `2 is the canonical model of all spaces H having acomplete orthonormal system (ek)k∈N: in this case the linear isometry from `2 to His given by

a 7→∞∑i=0

aiei.

Proposition 4.13 (Completeness criterion) Let (en) be an orthogonal system.Then (en) is complete if and only if the vector space E spanned by (en) is dense inH.

Proof. If (en) is complete we have that any x ∈ H is the limit of the finite sums∑N1 〈x, ei〉ei, which all belong to E, therefore E is dense. Conversely, if E is dense,

for any x ∈ H and any ε > 0 we can find a vector z =∑n

i=1 aiei with ‖z − x‖ < ε.By applying Proposition 4.6 twice (first to the vector space spanned by e1, . . . , em,and then to the vector space spanned by e1, . . . , en) we get

‖x−m∑

i=1

〈x, ei〉ei‖ ≤ ‖x−n∑

i=1

〈x, ei〉ei‖ ≤ ‖x−n∑

i=1

aiei‖ < ε

for m ≥ n. Since ε is arbitrary this proves that the Fourier series of x converges tox.

The following proposition provides a necessary and sufficient condition for theexistence of a complete orthonormal system. We recall that a metric space (X, d) issaid to be separable if there exists a countable dense subset D ⊂ X.

Page 70: Probability Book

66 Hilbert spaces

Theorem 4.14 A Hilbert space H admits a complete orthonormal system (ek)k∈Nif and only if H, as a metric space, is separable.

Proof. If H admits a complete orthonormal system (ek)k∈N then H is separable,because the collection D of finite sums with rational coefficients of the vectors ek

provides a countable dense subset (indeed, the closure of D contains the finite sumsof the vectors ek and then the whole space).

Conversely, assume that H is separable and let (vn) be a dense sequence. Wedefine e0 = v0, e1 = vk1 where k1 is the first k > k0 = 0 such that vk is linearlyindependent from v0, e2 = vk2 where k2 is the first k > k1 such that vk is linearlyindependent from e0, e1 and so on. In this way we have built a sequence (ei) oflinearly independent vectors generating the same vector space generated by (vn).Let S be this vector space, and let us represent it as ∪nSn, where Sn is the vectorspace generated by e0, . . . , en. Notice that S is dense, as all vn belong to S.

By applying the Gram-Schmidt process to ei, an operation that does not changethe vector spaces Sn generated by the vectors e0, . . . , en, we can also assume that(ei) is an orthonormal system.

Now, let dn = min ‖x− y‖ : y ∈ Sn; since the union of the Sn is dense wehave that dn ↓ 0 as n → ∞, and therefore Proposition 4.6 gives that for any ε > 0there exists an integer m such that

dn = ‖x−n∑

k=0

〈x, ek〉ek‖ < ε ∀n ≥ m.

Since ε is arbitrary we obtain that

x =∞∑

k=0

〈x, ek〉ek.

This proves that (en) is complete.

Page 71: Probability Book

Chapter 4 67

EXERCISES

4.1 Let (X, ‖ · ‖) be a normed space, and assume that the norm satisfies the parallelogram identity(4.2). Set

〈x, y〉 :=14‖x+ y‖2 − 1

4‖x− y‖2, x, y ∈ X.

Show that 〈·, ·〉 is a scalar product whose induced norm is ‖ · ‖.4.2 Use the identity of the previous exercise to show that any linear isometry between pre-Hilbertspaces preserves also the scalar product.4.3 Show that, in the situation considered in Remark 4.4, πK(x) is characterized by the property

〈x− πK(x), z − πK(x)〉 ≤ 0 ∀z ∈ K.

4.4 Let H be a finite dimensional pre-Hilbert space and let v1, . . . , vn, with n = dimH, be abasis of it. Define

f1 = v1, f2 = v2 −〈v2, f1〉〈f1, f1〉

f1, f3 = v3 −〈v3, f1〉〈f1, f1〉

f1 −〈v3, f2〉〈f2, f2〉

f2, ......

Show that ei = fi/‖fi‖ is an orthonormal system in H (notice that vk − fk is the projection of vk

on the vector space generated by v1, . . . , vk−1).4.5 Let H be an Hilbert space, and let X be an infinite-dimensional separable subspace. Showthat

πX(x) =∞∑

k=0

〈x, ek〉ek ∀x ∈ H,

where (ek) is any complete orthonormal system of X. Hint: show that the vector x−∑

k〈x, ek〉ek

is orthogonal to all vectors of X.4.6 Let X be the space of functions f : [0, 1] → R such that f(x) 6= 0 for at most countably manyx, and

∑x f

2(x) < +∞. Show that X, endowed with the scalar product

〈f, g〉 :=∑

x∈[0,1]

f(x)g(x),

is a non-separable Hilbert space. Hint: a possible solution is to check that X corresponds toL2 ([0, 1],P([0, 1]), µ), where µ(A) = is the counting measure in [0, 1].4.7 Let (ek)k∈N be a complete orthonormal system of H. Show that, for any x, y ∈ H we have

∞∑k=0

〈x, ek〉〈y, ek〉 = 〈x, y〉. (4.10)

4.8 ? Show that for any Hilbert space H there exists a family (not necessarily finite or countable)of vectors eii∈I such that:

(i) 〈ei, ej〉 is equal to 1 if i = j, and to 0 otherwise;

(ii) for any vector x ∈ H there exists a countable set J ⊂ I with

x =∑i∈J

〈x, ei〉ei.

Hint: use Zorn’s lemma.

Page 72: Probability Book

68 Hilbert spaces

Page 73: Probability Book

Chapter 5

Fourier series

In this chapter we study the problem of representing a given T -periodic functionas a superposition, for a suitable choice of the coefficients, of more “elementary”ones. This problem was first studied by J. Fourier in the case when the elementaryfunctions are the trigonometric ones (nowadays we know that many different choicesare indeed possible). Thanks to the theory of L2 spaces and of Hilbert spacesdeveloped in the previous chapters, the problem can be formalized by looking forcomplete orthonormal systems in L2 made by trigonometric functions.

We shall mostly be concerned with the case of 2π-periodic functions, but asimple change of scale (see Remark 5.1) easily provides the translation of the resultsto arbitrary periods.

We are concerned with the measure space((−π, π),B((−π, π)), λ

), where λ is

the Lebesgue measure. As usual, we shall write for brevity L2(−π, π). We shalldenote by 〈·, ·〉 the canonical scalar product

〈f, g〉 :=

∫(−π,π)

f(x)g(x) dλ =

∫ π

−π

f(x)g(x) dx, f, g ∈ L2(−π, π).

Let us consider, as a family of elementary functions, the trigonometric system, givenby:

1√2π

;1√π

cos kx, k ∈ N;1√π

sin kx, k ∈ N, k ≥ 1. (5.1)

It is easy to check with integration by parts that this is an orthonormal system inL2(−π, π), see Exercise 5.1. Thus, in view of Proposition 4.10, the series of functions

S(x) =1

2a0 +

∞∑k=1

(ak cos kx+ bk sin kx), (5.2)

69

Page 74: Probability Book

70 Fourier series

is convergent in L2(−π, π) for any f ∈ L2(−π, π), where

ak :=1

π

∫ π

−π

f(y) cos kydy, k ∈ N,

(notice that a0/2 is the mean value of f on (−π, π)) and

bk :=1

π

∫ π

−π

f(y) sin kydy, k ∈ N, k ≥ 1.

Indeed, the term a0/2 corresponds to

〈f, 1√2π〉 1√

and the terms ak cos kx, bk sin kx for k ≥ 1, correspond respectively to

〈f, 1√π

cos kx〉 1√π

cos kx, 〈f, 1√π

sin kx〉 1√π

sin kx.

Formula (5.2) is called the trigonometric Fourier series of f .The Bessel inequality (4.7) reads, in this context as follows:

1

π

∫ π

−π

|f(x)|2 dx ≥ 1

2a2

0 +∞∑

k=1

(a2k + b2k). (5.3)

First, we shall find sufficient conditions on f ensuring the pointwise convergenceof the series S(x) to f(x) in (−π, π). Then, we shall show that the trigonometricsystem is complete, so that the inequality above is actually an equality. As shown inExercise 5.4 and Exercise 5.5, the trigonometric system, the trigonometric series andthe form of the coefficients become much more nice and symmetric in the complex-valued Hilbert space L2

((−π, π);C

):

f(x) =∑n∈Z

aneinx where an :=

1

∫ π

−π

f(x)e−inx dx.

Remark 5.1 (2T -periodic functions) If f ∈ L2(−T, T ) we can write instead

f(x) =∞∑

k=0

an cosπ

Tkx+

∞∑k=1

bn sinπ

Tkx

with

ak :=

1

2T

∫ T

−T

f(x) dx if k = 0;

1

T

∫ T

−T

f(x) cos kx dx if k > 0,

bk :=1

T

∫ T

−T

f(x) sin kx dx.

Page 75: Probability Book

Chapter 5 71

5.1 Pointwise convergence of the Fourier series

For any integer N ≥ 1 we consider the partial sum

SN(x) :=1

2a0 +

N∑k=1

(ak cos kx+ bk sin kx), x ∈ [−π, π).

Since the functions cos kx and sin kx are 2π–periodic, it is natural to extend f tothe whole of R as a 2π–periodic function, setting

f(x+ 2πn) = f(x), x ∈ [−π, π), n = ±1,±2, . . . .

We shall denote in the sequel by Hl,r(z) the “Heaviside” function

Hl,r(z) :=

l if z ≤ 0;

r if z > 0.

Lemma 5.2 For any integer N ≥ 1 and x, l, r ∈ R we have

SN(x)− l + r

2=

1

∫ π

−π

f(x+ τ)−Hl,r(τ)

sin(τ/2)sin[(N +

1

2

)τ]dτ. (5.4)

Proof. Write

SN(x) =1

2a0 +

N∑k=1

(ak cos kx+ bk sin kx)

=1

π

∫ π

−π

f(y)[12

+N∑

k=1

(cos kx cos ky + sin kx sin ky)]dy

=1

π

∫ π

−π

f(y)

[1

2+

N∑k=1

cos k(x− y)

]dy.

To evaluate the sum, we notice that for any z ∈ R

[12

+N∑

k=1

cos kz]sin(z2

)=

1

2

[sin(z2

)+

N∑k=1

(sin[(k +

1

2

)z]− sin

[(k − 1

2

)z)]]

=1

2sin[(N +

1

2

)z].

Page 76: Probability Book

72 Fourier series

Therefore1

2+

N∑k=1

cos kz =1

2

sin[(N + 1

2

)z]

sin(

12z) (5.5)

and so,

SN(x) =1

∫ π

−π

f(y)sin[(N + 1

2

)(x− y)

]sin(

12(x− y)

) dy. (5.6)

Now, setting τ = y − x we get

SN(x) =1

∫ π−x

−π−x

f(x+ τ)sin[(N + 1

2

)τ]

sin(

12τ) dτ

=1

∫ π

−π

f(x+ τ)sin[(N + 1

2

)τ]

sin(

12τ) dτ

since the function under the integral is 2π–periodic. Now, integrating (5.5) over[−π, π] yields

1 =1

∫ π

−π

sin[(N + 1

2

)τ]

sin(

12τ) dτ,

so that1

π

∫ π

0

sin[(N + 1

2

)τ]

sin(

12τ) dτ = 1 =

1

π

∫ 0

−π

sin[(N + 1

2

)τ]

sin(

12τ) dτ.

If we multiply both sides by l and r, and subtract the resulting indentities from(5.6), (5.4) follows.

Proposition 5.3 (Dini’s test) Let x, l, r ∈ R be such that∫ π

−π

|f(x+ τ)−Hl,r(τ)|| sin(τ/2)|

dτ <∞. (5.7)

Then the Fourier series of f converges to (l + r)/2 at x.

Dini’s test shows a remarkable property of the Fourier series: while the specificvalue of the coefficients ak and bk depends on the behaviour of f on the wholeinterval (−π, π), and the same holds for the Fourier series, the character of theseries (convergent or not) at a given point x depends only on the behaviour of f inthe neighbourhood of x: indeed, it is this behaviour that influences the integrabilityof (f(x + τ)−Hl,r(τ))/ sin(τ/2) (the only singularity being at τ = 0). This is wellillustrated in the next example.

Page 77: Probability Book

Chapter 5 73

Example 5.4 Assume that f : [−π, π] → R is L-Lipschitz continuous, i.e.

|f(x)− f(y)| ≤ L|x− y| ∀ x, y ∈ [−π, π]

for some L ≥ 0. Then Dini’s test is fulfilled at any x ∈ R\Zπ choosing l = r = f(x),

and at any x ∈ Zπ choosing l = f(x−) and r = f(x+) (1). Indeed, with these choicesof l and r, the quotient

f(x+ τ)−Hl,r(τ)

sin(τ/2)

is bounded in a neighbourhood of 0.The same conclusions hold when f is α–Holder continuous for some α ∈ (0, 1], i.e.

|f(x)− f(y)| ≤ L|x− y|α, ∀ x, y ∈ [−π, π]

for some L ≥ 0: in this case the quotient is bounded from above, near 0, by thefunction L|τ |α/| sin(τ/2)| ∼ 2L|τ |α−1 which is integrable.

More generally, the argument of the previous example can be used to showthat the Fourier series is pointwise convergent for piecewise C1 functions f : atcontinuity points x the series converges to f(x), and at (jump) discontinuity pointsx it converges to (f(x−) + f(x+))/2. However, the mere continuity of f is notsufficient to ensure pointwise convergence of the Fourier series.

In order to prove Proposition 5.3, we need the following Riemann–Lebesguelemma, a tool interesting in itself.

Lemma 5.5 Let (ek) be an orthonormal system in L2(−π, π). Assume that thereexists M > 0 such that ‖ek‖∞ ≤ M for all k ∈ N. Then for any f ∈ L1(−π, π) wehave

limk→∞

∫ π

−π

f(x)ek(x) dx = 0. (5.8)

Proof. Notice first that if f ∈ L2(−π, π) the conclusion of the lemma is trivial. Wehave in fact in this case ∫ π

−π

f(x)ek(x) dx = 〈f, ek〉

and, since by Bessel’s inequality the series∑∞

1 |〈f, ek〉|2 is convergent, we havelimk〈f, ek〉 = 0.

Let us now consider the general case. We know that bounded continuous func-tions are dense in L1(−π, π), hence for any ε > 0 we can find g ∈ Cb(−π, π) suchthat ‖f − g‖1 < ε. As a consequence

|〈f, ek〉| = |〈f − g, ek〉|+ |〈g, ek〉| ≤Mε+ |〈g, ek〉|(1)here we denote by g(x−), g(x+) the left and right limits of g at x

Page 78: Probability Book

74 Fourier series

and letting k →∞ we obtain lim supk |〈f, ek〉| ≤Mε. Since ε is arbitrary the proofis achieved.

Proof of Proposition 5.3. Set

g(τ) :=f(x+ τ)−Hl,r(τ)

sin(τ/2)∈ L1(−π, π). (5.9)

Then, writing

sin[(N +1

2)t] = sinNt cos

1

2t+ cosNt sin

1

2t

and applying the Riemann–Lebesgue lemma to g cos t/2 (with eN = sinNt) and tog sin(t/2) (with eN = cosNt) we obtain from (5.4) that SN(x) converge to (l+ r)/2.

5.2 Completeness of the trigonometric system

Proposition 5.6 The trigonometric system (5.1) is complete. In particular equalityholds in (5.3).

Proof. We show that the vector space E generated by the trigonometric systemis dense in L2(−π, π). Let H ′ be the closure, in the L2(−π, π) norm, of E, that iseasily seen to be still a vector space as well. We will prove in a series of steps thatH ′ contains larger and larger classe of functions.

Let f : [−π, π] → [0,+∞) be a Lipschitz function, and let us prove that it belongsto H ′. Indeed, we know from Example 5.7 that SN → f pointwise in (−π, π). Onthe other hand, we already know from Proposition 4.10(ii) that the Fourier seriesis convergent in L2(−π, π) to some function g (which is indeed, by Exercise 4.5,the orthogonal projection of f on H ′), therefore a subsequence (SN(k)) is convergingλ-almost everywhere to g. It follows that g = f and SN → f in L2(−π, π).

If now g : [−π, π] → [0,+∞) is continuous, we know that g can be monotonicallyapproximated by the Lipschitz functions

gλ(x) := miny∈[−π,π]

(g(y) + λ|x− y|

), x ∈ [−π, π]

(see Exercise 2.11), that converge to g also in L2(−π, π) by the dominated conver-gence theorem. As a consequence also g belongs to H ′. Since H ′ is invariant byaddition of constants, we proved that all continuous functions in [−π, π] belong toH ′. We conclude using the density of this class of functions in L2(−π, π).

Page 79: Probability Book

Chapter 5 75

Remark 5.7 Let f ∈ L2(−π, π). Then, the Parseval identity reads as follows

1

π

∫ π

−π

|f(x)|2 dx =1

2a2

0 +∞∑

k=1

(a2k + b2k). (5.10)

For instance, taking f(x) = x one finds the following nice relation between π andthe harmonic series with power 2:

∞∑k=1

1

k2=π2

6.

Finally, we notice that there exist other important examples of complete or-thonormal systems, besides the trigonometric one. Some of them are illustrated inthe exercises.

5.3 Uniform convergence of the Fourier series

We conclude by studying the uniform convergence of the Fourier series. We recallthat a series

∑∞0 xn in a Banach space E is said to be totally convergent if the

numerical series∑∞

0 ‖xn‖ is convergent. Using the completeness of E it is notdifficult to check (see Exercise 5.2) that any totally convergent series is convergent(as we have seen in the previous chapter, this means that the finite sums

∑N0 xn

converge in E to a vector, denoted by∑∞

0 xn).Now we show that the Fourier series of C1 functions f with f(−π) = f(π)

are uniformly convergent: the proof uses an important relation (5.11) between theFourier series of a function and the Fourier series of its derivative.

Proposition 5.8 Assume that f ∈ C1([−π, π]) and that f(−π) = f(π). Then theFourier series of f converges uniformly to f in [−π, π].

Proof. We first notice that f is Lipschitz continuous, so that by Proposition 5.3 wehave

f(x) =1

2a0 +

∞∑k=1

(ak cos kx+ bk sin kx) x ∈ [−π, π].

Let us consider the Fourier series of the derivative f ′ of f ,

∞∑k=1

(a′k cos kx+ b′k sin kx) x ∈ [−π, π],

Page 80: Probability Book

76 Fourier series

where, for k ≥ 1 integer (notice that a′0 = 0 because the mean value of f ′ on (−π, π)is 0),

a′k =1

π

∫ π

−π

f ′(y) cos ky dy, b′k =1

π

∫ π

−π

f ′(y) sin ky dy. (5.11)

As easily checked through an integration by parts (using again the fact that f(−π) =f(π)), we have a′k = kbk and b′k = −kak. Then, by the Bessel inequality it followsthat

∞∑k=1

k2(a2k + b2k) =

∞∑k=1

((a′k)2 + (b′k)

2) ≤ 1

π

∫ π

−π

|f ′(x)|2 dx <∞. (5.12)

Therefore the Fourier series of f is totally convergent in C([−π, π]) and thereforeuniformly convergent. We have in fact

∞∑k=1

|ak cos kx+ bk sin kx| ≤∞∑

k=1

(|ak|+ |bk|)

(∞∑

k=1

k2(|ak|+ |bk|)2

)1/2( ∞∑k=1

k−2

)1/2

<∞.

Page 81: Probability Book

Chapter 5 77

EXERCISES

5.1 Check that the trigonometric system (5.1) is orthogonal.5.2 Let E be a Banach space. Show that any totally convergent series

∑n xn, with (xn) ⊂ E, is

convergent. Moreover, ∥∥∥ ∞∑n=0

xn

∥∥∥ ≤ ∞∑n=0

‖xn‖. (5.13)

Hint: estimate ‖∑N

0 xn −∑M

0 xn‖ with the triangle inequality.5.3 Prove that the following systems on L2(0, π) are orthonormal and complete√

sin kx, k ≥ 1,

and1√π

;

√2π

cos kx, k ≥ 1.

5.4 Show thatek(x) :=

1√2πeikx, k ∈ Z

is a complete orthonormal system in L2 ((−π, π);C). Hint: in order to show completeness, considerfirst the cases where f is real-valued or if is real-valued.5.5 Let (ek) be as in Exercise 5.4. Using the Parseval identity show that∫ π

−π

|f(x)|2 dx =12π

∑k∈Z

(∫ π

−π

f(x)e−ikx dx

)2

∀f ∈ L2 ((−π, π);C) .

5.6 Let f ∈ L2 ((−π, π);C) and let SNf =∑N

−N 〈f, ek〉ek, with N ≥ 1, be the Fourier sumscorresponding to the complete orthonormal system in Exercise 5.4. Show that

f(x)−SNf(x) =∫ π

−π

GN (x−y)(f(x)−f(y)) dy with GN (z) :=12π

(1 + 2Re

[ei(N+1)z − eiz

eiz − 1

]).

Hint: use the identities z+ z = 2Re(z), e−iy = eiy,∑N

0 eiky =∑N

0 (eiy)k = (ei(N+1)z−1)/(eiy−1).5.7 Arguing as in Remark 5.7, show that

∑∞1 k−4 = π4/90. Hint: consider the function f(x) = x2.

5.8 Chebyschev polynomials Cn in L2(a, b), with (a, b) bounded interval, are the ones obtainedby applying the Gram-Schmidt procedure to the vectors 1, x, x2, x3, . . .. They are also calledLegendre polynomials when (a, b) = (−1, 1).

(a) Compute explicitly the first three Legendre polynomials.

(b) Show that Cnn∈N is a complete orthonormal system. Hint: use the density of polynomialsin C([a, b]).

(c) ? Show that the n-th Legendre polynomial Pn is given by

Pn(x) =

√2n+ 1

21

2nn!dn

dnx(x2 − 1)n.

Page 82: Probability Book

78 Fourier series

Page 83: Probability Book

Chapter 6

Operations on measures

In this chapter we collect many useful tools in Analysis and Probability that will bewidely used in the following chapters. We will study the product of measures (bothfinite and countable), the product of measures by L1 functions, the Radon–Nikodymtheorem, the convergence of measures on the real line R and the Fourier transform.

6.1 The product measure and Fubini–Tonelli the-

orem

Let (X,F ) and (Y,G ) be measurable spaces. Let us consider the product spaceX × Y . A set of the form A× B, where A ∈ F and B ∈ G , is called a measurablerectangle. We denote by R the family of all measurable rectangles. R is obviouslya π–system. The σ–algebra generated by R is called the product σ–algebra of Fand G . It is denoted by F × G .

Given σ–finite measures µ in (X,F ) and ν in (Y,G ), we are going to define theproduct measure µ× ν in (X × Y,F × G ).

First, for any E ∈ F × G we define the sections of E, setting for x ∈ X andy ∈ Y ,

Ex := y ∈ Y : (x, y) ∈ E, Ey := x ∈ X : (x, y) ∈ E.

Proposition 6.1 Assume that µ and ν are σ–finite and let E ∈ F × G . Then thefollowing statements hold.

(i) Ex ∈ G for all x ∈ X and Ey ∈ F for all y ∈ Y .

(ii) The functions

x 7→ ν(Ex), y 7→ µ(Ey),

79

Page 84: Probability Book

80 Operations on measures

are F–measurable and G –measurable respectively. Moreover,∫X

ν(Ex) dµ(x) =

∫Y

µ(Ey) dν(y). (6.1)

Proof. We shall first prove both statements in the case when µ and ν are finite.Assume first that E = A×B is a measurable rectangle. Then, if (x, y) ∈ X × Y wehave

Ex =

B if x ∈ A∅ if x /∈ A, Ey =

A if y ∈ B∅ if y /∈ B.

Consequently,

ν(Ex) = 1A(x)ν(B), µ(Ey) = 1B(y)µ(A),

so that (6.1) clearly holds.

Now, let D be the family of all E ∈ F × G such that (i) is fulfilled. Clearly,D is a Dynkin system including the π–system R. Therefore, (i) follows from theDynkin theorem. Analogously, let D be the family of all E ∈ F × G such that(ii) is fulfilled. Clearly, D is a Dynkin system including the π–system R (stabilityunder complement follows by the identities ν((Ec)x) = ν(Y )−ν(Ex) and µ((Ec)y) =µ(X)− µ(Ey)). Therefore, (ii) follows from the Dynkin theorem as well.

In the general σ–finite case we argue by approximation: if E ∈ F×G , F 3 Xh ↑X and G 3 Yh ↑ Y satisfy µ(Xh) <∞ and ν(Yh) <∞, we define the σ–algebras

Fh := A ⊂ Xh : A ∈ F , Gh := B ⊂ Yh : B ∈ G .

Then,

B ∩Xh × Yh ∈ Fh × Gh for all B ∈ F × G ,

because the class of sets B ⊂ X×Y satisfying B∩Xh×Yh ∈ Fh×Gh is a σ–algebracontaining the measurable rectangles. Now we apply (ii) to the sets Eh = E∩Xh×Yh,which belong to Fh × Gh. Passing to the limit as h→∞, the continuity propertiesof measures give the measurability in the limit and (6.1) as well.

Theorem 6.2 (Product measure) If µ and ν are σ–finite, there exists a uniquemeasure λ in (X × Y,F × G ) satisfying

λ(A×B) = µ(A)ν(B) for all A ∈ F , B ∈ G .

We denote λ by µ × ν. Furthermore µ × ν is σ–finite and is finite (respectively aprobability measure) if both µ and ν are finite (respectively, probability measures).

Page 85: Probability Book

Chapter 6 81

Proof. Existence is easy: we set

λ(E) =

∫X

ν(Ex) dµ(x) =

∫Y

µ(Ey) dν(y), E ∈ F × G . (6.2)

Using the continuity and additivity properties of the integral, it is immediate tocheck that λ is a measure on (X × Y,F × G ). In the case of σ–finite measures,uniqueness follows by the the coincidence criterion for positive measures stated inProposition 1.15: indeed, the value of the product measure is uniquely determinedon the π–system K made by rectangles A×B with µ(A) and ν(B) finite, and thanksto the σ–finiteness assumption there exist En = An×Bn ∈ K with En ↑ X ×Y .

Corollary 6.3 Let E ∈ F × G be such that µ × ν(E) = 0. Then µ(Ey) = 0 forν–almost all y ∈ Y and ν(Ex) = 0 for µ–almost all x ∈ X.

Proof. It follows directly from (6.2).

We consider here the measure space (X × Y,F × G , λ), where λ = µ× ν and µand ν are σ–finite.

Theorem 6.4 (Fubini–Tonelli) Let F : X×Y → [0,+∞] be a F×G –measurablemapping. Then the following statements hold.

(i) For any x ∈ X (resp. y ∈ Y ), the function y 7→ F (x, y) (resp. x 7→ F (x, y))is G –measurable (resp. F–measurable).

(ii) The functions

x 7→∫

Y

F (x, y) dν(y), y 7→∫

X

F (x, y) dµ(x)

are respectively F–measurable and G –measurable.

(iii) We have ∫X×Y

F (x, y) dλ(x, y) =

∫X

[∫Y

F (x, y) dν(y)

]dµ(x)

=

∫Y

[∫X

F (x, y) dµ(x)

]dν(y).

(6.3)

Proof. Assume first that F = 1E, with E ∈ F × G . Then we have

F (x, y) = 1Ex(y), x ∈ X, F (x, y)(x) = 1Ey(x), y ∈ Y,

so (i), (ii) and (iii) follow from Proposition 6.1. Consequently, by linearity, (i)–(iii)hold when F is a simple function. If F is general, it is enough to approximate it bya monotonically increasing sequence of simple functions and then pass to the limitusing the monotone convergence theorem.

Page 86: Probability Book

82 Operations on measures

Remark 6.5 (The definition of integral revisited) We noticed in Remark 2.13that the integral of nonnegative functions can also be defined without using thearchimedean integral, by considering minorant simple functions. If we follow thisapproach, the identity that we used to define the integral can be derived by applyingthe Fubini–Tonelli theorem to the subgraph

E := (x, t) ∈ X × R : 0 < t < f(x) ,

with the product measure µ × λ, λ being the Lebesgue measure. Indeed, it is notdifficult to show that E is F ×B(R)–measurable whenever f is F -measurable, sothat∫ ∞

0

µ(f > t) dt =

∫ ∞

0

µ(Et) dt = µ× λ(E) =

∫X

λ(Ex) dµ(x) =

∫X

f(x) dµ(x).

Of course, splitting F in positive and negative parts, also the case of extendedreal valued maps can be considered:

Corollary 6.6 Let F : X×Y → [−∞,+∞] be a F×G –measurable mapping. ThenF is µ× ν–integrable if and only if:

(i) for µ–a.e. x ∈ X the function y 7→ F (x, y) is ν–integrable;

(ii) the function x 7→∫

Y|F (x, y)| dν(y) is µ–integrable.

If these conditions hold, we have∫X×Y

F (x, y) dµ× ν(x, y) =

∫X

[∫Y

F (x, y) dν(y)

]dµ(x). (6.4)

Notice that, strictly speaking, the function in (ii) is defined only out of a µ–negligible set; by µ–integrability of it we mean µ–integrability of any F–measurableextension of it (for instance we may set it equal to 0 wherever

∫Y|F (x, y)| dν(y) is

not finite).

Remark 6.7 (Finite products) The previous constructions extend without anydifficulty to finite products of measurable spaces (Xi,Fi). Namely, the product

σ-algebra F :=×n

i Fi in the cartesian product X :=×n

1 Xi is generated by therectangles

A1 × · · · × An : Ai ∈ Fi, 1 ≤ i ≤ n .Furthermore, if µi are σ–finite measures in (Xi,Fi), integrals with respect to the

product measure µ =×n

1 µi are defined by∫X

F (x) dµ(x) =

∫X1

∫X2

· · ·∫

Xn

F (x1, . . . , xn) dµn(xn) · · · dµ2(x2) dµ1(x1),

Page 87: Probability Book

Chapter 6 83

and any permutation in the order of the integrals would produce the same result.Finally, the product measure is uniquely determined, in the σ–finite case, by theproduct rule

µ (A1 × · · · × An) =n

Πi=1

µi(Ai) Ai ∈ Fi, 1 ≤ i ≤ n.

It is also not hard to show that the product is associative, both at the level ofσ–algebras and measures, see Exercise 6.1.

6.2 The Lebesgue measure on Rn

This section is devoted to the construction, the characterization and the main prop-erties of the Lebesgue measure in Rn, i.e. the length measure in R1, the area measurein R2, the volume measure in R3 and so on.

Definition 6.8 (Lebesgue measure in Rn) Let us consider the measure space(R,B(R),L 1), where L 1 is the Lebesgue measure on (R,B(R)). Then, we can

define the measure space (Rn,n

×i=1

B(R),L n) with L n :=×n

1 L 1. We say that L n

is the Lebesgue measure on Rn.

Since (see Exercise 6.2)

B(Rn) =n

×i=1

B(R),

we can equivalently consider L n as a measure in (Rn,B(Rn)), forgetting its construc-tion as a product measure (indeed, there exist alternative and direct constructionsof L n independent of the concept of product measure).

As in the one-dimensional case, we will keep using the classical notation∫E

f(x) dx E ⊂ Rn, f : E → R

for integrals with respect to Lebesgue measure L n (or Riemann integrals in morethan one independent variable).

In the computation of Lebesgue integrals a particular role is sometimes played bythe dimensional constant ωn = L n(B(0, 1)) (so that ω1 = 2, ω2 = π, ω3 = 4π/3,. . . ).A general formula for the computation of ωn can be given using Euler’s Γ function:

Γ(z) :=

∫ ∞

0

tz−1e−t dt z > 0.

Page 88: Probability Book

84 Operations on measures

Indeed, we have

ωn =πn/2

Γ(n2

+ 1). (6.5)

A proof of this formula, based on the identity Γ(z + 1) = zΓ(z) (which gives alsoΓ(n) = (n− 1)! for n ≥ 1 integer) is proposed in Exercise 6.7.

We are going to show that L n is invariant under translations and rotations. Forthis we need some notation. For any a ∈ Rn and any δ > 0 we set

Q(a, δ) :=x ∈ Rn : ai ≤ x < ai + δ, ∀ i = 1, . . . , n

=

n

×i=1

[ai, ai + δ).

Q(a, δ) is called the δ–box with corner at a. For all N ∈ N we consider the family

QN = Q(2−Nk, 2−N) : k = (k1, . . . , kn) ∈ Zn.

It is also clear that each box in QN is Borel and that its Lebesgue measure is 2−nN .Now we set

Q =∞⋃

N=0

QN .

It is clear that all boxes in QN are mutually disjoint and that their union is Rn.Furthermore, if N < M , Q ∈ QN and Q′ ∈ QM , then either Q′ ⊂ Q or Q∩Q′ = ∅.

Lemma 6.9 Let U be a non empty open set in Rn. Then U is the disjoint union ofboxes in Q.

Proof. For any x ∈ U , let Qx ∈ Q be the biggest box such that x ∈ Qx ⊂ U . Thisbox is uniquely defined: indeed, fix an x; for any m there is only one box Qx,m ∈ Qm

such that x ∈ Qx,m; moreover, since U is open, for m large enough Qx,m ⊂ U ; wecan then define Qx = Qx,m where m is the smallest integer m such that Qx,m ⊂ U .

This family Qxx∈U is a partition of U , that is, for any x, y ∈ U , either Qx = Qy

or Qx∩Qy = ∅; indeed, if we suppose that Qx∩Qy 6= ∅, then one of the two boxes iscontained in the other, say Qx ⊂ Qy. This leads to x ∈ Qx ⊂ Qy ⊂ U , contradictingthe definition of Qx unless Qx = Qy.

From Lemma 6.9 it follows easily that the σ–algebra generated by Q coincideswith B(Rn).

Proposition 6.10 (Properties of the Lebesgue measure) The following state-ments hold.

(i) (translation invariance) For any E ∈ B(Rn), x ∈ Rn we have L n(E + x) =L n(E), where

E + x = y + x : y ∈ Rn.

Page 89: Probability Book

Chapter 6 85

(ii) If µ is a translation invariant measure on (Rn,B(Rn)) such that µ(K) < ∞for any compact set K, there exists a number Cµ ≥ 0 such that

µ(E) = CµLn(E) ∀ E ∈ B(Rn).

(iii) (rotation invariance) For any orthogonal matrix R ∈ L(Rn;Rn) we have

L n(R(E)) = L n(E) ∀ E ∈ B(Rn).

(iv) For any T ∈ L(Rn;Rn) we have

L n(T (E)) = |detT |L n(E) ∀ E ∈ B(Rn).

Proof. Fix x ∈ Rn. The measures L n(E) and L n(E+x) coincide on the π–systemof boxes; thanks to Lemma 6.9, this π–system generates the Borel σ–algebra, sothat the coincidence criterion for measures stated in Proposition 1.15 gives thatL n(E) = L n(E + x) for all Borel sets E.Let us prove (ii). Let Q0 ∈ Q0 and set Cµ = µ(Q0). Since Q0 is included in acompact set, we have Cµ < ∞. Since µ is translation invariant, all boxes in Q0

have the same µ measure. Now, let QN ∈ QN . Since Q0 is the disjoint union of2−nN boxes in QN which have all the same µ measure (again by the translationinvariance) we have that

µ(QN) = CµLn(QN).

So, Lemma 6.9 gives that µ(A) = CµL n(A) for any open set, and therefore for anyBorel set.Let us now prove (iii). By the translation invariance of L n, the measure µ(E) =L n(R(E)) is easily seen to be translation invariant (because R(E + z) = R(E) +R(z)), hence L n(R(E)) = CL n(E) for some contant C. We can identify theconstant C choosing E equal to the unit ball, finding C = 1.Finally, let us prove (iv). By polar decomposition we can write T = R S withS =

√T ∗ T symmetric and nonnegative definite, and R orthogonal. Notice that

on one hand |detT | = detS (because detR ∈ −1, 1) and on the other hand, by(iii) we have

L n(T (E)) = L n(R(S(E))) = L n(S(E)).

Hence, it suffices to show that L n(S(E)) = detSL n(E) for any symmetric andnonnegative definite matrix S. By the translation invariance of L n(S(E)) thereexists a constant C such that L n(S(E)) = CL n(E) for any Borel set E. In thiscase we can identify the constant C choosing as E a suitable n-dimensional cube:

Page 90: Probability Book

86 Operations on measures

denoting by (ei) an orthonormal basis of eigenvectors of S, with eigenvalues αi ≥ 0(whose product is detS), choosing

E =

n∑

i=1

ciei : |ci| ≤ 1

, so that S(E) =

n∑

i=1

αiciei : |ci| ≤ 1

,

the rotation invariance of L n gives L n(E) = 1 and L n(S(E)) = α1 · · ·αn. There-fore C = detS and the proof is complete.

6.3 Countable products

We are here concerned with a sequence (Xi,Fi, µi), i = 1, 2, . . ., of probabilityspaces. We denote by X the product space

X :=∞

×k=1

Xk

and by x = (xk) the generic element of X.We are going to define a σ–algebra of subsets of X. Let us first introduce the

cylindrical sets in X. A cylindrical set In,A is a set of the following form

In,A = x : (x1, . . . , xn) ∈ A,

where n ≥ 1 is an integer and A ∈ ×n

1 Fk. This representation is not unique;however, since

In,A = A×∞

×k=n+1

Xk

we have that In,A = Im,B with n < m implies B = A×Xn+1 × · · · ×Xm.We denote by C the family of all cylindrical sets of X. Notice also that

Icn,A = In,Ac ,

so that C is stable under complement. If In,A and Im,B belong to C we can assumeby the previous remarks that m = n, so that In,A ∪ In,B = In,A∪B belongs to C .Therefore C is an algebra.

The σ–algebra generated by C is called the product σ–algebra of Fi. It is denotedby

×k=1

Fk.

Page 91: Probability Book

Chapter 6 87

Now we define a function µ on C , setting

µ(In,A) =

( n

×k=1

µk

)(A), In,A ∈ C . (6.6)

This definition is well posed, again thanks to the fact that In,A = Im,B with n < mimplies B = A×Xn+1 × · · · ×Xm. It is easy to check that µ is additive: indeed, ifIn,A and Im,B are disjoint, using the previous remark we can assume with no loss ofgenerality that n = m, and therefore the equality µ(In,A ∪ In,B) = µ(In,A) + µ(In,B)follows by ( n

×k=1

µk

)(A ∪B) =

( n

×k=1

µk

)(A) +

( n

×k=1

µk

)(B).

We set

µ :=∞

×k=1

µk.

Theorem 6.11 The set function µ defined in (6.6) is σ–additive on C and there-fore, by the Caratheodory theorem, it has a unique extension to a probability measureon (X,×∞

1 Fk) that is denoted by

×k=1

µk

Proof. To prove the σ–additivity of µ it is enough to show the continuity of µ at∅, or equivalently the implication

(Ej) ⊂ C , (Ej) nonincreasing, µ(Ej) ≥ ε0 > 0 =⇒∞⋂

n=1

Ej 6= ∅. (6.7)

In the following we are given a nonincreasing sequence (Ej) on C such that µ(Ej) ≥ε0 > 0. To prove (6.7), we need some more notation. We set

X(n) =∞

×k=n+1

Xk, µ(n) =∞

×k=n+1

µk, n ≥ 1,

and consider the sections of Ej defined as

Ej(x1) =x(1) ∈ X(1) : (x1, x

(1)) ∈ Ej

, x1 ∈ X1.

Ej(x1) is a cylindrical subset of X(1) and by the Fubini theorem we have

µ(Ej) =

∫X1

µ(1)(Ej(x1)) dµ1(x1) ≥ ε0 > 0, j ≥ 1. (6.8)

Page 92: Probability Book

88 Operations on measures

Set nowFj,1 =

x1 ∈ X1 : µ(1)(Ej(x1)) ≥

ε0

2

, j ≥ 1.

Then Fj,1 is not empty and by (6.8) we have

µ(Ej) =

∫Fj,1

µ(1)(Ej(x1)) dµ1(x1) +

∫F c

j,1

µ(1)(Ej(x1)) dµ1(x1)

≤ µ1(Fj,1) +ε0

2.

Therefore µ1(Fj,1) ≥ ε0/2 for all j ≥ 1.Obviously (Fj,1) is a nonincreasing sequence of subsets of X1. Since µ1 is σ–

additive, it is continuous at 0. Therefore, there exists α1 ∈⋂∞

1 Fj,1 and so

µ(1)(Ej(α1)) ≥ε0

2, j ≥ 1. (6.9)

Consequently we haveEj(α1) 6= ∅, j ≥ 1. (6.10)

Now we iterate the procedure: for any x2 ∈ X2 we consider the section

Ej(α1, x2) =x(2) ∈ X(2) : (α1, x2, x

(2)) ∈ Ej

, j ≥ 1.

By the Fubini theorem we have

µ(1)(Ej(α1)) =

∫X2

µ(2)(Ej(α1, x2)) dµ2(x2). (6.11)

We setFj,2 =

x2 ∈ X2 : µ(2)(Ej(α1, x2)) ≥

ε0

4

, j ≥ 1.

Then by (6.9) and (6.10) we have

ε0

2≤ µ(1)(Ej(α1)) =

∫X2

µ(2)(Ej(α1, x2)) dµ2(x2)

=

∫Fj,2

µ(2)(Ej(α1, x2)) dµ2(x2) +

∫[Fj,2]c

µ(2)(Ej(α1, x2)) dµ2(x2)

≤ µ2(Fj,2) +ε0

4.

Therefore µ2(Fj,2) ≥ ε0/4. Since (Fj,2) is nonincreasing and µ2 is σ–additive, thereexists α2 ∈ X2 such that

µ2(Ej(α1, α2)) ≥ε0

4, j ≥ 1,

Page 93: Probability Book

Chapter 6 89

and consequently we haveEj((α1, α2)) 6= ∅. (6.12)

Arguing in a similar way we see that there exists a sequence (αk) ⊂ X such that

Ej(α1, . . . , αn) 6= ∅, for all j, n ≥ 1, (6.13)

where

Ej(α1, . . . , αn) =x ∈ X(n) : (α1, . . . , αn, x

(n)) ∈ Ej

, j, n ≥ 1.

This implies, as easily seen, that (αn) ∈⋂∞

1 Ej. Therefore⋂∞

1 Ej is not empty, asrequired.

Page 94: Probability Book

90 Operations on measures

EXERCISES

6.1 Let (X1,F1), (X2,F2), (X3,F3) be measurable spaces. Show that

(F1 ×F2)×F3 = F1 × (F2 ×F3).

If we are given measures µi in Fi, i = 1, 2, 3, show also that (µ1 × µ2)× µ3 = µ1 × (µ2 × µ3).6.2 Let us consider the measurable spaces (R,B(R)), (Rn,B(Rn)). Show that

B(Rn) =n

×i=1

B(R).

Hint: to show the inclusion ⊂, use Lemma 6.9.6.3 Let Ln be the σ–algebra of Lebesgue measurable sets in Rn. Show that

L1 ×L1 ( L2.

Hint: to show the strict inclusion, consider the set E = F × 0, where F ⊂ R is not Lebesguemeasurable.

6.4 Show that the product σ–algebra is also generated by the family of products×∞1 Ai where

Ai ∈ Fi and Ai 6= Xi only for finitely many i.6.5 Writing properly L 3 as a product measure, compute L 3(T ), where

T =(x, y, z) : x2 + y2 < r2 and y2 + z2 < r2

.

Answer: 16r3/3.6.6 [Computation of ωn] Find a recursive formula linking ωn to ωn−2, and use it to show thatω2k = πk/k! and ω2k+1 = 2k+1πk/(2k + 1)!!, where (2k + 1)!! is the product of all odd integersbetween 1 and 2k + 1. Hint: use the Fubini–Tonelli theorem.6.7 Use Exercise 6.6 and the identities Γ(1) = 1, Γ(1/2) =

√π and Γ(z+1) = zΓ(z) to show (6.5).

6.8 Let µ and ν be σ–finite measures on (X,F ) and (Y,G ) respectively and let λ = µ × ν. LetE = (F × G )λ, as defined in Definition 1.12, and let ζ be the extension of λ to E . Show thisversion of the Fubini–Tonelli Theorem 6.4: for any E –measurable function F : X × Y → [0,+∞]the following statements hold:

(i) for µ–a.e x ∈ X the function y 7→ F (x, y) is ν–measurable;

(ii) the function x 7→∫

Y

F (x, y) dν(y), set to zero at all points x such that y 7→ F (x, y) is not

ν–integrable, is µ–measurable;

(iii)∫

X

∫YF (x, y) dµ(x) dµ(y) =

∫X×Y

F (x, y) dζ(x, y).

6.9 Using the notation of the Fubini-Tonelli theorem, let X = Y = [0, 1], F = G = B([0, 1]), letµ be the Lebesgue measure and let ν be the counting measure. Let D = (x, x) : x ∈ [0, 1] bethe diagonal in X × Y ; check that

∫Xν(Dx) dµ(x) 6=

∫Yµ(Dy) dν(y).

6.10 ? Let (fh) be converging to f in L1(X×Y, µ× ν). Show the existence of a subsequence h(k)such that fh(k)(x, ·) converge to f(x, ·) in L1(Y, ν) for µ–a.e. x ∈ X. Show by an example that, ingeneral, this property is not true for the whole sequence.

Page 95: Probability Book

Chapter 6 91

6.4 Comparison of measures

In this section we study some relations between measures in a measurable space(X,F ).

The first (immediate) one is the order relation: viewing measures as set functions,we say that µ ≤ ν if µ(B) ≤ ν(B) for all B ∈ F . It is not hard to see that the spaceof measures endowed with this order relation is a complete lattice (see Exercise 6.13):in particular

µ ∨ ν(B) = sup µ(A1) + ν(A2) : A1, A2 ∈ F , (A1, A2) partition of B

and

µ ∧ ν(B) = inf µ(A1) + ν(A2) : A1, A2 ∈ F , (A1, A2) partition of B .

Another relation between measures is linked to the concept of product of afunction by a measure.

Definition 6.12 Let µ be a measure in (X,F ) and let f ∈ L1(X,F , µ) be nonneg-ative. We set

fµ(B) :=

∫B

f dµ ∀B ∈ F . (6.14)

It is immediate to check, using the additivity and the continuity properties ofthe integral, that fµ is a finite measure. Furthermore, the following simple ruleprovides a way for the computation of integrals with respect to fµ:∫

X

h d(fµ) =

∫X

hf dµ, (6.15)

whenever h is F–measurable and nonnegative (or hf is µ–integrable, see Exer-cise 6.11). It suffices to check the identity (7.4) on characteristic functions h = 1B

(and in this case it reduces to (6.14)), and then for simple functions. The monotoneconvergence theorem then gives the general result.

Notice also that, by definition, fµ(B) = 0 whenever µ(B) = 0. We formalizethis relation between measures in the next definition.

Definition 6.13 (Absolute continuity) Let µ, ν be measures in F . We say thatν is absolutely continuous with respect to µ, and write ν µ, if all µ–negligible setsare ν–negligible, i.e.

µ(A) = 0 =⇒ ν(A) = 0.

Page 96: Probability Book

92 Operations on measures

For finite measures, the absolute continuity property can also be given in a(seemingly) stronger way, see Exercise 6.14.

The following theorem shows that absolute continuity of ν with respect to µ isnot only necessary, but also sufficient to ensure the representation ν = fµ.

Theorem 6.14 (Radon–Nikodym) Let µ and ν be finite measures on (X,F )such that ν µ. Then there exists a unique nonnegative ρ ∈ L1(X,F , µ) such that

ν(E) =

∫E

ρ(x) dµ(x) ∀E ∈ F . (6.16)

We are going to show a more general result, whose statement needs two moredefinitions. We say that a measure µ is concentrated on a F–measurable set A ifµ(X \ A) = 0. For instance, the Dirac measure δa is concentrated on a, andthe Lebesgue measure in R is concentrated on the irrational numbers, and fµ isconcentrated (whatever µ is) on f 6= 0.

Definition 6.15 (Singular measures) Let µ, ν be measures in (X,F ). We saythat µ is singular with respect to ν, and write µ ⊥ ν, if there exist disjoint F–measurable sets A, B such that µ is concentrated on A and ν is concentrated onB.

The relation of singularity, as stated, is clearly symmetric. However, it can alsobe stated in a (seemingly) asymmetric way, by saying that µ ⊥ ν if µ is concentratedon a ν–negligible set A (just take B = Ac to see the equivalence with the previousdefinition).

Example 6.16 Let X = R, F = B(R), µ the Lebesgue measure on (X,F ) andν = δx0 the Dirac measure at x0 ∈ R. Then µ is concentrated on A := R \ x0,whereas ν is concentrated on B := x0. So, µ and ν are singular.

Theorem 6.17 (Lebesgue) Let µ and ν be measures on (X,F ), with µ σ–finiteand ν finite. Then the following assertions hold.

(i) There exist two unique finite measures νa and νs on (X,F ) such that

ν = νa + νs, νa µ, νs ⊥ µ. (6.17)

(ii) There exists a unique ρ ∈ L1(X,F , µ) such that νa = ρµ.

Page 97: Probability Book

Chapter 6 93

(6.17) is called the Lebesgue decomposition of ν with respect to µ. The functionρ in (ii) is called the density of ν with respect to µ and it is sometimes denoted by

ρ : =dν

dµ.

Radon–Nikodym theorem simply follows by Legesgue theorem noticing that, in thecase when ν µ the uniqueness of the decomposition gives νa = ν and νs = 0, sothat ν = νa = ρµ.

Proof of Theorem 6.17. We assume first that also µ is finite. Set λ = µ + νand notice that, obviously, µ λ and ν λ. Define a linear functional F onL2(X,F , λ) setting

F (ϕ) :=

∫X

ϕ(x) dν(x), ϕ ∈ L2(X,F , λ).

The functional F is well defined and bounded (and consequently continuous) since,in view of the Holder inequality, we have

|F (ϕ)| ≤∫

X

|ϕ(x)| dν(x) ≤∫

X

|ϕ(x)| dλ(x) ≤ [λ(X)]1/2 ‖ϕ‖L2(X,F ,λ).

Now, thanks to Riesz theorem, there exists a unique function f ∈ L2(X,F , λ) suchthat ∫

X

ϕ(x) dν(x) =

∫X

f(x)ϕ(x) dλ(x) ∀ϕ ∈ L2(X,F , λ). (6.18)

Setting ϕ = 1E, with E ∈ F , yields

ν(E) =

∫E

f(x) dλ(x) ≥ 0,

which implies, by the arbitrariness of E, f(x) ≥ 0, λ–a.e. and, in particular, both µ–a.e. and ν–a.e. In the sequel we shall assume, possibly modifying f in a λ–negligibleset, and preserving the validity of (6.18), that f ≥ 0 everywhere. By (6.18) it follows∫

X

ϕ(x)(1− f(x)) dν(x) =

∫X

f(x)ϕ(x) dµ(x) ∀ϕ ∈ L2(X,F , λ). (6.19)

Setting ϕ = 1E, with E ∈ F , yields∫E

(1− f(x)) dν(x) =

∫E

f(x) dµ(x) ≥ 0

Page 98: Probability Book

94 Operations on measures

because f ≥ 0. Thus, being E arbitrary, we obtain that f(x) ≤ 1 for ν–a.e. x ∈ X.Set now

A := x ∈ X : 0 ≤ f(x) < 1, B := x ∈ X : f(x) ≥ 1,

so that (A,B) is a F–measurable partition of X, and

νa(E) := ν(E ∩ A), νs(E) := ν(E ∩B) ∀E ∈ F ,

so that νa = 1Aν is concentrated on A, νs = 1Bν is concentrated on B and ν =νa + νs.

Then, setting in (6.19) ϕ = 1B, we see that

µ(B) ≤∫

B

f dµ =

∫B

(1− f) dν = 0

because f = 1 ν–a.e. on B. It follows that νs is singular with respect to µ.We show now that the existence of ρ such that νa = ρµ. Heuristically, this can be

obtained choosing in (6.19) the function ϕ = (1− f)−11E∩A, but since this functionneed not be in L2(X,F , λ) we argue by approximation: set in (6.19)

ϕ(x) = (1 + f(x) + · · ·+ fn(x))1E∩A(x)

where n ≥ 1 and E ∈ F . Then we obtain∫E∩A

(1− fn+1(x)) dν(x) =

∫E∩A

[f(x) + f 2(x) + · · ·+ fn+1(x)] dµ(x).

Set ρ(x) = 0 for x ∈ B and

ρ(x) := limn→∞

[f(x) + f 2(x) + · · ·+ fn+1(x)] =f(x)

1− f(x), x ∈ A.

Then, by the monotone convergence theorem it follows that

νa(E) = ν(E ∩ A) =

∫E∩A

ρ(x) dµ(x) =

∫E

ρ(x) dµ(x).

Setting E = X we see that ρ ∈ L1(X,F , µ), and the arbitrariness of E gives thatνa = ρµ.

Now we consider the case when µ is σ–finite. In this case there exists a sequenceof pairwise disjoint sets (Xn) ⊂ F such that

X =∞⋃

n=0

Xn with µ(Xn) <∞.

Page 99: Probability Book

Chapter 6 95

Let us apply Theorem 6.17 to the finite measures µn = 1Xnµ, νn = 1Xnν. For anyn ∈ N let νn = (νn)a + (νn)s = ρnµn + (νn)s be the Lebesgue decomposition of νn

with respect to µn. Now, set

νa :=∞∑

n=0

(νn)a, νs :=∞∑

n=0

(νn)s, ρ :=∞∑

n=0

ρn1Xn .

Sincek∑

n=0

(νn)a + (νn)s =k∑

n=0

νn = 1∪k0Xn

ν,

we can pass to the limit as k →∞ to obtain that νa and νs are finite measures, andν = νa + νs. Moreover, for any E ∈ F we have, using the monotone convergencetheorem,

νa(E) =∞∑

n=0

(νn)a(E) =∞∑

n=0

∫E

ρn(x) dµn(x) =

∫E

∞∑n=0

ρn(x)1Xn dµ(x) =

∫E

ρ(x) dµ(x).

So, νa µ, and setting E = X we see that ρ is integrable with respect to µ. Finally,it is easy to see that νs ⊥ µ, because if we denote by Bn ∈ F µ–negligible sets where(νn)s are concentrated, we have that νs is concentrated on the µ–negligible set ∪nBn.

Finally, let us prove the uniqueness of νa and νs: assume that

ν = νa + νs = ν ′a + ν ′s

and let B, B′ be µ–negligible sets where νs and ν ′s are respectively concentrated.Then, as B ∪B′ is µ–negligible and both νs and ν ′s are concentrated on B ∪B′, forany set E ∈ F we have

νs(E) = νs(E ∩ (B ∪B′)) = ν(E ∩ (B ∪B′)) = ν ′s(E ∩ (B ∪B′)) = ν ′s(E).

It follows that νs = ν ′s and therefore νa = ν ′a.

Remark 6.18 If µ is not σ–finite then the Lebesgue decomposition does not holdin general. Consider for instance the case when X = [0, 1], F = B([0, 1]), µ is thecounting measure and ν = L 1. Then ν µ (as the only µ–negligible set is theempty set) but there is no ρ : [0, 1] → [0,∞] satisfying

ν(E) =

∫E

ρ dµ.

Indeed, this function should be µ-integrable and therefore it can be nonzero only ina set at most countable.

Page 100: Probability Book

96 Operations on measures

EXERCISES

6.11 Show that a F–measurable function h is fµ–integrable if and only if fh is µ–integrable.6.12 Show that (fµ) ∨ (gµ) = (f ∨ g)µ and (fµ) ∧ (gµ) = (f ∧ g)µ whenever f, g ∈ L1(X,F , µ)are nonnegative.6.13 Let µii∈I be a family of measures in (X,F ). Show that

µ(B) := inf

∞∑k=0

µi(k)(Bk) : i : N→ I, (Bk) countable F–measurable partition of B

is the greatest lower bound of the family µii∈I , i.e. µ ≤ µi for all i ∈ I and it is the largestmeasure with this property. Show also that

µ(B) := sup

∞∑k=0

µi(k)(Bk) : i : N→ I, (Bk) countable F–measurable partition of B

is the smallest upper bound of the family µii∈I , i.e. µ ≥ µi for all i ∈ I and it is the smallestmeasure with this property.6.14 Let µ, ν be measures in (X,F ) with ν finite. Then ν µ if and only if for all ε > 0 thereexists δ > 0 such that

A ∈ F , µ(A) < δ =⇒ ν(A) < ε.

6.15 Assume that ν µ and that ν ⊥ µ. Show that ν = 0.6.16 Assume that σ ≤ µ+ ν and that σ ⊥ ν. Show that σ ≤ µ.

Page 101: Probability Book

Chapter 6 97

6.5 Signed measures

Let (X,F ) be a measurable space. In this section we see how the concept ofmeasure, still viewed as a set function, can be extended dropping the nonnegativityassumption on A 7→ µ(A).

We recall that sequence (Ei) ⊂ F of pairwise disjoint sets such that⋃∞

0 Ei = Eis called a countable F–measurable partition of E.

Definition 6.19 (Signed measures and total variation) A signed measure µin (X,F ) is a map µ : F → R such that

µ(E) =∞∑i=0

µ(Ei),

for all countable F–measurable partitions (Ei) of E.

Notice that the series above is absolutely convergent by the arbitrariness of (Ei):indeed, if σ : N→ N is a permutation, then (Eσ(i)) is still a partition of E, hence

∞∑i=0

µ(Ei) =∞∑i=0

µ(Eσ(i)).

This implies that the series is absolutely convergent.Let µ be a signed measure. Then we define the total variation |µ| of µ as follows:

|µ|(E) = sup

∞∑i=0

|µ(Ei)| : (Ei) F–measurable partition of E

, E ∈ F .

Proposition 6.20 Let µ be a signed measure and let |µ| be its total variation. Then|µ| is a finite measure on (X,F ).

Proof. It is immediate to check that |µ| is a nondecreasing set function.Step 1. If A, B ∈ F are disjoint, we have

|µ|(A ∪B) = |µ|(A) + |µ|(B).

Indeed, let E = A ∪ B and let (Ei) be a countable F–measurable partition of E.Set

Aj = A ∩ Ej, Bj = B ∩ Ej, j ∈ N.

Page 102: Probability Book

98 Operations on measures

Then (Aj) is a countable F–measurable partition of A and (Bj) a countable F–measurable partition of B and we have Ej = Aj ∪Bj, j ∈ N. Moreover,

∞∑j=0

|µ(Ej)| ≤∞∑

j=1

|µ(Aj)|+∞∑

j=0

|µ(Bj)| ≤ |µ|(A) + |µ|(B),

which yields |µ|(A ∪B) ≤ |µ|(A) + |µ|(B).Let us prove the converse inequality, assuming with no loss of generality that

|µ|(A ∪ B) < ∞. Since both |µ|(A) and |µ|(B) are finite, for any ε > 0 there existcountable F–measurable partitions (Aε

k) of A and (Bεk) of B such that

∞∑k=0

|µ(Aεk)| ≥ |µ|(A)− ε

2,

∞∑k=0

|µ(Bεk)| ≥ |µ|(B)− ε

2.

Since (Aεk, B

εk) is a countable F–measurable partition of A ∪B, we have that

|µ|(A ∪B) ≥∞∑

k=1

(|µ(Aεk)|+ |µ(Bε

k)|) ≥ |µ|(A) + |µ|(B)− ε.

By the arbitrariness of ε we have |µ|(A ∪B) ≥ |µ|(A) + |µ|(B).Step 2. |µ| is σ–additive. Since |µ| is additive by Step 1, it is enough to show that|µ| is σ–subadditive, i.e. |µ(A)| ≤

∑∞0 |µ|(Ai) whenever (Ai) ⊂ F is a partition of

A. This can be proved arguing as in the first part of Step 1, i.e. building from apartition (Ej) of A partitions (Ej ∩ Ai) of all sets Ai.Step 3. |µ|(X) < ∞. Assume by contradiction that |µ|(X) = ∞. Then we claimthat

there exists a partition X = A ∪B such that |µ(A)| ≥ 1 and |µ|(B) = ∞. (6.20)

By the claim the conclusion follows since we can use it to construct by recurrence(replacing X with B and so on), a disjoint sequence (An) ⊂ F such that |µ(An)| ≥1. Assume, to fix the ideas, that µ(An) ≥ 1 for infinitely many n, and denoteby E the union of these sets: then, the σ–additivity of µ forces µ(E) = +∞, acontradiction. Analogously, if µ(An) ≤ −1 for infinitely many n, we find a set Esuch that µ(E) = −∞.

Let us prove (6.20). By the assumption |µ|(X) = ∞ it follows the existence of apartition (Xn) of X such that

∞∑n=0

|µ(Xn)| > 2(1 + |µ(X)|).

Page 103: Probability Book

Chapter 6 99

Then either the sum of those µ(Xn) which are nonnegative or the absolute value ofthe sum of those µ(Xn) which are nonpositive is greater than 1 + |µ(X)|. To fix theideas, assume that for a subsequence (Xn(k)) we have µ(Xn(k)) ≥ 0 and

∞∑k=0

µ(Xn(k)) > 1 + |µ(X)|.

Set A =⋃∞

0 Xn(k) and B = Ac. Then we have |µ(A)| > 1 + |µ(X)| and

|µ(B)| = |µ(X)− µ(A)| ≥ |µ(A)| − |µ(X)| > 1.

Since|µ|(X) = |µ|(A) + |µ|(B) = ∞,

either |µ|(B) = +∞ or |µ|(A) = +∞. In the first case we are done, in the secondone we exchange A and B. So, the claim is proved and the proof is complete.

Let µ be a signed measure on (X,F ). We define

µ+ :=1

2(|µ|+ µ), µ− :=

1

2(|µ| − µ),

so thatµ = µ+ − µ− and |µ| = µ+ + µ−. (6.21)

The measure µ+ (resp. µ−) is called the positive part (resp. negative part) of µ andthe first equation in (6.21) is called the Jordan representation of µ.

Remark 6.21 It is easy to check that Theorems 6.17 and 6.14 hold when ν is asigned measure: it suffices to split it into its positive and negative part, see alsoExercise 6.17.

The following theorem proves also that µ+ and µ− are singular, and provides acanonical representation of µ± as suitable restrictions of ±µ.

Theorem 6.22 (Hahn decomposition) Let µ be a signed measure on (X,F )and let µ+ and µ− be its positive and negative parts. Then there exists a F–measurable partition (A,B) of X such that

µ+(E) = µ(A ∩ E) and µ−(E) = −µ(B ∩ E) ∀E ∈ F . (6.22)

Proof. Let us first notice that µ |µ|. Thus, by the Radon–Nikodym theorem,there exists h ∈ L1(X,F , |µ|) such that

µ(E) =

∫E

h d|µ| ∀E ∈ F . (6.23)

Page 104: Probability Book

100 Operations on measures

Let us prove that |h(x)| = 1 for |µ|–a.e. x ∈ X. Indeed, set

E1 := x ∈ X : h(x) > 1, F1 := x ∈ X : h(x) < −1

We first show that |µ|(E1) = |µ|(F1) = 0. Since have

|µ|(E1) ≥ µ(E1) =

∫E1

h d|µ| ≥ |µ|(E1),

and the second inequality is strict if |µ|(E1) > 0, we have that |µ|(E1) = 0. In asimilar way one can prove that |µ|(F1) = 0, so that |h| ≤ 1 |µ|–a.e. in X. Now, letr ∈ (0, 1) and set

Gr := x ∈ X : |h(x)| < r.

Let (Gr,k) be a countable F–measurable partition of Gr. Then we have

|µ(Gr,k)| =∣∣∣∣∫

Gr,k

h d|µ|∣∣∣∣ ≤ ∫

Gr,k

|h| d|µ| ≤ r|µ|(Gr,k).

Therefore∞∑

k=0

|µ(Gr,k)| ≤ r|µ|(Gr),

which yields, by the arbitrariness of the partition of Gr, |µ|(Gr) ≤ r|µ|(Gr). Thus|µ|(Gr) = 0 and letting r ↑ 1 we obtain that |µ|(|h| < 1) = 0. Hence, possiblymodifying h in |µ|–negligible set, we can assume with no loss of generality that htakes its values in −1, 1.

Now, to conclude the proof, we set

A := x ∈ X : h(x) = 1, B := x ∈ X : h(x) = −1.

Then for any E ∈ F we have

µ+(E) =1

2(|µ|(E) + µ(E)) =

1

2

∫E

(1 + h)d|µ| =∫

E∩A

hd|µ| = µ(E ∩ A),

and

µ−(E) =1

2(|µ|(E)− µ(E)) =

1

2

∫E

(1− h)d|µ| = −∫

E∩B

hd|µ| = −µ(E ∩B).

Page 105: Probability Book

Chapter 6 101

EXERCISES

6.17 Using the decomposition of ν in positive and negative part, show that Lebesgue decompositionis still possible when µ is σ–finite and ν is a signed measure. Using the Hahn decomposition extendthis result to the case when even µ is a signed measure. Are these decompositions unique?6.18 Show that |fµ| = |f |µ for any f ∈ L1(X,E , µ).

Page 106: Probability Book

102 Operations on measures

6.6 Measures in R

In this section we estabilish a 1-1 correspondence between finite Borel measures in Rand a suitable class of nondecreasing functions. In one direction this correspondenceis elementary, and based on the concept of repartition function.

Given a finite measure µ in (R,B(R)), we call repartition function of µ thefunction F : R→ [0,+∞) defined by

F (x) := µ ((−∞, x]) x ∈ R.

Notice that obviously (1) F is nondecreasing, right continuous, and satisfies

limx→−∞

F (x) = 0, limx→+∞

F (x) ∈ [0,+∞). (6.24)

Moreover, F is continuous at x if and only if x is not an atom of µ.The following result shows that this list of properties characterizes the functions

that are repartition functions of some finite measure µ; in addition the measure isuniquely determined by its repartition function.

Theorem 6.23 Let F : R → [0,+∞) be a nondecreasing and right continuousfunction satisfying (6.24). Then there exists a unique finite measure µ in (R,B(R))such that F is the repartition function of µ.

Proof. The proof follows the same lines of the construction of the Lebesguemeasure in Section 1.6, with a simplification due to the fact that we can also considerunbounded intervals (because we are dealing with finite measures). We set

I := (a, b] : a ∈ [−∞,+∞), b ∈ R, a < b

and denote by A the ring generated by I : it consists, as it can be easily checked, ofall finite disjoint unions of intervals in I . We define, with the convention F (−∞) =0,

µ((a, b]) := F (b)− F (a) ∀(a, b] ∈ I . (6.25)

This definition is justified by the fact that, if µ were a measure and F were itsrepartition function, (6.25) would be valid, because (a, b] = (−∞, b]\(−∞, a]. Thenwe extend µ to A with the same mechanism used in the proof of Theorem 1.19, andcheck that µ is additive on A . Also, the same argument used in that proof showsthat µ is even σ–additive: in order to prove that µ(F ) =

∑i µ(Fi) whenever F and

all Fi belong to A one first reduces to the case when F = (a, b] belongs to I ; then,

(1)The arguments are similar to those used in Section 2.4.2, in connection with the properties ofthe function t 7→ µ(ϕ > t)

Page 107: Probability Book

Chapter 6 103

one enlarges Fi to F ′i ∈ A with µ(F ′

i ) < µ(Fi) + δ2−i and, using the fact that allintervals [a′, b] with a′ > a are contained in a finite union of the sets F ′

i , obtains

µ((a′, b]) ≤∞∑i=0

µ(F ′i ) ≤ 2δ +

∞∑i=0

µ(Fi).

Letting first δ ↓ 0 and then a′ ↓ a we obtain the σ–subadditivity property µ(F ) ≤∑i µ(Fi), and the opposite inequality follows by monotonicity.By Caratheodory theorem µ has a unique extension, that we still denote by µ,

to B(R) = σ(A ). Setting a = −∞ and letting b tend to +∞ in the identity (6.25)we obtain that µ(R) = F (+∞) ∈ R. From (6.25) with a = −∞ we obtain that therepartition function of µ is F .

Given a nondecreasing and right continuous function F satisfying (6.24), theStieltjes integral ∫

R

f dF

is defined as∫f dµF , where µF is the finite measure built in the previous theorem.

The notation dF is justified by the fact that, when f =∑

i zi1(ai,bi], we have (bythe very definition of µF )∫

R

f dF =

∫R

f dµF =∑

i

zi(F (bi)− F (ai)).

This approximation of the Stieltjes integral will play a role in the proof of Theo-rem 6.28.

6.7 Convergence of measures on R

In this section we study a notion of convergence for measures on the real line thatis quite useful, both from the analytic and the probabilistic viewpoints.

Definition 6.24 (Weak convergence) Let (µh) be a sequence of finite measureson R. We say that (µh) weakly converges to a finite measure µ on R if the repartitionfunctions Fh of µh are pointwise converging to the repartition function F of µ on aco-countable set, i.e. if

limh→∞

µh (−∞, x]) = µ ((−∞, x]) with at most countably many exceptions.

(6.26)

Page 108: Probability Book

104 Operations on measures

Since the repartition function is right continuous, it is uniquely determined by(6.26). Then, since the measure is uniquely determined by its repartition function,we obtain that the weak limit, if exists, is unique. The following fundamentalexample shows why we admit at most countably many exceptions in the convergenceof the repartition functions.

Example 6.25 [Convergence to the Dirac mass] Let ρ ∈ C∞(R) be a nonneg-ative function such that

∫Rρ dx = 1 (an important example is the Gauss function

(2π)−1/2e−x2/2). We consider the rescaled functions ρh(x) = hρ(hx) and the inducedmeasures µh = ρhL 1, all probability measures. Then, it is immediate to check thatµh weakly converge to δ0: for x > 0 we have indeed

µh ((−∞, x]) =

∫ x

−∞ρh(y) dy =

∫ hx

−∞ρ(y) dy → 1

because hx→ +∞ as h→ +∞. An analogous argument shows that µh ((−∞, x]) →0 for any x < 0. If ρ is even, at x = 0 we don’t have pointwise convergence of therepartition functions: all the repartition functions Fh satisfy Fh(0) = 1/2, whileF (0) = 1.

Weak convergence is a quite flexible tool, because it allows also an oppositebehaviour, the approximation of continuous measures (i.e. with no atom) by atomicones, see for instance Exercise 6.19.

From now on we will consider only, for the sake of simplicity, the case of weakconvergence of probability measures. Before stating a compactness theorem for theweak convergence of probability measures, we introduce the following terminology.

Definition 6.26 (Tightness) We say that a family of probability measures µii∈I

in R is tight if for any ε > 0 there exists a closed interval J ⊂ R such that

µi(R \ J) ≤ ε ∀i ∈ I.

Clearly any finite family of probability measures is tight. One can also check(see Exercise 6.22) that µii∈I is tight if and only if

limx→−∞

Fi(x) = 0, limx→+∞

Fi(x) = 1 uniformly with respect to i ∈ I, (6.27)

where Fi are the repartition functions of µi. Furthermore, (see Exercise 6.23) anyweakly converging sequence is tight. Conversely, we have the following compactnessresult for tight sequences:

Page 109: Probability Book

Chapter 6 105

Theorem 6.27 (Compactness) Let (µh) be a tight sequence of probability mea-sures on R. Then there exists a subsequence (µh(k)) weakly converging to a probabilitymeasure µ.

Proof. We denote by Fh the repartition functions of µh. By a diagonal argumentwe can find a subsequence (Fh(k)) pointwise converging on Q. We denote by G thepointwise limit, obviously a nondecreasing function. We extend G by monotonicitysetting

G(x) := sup G(q) : q ∈ Q, q ≤ x x ∈ R

and let E be the co-countable set of the discontinuity points of G.Let us check that Fh(k) is pointwise converging to G on R\E: for x /∈ E we have

indeed

lim supk→∞

Fh(k)(x) ≤ infq∈Q, q>x

lim supk→∞

Fh(k)(q) = infq∈Q, q>x

G(q) = G(x),

and analogously

lim infk→∞

Fh(k)(x) ≥ supq∈Q, q<x

lim supk→∞

Fh(k)(q) = supq∈Q, q<x

G(q) = G(x).

Since (µh) is tight, we have also

limx→−∞

Fh(x) = 0, limx→+∞

Fh(x) = 1

uniformly with respect to h, hence G(−∞) = 0 and G(+∞) = 1.Notice now that the nondecreasing function

F (x) := limy↓x

G(y)

is right continuous, and still satisfies F (−∞) = 0 and F (+∞) = 1, therefore (ac-cording to Theorem 6.23) F is the repartition function of a probability measure µ.Since F = G on R \ E, we have Fh(k) → F pointwise on R \ E, and this proves theweak convergence of µh(k) to µ.

The following theorem provides a characterization of the weak convergence interms of convergence of the integrals of continuous and bounded functions.

Theorem 6.28 Let µh, µ be probability measures in R. Then µh weakly convergeto µ if and only if

limh→∞

∫R

g dµh =

∫R

g dµ ∀g ∈ Cb(R). (6.28)

Page 110: Probability Book

106 Operations on measures

Proof. Assuming that µh → µ weakly, we denote by Fh and F the correspondingrepartition functions and fix g ∈ Cb(R). LetM = sup |g| and ε > 0. By Exercise 6.23the sequence (µh) is tight, so that we can find t > 0 satisfying µh (R \ (−t, t]) < ε forany h ∈ N; we may assume (possibly choosing a larger t) that also µ((R \ (−t, t]) < εand both −t and t are points where the repartition functions are converging. Thanksto the uniform continuity of g in [−t, t] we can find δ > 0 such that

x, y ∈ [−t, t], |x− y| < δ =⇒ |g(x)− g(y)| < ε. (6.29)

Hence, we can find points t1, . . . , tn in [−t, t] such that t1 = −t, tn = t, thereis convergence of the repartition functions in all points ti, and ti+1 − ti < δ fori = 1, . . . , n− 1. By (6.29) it follows that sup(−t,t] |g − f | < ε, where

f :=n−1∑i=1

g(ti)1(ti,ti+1].

Splitting the integrals on R as the sum of an integral on (−t, t] and an integral on(−t, t]c we have∣∣∣∣∫

R

g dµh −∫

(−t,t]

f dµh

∣∣∣∣ ≤Mε+ ε = (M + 1)ε ∀h ∈ N, (6.30)

and analogously ∣∣∣∣∫R

g dµ−∫

(−t,t]

f dµ

∣∣∣∣ ≤Mε+ ε = (M + 1)ε. (6.31)

Since∫(−t,t]

f dµh =n−1∑i=1

g(ti) [Fh(ti+1)− Fh(ti)] →n−1∑i=1

g(ti) [F (ti+1)− F (ti)] =

∫(−t,t]

f dµ,

adding and subtracting∫

(−t,t]f dµh, and using (6.30) and (6.31), we conclude that

lim suph→∞

∣∣∣∣∫R

g dµh −∫R

g dµ

∣∣∣∣ ≤ (M + 1)ε.

Since ε is arbitrary, (6.28) is proved.Conversely, assume that (6.28) holds. Given x ∈ R, define the open set A =

(−∞, x); we can easily find (gk) ⊆ Cb(R) monotonically converging to 1A and deducefrom (6.28) the inequality

lim infh→∞

µh(A) ≥ supk∈N

lim infh→∞

∫R

gk dµh = supk∈N

∫R

gk dµ = µ(A).

Page 111: Probability Book

Chapter 6 107

Analogously, using a sequence (gk) ⊆ Cb(R) such that gk ↓ 1C , with C = (−∞, x],we deduce from (6.28) the inequality

lim suph→∞

µh(C) ≤ infk∈N

lim suph→∞

∫R

gk dµh = infk∈N

∫R

gk dµ = µ(C).

Therefore we have convergence of the repartition functions for any x ∈ R such thatµ(A) = µ(C), i.e. for any x that is not an atom of µ. We conclude thanks toExercise 1.5.

Notice that in (6.28) there is no mention to the order structure of R, and onlythe metric structure (i.e. the space Cb(R)) comes into play. In a general context,of probability measures on a metric space (X, d) endowed with the Borel σ–algebraB(X), we say that µh weakly converge to µ if

limh→∞

∫X

g dµh =

∫X

g dµ for any function g ∈ Cb(X).

EXERCISES

6.19 Show that the probability measures

µh :=1h

h∑i=1

δ ih

weakly converge to the probability measure 1[0,1]L1.

6.20 Let Fh : R→ R be nondecreasing functions pointwise converging to a nondecreasing functionF : R→ R on a dense set D ⊂ R. Show that Fh(x) → F (x) at all points x where F is continuous.6.21 Consider all atomic measures of the form

h2∑i=−h2

aiδ ih,

where h ∈ N and a−h, . . . , ah ≥ 0. Show that for any finite Borel measure µ in R there exists asequence of measures (µh) of the previous form that weakly converges to µ.6.22 Show that a family µii∈I of probability measures in R is tight if and only if (6.27) holds.6.23 Show that any sequence (µh) of probability measures weakly convergent to a probabilitymeasure is tight. Hint: if µ is the weak limit and ε > 0 is given, choose an integer n ≥ 1 suchthat µ([1− n, n− 1]) > 1− ε and points x ∈ (−n, 1− n) and y ∈ (n− 1, n) where the repartitionfunctions of µh are converging to the repartition function of µ.6.24 We want to extend what was shown in this section from the realm of probability measures tothat of finite measures. Let (µh), µ be finite positive Borel measures on R, and let Fh, F be theirrepartition functions. Consider the following implications:

(a) limh

∫Rg dµh =

∫Rg dµ ∀g ∈ Cb(R) (that is, (6.28));

Page 112: Probability Book

108 Operations on measures

(b) limh

∫Rg dµh =

∫Rg dµ ∀g ∈ Cc(R);

(c) Fh converge to F at all points where F is continuous;

(d) Fh converge to F on a dense subset of R;

(e) limh µh(R) = µ(R);

(f) (µh) is tight.

Find an example where (b) holds but (a), (c), (e) do not hold and prove the following implications:a ⇒ b, e, a ⇒ c, d ⇔ c, b ∧ e ⇒ c, d ∧ e ⇒ f , d ∧ f ⇒ e, d ∧ f ⇒ a. As a corollary, if(e) holds (as it happens in the case when all µh and µ are probability measures) we obtain thata⇔ b⇔ c⇔ d⇒ f .

6.8 Fourier transform

The Fourier transform is a basic tool in Pure and Applied Mathematics, Physicsand Engineering. Here we just mention a few basic facts, focussing on the use ofthis transform in Measure Theory and Probability.

Definition 6.29 (Fourier transform of a function) Let f ∈ L1(R,C). We set

f(ξ) :=

∫R

f(x)e−ixξ dx ∀ξ ∈ R.

The function f is called Fourier transform of f .

Since the map ξ 7→ f(x)e−iξx is continuous, and bounded by |f(x)|, the domi-nated convergence theorem gives that f(ξ) is continuous. The same upper boundalso shows that f is bounded, and sup |f | ≤ ‖f‖1. More generally, the followingresult holds:

Theorem 6.30 Let k ∈ N be such that∫R|x|k|f |(x) dx < ∞. Then f ∈ Ck(R,C)

andDpf(ξ) = (−i)pxpf(ξ) ∀p = 0, . . . , k.

The proof of Theorem 6.30 is a straightforward consequence of the differentiationtheorem for integrals depending on a parameter (in this case, the ξ variable):

Dpξ

∫R

f(x)e−ixξ dx =

∫R

Dpξ

(f(x)e−ixξ

)dx = (−i)p

∫R

xpf(x)e−ixξ dx.

According to the previous result, the Fourier transform allows to transform differ-entiations (in the x variable) into multiplications (in the ξ variable), thus allowingan algebraic solution of many linear differential equations.

Page 113: Probability Book

Chapter 6 109

In the sequel we need an explicit expression of the Fourier transform of a Gaussianfunction. For σ > 0, let

ρσ(x) :=e−|x|

2/(2σ2)

(2πσ2)1/2(6.32)

be the rescaled Gaussian functions, already considered in Example 6.25. Then∫R

ρσ(x)e−iξx dx = e−ξ2σ2/2 ∀ξ ∈ R. (6.33)

The proof of this identity is sketched in Exercise 6.23.

Remark 6.31 (Discrete Fourier transform) If f : R → R is a 2T -periodicfunction, then we can write the Fourier series (corresponding, up to a linear changeof variables, to those considered in Chapter 5 for 2π-periodic functions)

f =∑n∈Z

anein

πT

x, in L2 ((−T, T );C), (6.34)

with

an =1

2T

∫ T

−T

f(x)e−inπT

x dx, ein πT

x = cosnπ

Tx+ i sinn

π

Tx. (6.35)

Remark 6.32 (Inverse Fourier transform) For g ∈ L1(R,C) we define inverseFourier transform of f the function

g(x) :=1

∫R

g(ξ)eixξ dξ x ∈ R.

It can be shown (see for instance Chapter VI.1 in [7]) that the maps f 7→ f andg 7→ g are each the inverse of the other in the so-called Schwarz space S (R,C) ofsmooth and rapidly decreasing functions at infinity:

S (R,C) :=

f ∈ C∞(R,C) : lim

|x|→∞|x|k|Dif |(x) = 0 ∀k, i ∈ N

.

In particular we have

f(x) = 2π

˜(f

)(x) =

∫R

aξeixξ dξ with aξ :=

1

∫R

f(x)e−iξx dx.

These formulas can be viewed as the continuous counterpart of the discrete Fouriertransform (6.34), (6.35). In this sense, aξ are generalized Fourier coefficients, cor-responding to the “frequency” ξ. The difference with Fourier series is that anyfrequency is allowed, not only the integer multiples nπ/T of a given one.

Page 114: Probability Book

110 Operations on measures

6.8.1 Fourier transform of a measure

In this section we are concerned in particular with the concept of Fourier transformof a measure.

Definition 6.33 (Fourier transform of a measure) Let µ be a finite measureon R. We set

µ(ξ) :=

∫R

e−ixξ dµ(x) ∀ξ ∈ R.

The function µ : R→ C is called Fourier transform of µ.

Notice that Definition 6.29 is consistent with Definition 6.33, because µ = fwhenever µ = fL 1. Notice also that, by the dominated convergence theorem, thefunction µ is continuous. Moreover µ(0) = µ(R) and, by estimating from above themodulus of the integral with the integral of the modulus (see also Exercise 6.25),we obtain that |µ(ξ)| ≤ µ(R) for all ξ ∈ R. Still using the differentiation theoremsunder the integral sign, one can check that for k ∈ N the following implications hold:∫R

|x|k dµ(x) <∞ =⇒ µ ∈ Ck(R,C) and Dpµ(ξ) = (−i)pxpµ(ξ) ∀p = 0, . . . , k.

(6.36)Let us see other basic examples of Fourier transforms of probability measures:

Example 6.34 (1) If µ = δx0 then µ(ξ) = e−ix0ξ.(2) If µ = pδ1 + qδ0 (with p + q = 1) is the Bernoulli measure with parameter p,then µ(ξ) = q + pe−iξ.(3) If

µ =n∑

i=0

(ni

)piqn−iδi

is the binomial measure with parameters n, p then

µ(ξ) = (q + pe−iξ)n ∀ξ ∈ R.

(4) If µ = e−x1(0,∞)(x)L1 is the exponential measure, then

µ(ξ) =1

1 + iξ∀ξ ∈ R.

(5) If µ = (2a)−11(−a,a)L1 is the uniform measure in [−a, a], then

µ(ξ) =sin(aξ)

aξ∀ξ ∈ R \ 0.

Page 115: Probability Book

Chapter 6 111

(6) If µ = [π(1 + x2)]−1L 1 is the Cauchy measure, then (2)

µ(ξ) = e−|ξ| ∀ξ ∈ R.

Theorem 6.35 Any finite measure µ in R is uniquely determined by its Fouriertransform µ.

Proof. For σ > 0 we denote by ρσ the rescaled Gaussian functions in (6.32).According to Exercise 6.23 we have

e−z2σ2/2 =

∫R

ρσ(w)e−izw dw.

Setting z = (x− y)/σ2, dividing both sides by (2πσ2)1/2 we deduce that

ρσ(x− y) =1

(2πσ2)1/2

∫R

ρσ(w)e−iw(x−y)/σ2

dw.

Using Fubini-Tonelli theorem we obtain∫R

ρσ(x− y)dµ(x) =

∫R

1

(2πσ2)1/2

(∫R

ρσ(w)e−iw(x−y)/σ2

dw

)dµ(x) (6.37)

=

∫R

ρσ(w)

(2πσ2)1/2µ( wσ2

)eiyw/σ2

dy.

As a consequence, the integrals hσ(y) =∫Rρσ(y− x) dµ(x) are uniquely determined

by µ. But, still using the Fubini-Tonelli theorem, one can check the identity∫R

(∫R

g(y)ρσ(x− y) dy

)dµ(x) =

∫R

hσ(y)g(y) dy ∀g ∈ Cb(R). (6.38)

Passing to the limit as σ ↓ 0 and noticing that (by Example 6.25, that provides theweak convergence of ρσλ to δ0 as σ ↓ 0, or a direct verification)∫

R

g(y)ρσ(x− y) dy =

∫R

g(x− z)ρσ(z) dz → g(x) ∀x ∈ R

from the dominated convergence theorem we obtain that all integrals∫Rg dµ, for

g ∈ Cb(R), are uniquely determined. Hence µ is uniquely determined by its Fouriertransform.

(2)This computation can be done using the residus theorem in complex analysis

Page 116: Probability Book

112 Operations on measures

Remark 6.36 It is also possible to show an explicit inversion formula for the Fouriertransform. Indeed, (6.38) holds not only for continuous functions, but also forbounded Borel functions; choosing a < b that are not atoms of µ and g = 1(a,b), wehave that

∫Rg(x)ρσ(x− y) dy → g(x) for µ-a.e. x (precisely for x /∈ a, b), so that

(6.38) and (6.37) give

µ((a, b)) = limσ↓0

∫ b

a

hσ(y) dy = limσ↓0

∫ b

a

∫R

e−w2/2σ2

2πσ2µ(w

σ2)eiyw/σ2

dwdy.

The change of variables w = tσ2 and Fubini theorem give

µ((a, b)) = limσ↓0

1

∫R

e−t2σ2/2µ(t)eitb − eita

itdt, (6.39)

for all points a < b that are not atoms of µ.

According to Theorem 6.28 we have the implication:

µh → µ weakly =⇒ µh → µ pointwise in R. (6.40)

The following theorem, due to Levy, gives essentially the converse implication,allowing to deduce the weak convergence from the convergence of the Fourier trans-forms.

Theorem 6.37 (Levy) Let (µh) be probability measures in R. If fh = µh pointwiseconverge in R to some function f , and if f is continuous in 0, then f = µ for someprobability measure µ in R and µh → µ weakly.

Proof. Let us show first that (µh) is tight. Fixed a > 0, taking into accountthat sin ξ is an odd function and using the Fubini theorem we get∫ a

−a

σ(ξ) dξ =

∫ a

−a

∫R

e−ixξ dσ(x)dξ =

∫R

∫ a

−a

cos(xξ) dξdσ(x)

=

∫R

2

xsin(ax) dσ(x)

for any probability measure σ. Hence, using the inequalities | sin t| ≤ |t| for all t and| sin t| ≤ |t|/2 for |t| ≥ 2, we get

1

a

∫ a

−a

(1− σ(ξ)) dξ = 2− 2

∫R

sin(ax)

axdσ(x) = 2

∫R

(1− sin(ax)

ax

)dσ(x)

≥ σ

(R \ [−2

a,2

a]

). (6.41)

Page 117: Probability Book

Chapter 6 113

For ε > 0 we can find, by the continuity of f at 0, a > 0 such that∫ a

−a

(1− f(ξ)) dξ < εa.

By the dominated convergence theorem we get h0 ∈ N such that∫ a

−a

(1− µh(ξ)) dξ < εa ∀h ≥ h0. (6.42)

As a−1∫ a

−a(1 − µh(ξ)) dξ → 0 as a ↓ 0 for any fixed h, we infer that we can find

b ∈ (0, a] such that (6.42) holds with b replacing a for all h ∈ N. From (6.41) we getµh (R \ [−n, n]) < ε for all h ∈ N, as soon as n > 2/b.

Being the sequence tight, we can extract a subsequence (µh(k)) weakly convergingto a probability measure µ and deduce from (6.40) that f = µ. It remains to showthat the whole sequence (µh) weakly converges to µ: if this is not the case thereexist ε > 0, g ∈ Cb(R) and a subsequence h′(k) such that∣∣∣∣∫

R

g dµh′(k) −∫R

g dµ

∣∣∣∣ ≥ ε ∀k ∈ N.

But, possibly extracting one more subsequence, we can assume that µh′(k) weaklyconverge to a probability measure σ; in particular∣∣∣∣∫

R

g dσ −∫R

g dµ

∣∣∣∣ ≥ ε > 0. (6.43)

As we are assuming that fh = µh converge pointwise to f = µ we obtain thatσ = limk µh′(k) = µ, hence µ = σ. From Theorem 6.35 we obtain that µ = σ,contradicting (6.43).

Notice that just pointwise convergences of the Fourier transforms is not enough toconclude the weak convergence, unless we know that the limit function is continuous:let us consider, for instance, the rescaled Gaussian kernels used in the proof ofTheorem 6.35 and let us consider the behaviour of the Gaussian measures µσ = ρσL 1

as σ ↑ ∞; in this case, from Exercise 6.23 we infer that the Fourier transforms arepointwise converging in R to the discontinuous function equal to 1 at ξ = 0 and equalto 0 elsewhere. In this case we don’t have weak convergence of the measures: wehave, instead, the so-called phenomenon of dispersion of the whole mass at infinity

limσ↑∞

µσ (R \ [−n, n]) = limσ↑∞

µ1

(R \ [−n

σ,n

σ])

= µ1 (R \ 0) = 1 ∀n ∈ N

and the family of measures µσ is far from being tight as σ ↑ ∞.

Page 118: Probability Book

114 Operations on measures

EXERCISES

6.23 Check the identity (6.33).6.24 ? Show that µ is uniformly continuous in R for any finite measure µ.6.25 Let µ be a probability measure in R. Show that if |µ| attains its maximum at ξ0 6= 0, thenthere exist x0 ∈ R and cn ∈ [0,∞) such that

µ =∑n∈Z

cnδxnwith xn = x0 +

2nπξ0

.

Use this fact to show that |µ| ≡ 1 in R if and only if µ is a Dirac mass.

Page 119: Probability Book

Chapter 7

The fundamental theorem of theintegral calculus

In this section we give a closer look at a classical theme, namely the fundamentaltheorem of the integral calculus, looking for optimal conditions on f ensuring thevalidity of the formula

f(x)− f(y) =

∫ x

y

f ′(s) ds.

Notice indeed that in the classical theory of the Riemann integration there is agap between the conditions imposed to give a meaning to the integral

∫ x

ag(s) ds

(i.e. Riemann integrability of g) and those that ensure its differentiability as afunction of x (for instance, typically one requires the continuity of g). We will seethat this gap basically disappears in Lebesgue’s theory, and that there is a precisecharacterization of the class of functions representable as c+

∫ x

ag(s) ds for a suitable

(Lebesgue) integrable function g and for some constant c.The following definition is due to Vitali.

Definition 7.1 (Absolutely continuous functions) Let I ⊂ R be an interval.We say that f : I → R is absolutely continuous if for any ε > 0 there exists δ > 0for which the implication

n∑i=1

(bi − ai) < δ =⇒n∑

i=1

|f(bi)− f(ai)| < ε (7.1)

holds for any finite family (ai, bi)1≤i≤n of pairwise disjoint intervals contained inI.

An absolutely continuous function is obviously uniformly continuous, but theconverse is not true, see Example 7.7.

115

Page 120: Probability Book

116 Fundamental theorem of the integral calculus

Let f : [a, b] → R be absolutely continuous. For any x ∈ [a, b] define

F (x) = supσ∈Σa,x

n∑i=1

|f(xi)− f(xi−1)|,

where Σa,x is the set of all decompositions σ = a = x0 < x1 < · · · < xn = x of[a, x]. F is called the total variation of g. Let us check that F is finite: let δ > 0 besatisfying the implication (7.1) with ε = 1; then, any sum in the definition of F canbe split in at most 2(x−a)/δ+1 partial sums corresponding to a family of intervalswith total length less than δ/2; as a consequence, (7.1) gives

F (x) ≤ 2

δ(x− a) + 1.

We set

f+(x) =1

2(F (x) + f(x)), f−(x) =

1

2(F (x)− f(x)),

so that

f(x) = f+(x)− f−(x), F (x) = f+(x) + f−(x), x ∈ [a, b].

Lemma 7.2 Let f : [a, b] → R be absolutely continuous and let F be its total varia-tion. Then F, f+, f− are nondecreasing and absolutely continuous.

Proof. Let x ∈ [a, b), y ∈ (x, b] and σ = a = x0 < x1 < · · · < xn = x. Thenwe have

F (y) ≥ |f(y)− f(x)|+n∑

i=1

|f(xi)− f(xi−1)|.

Taking the supremum over all σ ∈ Σa,x, yields

F (y) ≥ |f(y)− f(x)|+ F (x),

which implies that F, f+, f− are nondecreasing. It remains to show that F isabsolutely continuous. Let ε > 0 and let δ = δ(ε) > 0 be such that the implication(7.1) holds for all finite families (ai, bi), 1 ≤ i ≤ n, of pairwise disjoint intervals with∑

i(bi − ai) < δ. For any i = 1, . . . , n we can find σi = ai = x0,i < x1,i < · · · <xni,i = bi such that

F (bi)− F (ai) ≤ε

n+

ni∑k=1

|f(xk,i)− f(xk−1,i)|, 1 ≤ i ≤ n. (7.2)

Page 121: Probability Book

Chapter7 117

Indeed, if a = y0 < y1 < · · · < ymi= bi is a partition such that

F (bi) ≤ε

n+

mi∑k=1

|f(yk)− f(yk−1)|

we can assume with no loss of generality (adding one more element to the partitionif necessary) that yk = ai for some k; then, it suffices to estimate the first k termsof the above sum with F (ai), and to call x0,i = yk, . . . , xmi−k+1,i = ymi

to obtain(7.2) with ni = mi − k + 1. Adding the inequalities (7.2) and taking into accountthat the union of the disjoint intervals (xk,i−1, xk,i) (for 1 ≤ i ≤ n, 0 ≤ k ≤ ni) haslength less than δ, from the absolute continuity property of f we get

n∑i=1

(F (bi)− F (ai)) ≤ ε+ ε = 2ε.

This proves that F is absolutely continuous.

The absolute continuity property characterizes integral functions, as the followingtheorem shows.

Theorem 7.3 Le I = [a, b] ⊂ R. A function f : I → R is representable as

f(x) = f(a) +

∫ x

a

g(t) dt ∀x ∈ I (7.3)

for some g ∈ L1(I) if and only if f is absolutely continuous.

Proof. (Sufficiency) If f is representable as in (7.3), we have

|f(x)− f(y)| ≤∫ y

x

|g(s)| ds ∀x, y ∈ I, x ≤ y.

Hence, setting A = ∪i(ai, bi), the absolute continuity property follows by the impli-cation

L 1(A) < δ =⇒∫

A

|g| ds < ε.

The existence, given δ > 0, of ε > 0 with this property is ensured by Exercise 6.14(with µ = L 1 and ν = gL 1).

(Necessity) According to Lemma 7.2, we can write f as the difference of two non-increasing absolutely continuous functions. Hence, we can assume with no loss ofgenerality that f is nonincreasing, and possibly adding to f a constant we shallassume that f(a) = 0. We extend f to the whole of R setting f ≡ 0 in (−∞, a) and

Page 122: Probability Book

118 Fundamental theorem of the integral calculus

f ≡ f(b) in (b,∞). It is clear that this extension, that we still denote by f , retainsthe monotonicity and absolute continuity properties.

By Theorem 6.23 we obtain a unique finite measure ν on (R,B(R)) withoutatoms (because f is continuous) such that f is the repartition function of ν. As fis constant on (−∞, a) and on (b,+∞), we obtain that ν is concentrated on I, sothat

f(x) = ν ((−∞, x]) = ν ((a, x]) ∀x ∈ R. (7.4)

Now, if we were able to show that ν 1IL 1, by the Radon–Nikodym theorem wewould find g ∈ L1(I) such that ν = gL 1, so that (7.4) would give

f(x) =

∫ x

a

g(s) ds ∀x ∈ I.

Hence, it remains to show that ν 1IL 1. Taking into account the identityν((a, b)) = f(b)−f(a), the absolute continuity property can be rewritten as follows:for any ε > 0 there exists δ > 0 such that

L 1(A) < δ =⇒ ν(A) ≤ ε

for any finite union of open intervals A ⊂ I. But, by approximation, the sameimplication holds for all open sets, because any such set is the countable union ofopen intervals. By Proposition 1.22, ensuring an approximation from above withopen sets, the same implication holds for Borel sets B ⊂ I as well. This proves thatν 1IL 1 and concludes the proof.

We will need the following nice and elementary covering theorem.

Theorem 7.4 (Vitali covering theorem) Let Bri(xi)i∈I be a finite family of

balls in a metric space (X, d). Then there exists J ⊂ I such that the balls Bri(xi)i∈J

are pairwise disjoint, and ⋃i∈I

Bri(xi) ⊂

⋃i∈J

B3ri(xi). (7.5)

Proof. We proceed as follows: first we pick a ball with largest radius, then wepick a second ball of largest radius among those that don’t intersect the first ball,then we pick a third ball of largest radius among those that don’t intersect the firstor the second ball, and so on. The process stops when either there is no ball left,or when the remaining balls intersect at least one of the balls already chosen. Thefamily of chosen balls is disjoint by construction. If x ∈ Bri

(xi) and the ball Bri(xi)

has not been chosen, then there is a chosen ball Brj(xj) intersecting it, so that

d(xi, xj) < ri + rj. Moreover, if Brj(xj) is the first chosen ball with this property,

Page 123: Probability Book

Chapter7 119

then rj ≥ ri (otherwise, if ri > rj, either the ball Bri(xi) or a ball with larger radius

would have been chosen, instead of Brj(xj)), so that d(xi, xj) < 2rj. It follows that

d(x, xj) ≤ d(x, xi) + d(xi, xj) < ri + 2rj ≤ 3rj.

As x is arbitrary, this proves (7.5).

It is natural to think that the function g in (7.3) is, as in the classical fundamentaltheorem of integral calculus, the derivative of f . This is true, but far from beingtrivial, and it follows by the following weak continuity result (due to Lebesgue) ofintegrable functions. We state the result even in more then one variable, as theproof in this case does not require any extra difficulty.

Theorem 7.5 (Continuity in mean) Let f ∈ L1(Rn). Then, for L n-a.e. x ∈ Rn

we have

limr↓0

1

ωnrn

∫Br(x)

|f(y)− f(x)| dy = 0.

The terminology “continuity in mean” can be explained as follows: it is easy toshow that the integral means

1

ωnrn

∫Br(x)

f(y) dy

of a continuous function f converge to f(x) as r ↓ 0 for any x ∈ Rn, because theybelong to the interval [min

Br(x)f,max

Br(x)]. The previous theorem tells us that the same

convergence occurs, for L n-a.e. x ∈ Rn, for any integrable function f . This simplyfollows by the inequality∣∣∣∣ 1

ωnrn

∫Br(x)

f(y) dy − f(x)

∣∣∣∣ =1

ωnrn

∣∣∣∣∫Br(x)

f(y)− f(x) dy

∣∣∣∣≤ 1

ωnrn

∫Br(x)

|f(y)− f(x)| dy.

By the local nature of this statement, the same property holds for locally integrablefunctions.

Proof of Theorem 7.5. Given ε, δ > 0 and an open ball B, it suffices to checkthat the set

A :=

x ∈ B : lim sup

r↓0

1

ωnrn

∫Br(x)

|f(y)− f(x)| dy > 2ε

has Lebesgue measure less than (3n + 1)δ. To this aim, we write f as the sum ofa “good” part g and a “bad”, but small, part h, i.e. f = g + h with g : B → R

Page 124: Probability Book

120 Fundamental theorem of the integral calculus

bounded and continuous, and ‖h‖L1(B) < εδ; this decomposition is possible, becauseProposition 3.15 ensures the density of bounded continuous functions in L1(B).

The continuity of g gives

limr↓0

1

ωnrn

∫Br(x)

|g(y)− g(x)| dy = 0 ∀x ∈ B.

Hence, as f = g + h, we have A ⊂ A1, where

A1 :=

x ∈ B : lim sup

r↓0

1

ωnrn

∫Br(x)

|h(y)− h(x)| dy > 2ε

.

Then, it suffices to show that L n(A1) ≤ (3n + 1)δ. By the triangle inequality, wehave also A1 ⊂ A2 ∪ A3 with

A2 := x ∈ B : |h(x)| > ε

and

A3 :=

x ∈ B : sup

r∈(0,1)

1

ωnrn

∫Br(x)

|h(y)| dy > ε

.

Markov inequality ensures that L n(A2) ≤ ‖h‖L1(B)/ε < δ, so that we need only toshow that L n(A3) ≤ 3nδ.

Notice that A3 is open and bounded, and that for any x ∈ A there exists r ∈(0, 1), depending on x, such that Br(x) ⊂ B and∫

Br(x)

|h(y)| dy > εωnrn.

Let K ⊂ A3 be a compact set and let B(xi, ri)i∈I be a finite family of these ballswhose union covers K. By applying Vitali’s covering theorem to this family of balls,we can find a disjoint subfamily Bri

(xi)i∈J such that the union of the enlargedballs B3ri

(xi) still covers K. Adding the previous inequalities with x = xi and r = ri

and summing in j ∈ J we get

L n(K) ≤∑i∈J

ωn(3ri)n ≤ 3n

ε

∑i∈J

∫Bri (xi)

|h(y)| dy ≤ 3n

ε

∫B

|h(y)| dy ≤ 3nδ.

As K is arbitrary we obtain that L n(A3) ≤ 3nδ.

By applying the theorem to a characteristic function f = 1E we getlimr↓0

L n (E ∩Br(x))

ωnrn= 1 for L n-a.e. x ∈ E

limr↓0

L n (E ∩Br(x))

ωnrn= 0 for L n-a.e. x ∈ Rn \ E

Page 125: Probability Book

Chapter7 121

for any E ∈ B(Rn); points of the first type are called density points, whereas pointsof the second type are called rarefaction points.

Using the continuity in mean of integrable functions we obtain the fundamentaltheorem of calculus within the (natural) class of absolutely continuous functions.

Theorem 7.6 Let I ⊂ R be an interval and let f : I → R be absolutely continuous.Then f is differentiable at L 1-a.e. point of I. In addition f ′ is Lebesgue integrablein I and

f(x) = f(a) +

∫ x

a

f ′(s) ds ∀x ∈ I. (7.6)

Proof. Let g be as in (7.3), let x0 ∈ I be a point where

limr↓0

1

r

∫ x0+r

x0−r

|g(s)− g(x0)| ds = 0 (7.7)

and notice that

f(x0 + r)− f(x0)

r=

1

r

∫ x0+r

x0

g(s) ds = g(x0) +1

r

∫ x0+r

x0

g(s)− g(x0) ds

for r > 0. Hence, passing to the limit as r ↓ 0, from (7.7) we get f ′+(x0) = g(x0); asimilar argument shows that f ′−(x0) = g(x0). As, according to the previous theorem,L 1-a.e. point x0 satisfies (7.7), we obtain that f is differentiable, with derivativeequal to g, L 1-a.e. in I.It suffices to replace g with f ′ in (7.3) to obtain (7.6).

One might think that differentiability L 1-a.e. and integrability of the derivativeare sufficient for the validity of (7.6) (these are the minimal requirements to give ameaning to the formula). However, this is not true, as the Heaviside function 1(0,∞)

fulfils these conditions but fails to be (absolutely) continuous. Then, one might thinkthat one should require also the continuity of f to have (7.6). It turns out that noteven this is enough: we build in the next example the Cantor-Vitali function, alsocalled devil’s staircase: a continuous function having derivative equal to 0 L 1-a.e.,but not constant. This example shows why a stronger condition, namely the absolutecontinuity, is needed.

Example 7.7 (Cantor–Vitali function) Let

X := f ∈ C([0, 1]) : f(0) = 0, f(1) = 1 .

This is a closed subspace of the complete metric space C([0, 1]), hence X is completeas well. For any f : [0, 1] 7→ R we set

Tf(x) :=

f(3x)/2 if 0 ≤ 3x ≤ 1,

1/2 if 1 < 3x < 2,

1/2 + f(3x− 2)/2 if 2 ≤ 3x ≤ 3.

(7.8)

Page 126: Probability Book

122 Fundamental theorem of the integral calculus

It is easy to see that T maps X into X, and that T is a contraction (with Lipschitzconstant equal to 1/2). Hence, by the contraction principle, there is a unique f ∈ Xsuch that Tf = f .

Let us check that f has zero derivative L 1-a.e. in [0, 1]. As f = Tf , f isconstant, and equal to 1/2, in (1/3, 2/3). Inserting this information again in theidentity f = Tf we obtain that f is locally constant (equal to 1/4 and to 3/4) on(1/9, 2/9)∪(7/9, 8/9). Continuing in this way, one finds that f is locally constant onthe union of 2n−1 intervals, each of length 3−n, n ≥ 1. The complement C = [0, 1]\Aof the union A of these intervals is Cantor’s middle third set (see also Exercise 1.7),and since

L 1(A) =∞∑

n=1

2n−1

3n=

1

2

∞∑n=1

(2

3

)n

= 1

we know that L 1(C) = 0. At any point of A the derivative of f is obviously 0.

In connection with the previous example, notice also that f maps A, a set of fullLebesgue measure in (0, 1), into the countable set 2−nn≥1. On the other hand, itmaps C, a Lebesgue negligible set, into [0, 1], a set with strictly positive Lebesguemeasure.

EXERCISES

7.1 Let H : R→ R be satisfying the Lipschitz condition

|H(x)−H(y)| ≤ C|x− y| ∀x, y ∈ R

and let f : [a, b] → R be an absolutely continuous function. Show thatHf is absolutely continuous.7.2 ? Let E ⊆ R be a Borel set and assume that any t ∈ R is either a point of density or a pointof rarefaction of E. Show that either L 1(E) = 0 or L 1(R \ E) = 0. (Remark: the same result istrue in Rn, but with a much harder proof, see [2], 4.5.11).7.3[Lipschitz change of variables] ? Let f : I = [a, b] → R be a Lipschitz map. Show that∫ f(b)

f(a)

ϕ(y) dy =∫ b

a

ϕ(f(x))f ′(x) dx

for any bounded Borel function ϕ : f(I) → R.7.4 Use the previous exercise to show that, for any Lipschitz function f : R → R and any L 1–negligible set N ∈ B(R), the derivative f ′ vanishes L 1-a.e. on f−1(N).

Page 127: Probability Book

Chapter 8

Measurable transformations

In this chapter we study the classical problem of the change of variables in theintegral from a new viewpoint. We will compute how the Lebesgue measure inRn changes under a sufficiently regular transformation, generalizing what we havealready seen for linear, or affine, maps. As a byproduct we obtain a quite generalchange of variables formula for integrals with respect to the Lebesgue measure.

8.1 Image measure

We are given two measurable spaces (X,E ) and (Y,F ), a measure µ on (X,E ) anda (E ,F )–measurable mapping F : X → Y . We define a measure F#µ in (Y,F ) bysetting

F#µ(I) := µ(F−1(I)), I ∈ F . (8.1)

It is easy to see that F#µ is well defined, by the measurability assumption on F ,and σ-additive on F . F#µ is called the image measure of µ by F .

The following change of variable formula is simple, but of a basic importance.

Proposition 8.1 Let ϕ : Y → [0,∞] be a F–measurable function. Then we have∫X

ϕ(F (x)) dµ(x) =

∫Y

ϕ(y) dF#µ(y). (8.2)

Proof. It is enough to prove (8.2) when ϕ is a simple function and so for any ϕ ofthe form ϕ = 1I , where I ∈ F . In this case we have ϕ F = 1F−1(I), hence (8.2)reduces to (8.1).

In the following example we discuss the relation between the change of variablesformula (8.2), that even on the real line involves no derivative, and the classical one.The difference is due to the fact that in (8.2) we are not using the density of F#µwith respect to L 1. It is precisely in this density that the derivative of F shows up.

123

Page 128: Probability Book

124 Measurable transformations

Example 8.2 Let F : R → R be of class C1 and such that F ′(t) > 0 for all t ∈ R.Let A be the image of F (an open interval, by the assumptions made on F ) and letψ : A→ R be continuous. Then for any interval [a, b] ⊂ A the following elementaryformula of change of variables holds (just put y = F (x) in the right integral):∫ F−1(b)

F−1(a)

ψ(F (x)) dx =

∫ b

a

ψ(y)1

F ′(F−1(y))dy. (8.3)

On the other hand, choosing ϕ = ψ1I with I = [a, b] in (8.2), we have∫ F−1(b)

F−1(a)

ψ(F (x)) dx =

∫ b

a

ψ(y) dF#L 1.

Hence, ∫ b

a

ψ(y)1

F ′(F−1(y))dy =

∫ b

a

ψ(y) dF#L 1.

Since a, b and ψ are arbitrary, (8.3) can be interpreted by saying that F#L 1 L 1

and

F#L 1 =1

F ′ F−1L 1.

In the next section, we shall generalize this formula to Rn, and even in one spacedimension we will see that the assumption that F ′ > 0 everywhere can be weakened(see also Exercise 8.3).

8.2 Change of variables in multiple integrals

We consider here the measure space (Rn,B(Rn),L n), where L n is the Lebesguemeasure.

We recall a few basic facts from calculus with several variables: given an openset U ⊂ Rn and a mapping F : U → Rn, F is said to be differentiable at x ∈ U ifthere exists a linear operator DF (x) ∈ L(Rn;Rn) (1) such that

lim|h|→0

|F (x+ h)− F (x)−DF (x)h||h|

= 0.

The operator DF (x) if exists is unique, and is called the differential of F at x. If Fis affine, i.e. F (x) = Tx+a for some T ∈ L(Rn;Rn) and a ∈ Rn, we have DF (x) = Tfor all x ∈ U .

(1)L(Rn;Rm) is the Banach space of all linear mappings T : Rn → Rm endowed with the sup norm‖T‖ = sup|Tx| : x ∈ Rn, |x| = 1

Page 129: Probability Book

Chapter 8 125

If F is differentiable at x ∈ U we define the Jacobian determinant JF (x) of F at xby setting

JF (x) = det DF (x).

If F is differentiable at any x ∈ U and if the mapping DF : U → L(Rn;Rn) iscontinuous, we say that F is of class C1. If, in addition, F is bijective betweenU and an open domain A and F−1 is of class C1 in A, we say that F is a C1

diffeomorphism of U onto A. In this case we have that DF (x) is invertible and

D(F−1)(F (x)) = (DF (x))−1 ∀x ∈ U.

Finally, by Proposition 6.10 we know that if T ∈ L(Rn;Rn) we have

L n(T (E)) = | detT | L n(E) ∀E ∈ B(Rn). (8.4)

8.3 Image measure of L n by a C1 diffeomorphism

In this section we study how the Lebesgue measure changes under the action of a C1

map F . The relevant quantity will be the function |JF |, which really correspondsto the distorsion factor of the measure.

Let U ⊂ Rn be open. The critical set CF of F ∈ C1(U ;Rn) is defined by

CF := x ∈ U : JF (x) = 0 .

Lemma 8.3 The image F (CF ) of the critical set is Lebesgue negligible.

Proof. Let K ⊂ CF be a compact set and ε > 0; for any x ∈ K the setDF (x)(B1(0)) is Lebesgue negligible (becauseDF is singular at x, so thatDF (x)(Rn)is contained in a (n−1)-dimensional subspace of Rn), hence we can find δ = δ(ε, x) >0 such that

L n(z ∈ Rn : dist (z − F (x), DF (x)(B1(0))) < δ

)< ε.

By a scaling argument we get

L n(z ∈ Rn : dist (z − F (x), DF (x)(Br(0))) < δr

)< εrn ∀r > 0.

On the other hand, since |F (y)− F (x)−DF (x)(y − x)| < δr in Br(x), provided ris small enough, we get

F (Br(x)) ⊂z ∈ Rn : dist (z − F (x), DF (x)(Br(0)) < δr

,

so that L n(F (Br(x))) < εrn for r > 0 small enough.

Page 130: Probability Book

126 Measurable transformations

Since the family of balls Br/3(x)x∈K covers the compact set K, we can find afinite family Bri/3(xi)i∈I whose union still covers K and extract from it, thanksto Vitali’s covering theorem, a subfamily Bri/3(xi)i∈J made by pairwise disjointballs such that the union of the enlarged balls Bri

(xi)i∈J covers K. In particular,covering F (K) by the union of F (Bri

(xi)) for i ∈ J , we get

L n(F (K)) ≤∑i∈J

εrni =

3nε

ωn

∑i∈J

ωn

(ri

3

)n

≤ 3nε

ωn

L n(U).

Letting ε ↓ 0 we obtain that L n(F (K)) = 0. Since K is arbitrary, by approximation(recall that CF , being a closed subset of U , can be written as the countable unionof compact subsets of U) we obtain that L n(F (CF )) = 0.

The following theorem provides a necessary and sufficient condition for the ab-solute continuity of F#L n with respect to L n, assuming a C1 regularity of F .

Theorem 8.4 Let U ⊂ Rn be an open set and let F : U → Rn be of class C1, whoserestriction to U \ CF is injective. Then:

(i) F#(1UL n) is absolutely continuous with respect to L n if and only if CF isLebesgue negligible.

(ii) If F#(1UL n) L n we have

F#(1UL n) =1

|JF |(F−1)1F (U\CF )L

n. (8.5)

Proof. (i) If L n(CF ) > 0, we have F#(1UL n)(F (CF )) ≥ L n(CF ) > 0 andF#(1UL n) fails to be absolutely continuous with respect to L n, because we provedin Lemma 8.3 that F (CF ) is Lebesgue negligible.

Let G be the inverse of the restriction of F to the open set U \ CF . The localinvertibility theorem ensures that the domain A = F (U \ CF ) of G is an open set,that G is of class C1 in A and that DG(y) = (DF )−1(G(y)) for all y ∈ A. Let usassume now that CF is Lebesgue negligible and let us show that F−1(E) is Lebesguenegligible whenever E ⊂ F (U) is Lebesgue negligible. Since we already know thatF (CF ) is mapped by G into the L n–negligible set CF , we can assume with no lossof generality that E ∩ F (CF ) = ∅, i.e. E ⊂ A. Let AM be the open sets

AM := y ∈ A : ‖DG(y)‖ < M .

We will prove thatL n(G(K)) ≤ (3M)nL n(K) (8.6)

Page 131: Probability Book

Chapter 8 127

for any compact set K ⊂ AM . So, F#L n ≤ (3M)nL n on the compact sets of AM

and therefore on the Borel sets; in particular

L n(G(E ∩ AM)) ≤ (3M)nL n(E ∩ AM) = 0,

and letting M ↑ ∞ we obtain that L n(G(E)) = 0, because E ⊂ A.In order to show (8.6) we consider a bounded open set B contained in AM and

containing K, and the family of balls Br(y) ⊂ B with y ∈ K and r > 0 sufficientlysmall (possibly depending on y), such that

|G(z)−G(y)−DG(y)(z − y)| < (M − ‖DG‖(y))|z − y| ∀z ∈ Br(y).

In particular, since |Dg(y)(z − y)| ≤ ‖DG(y)‖|z − y|, the triangle inequality gives

|G(z)−G(y)| ≤M |z − y| ∀z ∈ Br(y),

and therefore G(Br(y)) ⊂ BMr(G(y)) for any of these balls. Since the family ofballs Br/3(y)y∈F covers K, we can find a finite family Bri/3(yi)i∈I whose unionstill covers K and extract from it, thanks to Vitali’s covering theorem, a subfamilyBri/3(yi)i∈J made by pairwise disjoint balls such that the union of the enlargedballs Bri

(yi)i∈J covers K. In particular, by our choice of the radii of the balls, thefamily BMri

(G(yi))i∈J covers G(K). We have then

L n(G(K)) ≤∑i∈J

ωn(Mri)n = (3M)n

∑i∈J

ωn

(ri

3

)n

≤ (3M)nL n(B).

Letting B ↓ K we obtain (8.6).

Let us prove (ii). We denote by h the Radon–Nikodym derivative of F#(1UL n)with respect to L n; by Theorem 7.5 we have that

h(y) = limr↓0

1

ωnrn

∫Br(y)

h(z) dz = limr↓0

L n(G(Br(y)))

ωnrn, for L n–a.e. y ∈ A.

Taking into account that F#(1UL n) is concentrated on A, and that 1/|JF | F−1 =|JG|, it remains to prove that for all y0 ∈ A we have

limr↓0

L n(G(Br(y0)))

ωnrn= |JG|(y0). (8.7)

For the sake of simplicity we only consider the case when y0 = 0 and G(0) = 0 (thisis not restrictive, up to a translation in the domain and in the codomain). We dividethe rest of the proof in two steps.

Page 132: Probability Book

128 Measurable transformations

Step 1. We assume in addition that DG(0) = I and show that

limr↓0

L n(G(Br(0)))

ωnrn= 1, (8.8)

which is equivalent to (8.7) in this case.Since DG(0) = I we have by the definition of derivative,

lim|y|→0

|G(y)− y||y|

= 0

So, for any ε ∈ (0, 1) there exists δε > 0 such that if |y| < δε we have |G(y)−y| < ε|y|,so that

(1− ε)|y| ≤ |G(y)| ≤ (1 + ε)|y| ∀y ∈ Bδε(0). (8.9)

In particularB(1−ε)r(0) ⊂ G(Br(0)) ⊂ B(1+ε)r(0) ∀r < δε. (8.10)

Now, by (8.10) it follows that

(1− ε)n ≤ L n(G(Br(0)))

ωnrn≤ (1 + ε)n,

provided r < δε, and this proves that (8.8) holds.Step 2. Set T = DG(0) and H(x) = T−1G(x), so that DH(0) = I. Then we

have G(Br(0)) = T (H(Br(0))) and so, thanks to (8.4),

L n(G(Br(0))) = L n(T (H(Br(0)))) = | detT | L n(H(Br(0))),

which implies

limr↓0

L n(G(Br(0)))

ωnrn= | detT | lim

r↓0

L n(H(Br(0)))

ωnrn= | detT |.

The proof is complete.

Example 8.5 (Polar and spherical coordinates) Let us consider the polar co-ordinates

(ρ, θ) 7→ (ρ cos θ, ρ sin θ).

Here U = (0,∞)× (0, 2π) and the critical set is empty, because the modulus of theJacobian determinant is ρ.

In the case of the spherical coordinates

(ρ, θ, φ) 7→ (ρ cos θ sinφ, ρ sin θ sinφ, ρ cosφ)

we have U = (0,∞) × (0, 2π) × (0, π) and the critical set is empty, because themodulus of the Jacobian determinant is −ρ2 sinφ.

Page 133: Probability Book

Chapter 8 129

Theorem 8.6 (Change of variables formula) Let U ⊂ Rn be an open set andlet F : U → Rn of class C1, with Lebesgue negligible critical set CF , and injectiveon U \ CF . Then ∫

F (U)

ϕ(y) dy =

∫U

ϕ(F (x))|JF |(x) dx (8.11)

for any Borel function ϕ : F (U) → [0,+∞].

Proof. By (8.2) and (8.5) we have∫F (U\CF )

ψ(y)

|JF |(F−1(y))dy =

∫U\CF

ψ(F (x)) dx =

∫U

ψ(F (x)) dx.

for any nonnegative Borel function ψ. Taking into account that F (CF ) is Lebesguenegligible and choosing ψ(y) = ϕ(y)|JF |(F−1(y)) we conclude.

Page 134: Probability Book

130 Measurable transformations

EXERCISES

8.1 Let (X,F ), (Y,G ) and (Z,H ) be measurable spaces and let f : X → Y , g : Y → Z bemeasurable maps. Show that

g#(f#µ) = g f#µ

for any measure µ in (X,F ).8.2 Let f : 0, 1N → [0, 1] be the map associating to a sequence (ai) ⊂ 0, 1 the real number∑

i ai2−i−1 ∈ [0, 1]. Show that

f#

( ∞

×i=0

(12δ0 +

12δ1)

)= 1[0,1]L

1.

8.3 ? Show the existence of a strictly increasing and C1 function F : R → R such that F#L 1 isnot absolutely continuous with respect to L 1.8.4 ? ? Remove the injectivity assumption in Theorem 8.4, showing that

F#(1UL n) =∑

x∈F−1(y)\CF

1|JF |(x)

1F (U\CF )Ln.

for any C1 function F : U → Rn with Lebesgue negligible critical set.

Page 135: Probability Book

Chapter 9

General concepts of Probability

In this chapter we will introduce the basic concepts and terminology of ProbabilityTheory. We will see that many notions of Measure Theory have a direct counterpartin Probability Theory and we shall, from now on, adopt systematically the notationand the terminology of the latter. For instance, we shall often use the word lawin place of probability measure). On the other hand, we will also encounter newconcepts, typical of Probability Theory: the most important one is surely the conceptof independence.

We will not give here a precise definition of the concept “probability of an event”,and we shall assume this concept as a primitive one. The axiomatization of Prob-ability Theory gives, a posteriori, an interpretation of this concept based on theso-called law of large numbers: according to this result the probability coincideswith the asymptotic frequency of successful events. For instance, the probability ofgetting 3 after tossing an (ideal) die is 1/6 because, tossing it n times, the numberkn of times one obtains 3 satisfies

limn→∞

kn

n=

1

6.

It is also important to estimate (but still in a probabilistic sense) the rate of con-vergence of the frequencies kn/n: this leads to the so-called central limit theorem.

9.1 Basic terminology

Definition 9.1 (Probability space) A probability space is a triplet (Ω,A ,P), where(Ω,A ) is a measurable space and P : A → [0, 1] is a probability measure.

The elements A ∈ A are usually called events; we shall denote by ω the genericpoint of Ω, also called elementary event. As we already said, in this context we shallalso call laws the probability measures.

131

Page 136: Probability Book

132 General concepts of Probability

Finally, given a probability space (Ω,A ,P) and a property P (ω) whose truth orfalsity depends on ω, we say that “P holds almost surely” if the set

ω ∈ Ω : P (ω) is false

is contained in a P-negligible set of A . This of course corresponds to the “P holdsP-almost everywhere” terminology typical of Analysis.

9.2 Basic examples

Let us see now some important, and in some sense canonical, examples of probabilityspaces.

Example 9.2 (1) Let Ω = 0, 1, A = P(Ω) and P = pδ1 + qδ0, with p + q = 1.The law P is called Bernoulli law with parameter p. This space corresponds to therandom choice between two possibilities, indexed with 1 and 0, having probabilitiesrespectively p and q. The canonical example, with p = q = 1/2, is the toss of a (per-fect) coin. Many variants are obviously possible: for instance, if Ω = 1, 2, 3, 4, 5, 6and P =

∑61 δi/6, we have a probability space corresponding to the toss of a die.

(2) Let Ω = 0, 1n, A = P(Ω) and P = ×n

1 (pδ1 + qδ0), with p + q = 1. Thisprobability space corresponds, as we will see, to n random independent choicesbetween two possibilities having probability p and q. Even in this case the canonicalexample, with p = q = 1/2, is given by n consecutive tosses of a coin.

(3) Let Ω = 0, 1N, A =×∞0 P(0, 1) and P =×∞

0 (pδ1 + qδ0), with p + q =1. This probability space corresponds, as we will see, to a sequence of randomindependent choices between two possibilities having probability p and q. Even inthis case the canonical example, with p = q = 1/2, is given by a sequence of tossesof a coin.(4) Let Ω = [0, 1], A = B([0, 1]) and P(A) = L 1(A). The law P is said to beuniform in [0, 1]. This probability space corresponds to the choice of a randomnumber in [0, 1], with a uniform distribution. Let us mention that this exampleis strictly linked to the previous one (with p = q = 1/2) through the map thatassociates to a number its binary expansion (see Exercise 8.2). Notice also that inthis case all elementary events have null probability, and that there are subsets of[0, 1] that do not correspond to events because (as we noticed in Chapter 1) thereis no uniform probability measure defined on the whole of P([0, 1]).

More generally, if D ⊂ Rn is a Borel set with L n(D) < ∞, the probabilitymeasure in (D,B(D)) defined by

P(B) :=1

L n(D)L n(B ∩D)

Page 137: Probability Book

Chapter 9 133

induces the uniform distribution in D.(5) Let Ω = (0,+∞), A = B(Ω) and P = 1

λe−t/λL 1, with λ > 0. This probability

measure is called exponential measure with parameter λ.(6) Let Ω = R, A = B(R). Given µ ∈ R and σ ∈ R we denote by N (µ, σ2) theGaussian (or normal) law with parameters µ and σ. This law, absolutely continuouswith respect to Lebesgue measure L 1, has a density given by

fµ,σ(x) :=1√2πσ

e−|x−µ|2/2σ2

.

We will see that µ and σ2 represent respectively the mean and the variance ofN (µ, σ2). The Gaussian law has a fundamental role in Probability Theory, thanksto the central limit theorem: this celebrated result shows that the deviations fromthe mean values, in a sequence of n independent and identically distributed trials,asymptotically display a gaussian distribution.(7) Assume that (Ω,A ) = (N,P(N)) and λ > 0. The Poisson law with parameterλ is defined by

P(n) =λn

n!e−λ ∀n ∈ N.

Equivalently P =∑

nλn

n!e−λδn, so that

P(A) =∑n∈A

λn

n!e−λ ∀A ⊂ N.

This law arises in some counting processes.(8) Assume that (Ω,A ) = (N \ 0,P(N \ 0)) and p ∈ [0, 1]. The geometric lawwith parameter p is defined by

P(n) = p(1− p)n−1 ∀n ≥ 1.

Equivalently P =∑∞

1 p(1−p)n−1δn. Also this law arises in some counting processes.

The concept of random variable is strictly linked to the concept of probabilityspace: it corresponds to the intuitive idea of a quantity X whose individual valuesX(ω) are not known: one then tries to compute at least the statistical distribution,or law, of these values.

Definition 9.3 (Random variable) If (Ω,A ,P) is a probability space and (Ω′,A ′)is a measurable space, any (A −A ′)-measurable function X : Ω → Ω′ is said to bea random variable with values in Ω′.

Page 138: Probability Book

134 General concepts of Probability

It is customary in Probability Theory to write

X ∈ A for ω : X(ω) ∈ A

where A ∈ A ′.A random variable X is said to be finite (resp. discrete) if its range X(Ω) is

finite (resp. countable).

Example 9.4 Let (Ω,A ,P) be the space of Example 9.2(2). Then the integervalued function

X(ω) :=n∑

i=1

ωi

is a finite random variable with values in (N,P(N)). In the case of Example 9.2(3)the function

X(ω) :=∞∑i=0

ωi

2i+1

is a random variable with values in [0, 1].

9.3 Expectation, variance and standard deviation

We shall often consider real-valued random variables. In this case we tacitly assumethat the σ-algebra is B(R). For extended real random variables we consider, ac-cording to our definition of measurability for this class of maps, the σ-algebra whosegenerators are the elements of B(R) and +∞, −∞. For maps X taking valuesin N ∪ +∞ this σ-algebra reduces to P(N ∪ +∞).

For these classes of random variables we can define the important concepts ofexpectation, variance, covariance and standard deviation.

Definition 9.5 (Expectation) Let X be an extended real random variable on aprobability space (Ω,A ,P). If X is P-integrable, we define expectation of X the realnumber

EP(X) :=

∫Ω

X(ω) dP(ω),

omitting P when this is clear from the context. More generally we define E(X) ∈ Rwhenever the integral makes sense, i.e. either X+ or X− are P-integrable.

The expectation is indeed the mean value of the random variable: for instance,if X has a finite number of values z1, . . . , zp, we have

E(X) =

∫Ω

X dP =

p∑i=1

ziP(X = zi) (9.1)

Page 139: Probability Book

Chapter 9 135

is the weighted mean of these values, with weights equal to P(X = zi).Notice also that, thanks to the properties of the integral, the operator E is

nondecreasing and satisfies

E(X + a) = E(X) + a, E(aX + Y ) = aE(X) + E(Y ) ∀a ∈ R. (9.2)

Moreover, E is continuous under nondecreasing sequences of random variables uni-formly bounded from below (by the monotone convergence theorem), or under non-increasing sequences of random variables uniformly bounded from above. Usingthese properties it is immediate to check that the random variables considered inExample 9.4 have expectation np and p respectively.

Holder’s inequality reads in this context as

|E(X)| ≤ [E(|X|p)]1/p ∀p ∈ [1,∞). (9.3)

Analogously, Markov’s inequality becomes

tP(X ≥ t) ≤ E(X) ∀t ≥ 0 (9.4)

for any nonnegative extended random variable X. Recall also this yields the impli-cations

E(|X|) <∞ =⇒ |X| ∈ R almost surely (9.5)

andE(|X|) = 0 =⇒ |X| = 0 almost surely. (9.6)

We say that an integrable extended random variable X is centered if E(X) = 0.Any integrable extended random variable X can be transformed into a centered onewith a translation: it suffices to replace X with X − E(X).

The standard deviation, introduced with the next definition, measures the meandeviation of a random variable from its expectation.

Definition 9.6 (Variance and standard deviation) Let X be an extended realrandom variable in (Ω,A ,P). If |X| is P-integrable, we define variance of X thenumber

V [X] := E(|X − E(X)|2

)=

∫Ω

|X(ω)− E(X)|2 dP(ω).

The number σ(X) :=√V [X] ∈ [0,∞] is called standard deviation of X.

Notice that V [X] = 0 if and only if X(ω) = E(X) almost surely (i.e. X isequivalent to a constant), and that

σ(X + a) = σ(X), σ(aX) = |a|σ(X) ∀a ∈ R. (9.7)

Page 140: Probability Book

136 General concepts of Probability

Other properties of the variance are given in Exercise 9.2.A random variable is said to be normalized if σ(X) = 1. Any square integrable

extended random variable X not equivalent to a constant can be transformed intoa normalized one with an homothety: it suffices to replace X with X/σ(X).

The variance is also called mean square deviation. Expanding the squares inthe definition of V [X] and using (9.2) we get a more manageable, but maybe lessintuitive, expression:

V [X] = E(X2 + E2(X)− 2X · E(X)

)= E(X2) + E2(X)− 2E(X) · E(X)

= E(X2)− E2(X). (9.8)

In particular σ(X) is finite if and only if X is square integrable, i.e. E(X2) < ∞.Using these properties it is easy to check that for the random variables consideredin Example 9.4 the variances are npq and pq/3 respectively.

For discrete random variables X, taking values in z1, . . . , zp, we have also

σ2(X) =

p∑i=1

(zi − z)2P(X = zi) =

p∑i=1

z2i P(X = zi)− z2,

with z defined by (9.1).We recall the Cauchy–Schwarz inequality (4.1) in L2 that can be written as

|E(XY )| ≤√E(X2)E(Y 2) (9.9)

for any coupleX, Y of square integrable random variables. In particular, if E(X) = 0and E(Y ) = 0, we get

|E(XY )| ≤ σ(X)σ(Y ). (9.10)

Definition 9.7 (Covariance and uncorrelated variables) Let X, Y be squareintegrable extended real random variable in (Ω,A ,P). We define covariance of thepair X, Y the number

V [X, Y ] := E((X − E(X))(Y − E(Y ))

).

When V [X, Y ] = 0 we say that X, Y are uncorrelated.

By Cauchy–Schwarz inequality the covariance is a real number, and using again thebilinearity of E we obtain the more manageable expression:

V [X, Y ] = E(XY )− E(X)E(Y ). (9.11)

In addition|V (X, Y )| ≤ σ(X)σ(Y ). (9.12)

Page 141: Probability Book

Chapter 9 137

for any square integrable random variables X, Y (it suffices to apply (9.9) to thecentered variables X − E(X) and Y − E(Y )). See also Exercise 9.5.

Finally let us rewrite, and give a different proof, of Jensen’s inequality 3.13 forexpectations: it shows in particular that the inequality E(|X|p) ≥ [E(|X|)]p holdsfor all p ∈ [1,+∞).

Lemma 9.8 (Jensen’s inequality) Let X be an integrable real random variablein (Ω,A ,P) and let f : R→ R be convex. Then

E(f(X)) ≥ f(E(X)).

Proof. Recall that the set of points where a real-valued convex function is notdifferentiable is at most countable; moreover, at any differentiability point x, theaffine function L(y) = f(x) + f ′(x)(y − x) bounds f(y) from below. In particular,since L(X) is integrable, the negative part of f(X) is integrable and the expectationof f(X) makes sense. From (9.2) we get

E(f(X)) ≥ E(L(X)) = L(E(X)) = f(x) + f ′(x)(E(X)− x).

Choosing a sequence (xn) of differentiability points converging to E(X) we obtainthe stated inequality.

Notice that the above proof uses only the fact that f can be represented asthe supremum of a family of affine functions, because for affine functions equalityholds in Jensen’s inequality. It can be proved that f : R → R ∪ +∞ has thisrepresentation if and only if it is convex and lower semicontinuous. As a consequence,Jensen’s inequality still holds for convex and lower semicontinuous functions f : R→R ∪ +∞.

9.4 Law and characteristic function of a random

variable

Let us now introduce the basic concept of law (or distribution) of a random variable.

Definition 9.9 (Law of a random variable) Let X : Ω → Ω′ be a random vari-able as in Definition 9.3. The image measure µ = X#P of P through X, definedby

µ(A′) := P(X ∈ A′) ∀A′ ∈ A ′

is said law of X.

Page 142: Probability Book

138 General concepts of Probability

Recall that the change of variable formula for the image measure gives

EP(f(X)) =

∫Ω

f(X(ω)) dP(ω) =

∫R

f(t) dµ(t), (9.13)

where µ is the law of X, whenever the integrals make sense. From (9.13) we get

E(X) =

∫R

t dµ(t), σ2(X) =

∫R

t2 dµ(t)−(∫

R

t dµ(t)

)2

. (9.14)

Example 9.10 (binomial law) If X has a finite number of values z1, . . . , zp, thenthe law of X is simply

p∑i=1

P(X = zi)δzi.

The law of the first random variable considered in Example 9.4 is

n∑i=0

(ni

)piqn−iδi,

called binomial distribution with parameters n and p. This can be checked either witha direct computation, or using the concept of conditional probability distributionthat we will introduce later on.The law of the second random variable in Example 9.4 is the uniform distributionin [0, 1] (see Exercise 8.2).

The law of a random variable gives us informations on the statistical distributionof the values of the variable, and many properties of the random variable can beinferred directly from the properties of its law. Two random variables are said tobe identically distributed if they have the same law. For instance, if Ω = (0, 1),A = B(Ω) and P is the uniform measure, then [2x] and 1− [2x], both with valuesin 0, 1, have the same law (δ0 + δ1)/2, even though they are nowhere equal.

Notice also that identically distributed random variables need not be defined onthe same probability space: on the other hand, the notion makes sense only if thetwo variables have their values in the same probability space. For instance, supposewe endow the space 0, 1N with the probability defined in Example 9.2(2) and weendow [0, 1] with the uniform distribution defined in 9.2(4): then

X(ω) :=∞∑

n=0

ωn

2n+1, ω ∈ 0, 1N; X(t) = t, t ∈ [0, 1]

are identically distributed, and the law of both variables is the uniform distributionin [0, 1] (see Exercise 8.2).

Page 143: Probability Book

Chapter 9 139

Definition 9.11 (Characteristic function of a random variable) Let X : Ω →Ω′ be an integrable extended real random variable in a probability space (Ω,A ,P).The characteristic function of X is the complex-valued function defined by

X(ξ) :=

∫Ω

e−iξX(ω) dP(ω) = E(e−iξX).

According to (9.13), we have

X(ξ) =

∫Ω

e−iξX(ω) dP(ω) =

∫R

e−iξy dX#P(y),

hence the characteristic function of X is nothing but the Fourier transform of thelaw of X, so that X = Y whenever X and Y are identically distributed. Recallalso that the Fourier transform of any probability µ measure in R is a boundedcontinuous function (even uniformly continuous, see Exercise 6.24), and that µ isuniquely determined by its Fourier transform (see Theorem 6.35); in particular wehave the equivalence

X = Y ⇐⇒ X and Y are identically distributed. (9.15)

For extended real integrable random variables, the expectation, the variance, thestandard deviation and the characteristic function, depend only on the law of therandom variable (see (9.14)). In other words, if X and Y are integrable and identi-cally distributed, then X and Y have the same mean, variance, standard deviation,characteristic function. Exercise 9.6 shows that also a kind of converse implicationholds, namely if X and Y take their values in the same measure space (Ω′,A ′), andif E(f(X)) = E(f(Y )) for any bounded A ′–measurable function f : Ω′ → R, thenX and Y are identically distributed.

The invariance of these concepts explains, at least in part, the fact that rarelythe probabilistic notation emphasizes the domain or even the underlying probabilitymeasure P, unlike what happens in Analysis: it suffices to compare the probabilisticnotation E(X) with the typical analytic one

∫ΩX dP.

According to (9.14), it makes sense to talk of expectation, variance and standarddeviation of a law in (R,B(R)).

Example 9.12 (Expectation, variance, characteristic function of the Poisson law)Let P be the Poisson law with parameter λ. Then, if X(n) = n we have

E(X) =∞∑

n=0

nP(X = n) = e−λ

∞∑n=1

nλn

n!= λe−λ

∞∑m=0

λm

m!= λ.

Page 144: Probability Book

140 General concepts of Probability

Moreover,

E(X2) =∞∑

n=0

n2P(X = n) = e−λ

∞∑n=1

n2λn

n!

= λe−λ

∞∑m=0

(m+ 1)λm

m!= λ+ λe−λ

∞∑m=0

mλm

m!

= λ+ λ2e−λ

∞∑n=0

nλn

n!= λ+ λ2.

Hence, the variance of P is E(X2)− [E(X)]2 = λ. Finally we have

P(ξ) = e−λ

∞∑n=0

e−iξnλn

n!= eλ(e−iξ−1) ∀ξ ∈ R.

Example 9.13 (Expectation, variance, characteristic function of the geometric law)Let P =

∑∞1 p(1−p)n−1δn be the geometric law with parameter p. Then, the identity∑∞

1 nxn−1 = (1−x)−2 (that can be obtained by differentiation of∑∞

0 xn = 1/(1−x))with x = 1− p gives

E(n) =∞∑

n=1

np(1− p)n−1 =p

(1− (1− p))2=

1

p.

Multiplying by x the previous identity we get∑∞

1 nxn = x/(1 − x)2, and differen-tiating both sides we get

∞∑n=1

n2xn−1 =1

(1− x)2+

2x

(1− x)3

This identity, with x = (1− p), gives

E(n2) =∞∑

n=1

n2p(1− p)n−1 =1

p+

2(1− p)

p2=

2

p2− 1

p,

hence the variance of P is (1− p)/p2. Finally we have

P(ξ) =∞∑

n=1

e−inξp(1− p)n−1 = pe−iξ

∞∑m=0

(e−iξ(1− p))m =pe−iξ

1− (1− p)e−iξ.

Let us list the expectation, variance and characteristic function of the other basiclaws on R (or on N) seen in this chapter.

Page 145: Probability Book

Chapter 9 141

Example 9.14 (Expectation, variance, characteristic function of main laws)(1) The Bernoulli law with parameter p has expectation p and variance pq. Its char-acteristic function is F (ξ) = q + pe−iξ.(2) The binomial law with parameters n and p has expectation np and variance npq.Its characteristic function is

F (ξ) = (q + pe−iξ)n ∀ξ ∈ R.

(3) The uniform law in [0, 1] has expectation 1/2 and variance 1/12. Its characteristicfunction is

F (ξ) =1− e−iξ

iξ∀ξ ∈ R \ 0.

(4) The exponential law in (0,∞) with parameter λ has expectation λ and varianceλ2. Its characteristic function is (see Example 6.34 and change variables)

F (ξ) =1

1 + iλξ∀ξ ∈ R.

(5) The Gaussian law N (µ, σ2) has expectation µ and variance σ2. Its characteristicfunction, according to (6.33), is F (ξ) = e−ξ2σ2

.

Page 146: Probability Book

142 General concepts of Probability

EXERCISES

9.1 Let X be a discrete random variable such that X(Ω) ⊂ N ∪ +∞. Show that

E(X) =∞∑

n=0

P(X > n). (9.16)

9.2 Consider two random variables X, Y with values in Rn and Rm respectively. The covarianceV [X,Y ] is the n×m matrix whose (i, j)-th coordinate is

(V [X,Y ])i,j := E((Xi − E(Xi))(Yj − E(Yj))

)and the variance is the n× n matrix V [X] := V [X,X]. Prove that:

(i) V [X, c] = 0 and V [c] = 0 for all c ∈ Rn;

(ii) V [X,Z] = V [Z,X]t, where M 7→M t is the matrix transposition;

(iii) V [aX + bZ, Y ] = aV [X,Y ] + bV [Z, Y ];

(iv) V [aX + bZ] = a2V [X] + b2V [Z] + abV [X,Z] + abV [Z,X];

(v) AV [X,Y ] = V [AX,Y ], so that V [X,BY ] = V [X,Y ]Bt and V [AX] = AV [X]At.

9.3 Given the notation of the previous exercise, define

σ(X) :=√

tr(V [X]),

where tr (A) is the trace of the matrix A. Prove that:

(i) σ(X) = σ(AX) if A is a n× n orthogonal matrix;

(ii) σ(aX + c) = |a|σ(X) where a ∈ R and c ∈ Rn.

9.4 (correlation) If σ(X) 6= 0 and σ(Y ) 6= 0, we define the correlation R of the random variablesX and Y taking values in Rn as:

R[X,Y ] :=V [X,Y ]σ(X)σ(Y )

.

Prove that−1 ≤ trR[X,Y ] ≤ 1

and that, if trR[X,Y ] = 1 (resp. trR[X,Y ] = −1), then λX = Y + c almost surely for some λ > 0and c ∈ Rn (resp. λ < 0).

9.5 Let us endow the space of square integrable Rn–valued random variables with the scalar product〈X,Y 〉 = E(X · Y ). Let

W := X ∈ [L2(Ω,A ,P)]n : E(X) = 0

be the set of centered random variables. Show that:

(i) X 7→ (X − E(X)) is the orthogonal projection on W ;

(ii) for X, Y ∈W we have trV [X,Y ] = 〈X,Y 〉, σ(X) = ‖X‖.

Page 147: Probability Book

Chapter 9 143

9.6 LetX, Y be random variables with values in (Ω′,A ′). ThenX and Y are identically distributedif and only if E (f(X)) = E (f(Y )) for any bounded A ′–measurable function f : Ω′ → R.

9.7 Show the following refinement of Jensen’s inequality: if f : R → R is strictly convex, andE(f(X)) = f(E(X)), then X = E(X) almost surely.

9.8 Let (E, d) be a metric space endowed with the Borel σ–algebra and let Xn, Yn be Borel randomvariables. Assume that Xn → X and Yn → Y almost surely, and that Xn and Yn are identicallydistributed for any n ∈ N. Show that X and Y are identically distributed.

9.9? ? [On De Finetti’s coherence]Consider a betting office. The bookmaker fixes the price of an odd An to qn (this is the “quote”):this means that the bettor can buy an amount cn of the bet (this is the “stake”), and if the outcomeof the bet is positive, s/he will receive qncn. (1)

Equivalently, setting pn = 1/qn, (by (ab)using the arbitrarity of cn) we will say that the bettorcan buy an amount cnpn of the bet, and if the outcome is positive, he will receive cn.It is common intuition that pn represents, in a sense, the probability of An, according to thebookmaker’s judgment. Indeed, the discussion that follows was proposed by De Finetti as adefinition of what Probability is. (2)

To include in one single formula both the winning and the losing case, we will say that the bettorthat buys an amount cnpn of the bet, receives cn1An(ω); where 1An(ω) is 1 if the bet succeds (thatis, if the possibility ω ∈ An occurs) and 0 otherwise.We suppose that there is a family of bets A1, . . . , AN (and we suppose that A1, . . . , An ⊂ Ω, whereΩ is a set); we collect the above quantities in vectors, with −→c := (c1, . . . , cN ), −→p := (p1, . . . , pN ),and also define the vector-valued function

−→I : Ω → 0, 1N ,

−→I := (1A1 , . . . 1AN

) so that

−→c · −→p :=N∑

n=1

cnpn,−→c ·

−→I :=

N∑n=1

cn1An

are respectively the amount payed, and the amount won.A “Dutch book” is a situation where it is possible to buy a set of bets that will guarantee a gain:

∃−→c such that ∀ω, −→c · −→p < −→c ·−→I (ω) (9.17)

We may argue that no sane bookmaker would allow such a situation!We will also assume that cn can be chosen negative (that is, the bettor can sell the bet to thebookmaker – this is not usually possible in gambling, but is instead possible in stock exchangemarkets). If the above (9.17) is false for all possible −→c ∈ RN , then we will say that the choice −→pis coherent.Let B =

−→I (ω) : ω ∈ Ω ⊂ 0, 1N be the image of

−→I .

• ? Prove (9.17) is false iff −→p is in the convex hull of B

• Suppose that (9.17) is false: since −→p is in the convex hull of B, then there exist numbersλe ∈ [0, 1] such that

∑e∈B λe = 1 and −→p =

∑e∈B eλe.

(1)Note that the gain is (qn − 1)cn, that is, the quote includes the stake (as to say): this is thelanguage of European/continental gambling. In the language of British gambling, instead, the quotedoes not include the stake.

(2)This exposition is based on the paper “On De Finetti Coherence and Kolmogorov Probability”,by V. S. Borkar, V. R. Konda and S. K. Mitter, Stat. Prob. Lett. 66 (2004) 417-421

Page 148: Probability Book

144 General concepts of Probability

Let Be = ω ∈ Ω :−→I (ω) = e be the counterimage of e ∈ B: the family Bee∈B is a

partition of Ω. Let τ be the algebra generated by that partition, and let P the probabilityon τ such that P(Be) = λe

Prove that A1, . . . AN ∈ τ and that P extends −→p , thats is, P(An) = pn.

We conclude that any sane bookmaker that buys and sells bets must use a true Probability asthe model for its stakes; that is, coherence implies that probability is a measure (as Kolmogorovdefined it).9.10 What happens in the previous exercise if we decide that cn ≥ 0? Assume for simplicity thatAn are pairwise disjoint (this is the case, for example, when the bet is on the winner of a race).Assume that (9.17) is false: what does this imply on p1, . . . , pN?

Page 149: Probability Book

Chapter 10

Conditional probability andindependence

Let (Ω,A ,P) be a probability space. Given two events A and B, with P(B) > 0,the conditional probability P(A

∣∣B) of A given B is given by

P(A∣∣B) :=

P(A ∩B)

P(B). (10.1)

Using a measure-theoretic notation, P(·∣∣B) is the probability measure 1BP/P(B).

The notion of conditional probability is linked to the interpretation of probabilityas “quantitative evaluation of the possibility that an event occurs”; this evaluationhas to be based on the available informations. This is why the knowledge that theevent B occurred modifies the initial probability distribution P, and leads to theconditional probability 1BP/P(B).

For instance, if Ω = 1, 2, 3, 4, 5, 6 is the canonical probability space associatedto the toss of a die, then

P(6∣∣2, 4, 6) =

1

3and P

(6∣∣1, 3, 5) = 0.

Or, if Ω = [0, 1] with the canonical probability structure, then

P

(x ≤ 1

3∣∣x ≤ 1

2)

=2

3.

Maybe the wrong identification between the concepts of probability and con-ditional probability is underlying all wrong beliefs about delayed numbers in lot-teries and similar games: if one randomly extracts 5 integer numbers between 1and 90 (re-inserting the 5 chosen numbers in the ballot-box after each extraction),

145

Page 150: Probability Book

146 Conditional probability and independence

the probability that a given number, say 36, is not chosen in 100 extractions is(1 − 5/90)100 ∼ 3 · 10−3, but the conditional probability that it is not chosen in the100-th extraction, knowing that is has not been chosen in the first 99 extractions, is5/90 (this happens because this process has “no memory”, as we will see later on).

If one looks at probability as the limit of frequencies of succesful events, theformula (10.1) can be justified as follows: in a scheme of n trials we have

inn∼ P(A ∩B) and

bnn∼ P(B),

where in, bn are respectively the number of times the events A ∩B, B occurred. Inorder to compute the conditional probability P (A

∣∣B) we consider, as it must be,only the fraction in/bn, ignoring the n − bn cases in which B did not occur. Then,multiplying and dividing by n we get

P(A∣∣B) ∼ in

bn=inn· nbn∼ P(A ∩B)

P(B).

Notice also that if (B1, . . . , Bn) is a partition of Ω made by sets in A , then thefollowing Law of total probability holds:

P(A) =n∑

i=1

P(Bi)P(A∣∣Bi). (10.2)

This is the identity used, more or less implicitly, in all exercises of discrete probabilityin which the probability of an event is computed by considering separately the casesA ∩B1, . . . , A ∩Bn.

10.1 Independence of pairs of events, families and

random variables

Two events A and B are said to be independent if P(A∣∣B) = P(A), i.e. if P(A∩B) =

P(A)P(B). Intuitively, A is independent of B if the probability of the occurrence ofA is not affected by the knowledge that B occurred. For instance, if Ω = H,Tn

is the canonical space associated to n-consecutive tosses of a coin, then

P(ωn = T

∣∣ωn−1 = H)

=1

2and P

(ωn = T

∣∣ωn−1 = T)

=1

2

because the events are independent. On the other hand, in the case of the toss of adie, the events 2 and 2, 4, 6 are clearly not independent.

Page 151: Probability Book

Chapter10 147

Definition 10.1 (Independence of two families) We say that two families A1, A2

contained in A are independent if P(A1 ∩ A2) = P(A1)P(A2) for all A1 ∈ A1,A2 ∈ A2.

In order to extend the concept of independence to pairs of random variables weneed the concept of σ–algebra generated by a random variable X: for X : (Ω,A ) →(E,E ) be measurable, we shall denote by σ(X) the smallest σ–algebra G whichmakes X (G ,E )–measurable. Obviously

σ(X) = X ∈ A : A ∈ E .

Indeed, the right hand side is a σ–algebra and it is the minimal one with thisproperty. Moreover, if F is a family of generators of E (for instance halflines, inthe case X is a real-valued variable), then G is the σ–algebra generated by

X ∈ A : A ∈ F .

Definition 10.2 (Independence of two random variables) Let X : (Ω,A ) →(E,E ), Y : (Ω,A ) → (F,G ) be random variables. We say that X and Y areindependent if σ(X) is independent of σ(Y ), i.e.

P(X ∈ A ∩ Y ∈ B) = P(X ∈ A)P(Y ∈ B) ∀A ∈ E , ∀B ∈ F .

Notice also that the concept of independence at the level of random variablesdoes not require that the variables have the same target: it requires, instead, thatthey are defined on the same probability space (compare this with the concept ofbeing identically distributed).

We give now some important characterizations of the independence property forpairs or random variables taking their values respectively in the measurable spaces(E,E ) and (F,F ).

Proposition 10.3 The following properties are all equivalent to the independenceof X and Y :

(a) P(X ∈ A ∩ Y ∈ B) = P(X ∈ A)P(Y ∈ B) for A ∈ E ′ and B ∈F ′, where E ′, F ′ are generators of E and F respectively, stable under finiteintersection;

(b)E(f(X)g(Y )) = E(f(X))E(g(Y )) (10.3)

for any pair of functions f : E → R, g : F → R, with f E –measurable and gF–measurable, and f, g either both bounded or both nonnegative;

Page 152: Probability Book

148 Conditional probability and independence

(c) the law of the joint random variable Z = (X, Y ) : (Ω,A ) → (X × Y,E ×F )is the product of the laws of X and Y .

If (E, dE) and (F, dF ) are metric spaces and E , F are the corresponding Borelσ–algebras, we have a fourth equivalent property:

(d) E(f(X)g(Y )) = E(f(X))E(g(Y )) for any pair of continuous functions f : E →R, g : F → R, with f, g either both bounded or both nonnegative.

Proof. The independence, in what follows denoted by (i), is equivalent to (a).Indeed, it suffices to notice that for B ∈ F ′ fixed, the collection of sets A ∈ Esatisfying

P(X ∈ A ∩ Y ∈ B) = P(X ∈ A)P(Y ∈ B) (10.4)

is a Dynkin class and contains E ′. Therefore, by the Dynkin theorem, this collectioncoincides with E . As a consequence, the collection of sets F ∈ F such that (10.4)holds for all E ∈ E is a Dynkin class and contains F ′. Again, Dynkin theorem givesthat this collection coincides with F .

We are now going to show that (i)⇒(b)⇒(c)⇒(i).(i)⇒(b). The formula is true for characteristic functions f = 1A, g = 1B (in thiscase it reduces to (i)), hence for simple functions. Using Proposition 2.4 we obtainits validity for nonnegative measurable functions. Eventually, splitting in positiveand negative part, it holds for bounded measurable functions.(b)⇒(c). Denoting by λ the law of Z, choosing f = 1A and g = 1B with A ∈ E andB ∈ F , we get

λ(A×B) = Eλ(1A×B) = E(Z1 ∈ A ∩ Z2 ∈ B)= E(f(X)g(Y )) = E(f(X))E(g(Y )) = µ(A)ν(B),

with µ, ν respectively equal to the laws of X and Y . Hence λ = µ× ν.(c)⇒(i). Denoting by µ, ν the laws of X and Y respectively, we have

P(X ∈ A ∩ Y ∈ B) = P(Z1 ∈ A ∩ Z2 ∈ B) = µ× ν(A×B)

= µ(A)ν(B) = P(X ∈ A)P(Y ∈ B).

Finally, it is clear that (b)⇒(d) if the σ–algebras in E and F are the Borelσ–algebras, as continuous functions are measurable. We then show that (d)⇒(a)where we set E ′ and F ′ to be respectively the open sets in E and F . Let then A ⊂ Eand B ⊂ F be open; by Exercise 2.11 we know that there exist two sequences ofnonnegative continuous functions fn : E → [0, 1] and gn : F → [0, 1] monotonicallyconverging to 1A and 1B: we then just pass to the limit into

E(fn(X)gn(Y )) = E(fn(X))E(gn(Y ))

Page 153: Probability Book

Chapter10 149

to obtain, by monotone approximation, that E(1A(X)1B(Y )) = E(1A(X))E(1B(Y )),that is P(X ∈ A ∩ Y ∈ B) = P(X ∈ A)P(Y ∈ B).

Remark 10.4 (Criterion for indepencence of two σ–algebras) The proof that(a) implies the independence of the random variables can also be used to show thatA1 is independent of A2 if and only if

P(A ∩B) = P(A)P(B) ∀A ∈ F1, ∀B ∈ F2,

for suitable π–systems F1 and F2 generating A1 and A2 respectively.

10.2 Independence of families of events and of

random variables

In this section we will see how the concept of independence can be extended toarbitrary families of events and random variables.

Definition 10.5 (Independence of families Ai) Let Ai ⊂ A , for i ∈ I. Wesay that Aii∈I are independent if

P

(⋂i∈J

Ai

)=∏i∈J

P(Ai) ∀Ai ∈ Ai, i ∈ J

for any finite set J ⊂ I.

In order to extend the concept of independence to families of random variables,let us consider a family, indexed by i ∈ I, of random variables Xi : Ω → Ei, with(Ei,Ei) measurable spaces.

Definition 10.6 (Independence of random variables) The family of randomvariables Xii∈I is said to be independent if the σ–algebras σ(Xi)i∈I are indepen-dent, namely

P

(⋂i∈J

Xi ∈ Ai

)=∏i∈J

P(Xi ∈ Ai) ∀Ai ∈ Ei, i ∈ J (10.5)

for all finite sets J ⊂ I.

Page 154: Probability Book

150 Conditional probability and independence

Exercise 10.2 shows that also this concept can be formulated in terms of indepen-dence of pairs of events. Notice also that a family of events (or of random variables)is independent if and only if any finite subfamily is independent. Moreover, inde-pendence is preserved by taking smaller families of events; at the level of randomvariables this gives the implication

Xii∈I independent =⇒ φi(Xi)i∈I independent, (10.6)

because σ(φi(Xi)) ⊂ σ(Xi) whatever φi is. We shall discuss another less elementarystability property in Proposition 10.10 .

Proposition 10.3 can be extended with no great difficulty to the case of a familyof (A ,Ei)–measurable random variables Xi : Ω → Ei, indexed by any set I.

Proposition 10.7 Let Xi : Ω → Ei be (A ,Ei)–measurable random variables. Thefollowing properties are all equivalent to the independence of (Xi)i∈I :

(a) there exists π–systems E ′i generating Ei such that

P

(⋂i∈J

Xi ∈ Ai

)=∏i∈J

P(Xi ∈ Ai)

holds for any finite set J ⊂ I and any choice of Ai ∈ E ′i , i ∈ J ;

(b) ∏i∈J

E(fi(Xi)) = E(∏

i∈J

fi(Xi))

(10.7)

for any finite set J ⊂ I, for any choice of Ei–measurable functions fi : Ei → R,either all bounded or all nonnegative;

(c) for all J ⊂ I finite (or countable), the law of the joint random variable Z =

(Xi)i∈I : (Ω,A ) → (×i∈I

Ei,×i∈I

Ei) is the product of the laws of Xi.

If Ei are metric spaces and Ei are the corresponding Borel σ–algebras, we have afourth equivalent property:

(d) equality (10.7) holds for any finite set J ⊂ I, for any choice of real continuousfunctions fi : Ei → R, either all bounded or all nonnegative.

We omit the proof, that is a simple generalization of the one of Proposition 10.3,with the exception of statement (a). In the case of statement (a), the necessity forindependence is obvious, and sufficiency can be proved by induction. Precisely, one

Page 155: Probability Book

Chapter10 151

proves by induction on k that for all sets J = j1, . . . , jn with cardinality n ≥ k,we have

P

(n⋂

i=1

Xji∈ Aji

)=

n∏i=1

P(Xji∈ Aji

) (10.8)

for Aji∈ Eji

, 1 ≤ i ≤ k and Aji∈ E ′

ji, i > k (if n > k). The induction argument

follows the same scheme used in the proof of Proposition 10.3(a): for k = 1 one fixesAji

∈ E ′ji, i > 1, and the class

Aj1 ∈ Ej1 : P

(n⋂

i=1

Xji∈ Aji

)=

n∏i=1

P(Xji∈ Aji

)

.

By the assumptions made in (a), this class contains E ′j1

and it is Dynkin, therefore(10.8) holds for Aj1 ∈ Ej1 and Aji

∈ E ′ji, i > 1. Continuing in this way, when k

reaches the cardinality of J we prove the independence property (10.5).The above proposition has an interesting corollary for discrete random variables:

Corollary 10.8 Suppose X1, . . . , Xn : Ω → C are discrete random variables takingvalues in a countable set C. Suppose that for all choices of x1, . . . , xn ∈ C we have

P

(n⋂

i=1

Xi = xi

)=

n∏i=1

P(Xi = xi).

Then, X1, . . . , Xn are independent.

The proof follows from (a), choosing E ′i equal to the the family of singletons

of C. Another consequence of the proposition is the fact that independence forfamilies of singletons Aii∈I is equivalent to the independence of the correspondingcharacteristic functions.

Corollary 10.9 Given Aii∈I ⊂ A , the following are equivalent:

(i) the family Aii∈I is independent, according to Definition 10.5;

(ii) the family of characteristic functions 1Aii∈I is independent, according to

Definition 10.6.

The proof again follows from (a), choosing Ei = P0, 1 and E ′i = 1, 0, 1.

The independence of random variables enjoys another stability property under com-position: it corresponds to the intuitive fact that functions generated from indepen-dent variables are independent. Before stating it, we define σ–algebra generated by

Page 156: Probability Book

152 Conditional probability and independence

a family of random variables Xii∈I as the smallest σ–algebra A ′ that makes allXi (A ′,Ei)–measurable. It is generated by the π–system⋂

i∈J

Xi ∈ Ai Ai ∈ Ei, i ∈ J , J ⊂ I finite.

Proposition 10.10 (Stability of independence under composition) Suppose

that Xii∈I are independent and that Xi : Ω → Ei are (A ,Ei)–measurable. Then,

for all finite J, K ⊂ I disjoint and all functions f , g respectively defined in×i∈J Ei

and×i∈K Ei, and measurable with respect to the product σ–algebras, we have thatf(Xii∈J) and g(Xii∈K) are independent.

Proof. The σ–algebra generated f(Xii∈J) (resp. f(Xii∈K) is contained inthe σ–algebra AJ (resp. AK) generated by Xii∈J (resp. Xii∈K), so it sufficesto show that the σ–algebras AJ and AK are independent. Now, AJ and AK arerespectively generated by the the π–systems FJ and FK obtained by consideringfinite intersections Xi ∈ Ai, with i ∈ J or i ∈ K, and the independence assumptionon Xii∈I gives P(E∩F ) = P(E)P(F ) for all E ∈ FJ and F ∈ FK . By Remark 10.4we conclude that AJ is independent of AK .

In the case when I = N we can reverse somehow the implication stated inProposition 10.10, obtaining an independence criterion that will be useful in thefollowing.

Proposition 10.11 (Recursive independence criterion) Let (Xn) be a sequenceof random variables such that, for any n ≥ 1, Xn is independent from (X0, . . . , Xn−1)(i.e. σ(Xn) is independent of σ(Xii<n)). Then (Xn) is an independent sequence.

Proof. Let 0 ≤ i1 < i2 < · · · < ik and Aj = X−1ij

(Bj) be in the σ–algebragenerated by Xij , j = 1, . . . , k. Setting n = ik, the event An is independent fromthe event ∩j<kAij , as the former belongs to the σ–algebra generated by Xn and thelatter to the σ–algebra generated by (X0, . . . , Xn−1). We have then

P

(k⋂

j=1

Aij

)= P

(k−1⋂j=1

Aij

)· P(Aik).

Continuing in this way we get P(∩k1Aij) =

∏k1 P(Aij).

By combining the above ideas, it is possible to create many other “independencecriterions”: see Exercise 10.1,. . . , Exercise 10.4; but not all of them work, as seenin Exercise 10.5.

Page 157: Probability Book

Chapter10 153

10.2.1 Independence of real valued variables

Whenever X and Y are real valued and integrable we can obtain from Propo-sition 10.3(b) with f and g equal to the positive and the negative part, thatE(X+Y +) = E(X+)E(Y +), E(X+Y −) = E(X+)E(Y −), E(X−Y +) = E(X−)E(Y +),E(X−Y −) = E(X−)E(Y −). Splitting X and Y in positive and negative part, itfollows that XY is integrable and

E(XY ) = E(X)E(Y ). (10.9)

In this identity the independence is really crucial: without this assumption we cansay that XY is integrable only if X and Y are square integrable, and typicallyE(XY ) 6= E(X)E(Y )! An analogous splitting in real and imaginary part shows that(10.9) still holds for integrable complex-valued random variables X, Y .

If moreover (X1, . . . , Xn) are real integrable and independent, then applyingequation (10.7) with fi(x) = |x| we obtain that all products

∏k1 Xj with 1 ≤ k ≤ n

are integrable; moreover, by Proposition 10.10, the product X1 · · ·Xn−1 is indepen-dent from Xn. Using these facts we can repeatedly apply (10.9) to obtain

E

(n∏

j=1

Xj

)=

n∏j=1

E(Xj). (10.10)

Again, an analogous argument holds for integrable complex-valued random variables.The identity (10.9) has a remarkable consequence: if X1, . . . , Xn are pairwise

independent, then they are uncorrelated, so that the variance of the sumX1+· · ·+Xn

is the sum of the variances of the Xi. Indeed, since Xi − ci is still independent ofXj − cj when i 6= j and ci, cj ∈ R, in the proof of this fact we can assume with noloss of generality that the Xi are centered. Then, we have

σ2(X1 + · · ·+Xn) = E(|X1 + · · ·+Xn|2) =n∑

i, j=1

E(XiXj) (10.11)

=n∑

i=1

E(X2i ) =

n∑i=1

σ2(Xi).

This fact shows that the variance of the arithmetic mean of the Xi can be estimatedby C/n, if all variances of Xi are bounded by C. This simple observation will bethe basis of the proof of the strong law of large numbers, in the next chapter.

At the same time it is important to remark that any family of random variablesmay be “independent” whereas the concept of uncorrelation makes sense only forreal (or, at most vector valued) random variables.

Page 158: Probability Book

154 Conditional probability and independence

Example 10.12 Using the independence of∑n−1

1 Xi of Xn we can check that therandom variable Sn =

∑n1 Xi in Example 9.4 has binomial law with parameters n

and p. We can show that

P(Sn = i) =(ni

)piqn−i, 1 ≤ i ≤ n

by induction on n using the fact that Sn−1 is independent of Xn and (10.2):

P(Sn = i) = qP(Sn = i

∣∣Xn = 0)

+ pP(Sn = i

∣∣Xn = 1)

= qP(Sn−1 = i

∣∣Xn = 0)

+ pP(Sn−1 = i− 1

∣∣Xn = 1)

=

(n− 1

i

)piqn−1−i · q +

(n− 1

i− 1

)pi−1qn−1−i+1 · p =

(ni

)piqn−1.

We now show a basic rule for computing the characteristic function of the sumof n independent random variables.

Proposition 10.13 If Y1, . . . , Yn are independent random variables, then

n∑j=1

Yj =n∏

j=1

Yj.

Proof. For any ξ ∈ R we have

n∑j=1

Yj(ξ) = E

(e−iξ

nPj=1

Yj

)= E(

n∏j=1

e−iξYj),

whileYj(ξ) = E(e−iξYj), 1 ≤ j ≤ n.

The conclusion follows by (10.10) with Xj = e−iξYj , noticing that (Xj) are stillindependent.

10.3 Independent sequences with prescribed laws

A sequence of random variables (Xi) all taking values in the same measurable space(E,E ) is called a discrete stochastic process. Here “discrete” refers to the fact thatthe index i varies in N, while in many other important applications also continuousparameters are used (typically Xt, with t ∈ R).

One of the most important applications of Proposition 10.7 consists in the pos-sibility of building, given a sequence (Pi) of laws in the measurable spaces (Ω′

i,A′

i ),

Page 159: Probability Book

Chapter10 155

an independent sequence (Xi) of random variables with values in Ω′i and having laws

Pi. In many cases the law Pi = µ does not depend on i; in this case we will say thatthe sequence (Xi) is independent and identically distributed.

In order to build a probability space and independent mapsXi with this property,it suffices to take as probability space

(Ω,A ,P) :=

(∞∏i=1

Ω′i,

×i=1

Ai,∞

×i=1

Pi

),

and as functionsXi(ω) the canonical projections on the i-th coordinate, i.e. Xi(ω) =ωi. Since, by the definition of product measure

P(Xi ∈ Ai) = P(ω : ωi ∈ Ai) = Pi(Ai) ∀Ai ∈ Ai,

we obtain that the law of Xi is µi. More generally, an analogous argument givesthat the law of (Xi1 , . . . , Xin) is µi1 × · · · × µin , because for any choice of Ak ∈ Aik ,1 ≤ k ≤ n, the event

ω ∈ Ω : wij ∈ Aj j = 1, . . . , k

is cylindrical and has P-probability

k∏j=1

Pij(Aj) =n

×j=1

µij(A1 × · · · × An).

The independence (not only pairwise, but as a whole) of the variables Xi follows byProposition 10.7(c).

We begin with a fundamental example, that can be built as a product of infinitecopies of the Bernoulli law (as in Example 9.2(3)).

Example 10.14 (Bernoulli process) A discrete Bernoulli process with parame-ter p ∈ [0, 1] is a sequence of independent and identically distributed random vari-ables Xi with Bernoulli law with parameter p.

Given a discrete stochastic process (Xn), one can typically build many newrandom variables starting from Xn, as shown in the following example.

Example 10.15 (k-success time and Pascal laws) We consider a discrete Bernoulliprocess with parameter p. We consider the random variable

T (ω) := inf n ≥ 1 : Xn(ω) = 1 ,

Page 160: Probability Book

156 Conditional probability and independence

so that T represents the time of the first success. As T > n is the intersection ofthe sets Xi = 0 for i = 1, . . . , n, the independence of (Xn) gives

P(T > n) = (1− p)n.

In particular one has that almost surely T is finite, if p > 0. Moreover, writingT = n = T > n− 1 \ T > n, we get

P(T = n) = p(1− p)n−1 ∀n ≥ 1.

This shows that T has a geometric law with parameter p, so that E(T ) = 1/p. Wehave also

P(T − n = k

∣∣T > n)

=P(T = n+ k)P(T > n)

= P(T = k), (10.12)

therefore the law of T − n, conditioned to T > n, has still geometric law withparameter p (one says that the geometric law has no memory).

Let us build now new random variables as follows: T1 = T ,

T2(ω) := min n > T1(ω) : Xn(ω) = 1

and, recursively, Tk+1(ω) := min n > Tk(ω) : Xn(ω) = 1. The variables Tk repre-sent the time of the k-th success. Let us compute the law of Tk, called Pascal lawof parameter p, k: since for n ≥ k we have

Tk = n = Xn = 1 ∩ X1 + · · ·+Xn−1 = k − 1

we get

P(Tk = n) = p ·(n− 1

k − 1

)pk−1(1− p)n−1−(k−1)

=

(n− 1

k − 1

)pk(1− p)n−k n = k, k + 1, . . .

Let eventually U1 = T1 and Un+1 = Tn+1 − Tn be the times of return to success :then (Un) is a sequence of independent geometrical random variables. (The prooffollows from Exercise 10.1).

A very particular, but interesting, case of the previous construction correspondsto the situation when (Ω′

i,A′

i ) = (−1, 1,P(−1, 1)) and the laws Pi are all equalto µ = (δ−1 + δ1)/2.

Page 161: Probability Book

Chapter10 157

Example 10.16 (Rademacher sequence) Let

(Ω,A ,P) :=([0, 1],B([0, 1]),L 1

)and let us divide [0, 1] in 2n closed intervals with length 2−n, setting Rn alternativelyequal to −1 and 1 in the interior of these intervals, defining Rn on the extreme pointsof the intervals in such a way that Rn is right continuous (other choices are possible,without changing the law of Rn). The law of Rn is µ because

P (Rn = ±1) =1

2.

We will check that (Rn) is independent in two ways, a direct one and a moretheoretical one. Thanks to the recursive independence criterion 10.11 it suffices tocheck that for any n the σ–algebra generated by Rn+1 is independent of the σ–algebra generated by R1, . . . , Rn. The latter σ–algebra is generated by intervals Iwith length 2−n on which each Rn+1 is equal to −1 in the first half, and equal to 1in the second half (the values in the mid points are irrelevant), hence

P(Rn+1 = ±1

∣∣I) =P(Rn+1 = ±1 ∩ I)

P(I)= 2−n−1+n = P(Rn+1 = ±1).

This proves that Rn+1 is independent of the σ–algebra generated by R1, . . . , Rn.Another way to show the independence is to notice that Rn = (2Xn − 1) φ,

where

φ : ([0, 1],B([0, 1])) →

(∞∏i=1

0, 1,∞

×i=1

P(0, 1)

)is the map associating to any number in [0, 1] its binay expansion (in case of non-uniqueness one can choose the one with finitely many digits equal to 1) and Xn isthe map that to ω ∈

∏∞1 0, 1 associates its i-th coordinate. Notice also that φ

maps the uniform measure in [0, 1] in the product measure P =×∞1 (δ0 + δ1)/2 (see

Exercise 8.2). Being the sequence Yn = 2Xn − 1 independent, we have

L 1

(n⋂

j=1

R−1ij

(Aij)

)= L 1

(φ−1(

n⋂j=1

Y −1ij

(Aij))

)= P

(n⋂

j=1

Y −1ij

(Aij)

)

=n∏

j=1

P(Y −1ij

(Aij)) =n∏

j=1

L 1(φ−1(Y −1ij

(Aij)))

=n∏

j=1

L 1(R−1ij

(Aij))

for any choice of 1 ≤ i1 < i2 · · · < in and Aij ⊆ −1, 1.

Page 162: Probability Book

158 Conditional probability and independence

The Rademacher sequence provides us with a simple example of a family ofrandom variables pairwise independent but not globally independent: it suffices toconsider the triplet of functions (R1, R2, R1R2); where this triplet independent, wewould obtain thanks to Proposition 10.10 that R1R2 is independent from itself, andthis is clearly false (see also Exercise 10.7). On the other hand, it is easy to checkthat the pairs (R1, R1R2) and (R2, R1R2) are independent.

From that example we can easily generate another important example: a processof random variables that are pairwise independent but not all independent; to thisend. let P3 be the law of (R1, R2, R1R2), and use the construction shown at thebeginning of the section once again, to build infinite variables that have law P3 andare independent.

Page 163: Probability Book

Chapter10 159

EXERCISES

10.1[Recursive discrete independence criterion] Let (Xn) be a sequence of discrete random vari-ables taking values in a countable set C; suppose that, for any n ≥ 1, for all choices of x0, . . . , xn+1 ∈C, letting B =

⋂ni=0Xi = xi,

P (Xn+1 = xn+1 | B) = P (Xn+1 = xn+1)

whenever P (B) > 0. Then (Xn) is an independent sequence.

10.2 (Independence is ensured if we ask that any choice of disjoint subfamilies be independent) Show thatXii∈I is independent if and only if, chosen J, K ⊂ I disjoint and denoting by AI and AK theσ–algebras generated by Xii∈J and Xii∈K , we have

P(A ∩B) = P(A)P(B) ∀A ∈ AJ , B ∈ AK .

10.3 (Independence is still ensured if we only assume that each variable is independent from all the others) Showthat Xii∈I is independent if and only if, for any chosen j ∈ I, denoting by Aj and A¬j theσ–algebras generated by Xj and (Xi)i 6=j , we have

P(A ∩B) = P(A)P(B) ∀A ∈ Aj , B ∈ A¬j .

10.4 (In the case of events, we can test independence of one event w.r.t. a finite sub family of all the others)

Suppose that Xi takes values in 0, 1. Show that Xii∈I is independent if and only if, for anychosen j ∈ I and K ⊂ I finite and with j 6∈ K, defining A = Xj = 1, B = Xi = 1, i ∈ K, wehave

P(A ∩B) = P(A)P(B).

(Hint: Xi = 1Ciwith Ci ∈ A ; so A = Cj and B =

⋂i∈K Ci; by Corollary 10.9, show that the

events Cii∈I are independent).

Note that in the Exercise 10.2 we test independence of each Xj against the block (Xi)i∈K where K = i : i 6= j;whereas in 10.4 we consider more generally all choices j, K ⊂ I where j 6∈ K. The following example shows that

we cannot generalize 10.4 further, that is, supposing that P(A ∩B) = P(A)P(B) holds only when K = i : i 6= j.10.5 Show that there exists an example of 3 variables (X1, X2, X3) taking values in 0, 1, suchthat (X1, X2, X3) are not independent, but, for any chosen j ∈ 1, 2, 3 and defining A = Xj = 1,B = Xi = 1, i 6= j, they satisfy

P(A ∩B) = P(A)P(B).

(Hint: let V be the set of all possible densities p ∈ R8 of the law of (X1, X2, X3) on 0, 13 that satisfy the requisite;

prove that V is a smooth manifold of dimension 4 in a neighbourhood of (1/8, 1/8 . . . , 1/8); whereas the space of

all densities of independent variables is at most 3 dimensional.)

10.6(Ricomposition of independence) Let (Xi)i∈J be a family of random variables. Let J bea partition of J in non empty subsets; for I ∈ J let YI = (Xi)i∈I be the vector random variable;suppose that,

• for any I ∈ J , the family (Xi) with i ∈ I is a family of independent random variables

• the family of variables YI , for I ∈ J , is a family of independent random variables

Page 164: Probability Book

160 Conditional probability and independence

Prove that (Xi)i∈J is a family of independent random variables.Find simple examples where (Xi)i∈J is not a family of independent random variables, but one ofthe two properties above holds.10.7 Show that a real Borel random variable X is independent of itself if and only if X is almostsurely equal to a constant.Note also that a random variable X almost surely equal to a constant, is independent of any otherrandom variable.10.8 Let X, Y be real independent random variables with X ≤ Y . Show the existence of a constantc such that X ≤ c ≤ Y almost surely.10.9 Find an example of a random variable X that is independent of itself but not almost surelyequal to a constant.(Hint: find a σ–algebra τ on R such that x ∈ τ,∀x ∈ R, and find a probability P on (R, τ) suchthat Px = 0 and P(A) ∈ 0, 1,∀A ∈ τ ; let X : (R, τ,P) → (R, τ) be the identity.)10.10 Suppose that X, Y are two real independent random variables, and X − Y is almost surelyconstant: prove that X and Y are almost surely constant.10.11(Density and independence) Let X1, . . . , Xn be random variables on (Ω,F ,P) withvalues in a space (C,C , µ); let νi := X#P be the law of Xi; suppose that νi µ; let ρi ∈L1(Ω,F ,P) be the density of νi w.r.t. µ (as defined by the Radon–Nikodym Theorem 6.14). LetY = [X1, . . . , Xn] be the block random variables with values in a space (Cn,C n) let ν = Y#P andµ = µn. Prove that these two are equivalent:

• X1, . . . , Xn are independent

• ν µ, andρ(x) = ρ1(x1) · · · ρn(xn) for µ almost all x ∈ Cn

where ρ is the density of ν w.r.t. µ, and x = (x1, . . . xn) ∈ Cn.

The above is often applied to the case of discrete variables (when C is finite or countable): supposethat µ is the counting measure, then any measure on C is absolutely continuous w.r.t. to µ; inthat case, the marginal density satisfies

PXi ∈ A =∑xi∈A

ρi(x)

for any A ∈ C; and the block satisfies

P[X1 . . . , Xn] ∈ B =∑x∈B

ρ1(x1) · · · ρn(xn) ∀B ⊂ Cn

if and only if X1, . . . , Xn are independent.10.12? Let µ, ν be probability measures in R. Show that the convolution µ∗ν of µ and ν, definedby

µ ∗ ν(A) :=∫R

∫R

1A(x+ y) dµ(x) dν(y) A ∈ B(R)

is a probability measure in R. Show that ∗ is a commutative and associative product among theprobability measures in R, and find the identity element of this product. Show that∫

R

f(z)d(µ ∗ ν)(z) =∫R

∫R

f(x+ y) dµ(x) dν(y)

Page 165: Probability Book

Chapter10 161

for any Borel bounded function f : R→ R.10.13? Using Fubini–Tonelli theorem show that the law of the sum of two real independentvariables X, Y is given by µ ∗ ν, where µ is the law of X and ν is the law of Y .10.14 Let (Xn)n∈N be a Bernoulli process of parameter p ∈ (0, 1); let T (ω) := inf n ≥ 0 : Xn(ω) = 1be the first success (starting from n = 0); let Y := XT+1: compute PY = 1 and PY = 1 | Xn =1.10.15 Show that the geometric law is the only law with values in N such that the “no memory”property (10.12) holds. (See in Example 10.15 for more details).10.16 We recall that, given λ > 0, the exponential law with parameter λ is the probability law on[0,∞) defined by E (λ) = 1

λ e−t/λL 1.Let X be a real Borel random variable, positive almost surely. Let µ = X#P be the law, concen-trated in [0,∞). Show that the three following facts are equivalent:

a) µ = E (λ) for some λ > 0;

b) P(X > a+ b

∣∣X > a)

= P(X > b) ∀a, b ≥ 0;

c) there exists λ > 0 such that, for all t > 0, defining Qt(·) = P(·|X > t), we have

λ = EQt(X − t) :=

∫(X(ω)− t) dQt(ω).

The second condition above means that the exponential law has no memory (it is then the “con-tinuous equivalent” of the geometric law; see the previous exercise).

Page 166: Probability Book

162 Conditional probability and independence

Page 167: Probability Book

Chapter 11

Convergence of random variables

In this chapter we will study and compare various concepts of convergence for se-quences of random variables: the almost sure convergence, the Lp convergence, al-ready seen in the measure-theoretic part of this book, the convergence in probabilityand finally the convergence in law.

We are given a probability space (Ω,A ,P), a sequence of extended real randomvariables (Xn) and a random variable X on (Ω,A ,P). For p ∈ [1,∞), we recall that

limn→∞

Xn = X in Lp(Ω,A ,P)

if Xn, X ∈ Lp(Ω,A ,P) and

limn→∞

E (|X −Xn|p) = 0.

11.1 Convergence in probability

We say that a sequence (Xn) of extended real random variables converges in proba-bility to X if (with the convention (+∞)− (+∞) = 0, (−∞)− (−∞) = 0)

limn→∞

P(|X −Xn| > δ) = 0 ∀δ > 0. (11.1)

Condition (11.1) is equivalent to the following one

∀ε > 0, ∃ nε ∈ N such that P(|X −Xn| > ε) < ε, ∀ n ≥ nε. (11.2)

Indeed, assume that (11.2) holds and fix δ > 0. Then by (11.2) it follows that forany integer k ≥ 1 there exists nk ∈ N such that

n ≥ nk =⇒ P(|X −Xn| > 1/k) <1

k. (11.3)

163

Page 168: Probability Book

164 Convergence of random variables

Since |X −Xn| > 1/k ⊃ |X −Xn| > δ for k > 1/δ, by (11.3) it follows that

n ≥ maxnk,1

δ =⇒ P(|X −Xn| > δ) < 1

k.

Therefore P(|X −Xn| > δ) → 0 as n→∞. The converse implication is obvious,just taking δ = ε and using the definition of the limit.

The characterization (11.2) of convergence in probability suggests the introduc-tion of the following distance

d(X, Y ) = inf δ > 0 : P(|X − Y | > δ) < δ . (11.4)

The distance d is well defined, and obviously d(X, Y ) ≤ 1, because P(|X − Y | >δ) < δ for any δ > 1. Notice also that, by monotonicity

P(|X − Y | > δ) < δ ∀δ > d(X, Y ). (11.5)

We will prove that d induces a distance in the class X (Ω) of equivalence classes ofextended real random variables in (Ω,A ,P). Here the equivalence relation is theone induced by the almost sure coincidence, i.e.

X ∼ Y ⇐⇒ X = Y almost surely.

In the following we shall for simplicity identify (as we did in the first part of thebook) elements of X (Ω) with the corresponding equivalence classes, whenever thestatement does not depend on the choice of a particular representative.

Proposition 11.1 The space (X (Ω), d) is a metric space and the convergence in(X (Ω), d) coincides with the convergence in probability.

Proof. It is not hard to check that the function d defined in (11.4) is a distanceon equivalence classes: the symmetry property and the fact that d(X, Y ) = 0 if andonly if X = Y almost surely are straightforward, while the triangle inequality canbe proved by the following argument: if δ1 > d(X, Y ) and δ2 > d(Y, Z) then (11.5)gives

P(|X − Y | > δ1) < δ1 and P(|Y − Z| > δ2) < δ2

and the inclusion |X − Z| > δ1 + δ2 ⊂ |X − Y | > δ1 ∪ |Y − Z| > δ2 gives

P(|X − Z| > δ1 + δ2) < δ1 + δ2,

so that d(Y, Z) ≤ δ1 + δ2. Letting δ1 ↓ d(X, Y ), δ2 ↓ d(Y, Z) we obtain the triangleinequality.The definition of d implies that P(|Xn−X| > δ) ≤ P(|Xn−X| > ε) < ε as soon

Page 169: Probability Book

Chapter11 165

as d(Xn, X) < ε and ε < δ. Hence, convergence in (X (Ω), d) implies convergence inmeasure. Conversely, if Xn → X in measure and δ > 0, then P(|Xn−X| > δ) < δfor n large enough, hence d(Xn, X) ≤ δ for n large enough. This proves thatd(Xn, X) → 0.

In the following theorem we collect the relations between almost sure conver-gence, convergence in Lp and convergence in probability.

Theorem 11.2 Let (Xn), X be extended real random variables. Then:

(i) if (Xn) converges to X almost surely, then (Xn) converges to X in probability.Conversely, if (Xn) converges to X in probability, there exists a subsequence(Xn(k)) converging to X almost surely.

(ii) if Xn, X ∈ Lp(Ω,A , µ) for some p ∈ [1,∞), then convergence of (Xn) to Xin Lp(Ω,A , µ) implies convergence in probability. The converse implicationholds only if |Xn|p are P–uniformly integrable.

Proof. (i) If Xn → X almost surely, then 1|Xn−X|>δ → 0 almost surely for allδ > 0. Therefore its expectation, i.e. the probability of |Xn−X| > δ, tends to 0.Assume now that Xn → X in probability. For any integer k ≥ 1 we can find aninteger n(k) such that

P(|Xn(k) −X| > 1

k) < 2−k,

because P(|Xn − X| > 1/k) < 2−k for n large enough. For the same reason, wecan also choose recursively n(k) with the property above and in such a way thatn(k) > n(k − 1). For any δ > 0 we have |Xn(k) −X| > δ ⊂ |Xn(k) −X| > 1/kfor k ≥ 1/δ, hence

∞∑k=1

P(|Xn(k) −X| > δ) <∞ ∀δ > 0.

According to Borel-Cantelli Lemma 1.9, this implies that the lim supk|Xn(k)−X| >δ is P–negligible for any δ > 0. This means that Xn(k) → X almost surely.(ii) If Xn → X in Lp(Ω,A , µ), then Markov’s inequality (9.4) gives

P(|Xn −X| > δ) ≤ 1

δpE(|Xn −X|p)

and therefore Xn → X in probability. Conversely, assume that Xn → X almostsurely, and that |Xn|p are P-uniformly integrable. If, by contradiction, Xn do notconverge to X in Lp(Ω,A ,P), we can find ε > 0 and a subsequence n(k) such that

E(|Xn(k) −X|p) ≥ ε ∀k ∈ N.

Page 170: Probability Book

166 Convergence of random variables

But, on any subsequence Xn(k(l)) extracted from Xn(k) and converging almost surelyto X, we can apply Vitali’s convergence Theorem 2.18 to obtain that E(|Xn(k(l)) −X|p) → 0 as l→∞. This contradicts the previous inequality with k = k(l).

We have already seen in Remark 3.7 and Example 3.8 that the statements inTheorem 11.2 are optimal: there exist sequences converging in measure (even in anyLp with p < ∞) that are not converging almost surely, and there exist sequencesconverging almost surely but not in L1.

In the next remark we see how the concepts introduced so far can be adapted tothe more general case of random variables with values in a metric space (E, dE).

Remark 11.3 (Metric space-valued random variables) If (E, dE) is a metricspace and Xn, X are E-valued random variables (with the Borel σ–algebra in E),we say that Xn → X in probability if

limn→∞

P(dE(Xn, X) > δ) = 0 ∀δ > 0.

and we say that Xn → X almost surely if dE(Xn, X) → 0 almost surely. If (E, dE) iscomplete, all the results presented in the section hold in this more general frameworkwhich includes, as a particular case, vector-valued random variables.

11.2 Convergence in law

We say that a sequence of real random variables (Xn) converges in law to a realrandom variable X if the probability measures (Xn)#P in R converge weakly toX#P. Recall that, according to the very definition of convergence for laws on thereal line given in Definition 6.24, this means

limn→∞

(Xn)#P((−∞, x]) = X#P((−∞, x])

with at most countably many exceptions x, and so

limn→∞

P(Xn ≤ x) = P(X ≤ x) (11.6)

with at most countably many exceptions x. Moreover, from Theorem 6.27 we deducethat:

Theorem 11.4 (Xn) is convergent in law to X if and only if

limn→∞

∫R

ϕ(x) d(Xn)#P(x) =

∫R

ϕ(x)X#P(x) ∀ϕ ∈ Cb(R),

or, equivalently (by the change of variables formula),

limn→∞

E(ϕ(Xn)) = E(ϕ(X)) ∀ϕ ∈ Cb(R). (11.7)

Page 171: Probability Book

Chapter11 167

Convergence in law is a very weak convergence that does not take at all intoaccount the pointwise behaviour of Xn and X (for instance, if Xn → X in lawand Y has the same law of X, then obviously Xn → Y in law). In particular,choosing identically distributed X and Y , with X 6= Y almost surely, one sees thatconvergence of Xn to X in law can’t imply any form of almost sure convergence ofXn (or in probability). We are going to show that convergence in probability impliesconvergence in law and that the converse implication holds only when the previousconstruction is impossible, i.e. when the law of X is a Dirac mass.

Proposition 11.5 If Xn → X in probability then Xn → X in law.

Proof. Let ϕ ∈ Cb(R). If Xn → X almost surely, then ϕ(Xn) → ϕ(X) almost surelyand the dominated convergence theorem gives that (11.7) holds. In the general casewhen Xn → X in probability, assume by contradiction that E(ϕ(Xn)) does notconverge to E(ϕ(X)): then, we can find ε > 0 and a subsequence n(k) such that∣∣E(Xn(k))− E(X)

∣∣ ≥ ε ∀k ∈ N.

But, on any subsequenceXn(k(l)) extracted fromXn(k) converging almost surely toX,we already proved that E(Xn(k(l))) converges to E(X). This contradicts the previousinequality with k = k(l).

Proposition 11.6 Assume that (Xn) converges in law to a constant c. Then (Xn)converges to c in probability.

Proof. Obviously the law of the constant random variable c is δc, whose distributionfunction is

Fc(x) =

1 if x ≤ c0 if x > c.

It is enough to show that P(|Xn − X| > δ) → 0 for any δ > 0 such that theconvergence of the distribution functions of the laws ofXn to Fc occurs with x = c+δand x = c− δ. Since for any random variable Y we have

P(|Y − c| > δ) = P(Y > δ + c) + P(Y < −δ + c)≤ 1− P(Y ≤ δ + c) + P(Y ≤ −δ + c)

we obtain

limn→∞

P(|Xn − c| > δ) ≤ 1 + limn→∞

−P(Xn ≤ δ + c) + P(Xn ≤ −δ + c)= 1− 1 + 0 = 0.

Finally, recalling that weak convergence of probability measures can be character-

ized in terms of pointwise convergence of the Fourier transforms (see Theorem 6.37),we get:

Page 172: Probability Book

168 Convergence of random variables

Theorem 11.7 Let Xn, X be real random variables. Then Xn → X in law if andonly if Xn(ξ) → X(ξ) for all ξ ∈ R.

Page 173: Probability Book

Chapter11 169

EXERCISES

11.1 Show that Xn → X in probability if, and only if, E(min1, |Xn −X|) → 0 as n→∞.11.2 Show that convergence in probability of real random variables is invariant under compositionwith continuous maps, namely Xn → X in probability implies ϕ(Xn) → ϕ(X) in probabilitywhenever ϕ : R→ R is continuous.11.3 Suppose that the laws of the real random variables Xn are all supported inside a closedcountable set C ⊂ R. Show that Xn → X in probability if and only if limn P(Xn = x) =P(X = x) for all x ∈ C.11.4? Show that (X (Ω), d) is a complete metric space.

Page 174: Probability Book

170 Convergence of random variables

Page 175: Probability Book

Chapter 12

Sequences of independentvariables

In this chapter we study the asymptotic behaviour of sequences of independentvariables and of some random variables that can be generated from the sequences(for instance the arithmetic means, or the series). In Section 12.1 we start from thecase of sequences of independent events and discuss Kolmogorov’s dichotomy. InSection 12.2 we present some formulations the law of large numbers, dealing with theconvergence (almost sure or in Lp) of the arithmetic means. In Section 12.3 we brieflyillustrate the utility of these concepts: the convergence of Bernstein polynomials, theMonte Carlo method, the convergence of empirical distribution functions. Finally,in Section 12.4 we prove the central limit theorem, that characterizes the oscillationsof the arithmetic means around the expected value, after a suitable normalization.

12.1 Sequences of independent events

In this section we study the case of a sequence of characteristic functions, i.e. asequence of events. We will see that, in the case when the events are indepen-dent, a deterministic behaviour arises in the limit. This phenomenon is known asKolmogorov’s dichotomy. The results of this section will be used to study, moregenerally, independent sequences of random variables.

Lemma 12.1 (Borel–Cantelli) Let (An) be a sequence of independent events.Then

∞∑n=0

P(An) <∞ ⇐⇒ P(lim supn→∞

An) = 0.

171

Page 176: Probability Book

172 Sequences of independent variables

Proof. The proof of the implication ⇒ does not rely on independence, and itwas already mentioned in Chapter 1; let us recall its simple proof:

P

(lim sup

n→∞An

)= lim

p→∞P

(∞⋃

n=p

An

)≤ lim

p→∞

∞∑n=p

P(An) = 0.

Conversely, for any p ∈ N we have

P

(Ω \

∞⋃n=p

An

)= P

(∞⋂

n=p

(Ω \ An)

)=

∞∏n=p

(1− P(An)) .

Taking limits as p→∞ we get

limp→∞

∞∏n=p

(1− P(An)) = 1.

Choosing p0 such that the infinite product∏∞

p0(1− P(An)) is strictly positive, from

the inequality (1− x) ≤ exp(−x) we infer

exp

(−

∞∑n=p0

P(An)

)> 0.

Therefore∑∞

p0P(An) <∞ and the series

∑n P(An) converges.

From the Borel-Cantelli lemma we obtain the implication

∞∑n=0

P(An) = ∞ =⇒ P(lim supn→∞

An) > 0. (12.1)

Actually, using Kolmogorov’s dicothomy, we have the much stronger implication

∞∑n=0

P(An) = ∞ =⇒ P(lim supn→∞

An) = 1. (12.2)

In order to justify (12.2) and state the Kolmogorov’s dichotomy we need to statesome preliminary definitions. Given a sequence of σ-algebras An ⊂ A , we denoteby A∞, the terminal σ–algebra of the family: it is defined by ∩nBn, where Bn ⊂ Ais the σ–algebra generated by

∞⋃i=n

Ai.

Page 177: Probability Book

Chapter12 173

For instance, if (Xn) is an independent sequence of random variables and An arethe σ-algebras generated by Xn, then events of the form

lim supn→∞

An, An ∈ An; A :=

ω ∈ Ω :

∞∑n=0

Xn(ω) converges

(12.3)

are terminal, while events of the formω ∈ A :

∞∑n=0

Xn(ω) > 1

typically are not.

Kolmogorov’s dichotomy is the first example of an important phenomenon: theappearance of a deterministic behaviour from the superposition of many randomevents. Similar phenomena will be investigated in the law of large numbers and inergodic theorems.

Theorem 12.2 (Kolmogorov’s dichotomy) Let (An) be an independent sequenceof σ–algebras contained in A and let A∞ be the terminal σ–algebra of the sequence.Then

P(A) ∈ 0, 1 ∀A ∈ A∞.

In particular any random A∞–measurable variable is P–equivalent to a constant.

Proof. We denote by A ′n the collection of finite intersections of sets in ∪n

0Ai

and with A ′′n the collection of finite intersections of sets in ∪∞n+1Ai. For A ∈ A ′

n andB ∈ A ′′

n we have, thanks to the independence assumption,

P(A ∩B) = P(A)P(B).

Keeping A ∈ A ′n fixed, the class of sets B ∈ A satisfying the identity above is a

Dynkin class containing the class A ′′n stable under finite intersections, hence Theo-

rem 1.14 gives that this class contains the σ–algebra generated by A ′′n , i.e. Bn+1.

This proves that

P(A ∩B) = P(A)P(B) ∀A ∈ A ′n, B ∈ Bn+1.

Keeping now B ∈ A∞ fixed,

P(A ∩B) = P(A)P(B) ∀A ∈ A ′′n ,

because any A ∈ A ′′n belongs to A ′

m for m sufficiently large, and B ∈ Bm+1.Therefore a symmetric argument based on Theorem 1.14 allows to conclude thatthe equality holds for all A ∈ Bn and, in particular, for A ∈ A∞. We proved that

P(A ∩B) = P(A)P(B) ∀A, B ∈ A∞.

Page 178: Probability Book

174 Sequences of independent variables

For A ∈ A∞ and B = A we get P(A) = P2(A), hence either P(A) = 0 or P(A) = 1.Finally, let X be a real and A∞–measurable random variable. The function

f(t) := P(X ≤ t) takes its values in 0, 1, is nondecreasing and f(−∞) = 0,f(+∞) = 1. Therefore there exists t ∈ R such that f(t) = 1 for t > t and f(t) = 0for t < t. As a consequence X = t almost surely.

Example 12.3 Let Ω = −1, 1N, endowed with the canonical product σ–algebra

and of the product measure P = ×i(δ−1 + δ1)/2. Let us consider the randomvariables

Xn(ω) :=ωn

n+ 1.

Then, Kolmogorov’s dichotomy tells us that∑

nXn is either convergent almostsurely, or not convergent almost surely, because the event A in (12.3) belongs tothe terminal σ–algebra. With more sophisticated tools one can show that actuallyP(A) = 1.

12.2 The law of large numbers

As we already remarked, the law of large numbers provides a posteriori a justificationof our intuition of probability as an asymptotic frequency. It shows that the means

Un :=X1 + · · ·+Xn

n(12.4)

built from a sequence (Xn) of independent and identically distributed random vari-ables converge to the (common) expectation E(Xn). The convergence can of courseoccur in several ways (almost sure, in probability, in law, in Lp). Typically one saysthat the law of large numbers is strong if the convergence of the means is an almostsure one.

Theorem 12.4 (Law of large numbers) Let p ∈ [1,∞), let (Xn) be a sequenceof identically distributed random variables in Lp, let Un be given by (12.4) and let µbe the expected value of the Xn. Then

(a) If Xn are pairwise independent, then Un → µ in Lp, and also almost surely ifp ≥ 2.

(b) If (Xn) is independent, then Un → µ almost surely.

Proof. Statement (b) will be proved later on in a more general context, the lawof large numbers for stationary sequences. Here we just prove statement (a).

Page 179: Probability Book

Chapter12 175

Replacing if necessary Xn with Xn − E(Xn) (recall that, thanks to (10.6) thepairwise independence assumption is preserved), we can assume with no loss ofgenerality thatXn are centered. We assume first that p = 2 and show the almost sureconvergence of Un to 0. Setting σ2 = σ2(X1), because of the pairwise independenceof the Xi, we have

E(U2n) =

1

n2

n∑i, j=1

E(XiXj) =1

n2

n∑i=1

E2(Xi) =σ2

n.

Setting n(k) = k2, we have

E(∞∑

k=1

U2k2) =

∞∑k=1

E(U2k2) =

∞∑k=1

σ2

k2<∞

hence∑

k U2k2 <∞ almost surely. This proves that Uk2 → 0 almost surely.

It remains to show that (Un) tends to 0 almost surely. We denote by k(n) theinteger such that k2(n) ≤ n < (k(n) + 1)2 and notice that k2(n)/n→ 1 as n→∞.Using the pairwise independence assumption again we get

∞∑n=1

E

(∣∣∣∣Un −k2(n)

nUk2(n)

∣∣∣∣2)

=∞∑

n=1

1

n2E

∣∣∣∣∣∣n∑

i=k2(n)+1

Xi

∣∣∣∣∣∣2

=∞∑

n=1

(n− k2(n))σ2

n2≤ σ2

∞∑n=1

2k(n) + 1

n2<∞

because k(n) ≤√n. We conclude that Un − k2(n)

nUk2(n) tends to 0 almost surely.

Since Uk2(n) → 0 almost surely the convergence of Un is proved.It remains to show that, in the case when Xi ∈ Lp, Un → 0 in Lp. Fix ε > 0,

let k be such that∫|X1|>k |X1|p dP < (ε/4)p and write Xn = X ′

n + X ′′n, where

X ′n = k∧Xn∨−k. Notice that (X ′

n) are still pairwise independent (thanks to (10.6))and identically distributed, and that |X ′

n| ≤ k. Therefore, we can apply to X ′n the

strong law of large numbers to obtain, thanks to the dominated convergence theorem,that the arithmetic means U ′

n of X ′n converge to E(X ′

1) in Lp. As a consequence,there exists n0 ∈ N such that ‖U ′

n − E(X ′1)‖p ≤ ε/2 per n ≥ n0.

By our choice of k, we have ‖X ′′n‖p < ε/4. For n ≥ n0, denoting by U ′′

n thearithmetic means of X ′′

n and using the fact that E(X ′1) + E(X ′′

1 ) = E(X1) = 0 wehave

‖Un‖p ≤ ‖U ′n − E(X ′

1)‖p + ‖U ′′n − E(X ′′

1 )‖p ≤ε

2+ 2 sup

n‖X ′′

n‖p <ε

2+

4= ε.

Page 180: Probability Book

176 Sequences of independent variables

This proves that Un → 0 in Lp.

In the particular case of independent sequences Xn = 1An , all having Bernoullilaw with parameter µ = P(An), the law of large numbers becomes

µ = limn→∞

Un = limn→∞

card (i ∈ [1, n] : Xi = 1)n

almost surely.

This result is consistent with the interpretation of probability as the limit of theratio between the number of succesful events and the total number of events.

Example 12.5 (Probabilistic analysis of a Bernoulli game) Let Ω = a, bNendowed with the product σ–algebra and the product measure P =×i(pδa + qδb),with p+ q = 1. If a > 0 > b and K > 0, the quantity

Kn(ω) := K + Sn(ω) = K +n∑

i=1

ωi−1

represents the capital after n games of a player, having probability p of winning thesum a and probability q of losing the sum −b; of course K stands for the initialcapital. Let us define the random variables

S− := lim infn→∞

Sn, S+ := lim supn→∞

Sn.

If ap 6= −bq (unfair game) we have either S+ = S− = +∞ almost surely, if ap > −bq,or S+ = S− = −∞ almost surely if ap < −bq; this statement follows directly fromthe strong law of large numbers, because Sn/n converges almost surely to ap+ bq.

On the other hand, if ap = −bq (fair game) then S± = ±∞ almost surely. Inorder to prove this statement we set σ2 = σ2(pδa + qδb) and we use this property (aconsequence of the central limit theorem, that we will see later on)

limn→∞

P(Sn ≤ σ

√nt)

=1√2π

∫ t

−∞e−s2/2 ds ∀t ∈ R (12.5)

to infer

P(S+ = +∞) ≥ P

(lim sup

n→∞Sn > σ

√n)

≥ lim supn→∞

P(Sn > σ

√n)

=1√2π

∫ +∞

1

e−s2/2 ds > 0.

Being S+ = +∞ an event of the terminal σ–algebra we must conclude, thanksto Kolmogorov’s dicothomy, that S+ = +∞ almost surely. The argument for S− isanalogous.

Page 181: Probability Book

Chapter12 177

Coming back to Kn, we obtain that, even in the case of a fair game, almostsurely there exists n such that Kn < 0, whatever K is. Moreover, even assumingthat the bank provides an unbounded credit to the player (the more realistic casewith a bounded credit will be considered later on), the quantity Kn has almostsurely strong oscillations as n→∞.

Remark 12.6 (Rate of convergence in the law of large numbers) The law oflarge numbers gives us informations on the asymptotic behaviour of the arithmeticmeans Un of independent and identically distributed variables Xi, ensuring the al-most sure and L1 convergence to µ = E(Xi). On the other hand, from the practicalpoint of view, we would like to know how large n should be to reach a sufficientlygood degree of approximation: in other words, we wish to have a rate of convergence.The (mean) rate of convergence can be guessed recalling that σ2(Un) = σ2/n, withσ2 = σ2(Xi). Markov’s inequality then gives

P (|Un − µ| > t) ≤ σ2

nt2∀t > 0. (12.6)

Choosing α ∈ (0, 1] and defining t in such a way that σ2/(nt2) = α we obtain thatwith probability 1− α the values of Un belong to the interval

Iα :=

[µ− σ√

nα, µ+

σ√nα

].

Symmetrically, we can say that with probability 1−α the (unknown) value µ belongsto the (known) interval [Un−σ/

√nα, Un +σ/

√nα]. More precise informations, not

only on the size of Un−µ, but also on their asymptotic distribution, will come fromthe central limit theorem. The number 1 − α represents the confidence level andmeasures, as we have seen, the probability of a correct estimation of µ with Un.As it is natural, we must look for larger and larger integers n if we wish to have aconfidence level close to 1.

Page 182: Probability Book

178 Sequences of independent variables

12.3 Some applications

12.3.1 Density of Bernstein polynomials

We give a “probabilistic” proof, due to Bernstein, of the density of polynomials inC([0, 1]). Given f ∈ C([0, 1]), we will prove that the polynomials (called indeedBernstein polynomials)

Pn,f (x) :=n∑

i=0

(ni

)xi(1− x)n−if

(i

n

)x ∈ [0, 1]

uniformly converge to f as n→∞.

Let us preliminarly remark that the characteristic function 1A of an event A withexpectation p = P(A) has variance σ2 = p− p2 ≤ 1. Let 11, . . . , 1n be independentcharacteristic functions of events with probability p and let us consider the randomvariable

Sn :=11 + · · ·+ 1n

n.

Then E(Sn) = p and, by the independence assumption, we have (see (10.11))

σ2(Sn) =σ2(11) + · · ·+ σ2(1n)

n2≤ 1

n.

Still by the independence assumption, we have seen that the law of nSn is a binomiallaw with parameters n and p, therefore the law µn of Sn is

µn =n∑

i=0

(ni

)pi(1− p)n−iδ i

n

.

Notice that, by the law of large numbers, Sn converge almost surely to p and thereforein law. It follows that µn → δp weakly, and therefore

f(p) =

∫ 1

0

f dδp = limn→∞

∫ 1

0

f dµn

= limn→∞

n∑i=0

(ni

)pi(1− p)n−if

(i

n

)= lim

n→∞Pn,f (p)

for any continuous function f : [0, 1] → R.

A closer analysis reveals that the convergence is uniform in p. Indeed, let C =sup |f | and, for given ε > 0, let δ > 0 be given by the absolute continuity of f

Page 183: Probability Book

Chapter12 179

(i.e. |f(a)− f(b)| < ε whenever |a− b| < δ) and let n ≥ 1 be an integer such that2C ≤ nεδ2; then we can estimate f(p)− Pn,f (p) as follows

|f(p)− Pn,f (p)| = |E(f(p)− f(Sn))| ≤ E(|f(p)− f(Sn)|)≤ ε+ 2CP(|p− Sn| ≥ δ),

where we have written the domain of integration as the disjoint union of |p−Sn| <δ and |p−Sn| ≥ δ. Finally, using the Markov inequality and the estimate on thevariance of Sn we get

|f(p)− Pn,f (p)| ≤ ε+2C

δ2σ2(Sn) ≤ ε+

2C

δ2n≤ 2ε.

Notice that the underlying measure space (Ω,A ,P) played no explicit role inthe proof (this quite typical of many, but not all, probabilistic arguments). Itsuffices to know that, given p ∈ [0, 1] and an integer n, there exist n independentevents with probability p. The simplest choice corresponds to the measure spacein Example 9.2(2): in this case the characteristic functions are simply ω 7→ ωi,i = 1, . . . , n.

12.3.2 The Monte Carlo method

Let f be an integrable function in I = [0, 1], with respect to L 1. We illustrate here

a “probabilistic” method for the computation of the integral∫ 1

0f(x) dx. Let (Xn)

be an independent sequence of random variables, all having as law the uniform lawin I. Then, since f(Xi) are independent and identically distributed, the law of largenumbers gives

limn→∞

1

n

n∑i=1

f(Xi) = E (f(X1)) =

∫ 1

0

f(x) dx

almost surely and in L1. Hence, if we are able to generate this sequence (or, better,a good approximation of this sequence) on a computer with a random numbergenerator, then we can expect that the means above provide a good approximationof∫ 1

0f(x) dx. If f is square integrable, by Markov inequality we have a probabilistic

estimate of the error:

P

(∣∣∣∣∣∫ 1

0

f(x) dx− 1

n

n∑i=1

f(Xi)

∣∣∣∣∣ > t

)≤ σ2(f(X))

nt2≤∫ 1

0f 2 dx

nt2.

The Monte Carlo method can be extended with minor difficulties to the computationof d–dimensional integrals on cubes C = [0, 1]d, and is particularly useful when d is

Page 184: Probability Book

180 Sequences of independent variables

large. In this case, for any integrable function f , we have∫C

f(x) dx = limn→∞

1

n

n∑i=1

f(Xi) almost surely and in L1,

where (Xn) is an independent and identically distributed sequence of random vari-ables with values in C, all having a uniform law (i.e. 1CL d).

12.3.3 Empirical distribution

Assume that we need to compute empirically the law µ of a random phenomenon.In many real situations one has at disposal an independent sequence of randomvariables (Xi) (the outcome of a sequence of experiments), all identically distributedwith law µ. A canonical procedure to estimate the distribution function F of µ isto define the empirical distribution function

Fn(x, ω) :=1

n

n∑i=1

1Xi(ω)≤x ∀x ∈ R.

Notice that, for x fixed, Fn(x, ω) is itself a random variable. For ω fixed, instead,Fn(x, ω) is the repartition function of a law made by the sum of finitely many Diracmasses (concentrated at Xi(ω), 1 ≤ i ≤ n), so its graph has the typical form of anhistogram.

The law of large numbers, applied to the independent and identically distributedsequence (1Xn≤x), implies

limn→∞

Fn(x, ω) = E(1X1≤x) = F (x) almost surely and in L1

for every x ∈ R. Therefore the empirical distribution function approximates thedistribution function of µ. Taking into account the identities

E(Fn(x, ·)) = F (x), σ2(Fn(x, ·)) =σ2(x)√

n,

where σ2(x) = F (x)(1 − F (x)) is the variance of 1Xn≤x, and using (12.6) we cangive a more precise estimate, but necessarily of a probabilistic type, of the errormade at the n-th step:

P (|Fn(x, ·)− F (x)| > t) ≤ σ2(x)

nt2≤ 1

4nt2∀t > 0, ∀x ∈ R.

Page 185: Probability Book

Chapter12 181

12.4 The central limit theorem

Let us consider a sequence (Xn) of independent, identically distributed and square-integrable random variables Xi. Denoting by Un the arithmetic means of Xn, weknow that the expected value of Un is µ, with µ = E(Xi), and the standard deviationof Un is equal to σ/

√n, where σ = σ(Xi). Therefore, in order to know not only the

mean size of the deviations of Un from µ, of order√

1/n, but also their (asymptotic)distribution, it is natural to rescale Un − µ by the factor

√n/σ. The central limit

theorem shows that, surprisingly, whatever the law of Xn is, these rescaled variablesasymptotically display a Gaussian distribution N (0, 1).

Before the statement and the proof of the central limit theorem we state a simpleCalculus lemma.

Lemma 12.7 Let ϕ : R → R be such that ϕ(0) = 1, ϕ′(0) = 0 and ϕ′′(0) = M .Then

limt→0

ϕ(t)1/t2 = eM/2.

Proof. Taking logarithms and making a second-order Taylor expansion we get

limt→0

ln(ϕ(t))

t2= lim

t→0

ln (1 +Mt2/2 + o(t2))

t2=M

2,

because log(1 + z) = z + o(z) as z → 0.

Theorem 12.8 (Central limit theorem) Let (Xn) be an independent sequenceof identically distributed and square-integrable random variables. Setting µ = E(Xi),σ2 = σ2(Xi), let the arithmetic means Un of Xn and their normalization Yn be definedby

Un :=1

n

n∑i=1

Xi, Yn :=Un − µ

σ/√n.

Then the laws of Yn weakly converge to the normal law N (0, 1).

Proof. Let µn be the laws of Yn and let ϕn(ξ) = Yn be the Fourier transformsof µn. By Levy Theorem 6.37, it suffices to show that ϕn(ξ) pointwise converge toϕ∞(ξ) = e−ξ2/2, the Fourier transform of the normal law N(0, 1) (Exercise 6.23).

Let us denote by ν the law of Xi − µ and by ϕ(ξ) the Fourier transform of ν.We have then ϕ(0) = 1 and (by (6.36) and the change of variable rule for the imagemeasure)

ϕ′(0) = −i∫R

tdν(t) = −iE(Xi − µ) = 0,

Page 186: Probability Book

182 Sequences of independent variables

ϕ′′(0) = (−i)2

∫R

t2 dν(t) = −E((Xi − µ)2) = −σ2.

Since Yn =∑n

1 (Xi − µ)/(√nσ), using Proposition 10.13 we get

ϕn(ξ) =n∏

i=1

(Xi − µ)

(ξ√nσ

)=

(ξ√nσ

)]n

.

By Lemma 12.7 with M = −σ2 we get

limn→∞

ϕn(ξ) = limn→∞

(ξ√nσ

)]n

= limt→0

ϕ(t)ξ2/(t2σ2) = e−ξ2/2.

Let Fn(t) = P(Yn ≤ t) be the repartition function of the law of Yn and let

G(t) be the repartition function of N (0, 1). Then, the central limit theorem tellsus that Fn(t) → G(t) for all t ∈ R, with at most countably many exceptions. Butusing the fact that G is continuous (because N (0, 1) has no atom) we can actuallyprove that Fn → G uniformly in R. This follows by the next lemma.

Lemma 12.9 Let µn, µ be laws in R, let Fn, F be the corresponding repartitionfunctions and assume that µn → µ weakly. If F is a continuous function, then

limn→∞

supt∈R

|Fn(t)− F (t)| = 0. (12.7)

Proof. Given ε > 0, let t−, t+ ∈ R be such that F (t−) < ε and F (t+) > 1 − ε.Since F is continuous in [t− − 1, t+ + 1] there exists δ > 0 such that

|F (s)− F (t)| < ε for any s, t ∈ [t− − 1, t+ + 1] such that |s− t| < δ.

Let t1, . . . , tp real numbers where the repartition functions are converging, such that0 < ti+1 − ti ≤ δ, t− − 1 ≤ t1 ≤ t− and t+ + 1 ≥ tp ≥ t+. There exists an integer n0

such that |Fn(ti)− F (ti)| < ε for i = 1, . . . , p and n ≥ n0.For n ≥ n0 and t ∈ R we have

Fn(t)− F (t) ≥ Fn(tp)− 1 ≥ F (tp)− ε− 1 ≥ −2ε, Fn(t)− F (t) ≤ 1− F (tp) ≤ ε

if t ≥ tp. Analogously, |Fn(t)−F (t)| < ε if t ≤ t1. For t ∈ [t1, tp], choosing i in sucha way that t ∈ [ti, ti+1] we get

Fn(t)− F (t) ≥ Fn(ti)− F (ti+1) ≥ −ε− ε,

Fn(t)− F (t) ≤ Fn(ti+1)− F (ti) ≤ ε+ ε.

Page 187: Probability Book

Chapter12 183

Using the central limit theorem and Lemma 12.9 we can now show (12.5): indeed,

we have (recall that µ = 0 in this case)

limn→∞

P(Sn ≤ σ

√nt)

= limn→∞

P (Yn ≤ t) =1√2π

∫ t

−∞e−s2/2 ds

for any t ∈ R.The Berry–Esseen theorem quantifies the speed of convergence of the reparti-

tion functions in the central limit theorem, under a slightly stronger integrabilityassumption on Xi.

Theorem 12.10 (Berry–Esseen) Let (Xn) be an independent sequence of iden-tically distributed random variables with E(|Xn|3) < ∞. Defining Yn as in Theo-rem 12.8 we have

supt∈R

|Fn(t)− F (t)| ≤ E(|Xi|3)σ3√n

∀n ≥ 1,

where Fn(t) is the repartition function of the law of Yn and F (t) is the repartitionfunction of the normal law N (0, 1).

Page 188: Probability Book

184 Sequences of independent variables

EXERCISES

12.1 Show that the law of large numbers holds also in the following form: assume that Xn arepairwise independent, σ2(Xn) ≤ C < ∞ with C independent of n, and E(Xn) → µ ∈ R. ThenUn → µ almost surely, and if µ ∈ R we have Un → µ in L2.12.2 Using the Berry–Esseen Theorem, improve (12.5), showing that

lim supn→∞

√nP(|Sn| > M) ≤ C ∀M > 0

with C depending only on the law of Xi.

Page 189: Probability Book

Chapter 13

Stationary sequences and elementsof ergodic theory

Let T : Ω → Ω be a map, and let us define the iterates T (n)n∈N of T by

T (0) = Id, T (n+1) = T T (n) = T (n) T.

Given any ω ∈ Ω, we call T (n)(ω) orbit generated by T starting from ω, andthe collection of all such orbits discrete dynamical system in Ω induced by T . Here“discrete” refers to the fact that we may think of n as a discrete time parameter(dynamical systems with a continuous time typically involve ordinary differentialequations).

The aim of ergodic theory is to study, mostly with probabilistic tools, the be-haviour of the orbits in situations when either an explicit computation of them isnot possible, or it provides very little information.

Some typical examples of dynamical systems are the arithmetic progressionsω + nα, with α ∈ R fixed, or geometric progressions mnω, with m ∈ R, respectivelyinduced by the maps T (ω) = ω + α and T (ω) = mω. The induced dynamicalsystems are somehow trivial in R, but much less trivial if we think these arithmeticoperations modulo 2π, thus considering the dynamics on the circle S1 = R/(2π)(in this case we have to require m to be an integer, in order to have a well definedarithmetic progression on the circle).

13.1 Stationary sequences and law of large num-

bers

In this section we consider some possible extensions of the law of large numbers,in which the independence condition is replaced by a much weaker one, namely

185

Page 190: Probability Book

186 Stationary sequences and elements of ergodic theory

stationarity.

Definition 13.1 (Stationary sequences) Let (Xn) be a sequence of real-valuedrandom variables. We say that (Xn) is stationary if, when seen as maps with valuesin RN endowed with the product σ–algebra, the maps (Xn) and (Xn+1) are identicallydistributed.

As the σ–algebra of RN is generated by cylindrical sets we can rewrite the sta-tionarity condition as follows:

P

(n⋂

i=0

Xi ∈ Ai

)= P

(n⋂

i=0

Xi+1 ∈ Ai

), (13.1)

where the condition above has to be fulfilled for any choice of A0, . . . , An in theσ–algebra of Borel sets of R. This formula yields immediately that Xn and Xn+1 areidentically distributed for any n ∈ N. By transitivity we obtain that all variablesXn are identically distributed, for any stationary sequence (Xn). Moreover, stillby transitivity one obtains that the sequences (Xn) and (Xn+k) (still thought asRN–valued random maps) are identically distributed for any k ∈ N.

It is easy to check that an independent sequence (Xn) is stationary if and only ifall variables Xn are identically distributed: we already proved that one implicationholds without any independence assumption, and the other one (now under theindependence assumption) simply follows by the fact that the laws of (Xn) and(Xn+1) are respectively given by

×i=0

µi,∞

×i=0

µi+1,

where µi are the laws of Xi. Therefore the laws of the two sequences coincide if andonly if µi = µi+1 for all integers i ≥ 0.

On the other hand, we will see that the ergodic theory provides many naturalexamples of stationary sequences that are not independent.

Lemma 13.2 (Maximal lemma for stationary sequences) Let (Xn) be a sta-tionary sequence, and let Yn = X0 + · · · + Xn. Setting Λ = ∪nYn > 0, we have∫

ΛX0 dP ≥ 0.

Proof. We denote by Mn the random variable max0≤k≤n

(X0 + · · · + Xk)+ and by

Nn the shifted random variable

max0≤k≤n

(X1 + · · ·+Xk+1)+.

Page 191: Probability Book

Chapter13 187

Thanks to the stationarity assumption the variables Mn and Nn are nonnegative andidentically distributed, so that they have the same expectation. Notice also that Λis the monotone limit of the family of sets Λn = Mn > 0, therefore it suffices toshow that

∫ΛnX0 dP ≥ 0 for any n ≥ 0. Let us prove that

X0 +Nn ≥ max0≤k≤n

Yk = Mn on Λn. (13.2)

Indeed, if Mn(ω) > 0 then we can find k ∈ [0, n] such that Yk(ω) = Mn(ω), andestimate

Yk(ω) = X0(ω) + · · ·+Xk(ω) ≤ X0(ω) + (X1(ω) + · · ·+Xk(ω))+ ≤ X0(ω) +Nn(ω).

From (13.2) we get∫Λn

X0 dP ≥∫

Λn

Mn dP−∫

Λn

Nn dP ≥ E(Mn)− E(Nn) = 0

because Mn is zero outside of Λn and Nn ≥ 0.

The law of large numbers still holds for independent sequences, in this form.

Theorem 13.3 (Law of large numbers for stationary sequences) Let p ∈ [1,∞)and let (Xn) be a stationary sequence of random variables in Lp. Then the arithmeticmeans

Un :=X0 + · · ·+Xn−1

n

converge almost surely and in Lp.

Proof. Being Xi identically distributed, we have E(|Un|) ≤ E(|X0|); as a conse-quence, Fatou’s lemma gives

E(lim infn→∞

|Un|) ≤ lim infn→∞

E(|Un|) <∞,

and therefore lim infn |Un| is finite almost surely. It is easily seen that the set ofpoints where (Un) is not pointwise convergent is the countable union of the events

Sab :=

lim infn→∞

Un < a, lim supn→∞

Un > b

a < b, a, b ∈ Q.

We need only to show that all these events have null probability. Let a, b ∈ Q witha < b, set S = Sab and define

X ′n := (Xn − b)1S, U ′

n :=X ′

0 +X ′1 + · · ·+X ′

n−1

n.

Page 192: Probability Book

188 Stationary sequences and elements of ergodic theory

Using Exercise 9.6 we can check that (X ′n) is still stationary: indeed, for any bounded

function f : RN → R, measurable with respect to the product σ–algebra, we have

E(f((X ′

n+1)))

= E(f(((Xn+1−b)1S((Xn+1)))

)= E

(f(((Xn−b)1S((Xn)))

)= E

(f((X ′

n))).

By the arbitrariness of f the variables (X ′n) and (X ′

n+1) are identically distributed.Setting Λ′ = ∪nU ′

n > 0, we have Λ′ ⊂ S because X ′n vanish on the complement

of S, but also S ⊂ Λ′, because at any point in S we have U ′n = Un−b > 0 for infinitely

many n. Therefore S = Λ′ and the maximal lemma gives∫S

X0 dP− bP(S) =

∫S

X ′0 dP ≥ 0.

By an analogous argument, based on the random variables X ′′n = (a − Xn)1S, we

obtain that aP(S)−∫

SX0 dP ≥ 0. Hence (a− b)P(S) ≥ 0 and we obtain P(S) = 0.

This proves the almost sure convergence of the Un, and it remains to show theirconvergence in Lp, when Xi ∈ Lp. Here we argue as in the proof of Theorem 12.4:fix ε > 0, let k be such that

∫|X0|>k |X0|p dP < (ε/4)p and write Xn = X ′

n + X ′′n,

where X ′n = k ∧ Xn ∨ −k. Notice that (X ′

n) is still stationary, and that |X ′n| ≤ k.

Therefore, we can apply to U ′n the dominated convergence theorem to obtain that

the arithmetic means U ′n ofX ′

n converge in Lp. As a consequence, there exists n0 ∈ Nsuch that ‖U ′

n − U ′m‖p ≤ ε/2 per n ≥ m ≥ n0. On the other hand, denoting by U ′′

n

the arithmetic means of X ′′n, we have (using the convexity of the p-th power)

E(|Un − U ′n|p) = E(|U ′′

n |p) ≤1

n

n−1∑i=0

E(|X ′′i |p) <

εp

4p.

Therefore, the triangle inequality gives

‖Un − Um‖p ≤ ‖Un − U ′n‖p + ‖U ′

n − U ′m‖p + ‖U ′

m − Um‖p <ε

4+ε

2+ε

4= ε,

for m ≥ n ≥ n0. As ε is arbitrary, this proves that (Un) is a Cauchy sequence, andconvergence follows by the completeness of Lp spaces.

13.2 Measure-preserving transformations and er-

godic theorems

A measure-preserving transformation is a (A ,A )–measurable map T : Ω → Ω suchthat T#P = P, i.e. such that P (T−1(A)) = P(A) for all A ∈ A .

Measure-preserving transformations T provide in a natural way stationary se-quences as follows: given any (initial) real-valued random variable X = X0, one

Page 193: Probability Book

Chapter13 189

defines Xn := X T (n), where T (n) is the n-th iterate of T . The stationarity of (Xn)is a simple consequence of (13.1), because

P

(n⋂

i=0

Xi ∈ Ai

)= P

(n⋂

i=0

T−1(Xi ∈ Ai)

)= P

(n⋂

i=0

Xi+1 ∈ Ai

).

The converse is true as well: if the σ–algebra generated by the process (Xn) (i.e.the σ–algebra generated by T ) coincides with A , and the process (Xn) is stationary,then T is measure-preserving.

For this class of stationary sequences, the law of large numbers is better knownas ergodic theorem. Here we consider only a particular case of the ergodic theorem,corresponding to the case of ergodic measure-preserving maps.

Before stating a precise result, we need some definitions. A random variable Xis said to be T -invariant if X T = X almost surely; analogously, an event A is saidto be T -invariant if 1A is T -invariant, namely if

ω ∈ A ⇐⇒ ω ∈ T−1(A) almost surely.

It is immediate to check that the class of T -invariant sets is a complete σ–algebra (seeExercise 13.2), and that the σ–algebra generated by a T -invariant random variableis made by T -invariant sets.

Definition 13.4 (Ergodic maps) A (A ,A )–measurable map T : Ω → Ω is saidto be ergodic if T is measure-preserving and if any T -invariant event A has eitherprobability 0 or probability 1.

Trajectories associated to ergodic maps have the following remarkable property:no matter how P(A) small is, if P(A) > 0 then the orbits starting from ω almostsurely hit A infinitely many times, see Exercise 13.4 for the simple proof of this fact.So, in some sense, ergodic dynamics mix the elements of Ω as much as possible.

A more precise, but less elementary, result is provided by the ergodic theo-rem, that provides a probabilistic description of the behaviour of the discrete orbitsT (n)(ω)n∈N of an ergodic map T . According to this result, we know not only thatalmost surely any set A with P(A) > 0 is visited infinitely many times, but also thatthe asymptotic frequency of visits is P(A). Precisely we have

P(A) = limn→∞

1

ncard

(i ∈ [0, n− 1] : T (i)(ω) ∈ A

)almost surely (13.3)

for any event A ⊂ Ω. In this general picture fits the behaviour of the sequencenαmod(2π), that is not only dense in [0, 2π], but also asymptotically uniformlydistributed in [0, 2π], provided α/(2π) /∈ Q.

Page 194: Probability Book

190 Stationary sequences and elements of ergodic theory

Theorem 13.5 (Ergodic theorem) Let (Ω,A ,P) be a probability space and letT : Ω → Ω be an ergodic map. Then, for any random variable X ∈ Lp, p ∈ [1,∞),the arithmetic means

X +X T + . . .+X T (n−1)

n(13.4)

converge almost surely and in Lp to E(X).

Proof. Setting Xn = X T (n) and denoting by Un their arithmetic means,we already checked that (Xn) is stationary, hence Theorem 13.3 tells us that Un

converge almost surely and in Lp.

Denoting by L the limit of Un, we have E(L) = E(Un) = E(X), taking intoaccount the fact that Xn are identically distributed and Un → L in Lp. Notice alsothat L is T -invariant: it suffices to pass to the limit as n→∞ in the relation

n

n+ 1Un T = Un+1 −

X

n+ 1

to obtain that L T = L almost surely. We have then a random variable L which ismeasurable with respect to a σ–algebra (the one of T -invariant sets) whose elementshave, according to the ergodicity of T , either probability 0 or probability 1 (theseσ–algebras are called degenerate). Then, the same argument seen in the proofof Theorem 12.2 for the terminal σ–algebra of an independent sequence of eventsshows that there exists a real number c such that L = c almost surely. SinceE(X) = E(L) = c we conclude that Un → E(X) almost surely.

By applying the ergodic theorem to the random variable X = 1A, whose expec-tation is P(A), we obtain (13.3). Another suggestive interpretation of the ergodictheorem is the following one: thinking of n as a discrete time parameter and of ω asa spatial parameter (this is indeed the case in many applications), the sums (13.4)can be considered as temporal means of (n, ω) 7→ Xn(ω), while E(X) = E(Xn) isobviously a spatial mean, which is time-independent. Therefore the ergodic theoremtells us that, asymptotically, the temporal means converge to the spatial mean.

It is sometimes useful to deduce the ergodicity of a map T : X → X from theergodicity of another one S : Y → Y through a sort of change of variables, inducedby a map g : Y → X. To this aim, we introduce the following definition.

Definition 13.6 (Conjugate maps) We say that T : X → X is conjugate toS : Y → Y by g : Y → X if T (g(y)) = g(S(y)) for all y ∈ Y . Any map g with theseproperties is called conjugating map.

Page 195: Probability Book

Chapter13 191

The identity T g = g S leads to the commutativity of the diagrams

YS−→ Y

↓ g ↓ g

XT−→ X

YS(n)

−→ Y

↓ g ↓ g

XT (n)

−→ X

and basically tells us that the behaviour of the iterates of T in X can be read,modulo the change of variables induced by g, through the behaviour of the iteratesof S in Y . Notice however that we are not requiring g to be 1-1 (although in manyexamples this is the case), so the conjugate relation is not symmetric in general (forthis reason, this relation is sometimes called semi-conjugacy).

Theorem 13.7 (Invariance of ergodicity) Let (X,A ), (Y,B) be measurable spaces.If T : X → X is conjugate to S : Y → Y via a (B,A )–measurable map g : Y → X,and S is ergodic with respect to P′, then T is ergodic with respect to P := g#P

′.

Proof. The commutativity of the diagram at the level of the maps implies commu-tativity of the diagram also at the level of the measures:

P′S#−→ P′

↓ g# ↓ g#

PT#−→ P

.

Therefore T#(P) = T#(g#(P′)) = g#(S#(P′)) = g#P′ = P. This proves that T is

P–measure preserving.Let now A ∈ A be a T–invariant set, which means that T−1(A)∆A is contained

in a P–negligible set; then g−1(A) is a S–invariant set because

S−1(g−1(A)) ∆ g−1(A) = g−1(T−1(A)) ∆ g−1(A) = g−1(T−1(A) ∆A

)and g−1 maps P–negligible sets into P′–negligible sets. Hence P(A) = P′ (g−1(A)) ∈0, 1 because g is ergodic.

As an example, let us prove that the map D : [0, 1] → [0, 1] defined by

D(x) :=

2x if 0 ≤ x < 1

2

2x− 1 if 12≤ x ≤ 1

=

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

D(x)

(13.5)

Page 196: Probability Book

192 Stationary sequences and elements of ergodic theory

satisfies all the assumptions of the ergodic theorem. It suffices to use the factthat the doubling map D2(θ) = 2θ is ergodic in R/2π (see the next section for adetailed proof) and to use the measure-preserving map g(x) = 2πx. The identity2πD(x) = D2(2πx) is immediate to check, and shows that D is conjugate to D2.This simple example shows that ergodic maps need not be continuous.

13.2.1 Ergodic processes

We close this section mentioning briefly how the concept of ergodicity can be givenalso for stationary sequences (Xn) (for simplicity we consider only real valued ran-dom variables). To this aim, given a sequence (Xn) of random variables, one candefine Ω′ = RN with the product σ–algebra A ′, and canonically consider the law P′

of the random variable (Xn) as a probability measure on (Ω′,A ′). On the space RN

there exists a canonical dynamics, induced by the shift map:

S(ω) = (ω1, . . . , ωn . . .), ω = (ω0, ω1, . . .) ∈ RN.

It is not difficult to check, going back to the definitions, that S is P′–measure-preserving if and only if (Xn) is stationary. This equivalence suggests the followingdefinition:

Definition 13.8 (Ergodic processes) We say that a process (Xn) with law P′ inRN is ergodic if the shift map S is ergodic in RN relative to P′.

The ergodic theorem still holds for ergodic processes: this can be seen eitherrevisiting the proof of Theorem 13.5 or using the following indirect argument: wealready know that the means Un of a stationary process (Un) converge almost surely.The convergence to a constant, as it happens in the case Xn = X T (n), can beobtained as follows: the law of the means Un coincides with the law of the means

U ′n :=

ω0 + · · ·+ ωn−1

n,

seen as random variables in (Ω′,A ′). But since ωn = ω0 S(n) and the action ofS is (by assumption) ergodic, we can apply the ergodic Theorem 13.5 to obtainthat U ′

n converge almost surely to a constant. Since almost sure convergence implies(convergence in probability and) convergence in law, it follows that U ′

n, and thereforeUn, converge in law to a constant. Since we already know that Un converge almostsurely, and almost sure convergence implies convergence in law to the same limit,we obtain that the almost sure limit of Un is (equivalent to) a constant.

Page 197: Probability Book

Chapter13 193

Some dynamical systems, when read in suitable coordinates, can be read asdynamical systems induced by a shift map. Here is a typical example: let us considermap D2 in (13.5) and the map g : 0, 1N → [0, 1] defined by

g((an)) :=∞∑

n=0

an2−n−1

(notice that g is not injective, because dyadic numbers k/2m have two binary ex-pansions). Then, D is conjugate to the shift map S in 0, 1N via g: indeed, it isnot difficult to check that x = g((an)) =

∑i ai2

−i−1 implies D(x) =∑

i ai+12−i−1,

which means precisely that g(S((an))) = D(g((an))).

13.3 Examples

In this section we present some fundamental examples of ergodic dynamical systems.

13.3.1 Arithmetic progressions on the circle

Let Ω = S1 ∼ R/2π with the Borel σ–algebra and the law induced by arclength,divided by 2π. For α ∈ S1 the map T (θ) = θ + α is easily shown to be measure-preserving and T (n)(θ) = θ + nα.

Let us prove now that if α/π /∈ Q then T is ergodic; we give an elementary proof,based on the fact that the group

nα + 2πm : n, m ∈ Z

is dense in R, and another one, based on Fourier series. The first argument goes asfollows: setting u = 1A, and thinking of u as a 2π-periodic function, invariance andperiodicity we get

u(θ + nα + 2πm) = u(θ) for L 1-a.e. θ,

for any n, m ∈ Z. Multiplying both sides by a 2π-periodic function φ we get

0 =

∫ 2π

0

u(θ + nα + 2πm)− u(θ)

nα + 2πmφ(θ) dθ =

∫ 2π

0

u(θ)φ(θ − nα− 2πm)− φ(θ)

nα + 2πmdθ

for any n, m ∈ Z. If φ′ exists and is bounded in (0, 2π), choosing sequences (nk)and (mk) such that nkα+ 2πmk ↓ 0 we can pass to the limit as k →∞ to obtain∫ 2π

0

u(θ)φ′(θ) dθ = 0 ∀φ ∈ C1c (0, 2π).

Page 198: Probability Book

194 Stationary sequences and elements of ergodic theory

Lemma 13.9 below tells us that either P(A) = 0 or P(A) = 1.The second proof goes as follows: let

u(θ) =∑k∈Z

ukeikθ

be the Fourier series of u, and take the left composition with T (that preserves, bythe measure preserving property of T , the L2 convergence of the series) to obtain

u(θ) = u T (θ) =∑k∈Z

ukeikαeikθ.

The uniqueness of the Fourier expansion gives that eikα = 1 for any k such thatuk 6= 0. If u is not identically equal to 0, there exists such a k, and therefore1 = eikα. As α/(2π) /∈ Q, this can happen only if k = 0, i.e. 1A is equivalent to theconstant 1.

Therefore the ergodic theorem gives

limn→∞

1

n

n−1∑i=0

X(θ + iα) =1

∫ 2π

0

X(θ) dθ

almost surely (with respect to the initial point θ) for any random variable X ∈ L1.In the previous formula we can choose in particular any bounded continuous

function X on the circle, to obtain that∑n−1

0 δθ+iα/n weakly converge to P asn→∞ for almost any θ. In this particular case, as the sequences θ+ iα and θ+ iαdiffer by an additive constant and the limit measure is translation invariant, thevalidity of this property for some θ implies its validity for all others.

When instead α/2π = p/q is a rational number then T is measure preserving,but not ergodic: it suffices to take a small sector S with (normalized) length lessthan 1/q and define

A :=

q−1⋃i=0

T (i)(S),

to obtain a T -invariant set with measure qP(S) ∈ (0, 1).Notice also that, taking X(θ) = θ, we have a stationary but not independent

sequence, not even pairwise independent: indeed the difference Xn−Xm is constant,therefore Xn is not independent of Xm, as seen in Exercise 10.10.

Finally, let us state and prove Lemma 13.9, that we used to show that arithmeticprogressions on the circle are ergodic.

Lemma 13.9 (De La Vallee Poussin) Let I = (a, b) be a bounded interval and

let u ∈ L1(a, b) be such that∫ b

auφ′ dt = 0 for any φ ∈ C1([a, b]) with φ(a) = φ(b).

Then there exists a constant c such that u = c L 1–almost everywhere in (a, b).

Page 199: Probability Book

Chapter13 195

Proof. We give an elementary proof of the lemma. By a change of variableswe can assume with no loss of generality that I = (0, 1). Note that u is almosteverywhere equal to a constant in (0, 1) if and only if∫ 1

0

uψ dt = (

∫ 1

0

u dt)(

∫ 1

0

ψ dt) (13.6)

for any continuous function ψ : [0, 1] → R: indeed the above identity can be writtenas ∫ 1

0

(u− (

∫ 1

0

u dt))ψ dt = 0,

and the density of continuous functions in L2(0, 1) shows that u =∫ 1

0u dtL 1-almost

everywhere in (0, 1).

Given ψ ∈ C([0, 1]), let c =∫ 1

0ψ dt, and define φ(t) :=

∫ t

0(ψ − c) ds; then

φ(0) = φ(1) = 0, so that the assumption on u gives

0 =

∫ 1

0

uφ′ dt =

∫ 1

0

u(ψ − c) dt =

∫ 1

0

uψ dt− (

∫ 1

0

u dt)(

∫ 1

0

ψ dt),

that is precisely (13.6).

13.3.2 Geometric progressions on the circle

Let m ≥ 1 be an integer. The map Dm : S1 → S1 defined by θ 7→ mθ satisfies all theassumptions of the ergodic theorem. This example shows that ergodic maps neednot be injective.

First, Dm is measure-preserving: for instance in the case m = 2 (the general caseis analogous) we see that

D−12 (A) =

1

2A ∪

(π +

1

2A

)∀A ∈ B(S1). (13.7)

Now, let us prove the ergodicity of Dm. If S ⊂ S1 is Dm-invariant, we easily seethat ∫

S1

1S(θ)e−inmθ dP(θ) =

∫S1

1S(mθ)e−inmθ dP(θ) =

∫S1

1S(θ)e−inθ dP(θ)

(in the first equality we used the invariance of S, and in the second one the measure-preserving property of Dm). Repeating k times this argument we get∫

S1

1S(θ)e−inmkθ dP(θ) =

∫S1

1S(θ)e−inθ dP(θ) ∀k ∈ N

Page 200: Probability Book

196 Stationary sequences and elements of ergodic theory

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

T(x)

and we can use the Riemann–Lebesgue lemma (see Lemma 5.5) to infer that 1S isorthogonal to all functions einθ, n ∈ Z \ 0. It follows that 1S is (equivalent to) aconstant.

This map provides also an example of a situation where we don’t have conver-gence of the arithmetic means to E(X) for all ω: indeed, if m, h are integers andω = 2πh/mk, then the orbit reaches 0 = 2hπ in at most k steps, and then it remainsconstant.

13.3.3 The triangular map

Let

T (x) := min 2x, 2(1− x) x ∈ [0, 1]. (13.8)

It is easy to prove that T is measure-preserving, using a reasoning similar to thatof (13.7). We will show that T is conjugate to the map D in (13.5). More precisely,we shall work on X = [0, 1]\D, where D ⊂ [0, 1] is the set of dyadic numbers (of theform h2−n for h, n integers, h ≤ 2n). This restriction is motivated by the fact thatany x ∈ X has a unique binary expansion (see (13.9) below), and we will read themaps T and D (as well as the conjugating map g) in these binary coordinates. Atthe same time, this restriction to X is justified by the fact that both T and D mapX in X, and D in D, and since D has null probability it plays no role in the ergodictheorem.

We now proceed to define g : X → X. Let in all of the following x ∈ X beexpressed in binary form as

x =∑n≥1

an2−n with an ∈ 0, 1. (13.9)

Using the formula 2x = a1 +∑

n≥1 an+12−n it easy to check that

D(x) =∑n≥1

an+12−n.

Page 201: Probability Book

Chapter13 197

Analogously, since 2(1− x) = 2∑

n≥1(1− an)2−n we get

T (x) =∑n≥1

bn+12−n, where bn =

an if a1 = 0

1− an if a1 = 1, n ≥ 2.

We define the function g(x) by

g(x) :=∑n≥1

bn2−n with bn ∈ 0, 1 given by bnZ2=

n∑i=1

ai (13.10)

where the last identity is in the Z2 arithmetic. Then

g(D(x)) =∑n≥1

dn2−n with dn ∈ 0, 1 given by dnZ2=

n∑i=1

ai+1 for n ≥ 1.

Analogously

T (g(x)) =∑n≥1

cn+12−n, cn ∈ 0, 1 given by

cn

Z2=∑n

i=1 ai if b1 = 0

cnZ2= 1−

∑ni=1 ai if b1 = 1

for n ≥ 2.

Therefore, in order to show that g(D(x)) = T (g(x)), it suffices to check that dn =cn+1 for all n ≥ 1. If a1 = b1 = 0, we obtain

cn+1Z2=

n+1∑i=1

ai =n+1∑i=2

aiZ2= dn.

If instead a1 = b1 = 1 we obtain

cn+1Z2= 1−

n+1∑i=1

ai = −n+1∑i=2

aiZ2=

n+1∑i=2

aiZ2= dn.

It is also possible to show that the map g is measure preserving, see Exercise 13.5.Since T is conjugate to D, which in turn is conjugate to the ergodic map D2 (as weproved at the end of the previous section), and all the conjugating maps are measurepreserving, we obtain as a corollary that T is ergodic.

13.3.4 The logistic map

In the next and final example we see how it might be sometimes useful to considernon canonical probability structures to obtain the measure-preserving property, andthe ergodicity as well.

Page 202: Probability Book

198 Stationary sequences and elements of ergodic theory

Let Ω = [0, 1], λ > 1, and let us consider the map

Fλ(x) := λx(1− x) x ∈ [0, 1].

Notice that the unique nonzero fixed point x of Fλ (i.e. the solution of Fλ(x) = x)is x = 1 − 1/λ. It is well known that, as λ increases from 1 to 4, the iterates ofFλ show a transition between a deterministic behaviour (convergence for all nonzeroinitial points to the fixed point 1− 1/λ) to a chaotic behaviour. We see here that,in the particular case λ = 4 (well inside the chaotic regime), the iterates of Fλ arewell described by the ergodic theorem. Let T : [0, 1] → [0, 1] be the triangular

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

F(x)

Figure 13.1: F4

map that was defined in (13.8). Let moreover h : [0, 1] → [0, 1] be defined byh(θ) = sin2(πθ/2). We want to show that T and F4 are conjugated by h, checkingthe identity

F4 h(θ) = h T (θ) ∀θ ∈ [0, 1]. (13.11)

Indeed, h(T (θ)) = sin2(πθ) and standard trigonometric identities give

F4(h(θ)) = 4 sin2(πθ/2)(1− sin2(πθ/2)) = 4 sin2(πθ/2) cos2(πθ/2) = sin2(πθ).

The map T is ergodic on the space [0, 1] with its canonical probability structure,as seen in Section 13.3.3. Therefore, by Theorem 13.7, the map F4 is ergodic on([0, 1],B([0, 1]),P), with P = h#(1[0,1]L

1).Finally, notice that we can use the fact that h−1(x) = 2

πarcsin

√x, and use the

change of variables formula to get

P(A) = L 1(h−1(A)

)=

∫A

(h−1)′(x) dx =1

π

∫A

1√x(1− x)

dx ∀A ∈ B([0, 1]).

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

h(x)

Figure 13.2: h

Page 203: Probability Book

Chapter13 199

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

T(x)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

h(x)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

F(x)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

T(x)

T h F4 F4 h = h T

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

T(x)

Figure 13.3: F4 h = h T

In particular P L 1 and, being the density strictly positive in [0, 1], the classesof P–negligible and L 1–negligible sets of [0, 1] coincide. However, since the densityof P is very large when x is close to 0 or to 1, the ergodic theorem tells us that theiterates of F4 pass through these regions more often, compared to the central partof the interval.

Page 204: Probability Book

200 Stationary sequences and elements of ergodic theory

EXERCISES

13.1 Given A, B ⊂ Ω, let A∆B := (A \ B) ∪ (B \ A) = (A ∪ B) \ (A ∩ B) be the symmetricdifference. Note that A∆B = C iff 1A + 1B = 1C in Z2, and A ∩B = C iff 1A1B = 1C .Prove that A∆B = B∆A, that (A∆B)∆C = A∆(B∆C) and that A∆B = C iff A = C∆B.13.2 Prove that an event A is T -invariant if and only if

P(A∆T−1(A)) = 0

Suppose that T is measure preserving. Using Exercise 13.1, prove that the class of T -invariantsets is a complete σ–algebra. Hint: prove that if An are invariant, and T−1(An) = An∆En withP(En) = 0, then

(⋃n

An) \ (⋃n

En) ⊂⋃n

T−1(An) ⊂ (⋃n

An) ∪ (⋃n

En).

13.3 Suppose that T is measure-preserving. Given any random variable X = X0, define Xn :=X T (n). Let AI be the invariant σ–algebra of T , and let A∞ be the terminal σ–algebra of theprocess (Xn) (that was defined for the Kolmogorov’s dichotomy Theorem 12.2).Suppose that the σ–algebra generated by the process (Xn) is A . Prove that

AI ⊂ A∞. (13.12)

13.4 Let T be an ergodic map, and let A be an event with P(A) > 0. Show that the event

A∞ := lim supn→∞

ω : T (n)(ω) ∈ A

has probability 1. Hint: show that A∞ is T -invariant, and that P(A∞) > 0.13.5 Show that the map g in (13.10) is measure-preserving in [0, 1]. Hint: use the representationof the Lebesgue measure as a product measure in 0, 1N∗ (N∗ = N \ 0), induced by the binaryexpansion x =

∑n≥1 an2−n.

13.6 Let (Ω,A ,P). Prove that T : Ω → Ω is ergodic if and only if for any A, B ∈ A

limn

1n

n∑k=1

P(T−k(A) ∩B) = P(A)P(B). (13.13)

Hint: Let f = 1A, so rewrite the above as

P(T−k(A) ∩B) =∫

B

f(T k(ω)) P(dω)

Let then Xk(ω) = f(T k(ω)) by the ergodic Theorem 13.5,

limn

1n

n−1∑k=0

Xn = E(X0)

in L1, and this means that

limn

∫B

1n

n−1∑k=0

Xk(ω) P(dω) =1n

n−1∑k=0

∫B

Xk(ω) P(dω) = P(B)E(X0).

Page 205: Probability Book

Chapter13 201

Conversely, suppose that (13.13) holds; let A be T -invariant; then P(T−k(A) ∩B) = P(A ∩B), so(13.13) becomes P(A ∩ B) = P(A)P(B) for all B ∈ A . In particular P(A) = P2(A), that impliesP(A) ∈ 0, 1.13.7 Let (Ω,A ,P) be a probability space and let T : Ω → Ω be a measure-preserving map. LetL2 = L2(Ω,A ,P). Define U : L2 → L2 by U(X) = X T ; prove that it is linear and unitary (thatis, 〈U(X), U(Y )〉 = 〈X,Y 〉 for all X, Y ∈ L2).13.8[Von Neumann Ergodic Theorem] Let (Ω,A ,P) be a probability space and let T : Ω → Ωbe a measure preserving map. Let L2 = L2(Ω,A ,P).Let AI be the invariant σ–algebra of T , so that H = L2(Ω,AI ,P) is the set of square-integrable T–invariant functions. Prove that H is closed in L2 and let π : L2 → H be the orthogonal projection.Prove that, for any random variable X ∈ L2, the arithmetic means

1n

(X +X T + . . .+X T (n−1))

converge in L2 to π(X). Hint: Decompose L2 = H ⊕K, with K the orthogonal of H, anddecompose X = X ′ + X ′′ with X ′ = π(X) ∈ H and X ′′ ∈ K. Define U : L2 → L2

by U(X) = X T ; recall the previous exercise; note that U(H) = H and prove thatU(K) ⊥ H, that means that U(K) ⊂ K. Let Zn be the linear operator

Zn(X) :=1n

(X +X T + . . .+X T (n−1)) =1n

(X + U(X) + . . .+ Un−1(X))

then U (n)(X ′) = Zn(X ′) = X ′; while Zn(X ′′) ∈ K. By Theorem 13.3, limn Zn(X) = Yin L2, and limn Zn(X ′′) = Y ′′ ∈ K; note that Y = X ′ + Y ′′. Arguing as in the Ergodictheorem 13.5, prove that Y ∈ H: then Y ′′ = 0, so limn Zn(X) = X ′ = π(X).

Page 206: Probability Book

202 Stationary sequences and elements of ergodic theory

13.4 Appendix

13.4.1 Continuity and differentiability of integrals depend-ing on a parameter

Let (X,F , µ) be a measure space and let I ⊂ R be an open interval. We consider afunction f : I ×X → R with the property

x 7→ F (t, x) ∈ L1(X,F , µ) ∀t ∈ I. (13.14)

Under this assumption, the function

F (t) =

∫E

f(t, x) dµ(x) t ∈ I

is well defined, and we would like to find conditions ensuring its continuity and itsdifferentiability.

Theorem 13.10 (Continuity of F ) Assume that t 7→ f(t, x) is continuous in Ifor µ–a.e. x ∈ X, and that there exists g ∈ L1(X,F , µ) satisfying

|f(t, x)| ≤ g(x) ∀t ∈ I, x ∈ X. (13.15)

Then F is continuous in I.

Proof. Let t ∈ I and let (th) ⊂ I be converging to t. Since t 7→ f(t, x) is continuousfor µ–a.e. x ∈ X, we have

limh→+∞

f(th, x) = f(t, x) for µ–a.e. x ∈ X.

By (13.15) we can apply the dominated convergence theorem to obtain

limh→+∞

F (th) = limh→+∞

∫X

f(th, x) dµ(x) =

∫X

f(t, x) dµ(x) = F (t).

The next example shows that F may fail to be continuous if (13.15) does nothold.

Example 13.11 Let I = X = R, let µ be the Lebesgue measure and let

f(t, x) :=

|t|−|x|

t2if |x| < |t|;

0 if |x| ≥ |t|.

Then we find that F (t) = 1 for t 6= 0 and F (0) = 0, hence F is not continuous.One can also easily check that, indeed, supt |f(t, x)| is not integrable, therefore(13.15) does not hold.

Page 207: Probability Book

Chapter13 203

Example 13.12 The function

F (t) =

∫ ∞

1

cos(tx)

x2dx

is continuous in R. It suffices to note that t 7→ cos(tx)/x2 is continuous, and todefine g(x) = x−2 to show that (13.15) holds.

In the next theorem we consider the differentiability properties of F .

Theorem 13.13 (Differentiability of F ) Assume that t 7→ f(t, x) is differen-tiable (resp. continuously differentiable) in I for µ–a.e. x ∈ X and that thereexists g ∈ L1(X,F , µ) satisfying (Dt denotes the differentiation with respect to thet variable)

|Dtf(t, x)|+ |f(t, x)| ≤ g(x) ∀t ∈ I, x ∈ X. (13.16)

Then F is differentiable (resp. continuously differentiable) in I and

DtF (t) =

∫X

Dtf(t, x) dµ(x) ∀t ∈ I. (13.17)

Proof. Let t0 ∈ I and let (rh) ⊂ R \ 0 be converging to 0. We obviously have

F (t0 + rh)− F (t)

rh

=

∫X

f(t0 + rh, x)− f(t0, x)

rh

dµ(x). (13.18)

Given x ∈ X such that t 7→ f(t, x) is differentiable in I, by the mean value theoremthere exist sh(x) between 0 and rh satisfying

f(t0 + rh, x)− f(t0, x)

rh

= Dtf(t0 + sh(x), x).

In particular (13.16) gives∣∣∣∣f(t0 + rhei, x)− f(t0, x)

rh

∣∣∣∣ ≤ g(x).

Now we can use the dominated convergence theorem in (13.18) to obtain

limh→∞

F (t0 + rh)− F (t0)

rh

=

∫X

limh→∞

f(t0 + rh, x)− f(t0, x)

rh

dµ(x) =

∫X

Dtf(t0, x) dµ(x).

Since t0 and the sequence (rh) are arbitrary, this proves (13.17).

Page 208: Probability Book

204 Stationary sequences and elements of ergodic theory

Example 13.14 The function

F (t) =

∫ ∞

1

cos(tx)

x3dx

is continuously differentiable in R. It suffices to notice that t 7→ cos(tx)/x3 iscontinuously differentiable, and that the modulus of its t-derivative can be estimatedwith g(x) = x−2. We have also DtF (t) = −

∫∞1

sin(tx)/x2 dx.

More generally, an easy induction argument shows that if t 7→ f(t, x) is k times(continuously) differentiable in I for µ–a.e. x ∈ X, and

k∑p=0

|Dpt f(t, x)| ≤ g(x) ∀x ∈ X

for some g ∈ L1(X,F , µ), then F is k times (continuously) differentiable in I and

DptF (t) =

∫X

Dpt f(t, x) dµ(x) ∀t ∈ I, p = 0, . . . , k.

Finally, if X = R, we can also consider the case when the domain of integrationdepends on t, namely

F (t) =

∫ β(t)

α(t)

f(t, x) dµ(x)

for suitable functions α, β : I → R. Then F (t) = G(t, α(t), β(t)) with

G(t, u, v) =

∫ v

u

f(t, x) dµ(x).

If f : I ×X → R is continuous, and continuously differentiable with respect to thet variable, then

DtG(t, u, v) =

∫ v

u

ft(t, x) dx, DuG(t, u, v) = −f(t, u), DvG(t, u, v) = f(t, v).

Therefore the chain rule gives

DtF (t) = DtG(t, α(t), β(t)) +DuG(t, α(t), β(t))α′(t) +DvG(t, α(t), β(t))β(t)′

=

∫ β(t)

α(t)

ft(t, x) dµ(x) + β′(t)f(t, β(t))− α′(t)f(t, α(t)).

For instance, we have

Dt

(∫ t3

sin t

ext

1 + x2dx

)=

∫ t3

sin t

xext

1 + x2dx+

3t2et4

1 + t6+

cos tet sin t

1 + sin2 t.

Page 209: Probability Book

Chapter 14

Solutions of some exercises

In this chapter we provide solutions to the main exercises proposed in the text, andin particular of those marked with one or two ?.

Chapter 1

Exercise 1.2. We prove the statement for the translations, the proof for the dilationsbeing similar. Fix h ∈ R and consider the class

F := A ∈ B(R) : A+ h ∈ B(R) .

Then F is a σ–algebra containing the intervals, because the class I of intervals isinvariant under translations. Therefore F ⊃ σ(I ) = B(R). This proves that A+his Borel whenever A is Borel.

Exercise 1.3. Set X = N and µ :=∑

n δn. Then the sets An := n, n + 1, . . .satisfy µ(An) = +∞, but their intersection is empty.

Exercise 1.4. Let An ↑ A with An, A ∈ A . Then the sets Bn := A \ An satisfyBn ↓ ∅, so that by assumption µ(Bn) ↓ µ(∅) = 0. Since µ is finite, µ(Bn) =µ(A)− µ(An), so that µ(An) ↑ µ(A).

Exercise 1.5. For any n ∈ N∗ the set An of all atoms x such that µ(x) ≥ 1/nhas at most cardinality nµ(X): indeed, if we choose k elements x1, . . . , xk in thissets, adding the inequalities µ(xi) ≥ 1/n we find k/n ≤ µ(X), whence the upperbound on the cardinality of An follows.

If µ is σ–finite, we choose Xi ↑ X with Xi ∈ E and µ(Xi) < ∞ and repeatthe previous argument with the sets Ai,n := x ∈ Aµ ∩ Xi : µ(x) ≥ 1/n,whose union gives Aµ. If not finiteness assumption is made, the statement fails:take X = R, E = P(R) and µ(A) = 0 if A = ∅ and µ(A) = +∞ otherwise.

Exercise 1.6. Let µ be diffuse and t ∈ [0, µ(X)]. We have to find A ∈ E withµ(A) = t. The case t = 0 is obvious, so that we consider the case when t ∈

205

Page 210: Probability Book

206 Stationary sequences and elements of ergodic theory

(0, µ(X)]. We build recursively a non decreasing family of sets Bi with µ(Bi) ≤ tand t−µ(Bi+1) ≤ (t−µ(Bi))/2. Obviously the set B = ∪nBn will satisfy µ(B) = t.

We start proving the following property: for any A ∈ E and any ε > 0 we canfind B ∈ E with A ∩B = ∅ and µ(B) ≤ εµ(Ac). Indeed,

Now we proceed to the construction of the sets Bi. We set B0 = ∅ and, givenBi, we define Bi+1 = Bi ∪ Ci, where Ci is disjoint with Bi and chosen in such away....

Now, let us consider the case when X is a separable metric space and E = B(X).If µ(x) > 0 for some x ∈ X, obviously µ is not diffuse. Conversely, if A ∈ B(X)is given, with µ(A) > 0 and µ(B) ∈ 0, µ(A) for all B ⊂ A, we can fix a countabledense set (xi) ⊂ X and define

r0 := supr ≥ 0 : µ(A ∩Br(x0)) = 0

.

Since r 7→ µ(A ∩ Br(x0)) is right continuous, the maximality of r0 easily impliesthat µ(A∩Br0(x0)) > 0, and therefore µ(A∩Br0(x0)) = µ(A). Now we iterate thisconstruction, setting A1 := A ∩Br0(x0), defining

r1 := supr ≥ 0 : µ(A1 ∩Br(x1)) = 0

,

so that µ(A1 ∩ Br1(x1)) = µ(A1) = µ(A). Continuing in this way, we have anonincreasing family of sets (Ai) with µ(Ai) = µ(A); it follows that µ(

⋂iAi) =

µ(A) > 0. On the other hand, any point x ∈⋂

iAi satisfies

d(x, xi) = ri ∀i ∈ N.

By the density of the family (xi), this intersection contains at most one point (andat least one, because the measure is positive). It follows that this point is an atomof µ.

Exercise 1.7. Cantor’s middle third set can be obtained as follows: let C0 = [0, 1],let C1 the set obtained from C0 by removing the interval (1/3, 2/3), let C2 be theset obtained from C1 by removing the intervals (1/9, 2/9) and (7/9, 8/9), and so on.Each set Cn consists of 2n intervals with length 3−n, so that λ(Cn) = (2/3)n → 0.If follows that the intersection C of all sets Cn is a closed and λ–negligible set.

In order to show that C has the cardinality of continuum (at this stage it is noteven obvious that C 6= ∅!) we recall that numbers x ∈ [0, 1] can be representedwith a ternary, instead of a decimal, expansion: this means that we can write

x =∑i≥1

ai3−i = 0, a1a2a3 . . .

with the ternary digits ai ∈ 0, 1, 2. As for decimal expansions, this representa-tion is not unique; for instance 1/3 can be written either as 0, 1 or as 0, 0222 . . ..

Page 211: Probability Book

Chapter13 207

Whenever this ambiguity arises, we will opt for the second representation; with thischoice it is easy to check that C1 corresponds to the set of numbers not having 1 asfirst ternary digit, C2 corresponds to the set of numbers not having 1 as a secondternary digit, and so on. It follows that C is the set of numbers not having 1 intheir ternary expansion: since the map

C 3 x =∞∑i=1

ai3−i 7→ (a1, a2, . . .) ∈ 0, 2N

provides a bijection of C with 0, 2N∗ , whose cardinality is the continuum, thisproves that C has the cardinality of continuum.

Exercise 1.8. Let qnn∈N be an enumeration of the rational numbers in [0, 1], andset

A :=∞⋃

n=0

(qn −ε

42−n, qn +

ε

42−n).

Then A ⊂ R is open and λ(A) <∑

n ε2−n−1 = ε (why is the inequality strict ?).

Therefore [0, 1] \A has Lebesgue measure strictly less than ε and an empty interior,because [0, 1] \ A does not intersect Q.

Exercise 1.9 Let Inn∈N be an enumeration of the open intervals of (0, 1). By theconstruction in Exercise (1.8), for any interval I and any δ ∈ (0, λ(I)) we can find acompact set C ⊂ I with an empty interior such that 0 < λ(C) < δ. We will define

E :=∞⋃i=0

Ci

where Cn ⊂ In are compact sets with an empty interior, λ(Cn) > 0 and λ(Cn) < δn.The choice of Cn and δn will be done recursively. Notice first that

λ(E ∩ In) ≥ λ(Cn ∩ In) > 0 ∀n ∈ N,

so we have only to care of the condition λ(E ∩ In) < λ(In). Set βn = λ(In \ ∪n0Ci)

and notice that βn > 0 because all Ci have an empty interior. Since

λ(In ∩ E) ≤ λ(In ∩n⋃0

Ci) +∞∑i=0

δi = λ(In)− βn +∞∑

i=n+1

δi

it suffices to choose δn (and Cn) in such a way that∑∞

n+1 δn < βn. This is possible,choosing for instance δn+1 > 0 satisfying

δn+1 < max

1

2βn,

1

4βn−1, . . . ,

1

2n+1β0

,

Page 212: Probability Book

208 Stationary sequences and elements of ergodic theory

to get δi < 2n−iβn for i > n.

Exercise 1.10. Let A be µ–measurable and let B, C ∈ E be satisfying A∆B ⊂ Cand µ(C) = 0. For any set D ⊂ X we have, by monotonicity of µ∗,

µ∗(D ∩ A) + µ∗(D \ A) ≤ µ∗(D ∩ (B ∪ C)) + µ∗((D \B) ∪ C).

Since µ∗(D ∩ C) ≤ µ∗(C) = µ(C) = 0, by using twice the subadditivity of µ∗ andthen the additivity of B we get

µ∗(D ∩ A) + µ∗(D \ A) ≤ µ∗(D ∩B) + µ∗(D \B) = µ∗(D).

Since D is arbitrary, this proves that A is additive.

Exercise 1.11. The statement is trivial if µ∗(A) = ∞. If not, for any n ∈ N∗

we can find, by the definition of µ∗, a countable union An of sets of N such thatAn ⊃ A and µ(An) ≤ µ∗(A) + 1/n. Then, setting B :=

⋂nAn we have B ⊃ A and

µ(B) ≤ infn µ∗(A) + 1/n = µ∗(A). The inequality µ(B) ≥ µ∗(B) follows by the

monotonicity of µ∗, taking into account that µ∗(B) = µ(B).

Exercise 1.12. Eµ is a σ–algebra: stability under complement is immediate, becauseAc∆Bc = A∆B; if Ai∆Bi ⊂ Ci, then (

⋃iAi)∆(

⋃iBi) ⊂

⋃iCi, and since µ–

negligible sets are stable under countable unions, this proves that Eµ is stable undercountable unions.

The extension µ(A) := µ(B), where B ∈ E is any set such that A∆B is containedin a µ–negligible set of E , is well defined and σ–additive on Eµ: if A∆B ⊂ C andA∆B′ ⊂ C ′, then B∆B′ ⊂ C ∪ C ′; consequently, if µ(C) = µ(C ′) = 0 it must beµ(B) = µ(B′). The σ–additivity can be proven with an argoment analogous to theone used to show that Eµ is a σ–algebra.

µ–negligible sets of Eµ are characterized by the property of being cointainedin a µ–negligible set of E : if A ∈ Eµ is µ–negligible, there exist µ–negligible setsB, C ∈ E with A∆B ⊂ C; as a consequence A is contained in the µ–negligible setB ∪ C ∈ E . Conversely, if A ⊂ X is contained in a µ–negligible set C ∈ E we maytake B = ∅ to conclude that A ∈ Eµ and µ(A) = 0.

Exercise 1.13. Let A be additive; by Exercise 1.11 we can find a set B ∈ Econtaining A with µ(B) = µ∗(A). The additivity of A and the equality µ∗(B) =µ(B) give

µ(B) = µ∗(A) + µ∗(B \ A).

As a consequence µ∗(B \ A) = 0. Now we apply Exercise 1.11 again, to find aµ–negligible set C ∈ E containing A \ B. It follows that A∆B is contained in C,and therefore A is µ–measurable.

Exercise 1.14. Let us first build a family of pairwise disjoints sets Aii∈I ⊂ P(N),with I and all sets Ai having an infinite cardinality and

⋃iAi = N (the construction

Page 213: Probability Book

Chapter13 209

of the σ–algebra will be more clear if we keep I and N distinct). The family Aican be obtained, for instance, through a bijective correspondence S between N×N

and N, setting Ai := S(i × N). Then, we define π : N→ I by

π(n) = i, where i ∈ I is the unique index such that n ∈ Ai

and (with the convention π−1(∅) = ∅)

F :=π−1(J) : J ⊂ I

.

It is immediate to check that F is a σ–algebra, that Ai = π−1(i) ∈ F and thatany nonempty set in F contains one of the sets Ai. Therefore F contains infinitelymany sets, and all of them except ∅ have an infinite cardinality.

Exercise 1.15. It suffices to define µ(A) = 0 if A has a finite cardinality, and +∞otherwise. A finite union of sets has an infinite cardinality if and only if at least oneof the sets has an infinite cardinality, and this shows that µ is additive.

The solutions of the next exercises require a more advanced knowledge of settheory, and in particular the theory of ordinals, the transfinite induction, the be-haviour of cardinality under unions and products, and Zorn lemma. We shall denoteby ω the smallest uncountable ordinal and by χ the cardinality of continuum.

Exercise 1.16. Notice that F (j) ⊂ σ(K ) implies∞⋃

k=0

Ak, Bc : (Ak) ⊂ F (j), B ∈ F (j)

⊂ σ(K ).

Therefore, if i is the successor of j, we obtain F (i) ⊂ σ(K ); analogously, if i hasno predecessor, and F (j) ⊂ σ(K ) for all j ∈ i, then

⋃j∈i F

(j), namely F (i), iscontained in σ(K ). Using these two facts, one obtains by transfinite inductionthat F (i) ⊂ σ(K ) for all i ∈ ω. An analogous induction argument shows thatF (i) ⊂ F (j) whenever i ∈ j.

So, the union U :=⋃

i∈ω F (i) is contained in σ(K ) and, to prove that equalityholds, it suffices to show that this union is a σ–algebra. Let (Bk) ⊂ U and letik ∈ ω be such that Bk ∈ F (ik). Since ik are countable and ω is uncountable wehave i :=

⋃k ik ∈ ω and all sets Bk belong to F (i). It follows that their union

belongs to F (j), where j is the successor of i, and therefore to U . An analogous(and simpler) argument proves that U is stable under complement.

Exercise 1.17. Obviously B(R) has at least the cardinality of continuum, so weneed only to show an upper bound on the cardinality of B(R). The proof is based on

the fact that a union⋃

i∈J Xi and a product×i∈J Xi have cardinality not greaterthan χ if the index set J and all sets Xi have cardinality not greater than χ. Let F (i)

Page 214: Probability Book

210 Stationary sequences and elements of ergodic theory

be defined as in Exercise 1.17, with K having at most the cardinality of continuum.Using the previous property of products, with J even countable, one can prove bytransfinite induction that, for all i ∈ ω, F (i) has at most cardinality χ. If we chooseas K the class of intervals, whose cardinality is (at most) χ, we find

B(R) = σ(K ) =⋃i∈ω

F (i).

Now we use the above mentioned property of unions, with J = ω and Xi = F (i), toconclude that B(R) has at most the cardinality of continuum.

Exercise 1.18. Obviously L has a cardinality not greater than the cardinality ofP(R); by Bernstein theorem (1) it suffices to show that the cardinality of P(R) isnot greater than the cardinality of L : if C is the Cantor set of Exercise 1.7, we knowthat P(R) is in one-to-one correspondence of P(C), because C has the cardinalityof continuum; on the other hand, any subset of C obviously belongs to L , becauseC has null Lebesgue measure.

Exercise 1.19. Let E ⊂ P(X) be a σ–algebra. We assume first that:

for any x, y ∈ X with x 6= y there exists C ∈ E such that x ∈ C and y /∈ C.(14.1)

Under this assumption, if X is not finite (and so at least countable) the σ–algebraE has at least the cardinality of the countable subsets of X, and therefore at leastthe cardinality of the continuum. Indeed, let X ′ ⊂ X be a countable set and let usbuild for any x ∈ X ′ a set Bx ∈ E such that x ∈ Bx and y /∈ Bx for all y ∈ X ′ \ x.To this aim, it suffices to consider the intersection

Bx :=⋂

y∈X′\x

Cxy

where Cxy ∈ E contains x, but not y. Now, for any S ⊂ X ′ , we consider the set

BS :=⋃x∈S

Bx

In order to show that E is uncountable, it suffices to show that S 7→ BS is injective,because the cardinality of P(X ′) is the continuum. Assume that S, S ′ ⊂ X ′ andS 6= S ′; possibly exchanging the roles of S and S ′ we can assume that, for somex ∈ X ′, x ∈ S and x /∈ S ′. Then x ∈ Bx ⊂ BS, but x /∈ BS′ , because x /∈ Bx′ for allx′ ∈ S ′. This proves that BS 6= BS′ , therefore S 7→ BS is injective.

(1)If A has cardinality not greater than B, and B has cardinality not greater than A, then thereexists a bijection between A and B

Page 215: Probability Book

Chapter13 211

Now we consider the general case, removing the assumption (14.1). Let F be aσ–algebra in Y and let us consider the equivalence relation

y ∼ y′ if and only if ((y ∈ B ⇔ y′ ∈ B) ∀B ∈ F )

(equivalently, y ∼ y′ if and only if 1B(y) = 1B(y′) for all B ∈ F ). We considerthe quotient set X = Y/ ∼ and the projection map π : Y → X (π(y) = [y] is theequivalence class of y). The definition of ∼ gives B = π−1(π(B)) for all B ∈ E ,hence B 7→ π(B) is injective from F to P(X). We define the σ–algebra

E :=π(B) : B ∈ F

,

so that B 7→ π(B) is a bijective map from F to E . If [y] 6= [y′] then y and y′

are not equivalent, therefore there exists B ∈ F containing y but not y′, so that[y] ∈ π(B) and [y′] /∈ π(B). This proves that (14.1) is fulfilled, and so E is eitherfinite or uncountable; as a consequence, the same holds for F .

Exercise 1.20. We begin our construction with an algebra τ0 in P(N) and µ0 :τ0 → 0, 1 which is additive but not σ–additive. For instance we may take as τ0the algebra generated by singletons x with x ∈ N (i.e. the sets A ⊂ N such thateither A or Ac are finite) and set

µ0(A) :=

0 if A is finite;

1 if Ac is finite.

We will extend µ0 to an additive function, that we still denote by µ0, defined onthe whole of P(N). If such an extension exists, it can’t be σ–additive, becauseµ0(n) = 0 for all n ∈ N, while µ0(N) = 1.

In the class of pairs (τ, µ) with τ algebra and µ : τ → 0, 1 additive, we definethe order relation (τ, µ) ≤ (τ ′, µ′) by τ ⊂ τ ′ and µ′|τ = µ; then we consider the class

of all (τ, µ) satisfying (τ, µ) ≥ (τ0, µ0). By Zorn lemma, we can find a maximal (τ , µ)in this class: indeed, it is easy to check that any nondecreasing family (τi, µi)i∈I inthis class, with I totally ordered, has an upper bound (τ ′, µ′), defined by

τ ′ :=⋃i∈I

τi and µ′(A) := µi(A) where A ∈ τi.

We will show that the maximality of (τ , µ) forces τ to coincide with P(N), so thatµ will be the desired extension of µ0.

Let us assume by contradiction that τ ( P(N) and choose Z ⊂ N with Z /∈ τ .We notice that

(A1 ∩ Z) ∪ (A2 ∩ Zc) : A1, A2 ∈ τ

Page 216: Probability Book

212 Stationary sequences and elements of ergodic theory

is the algebra generated by τ ∪ Z. Moreover, either Z or Zc satisfy the followingproperty

for all A ∈ τ with µ(A) = 1, Z ∩ A 6= ∅. (14.2)

If not, we would be able to find A1, A2 ∈ τ with A1∩Z = A2∩Zc = ∅ and µ(A1) =µ(A2) = 1, so that A1 and A2 would be disjoint and µ(A1 ∪ A2) = 2, contradictingthe fact that µ maps τ into 0, 1. Possibly replacing Z by its complement we shallassume that Z fulfils (14.2).

Now we extend µ to the algebra generated by τ ∪ Z, as follows:

µ(B) := µ(A1) whenever A1, A2 ∈ τ and B = (A1 ∩ Z) ∪ (A2 ∩ Zc). (14.3)

Let us check that µ is well defined and additive.

1. µ is well defined: if

B = (A1 ∩ Z) ∪ (A2 ∩ Zc) = (A3 ∩ Z) ∪ (A4 ∩ Zc)

then (A1 ∩Z) = (A3 ∩Z), and if µ(A1) 6= µ(A3) then one of the two numbers,say µ(A1), equals 1, while µ(A3) = 0. Defining A := A1 \A3 we have µ(A) = 1and A ∩ Z = ∅, contradicting (14.2).

2. Suppose B, B′ ∈ τ are disjoint. Let B = (A1 ∩ Z) ∪ (A2 ∩ Zc) and B′ =(A′

1 ∩ Z) ∪ (A′2 ∩ Zc). Then A1 ∩ A′

1 ∩ Z = ∅. Setting A′′1 := A′

1 \ A1 we stillhave B′ = (A′′

1 ∩ Z) ∪ (A′2 ∩ Zc), and then we can use the additivity of µ to

conclude that

µ(B ∪B′) = µ(A1 ∪ A′′1) = µ(A1) + µ(A′′

1) = µ(B) + µ(B′).

If B ∈ τ we can choose A1 = A2 = B in (14.3) to obtain that µ(B) = µ(B), sothat µ extends µ to the algebra generated by τ ∪Z. This violates the maximalityof (τ , µ).

Chapter 2

Exercise 2.3. (i) The verification of the axioms of distance is immediate. In orderto prove the compactness of R, let us consider a sequence (xn) ⊂ R. If supn xn = +∞we can find for any k an index n(k) such that xn(k) ≥ k; it follows that d(xnk

,+∞) =| arctanxn(k)−π/2| tends to 0, so that xn(k) → +∞ in the metric space. Analogously,if infn xn = −∞ we can find a subsequence converging to −∞ in (R, d). Finally,if both supn xn and infn xn are finite, the sequence (xn) is bounded and we canextract, thanks to the Bolzano–Weierstrass theorem, a subsequence xn(k) convergingto x ∈ R. The continuity of z 7→ arctan z implies that xn(k) → x in (R, d). To prove

Page 217: Probability Book

Chapter13 213

the equivalence of the two topologies, let us work with closed sets: if C ⊂ R is closedwith respect to the (R, d) topology, then it is closed with respect to the Euclideantopology, because |xn − x| → 0 implies | arctanxn − arctanx| → 0. On the otherhand, if | arctanxn − arctanx| → 0 then for n large enough arctanxn belongs to aninterval I := (arctanx− ε, arctanx+ ε) ⊂ (−π/2, π/2); the continuity of y 7→ tan yin I implies that xn → x. This proves the converse implication, and the equivalenceof the two topologies.(ii) We notice first that, according to (i), B(R) and −∞, +∞ belong to B(R).Therefore, if f is measurable between E and the Borel σ–algebra of (R, d), then itis E –measurable according to (2.2). According to the measurability criterion, inorder to prove the converse implication it suffices to show that B(R) is generatedby B(R) ∪ −∞ ∪ +∞: this follows by the fact that if C ⊂ R is closed, then

C = (C ∩ R) ∪ (C ∩ −∞) ∪ (C ∩ +∞)

(again by (i)) belongs to the algebra generated by B(R)∪−∞∪+∞, thereforethe σ–algebra generated by this family of sets contains B(R).

Exercise 2.4. If f 6= g is contained in a µ–negligible set C of E , for someE –measurable function g, then f > t∆g > t ⊂ C for all t ∈ R, and sinceg > t ∈ E it follows that f > t ∈ Eµ; this means that f is Eµ–measurable.Conversely, assume that f is Eµ–measurable and find for all q ∈ Q a set Bq ∈ E anda µ–negligible set Cq ∈ E with f > q∆Bq ⊂ Cq. We define

g(x) := sup q ∈ Q : x ∈ Bq , C :=⋃q∈Q

Cq.

Since g ≤ t =⋂

q≤tBq we have that g is E –measurable. Let us prove thatf(x) = g(x) for all x /∈ C: for any such x we have x ∈ Bq for all q < f(x), thereforeg(x) ≥ f(x); if the inequality were strict, there would exist q ∈ Q with x ∈ Bq andq > f(x), therefore x would be in Bq \ f > q ⊂ Cq ⊂ C.

Exercise 2.5. If σ ≤ τ we can find a nondecreasing family of partitions σ1, . . . , σn

with σ1 = σ, σn = τ and σi+1 \ σi containing just one point. Therefore, in theproof of the monotonicity of σ 7→ Iσ(f) we need only to show that Iσ(f) ≤ Iσ∪t(f)whenever t ∈ (0,∞) \ σ. Let σ = t0, . . . , tN and let i be the last index such thatti < t. If i < N we use the inequality

(ti+1− ti)f(ti+1) = (ti+1− t)f(ti+1) + (t− ti)f(ti+1) ≤ (ti+1− t)f(ti+1) + (t− ti)f(t)

adding to both sides∑

j 6=i(tj+1 − tj)f(tj+1) we obtain Iσ(f) ≤ Iσ∪t(f). If i = Nthe argument is even easier, because the difference Iσ∪t(f) − Iσ(f) is given by(t− tN)f(t).

Page 218: Probability Book

214 Stationary sequences and elements of ergodic theory

Now, let f, g : (0,+∞) → [0,+∞) be given; since Iσ(f + g) = Iσ(f) + Iσ(g) weget Iσ(f + g) ≤

∫∞0f(t) dt+

∫∞0g(t) dt. Since σ ∈ Σ is arbitrary, this proves that∫ ∞

0

f(t) + g(t) dt ≤∫ ∞

0

f(t) dt+

∫ ∞

0

g(t) dt.

In order to prove the converse inequality, fix L <∫∞

0f(t) dt, M <

∫∞0g(t) dt and

find σ, η ∈ Σ with Iσ(f) > L and Iη(g) > M ; then∫ ∞

0

f(t) + g(t) dt ≥ Iσ∪η(f + g) = Iσ∪η(f) + Iσ∪η(g) ≥ Iσ(f) + Iη(g) > L+M.

Letting L ↑∫∞

0f(t) dt and M ↑

∫∞0g(t) dt the inequality is proved.

Exercise 2.6. We will prove that f∗ is lower semicontinuous, the proof of the uppersemicontinuity of f ∗ being analogous. Let (xn) ⊂ R be converging to x and use thedefinition of f∗(xn) to find yn ∈ R such that

|xn − yn| <1

nand f(yn) ≤ f∗(xn) +

1

n.

Then (yn) still converges to x, so that

f∗(x) ≤ lim infn→∞

f(yn) ≤ lim infn→∞

f∗(xn) +1

n= lim inf

n→∞f∗(xn).

Exercise 2.7. Let t ∈ R and let (xn) ⊂ f∗ ≤ t be convergent to x. Then, thelower semicontinuity of f∗ gives

f∗(x) ≤ lim infn→∞

f∗(xn) ≤ t.

This proves that x ∈ f∗ ≤ t, so that f∗ ≤ t is closed. The proof for f ∗ issimilar. Since the Borel σ–algebra is generated by halflines, it follows that f ∗ andf∗ are Borel, and the same is true for the set f∗ = f ∗, that coincides with Σ.

Exercise 2.8. Set ϕ0 := ϕ, A0 := ϕ0 ≥ a0 and ϕ1 := ϕ − a01A0 ≥ 0. Then, setA1 := ϕ1 ≥ a1 and ϕ2 := ϕ1 − a11A1 and so on. By construction we have that0 ≤ ϕi+1 ≤ ϕi ≤ · · · ≤ ϕ0 = ϕ, hence

ϕ = ϕn+1 +n∑

i=0

(ϕi − ϕi+1) = ϕn+1 +n∑

i=0

ai1Ai.

This proves that ϕ ≥∑

i ai1Ai. If the inequality were strict for some x ∈ X, we

could find ε > 0 such that ϕi(x) ≥ ε for all i ∈ N, and since ai < ε for i large

Page 219: Probability Book

Chapter13 215

enough, we would get x ∈ Ai for i large enough. But since the series∑

i ai is notconvergent, we would get

∑i ai1Ai

(x) = ∞, a contradiction.

Exercise 2.9. Assume by contradiction that the absolute continuity property fails.Then, for some ε > 0 we can find Ai with µ(Ai) < 2−i and

∫Ai|ϕ| dµ ≥ ε. It follows

that the set B := lim supiAi is µ–negligible, and

Bn :=⋃i≥n

Ai \B ↓ ∅.

Since∫

Bn|ϕ| dµ ≥

∫An|ϕ| dµ ≥ ε we find a contradiction with the dominated con-

vergence theorem applied to the functions 1Bn|ϕ|, pointwise converging to 0.

Exercise 2.10. Let ε > 0 be given and let δ > 0 be such that∫

A|ϕ| dµ < ε/2

whenever A ∈ E and µ(A) < δ. The triangle inequality gives, with the same choiceof A,

∫A|ϕn| dµ < ε for n > n0, provided ‖ϕn − ϕ‖1 < ε/2 for n > n0. Since

ϕ1, . . . , ϕn0 are integrable, we can find δi > 0 such that∫

A|ϕi| dµ < ε whenever

A ∈ E and µ(A) < δi. If δ0 = minδ,mini δi, we have∫

A|ϕn| dµ < ε/2 whenever

n ∈ N, A ∈ E and µ(A) < δ.A possible example for the second question is Ω = [0, 1], µ = λ the Lebesguemeasure, and ϕn = 2n

n1[2−n,21−n). The uniform integrability is a direct consequence

of the convergence of ϕn to 0 in L1. If ϕn ≤ g, then

∞supn=1

ϕn =∞∑

n=1

ϕn ≤ g

but∫ ∑∞

1 ϕn =∑∞

1 1/n = ∞.

Exercise 2.11. (a) For any y ∈ X we have

gλ(x) ≤ g(y) + λd(x, y) ≤ g(y) + λd(x′, y) + λd(x, x′).

Since y is arbitrary we get gλ(x) ≤ gλ(x′) + λd(x, x′). Reversing the roles of x and

x′ the inequality is achieved.(b) Clearly the family (gλ) is monotone with respect to λ, and since we can

always choose y = x in the minimization problem we have gλ(x) ≤ g(x). Assumethat supλ gλ(x) is finite (otherwise the statement is trivial) and let xλ such thatgλ(x) + λ−1 ≥ g(xλ) + λd(x, xλ). This inequality implies that xλ → x as λ → ∞and, now neglecting the term λd(x, xλ), that

gλ(x) +1

λ≥ g(xλ).

Passing to the limit in this inequality as λ→∞ and using the lower semicontinuityof g we get supλ gλ(x) ≥ g(x).

Page 220: Probability Book

216 Stationary sequences and elements of ergodic theory

Exercise 2.12. Let us first assume that f is bounded. For ε > 0 we consider thefunctions

fε(x, y) :=1

∫ x+ε

x−ε

f(x′, y) dx′.

Since x 7→ f(x, y) is continuous, we can apply the mean value theorem to obtainthat fε(x, y) → f(x, y) as ε ↓ 0. So, in order to show that f is a Borel function, weneed only to show that fε are Borel.

We will prove indeed that fε are continuous: let xn → x and yn → y; sincef(x′, yn) → f(x′, y) for all x′ ∈ R, we have

1[xn−ε,xn+ε](x′)f(x′, yn) → 1[x−ε,x+ε](x

′)f(x′, y)

for all x′ ∈ R \ x − ε, x + ε. Therefore, since f is bounded, the dominatedconvergence theorem yields

fε(x, y) =1

∫R

1[x−ε,x+ε](x′)f(x′, y) dx′

=1

2εlim

n→∞

∫R

1[xn−ε,xn+ε](x′)f(x′, yn) dx′ = lim

n→∞fε(xn, yn).

In the general case when f is not bounded we approximate it by the boundedfunctions fh(x) := max−h,minf(x), h, with h ∈ N, that are still separatelycontinuous, and therefore Borel.

Chapter 3

Exercise 3.1. On the real line, endowed with the Lebesgue measure, the function(1 + |x|)−1 belongs to L2, but not to L1, and the function |x|−1/2 belongs to L1, butnot to L2. Turningback to the general case, if ϕ ∈ Lp1 ∩ Lp2 with p1 ≤ p2, from theinequality

|ϕ|p ≤ max|ϕ|p1 , |ϕ|p2 ≤ |ϕ|p1 + |ϕ|p2 ∀p ∈ [p1, p2]

(that can be verified considering separately the cases |ϕ| ≤ 1 and |ϕ| > 1) we getthat ϕ ∈ Lp for all p ∈ [p1, p2].

Exercise 3.2. Let 1 ≤ p ≤ q < ∞ and f ∈ Lq(X,E , µ). Show that, regardless ofany finiteness assumption on µ, for any δ ∈ (0, 1) we can write f = g + f , withg ∈ Lp(X,E , µ), f ∈ Lq(X,E , µ) and ‖f‖q ≤ δ‖f‖q.

Exercise 3.3. By homogeneity we can assume that ‖ϕ‖p = 1 and ‖Ψ‖q = 1. Since∫X

( |ϕ|pp

+|ψ|q

q− |ϕ||ψ|

)dµ =

‖ϕ‖p

p+‖ψ‖q

q− 1 = 0

Page 221: Probability Book

Chapter13 217

and the function among parentheses is nonnegative, it follows that if vanishes µ–a.e.In particular, for µ–a.e. x, |ϕ(x)| is a minimizer of

y 7→ yq

q− |ψ(x)|y

in [0,+∞). But this problem has a unique minimizer, given by |ψ(x)|q−1, and weconclude.

Exercise 3.4. It suffices to apply Holder inequality to the functions |ϕ|r and |ψ|r,with the dual exponents p/r and q/r, to obtain

‖ϕψ‖rr ≤ ‖|ϕ|r‖p/r‖|ψ|r‖q/r = ‖ϕ‖r

p‖ψ‖rq.

Exercise 3.5. The positive part and the negative part of ϕ − ϕn have the sameintegral, hence ∫

X

|ϕ− ϕn| dµ = 2

∫X

(ϕ− ϕn)+ dµ.

The condition lim infn ϕn ≥ ϕ ensures that (ϕ− ϕn)+ is pointwise convergent to 0;therefore the dominated convergence theorem gives the result.

Exercise 3.6. If ψn → ψ µ–a.e. we apply Fatou’s lemma to the functions ψn + ϕn

to obtain

lim infn→∞

∫X

ψn + ϕn dµ ≥∫

X

ψ + ϕdµ.

Therefore

lim supn→∞

∫X

ψn dµ+ lim infn→∞

∫X

ϕn dµ ≥∫

X

ϕdµ+

∫X

ψ dµ.

Subtracting∫ψ dµ from both sides the statement is achieved. In the general case,

let n(k) be a subsequence such that limk

∫ϕn(k) dµ = lim infn

∫Xϕn, and let n(k(s))

be a further subsequence converging to ϕ µ–a.e. Then

lim infn→∞

∫X

ϕn dµ = lims→∞

∫X

ϕn(k(s)) dµ ≥∫

X

lim infn

ϕn(k(s)) dµ

≥∫

X

lim infn→∞

ϕn dµ.

Exercise 3.7. Show that (3.13) implies f(tx+(1− t)y) ≤ tf(x)+ (1− t)f(y) for allx, y ∈ J and t ∈ [0, 1]. Then, deduce from this property (3.14). Hint: it is useful toconsider dyadic numbers t = k/2m, with k ≤ 2m integer.

Exercise 3.8.

Page 222: Probability Book

218 Stationary sequences and elements of ergodic theory

Exercise 3.9. Fatou’s lemma gives lim infn

∫ϕn dµ ≥

∫lim infn ϕn dµ ≥

∫ϕdµ.

Therefore tn :=∫ϕn dµ → t :=

∫ϕdµ; we can apply Exercise 3.5 to the functions

ϕn/tn to obtain that ϕn/tn → ϕ/t in L1. From this, taking into account that tn → t,the convergence of ϕn to ϕ in L1 follows.

Exercise 3.10. Let Ψ(c) := Φ(c)/c and notice that |ϕi| ≤ cΦ(|ϕi|)/Φ(c) = Φ(|ϕi|)/Ψ(c)on |ϕi| ≥ c. Therefore∫

A

|ϕi| dµ ≤∫

A∩|ϕi|≥c

Φ(|ϕi|)Ψ(c)

dµ+

∫A∩|ϕi|<c

|ϕi| dµ ≤M

Ψ(c)+ cµ(A).

Let us choose c sufficiently large, such that M/Ψ(c) < ε/2, and then δ > 0 suchthat cδ < ε/2. The inequality above yields

∫A|ϕi| dµ < ε whenever µ(A) < δ.

Exercise 3.11. Let (fn) ⊂ Cb(X) be converging in L1 to f , and let fn(k) be asubsequence pointwise convergent µ–a.e. to f . Then, given any ε > 0, by Egorovtheorem we can find a Borel set B ⊂ X with µ(B) < ε and fn(k) → f uniformlyon Bc. By the inner regularity of the measure we can find a closed set C ⊂ Bc

such that µ(X \ C) < ε. The function f restricted to C, being the uniform limit ofcontinuous functions, is continuous.

Chapter 4

Exercise 4.1. Notice that 〈·, ·〉 is obviously symmetric, and 〈x, x〉 = ‖x‖2 ≥ 0, withequality only if x = 0. Notice that the parallelogram identity gives

‖x+x′+2y‖2+‖x−x′‖2 = 2‖x+y‖2+2‖x′+y‖2 = 8〈x, y〉+8〈x′, y〉−2‖x−y‖2−2‖x′−y‖2

and

‖x+x′−2y‖2+‖x−x′‖2 = 2‖x−y‖2+2‖x′−y‖2 = 8〈x,−y〉+8〈x′,−y〉−2‖x+y‖2−2‖x′+y‖2.

Subtracting and dividing by 4 we get

〈x+ x′, 2y〉 = 4〈x, y〉+ 4〈x′, y〉 − 2〈x, y〉 − 2〈x′, y〉.

So, we proved that 〈x + x′, 2y〉 = 2〈x, y〉 + 2〈x′, y〉. Using the relation 〈u, 2v〉 =4〈u/2, v〉 (due to the definition of 〈·, ·〉 and the homogeneity of ‖ · ‖), we get

〈x+ x′

2, y〉 =

1

2〈x, y〉+

1

2〈x′, y〉.

Setting x = t1v, x′ = t2v, and defining the continuous function φ(t) = 〈tv, y〉, we

get

φ(t1 + t2

2) =

1

2φ(t1) +

1

2φ(t2).

Page 223: Probability Book

Chapter13 219

This means that φ and −φ are convex in R, so that φ is an affine function, and sinceφ(0) = 0 we get φ(t) = tφ(0), i.e. 〈tu, y〉 = t〈u, y〉. Coming back to the identityabove, we get 〈x+ x′, y〉 = 〈x, y〉+ 〈x′, y〉.Exercise 4.3. Assume that y = πK(x). For all z ∈ K and t ∈ [0, 1] we havey + t(z − y) belongs to K, so that

‖y + t(z − y)− x‖2 ≥ ‖y − z‖2.

Expanding the squares we get

t2‖z − y‖2 + 2t〈z − y, y − x〉 ≥ 0 ∀t ∈ [0, 1].

This implies (either dividing by t > 0 and passing to the limit as t ↓ 0, or computingthe right derivative at t = 0) that 〈z − y, x − y〉 ≤ 0. Conversely, if for somey ∈ K this condition holds for all z ∈ K, the argument can be reversed to get‖y+ t(z− y)−x‖ ≥ ‖y−x‖ for all t ≥ 0. Choosing t = 1 we get ‖z−x‖ ≥ ‖y−x‖,proving that y = πK(x).

Exercise 4.4. Notice that f2 6= 0, otherwise v2 would be linearly dependent fromv1, and therefore the vector space spanned by f1, f2 coincides with the vectorspace spanned by v1, v2. This implies that f3 6= 0, otherwise v3 would be linearlydependent from f1, f2, and therefore from v1, v2. As a consequence the vectorspace spanned by f1, f2, f3 coincides with the vector space spanned by v1, v2, v3.Continuing in this way, we see that the vector space Yk spanned by f1, . . . , fk (orequivalently by e1, . . . , ek) coincides with the vector space spanned by v1, . . . , vk.The orthogonality of the vectors ei can be obtained just noticing that

fk = vk −k−1∑i=1

〈vk, ei〉ei.

So, fk = vk−πYk−1(vk) is orthogonal to all vectors in Yk−1. It follows that 〈ek, ei〉 = 0

for all i < k.

Exercise 4.5. Let y = x −∑

k〈x, ek〉ek; we know that the series converges in Hby Bessel’s inequality. In order to show that

∑k〈x, ek〉ek = πX(x) it suffices to

prove that y is orthogonal to all vectors in X. But since any vector v ∈ X can berepresented as a series, it suffices to show that 〈v, ei〉 = 0 for all i. The continuityand linearity of the scalar product give

〈y, ei〉 = 〈x, ei〉 −∞∑

k=0

〈x, ek〉〈x, ei〉 = 〈x, ei〉 − 〈x, ei〉 = 0.

Exercise 4.7. By Parseval identity we know that x 7→ (〈x, ei) is a linear isometryfrom H to `2. As a consequence, taking Exercise 4.2 into account, the scalar productis preserved.

Page 224: Probability Book

220 Stationary sequences and elements of ergodic theory

Exercise 4.8. We consider the class of orthonormal systems eii∈I of H, orderedby inclusion. Zorn’s lemma ensures the existence of a maximal system eii∈I . LetV be the subspace spanned by ei, let Y be its closure (still a subspace) and let usprove that Y = H. Indeed, if Y were a proper subspace of H, we would be ableto find, thanks to Corollary 4.5, a unit vector e orthogonal to all vectors in Y , andin particular to all vectors ei. Adding e to the family eii∈I the maximality of thefamily would be violated. Now, by the just proved density of V in H, given anyx ∈ H we can find a sequence of vectors (vn), finite combinations of vectors ei, suchthat ‖x − vn‖ → 0. If we denote by Jn ⊂ I the set of indeces used to build thevectors v1, . . . , vn, and by Hn the vector space spanned by eii∈Jn , we know byProposition 4.6 that

‖x−∑i∈Jn

〈x, ei〉ei‖ ≤ ‖x− vn‖ → 0.

As a consequence, setting J = ∪nJn, we have x =∑

i∈J〈x, ei〉ei.

Chapter 5

Exercise 5.1. The functions sinmx cos lx are odd, therefore their integral on (−π, π)vanishes. To show that sinmx is orthogonal to sin lx when l 6= m, we integrate twiceby parts to get∫ π

−π

sinmx sin lx dx =m

l

∫ π

−π

cosmx cos lx dx =m2

l2

∫ π

−π

sinmx sin lx dx.

The integrals of products cosmx cos lx can be handled analogously.

Exercise 5.2. Since for N < M we have

‖N∑

n=0

xn −M∑

n=0

xn‖ ≤M∑

i=N+1

‖xi‖ ≤∞∑

i=N+1

‖xi‖

we obtain that (∑N

0 xi) is a Cauchy sequence in E. Therefore the completeness ofE provides the convergence of the series. Passing to the limit as N → ∞ in theinequality ‖

∑N0 xi‖ ≤

∑N0 ‖xi‖ and using the continuity of the norm we obtain

(5.13).

Exercise 5.3. We consider only the first system gk =√

2/π sin kx, the proof forthe second one being analogous. The fact that (gk) is orthonormal can be easilychecked noticing that gk are restrictions to (0, π) of odd functions, and using theorthogonality of sin kx in L2(−π, π). Analogously, if f ∈ L2(0, π) let us consider itsextension f to (−π, π) as an odd function and its Fourier series, which obviously

Page 225: Probability Book

Chapter13 221

contains no cosinus. In (0, π) we have

N∑k=1

bk sin kx =N∑

k=1

〈f, gk〉gk,

where the scalar products are understood in L2(0, π). Therefore, from the conver-gence of the Fourier series in L2(−π, π) to f , which implies convergence in L2(0, π)to f , the completeness follows.

Exercise 5.4. Clearly 〈ek, ek〉 = 1, while∫ π

−π

eikxe−ilx dx =1

i(k − l)

∫ π

−π

ei(k−l)x dx = 0 whenever k 6= l.

As a consequence (ek) is an orthonormal system.Since the Fourier series SNf =

∑N−N〈f, ek〉ek of f depends linearly on f , in order to

show completeness we need only to show SNf → f when f is real-valued and whenf is imaginary-valued (i.e. if is real-valued). We consider only the first case, thesecond one being analogous. Setting ck = 〈f, ek〉, we have

ck =1√2π

∫ π

π

f(x) cos kx− if(x) sin kx dx.

As a consequence, for k ≥ 1 we have√

2ck = ak − ibk, where ak and bk are thecoefficients of the real Fourier series of f , and for k ≤ −1 we have

√2/πck =

a−k + ib−k. For k = 0, instead, we have c0 =√π/2a0. Taking into account these

relations and setting b0 = 0, we have

N∑k=−N

ckeikx

√2π

=1

2

N∑k=−N

(ak cos kx+ bk sin kx) + i(ak sin kx− bk cos kx)

=a0

2+

N∑k=1

ak cos kx+ bk sin kx,

and the convergence of SNf to f follows by the convergence in the real-valued case.

Exercise 5.5. It suffices to note that

1

(∫ π

−π

f(x)e−ikx dx

)2

= (〈f, ek〉)2,

where (ek) is the orthonormal system of Exercise 5.4 and to use its completeness.

Page 226: Probability Book

222 Stationary sequences and elements of ergodic theory

Exercise 5.6. From the identity∑N

0 eikz = (ei(N+1)z − 1)/(eiz − 1) we get

N∑k=1

eikz =ei(N+1)z − eiz

eiz − 1. (14.4)

Hence

SNf(x) =N∑

k=−N

1

(∫ π

−π

f(y)e−iky dy)eikx =

N∑k=−N

1

∫ π

−π

f(y)eik(x−y) dy

=1

∫ π

−π

f(y) dy +1

π

∫ π

−π

f(y)Reei(N+1)(x−y) − ei(x−y)

ei(x−y) − 1dy.

Using the fact that (ei(N+1)z− eiz)/(eiz−1) have, still because of (14.4), mean value0 on (−π, π), we get

f(x)− SNfx =1

∫ π

−π

(f(x)− f(y))

[1 + 2Re

ei(N+1)(x−y) − ei(x−y)

ei(x−y) − 1

]dy.

Exercise 5.7. We apply the Parseval identity to the function f(x) = x2, whoseFouries series contains no sinus. It is simple to check, by integration by parts, thata0 = 2π2/3 and that ak = 4k−2 cos kx for k ≥ 1. We have then

1

π

∫ π

−π

x4 dx =2

5π4 =

a20

2+

∞∑k=1

a2k =

4

18π4 +

∞∑k=1

16

k4.

Rearranging terms, we get∑∞

1 k−4 = π4/90.

Exercise 5.8. The polynomials Pn are given by Qn/‖Qn‖2, where Qn are recursivelydefined by Q0 = 1 and

Qn(x) := xn −n−1∑k=0

〈xn, Qk〉〈Qk, Qk〉

Qk(x) = xn −n−1∑k=0

〈xn, Pk〉Pk(x) ∀n ≥ 1.

(a) Since Q0 = 1, P0 = 1/√

2 and Q1 = x− 〈x, P0〉P0 = x, because 〈x, P0〉 = 0. Asa consequence P1(x) =

√3/2x. Since 〈x2, P1〉 = 0, we have also

Q2(x) = x2 − 〈x2, P0〉P0 − 〈x2, P1〉P1 = x2 − 1

3

and this leads, with simple calculations, to P2(x) =√

45/8(x2 − 1/3).(b) Let H be the closure of the vector space spanned by Cn. This space containsall monomials xn, and therefore all polynomials. Since the polynomials are dense

Page 227: Probability Book

Chapter13 223

in C([a, b]), for the sup norm, they are also dense in L2(a, b). It follows that H =L2(a, b). By Proposition 4.13 we conclude that (Cn) is complete.(c) Set

zn :=

√2n+ 1

2

1

2nn!, Pn(x) := zn

dn

dnx(x2 − 1)n

Clearly the polynomial Pn has degree n. So, in order to show that Pn = Pn, wehave to show that Pn is orthogonal to all monomials xk, k = 0, . . . , n− 1, and that‖Pn‖2 = 1. Since Pn has zeros at ±1 with multiplicity n, all its derivatives at ±1with order less than n are zero. Therefore, for k < n we have

〈Pn, xk〉 = zn

[xk d

n−1

dn−1x(x2 − 1)n

]1

−1

− k

∫ 1

−1

xk−1 dn−1

dn−1x(x2 − 1)n dx

= · · ·

= (−1)kk!zn

[dn−k

dn−kx(x2 − 1)n

]1

−1

= 0.

In order to prove that ‖Pn‖2 = 1, still integrating by parts we have

〈Pn, Pn〉 = −z2n

∫ 1

−1

dn−1

dn−1x(x2−1)n d

n+1

dn+1x(x2−1)n dx = · · · = z2

n

∫ 1

−1

(1−x2)n d2n

d2nx(x2−1)n dx.

(14.5)On the other hand∫ 1

−1

(1−x2)n dx = 2n

∫ 1

−1

(1−x2)n−1x2 dx = −2n

∫ 1

−1

(1−x2)n dx+2n

∫ 1

−1

(1−x2)n−1 dx,

so that∫ 1

−1

(1−x2)n dx =2n

2n+ 1

∫ 1

−1

(1−x2)n−1 dx = · · · = (2n)!!

(2n+ 1)!!

∫ 1

−1

(1−x2)0 dx =2(2n)!!

(2n+ 1)!!.

Taking into account that

d2n

d2nx(x2 − 1)n = (2n)! = (2n)!!(2n− 1)!! = 2nn!(2n− 1)!!

from (14.5) we get

〈Pn, Pn〉 =2n+ 1

2

1

22n(n!)2

2(2n)!!

(2n+ 1)!!2nn!(2n− 1)!! = 1.

Chapter 6

Page 228: Probability Book

224 Stationary sequences and elements of ergodic theory

Exercise 6.1. Let us prove the inclusion

(F1 ×F2)×F3 ⊂ F1 × (F2 ×F3),

the proof of the converse one being analogous. We have to show that all productsA×B, with A ∈ F1×F2 and B ∈ F3 belong to F1× (F2×F3). Keeping B fixed,the class of sets A for which this property holds is a σ–algebra that contains theπ–system of measurable rectangles A = A1 × A2 (because A× B = A1 × (A2 × B)and A2 ×B ∈ F2 ×F3), and therefore the whole product σ–algebra F1 ×F2.For all A in the product σ–algebra we have

(µ1 × µ2)× µ3(A) =

∫X1×X2

µ3(Ax1x2) dµ1 × µ2(x1, x2) =

∫X1

∫X2

µ3(Ax1x2) dµ2(x2) dµ1(x1)

=

∫X1

µ2 × µ3(Ax1) dµ1(x1) = µ1 × (µ2 × µ3)(A).

Exercise 6.2. Obviously the cubes belong to×n

1 B(R), and thanks to Lemma 6.9

the same is true for the open sets. It follows that B(Rn) is contained in×n

1 B(R).Fix now A2, . . . , An open sets and consider the class

M :=B ⊂ R : B × A2 × · · · × An ∈ B(Rn)

.

This class contains the open sets (because the product of open sets is open) andis a σ–algebra, so it contains B(R). We have thus proved that all rectangles B1 ×A2 × · · · ×An, with B1 Borel and A2, . . . , An open, belong to B(Rn). all rectanglesB1 ×A2 × · · · ×An, with B1 Borel and A2, . . . , An open, belong to B(Rn). Now wefix B1 Borel, A3, . . . , An open and consider the class

M ′ :=B2 ⊂ R : B1 ×B2 × A3 × · · · × An ∈ B(Rn)

.

It contains, by the previous step, the open sets and it is a σ–algebra, so it containsthe Borel sets. We proved that all rectangles B1 ×B2 ×A3 × · · · ×An, with B1, B2

Borel and A3, . . . , An open, belong to B(Rn). Continuing in this way, in finitelymany steps we obtain that products of Borel sets are Borel, whence the inclusion×n

1 B(R) ⊂ B(Rn) follows.

Exercise 6.3. Assume that A, B ∈ L1; then there exist Borel sets A′, B′ andBorel Lebesgue negligible sets NA, NB with A∆A′ ⊂ NA and B∆B′ ⊂ NB. SinceA′ ×B′ ∈ B(R2), by the previous exercise,

(A×B)∆(A′ ×B′) ⊂ (NA × R) ∪ (R×NB)

and NA×R and R×NB are L 2 negligible, we obtain that A×B ∈ L2. This provesthat L2 contains the generators of L1 ×L1, and therefore the whole σ–algebra. In

Page 229: Probability Book

Chapter13 225

order to show the strict inclusion, we consider the set E = F ×0, where F ⊂ R isnot Lebesgue measurable. Since E is L 2–negligible we have E ∈ L2. On the otherhand, since the 0 section E0 coincides with F , and therefore it does not belong toL1, the set E can’t belong to the product of the two σ–algebras.

Exercise 6.4. Let A be the σ–algebra generated by these sets; since these sets areobviously cylindrical, A is contained in the product σ–algebra. The class of setsB ⊂×n

1 Xi such that B × Xn+1 × Xn+2 × · · · ∈ A is a σ–algebra containing themeasurable rectangles A1 × · · · ×An, and therefore contains the product σ–algebra×n

1 Fi. Therefore A contains the cylindrical sets and, by definition, the wholeproduct σ–algebra.

Exercise 6.5. The sections Ty := (x, z) : (x, y, z) ∈ T are squares with length

side 2√r2 − |y|2 for 0 ≤ |y| ≤ r, hence

L 3(T ) =

∫ r

−r

L 2(Ty) dy = 8

∫ r

0

(r2 − y2) dy = 8(r3 − 1

3r3) =

16

3r3.

Exercise 6.6. For x ∈ Rn (with n ≥ 3) let

r := (x21 + x2

2)1/2, Ar :=

(x3, . . . , xn) : (x2

3 + · · ·+ x2n) < 1− r2

.

Then, using polar coordinates we get

ωn =

∫r<1

L n−2(Ar) dx1dx2 = 2πωn−2

∫ 1

0

r(1− r2)(n−2)/2 dr =2π

nωn−2.

Therefore

ω2k =2k−1πk−1

2k(2k − 2) · · · 4ω2 =

πk

k!

and an analogous argument gives ω2k+1 = 2k+1πk/(2k + 1)!!.

Exercise 6.7. In order to show that ωn = πn/2

Γ(n2

+1)we show that the right hand side

satisfies the same recursion formula of the previous exercise. Since (thanks to theidentities Γ(1) = 1, Γ(1/2) =

√π) the formula holds when n = 1, 2, this will prove

that the identity holds for all n. For n ≥ 2 we have

πn/2

Γ(n2

+ 1)=π · π(n−2)/2

n2Γ(n

2)

=2π

n

π(n−2)/2

Γ( (n−2)2

+ 1).

Exercise 6.8. We know, by Exercise 2.4, that there exist a λ–negligible set N ∈F ×G and a F ×G –measurable function F : X×Y → [0,+∞] such that F 6= Fis contained in N . By applying the Fubini–Tonelli theorem to 1N we obtain that

Page 230: Probability Book

226 Stationary sequences and elements of ergodic theory

Nx is ν–negligible in Y for µ–a.e. x ∈ X. Since F (x, ·) 6= F (x, ·) ⊂ Nx, stillExercise 2.4 gives that F (x, ·) is ν–measurable for µ–a.e. x ∈ X. This provesstatement (i). Since, still for µ–a.e. x ∈ X, the integral on Y (with respect to ν)of F (x, ·) coincides with the integral of F (x, ·), statements (ii) and (iii) follow byapplying the Fubini–Tonelli theorem to F .

Exercise 6.9. Indeed, µ(Dy) = µ(y) = 0 for all y ∈ Y , so that∫

Yµ(Dy) dν(y) =

0. On the other hand, ν(Dx) = ν(x) = 1 for all x ∈ X, so that∫

Xν(Dx) dµ(x) =

1.

Exercise 6.10. Let (h(k)) be a subsequence such that∑

k ‖fh(k)−f‖1 is convergent.Then the Fubini–Tonelli theorem gives∫

X

( ∞∑k=0

∫Y

|fh(k)(x, y)−f(x, y)| dν(y))dµ(x) =

∞∑k=0

∫X×Y

|fh(k)(x, y)−f(x, y)| dµ×ν <∞.

It follows that∑

k ‖fh(k)(x, ·) − f(x, ·)‖L1(ν) is finite for µ–a.e. x ∈ X, and for anysuch x the functions fh(k)(x, ·) converge to f in L1(ν). Choosing Y = y andν = δy, to provide a counterexample it is sufficient to consider any example (seeRemark 3.7) of a sequence converging in L1 but not µ–almost everywhere.

Exercise 6.11. It suffices to apply (7.4) to |h| to show that∫|h| dfµ is finite if and

only if∫|h|f dµ is finite.

Exercise 6.12. We prove the property for the sup, the property for the inf beinganalogous. If A = B1 ∪B2 with B1 ∈ F and B2 ∈ F disjoint, we have

fµ(B1) + gµ(B2) =

∫B1

f dµ+

∫B2

g dµ ≤∫

B1

f ∨ g dµ+

∫B2

f ∨ g dµ

=

∫A

f ∨ g dµ.

By the arbitrariness of this decomposition, this proves that [(fµ) ∨ (gµ)](A) ≤(f ∨ g)µ(A). The converse inequality can be obtained noticing that, in the chain ofequalities-inequality above, the inequality becomes an equality if we choose B1 =A ∩ f ≥ g and B2 = A ∩ f > g.Exercise 6.13. It is easy to check that µ ≤ µi (respectively, µ ≥ µi) for all i ∈ I,and that any measure ν with this property is less than µ (resp. greater than µ):just write ν(B) =

∑k ν(Bk) ≤

∑k µi(k)(Bk) (resp. ≥

∑k µi(k)(Bk). So, it remains

to show that µ and µ are σ-additive. For any map i : N → I, A1, A2 ∈ F disjointand any countable F–measurable partition of A1 ∪ A2 we have

∞∑k=0

µi(k)(Bk) =∞∑

k=0

µi(k)(Bk ∩ A1) +∞∑

k=0

µi(k)(Bk ∩ A2).

Page 231: Probability Book

Chapter13 227

Estimating the right hand side from below with µ(A1)+µ(A2) we get (because (Bk)is arbitrary) that µ is superadditive, i.e. µ(A1 ∪ A2) ≥ µ(A1) + µ(A2). With asimilar argument one can prove not only that µ is subadditive, but also that µ isσ–subadditive (it suffices to consider a countable F–measurable family, instead of2 sets).

Now, let us prove that µ is subadditive and µ is superadditive. Let A1, A2 ∈ Fbe disjoint and let B1

k, B2k be countable F–measurable partitions of A1 and A2

respectively. If i1, i2 : N→ I we define i(2k) = i1(k), B2k = B1k and i(2n+1) = i2(n),

B2k+1 = B2k, so that

µ(A1 ∪ A2) ≤∞∑

k=0

µi(k)(Bk) =∞∑

k=0

µi1(k)(B1k) +

∞∑k=0

µi2(k)(B2k).

By the arbitrariness ofB1k, B

2k, i1 and i2 we conclude that µ(A1∪A2) ≤ µ(A1)+µ(A2).

With a similar argument one can prove that µ is even σ–subadditive (one has to usea bijection between N× N and N) and that µ is superadditive.

Exercise 6.14. If for all ε > 0 there exists δ > 0 satisfying

A ∈ F , µ(A) < δ =⇒ ν(A) < ε

then ν µ: indeed, if µ(A) = 0 the implication above holds for all ε > 0, henceν(A) = 0. If ν is finite, to prove the converse we argue by contradiction. Assumethat, for some ε0, we can find sets An ∈ F with µ(An) < 2−n and ν(An) ≥ ε0.Then, by the Borel–Cantelli lemma the set A := lim supnAn is µ–negligible. On theother hand, we have

ν( ∞⋃

m=n

Am

)≥ ν(An) ≥ ε0

and therefore (here we use the assumption that ν is finite) ν(A) ≥ ε0, contradictingthe absolute continuity of ν with respect to µ.

Exercise 6.15. Let B ∈ F be a µ–negligible set where ν is concentrated. Thenν(E) = ν(E ∩ B) for all E ∈ F . But, by the absolute continuity of ν with respectto µ, we have ν(E ∩B) = 0 because E ∩B ⊂ B is µ–negligible.

Exercise 6.16. Let B ∈ F be a ν–negligible set where σ is concentrated. Then

σ(E) = σ(E ∩B) ≤ µ(E ∩B) + ν(E ∩B) ≤ µ(E) ∀E ∈ F ,

where we used the fact that ν(E ∩B) = 0 because E ∩B ⊂ B is ν–negligible.

Exercise 6.17. Let ν = ν+ − ν− and let ν+ = ν+a + ν+s , ν− = ν−a + ν−s be

the Lebesgue decompositions with respect to µ of ν+ and ν− respectively. Then,

Page 232: Probability Book

228 Stationary sequences and elements of ergodic theory

νa := ν+a − ν−a and νs := ν+

s − ν−s provide a decomposition ν = νa + νs with νa, νs

signed, |νa| µ and |νs| ⊥ µ.If µ is signed and A provides a Hahn decomposition of µ (i.e. µ+(E) = µ(E ∩A)

and µ−(E) = −µ(E ∩ Ac)), we repeat the decomposition above in A, relative to νand µ+, and in B = Ac, relative to ν and µ−. Denoting by νA

a + νAs and νB

a + νBs

the two decompositions obtained,

νa(E) := νAa (E ∩ A) + νB

a (E ∩B), νs(E) := νAs (E ∩ A) + νB

s (E ∩B)

provides the desired decomposition ν = νa + νs with |νa| |µ| and |νs| ⊥ |µ|.The uniqueness of these decompositions can be proved with the same argument

used in the case of nonnegative measures.

Exercise 6.18. Let B ∈ F and let (Bi) be a F–measurable partition of B; since

∞∑i=0

|fµ(Bi)| =∞∑i=0

∣∣∣∣∫Bi

f dµ

∣∣∣∣ ≤ ∞∑i=0

∫Bi

|f | dµ =

∫B

|f | dµ,

we obtain that |fµ|(B) ≤ |f |µ(B). To prove the converse inequality fix ε > 0 anddefine Bi = B ∩ f−1(Ii), where Ii = ε[i, i+ 1), i ∈ Z. Since the oscillation of |f − εi|and ||f | − ε|i|| in f−i(Ii) are less than ε, we get∣∣∣∣∫

Bi

f dµ− εiµ(Bi)

∣∣∣∣ ≤ εµ(Bi),

∣∣∣∣∫Bi

|f | dµ− ε|i|µ(Bi)

∣∣∣∣ ≤ εµ(Bi),

hence ∣∣∣∣∫Bi

|f | dµ−∣∣∫

Bi

f dµ∣∣∣∣∣∣ ≤ 2εµ(Bi).

It follows that∑i∈Z

|fµ(Bi)| =∑i∈Z

∣∣∣∣∫Bi

f dµ

∣∣∣∣ ≥∑i∈Z

∫Bi

|f | dµ− 2εµ(Bi)

=

∫B

|f | dµ− 2εµ(B).

Since ε is arbitrary the converse inequality follows.

Exercise 6.19. If x < 0 or x ≥ 1 all repartition functions are respectively equal to0 or 1, so we need to consider only the case x ∈ [0, 1). The repartition function of1[0,1]L

1 obviously is equal to x, while

µh((−∞, x]) =#i ∈ [1, h] : i ≤ hx

h=

[hx]

h,

Page 233: Probability Book

Chapter13 229

where [s] denotes the integer part of s. Using the inequalities s − 1 < [s] ≤ s withs = hx we obtain that µh((−∞, x]) → x.

Exercise 6.20. The argument is similar to the one used in the proof of Theorem 6.27:if y < x < y′ and y, y′ ∈ D we have

F (y) = limh→∞

Fh(y) ≤ lim infh→∞

Fh(x) ≤ lim suph→∞

Fh(x) ≤ limh→∞

Fh(y′) = F (y′).

Letting y ↑ x and y′ ↓ x, we conclude.

Exercise 6.21. We define a−h2 = µ((−∞,−h]) and, for −h2 < i ≤ h2, ai = µ((i−1)/h, i/h]). Let us denote by µh the measure obtained in this way. If x ∈ (−h, h]and i is the smallest integer in (−h2, h2] such that x ≤ i/h, we have

µ(−∞, x− 1

h]) ≤ µ((−∞,

i− 1

h]) =

i−1∑j=−h2

ai ≤ µh((−∞, x]).

If x is not an atom of µ, this proves that lim infh µh((−∞, x]) ≥ µ((−∞, x]). Anal-ogously

µ(−∞, x+1

h]) ≥ µ((−∞,

i

h]) =

i∑j=−h2

ai ≥ µh((−∞, x]).

If x is not an atom of µ, this proves that lim suph µh((−∞, x]) ≤ µ((−∞, x]).

Exercise 6.22. Let us assume that (6.27) holds. If Fi(x) → 1 as x→ +∞ uniformlyin i ∈ I, for any ε > 0 we can find x such that 1 − Fi(x) < ε/2 for all i ∈ I.Analogously, we can find y < x such that Fi(y) < ε/2 for all i ∈ I. Then, the intervalI = (y, x] satisfies µi(I) > 1− ε for all i ∈ I, because Ic = (−∞, y] ∪ (x,+∞).

Exercise 6.23. If µ is the weak limit and ε > 0 is given, let us choose an integern ≥ 1 such that µ([1−n, n−1]) > 1−ε and points x ∈ (−n, 1−n) and y ∈ (n−1, n)where the repartition functions of µh are converging to the repartition function ofµ. Then, since µ((∞, x]) + 1 − µ((−∞, y]) = µ(R \ (x, y)) < ε, there exists nε ∈ N

such that supn≥nεµn((∞, x])+ 1−µn((−∞, y]) < ε. Let now x′ and y′ be satisfying

µn((∞, x′]) + 1− µn((−∞, y′]) < ε ∀n = 0, . . . , nε − 1.

Then, the interval I = [minx, x′,maxy, y′] satisfies infn µn(I) > 1− ε.

Exercise 6.24.

(a) limh

∫Rg dµh =

∫Rg dµ ∀g ∈ Cb(R) (that is, (6.28));

(b) limh

∫Rg dµh =

∫Rg dµ ∀g ∈ Cc(R);

(c) Fh converge to F on all points where F is continuous;

Page 234: Probability Book

230 Stationary sequences and elements of ergodic theory

(d) Fh converge to F on a dense subset of R;

(e) limh µh(R) = µ(R);

(f) (µh) is tight.

We consider the functions ρh(x) := ρ(x + h), where ρ(x) = (2π)−1/2e−x2/2 is theGaussian, and µh = ρhλ (λ being the Lebesgue measure), µ = 0. In this case (c),(d), do not hold, because Fh(x) → 1 6= 0 = F (x) for all x ∈ R, (e) does not holdand (b) holds.a⇒ b, e. This is easy, because Cc(R) ⊂ Cb(R) and 1R ∈ Cb(R).a⇒ c. This follows by second part of the proof of Theorem 6.28.d⇔ c. This is Exercise 6.20.b ∧ e ⇒ c. This follows by the same argument used in the proof of second part ofTheorem 6.28: the sequence (gk) monotonically convergent to 1A can be chosen inCc(R), and this shows that lim infh µh(A) ≥ µ(A) for all A ⊂ R open. Using (e) andpassing to the complementary sets, we obtain lim suph µh(C) ≤ µ(C) for all C ⊂ R

closed.d⇒ f . This follows by the same argument used in the solution of Exercise 6.23.d ∧ f ⇒ e. For all x ∈ D, with D dense, we have limh µh((−∞, x]) = µ((−∞, x]).Since µh((−∞, x]) → µh(R) as x→ +∞ uniformly in h, we can pass to the limit asx ∈ D → +∞ to obtain limh µh(R) = limx→+∞ µ((−∞, x]) = µ(R).d ∧ f ⇒ a. This follows by the same argument used in the first part of the proof ofTheorem 6.28, choosing the points ti in the partitions to be in the dense set whereconvergence occurs.

Exercise 6.23. Set

g(ξ) :=1√

2πσ2

∫R

eiξxe−x2/(2σ2) dx.

Notice that g(0) = 1, and that differentiation theorems under the integral sign (2)

and an integration by parts give

g′(ξ) =1√

2πσ2

∫R

ieiξx(xe−x2/(2σ2)) dx =σ2

√2πσ2

∫R

id

dxeiξxe−x2/(2σ2) dx

= − ξσ2

√2πσ2

∫R

eiξxe−x2/(2σ2) dx.

Therefore g satisfies the linear differential equation g′(ξ) = −σ2ξg(ξ), whose generalsolution is g(ξ) = ce−σ2ξ2/2. Taking into account that g(0) = 1, c = 1.

(2)In this case, the application of the theorem is justified by the fact that supξ∈I | ddξ e

iξxe−x2/(2σ2)|is Lebesgue integrable for all bounded intervals I

Page 235: Probability Book

Chapter13 231

Exercise 6.24. Let us approximate µ by µn = 1(−n,n)µ; using the inequality

|eiξx − eiηx| ≤ |x||ξ − η| x, ξ, η ∈ R

we obtain that

|µn(ξ)− µn(η)| ≤ |ξ − η|∫R

|x| dµn(x) ≤ n|ξ − η|,

therefore µn is uniformly continuous. Since |µn(ξ)− µ(ξ)| ≤ µ(R \ [−n, n]), we havethat µn → µ uniformly as n→∞, therefore µ is uniformly continuous (indeed, givenε > 0, find n such that sup |µn− µ| < ε/2 and δ = ε/(2n) to obtain |µn(ξ)− µn(η)| ≤ε/2 whenever |ξ − η| < δ, and then |µ(ξ)− µ(η)| < ε).

Exercise 6.25. Obviously |µ(ξ0)| = 1, and we set c = µ(ξ0) = eiθ for some θ ∈ R.Since ∫

R

|1− ceixξ0|2 dµ(x) = 2− cc− cc = 0,

we obtain that eixξ0 = c for µ–a.e. x ∈ R. This implies that xξ0 − θ ∈ 2πZ forµ–a.e. x ∈ R, so that µ is concentrated on the set of points (2nπ + θ)/ξ0n∈N, andit suffices to set x0 = θ/ξ0 to obtain the stated representation of µ as a sum of Diracmasses.Obviously |µ| ≡ 1 if µ is a Dirac mass. Conversely, if |µ| ≡ 1, we find x0 withµ(x0) > 0 and ξ0, ξ

′0 ∈ R \ 0 with ξ0/ξ

′0 /∈ Q to obtain that µ is concentrated

on the set 2nπ/ξ0 + x0n∈N and on the set 2nπ/ξ′0 + x0n∈N. By our choice of ξ0and ξ′0, the intersection of the two sets is the singleton x0, and this proves thatµ = δx0 .

Chapter 7

Exercise 7.1. Let C > 0 be such that |H(x)−H(y)| ≤ C|x−y| for all x, y ∈ R. Letε > 0 and let δ > 0 be such that

∑i |f(bi)− f(ai)| < ε/C whenever

∑i(bi−ai) < δ.

We have∑

i |H(f(bi))−H(f(ai))| ≤ C∑

i |f(bi)− f(ai)| whenever∑

i(bi− ai) < δ.In particular, choosing f(t) = t, we see that Lipschitz functions are absolutelycontinuous.

Exercise 7.2. We assume that both L 1(E) > 0 and L 1(R \ E) > 0. Let a ∈ R

be such that L 1((a,∞) ∩ E) > 0 and L 1((a,∞) \ E) > 0, and define F (t) =L 1(E ∩ (a, t)). By our choice of a, F (t) and (t− a)− F (t) are not identically 0 in(a,+∞).

If t > a is a rarefaction point of E, we have

F ′+(t) = lim

h↓0

F (t+ h)− F (t)

h= lim

h↓0

L 1((t, t+ h) ∩ E)

h= 0.

Page 236: Probability Book

232 Stationary sequences and elements of ergodic theory

Analogously, F ′−(t) = 0 and we find that F ′ is equal to 0 at all rarefaction points.

A similar argument proves that F ′ = 1 at all density points. Let now t0 ∈ (a,∞)where 0 < F (t0) < (t0 − a) − F (t0) and apply the mean value theorem to obtaint′0 ∈ (a, t0) such that

F (t0) = (t0 − a)F ′(t′0).

By our choice of t0 it follows that F ′(t′0) ∈ (0, 1), a contradiction (because either t′0is a density point or a rarefaction point).

Exercise 7.3. Assume first that ϕ is continuous and bounded. Let H(z) :=∫ z

f(a)ϕ(y) dy. By the (classical) fundamental theorem of the integral calculus, H is

differentiable and H ′(z) = ϕ(z) for all z ∈ f(I). By the chain rule and Exerise 7.1,the function

F (t) :=

∫ f(t)

f(a)

ϕ(y) dy = H(f(t))

is absolutely continuous and it has derivative equal to H ′(f(t))f ′(t) = ϕ(f(t))f ′(t)at all points t where f is differentiable. On the other hand, still by the fundamentaltheorem of the integral calculus, the function

G(t) :=

∫ t

a

ϕ(f(x))f ′(x) dx

has derivative equal to (ϕ f)f ′ L 1-a.e. in [a, b]. Since both F and G vanish att = a, they coincide.

By dominated convergence theorem, the identity of the two functions persists ifϕ = 1A, with A open (because 1A is the pointwise limit of continuous functions).By applying Dynkin theorem to the class M of the sets E ∈ B(f(I)) such that∫ f(t)

f(a)1E(y) dy =

∫ t

a1E(f(x))f ′(x) dx we obtain that the formula holds for all ϕ =

1E with E Borel. Eventually we obtain it for simple functions and, by uniformapproximation, for bounded Borel functions.

Exercise 7.4. Choosing g = 1N , by Exercise 7.3 we get∫ b

a1f−1(N)f

′ dx = 0, because1N f = 1f−1(N). Let h+ and h− be respectively the positive and negative part off ′1f−1(N). Since ∫ b

a

h+ dx−∫ b

a

h− dx =

∫ b

a

f ′1f−1(N) dx = 0

for all intervals (a, b), it follows that h+ = h− L 1-a.e. in R. As a consequence,f ′ = 0 L 1-a.e. in f−1(N).

Chapter 8

Page 237: Probability Book

Chapter13 233

Exercise 8.1. Both are measures in (Z,H ). If B ∈ H then g f#µ(B) =µ(f−1(g−1(B))), because (g f)−1 = f−1 g−1. On the other hand,

g#(f#µ)(B) = f#µ(g−1(B)) = µ(f−1(g−1(B))).

Exercise 8.2 Let f : 0, 1N → [0, 1] be the map associating to a sequence (ai) ⊂0, 1 the real number

∑i ai2

−i−1 ∈ [0, 1]. Show that

f#

( ∞

×i=0

(1

2δ0 +

1

2δ1)

)= 1[0,1]L

1.

Hint: compute the two measures on intervals whose extreme points are dyadic num-bers.

Exercise 8.3. Let A ⊂ R be a dense open set whose complement C has strictlypositive Lebesgue measure (Exercise 1.8), and let

ϕ(t) := min 1, dist(t, C) t ∈ R.

By construction the function ϕ is continuous, nonnegative, bounded by 1, and van-ishes precisely on C. Then, set

F (t) :=

∫ t

0ϕ(s) ds if t ≥ 0;

−∫ 0

tϕ(s) ds if t < 0.

We have F ′ = ϕ, so that F ∈ C1 and its critical set CF = C has positive Lebesguemeasure. It follows that F#L 1 is not absolutely continuous with respect to L 1.

Finally, since∫ b

aϕdt > 0 whenever a < b (because A∩ (a, b) 6= ∅) we obtain that F

is strictly increasing.

Exercise 8.4 ? ? Remove the injectivity assumption in Theorem 8.4, showing that

F#(1UL n) =∑

x∈F−1(y)\CF

1

|JF |(x)1F (U\CF )L

n.

for any C1 function F : U → Rn with Lebesgue negligible critical set. Hint: use thelocal invertibility theorem and a suitable decomposition of U \ CF to reduce to thecase of an injective function.

Chapter 9

Exercise 9.1. We use identity E(X) =∫∞

0P(X > t) dt and the equality

X > t = X > n t ∈ (n, n+ 1), n ∈ N

Page 238: Probability Book

234 Stationary sequences and elements of ergodic theory

to write the integral as the series∑∞

0 P(X > n).Exercise 9.2. (i), (ii), (ii) are obvious. (ii) and (iii) imply that (X, Y ) 7→ V [X, Y ]is bilinear and this gives immediately (iv). Finally, (v) can be proved just in thecase of centered variables X. Since AX is centered we have

(V [AX, Y ])ij = E( n∑

k=1

aikXk(Yj−E(Yj))

=n∑

k=1

aikE(Xk(Yj−E(Yj)) = (AV [X, Y ])ij.

Exercise 9.3. (i) We know from the previous exercise that V [AX] = AV (X)At,hence V [AX] = AV (X)A−1, because A is orthogonal. It follows that V [X] andV [AX] have the same trace. (ii) is an obvious consequence of the analogous prop-erties of V [X, Y ].

Exercise 9.4. Without loss of generality we assume that X and Y are centered.For all λ > 0 we have

2|E(〈X, Y 〉)| ≤ 2E(|〈X, Y 〉|) ≤ λE(|X|2) +1

λE(|Y |2)

and minimizing with respect to λ we get |E(〈X, Y 〉)| ≤ σ(X)σ(Y ). Since the traceof V [X, Y ] equals E(〈X, Y 〉), this proves that |R[X, Y ]| ≤ 1. If equality holds andboth σ(X) and σ(Y ) are positive, we will prove that either X/σ(X) = Y/Σ(X) orX/σ(X) = −Y/Σ(X). Possibly replacing X and Y by X/σ(X) and Y/σ(Y ) weshall assume that σ(X) = σ(Y ) = 1. Assuming R[X, Y ] = 1 we have

E(|X − Y |2) = E(|X|2 + |Y 2| − 2〈X, Y 〉) = σ2(X) + σ2(Y )− 2E(〈X, Y 〉) = 0.

The case R[X, Y ] = −1 is analogous.

Exercise 9.5. (i) Let Z = X − E(X), obviously in W . We obtain that Z is theorthogonal projection of X on W by the simple identity E((X − Z) · (Y − Z)) =E(X) · E(Y − Z) = 0 for all Y ∈ W . (ii) is obvious.

Exercise 9.6. If X and Y are identically distributed, then f(X) and f(Y ) areidentically distributed (their common law is f#µ, where µ is the common law ofX and Y ), and therefore have the same expectation. Conversely, it suffices to takecharacteristic functions f = 1A to obtain P(X ∈ A) = P(Y ∈ A) for all A ∈ A ′.

Exercise 9.7. Clearly it suffices to show that X is equal to a constant almostsurely. If this is not the case, we can find t ∈ R such that a := P(X < t) > 0 andb := P(X ≥ t) > 0. Then ...

Show the following refinement of Jensen’s inequality: if f : R → R is strictlyconvex, and E(f(X)) = f(E(X)), then X = E(X) almost surely.

Exercise 9.8. For any f ∈ Cb(E) the random variables f(Xn) and f(Yn) are iden-tically distributed, so that taking expectations the dominated convergence theorem

Page 239: Probability Book

Chapter13 235

gives E(f(X)) = E(f(Y )). This means that∫f dµX =

∫f dµY , where µX and µY

are the laws of X and Y respectively. Since f is arbitrary, this proves that µX = µY .

Exercise 9.9

Exercise 9.10

Chapter 10

Exercise 10.16. Note that X has law µ if and only if P(X > b) = e−b/λ

(a) =⇒ (b) can be immediately obtained by calculating the probabilities on bothsides; indeed

e−(a+b)/λ

e−a/λ= e−b/λ.

(b) =⇒ (c). Condition (b) implies that that the law of X − t with respect to Qt isequal to the law of X, so the result follows by setting λ = E(X).(c) =⇒ (a). We know (by definition) that

∫X dP =

∫∞0P(X > s) ds. The

formula in (c) can be written in this form

λ =

∫ ∞

0

Qt(X − t > s) ds

but ∫ ∞

0

Qt(X − t > s) ds =

∫ ∞

0

P(X > s+ t | X > t) ds

=

∫∞0P(X > s+ t) dsPX > t

=

∫∞tP(X > s) dsPX > t

.

This means that the function ϕ(t) = P(X > t) satisfies the integral equation

λϕ(t) =

∫ ∞

t

ϕ(s) ds

which, coupled with the condition ϕ(0) = 1, implies that ϕ(t) = e−t/λ, and thenthat X has exponential law.

Chapter 11

Exercise 11.1. Let g(t) := min1, t and let Yn := |Xn − X|. Since E(g(Yn)) =E(min1, |Xn −X|) → 0, then by Markov inequality

P(Yn > δ) = P(g(Yn) > g(δ)) ≤ E(g(Yn))

g(δ),

Page 240: Probability Book

236 Stationary sequences and elements of ergodic theory

so that Xn → X in probability. If Xn → X in probability and δ ∈ (0, 1), then

E(g(Yn)) =

∫Yn>δ

g(Yn)dP+

∫Yn≤δ

g(Yn)dP ≤

≤ PYn > δ+ δ

(we used g ≤ 1 to estimated the first integral and g(t) ≤ t to estimate the secondone), that is less than 2δ for n sufficiently large.

Exercise 11.2. If ϕ(Xn) do not converge in probability to ϕ(X), there exist ε > 0and a subsequence n(k) such that d(ϕ(Xn(k)), ϕ(X)) > ε for all k. At the sametime, since Xn(k) → X in probability, by Theorem 11.2(i) there exists a subse-quence n(k(l)) of n(k) such that (Xn(k(l))) converges to X almost surely, henceϕ(Xn(k(l))) converge to ϕ(X) almost surely; by applying Theorem 11.2 again we getthat ϕ(Xn(k(l))) converge to ϕ(X) in probability, leading to a contradiction.

Exercise 11.4. Given a Cauchy sequence (Xn) ⊂ X (Ω), it suffices to prove thatany subsequence (Xn(k)) satisfying

∞∑k=0

d(Xn(k+1), Xn(k)) <∞

converges to some X ∈ X (Ω), relative to d. Setting δk = d(Xn(k+1), Xn(k)), to showthe convergence of (Xn(k)) we can use the inequalities (see (11.5))

P(|Xn(k+1) −Xn(k)| > 2δk) < 2δk

to obtain the convergence of the series∑

k P(|Xn(k+1)−Xn(k)| > 2δk). By Lemma 1.9the set

N := lim supk→∞

|Xn(k+1) −Xn(k)| > 2δk

is P–negligible, and since∑

k δk < ∞ it is easily seen that, for all ω ∈ Ω \ N ,(Xn(k)(ω)) converges (because |Xn(k+1)(ω)−Xn(k)(ω)| ≤ 2δk for k ≥ k(ω)). Denotingby X(ω) its limit, Theorem 11.2(i) gives that Xn(k) → X also in probability.

Chapter 12

Exercise 12.1. If E(Xn) = µ for all n the proof given in Theorem 12.4 works (theonly variant is that σ2(Un) ≤ C/n, instead of σ2(Un) = σ2/n). In the general case,we write Xn = Yn + µn, with µn = E(Xn), and the almost sure convergence of Yn

to 0 and of µn to µ give the result. A similar argument provides, when µ ∈ R, theconvergence of Un to µ in L2.

Page 241: Probability Book

Bibliography

[1] L.Carleson: On the convergence and growth of Fourier series. Acta Math.,116 (1966), 135–157.

[2] H.Federer: Geometric Measure Theory. Springer, 1969.

[3] G.B.Folland: A course in abstract harmonic analysis. Studies in AdvancedMathematics, CRC Press, 1995.

[4] W.Rudin: Real and complex analysis. Mc Graw–Hill, 1987.

[5] D.W. Stroock, S.R.S. Varadhan (1997): Multidimensional diffusion pro-cesses. Springer Verlag, second ed.

[6] S.Wagon: The Banach-Tarski paradox. Cambridge University Press, 1985.

[7] K.Yosida: Functional Analysis. Springer, 1980.


Recommended