ECE 541 - NOTES ON PROBABILITY THEORY ANDece-research.unm.edu/hayat/ece541_06/ECE541_11_11.pdfECE...

ECE 541 - NOTES ON PROBABILITY THEORY ANDSTOCHASTIC PROCESSES

MAJEED M. HAYAT

These notes are made available to students of the course ECE-541 at the University

of New Mexico. They cannot and will not be used for any type of profitable repro-

duction or sale. Parts of the materials have been extracted from texts and notes by

other authors. These include the text of Probability Theory by Y. Chow and H. Te-

icher, the notes on stochastic analysis by T. G. Kurtz, the text of Real and Complex

Analysis by W. Rudin, the notes on measure and integration by A. Beck, the text

of Probability and Measure by P. Billingsly, the text of Introduction to Stochastic

Processes by P. G. Hoel, S. C. Port, and C. J. Stone, and possibly others sources.

In compiling these note I tried, as much as possible, to make explicit reference to

these sources. If any omission is found, it is due to my oversight for which I seek the

authors’ forgiveness.

I am truly indebted to these outstanding scholars and authors, some of whom were

my teachers.

Date: November 15, 2006.1

2 MAJEED M. HAYAT

1. Fundamentals of Probability: Set-1

1.1. Experiments. The most fundamental component in probability theory is the

notion of a physical (or sometimes imaginary) “experiment,” whose “outcome” is

revealed when the experiment is completed. Probability theory aims to provide the

tools that will enable us to assess the likelihood of an outcome, or more generally, the

likelihood a collection of outcomes. Let us consider the following example:

Example 1. Shooting a dart: Consider shooting a single dart at a target (board)

represented by the unit closed disc, D, which is centered at the point (0, 0). We write

D = (x, y) ∈ IR2 : x2 + y2 ≤ 1. Here, IR denotes the set of real numbers (same

as (−∞,∞)), and IR2 is the set of all points in the plane ( IR2 = IR × IR, where ×denotes the Cartesian product of sets). We read the above description of D as “the

set of all points (x, y) in IR2 such that (or with the property) x2 + y2 ≤ 1.

1.2. Outcomes and the Sample Space. Now we define what we mean by an

outcome: An outcome can be “missing the target,” or “miss” for short, in which

case the dart misses the board entirely, or it can be its location in the scenario that

it hits the board. Note that we have implicitly decided (or chose) that we do not

care where the dart lands as whenever it misses the board. (The definition of an

outcome is totally arbitrary and therefore it is not unique for any experiment. It

depends on whatever makes sense to us.) Mathematically, we form what is called the

sample space as the set containing all possible outcomes. If we call this set Ω, then

for our dart example, Ω = miss ∪ D, where the symbol ∪ denotes set union (we

say x ∈ A ∪ B if and only if x ∈ A or x ∈ B). We write ω ∈ Ω to denote a specific

outcome from the sample space Ω. For example, we may have ω = “miss,” ω = (0, 0),

and ω = (0.1, 0.2); however, according to our definition of an outcome, ω cannot be

(1,1).

1.3. Events. An event is a collection of outcomes, that is, a subset in Ω. Such a

subset can be associated with a question that we may ask about the outcome of the

experiment and whose answer can be determined after the outcome is revealed. For

example, the question: “Q1: Did the dart land within 0.5 of the bull’s eye?” can be

ECE 541 3

associated with the subset of Ω (or event) given by E = (x, y) ∈ IR2 : x2+y2 ≤ 1/4.We would like to call the set E an event. Now consider the complement of Q1, that is:

Q2: “Did the dart not land within 0.5 of the bull’s eye?” with which we can associate

the event Ec, were the superscript “c” denotes set complementation (relative to Ω).

Namely, Ec = Ω \ E, which is the set of outcomes (or points or members) of Ω that

are not already in E. The notation “\” represents set difference (or subtraction). In

general, for any two sets A and B, A \ B is the set of all elements that are in A but

not in B. Note that Ec = (x, y) ∈ IR2 : x2 + y2 > 1/4∪miss. The point here

is that if E is an event, then we would want Ec to qualify as an event as well since

we would like to be able to ask the logical negative of any question. In addition,

we would also like to be able to form a logical “or” of any two questions about the

experiment outcome. Thus, if E1 and E2 are events, we would like their union to also

be an event.

Here is another illustrative example: for each n = 1, 2, . . ., define the subset En =

(x, y) ∈ IR2 : x2+y2 ≤ 1−1/n and let⋃∞

n=1 En be their countable union. (Notation:

we say ω ∈ ⋃∞n=1 En if and only if ω ∈ En for some n. Similarly, we say ω ∈ ⋂∞

n=1 En

if and only if ω ∈ En for all n.) It is not hard to see (prove it) that E = (x, y) ∈IR2 : x2 + y2 < 1, which corresponds to the valid question “did the dart land inside

the board?” Thus, we would want E (which is the countable union of events) to be

an event as well.

Example 2. For each n ≥ 1, take An = (−1/n, 1/n). Then,⋂∞

n=1 An = 0 and⋃∞

n=1 An = (−1, 1). Prove these.

Finally, we should be able to ask whether or not the experiment was conducted or

not, that is, we would like to label the sample space Ω as an event as well. With this

(hopefully) motivating introduction, we proceed to formally define what we mean by

events.

Definition 1. A collection F of subsets of Ω is called a σ-algebra (reads as sigma-

algebra) if:

(1) Ω ∈ F

4 MAJEED M. HAYAT

(2)⋃∞

n=1 En ∈ F whenever En ∈ F , n = 1, 2, . . .

(3) Ec ∈ F whenever E ∈ F

If F is a σ-algebra, then its members are called events. Note that F is a collection

of subsets and not a union of subsets; thus, F itself is not a subset of Ω.

Here are some consequences of the above definition (you should prove all of them):

(1) ∅ ∈ F .

(2)⋂∞

n=1 En ∈ F whenever En ∈ F , n = 1, 2, . . .. Here, the countable intersection⋂∞

n=1 En is defined as follows: ω ∈ ⋂∞n=1 En if and only if ω ∈ En for all n.

(3) A \B ∈ F whenever A,B ∈ F . (First prove that A \B = A ∩Bc.)

Generally, members of any σ-algebra (not necessary corresponding to a random ex-

periment and a sample space Ω) are called measurable sets. Measurable sets were

first introduced in the branch of mathematics called analysis. They were adopted

to probability by a great mathematician named Kolmogorov. Clearly, in probability

theory we call measurable subsets of Ω events.

Definition 2. Let Ω be a sample space and let F be a σ-algebra of events. We call

the pair (Ω,F) a measurable space.

Definition 3. A collection D of subsets of Ω is called a sub- σ-algebra of D if

(1) D ⊂ F (this means that if A ∈ D, then automatically A ∈ F ).

(2) D is itself a σ-algebra.

Example 3. ∅, Ω is a sub-σ-algebra of any other σ-algebra.

Example 4. The power set of Ω, which is the set of all subsets of Ω, is a σ-algebra.

In fact it is a maximal σ-algebra in a sense that it contains any other σ-algebra. The

power set of a set Ω is often denoted by the notation 2Ω.

Interpretation: Once again, we emphasize that it would be meaningful to think

of a σ-algebra as a collection of all valid questions that one may ask about an ex-

periment. The collection has to satisfy certain self-consistency rules, dictated by the

ECE 541 5

requirements for a σ-algebra, but what we mean by “valid” is really up to us as long

as the self-consistency rules defining the collection of events are met.

Generation of σ-algebras: Let M be a collection of events (not necessarily a σ-

algebra); that is, M ⊂ F . This collection could be a collection of certain events of

interest. For example, in the dart experiment we may define M =miss, (x, y) ∈

IR2 : 1/4 ≤ x2 + y2 = 1/2

in the dart experiment. The main question now is the

following: Can we construct a minimal (or smallest) σ-algebra that contains M? If

such a σ-algebra exists, call it FM, then it must possess the following property: If Dis another σ-algebra containing M, then necessarily FM ⊂ D. Hence FM is minimal.

The following Theorem states that there is such a minimal σ-algebra.

Theorem 1. Let M be a collection of events, then there is a minimal σ-algebra

containing M.

Before we prove the Theorem let us look at an example.

Example 5. Suppose that Ω = (−∞,∞), and M = (−∞, 1), (0,∞). It is easy to

check that the minimal σ-algebra containingM is FM = ∅, Ω, (−∞, 1), (0,∞), (0, 1), (−∞, 0], [1,∞), (−∞, 0]∪[1,∞). Explain where each member is coming from?

Proof of Theorem: Let KM be the collection of all σ-algebras that contain M.

We observe that such a collection is not empty since at least it contains the power

set 2Ω. Let us label each member of KM by an index α, namely Dα , where α ∈ I,

where I is some index set. Define FM =⋂

α∈I Dα. We need to show that 1) FM is a

σ-algebra containing M, and 2) that FM is a minimal σ-algebra. Note that each Dα

contains Ω (since each Dα is a σ-algebra); thus, FM contains Ω. Next, if A ∈ FM,

then A ∈ Dα, for each α ∈ I. Thus, Ac ∈ Dα, for each α ∈ I (since each Dα is a

σ-algebra) which implies that Ac ∈ ⋂α∈I Dα. Next, suppose that A1, A2, . . . ∈ FM.

Then, we know that A1, A2, . . . ∈ Dα, for each α ∈ I. Moreover,⋃∞

n=1 An ∈ Dα,

for each α ∈ I (again, since each Dα is a σ-algebra) and thus⋃∞

n=1 An ∈⋂

α∈I Dα.

This completes proving that FM is a σ-algebra. Now suppose that F ′M is another

6 MAJEED M. HAYAT

σ-algebra containing M, we will show that FM ⊂ F ′M. First note that since KM is

the collection of all σ-algebras containing M, then it must be true that F ′M = Dα∗

for some α∗ ∈ I. Now if A ∈ FM, then necessarily A ∈ Dα∗ (since Dα∗ is one of the

members of the countable intersection that defines FM), and consequently A ∈ F ′M,

since F ′M = Dα∗ . This establishes that FM ⊂ F ′

M and completes the proof of the

theorem. 2

Example 6. Let U be the collection of all open sets in IR. Then, according to the

above theorem, there exists a minimal σ-algebra containing U . This σ-algebra is called

the Borel σ-algebra, B, and its elements are called the Borel subsets of IR. Note that

by virtue of set complementation, union and intersection, B contains all closed sets,

half open intervals, their countable unions, intersections, and so on.

(Reminder: A subset U in IR is called open if for every x ∈ U , there exists an open

interval centered at x which lies entirely in U . Closed sets are defined as complements

of open sets. These definitions extend to IRn in a straightforward manner.)

Restrictions of σ-algebras: Let F be a σ-algebra. For any measurable set U , we

define F ∩ U as the collection of all intersections between U and the members of F ,

that is, F ∩U = V ∩U : V ∈ F. It is easy to show that F ∩U is also a σ-algebra,

which is called the restriction of F to U .

Example 7. Back to the dart experiment: What is a reasonable σ-algebra for this

experiment? I would say any such σ-algebra should contain all the Borel subsets of

D and the miss event. So we can take M =miss,B ∩D

. It is easy to check

that in this case

FM =((B ∩D) ∪ miss

)∪ (B ∩D)(1)

, where for any σ-algebra F and any measurable set U , we define F ∪ U as the

collection of all unions between U and the members of F , that is, F ∪ U = V ∪ U :

V ∈ F. Note that F ∪ U is not always a σ-algebra (contrary to F ∩ U), but in this

example it is because miss is the complement of D.

ECE 541 7

1.4. Random Variables. Motivation: Recall the dart experiment, and define the

following transformation on the sample space:

X: Ω → IR defined as

X(ω) =

1, if ω ∈ D

0, if ω = miss

where as before D = (x, y) ∈ IR2 : x2 + y2 ≤ 1.Consider the collection of outcomes that we can identify if we knew that X fell in

the interval (−∞, r). More precisely, we want to identify ω ∈ Ω : X(ω) ∈ (−∞, r),or equivalently, the set X−1((−∞, r)), which is the inverse image of the set (−∞, r)

under the transformation (or function) X. It can be checked that (you must work

these out carefully):

For r < 0, X−1((−∞, r)) = ∅,for 0 ≤ r < 1, X−1((−∞, r)) = miss,and for 1 ≤ r < ∞, X−1((−∞, r)) = Ω.

The important thing to note is that in each case, X−1((−∞, r)) is an event; that is,

X−1((−∞, r)) ∈ FM, where FM was defined earlier for this experiment (1). This is

a direct consequence of the way we defined the function X.

Here is another transformation defined on the outcomes of the dart experiment:

Define

(2) Y (ω) =

10, if ω = miss√

x2 + y2, if ω ∈ D.

Let’s consider the collection of outcomes that correspond to Y < 1/2, which we

can write as ω ∈ Ω : Y (ω) < 1/2 = Y −1((−∞, 1/2)). Note that

Y −1((−∞, 1/2)) = (x, y) ∈ IR2 : x2 + y2 < 1/4 ∈ FM.

Moreover, we can also show that

Y −1((−∞, 2)) = D ∈ FMY −1((−∞, 11)) = D ∪ miss = Ω ∈ FMY −1((−∞, 0]) = ∅ ∈ FM

8 MAJEED M. HAYAT

We emphasize again that Y −1((−∞, r)) is always an event; that is, Y −1((−∞, r)) ∈FM, where FM was defined earlier for this experiment (1). Again, this is a direct

consequence of the way we defined the function Y .

Motivated by these examples, we proceed to define what we mean by a random

variable in general.

Definition 4. Let (Ω,F) be a measurable space. A transformation X : Ω → IR is

said to be F-measurable if for every r ∈ IR, X−1((−∞, r)) ∈ F . Any measurable X

is called a random variable.

Now let X be a random variable and consider the collection of events M = ω ∈Ω : X(ω) ∈ (−∞, r), r ∈ IR, which can also be written more conveniently as

X−1((−∞, r)), r ∈ IR. As before, let FM be the minimal σ-algebra containing M.

Then, FM is the “information” that the random variable X conveys about the experi-

ment. We define such a σ-algebra from this point on as σ(X), the σ-algebra generated

by the random variable X. In the above example of X, σ(X) = ∅, miss, Ω, D.This concept is demonstrated in the next example. (Also see HW#1 for another

example of σ(X).)

Example 8. Back to the dart example: Let M′ 4= ∅, miss, Ω, then it is easy

to check that FM′ = ∅, miss, Ω, D, which also happens to be a subset of FMas defined in (1). However, we observe that in fact X−1((−∞, r)) ∈ F ′

M for any

r ∈ IR. Intuitively, FM′, which also call σ(X), can be identified as the set of all events

that the function X can convey about the experiment. In particular, FM′ consists of

precisely those events whose occurrence or nonoccurrence can be determined through

our observation of the value of X. In other words, FM′ is all the “information” that

the mapping X can provide about the outcome of the experiments. Note that FM′ is

much smaller than the original σ-algebra F , which was ((B∩D)∪miss)∪ (B∩D).

As we have seen before, ∅, miss, Ω, D ⊂ ((B ∩D) ∪ miss) ∪ (B ∩D). Thus, X

can only partially inform us of the true outcome of the experiment; it can only tell

us if we hit the target or missed it, nothing else, which is precisely the information

contained in FM′.

ECE 541 9

Facts about Measurable Transformations: Let (Ω,F) be a measurable space.

The following statements are equivalent:

(1) X is a measurable transformation.

(2) X−1((−∞, r]) ∈ F .

(3) X−1((r,∞)) ∈ F .

(4) X−1([r,∞)) ∈ F .

(5) X−1((a, b)) ∈ F for all a ≤ b.

(6) X−1([a, b)) ∈ F for all a ≤ b.

(7) X−1([a, b]) ∈ F for all a ≤ b.

(8) X−1((a, b]) ∈ F for all a ≤ b.

(9) X−1(B) ∈ F for all B ∈ B.

Using (9), we can equivalently define σ(X) as X−1(B) : B ⊂ B, which can be

directly shown to be a σ-algebra (prove it).

10 MAJEED M. HAYAT

2. Set 2: Fundamentals of Probability, Continued

2.1. Probability Measure.

Definition 5. Consider the measurable space (Ω,F), and a function, P, that maps

F into IR. P is called a probability measure if

(1) P(E) ≥ 0, for any E ∈ F .

(2) P(Ω) = 1.

(3) If E1, E2, . . . ∈ F and if Ei ∩ Ej = ∅ when i 6= j, then P(⋃∞

n=1 En) =∑∞

n=1 P(En).

The following properties follow directly from the above definition.

Property 1. P(∅) = 0.

Proof. Put E1 = Ω, E2 = . . . = En = ∅ in (3) and use (2) to get 1 = P(Ω) =

P(Ω∪∅∪∅∪ . . .) = P(Ω)+∑∞

n=2 P(∅) = 1+∑∞

n=2 P(∅). Thus,∑∞

n=2 P(∅) = 0, which

implies that P(∅) = 0 since P(∅) ≥ 0 according to (1).

Property 2. If E1, E2, . . . , En ∈ F and if Ei∩Ej = ∅ when i 6= j, then P(⋃n

i=1 Ei) =∑n

i=1 P(En).

Proof. Put En+1 = En+2 = . . . = ∅ and the result will follow from 3 since P(∅) = 0

(from Property 2)

Property 3. If E1, E2 ∈ F and E1 ⊂ E2, then P(E1) ≤ P(E2).

Proof. Note that E1 ∪ E2\E1 = E2 and E1 ∩ E2\E1 = ∅. Thus, by Property 2 (use

n = 2), P(E2) = P(E1) + P(E2\E1) ≥ P(E1), since P(E2\E1) ≥ 0.

Property 4. If A1 ⊂ A2 ⊂ A3 . . . ∈ F , then

(3) limn→∞

P(An) = P(∞⋃

n=1

An).

ECE 541 11

Proof. Put B1 = A1, B2 = A2\A1, . . . , Bn = An+1\An, . . .. Then, it is easy to check

that⋃∞

n=1 An =⋃∞

n=1 Bn and⋃m

n=1 An =⋃m

n=1 Bn for any m ≥ 1, and that Bi∩Bj =

∅ when i 6= j. Hence, P(⋃n

i=1 Ai) =∑n

i=1 P(Bi) and P(⋃∞

i=1 Ai) =∑∞

i=1 P(Bi).

But,⋃n

i=1 Ai = An, so that P(An) =∑n

i=1 P(Bi). Now since∑n

i=1 P(Bi) converges

to∑∞

i=1 P(Bi), we conclude that P(An) converges to∑∞

i=1 P(Bi), which is equal to

P(⋃∞

i=1 Ai).

Property 5. If A1 ⊃ A2 ⊃ A3 . . ., then limn→∞ P(An) = P(⋂∞

n=1 An).

Proof. See HW.

Property 6. For any A ∈ F , 0 ≤ P(A) ≤ 1.

The triplet (Ω,F , P ) is called a probability space.

Example 9. Recall the dart experiment. We now define P on (Ω,FM) as follows.

Assign P(miss) = 0.5, and for any A ∈ D∩B, assign P(A) = area(A)/2π. It is easy

to check that P defines a probability measure. (For example, P(Ω) = P(D∪miss) =

P(D) + P(miss) = area(D)/2π + 0.5 = 0.5 + 0.5 = 1. Check the other requirements

as an exercise.)

2.2. Distributions and distribution functions: Consider a probability space (Ω,F , P ),

and consider a random variable X defined on it. Up to this point, we have devel-

oped a formalism that allows us to as questions of the form “what is the proba-

bility that X assumes a value in a Borel set B?” Symbolically, this is written as

P(ω ∈ Ω : X(ω) ∈ B), or for short, PX ∈ B with the understanding that the

set X ∈ B is an event (i.e., member of F). Answering all the questions of the

above form is tantamount to assigning a number in the interval [0,1] to every Borel.

Thus, we can think of a mapping from B into [0,1], whose knowledge will provide an

answer to all the questions of the form described earlier. We call this mapping the

distribution of the random variable X, and it is denoted by µX . Formally, we have

µX : B → [0, 1] according to the rule µX(B) = PX ∈ B, B ∈ B.

12 MAJEED M. HAYAT

Proposition 1. µX defines a probability measure on the measurable space (IR,B).

Proof. See HW.

Distribution Functions: Recall from your undergraduate probability that we often

associate with each random variable a distribution function, defined as FX(x) =

PX ≤ x. This function can also be obtained from the distribution of X, µX , by

evaluating µX at B = (−∞, x], which is a Borel set. That is, FX(x)4= µX((−∞, x]).

Note that for any x ≤ y, µX((x, y]) = FX(y)− FX(x).

Property 7. FX is nondecreasing.

Proof. For x1 ≤ x2, (−∞, x1] ⊂ (−∞, x2] and = µX((−∞, x1]) ≤ µX((−∞, x2]) since

µX is a probability measure (see Property 3 above).

Property 8. limx→∞ FX(x) = 1 and limx→−∞ FX(x) = 0.

Proof. Note that (−∞,∞) =⋃∞

n=1(−∞, n] and by Property 4 above, limn→∞ µX((−∞, n]) =

µX(⋃∞

n=1(−∞, n]) = µX((−∞,∞)) = 1 since µX is a probability measure. Thus we

proved that limn→∞ FX(n) = 1. Now the same argument can be repeated if we re-

place the sequence n by any increasing sequence xn. Thus, limxn→∞ FX(xn) = 1 for

any increasing sequence xn, and consequently limx→∞ FX(x) = 1.

The proof of the second assertion is left as an exercise.

Property 9. FX is right continuous, that is, limx↓y FX(x) = FX(y).

Proof. Note that FX(y) = µX((−∞, y]), and (−∞, y] =⋂∞

n=1(−∞, y + n−1]. So, by

Property 5 above, limn→∞ µX((−∞, y+n−1]) = µX(⋂∞

n=1(−∞, y+n−1]) = µX((−∞, y]).

Thus, we proved that limn→∞ FX(y + n−1) = FX(y). In the same fashion, we can

generalize the result to obtain limn→∞ FX(y+xn) = FX(y) for any sequence for which

xn ↓ 0. This completes the proof.

Property 10. FX has a left limit at every point, that is, limx↑y FX(x) exits.

ECE 541 13

Proof. Note that (−∞, y) =⋃∞

n=1(−∞, y−n−1]. So, by Property 4 above, limn→∞ FX(y−n−1]) = limn→∞ µX((−∞, y − n−1]) = µX(

⋃∞n=1(−∞, y − n−1]) = µX((−∞, y))

4=

FX(y−). In the same fashion, we can generalize the result to obtain limn→∞ FX(y −xn) = µX((−∞, y)) for any sequence xn ↓ 0.

Property 11. FX has at most countably many discontinuities.

Proof. Let D be the set of discontinuity points of FX . We first show a simple

fact that says that the jumps (or discontinuity intervals) corresponding to distinct

discontinuity points are disjoint. More precisely, pick α, β ∈ D, and suppose, without

loss of generality, that α < β. Let Iα = (FX(α−), FX(α)] and Iβ = (FX(β−), FX(β)]

represent the discontinuities associated with α and β, respectively. Note that FX(α) <

FX(β−); this follows from the definition of FX(β−), the fact that α < β, and the fact

that FX is nondecreasing. Hence, Iα and Iβ are disjoint. From this we also conclude

that the discontinuities (jumps) associated with the points of D form a collection of

disjoint intervals. (*)

Next, note that D = ∪∞n=1Dn, where Dn = x ∈ IR : FX(x) − FX(x−) > n−1.In words, Dn is the set of all discontinuity points that have jumps greater than n−1.

Since the discontinuities corresponding to the points of Dn form a collection of disjoint

intervals, the length of the the union of all these disjoint intervals should not exceed 1

(why?). Hence, if we denote the cardinality of Dn by D#n , then we have n−1D#

n ≤ 1,

or D#n ≤ n. Hence, D is countable since it is a countable union of finite sets (this is

a fact from elementary set theory).

Note: We could have finished the proof right after (*) by associating the points of

D with a disjoint collection of intervals, each containing a rational number. In turn,

we can associate the points of D with distinct rational numbers, which proves that

D is countable.

Discrete random variables: If the distribution function of a random variable is

piecewise constant, then we say that the random variable is discrete. Note that in

this case the number of discontinuities is at most countably infinite, and the random

variable may assume at most countably many values, say a1, a2, . . .. It is easy to see

14 MAJEED M. HAYAT

that for a discrete random variable X, E[X] =∑∞

i=1 aiPX = ai =∑∞

i=1 ai

(FX(ai)−

FX(a−i )).

Absolutely continuous random variables: If there is a Borel function fX : IR →[0,∞), with the property that

∫∞−∞ f(t) dt = 1, such that FX(x) =

∫ x

−∞ fX(t) dt, then

we say that X is absolutely continuous. Note that in this event we term fX the

probability density function (pdf) of X. Also note that if Fx is differentiable, then

such a density exists and it is given by the derivative of FX .

Example 10. (Problem 3.8 in the textbook) Let X be a uniformly-distributed random

variable in [0,2]. Let the function g be as shown in Fig. 1 below. Compute the

distribution function of Y = g(X).

1

1/2

1/2

FY(y)

y

1

1

g(x)

x1

Figure 1.

Solution: If y < 0, FY (y) = 0 since Y is nonnegative. If 0 ≤ y ≤ 1, FY (y) =

P(X ≤ y/2∪X > 1−y/2) = 0.5y+0.5. Finally, if y > 1, FY (y) = 1 since Y ≤ 1.

The graph of FY (y) is shown above. Note that FY (y) is indeed right continuous

everywhere, but it is discontinuous at 0.

In this example the random variable X is absolutely continuous (why?). On the

other hand, the random variable Y is not absolutely continuous because there is no

Borel function fY so that FY (x) =∫ x

−∞ fY (t) dt. Observe that FY has a jump at 0,

and we cannot reproduce this jump by integrating a Borel finction over it (we would

need a “delta function,” which is not really a function let alone a Borel function).

It is not a discrete random variable either since Y can assume any value from an

uncountable collection of real numbers in [0, 1].

ECE 541 15

2.3. Expectation. Recall that in an undergraduate probability course one would

talk about the expectation, average, or mean of a random variable. This is done

by carrying out an integration (in the Riemann sense) with respect to a probability

density function. It turns out that the definition of an expectation does require having

a probability density function (pdf). It is based on a more-or-less intuitive notion of

an average. We will follow this general approach here and then connect it to the usual

expectation with respect to a pdf whenever the pdf exists. We begin by introducing

the expectation of a nonnegative random variable, and will generalize thereafter.

Consider a nonnegative random variable X, and for each n ≥ 1, we define the sum

Sn =∑∞

i=1i

2n P i2n < X ≤ i+1

2n . We claim that Sn is nondecreasing. If this is the

case (to be proven shortly), then we know that Sn is either convergent (to a finite

number) or Sn ↑ ∞. In any case, we call the limit of Sn the expectation of X, and

symbolically we denote it as E[X]. Thus, we define E[X] = limn→∞ Sn. To see the

monotonicity of Sn, we follow Chow and Teicher [1] and observe that

i2n < X ≤ i+1

2n = 2i2n+1 < X ≤ 2i+2

2n+1 = 2i2n+1 < X ≤ 2i+1

2n+1 ∪ 2i+12n+1 < X ≤ 2i+2

2n+1;thus,

Sn =∑∞

i=12i

2n+1 (P 2i2n+1 < X ≤ 2i+1

2n+1+ P 2i+12n+1 < X ≤ 2i+2

2n+1)≤ 1

2n+1 P 12n+1 < X ≤ 2

2n+1+∑∞

i=12i

2n+1 (P 2i2n+1 < X ≤ 2i+1

2n+1+ P 2i+12n+1 < X ≤ 2i+2

2n+1)≤ 1

2n+1 P 12n+1 < X ≤ 2

2n+1+∑∞

i=12i

2n+1 P 2i2n+1 < X ≤ 2i+1

2n+1+∑∞

i=12i+12n+1 + P 2i+1

2n+1 <

X ≤ 2i+22n+1

=∑∞

i=1,i oddi

2n+1 P i2n+1 < X ≤ i+1

2n+1+∑∞

i=1,i eveni

2n+1 P i2n+1 < X ≤ i+1

2n+1=

∑∞i=1

i2n+1 P i

2n+1 < X ≤ i+12n+1 = Sn+1

If E[X] < ∞ , we say that X is integrable.

General Case: If X is any random variable, then we decompose it into two

parts: a positive part and a negative part. More precisely, let X+ = max(0, X)

and X− = max(0,−X), then clearly X = X+ − X−, and both of X+ and X− are

nonnegative. Hence, we use our definition of expectation for a nonnegative random

variable to define E[X] = E[X+] − E[X−] whenever E[X+] < ∞ or E[X−] < ∞, or

16 MAJEED M. HAYAT

both. In cases where E[X+] < ∞ and E[X−] = ∞, or E[X−] < ∞ and E[X+] = ∞, we

define E[X] = −∞ and E[X] = ∞, respectively. Finally, E[X] is not defined whenever

E[X+] = E[X−] = ∞. Also, E[|X|] = E[X+ + X−] exists if and only if E[X] does.

Special Case: Consider any event E and let X = IE, a binary random variable.

(IE(ω) = 1 if ω ∈ E and IE(ω) = 0 otherwise; it is called the indicator function of

the event E.) Then, E[X] = P(E).

Proof. Note that in this case

(4) Sn =2n − 1

2nP2n − 1

2n< X ≤ 1

,

and the second quantity is simply PE. Thus, Sn → PE.When P(E) = 1, we say that X = 1 with probability one, or almost surely. In

general if X is equal to a constant c almost surely, then E[X] = c.

Another Special Case: A random variable that takes only finitely many values,

that is, X =∑n

i=1 aiIAi, where Ai, i = 1, . . . , n are events. It is straightforward

(prove it following the approach as in the previous case) that E[X] =∑n

i=1 aiP(Ai).

Notation and Terminology: E[X] is also written as∫

ΩX(ω)P(dω), which is

called the Lebesgue integral of X with respect to the probability measure P . Often,

cumbersome notation is avoided by writing∫

ΩXP(dω) or simply

∫Ω

XdP .

Linearity of Expectation: The expectation is linear, that is, E[aX + bY ] =

aE[X] + bE[Y ]. This can be seen, for example, by observing that any nonnega-

tive random variable can be approximated from below by functions of the form∑∞

i=1 xiIxi<X≤xj+1(ω), where for any event E, the random variable IE(ω) = 1 if

ω ∈ E and IE(ω) = 0 otherwise. (Recall that IE is called the indicator function of

the set E.) Indeed, we have seen such an approximation through our definition of the

expectation. Namely, if we define

(5) Xn(ω) =∞∑i=1

i

2nI i

2n <X≤ i+12n (ω),

ECE 541 17

then it is easy to check that Xn(ω) → X(ω), as n →∞. In fact, E[X] was defined as

limn→∞ E[Xn], where E[Xn] precisely coincides with Sn described above [recall that

E[IE(ω)] = P(E)]. Now to prove the linearity of expectations, we note that if X and

Y are random variables with defined expectations, then we can approximate them by

Xn and Yn, respectively. Also, Xn + Yn would approximate X + Y . Next, we observe

that for nonnegative X and Y ,

E[Xn] + E[Yn] =∑∞

i=1i

2n P i2n < X ≤ i+1

2n +∑∞

i=1i

2n P i2n < Y ≤ i+1

2n =

∑∞j=1

∑∞i=1

i2n P

( i

2n < X ≤ i+12n ∩ j

2n < Y ≤ j+12n

)+

∑∞j=1

∑∞i=1

j2n P

( i

2n <

X ≤ i+12n ∩ j

2n < Y ≤ j+12n

)

=∑∞

j=1

∑∞i=1

i+j2n P

( i

2n < X ≤ i+12n ∩ j

2n < Y ≤ j+12n

).

But∑∞

j=1

∑∞i=1

i+j2n P

( i

2n < X ≤ i+12n ∩ j

2n < Y ≤ j+12n

)= E[Xn + Yn]. (Think

about it.)

Thus, we have shown that E[Xn]+E[Yn] = E[Xn+Yn]. Now take limits of both sides

and use the definition of E[X], E[Y ] and E[X + Y ] as the limits of E[Xn], E[Yn], and

E[Xn +Yn], respectively, to conclude that E[X] +E[Y ] = E[X +Y ]. The homogeneity

property E[aX] = aE[X] can be proved similarly.

It is easy to show that if X ≥ 0 then E[X] ≥ 0; in addition, if X ≤ Y almost surely

(this means that PX ≤ Y = 1), then E[X] ≤ E[Y ].

Expectations in the Context of Distributions: Recall that for a nonnegative

random variable X, E[X] = limn→∞∑∞

i=1i

2n P i2n < X ≤ i+1

2n . But we had seen

earlier that P i2n < X ≤ i+1

2n = µX

(( i

2n , i+12n ]

). So we can write

∑∞i=1

i2n P i

2n <

X ≤ i+12n as

∑∞i=1

i2n µX(( i

2n , i+12n ]). We denote the limit of the latter by

∫IR

xdµX ,

which is read as the Lebesgue integral of x with respect to the probability measure (or

distribution) µX . In summary, we have E[X] =∫

ΩXdP =

∫IR

xdµX . Of course we

can extend this notion to a general X in the usual way (i.e., splitting X to its positive

and negative parts).

A pair of random variables Suppose that X and Y are rv’s defined on the

measurable space (Ω,F). For any B1, B2 ∈ B, we use the convenient notation X ∈

18 MAJEED M. HAYAT

B1, Y ∈ B2 for the event X ∈ B1 ∩ Y ∈ B2. Moreover, we can also think of the

pair X and Y as a vector (X,Y ) in IR2.

Now consider any B ∈ B2 (the set of all Borel sets on the plane) of the form B1×B2,

where B1, B2 ∈ B (i.e., rectangles with measurable sides). We note that (X, Y ) ∈ Bis always an event because (X, Y ) ∈ B = X ∈ B1∩Y ∈ B2. In fact, (X, Y ) ∈B is an event for any B ∈ B2, not just those that are rectangles. Here is the proof.

Let D be the collection of all Borel sets in B2 for which (X, Y ) ∈ B ∈ F . As we had

already seen, D contains all rectangles (i.e., Borel sets of the form B1×B2). Further,

it is easy to show that D qualifies to be called a Dynkin class of sets (see below).

Also, the collection S of all Borel sets in the plane that are rectangles is closed under

finite intersection (this is because (B1 × B2) ∩ (C1 × C2) = (B1 ∩ C1) × (B2 ∩ C2)).

To summarize, the collection D is a Dynkin class and it contains the collection S,

that is closed under finite intersection and contains all rectangular Borel sets in the

plane. Now by the Dynkin class Theorem (see below), D contains the sigma algebra

generated by S, which is just B2. Hence, (X, Y ) ∈ B ∈ F for any B ∈ B2.

Thus, for any Borel set B in the plane, we can define the joint distribution of X

and Y as µXY (B)4= PX ∈ B1, Y ∈ B2. This can also be generalized in the obvious

way to define a joint distribution of multiple (say n) random variables over Bn, the

Borel subsets of IRn.

Now back to the definition of a Dynkin class and the Dynkin class Theorem. Any

collection D in any set Ω is called a Dynkin class if (1) Ω ∈ D, (2) E2 \ E1 ∈ Dwhenever E1 ⊂ E2 and E1, E2 ∈ D, and (3) if E1 ⊂ E2 ⊂ . . ., then ∪∞n=1En ∈ D. Any

collection S in any set Ω is called a π-class if A ∩ B ∈ S whenever A,B ∈ S. The

Dynkin Class Theorem (see Chow and Teicher, for example) states that if S ⊂ D, then

D contains the σ-algebra generated by S. More so, if D also happens to be a π-class

(in addition to being a Dynkin class) then D is precisely the σ-algebra generated by

π. We will not prove this Theorem here although its proof is elementary.

Expectation in the context of distribution function: Note that we can use

the definition of a distribution function to write (for a non-negative rv X) E[X] =

ECE 541 19

limn→∞∑∞

i=1i

2n

(FX(( i+1

2n )−FX(( i2n )

). Now without being too fussy, we can imagine

that if FX is differentiable with derivative fX , then the above limit is the Reimann

integral∫∞−∞ xfX(x) dx, which is the usual formula for expectation for a random

variable that has a probability density function.

Expectation of a function of a random variable. Now that we have E[X], we

can define E[g(X)] in the obvious way, where is g is a Borel-measurable transformation

from IR to IR. (Recall that this means that g−1(B) ∈ B for any B ∈ B.) In particular,

E[g(X)] =∫Ω

g(X)dP =∫IR

g(x)dµX . Also, in the event that X has a density function,

we obtain the usual expression∫∞−∞ g(x)fX(x) dx.

Markov Inequality: If φ is a non-decreasing function on [0,∞), then P|X| >

ε ≤ 1φ(ε)

E[φ(|X|)]. In particular, if we take φ(t) = t2 and consider X − X in place

of X, where X4= E[X], then we have P|X − X| > ε ≤ E[(X−X)2]

ε2. Note that the

numerator is simply the variance of X.

Also, if we take φ(t) = t, then we have P|X| > ε ≤ E[|X|]ε

.

Proof. Since φ is nondecreasing, |X| > ε = φ(|X|) > φ(ε). Further, note that

1 ≥ I|X|>ε(ω), for any ω ∈ Ω (since any indicator function is either 0 or 1). Now

note that φ(|X|) ≥ I|X|>εφ(|X|) ≥ I|X|>εφ(ε). Now take expectations of both ends

of the inequalities to obtain

E[φ(|X|)] ≥ φ(ε)E[I|X|>ε] = φ(ε)P|X| > ε, and the desired result follows. (See the

first special case after the definition of expectation.)

Chernoff Bound: This in an application of the Markov inequality and it is a

useful tool in upperbounding error probabilities in digital communications. Let X be

a non-negative random variable and let θ be nonnegative. Then by taking φ(t) = eθt,

we obtain

PX > x ≤ e−θxE[eθX ].

20 MAJEED M. HAYAT

The right hand side can be minimized over θ > 0 to yield what is called the Chernoff

bound for a random variable X.

Independence: Consider a probability space (Ω,F , P), and let D1 and D2 be two

sub σ-algebras of F . We say that D1 and D2 are independent if P(A∩B) = P(A)P(B)

for every A ∈ D1 and B ∈ D2.

For example, if X and Y are random variables in (Ω,F), then we say that they

are independent if σ(X) and σ(Y ) are independent. Note that in this case we auto-

matically have µXY = µXµY , FXY (x, y) = FX(x)FY (y), and fXY (x, y) = fX(x)fY (y)

if the these densities exist.

ECE 541 21

3. Elementary Hilbert Space Theory

Most of the material in this section is extracted from the excellent book by W.

Rudin, Real & Complex Analysis [2]. A complex vector space H is called an inner

product space if ∀x, y ∈ H, we have a complex-valued scaler, < x, y >, read the “inner

product between vectors x and y,” such teat the following properties are satisfied:

(1) < x, y >= < y, x >, ∀x, y ∈ H

(2) < x + y, z >=< x, z > + < y, z >, ∀x, y, z ∈ H

(3) < αx, y >= α < x, y >, ∀x, y ∈ H, α ∈ |C

(4) < x, x >≥ 0,∀x ∈ H and < x, x >= 0 only if x = 0

Note that (3) and (1) together imply < x, αy >= α < x, y >. Also, (3) and (4)

together imply < x, x >= 0 if and only if x = 0. It would be convenient to write

< x, x > as ‖x‖2, which we later use to introduce a norm on H.

3.1. Schwarz Inequality. It follows from properties (1)-(4) that | < x, y > | ≤‖x‖ ‖y‖, ∀x, y ∈ H.

Proof: Put A = ‖x‖2, C = ‖y‖2 and B = | < x, y > |. By (4), < x−αry, x−αry >≥0, for any choice of α ∈ |C and all r ∈ IR, which can be further written as (using

(2)-(3)):

(6) ‖x‖2 − rα < y, x > −rα < x, y > +r2‖y‖2 ≥ 0.

Now choose α so that that α < y, x >= | < x, y > |. Thus, (6) can be cast as

‖x‖2 − 2r| < x, y > | + r2‖y‖2 ≥ 0, or A − 2rB + r2C ≥ 0, which is true for

any r ∈ IR. Let r1,2 = 2B±√4B2−4AC2C

= B±√B2−ACC

denote the roots of the equation

A − 2rB + r2C = 0. Since A − 2rB + r2C ≥ 0, it must be true that B2 − AC ≤ 0

(since the roots cannot be real unless they are the same), which implies | < x, y >

|2 ≤ ‖x‖2‖y‖2, or | < x, y > | ≤ ‖x‖ ‖y‖.Note that | < x, y > | = ‖x‖‖y‖ whenever x = βy, β ∈ |C. Also, if | < x, y >

| = ‖x‖‖y‖, then it is easy to verify that < x− |<x,y>|‖x‖<y,x>‖y‖ y, x− |<x,y>|‖x‖

<y,x>‖y‖ y = 0, which

implies (using (4)) x − ‖x‖‖y‖y = 0. Thus, | < x, y > | = ‖x‖‖y‖ if and only if x is

proportional to y.

22 MAJEED M. HAYAT

3.2. Triangle Inequality. ‖x + y‖ ≤ ‖x‖+ ‖y‖,∀x, y ∈ H. Proof.

This follows from the Schwarz inequality. Note that ‖x + y‖2 ≥ 0 and expand to

obtain ‖x‖2 + ‖y‖2+ < x, y > + ¯< x, y > ≤ ‖x‖2 + ‖y‖2 + 2| < x, y > |. Now by

Schwarz inequality, ‖x‖2+‖y‖2+2| < x, y > | ≤ ‖x‖2+‖y‖2+2‖x‖‖y‖ = (‖x‖+‖y‖)2,

from which the desired result follows. Note that ‖x+y‖2 = ‖x‖2+‖y‖2 if < x, y >= 0

(why?).

This is a generalization of the customary triangle inequality for complex numbers,

which states that ‖x− z‖ ≤ ‖x− y‖+ ‖y − z‖, x, y, z ∈ |C.

3.3. Norm. We say ‖x‖ is a norm on H if:

(1) ‖x‖ ≥ 0.

(2) ‖x‖ = 0 only if x = 0.

(3) ‖x + y‖ ≤ ‖x‖+ ‖y‖.(4) ‖αx‖ = |α| ‖x‖, α ∈ D.

With the triangular inequality at hand, we can define ‖ · ‖ on members of H as

follows: ‖x‖ 4=< x, x >1/2. You should check that this actually defines a norm. This

yields a “yardstick fo distance” between the two vectors x, y ∈ H, defined as ‖x− y‖.We can now say that H is normed space.

3.4. Convergence. We can then talk about convergence: a sequence xn ∈ H is said

to be convergent to x, written as xn → x, or limn→∞ xn = x, if for every ε > 0, there

exists N ∈ IN such that ‖xn − x‖ < ε whenever n > N .

3.5. Completeness. An inner-product space H is complete if any Cauchy sequence

in H converges to a point in H. A sequence yn∞n=1 in H is called a Cauchy if for

any ε > 0,∃ N ∈ IN such that ‖yn − ym‖ ≤ ε for all n,m > N .

Now, if H is complete, then H is called a Hilbert space.

Fact: If H is complete, then it is closed. This is because any convergent sequence is

automatically a Cauchy sequence.

3.6. Convex Sets. A set E in a vector space V is said to be a convex set if for any

x, y ∈ E, and t ∈ (0, 1), the following point Zt = tx + (1− t)y ∈ E. In other words,

ECE 541 23

the line segment between x and y lies in E. Note that if E is a convex set, the the

translation of E, E + x4= y + x : y ∈ E, is also a convex set.

3.7. Parallelogram Law. For any x and y in an inner-product space, ‖x + y‖2 +

‖x−y‖2 = 2‖x‖2 +2‖y‖2. This can be simply verified using the properties of an inner

product. See the schematic below.

x+y

x-y

Figure 2.

3.8. Orthogonality. If < x, y >= 0, then we say that x and y are orthogonal. We

write x ⊥ y. Note that the relation ⊥ is a symmetric relation; that is, x ⊥ y ⇔ y ⊥ x.

Pick a vector x ∈ H, then find all vectors in H that are orthogonal to x. We write

it as: x⊥ = y ∈ H :< x, y >= 0 and x⊥ is a closed subspace. To see that it is a

subspace, we must show that x⊥ is closed under addition and scalar multiplication;

these are immediate from the definition of an inner product (see Assignment 3). To

see that x⊥ is closed, we must show that the limit of every convergent sequence in

x⊥ is also in x⊥. Let xn be a sequence in x⊥, and assume that xn → x0. We need to

show that x0inx⊥. To see this, note that | < x0, x > | = | < x0 − xn + xn, x > | =

| < x0 − xn, x > + < xn, x > | = | < x0 − xn, x > | ≤ ‖x0 − xn‖‖x0‖, by Schwarz

inequality. but the term on the right converges to zero, so | < x0, x > | = 0, which

implies x0 = x (last property of inner products).

Let M ⊂ H be a subspace of H. We define M⊥ 4=

⋂x∈M x⊥,∀x ∈ M . It is easy

to see that M⊥ is actually a subspace of H and that M⊥ ∩M = 0. It is also true

that M⊥ is closed, which is simply because it is an intersection of closed sets (recall

that x⊥ is closed).

Theorem 2. Any non-empty, closed and convex set E in a Hilbert space H contains

a unique element with smallest norm. That is, there exists xo ∈ E such that ‖xo‖ <

‖x‖, x ∈ E.

24 MAJEED M. HAYAT

Proof (see also Fig. 3): Existence: Let δ = inf‖x‖ : x ∈ E.In the parallelogram law, replace x by x/2 and y by y/2 to obtain

‖x− y‖2 = 2‖x‖2 + 2‖y‖2 − 4‖x + y

2‖2.

Since x+y2∈ E (because E is convex), ‖x+y

2‖2 ≥ δ2. Hence,

(7) ‖x− y‖2 ≤ 2‖x‖2 + 2‖y2‖ − 4δ2.

By definition of δ, there exists a sequence yn∞n=1 ∈ H such that limn→∞ ‖yn‖ = δ.

Now replace x by yn and y by ym in to obtain

(8) ‖yn − ym‖2 ≤ 2‖yn‖2 + 2‖ym‖2 − 4δ2.

Hence,

limn,m→∞

‖yn − ym‖2 = 2 limn→∞

‖yn‖2 + 2 limm→∞

‖ym‖2 − 4δ2 = 2δ2 + 2δ2 − 4δ2 = 0

and we conclude that yn∞n=1 is a Cauchy sequence in H. Since H is complete, there

∃xo ∈ H such that yn → xo. We next show that ‖xo‖ = δ, which would complete the

existence of a minimal-norm member in E. Note that by the triangle inequality,

∣∣∣‖xo‖ − ‖yn‖∣∣∣ ≤ ‖xo − yn‖.

Now take limits of both sides to conclude that limn→∞ ‖yn‖ = ‖xo‖. But we already

know that limn→∞ ‖yn‖ = δ, which implies that ‖xo‖ = δ.

Uniqueness: suppose that x′ 6= x, x, x′ ∈ E, and ‖x‖ = ‖x′‖ = δ. Replace y in

(3.8) by x′ to obtain

‖x− x′‖2 = 2δ2 + 2δ2 − 4δ2 = 0,

which implies that x = x′. Hence the minimal-norm element in E is unique. 2

Example 11. For any fixed n, the set |Cn of all n-tuples of complex numbers, x =

(x1, x2, . . . , xn), xi ∈ |C, is a Hilbert Space, where with y = (y1, y2, . . . , yn) we define

< x, y >4=

∑ni=1 xiyi.

ECE 541 25

2ℜ=H

x+M

Shortest distance Qx

Figure 3. Elevating M by x.

2ℜ=Hℜ=M

M

Px

Qx

x

Figure 4. Correction: The upper M should read M⊥.

Example 12. L2[a, b] =

f(x) :∫ b

a|f(x)|2dx < ∞, x ∈ [a, b]

is a Hilbert space, with

< f, g >4=

∫ b

af(x)g(x)dx. ‖f‖ =

√< f, f > = [

∫ b

a|f |2dx]1/2 = ‖f‖2. Actually, we

need Schwarz inequality for expectations before we can see that we have a an inner

product in this case. More on this to follow.

Theorem 3. Projection Theorem. If M ⊂ H is a closed subspace of a Hilbert space

H, then for every x ∈ H there exists a unique decomposition x = Px + Qx, where

Px ∈ M , and Qx ∈ M⊥. Moreover,

(1) ‖x‖2 = ‖Px‖2 + ‖Qx‖2.

(2) If we think of Px and Qx as mappings from H to M and M⊥, respectively,

then the mappings P and Q are linear.

26 MAJEED M. HAYAT

(3) Px is the nearest point in M to x, and Qx is the nearest point in M⊥ to

x. Px and Qx are called the orthogonal projections of x into M and M⊥,

respectively.

Proof.

Existence: Consider M +x; we claim that M +x is closed. Recall that M is closed,

which means that if xn ∈ M and xn → xo, then xo ∈ M . Pick a convergent sequence

in x + M , call it zn. Now zn = x + yn, for some yn ∈ M . Since zn is convergent, so is

yn, but the limit of yn is in M , so x + limn→∞ yn ∈ x + M .

We next show that x + M is convex. Pick x1 and x2 ∈ x + M . We need to show

that for any 0 ≤ α ≤ 1, αx1 + (1 − α)x2 ∈ x +M. But x1 = x + y1, y1 ∈ M , and

x2 = x + y2, y2 ∈ M . So αx1 + (1 − α)x2 = x + αy1 + (1 − α)y2 ∈ x + M since

y1 + (1− α)y2 ∈M.

By the minimal-norm theorem, the exists a member in x + M of smallest norm.

Call it Qx. Let Px = x−Qx. Note that Px ∈ M . We need to show that Qx ∈ M⊥.

Namely, < Qx, y >= 0, ∀ y ∈ M . Call Qx = z. ‖z‖ ≤ ‖y‖, ∀ y ∈ M + x. Pick

y = z − αy, where y ∈ M, ‖y‖ = 1. ‖z‖2 ≤ ‖z − αy‖2 =< z − αy, z − αy >. 0 ≤−α < y, z > −α < z, y > +‖α‖2. Pick α =< z, y >. We obtain 0 ≤ −| < z, y > |2.This can hold only if < z, y >= 0, i.e., z is orthogonal to every y ∈ M; therefore,

Qx ∈ M⊥.

Uniqueness: Suppose that x = Px + Qx = (Px)′ + (Qx)′, where Px, (Px)′ ∈ M

and Qx, (Qx)′ ∈ M⊥. Then, Px− (Px)′ = (Qx)′−Qx, where the left side belongs to

M while the right side belongs to M⊥. Hence, each side can only be the zero vector

(why?), and we conclude that Px = (Px)′ and (Qx)′ = Qx.

Minimum Distance Properties: To show that Px is the nearest point in M to x,

pick any y ∈M and observe that ‖x− y‖2 = ‖Px + Qx− y‖2 = ‖Qx + (Px− y)‖2 =

‖Qx‖2 + ‖Px − y‖2 (since Px − y ∈ M). The right-hand side is minimized when

‖Px−y‖, which happens if and only if y = Px. The fact that Qx is the nearest point

in M⊥ to x can be shown similarly.

Linearity: Take x, y ∈ H. Then, we have x = Px + Qx and y = Py + Qy. Now

ax+by = aPx+aQx+bPy+bQy. On the other hand, ax+by = P (ax+by)+Q(ax+by).

ECE 541 27

Thus, we can write P (ax+ by)−aPx− bPy = −Q(ax+ by)+aQx+ bQy and observe

that the left side is in M while the right side is in M⊥. Hence, each side can only be

the zero vector. Therefore, P (ax + by) = aPx + bPy and Q(ax + by) = aQx + bQy.

2

4. Conditional expectations for L2 random variables

4.1. Holder inequality for expectations. Consider the probability space (Ω,F , P),

and let X and Y be random variables. The next result is called Holder inequality for

expectations.

Theorem 4. If p > 1, q > 1 for which p−1+q−1 = 1, then E[|XY |] ≤ E[|X|p]1/pE[|Y |q]1/q.

Proof: (See for example [1] p. 105).

Note that if E[|X|p]1/p = 0, then X = 0 a.s., and hence E[|XY |]| = 0 and (4) holds.

If E[|X|p]1/p > 0 and E[|Y |q]1/q = ∞, then (4) also holds trivially. Next, consider the

case when 0 < E[|X|p]1/p < ∞ and 0 < E[|Y |q]1/q < ∞. Let U = |X|/E[|X|p]1/p and

V = |Y |/E[|Y |q]1/q, and note that E[|U |q] = E[|V |q] = 1. Using the convexity of the

logarithm function, it is easy to see that for any a, b > 0,

log(ap

p+

bq

q

)≥ 1

plog ap +

1

qlog bq,

and the last term is simply log ab. From this it follows that

ab ≤ ap

p+

bq

q.

Thus,

E[UV ] ≤ 1

pE[Up] +

1

qE[V q] =

1

p+

1

q= 1,

from which the desired follows. 2

When p = q = 1/2, we have E[|XY |] ≤ E[X2]1/2E[Y 2]1/2, and this result is called

Schwarz inequality for expectations.

Let H be the collection of square-integrable random variables. Now E[(aX+bY )2] =

a2E[X2]+b2E[Y 2]+2abE[XY ] ≤ a2E[X2]+b2E[Y 2]+2abE[X2]1/2E[Y 2]1/2 < ∞, where

28 MAJEED M. HAYAT

we have used Schwarz inequality for expectations in the last step. Hence, H is a

vector space. Next, Schwarz inequality for expectations also tells us that if X and

Y are square integrable, then E[XY ] is defined (i.e., finite). We can therefore define

< X, Y >4= E[XY ] and see that it defines an inner product. (Technically we need

to consider the equivalence class of random variables that are equal a.s.; otherwise,

we may not have an inner product. Can you see why?) We can also define the norm

‖X‖24=< X, X >1/2. This is called the L2-norm associated with (Ω,F , P). Hence,

we can recast H as the vector space of all random variables that have finite L2-norm.

This collection is often written as L2(Ω,F , P). It can be shown that L2(Ω,F , P) is

complete (see [2], pp. 67). Hence, L2(Ω,F , P) is a Hilbert space. In fact, L2(Ω,D, P)

is a Hilbert space for any σ-algebra D.

4.2. The L2 conditional expectation. Let X, Y ∈ L2(Ω,F , P). Clearly, L2(Ω, σ(X), P)

is a closed (why? See Problem 5 in Assignment 3). subspace of L2(Ω,F , P)). We

can now apply the projection theorem to Y , with L2(Ω, σ(X), P) being our closed

subspace, and obtain the decomposition Y = PY + QY , where PY ∈ L2(Ω, σ(X), P)

and QY ∈ L2(Ω, σ(X), P)⊥. We also have the property that ‖X−PY ‖2 ≤ ‖X−Y ′‖2

∀ Y ′ ∈ L2(Ω, σ(X), P). We call PY the conditional expectation of Y given σ(X),

which we will write from this point an on as E[Y |σ(X)].

Exercise 1. (1) Show that if x is a D-measurable r.v. then so is aX, a ∈ <. (2)

Show that if X and Y are D-measurable, so is X + Y . [See Homework 3]

The above exercise shows that the collection of square-integrable D-measurable

random variables is indeed a vector space.

A function h : IR → IR is said to be a Borel function if h−1(B) ∈ B for every B ∈ B.

Theorem 5. Consider (Ω,F , P), and let X be a random variable. A random variable

Z is a σ(X)-measurable random variable if and only if Z = h(Y ) for some Borel

function h. 2 (The proof is based on a corollary to the Dynkin class Theorem. We

omit the proof, which can be found in [1].)

ECE 541 29

Since E[X|σ(Y )] is σ(Y )-measurable by definition, we can write it explicitly as a

Borel function of Y . For this reason, we often time write E[X|Y ] in place of E[X|σ(Y )].

4.3. Properties of Conditional Expectation. Our definition of conditional ex-

pectation lends itself to many powerful properties that are useful in practice. One of

the properties also leads to an equivalent definition of the conditional expectation,

which is actually the way commonly done—see the comments after Property 4.3.3.

However, proving the existence of the conditional expectation according to the alter-

native definition requires knowledge of a major theorem in measure theory called the

Radon-Nikodym Theorem, which is beyond the scope of this course. That is why in

this course we took a path (for defining the conditional expectation) that is based on

the projection theorem, which is an important theorem to signal processing for other

reasons as well. Here are some key properties of conditional expectations.

4.3.1. Property 1. For any constant a, E[a|D] = a. This follows trivially from the

observation that Pa = a (why?).

4.3.2. Property 2 (Linearity). Fr any constants a, b, E[aX + bY |D] = aE[XD] +

bE[Y |D]. This follows trivially from the linearity of the projection operator P (see

the Projection Theorem).

4.3.3. Property 3. Let Z = E[X|D], then E[XY ] = E[ZY ], for all Y ∈ L2(Ω,D, P).

Interpretation: Z contains all the information that X contains that is relevant to any

D-measurable random variable Y .

Proof: Note that E[XY ] = E[(PX +QX)Y ] = E[(PX)Y ]+E[(QX)Y ] = E[(PX)Y ] =

E[ZY ]. The last equality follows from the definition of Z.

Conversely, if a random variable Z ∈ L2(Ω,D, P) has the property that E[ZY ] =

E[XY ],∀Y ∈ L2(Ω,D, P), then Z = E[¯X|D]. To see this, we need to show that

Z = PX. Note that by assumption E[Y (X − Z)] = 0 for any Y ∈ L2(Ω,D, P).

Therefore, E[Y (PX + QX − Z)] = 0, or E[Y PX] + E[¯Y QX] − E[Y Z] = 0, or

30 MAJEED M. HAYAT

E[Y (Z − PX)] = 0,∀ Y ∈ L2(Ω,D, P) (since E[Y QX] = 0). In particular, if we

take Y = Z − PX we will conclude that E[(Z − PX)2] = 0, which implies Z = PX

almost surely.

Thus, we arrive at an alternative definition for the L2 conditional expectation.

Definition 6. We define Z4= E[X|D] if (1) Z ∈ L2(Ω,D, P) and (2) E[ZY ] = E[XY ]

∀ Y ∈ L2(Ω,D, P).

We will use this new definition frequently it the remainder of this chapter.

4.3.4. Property 4 (Smoothing Property). For any X, E[X] = E[E[X|D]

]. To see this,

note that if Z = E[X|D], then E[ZY ] = E[XY ] for all Y ∈ L2(Ω,D, P). Now take

Y = 1 to conclude that E[X] = E[Z].

4.3.5. Property 5. If Y ∈ L2(Ω,D, P), then E[XY |D] = Y E[X|D]. To show this,

we check the second definition for conditional expectations. Note that Y E[X|D] ∈L2(Ω,D, P), so all we need to show is that E[(XY )W |D] = E

[(Y E[X|D])W

]for

any W ∈ L2(Ω,D, P). But we already know from the definition of E[X|D] that

E[(WY )E[X|D]] = E[(WY )X], since WY ∈ L2(Ω,D, P).

As a special case, we have E[XY ] = E[E[XY |Y ]

]= E

[Y E[X|Y ]

].

4.3.6. Property 6. For any Borel function g : IR → IR, E[g(Y )|D] = g(Y ) whenever

Y ∈ L2(Ω,D, P). (For example, we have E[g(Y )|Y ] = g(Y ).) To prove this, all we

need to do is to observe that g(Y ) ∈ L2(Ω,D, P) and then apply Property 4.3.5 with

X = 1.

4.3.7. Property 7 (Iterated Conditioning). Let D1 ⊂ D2 ⊂ F . Then, E[X|D1] =

E[E[X|D2]|D1

].

ECE 541 31

Proof: Let Z = E[E[X|D2]

∣∣ D1

]. First note that Z ∈ L2(Ω,D1, P), so we only need to

show that E[ZY ] = E[XY ], ∀ Y ∈ L2(Ω,D1, P). Now E[ZY ] = E[Y E

[E[X|D2]

∣∣D1

]],

and we already know from the definition of E[E[X|D2]|D1

]that E

[Y ′(E[

E[X|D2]|D1

])]=

E[Y ′(E[X|D2])

]for any Y ′ ∈ L2(Ω,D1, P). But by Property 4.3.5, Y ′(E[

E[X|D2]|D1

])=

E[Y ′E[X|D2]|D1]

]. Now since Y ′ ∈ L2(Ω,D1, P) implies Y ′ ∈ L2(Ω,D2, P), we

use Property 4.3.5 once again to obtain Y ′E[X|D2] = E[Y ′X|D2]. Thus, we have

E[Y

(E[E[X|D2]|D1

])]= E

[E[E[Y X|D2]|D1

]]for all Y ∈ L2(Ω,D1, P). But by prop-

erty 4.3.4, E[E[(

E[Y X|D2])∣∣D1

]]= E

[(E[Y X|D2]

)]= E[Y X], which completes the

proof.

Also note that E[X|D1] = E[E[X|D1]|D2

]. This is much easier to prove. (Prove it.)

The next property is very important in applications.

4.3.8. Property 8. If X and Y are independent r.v.’s, and if g : IR2 → IR is Borel

measurable, then E[g(X, Y )|σ(Y )] = h(Y ), where for any scalar t, h(t)4= E[g(X, t)].

As a consequence, E[g(X,Y )] = E[h(Y )].

Proof: We prove this in the case g(x, y) = g1(x)g2(y). (The general result follows

through an application of a corollary to the Dynkin class theorem.) First note that

E[g1(X)g2(Y )|σ(Y )

]= g2(Y )E

[g1(X)|σ(Y )

]. Also note that h(t) = g2(t)E[g1(X)], so

h(Y ) = g2(Y )E[g1(X)]. We need to show that h(Y ) = E[g1(X)g2(Y ) | σ(Y )

]. To do

so, we need to show that E[h(Y )Z] = E[g1(X)g2(Y )Z] for every Z ∈ L2(Ω, σ(Y ), P).

But E[h(Y )Z] = E[g2(Y )E[g1(X)]Z

], and by the independence of X and Y , the latter

is equal to E[g1(X)g2(Y )Z].

The stated consequence follows from Property 4.3.4. Also, the above result can be

easily generalized to random vectors.

4.4. Some applications of Conditional Expectations.

Example 13 (Photon counting).

It is know that the energy of a photon is hν, where h is Planck’s constant and ν is the

optical frequency (color) of the photon in Hertz. We are interested in detecting and

32 MAJEED M. HAYAT

radiation

detection

current

……….

N photons

electric pulses

counter M

Figure 5.

counting the number of photons in a certain time interval. We assume that we have a

detector, and upon the detection of the ith photon, it generates an instantaneous pulse

(a delta function) whose area is random non-negative integer, Gi. We then integrate

the sum of all delta functions (i.e., responses) in a given interval and call that the

photon count M . We will also assume that the Gi’s are mutually independent, and

they are also independent of the number of photons N that impinge on the detector

in the given time interval. Mathematically, we can write

M =N∑

i=1

Gi.

Let us try to calculate E[M ], the average number of detected photons in a given

time interval. Assume that we know PM = k (k = 0, 1, 2, 3 . . .) and that PGi = kis also known (k = 0, 1, 2, 3 . . .).

By using Property 4.3.4, we know that E[M ] = E[E[M |N ]

]. To find E[M |N ], we

must appeal to Property 4.3.8. We think of the random sequence Gi∞i=1 as X, and

the random variable N as Y . With these, the function h becomes

h(t) = E[t∑

i=1

Gi] =t∑

i=1

E[Gi].

Let uss assume further that the Gi’s are identically distributed, so that E[Gi] ≡E[G1]. With this, h(t) = tE[G1] and h(N) = E[G1]N . Therefore, E[M ] = E[h(N)] =

E[N ]E[G1].

ECE 541 33

What is the variance of M? Let us first calculate E[M2]. By Property 4.3.4,

E[M2] = E[E[M2|N ]

], and by Property 4.3.8, E

[M2|N]

is equal to h2(N), where

h2(t)4= E

[(

t∑i=1

Gi)2]

= E[ t∑

i=1

t∑j=1

GiGj

]= E

[∑

i 6=j

∑GiGj

]+E

[∑i=j

∑G2

i

]= (t2−t)E[G1]

2+t(E[G1]2+σ2

G)

and σ2G

4= E[G2

1]− E[G1]2 is the variance of G1, which is common to all the other Gi’s

due to independence.

Next, E[M2] = E[h2(N)

]= E

[(N2−N)

]+ (E[G1]

2 + σ2G)E[N ] = E[G1]

2(σ2N + N2−

N) + (E[G1]2 + σ2

G)N , where N4= E[N ] and σ2

N4= E[N2] − N2 is the variance of N .

Finally, if we assume that N is a Poisson random variable (as in coherent light) the

variance of M becomes

σM = E[G1]2N2 +

(E[G1]

2 + σ2G

)N .

Exercise: For an integer-valued random variable X, we define the generating function

of X as ψX(s)4= E[sX ], s ∈ |C, |s| ≤ 1. We can think of the generating function as the

z-transform of the probability mass function associated with X.

For the photon-counting example described above, show that ψM(s) = ψN

(ψG(s)

).

Example 14 (First Occurrence of Successive Hits).

Consider the problem of flipping a coin successively. Suppose PH = p and

PT = q = 1−p. In this case, we define Ω4= X

∞i=1Ωi, where for each i, Ωi = H,T.

This means that members of Ω are simply of the form ω = (ω1, ω2, . . .), where ωi ∈ Ωi.

For each Ωi, we define Fi =∅, H, T, T, H

. We take F associated with Ω as

the minimal σ-algebra containing all cylinders in Ω. A an event E in X∞i=1Ωi is called

a cylinder if E is of the form

(ω1, ω2, . . .) ∈ Ω : ωi1 ∈ B1, . . . , ωik ∈ Bk, k finite, Bi ∈Fi

. In words, in a cylinder, only finitely many of the coordinates are specified. (For

example, think of a vertical cylinder in IR3, where we specify only two coordinates

out of three, now extend this to IR∞.) Let Xi = Iωi=H be a 0, 1-valued random,

and define X on X∞i=1Ωi as X

4= (X1, X2, . . .). Finally, we define P on Ω cylinders by

forming the product of the probabilities of the coordinates (enforcing independence).

For each (Ωi,Fi), define PiH = p and q = 1− p.

34 MAJEED M. HAYAT

We would like to define a random variable that tells us when a run of n successive

heads appears for the first time. More precisely, define the random variable T1 as a

function of X as follows: T14= mini ≥ 1 : Xi = 1, Xj = 0, j < i. For example, if

X = (0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T1 = 3. More generally, we define Tk = mini :

Xi−k+1 = Xi−k+2 = . . . = Xi = 1. For each k, Tk is a r.v. on Ω. For example, if

X = (0, 0, 1, 0, 1, 1, 0, 1, 1, 1, . . .), then T2 = 6, and T3 = 10.

Let yk = E[Tk]. In what follows we will characterize yk.

Special case: k=1

It is easy to see that PY1 = 1 = p,

PY1 = 2 = qp,

PY1 = 3 = q2p,...

PY1 = n = qn−1p.

Recall from undergraduate probability that above probability law is called the geo-

metric law, and Y1 is called a geometric random variable. Also, y1 =∑∞

i=1 ipqi−1 = 1p.

General case: k > 1

We begin by observing that Tk is actually an explicit function, f , say, of Tk−1, XTk−1+1, XTk−1+2, . . . .

Note, however, that Tk−1 and XYk−1+1, XYk−1+2, . . . are independent. Therefore,

by Property 4.3.8, E[Tk] = E[f(Tk−1, XYk−1+1, XYk−1+2, . . .)] = E[h(Yk−1)], where

h(t) = E[f(t,Xt+1, Xt+2, . . .)].

Now it is easy to check that E[f(t,Xt+1, . . .)|Xt+1] = (t + 1)IXt+1=1 + (t + 1 +

yk)IXt+1=0. This essentially says that if it took us t flips to see k − 1 consecutive

heads for the first time, then if the (t + 1)st flip is a head, then we have achieved

(k + 1) successive heads at time t + 1; alternatively, if the (t + 1)st flip is a tail, then

we have to start all over again (start afresh) while we have already waisted t+1 units

of time.

Now, E[f(t, Tt+k−1, . . .)] = E[E[f(t,Xt+1, . . .) | Xt+1]

]= E

[(t + 1)IXt+1=1 + (t +

1 + yk)IXt+1=0]

=(t + 1)p + (t + 1 + yk)q. Thus, h(t) = (t + 1)p + (t + 1 + yk)q and

yk = E[h(Tk−1)] = E[(Tk−1 + 1)p + (Tk−1 + 1 + yk)q = p + pyk−1 + qyk + qyk−1 + q.

ECE 541 35

Finally, we obtain

(9) yk = p−1yk−1 + p−1.

We now invoke the initial condition y1 = 1p, which completes the characterization of

yk.

For example, if p = 12, then y2 = 2y1 + 2 = 6, and so on.

Example 15. Important of the Independence Hypothesis in Property 4.3.8.

Let Y = XZ where Z = 1 + X. Now, E[Y ] = E[XZ] = E[X(1 + X)] = X + X2.

However, if we erroneously attempt to use Property 4.3.8 (by ignoring the dependence

of Z on X), we will have E[Xt] = E[X]t, and E[E[X]Z

]= E[X]E[Z] = E[X](1 +

E[X]) = X + X2. Note that the two are different since X2 6= X2 in general.

5. The L1 conditional expectation

Consider a probability space (Ω,F , P) and let X be an integrable random variable,

that is, E[|X|] < ∞. We write X ∈ L1(Ω,F , P). Let D be a sub σ-algebra. For

each M ≥ 1, we define X ∧ M as X when |X| ≤ M and M otherwise. Note that

X∧M ∈ L2(Ω,F , P) (since it is bounded), so we can talk about E[(X∧M)|D], which

we call ZM . The question is can we somehow use ZM to define E[X|D]? Intuitively,

we would want to think of E[X|D] as some kind of limit of ZM , as M →∞. So one

approach for the construction of E[(X ∧M)|D] is to take a “limit” of ZM and then

show that the limit, Z, say, satisfies Definition 6. It turns out that such an approach

would lead to the following definition, which we will adopt from now on.

Definition 7. [L1 Conditional Expectation] If X is an integrable random variable,

then we define Z = E[X|D] if Z has the following two properties: (1) Z is D measur-

able, (2) E[ZY ] = E[XY ] for any bounded D measurable random variable Y .

All the properties of the L2 conditional expectation will carry on to the L1 condi-

tional expectation.

36 MAJEED M. HAYAT

We make the final remark that if a random variable X is square integrable, then

it is integrable. The proof is easy. Suppose that X is square integrable. Then,

E[|X|] = E[|X|I|X|≤1]+E[|X|I|X|>1] ≤ E[I|X|≤1]+E[X2I|X|>1] ≤ 1+E[X2] < ∞.

6. Conditional probabilities

Now we will connect conditional expectations to the familiar conditional probabil-

ities. In particular, we recall from undergraduate probability that if A and B are

events, with P(B) > 0, then we define the conditional probability of A given that the

event B has occurred as P(A|B)4= P(A∩B)/P(B). What is the connection between

this definition and our notion of a conditional expectation?

Consider the σ-algebra D = ∅, Ω, B,Bc, and consider E[IA|D]. Because of the

special form of this D, and because we know that this conditional expectation is D-

measurable, we infer that E[IA|D] can assume only two values: one value on B and

another value on Bc. That is, we can write E[IA|D] = aIB + bIBc , where a and b are

constants. We claim that a = P(A ∩ B)/P(B) and b = P(A ∩ Bc)/P(Bc). Note that

because P(B) > 0, 1 − P(B) > 0, so we are not dividing by zero. As seen from a

homework problem, we can prove that(P(A∩B)/P(B)

)IB +

(P(A∩Bc)/P(Bc)

)IBc is

actually the conditional expectation E[IA|D] by showing that(P(A ∩B)/P(B)

)IB +

(P(A ∩ Bc)/P(Bc)

)IBc satisfies the two defining properties listed in Definition 6 or

Definition 7. Thus, E[IA|D] encompasses both P(A|B) and P(A|Bc); that is, P(A|B)

and P(A|Bc) are simply the values of E[IA|D] on B and Bc, respectively. Also note

that P(·|B) is actually a probability measure for each B as long as P(B) > 0.

7. Joint densities and marginal densities

Let X and Y be random variables on (Ω,F , P) with a distribution µXY (B) defined

on all Borel subsets B of the plane. Suppose that X and Y have a joint density. That

is, there exists an integrable function fXY (·, ·) : IR2 → IR, such that

(10) µXY (B) =

∫

B

fXY (x, y) dx dy, for any B ∈ B2.

ECE 541 37

7.1. Marginal densities. Consider µXY (A×IR), where A ∈ B. Notice that µXY (·×IR) is a probability measure for any A ∈ B. Also, by the integral representation shown

in (10),

(11) µXY (A×IR) =

∫

A×IR

fXY (x, y) dx dy =

∫

A

(∫

IR

fXY (x, y) dy)

dx ≡∫

A

fX(x) dx,

where fX(·) us called the marginal density of X. Note that fX(·) qualifies as the pdf

of X. Thus, we arrive at the familiar result that the pdf of X can be obtained from

the joint pdf of X and Y through integration. Similarly, fY (y) =∫

IRfXY (x, y) dx.

7.2. Conditional densities. Suppose that X and Y have a joint density fXY . Then,

(12) E[Y |X] =

∫IR

yfXY (X, y) dy∫IR

fXY (X, y) dy,

and in particular,

(13) E[Y |X = x] =

∫IR

yfXY (x, y) dy∫IR

fXY (x, y) dy≡ g(x).

Proof

We will verify the definition of a conditional expectation given in Definition 7. First,

we must show that g(X) is σ(X) measurable. This follows from the fact that g is Borel

measurable. (This is because g it obtained from integrating a Borel-measurable func-

tion in the plain over one of the variables. This is proved in a course on integration.

Let’s not worry about the details now.)

Next, we must show that E[g(X)W ] = E[Y W ] for every bounded σ(X)-measurable

W . Without loss of generality, assume that W = h(X) for some Borel function h. To

38 MAJEED M. HAYAT

see that E[g(X)h(X)] = E[Y h(X)], write

E[g(X)h(X)] = E

[∫IR

yfXY (X, y) dy∫IR

fXY (X, y) dyh(X)

]

=

∫

IR

∫

IR

[∫IR

yfXY (x, t) dt∫IR

fXY (x, t) dth(x)fXY (x, y)

]dx dy

=

∫

IR

∫

IR

[∫IR

yfXY (x, t) dt

fX(x)h(x)fXY (x, y)

]dx dy

=

∫

IR

[∫IR

yfXY (x, t) dt

fX(x)h(x)

∫

IR

fXY (x, y) dy

]dx

=

∫

IR

[∫IR

yfXY (x, t) dt

fX(x)h(X)fx(x)

]dx

=

∫

IR

[(∫

IR

yfXY (x, t) dt)h(x)

]dx

=

∫

IR

∫

IR

yh(x)fXY (x, t) dt dx

= E[Y h(X)].

In the above, we have used two facts about integrations: (1) that we can write a

double integral as an iterated integral, and (2) that we can exchange the order of the

integration in the iterated integrals. These two properties are always true as long

as the integrand is non-negative. This is a consequence of what is called Fubini’s

Theorem in analysis [2, Chapter 8]. To complete the proof, we need to address the

concern over the points over which fX(·) = 0 (in which case we will not be able

to cancel fX ’s in the numerator and the denominator in the above development).

However, it is straightforward to show that PfX(X) = 0 = 0, which guarantee

that we can exclude these problematic point from the integration in the x-direction

without changing the integral. 2

With the above results, we can think of

fXY (x, y)∫IR

fXY (x, y) dy

ECE 541 39

as a conditional pdf. In particular, if we define

(14) fY |X(y|x)4=

fXY (x, y)∫IR

fXY (x, y) dy=

fXY (x, y)

fX(x),

then we can calculate the conditional expectation E[Y |X] using the formula∫IR

yfY |X(y|X) dy,

which is the familiar result that we know from undergraduate probability.

8. Convergence of random sequences

Consider a sequence of random variables X1, X2, . . . , defined on the product space

(Ω,F , P), where as before, Ω = X∞i=1Ωi is the infinite-dimensional Cartesian product

space and F is the smallest σ-algebra containing all cylinders. Since each Xi is a

random variable, when we talk about convergence of random variables we must do so

in the context of functions.

8.1. Pointwise convergence. Let us take Ωi = [0, 1], F = B ∩ [0, 1], and take P as

Lebesgue measure (generalized length) in [0, 1]. Consider the sequence of functions

fn(t) defined as follows:

(15) fn(t) =

n2t, 0 ≤ t ≤ n−1/2

n− n2t, n−1/2 < t ≤ n−1

0, otherwise.

It is straightforward to see that for any t, the sequence of functions fn(t) converges

to the constant function 0. We say that fn(t) → 0 pointwise, everyevery, or surely.

Generally, a sequence of random variables Xn converges to a random variable X if for

every ω ∈ Ω, Xn(ω) → X(ω). To be more precise, for every ε > 0 and every ω ∈ Ω,

there exists n0 ∈ IN such that |Xn(ω) − X(ω)| < ε whenever n ≥ n0. Note that in

this definition n0 not only depends upon the choice of ε but also on the choice of ω.

Note that in the example above we cannot drop the dependence of n0 on ε. To see

that, observe that supt |fn(t)− 0| = n/2; thus, there is no single n0 that will work for

every t.

40 MAJEED M. HAYAT

8.2. Uniform convergence. The strongest type of convergence is what’s called uni-

form convergence, in which case n0 will be independent of the choice of ω. More

precisely, a sequence of random variables Xn converges to a random variable X uni-

formly if for every ε > 0, there exists n0 ∈ IN such that |Xn(ω) −X(ω)| < ε for any

ω ∈ Ω and whenever n ≥ n0. This is equivalent to saying that for every ε > 0, there

exists n0 ∈ IN such that supω∈Ω |Xn(ω)−X(ω)| < ε whenever n ≥ n0. Can you make

a small change in the example above to make the convergence uniform?

8.3. Almost-sure convergence (or almost-everywhere convergence). Now con-

sider a slight variation of the example in (15).

(16) fn(t) =

n2t, 0 < t ≤ n−1/2

n− n2t, n−1/2 < t ≤ n−1

1, t ∈ (n−1, 1] ∩ |Q

0, t ∈ (n−1, 1] ∩ |Qc.

Note that in this case fn(t) → 0 for t ∈ [0, 1) ∩ |Qc. Note that the Lebesgue measure

(or generalize length) of the set of points for which fn(t) does not converge to the

function 0 is zero. (The latter is because the measure of the set of rational numbers

is zero. To see that, let r1, r2, . . . be an enumeration of the rationals. Pick any ε > 0,

and define the intervals Jn = (rn−2−n−1ε, rn+2−n−1ε). Now, if we sum up the lengths

of all of these intervals, we obtain ε. However, since |Q ⊂ ∪∞n=1Jn, the “length” of the

set |Q cannot exceed ε. Since ε can be selected arbitrarily small, we conclude that

the Lebesgue measure of |Q must be zero.) In this case, we say that the sequence of

functions fn converges to the constant function 0 almost everywhere.

Generally, we say that a sequence of random variables Xn converges to a random

variable X almost surely if there exists A ∈ F , with the property P(A) = 1, such that

for every ω ∈ A, Xn(ω) → X(ω). This is the strongest convergence statement that we

can make in a probabilistic sense (because we don’t care about the points that belong

to a set that has probability zero). Note that we can write this type of convergence by

saying that Xn → X almost surely (or a.s.) if Pω ∈ Ω : limn→∞ Xn(ω) = X(ω) = 1,

or simply when Plimn→∞ Xn = X = 1.

ECE 541 41

8.4. Convergence in probability (or convergence in measure). We say that

the sequence Xn converges to a random variable X in probability (also called in

distribution) when for every ε > 0, limn→∞ P|Xn−X| > ε = 0. This a weaker type

of convergence than almost sure convergence, as seen next.

Theorem 6. Almost sure convergence implies convergence in probability.

Proof

Let ε > 0 be given. For each N ≥ 1, define EN = |Xn −X| > ε, for some n ≥ N.If we define E

4=

⋂∞N=1 EN , we’ll observe that ω ∈ E implies that Xn(ω) does not

converge to X. (This is because if ω ∈ E, then no matter how large we pick N ,

N0, say, there will be an n ≥ N0 such that |Xn − X| > ε.) Hence, since we know

that Xn converges to X a.s., it must be true that P(E) = 0. Now observe that since

E1 ⊃ E2 ⊃ . . ., P(E) = limN→∞ P(EN), and therefore limN→∞ P(EN) = 0. Observe

that if ω ∈ |Xn −X| > ε and n ≥ N , then ω ∈ |Xn −X| > ε for some n ≥ N.Thus, for n ≥ N , ω ∈ |Xn −X| > ε ⊂ EN and P|Xn −X| > ε ≤ P(EN). Hence,

limn→∞ P|Xn −X| > ε = 0. 2

This theorem is not true, however, in infinite measure spaces. For example, take

Ω = IR, F = B, and the Lebesgue measure m. Let Xn(ω) = ω/n. Then clearly,

Xn → 0 everywhere, but m(|Xn − 0| ε) = ∞ for any ε > 0. Where did we use the

fact that P is a finite measure in the proof of the Theorem 6?

8.5. Convergence in the mean-square sense (or in L2). We say that the se-

quence Xn converges to a random variable X in the mean-square sense (also in L2)

if limn→∞ E[|Xn − X|2] = 0. Recall that this is precisely the type of convergence

we defined in the Hilbert-space context, which we used to introduce the conditional

expectation. This a stronger type of convergence than convergence in probability, as

seen next.

Theorem 7. Convergence in the meas-square sense implies convergence in probabil-

ity.

42 MAJEED M. HAYAT

Proof

This is an easy consequence of Markov inequality when taking ψ(·) = (·)2, which

yields P|Xn −X| > ε ≤ E[|Xn −X|2]/ε2. 2

Convergence in probability does not imply almost-sure convergence, as seen from

the following example.

Example 16 (The marching-band function). Consider Ω = [0, 1], take P to be

Lebesgue measure on [0, 1], and define Xn as follows:

(17) Xn =

I[2−1(n−1),2−1n)(ω), n ∈ 1, . . . , 21I[2−2(n−1),2−2n)(ω), n ∈ 21 + 1, . . . , 22...

I[2−k(n−1),2−kn)(ω), n ∈ 2k−1 + 1, . . . , 2k...

Note that for any ω ∈ [0, 1], Xn(ω) = 1 for infinitely many n’s; thus, Xn does not

converge to 0 anywhere. At the same time, if n ∈ 2k−1+1, . . . , 2k, then P|Xn−0| >ε = 2−k; hence, limn→∞ P|Xn − 0| > ε = 0 and Xn converges to 0 in probability.

Note that in the above example, limn→∞ E[|Xn − 0|2] = 0, so Xn also converges to

0 in L2.

Recall the example given in (15) and let’s modify it slightly as follows.

(18) fn(t) =

n3/2t, 0 ≤ t ≤ n−1/2

n1/2 − n3/2t, n−1/2 < t ≤ n−1

0, otherwise.

In this case, fn continues to converge to 0 at every point; nonetheless, E[|fn − 0|2] =

1/12 for every n. Thus, fn does not converge to 0 in L2.

Lemma 1. If f is a continuous function, then Xn → X in probability implies

f(Xn) → f(X) in probability.

Proof: Left as an exercise.

ECE 541 43

8.6. Convergence in distribution. A sequence Xn is said to converge to a ran-

dom variable X in distribution if the distribution functions, FXn(x), converge to the

distribution function of X, FX(x), at every point of continuity of FX(x).

Theorem 8. Convergence in probability implies convergence in distribution.

Proof (Adapted from T. G. Kurtz)

Pick ε > 0 and note that

PX ≤ x + ε = PX ≤ x + ε, Xn > x+ PX ≤ x + ε,Xn ≤ x

and

PXn ≤ x = PXn ≤ x,X > x + ε+ PXn ≤ x,X ≤ x + ε.

Thus,

(19) PXn ≤ x−PX ≤ x+ ε = PXn ≤ x,X > x+ ε−PX ≤ x+ ε,Xn > x.

Note that since Xn ≤ x,X > x + ε ⊂ |Xn −X| > ε,

PXn ≤ x − PX ≤ x + ε ≤ P|Xn −X| > ε.

By taking the limit superior of both sides and noting that limn→∞ P|Xn−X| > ε =

0, we obtain limn→∞ PXn ≤ x ≤ PX ≤ x + ε.Now replace every occurrence of ε in (19) with −ε, rearrange and obtain

−PXn ≤ x+ PX ≤ x− ε = −PXn ≤ x,X > x− ε+ PX ≤ x− ε,Xn > x.

Since X ≤ x− ε, X > x ⊂ |Xn −X| > ε, we have

−PXn ≤ x+ PX ≤ x− ε ≤ P|Xn −X| > ε.

By taking the limit inferior of both sides, we obtain limn→∞ PXn ≤ x ≥ PX ≤x− ε.

Combining, we obtain PX ≤ x − ε ≤ limn→∞ PXn ≤ x ≤ limn→∞ PXn ≤x ≤ PX ≤ x+ε. Thus, by letting ε → 0 we conclude that limn→∞ FXn(x) = FX(x)

at every point of continuity of FX . 2

44 MAJEED M. HAYAT

Theorem 9 (Bounded Convergence Theorem). Suppose that Xn converges to X in

probability, and that |Xn| ≤ M , a.s., ∀n, for some fixed M . Then E[Xn] −−−→n→∞

E[X].

This example shows that the boundedness requirement is not superfluous. Take

Ω = [0, 1], F = B ∩ [0, 1], and m as Lebesgue measure (i.e., m(A) =∫

Adx). Consider

the functions fn(t) shown below the figure. Note that fn(t) → 0, ∀t ∈ Ω. But,

E[fn(t)− 0] = 1, for any n. Therefore, E[fn] 9 E[0] = 0.

Proof of Theorem (Adapted from A. Beck).

Let En = |Xn| > M and note thatP(En) = 0 by the boundedness assumption.

Define E0 = ∪∞n=1En and note that P(E0) ≤∑∞

n=1 P(En) = P(E1) + P(E2) + . . . = 0.

We will first show that |X| ≤ M a.s. Consider Wn,k4= |Xn−X| > 1/k, k = 1, 2, . . ..

Since Xn converges to X in probability, P(Wn,k) −−→n→0

0. Let Fk = |X| > M + 1/k.Note that Fk ⊂ Wn,k a.s., regardless of n, and P(Fk) ≤ P(Wn,k). Then, P(Fk) ≤limn→∞ P(Wn,k) = 0, or P(Fk) = 0. Therefore, P(∪∞i=1Fk) = 0. Since ∪∞i=1Fk =

|X| > M, |X| ≤ M a.s.

Let ε > 0 be given. Let F04= ∪∞i=1Fk and Gn = |Xn −X| > ε. Note that Ω can

be written as the disjoint union (F0 ∪E0)∪ (Gn ∪F0 ∪E0)c ∪ (Gn \ (F0 ∪E0)). Thus,

|E[Xn −X]| ≤ E[|Xn −X|] = E[|Xn −X|IF0∪E0 ] + E[|Xn −X|IGn\(F0∪E0)] + E[|Xn −X|I(Gn∪F0∪E0)c ] ≤ 0 + 2ME[IGn ] + ε = 2ME[IGn ] + ε.

ECE 541 45

Since P(Gn) → 0, there exist n0 such that n ≥ n0 implies P(Gn) ≤ ε2M

. Hence, for

n ≥ n0, |E[Xn] − E[X]| ≤ 2Mε2M

+ ε = 2ε, which establishes E[Xn] → E[X] as n → ∞.

2

Lemma 2 (Adapted from T. G. Kurtz). Suppose that X ≥ 0 a.s. Then, limM→∞ E[X∧M ] = E[X].

Proof

First note that the statement is true when Xn is a discrete random variable. Then

recall that we can always approximate a random variable X monotonically from below

by a discrete random variable. More precisely, define Xn ≤ X as in (5) and note that

Xn ↑ X and E[Xn] ↑ E[X]. Now since Xn ∧M ≤ X ∧M ≤ X, we have

E[Xn] = limM→∞

E[Xn ∧M ] ≤ limM→∞

E[X ∧M ] ≤ limM→∞

E[X ∧M ] ≤ E[X].

Take the limit as n →∞ to obtain

E[X] ≤ limM→∞

E[X ∧M ] ≤ limM→∞

E[X ∧M ] ≤ E[X],

from which the Lemma follows. 2

Theorem 10 (Monotone Convergence Theorem, Adapted from T. Kurtz). Suppose

that 0 ≤ Xn ≤ X and Xn converges to X in probability. Then, limn→∞ E[Xn] = E[X].

Proof

For M > 0, Xn ∧M ≤ Xn ≤ X. Thus,

E[Xn ∧M ] ≤ E[Xn] ≤ E[X].

46 MAJEED M. HAYAT

By the bounded convergence theorem, limn→∞ E[Xn ∧M ] = E[X ∧M ]. Hence,

E[X ∧M ] ≤ limn→∞

E[Xn] ≤ limn→∞

E[Xn] ≤ E[X].

Now by Lemma 2, limM→∞ E[X ∧M ] = E[X], and we finally obtain

E[X] ≤ limn→∞

E[Xn] ≤ limn→∞

E[Xn] ≤ E[X],

and the theorem is proven. 2

Lemma 3 (Fatou’s Lemma, Adapted from T. G. Kurtz). Suppose that Xn ≥ 0 a.s.

and Xn converges to X in probability. Then, limn→∞ E[Xn] ≥ E[X].

Proof

For M > 0, Xn ∧M ≤ Xn. Thus,

E[X ∧M ] = limn→∞

E[Xn ∧M ] ≤ limn→∞

E[Xn],

where the left equality is due to the bounded convergence theorem. Now by Lemma 2,

limM→∞ E[X ∧M ] = E[X], and we obtain E[X] ≤ limn→∞ E[Xn]. 2

Theorem 11 (Dominated Convergence Theorem). Suppose that Xn → X in proba-

bility, |Xn| ≤ Y , and E[Y ] < ∞. Then, E[Xn] → E[X].

Proof:

Since Xn + Y ≥ 0 and Y + Xn → Y + X in probability, we use Fatou’s lemma to

write limn→∞ E[Y + Xn] ≥ E[Y + X]. This implies E[Y ] + limn→∞ E[Xn] ≥ E[Y ] +

E[X], or limn→∞ E[Xn] ≥ E[X]. Similarly, limn→∞ E[Y − Xn] ≥ E[Y − X], which

implies E[Y ] + limn→∞(−E[Xn]) ≥ E[Y ] − E[X], or limn→∞(−E[Xn]) ≥ −E[X]. But

limn→∞(−xn) = − limn→∞ xn, and therefore limn→∞ E[Xn] ≤ E[X]. In summary, we

have limn→∞ E[Xn] ≤ E[X] ≤ limn→∞ E[Xn], which implies limn→∞ E[Xn] = E[X]. 2

9. The L1 Conditional expectation, revisited

Consider a probability space (Ω,F , P) and let X ∈ L1(Ω,F , P). Let D be a sub

σ-algebra. For each n ≥ 1, we define Xn4= X ∧ n (defined as X when |X| ≤ n and n

otherwise). Note that Xn ∈ L2(Ω,F , P) (since it is bounded), so we can talk about

ECE 541 47

E[(Xn)|D] as a projection of Xn onto L2(Ω,D, P ), which we call Zn. We will now

show that Zn is a Cauchy sequence in L1, which means limm,n→∞ E[|Zn − Zm|] = 0.

We already know that E[(Zn−Zm)Y ] = E[(Xn−Xm)Y ] for any Y ∈ L2(Ω,D, P ). In

particular, pick Y = IZn−Zm>0− IZn−Zm≤0. In this case, (Zn−Zm)Y = |Zn−Zm|.Thus, we conclude that E[|Zn−Zm|] = E[(Xn−Xm)(IZn−Zm>0− IZn−Zm≤0)]. But

the right-hand side is no greater than E[|Xn − Xm|] (why?), and we obtain E[|Zn −Zm|] ≤ E[|Xn − Xm|]. However, E[|Xn − Xm|] ≤ E[|X|I|X|≥max(m,n)] → 0 by the

dominated convergence theorem (verify this), and we conclude that limm,n→∞ E[|Zn−Zm|] = 0.

From a key theorem in analysis (e.g., see Rudin, Chapter 3), we know that any

L1 space is complete. Thus, there exists Z ∈ L1(Ω,D, P) such that limn→∞ E[|Zn −Z|] = 0. Further, Zn has a subsequence, Znk

, that converges almost surely to Z.

We take this Z as a candidate for E[X|D]. But first, we must show that Z satisfies

E[ZY ] = E[XY ] for any bounded, D-measurable Y . This is easy to show. Suppose

that |Y | < M , for some M . We already know that E[ZnY ] = E[XnY ], and by

the dominated convergence theorem, E[XnY ] → E[XY ] (since |Xn| ≤ |X|). Also,

|E[ZnY ] − E[ZY ]| ≤ E[|Zn − Z||Y |] ≤ E[|Zn − Z|M ] → 0. Hence, E[ZnY ] → E[ZY ].

This leads to the conclusion that E[ZY ] = E[XY ] for any bounded, D-measurable Y .

In summary, we have found a Z ∈ L1(Ω,D, P) such that E[ZY ] = E[XY ] for any

bounded, D-measurable Y . We define this Z as E[X|D].

10. Central Limit Theorems

To establish this theorem, we need the concept of a characteristic function. Let Y

be a r.v. We define the characteristic function of Y , φY (ω), as E[eiuY ], where E[eiuY ]

exists if E[Re(eiuY )] < ∞ and E[Im(eiuY )] < ∞. Note that Re(eiuY ) ≤ |eiuY | = 1 and

Im(eiuY ) ≤ 1. Therefore, E[|Re(eiuY )|] ≤ 1 and E[|Im(eiuY )|] ≤ 1. Hence, E[eiuY ] is

well defined for any u ∈ IR. If Y has a pdf fY (y), then E[eiuY ] =∫∞−∞ eiuY fY (y)dy,

which is the Fourier transform of fY evaluated at −u.

Example 17. Y ∈ 0, 1 and PY = 1 = p. In this case φY (u) = E[eiuY ] =

peiu + (1− p)eiu0 = peiu + (1− p).

48 MAJEED M. HAYAT

Example 18. Y ∼ N(0, 1). Then, E[eiuY ] = e−u2

2 (see justification below).

We also need this Theorem that relates pdf’s to characteristic functions.

10.1. Levy’s Inversion Lemma.

Lemma 4. If φx(n) is the characteristic functions of X, then limc→∞ 12π

∫ c

−ceiua−eiub

iuφX(u)du

= Pa < x < b+ PX=a+PX=b2

.

See Chow and Teicher for proof.

As a special case, if X is an absolutely continuous random variable with pdf fX ,

then 12π

∫∞−∞ e−iuxφX(u)du = fX(x).

Theorem 12 (Central limit theorem). Suppose Xk is a sequence of i.i.d. random

variables with E[Xk] = µ and varXk= σ2. Let Sn =

∑nk=1 Xk and Yn = Sn−nµ√

nσ. Then,

Yn convergence to Z in distribution, where Z ∼ N(0, 1) (zero-mean Gaussian random

variable with unit variance).

Sketch of the proof

Observe that

ΦYn(ω) = E[ejωYn ]

= E[ejωPn

k=1 Xk−µn√nσ ]

= E[n∏

k=1

ejω√nσ

(Xk−µ)]

=n∏

k=1

E[ejω√nσ

(Xk−µ)]

=n∏

k=1

Φ(Xk−µ)(ω√nσ

)

= (Φ(Xk−µ)(ω√nσ

))n

ECE 541 49

We now expand Φ(Xk−µ)(ω) as

Φ(Xk−µ)(ω) = E[ejω(Xk−µ)]

= E[1 + jω(Xk − µ) +(jω)2

2!(Xk − µ)2 +

(jω)3

3!(Xk − µ)3 + . . .]

= 1 + jωE[Xk − µ]− ω2

2E[(Xk − µ)2] +

(jω)3

3!E[(Xk − µ)3] + . . .

= 1− ω2

2E[(Xk − µ)2] +

(jω)3

3!E[(Xk − µ)3] + . . .

Then we have

Φ(Xk−µ)(ω√nσ

) = 1−( ω√

nσ)2

2E[(Xk − µ)2] +

(j( ω√nσ

))3

3!E[(Xk − µ)3] + . . .

= 1− ω2

2nσ2E[(Xk − µ)2]− jω3

3!√

n3σ3

E[(Xk − µ)3] + . . .

≈ 1− ω2

2nσ2E[(Xk − µ)2].

= 1− ω2

2n

Then

ΦYn(ω) = (1− ω2

2n)n

and

limn→∞

ΦYn(ω) = limn→∞

(1− ω2/2

n)n = e−ω2/2.

50 MAJEED M. HAYAT

On the other hand, the characteristic function of Z is ΦZ(ω) = Eejωz. Then

ΦZ(ω) = Eejωz

=

∫ ∞

−∞ejωzfZ(z)dz

=

∫ ∞

−∞ejωz 1√

2πe−z2/2dz

=1√2π

∫ ∞

−∞e−

12(z2−2jωz)dz

=1√2π

∫ ∞

−∞e−

12[(z−jω)2+ω2]dz

=1√2π

e−ω2/2

∫ ∞

−∞e−

12(z−jω)2dz

= e−ω2/2

Therefore, limn→∞ ΦYn(ω) = ΦZ(ω) and FYn

n→∞−−−→ FZ by the inversion lemma. 2

10.2. Central limit theorem for non-identical random variables. Suppose

Xk is a sequence of random variables that are independent but non-identical.

EXk = µk and VarXk = σ2k. Let Sn =

∑nk=1 Xk. Then ESn =

∑nk=1 EXk =

∑nk=1 µk = ηn and VarSn =

∑nk=1 σ2

k = βn.

Let Yn = Sn−ηβn

and suppose Xk satisfies the following two conditions:

a) limn→∞ β2n = limn→∞

∑nk=1 σ2

k = ∞;

b) There exists α > 2 and c < ∞ such that E[|Xk|α] < c ∀k.

Then, Yn converges in distribution to a zero-mean unit variance Gaussian random

variable (For proof see, fr example, Chow and Teicher). Equivalent conditions to the

above two conditions are:

a’) There is an constant ε > 0, such that, σ2k > ε ∀k.

b’) All densities fk(x) are zeros outside a finite interval (−c, c) for some c.

11. Strong Law of Large Numbers

Suppose that X1, X2, . . . , are i.i.d. random variables, E[X1] = µ < ∞, and E[(X1−µ)2] = σ2 < ∞. Let Sn

4= n−1

∑ni=1 Xi. Note that E[Sn] = µ < ∞ and E[(Sn−µ)2] =

ECE 541 51

n−1σ2. By Chebychev’s inequality,

P|Sn − µ| > ε ≤ σ2

nε,

and therefore we conclude that Sn converges to µ in probability. This is called the

weak law of large numbers.

As it turns out, this convergence occurs almost surely, yielding what is called the

strong law of large numbers. To prove one version of the strong law, one needs

the notion of a sequence of events occurring infinitely often. Roughly speaking, if

A1, A2, . . . , is a sequence of events (think for example of the sequence |Sn−µ| > ε),then one can look for the collection of all outcomes ω, each of which belonging to

infinitely many of the events An. For example, if ωo is such outcome, and if we take

An = |Sn − µ| > ε, where ε > 0, then we would know that Sn(ωo) cannot converge

to µ.

More generally, we define the event An occurs infinitely often, or for short An i.o.,as

An i.o. 4= ∩∞n=1 ∪∞k=n Ak.

The above event is also referred to as the limit superior of the sequence An, that

is, An i.o. = limn→∞ An. The terminology of limit superior makes sense since

the ∪∞k=nAk is a decreasing sequence and the intersection yields the infimum of the

decreasing sequence.

Before we prove a key results on the probability of events such as An i.o., we will

give an example showing their use.

Let Yn be any sequence, and let Y be a random variable (possibly being a candidate

limit of the sequence provided that the sequence is convergent). Consider the event

limn→∞ Yn = Y , the collection of all outcomes ω for which Yn(ω) → Y (ω). It is

easy to see that

(20) limn→∞

Yn = Y c = ∪ε>0,ε∈ |Q|Yn − Y | > ε i.o..

This is because if ωo ∈ |Yn − Y | > ε i.o. for some ε > 0, then we are guaranteed

that for any n ≥ 1, |Yk − Y | > ε for infinitely many k ≥ n; thus, Yn(ωo) does not

converge to Y (ωo).

52 MAJEED M. HAYAT

Lemma 5 (Borel-Cantelli Lemma). Let A1, A2, . . . , be any sequence of events. If∑∞

n=1 P(An) < ∞, then PAn i.o. = 0.

Proof

From (11) we know that An i.o. ⊂ ∪k≥nAk for any n, and thus P(An i.o.) ≤P(∪k≥nAk) for any n. Now note that

∑∞n=1 P(An) < ∞ implies

∑∞k=n P(Ak) → 0 as

n →∞. Since P(∪∞k=nAk) ≤∑∞

k=n P(Ak), limn→∞ P(∩∞n=1∪∞k=nAk) = limn→∞ P(∪∞k=nAk) =

0 (recall that ∪∞k=nAk is a decreasing sequence of events). Hence, P(An i.o.) =

0. 2

The next Theorem is an application of the Borel-Cantelli Lemma.

Theorem 13 (Strong law of large numbers). Let X1, X2, . . . , be an i.i.d. sequence,

E[X1] = µ < ∞, E[X21 ] = µ2 < ∞, and E[X4

1 ] = µ4 < ∞. Then, Sn4= n−1

∑∞n=1 Xn →

µ almost surely.

Proof

Without loss of generality, we will assume that µ = 0. From (20), we know that

(21) limn→∞

Sn = 0c = ∪ε>0,ε∈ |Q|Sn| > ε i.o..

By Markov inequality (with ψ(·) = (·)4),

P|Sn| > ε ≤ E[S4n]

ε4=

∑∞i=1

∑∞j=1

∑∞`=1

∑∞m=1 E[XiXjX`Xm]

n4ε4.

Considering the term in the numerator, we observe that all the terms in the sum-

mation that are of the form E[X3i ]E[Xj], E[X2

i ]E[Xj]E[X`], E[Xi]E[Xj]E[X`]E[Xm]

(i 6= j 6= ` 6= m) are zero. Hence,

∞∑i=1

∞∑j=1

∞∑

`=1

∞∑m=1

E[XiXjX`Xm] = nE[X41 ] + 3n(n− 1)E[X2

1 ]E[X21 ],

and therefore

P|Sn| > ε ≤ nµ4 + 3n(n− 1)µ2

n4ε4.

Hence, we conclude that∞∑

n=1

P|Sn| > ε < ∞.

ECE 541 53

By the Borel-Cantelli Lemma,

P(|Sn| > ε i.o.) = 0, ∀ε > 0.

Hence, by (21)

P( lim

n→∞Sn = 0c

)= 0

and the theorem is proved. 2

54 MAJEED M. HAYAT

12. Markov Chains

The majority of the materials of this section are extracted from the book by

Billingsly (Probability and Measure) and the book by Hoel et al. (Introduction to

Stochastic Processes.)

Up to this point we had seen many examples and properties (e.g., convergence)

of iid sequences. However, many sequences have correlation among their members.

Markov chains are a class of sequences that exhibit a simple type of correlation in the

sequence of random variables. In words, as far as calculating probabilities pertaining

future values of the sequence conditional on the present and the past, it is sufficient

to know the present. In other words, as far as the future is concerned, knowledge

of the present contains all the knowledge of the past. We next define discrete-space

Marcov chains (MCs) formally.

Consider a sequence of discrete random variables X1, X2, . . . , defined on some prob-

ability space (Ω,F , P). Let S be a finite or countable set representing the set of all

possible values that each Xn may assume. Without loss of generality, assume that

S = 0, 1, 2, . . .. The sequence is a Markov chain if

PXn+1 = j|X0 = i0, . . . , Xn = in = PXn+1 = j|Xn = in.

For each pair i and j in S, we define pij4= PXn+1 = j|Xn = i. Note that we have

implicitly assumed (for now) that PXn+1 = j|Xn = i is not dependent upon n, in

which case the cain is called homogeneous. Also note that the definition of pij implies

that∑

j∈S pij = 1, i ∈ S. The numbers pij are called the transition probabilities of

the MC. The initial probabilities are defined as αi4= PX0 = i. Note that the only

restriction on the αis is that they are nonnegative and they must add up to 1. We

denote the matrix whose (i, j)th entry is pi,j by IP, which is termed the transition

matrix of the MC. Note that if S is infinite, then IP is a matrix with infinitely many

rows and columns. The transition matrix IP called a stochastic matrix, as each row

sums up to one and all its elements are nonnegative. Figure 6 is a state-diagram

diagram illustrating the transition probabilities for a two-state MC.

ECE 541 55

P21

P22

P12

s t ate1

P11

s t a t e0

Figure 6.

12.1. Higher-Order Transitions. Consider PX0 = i0, X1 = i1, X2 = i2. By

applying Bayes rule repeatedly (i.e., P(A ∩ B ∩ C) = P(A|B ∩ C)P(B|C)P(C)), we

obtain

PX0 = i0, X1 = i1, X2 = i2 = PX0 = i0PX1 = i1|X0 = i0PX2 = i2|X0 = i0, X1 = i1

= αi0pi0i1pi1i2 .

More generally, it is easy to verify that

PX0 = i0, X1 = i1, . . . , Xm = im = αi0pi0i1 . . . pim−1im

for any sequence i0, i1, . . . , im of states.

Further, it is also easy to see that

PX3 = i3, X2 = i2, |X1 = i1, X0 = i0 = pi1i2pi2i3 .

More generally,

(22) PXm+` = j`, 1 ≤ ` ≤ n|Xs = is, 0 ≤ s ≤ m = pimj1pj1j2 . . . pjn−1jn .

With these preliminaries, we can define the n-step transition probability as

p(n)ij = PXm+n = j|Xm = i.

If we now observe that

Xm+n = j = ∪j0,...,jn−1Xm+n = j, Xm+n−1 = j0, . . . , Xm+1 = jn−1,

we conclude

p(n)ij =

∑

k1...kn−1

pik1pk1k2 . . . pkn−1j.

56 MAJEED M. HAYAT

From matrix theory, we recognize p(n)ij as the (i, j)th entry of IPn, the nth power of

the transition matrix IP. It is convenient to use the convention p(0)ij = δij, consistent

with the fact that IP0 is the identity matrix I.

Finally, it is straightforward to verify that

p(m+n)ij =

∑v∈S

p(m)iv p

(n)vj ,

and that∑

j∈S p(n)ij = 1. This means IPn = IPkIP`, whenever ` + k = n, which is

consistent with taking powers of a matrix.

For convenience, we now define the conditional probabilities Pi(A)4= PA|X0 = i,

A ∈ F . With this notation and as an immediate consequence of (22), we have

PiX1 = i1, . . . , Xm = j, Xm+1 = jm+1, . . . , Xm+n = jm+n =

PiX1 = i1, . . . , Xm = jPjX1 = jm+1, . . . , Xn = jm+n.

Example 19. A gambling problem (see related homework problem)

Initial fortune of a gambler is X0 = xo. After playing each hand, he/she either

increases (decreases) the fortune by one dollar with probability p (q). The gambler

quits if either her/his fortune becomes 0 (bankruptcy) or reaching a goal of L dollars.

Let Xn denote the fortune at time n. Note that

(23) PXn = j|X1 = i1 . . . Xn−1 = in−1 = PXn = j|Xn−1 = in−1.

Hence, the sequence Xn is a Markov chain with state space S = 0, 1, . . . , L. Also,

the time-independent transition probabilities are given by

(24) pij = PXn+1 = j|Xn = i =

p, if j = i + 1, i 6= L

q, if j = i− 1, i 6= 0

1, if i = j = L

1, if i = j = 0

ECE 541 57

The (L + 1)× (L + 1) probability transition matrix, IP4= ((pij)) is therefore

(25) IP =

1 0 0 . . . . . . . . . . . . . . . 0

q 0 p 0 . . . . . . . . . . . ....

0 q 0 p 0 . . . . . . . . ....

......

......

.... . .

......

...

0 0 . . . . . . . . . . . . q 0 p

0 0 . . . . . . . . . . . . . . . . . . 1

.

IP Note that the sum of any row is 1, a characteristic of a stochastic matrix.

Exercise 2. Show that λ = 1 is always an eigenvalue for any stochastic matrix IP.

Let P (x) be the probability of achieving the goal of L dollars, where x is the initial

fortune. In a homework problem we have shown that

(26) P (x) = pP (x + 1) + qP (x− 1), x = 1, 2, . . . , L− 1,

with boundary conditions P (0) = 0 and P (L) = 1. Similarly, define Q(x) as the

probability of going bankrupt. Then,

(27) Q(x) = pQ(x + 1) + qQ(x− 1), x = 1, 2, . . . , L− 1,

with boundary conditions Q(0) = 1 and Q(L) = 0.

For example, if p = q = 12, then (see homework solutions)

(28) P (x) =x

L,

(29) Q(x) = 1− x

L.

Thus, limL→∞ P (x) = 0. (what is the implication of this?).

Exercise 3. Show that

(30) limL→∞

P (x) = 0

when q 6= p.

58 MAJEED M. HAYAT

Back to Gambler’s ruin problem: take L = 4, p = 0.6. Then,

(31) IP =

1 0 0 0 0

0.4 0 0.6 0 0

0 0.4 0 0.6 0

0 0 0.4 0 0.6

0 0 0 0 1

.

Also, straightforward calculation yields

(32) IP2 =

1 0 0 0 0

0.4 0.24 0 0.36 0

0.16 0 0.48 0 0.36

0 0.16 0 0.24 0.6

0 0 0 0 1

, and

(33) limn→∞

IPn =

1 0 0 0 0

0.5846 0 0 0 0.4154

0.3077 0 0 0 0.6923

0.1231 0 0 0 0.8769

0 0 0 0 1

.

(This can be obtained, for example, using the Kayley-Hamilton Theorem). Note that

the last column is

(34) Q =

Q(0)...

Q(4)

,

and the first column is

(35) P =

P (0)...

P (4)

.

ECE 541 59

12.2. Transience and Recurrence. Suppose that X0 = i and define

Eij4= ∪∞n=1Xn = j

as the event that the chain eventually visits state j provided that X0 = i. A state i

is said to be recurrent if P(Eii) = 1; otherwise, it is said to be transient.

Note that we can write Eij as the disjoint union

Eij = ∪∞k=1Xk = j, Xk−1 6= j, . . . , X1 6= j.

In the above, we are saying that the chain can eventually visit state j precisely in the

following mutually exclusive ways: visit j for the first time in one step, or visit j for

the first time in two steps, and so on. If we define

f(n)ij

4= PiX1 6= j, . . . , Xn−1 6= j, Xn = j

as the probability of a first visit to state j at time n provided that X0 = i, then the

quantity

fij4=

∞∑n=1

f(n)ij

is precisely P(Eij). Hence, a state i is recurrent precisely when fii = 1 and it is

transient when fii < 1.

12.2.1. Visiting a state infinitely often. Suppose that X0 = i and consider the event

Ajk defined as the event that the chain visits state j at least k times. Note that if

Ajk occurs, then it must be true that the chain must have made a visit from i to j

and then revisited j at least k − 1 times. Realizing the Pi probability that the chain

visits state j from state i is fij and the Pi probability that the chain visits state j

from state j (k − 1) times is fk−1jj , we conclude that Pi(Ajk) ≤ fijf

k−1jj . On the other

hand, if the chain happens to visit j from i and also revisit j at least k − 1 times,

then we would know that the event Ajk has occurred. Hence, Pi(Ajk) = fijfk−1jj .

Note that Ajk is a decreasing sequence of events in k. Consequently,

Pi(∩kAjk) = limk→∞

Pi(Ajk) =

0, if fjj < 1

fij, if fjj = 1

60 MAJEED M. HAYAT

However, ∩kAjk is precisely the event Xn = j i.o.. Hence, we arrive at the conclu-

sion

(36) PiXn = j i.o. =

0, if fjj < 1

fij, if fjj = 1

Taking i = j yields

(37) PiXn = i i.o. =

0, if fjj < 1

1, if fjj = 1

Thus, PiXn = i i.o. is either 0 or 1!

We have therefore proved the following result.

Theorem 14. Recurrence of i is equivalent to PiXn = i i.o. = 1 and transience of

i is equivalent to PiXn = i i.o. = 0.

The next theorem further characterizes recurrence and transience.

Theorem 15 (Adopted from Billingsly). Recurrence of i is equivalent to∑

n p(n)ii =

∞; transience of i equivalent to∑

n p(n)ii < ∞.

Proof.

Since p(n)ii = PiXn = i, by the Borel-Cantelli lemma

∑n p

(n)ii < ∞ implies PiXn =

i i.o. = 0. According to (37), this implies fii < 1. Hence, we have shown that∑

n p(n)ii < ∞ implies transience. Consequently, recurrence implies that

∑n p

(n)ii = ∞.

We next prove that fii < 1 (or transience) implies∑

n p(n)ii < ∞, which will, in turn,

imply that∑

n p(n)ii = ∞ implies recurrence, and the entire theorem will therefore be

proved. Now back to proving that fii < 1 implies∑

n p(n)ii < ∞.

Observe that the chain visits j from i in steps precisely in the following mutually

exclusive ways: visit j from i for the first time n steps, visit j from i for the first time

in n− 1 steps and then revisit j in one step, visit j from i for the first time in n− 2

ECE 541 61

steps and then revisit j in two steps, and so on. More precisely, we write

p(n)ij = PiXn = j =

n−1∑s=0

PiX1 6= j, . . . , Xn−s−1 6= j,Xn−s = j, Xn = j

=n−1∑s=0

PiX1 6= j, . . . , Xn−s−1 6= j,Xn−s = jPjXs = j

=n−1∑s=0

f(n−s)ij p

(s)jj .

Put j = i and summing up over n = 1 to n = ` we obtain,

∑n=1

p(n)ii =

∑n=1

n−1∑s=0

f(n−s)ii p

(s)ii

=`−1∑s=0

p(s)ii

∑n=s+1

f(n−s)ii ≤ fii

∑s=0

p(s)ii

The last inequality comes from the fact that∑`

n=s+1 f(n−s)ii =

∑`−su=1 f

(u)ii ≤ fii. Re-

alizing that p(0)ii = 1, we conclude (1 − fii)

∑`n=1 p

(n)ii ≤ fii. Hence, if fii < 1, then

∑`n=1 p

(n)ii ≤ fii/(1− fii), and the series

∑∞n=1 p

(n)ii will therefore be convergent since

the partial sum is increasing and bounded. 2

Exercise 4. Show that in the gambler’s problem discussed earlier, states 0 and L are

recurrent, but states 1, . . . , L− 1 are transient.

12.3. Irreducible chains. A MC is said to be irreducible if for every i, j ∈ S, there

exists an n (that is generally dependent up on i and j) such that p(n)ij > 0. In words,

in if the chain is irreducible then we can always go from any state to any other state in

finite time with a nonzero probability. The next theorem shows that for irreducible

chains, either all states are recurrent or all states are transient, there is no option

to be in between. Thus, we can say that an irreducible MC is either recurrent or

transient.

Theorem 16 (Adopted from Billingsly). If the Markov chain is irreducible, then one

of the following two alternatives holds.

(i) All states are recurrent, Pi(∩jXn = j i.o.) = 1 for all i, and∑

n p(n)ij = ∞

for all i and j.

62 MAJEED M. HAYAT

(ii) All states are transient, Pi(∪jXn = j i.o.) = 0 for all i, and∑

n p(n)ij < ∞

for all i and j.

Proof.

By irreducibility, for each i and j there exist integers r and s so that p(r)ij > 0 and

p(s)ji > 0. Note that if a chain with X0 = i visits j from i in r steps, then visits j from

j in n steps, and then visits i from j in s steps, then it has visited i from i in r+n+s

steps. Therefore,

p(r+s+n)ii ≥ p

(r)ij p

(n)jj p

(s)ji .

Now if we sum up both sides of the above inequality over n and use the fact that

p(r)ij p

(s)ji > 0, we conclude that

∑n p

(n)ii < ∞ implies

∑n p

(n)jj < ∞ (why do we need

the fact that p(r)ij p

(s)ji > 0 to come up with this conclusion?) In particular, if one

state is transient, then all other states are also transient (i.e., fjj < 1 for all j).

In this case, (36) tells us that PiXn = j i.o. = 0 for all i and j. Therefore,

Pi(∪jXn = j i.o.) ≤ ∑j PiXn = j i.o. = 0 for all i.

Next, note that∑∞

n=1 p(n)ij =

∑∞n=1

∑nv=1 f

(v)ij p

(n−v)jj =

∑∞v=1 f

(v)ij

∑∞m=0 p

(m)jj ≤ ∑∞

m=0 p(m)jj ,

where the inequality follows from the fact that∑∞

v=1 f(v)ij = fij ≤ 1. Hence, if j is

transient (in which case∑

n p(n)jj < ∞ according to Theorem 15), then p

(n)ij < ∞ for

all i. We have therefore established the first alternative of the theorem.

If any state is not transient, then the only possibility is that all states are recurrent.

In this case PjXn = j i.o. = 1 by Theorem 15 and Xn = j i.o. is an almost certain

event for all j with respect to Pj. Namely, Pj(A ∩ Xn = j i.o.) = Pi(A) for any

A ∈ F . Hence, we can write

p(m)ji = Pj(Xm = i ∩ Xn = j i.o.)

≤∑n>m

PjXm = i, Xm+1 6= j, . . . , Xn−1 6= j,Xn = j

=∑n>m

p(m)ji f

(n−m)ij = p

(m)ji fij.

By irreducibility, there is an m0 for which p(m0)ji > 0; hence, fij = 1. Now by (37),

PiXn = j i.o.] = 1. The only item left to prove is that∑

n p(n)ij under the second

ECE 541 63

alternative. Note that if∑

n p(n)ij < ∞ for some i and j, then the Borel-Cantelli lemma

would dictate that PiXn = j i.o. = 0, and the chain would not be recurrent, which

is a contradiction. Hence, under the second alternative,∑

n p(n)ij = ∞ for all i and j.

2

Special Case: Suppose that∑

n p(n)ij < ∞ and S is a finite set. Then

∑j∈S

∑n p

(n)ij =

∑n

∑j∈S p

(n)ij . But since

∑j∈S p

(n)ij = 1, we obtain

∑j∈S

∑n p

(n)ij =

∑n 1 = ∞, which

is a contradition. Hence, in a finite irreducible MC, alternative (ii) is impossible;

hence, every finite state-space irreducible MC is recurrent. Consequently, if the tran-

sition matrix of a finite state-space MC has all positive entries (making the chain

irreducible), then the chain is recurrent.

12.4. Birth and death Markov chains. Most of the material here is extracted

from the book by Hoel et al. Birth and death chains are common examples of Markov

chains; they include random walks. Consider a Markov chain on the set of infinite or

finite nonnegative integers 0, . . . , d (in the case of an infinite set we take d = ∞).

The transition function is of the form

pij =

qi, j = i− 1,

ri, j = i,

pi, j = i + 1

Note that pi + qi + ri = 1, q0 = 0, and pd = 0 if d < ∞. We also assume that pi and

qj are positive for 0 < i < d.

For a and b in J such that a < b, set

u(x) = Px(Ta < Tb), a < x < b

and set u(a) = 1 and u(b) = 0. If the birth and death chain starts at y, then in one

step it goes to y − 1, y, or y + 1 with respective probabilities qy, ry, or py. It follows

that

u(y) = qyu(y − 1) + ryu(y) + pyu(y + 1), a < y < b

64 MAJEED M. HAYAT

Since ry = 1− py − qy we can rewrite it as

u(y + 1)− u(y) =qx

py

(u(y)− u(y − 1)), a < y < b

set γ0 = 1 and

γy =q1 . . . qy

p1 . . . py

, 0 < y < d

From this we see that

u(y + 1)− u(y) =γa+1

γa

. . .γy

γy−1

(u(a + 1)− u(a)) =γy

γa

(u(a + 1)− u(a))

Consequently,

u(y)− u(y + 1) =γy

γa

(u(a)− u(a + 1)), a ≤ y < b

Summing the equation on y = a, . . . , b−1 and recalling that u(a) = 1 and u(b) = 0,

we conclude that

u(a)− u(a + 1)

γa

=1∑b−1

y=a γy

Page 31

Thus the equation becomes,

u(y)− u(y + 1) =γy∑b−1

y=a γy

, a ≤ y < b

Summing this equation on y = x, . . . , b − 1 and again using the formula u(b) = 0,

we obtain

u(x) =

∑b−1y=x γy∑b−1y=a γy

, a < x < b

It now follows from the definition of u(x) that

ECE 541 65

Px(Ta < Tb) =

∑b−1y=x γy∑b−1y=a γy

, a ≤ y < b

By subtracting both sides of the equation from 1, we see that

Px(Ta < Tb) =

∑x−1y=a γy∑b−1y=a γy

, a ≤ y < b

Page 32

In the remainder of this section we consider a birth and death chain on the nonnegative

integers which is irreducible, i.e., such that px > 0 for x ≥ 0 and qx > 0 for x ≥ 1.

We will determine when such a chain is recurrent and when it is transient.

As a special case of (59),

P1(T0 < Tn) = 1− 1∑n−1y=0 γy

, n > 1

Consider now a birth and death chain starting in state 1. Since the birth and death

chain can move at most one step to the right at a time (considering the transition

from state to state as movement along the real number line),

1 ≤ T2 < T3 < . . . .

It follows from (62) that T0 < Tn, n > 1 forms a nondecreasing sequence of

events. We conclude from Theorem 1 of Chapter 1 of Volume I that

limn→∞

P1(T0 < Tn) = P1(T0 < Tn for some n > 1)

Equation (62) implies that Tn ≥ n and thus Tn → ∞ as n → ∞; hence the event

T0 < Tn for some n > 1 occurs if and only if the event T0 < ∞ occurs. We can

therefore rewrite (63) as

limn→∞

P1(T0 < T1) = P1(T0 < ∞)

It follows from (61) and (64) that

66 MAJEED M. HAYAT

P1(T0 < ∞) = 1− 1∑∞y=0 γy

We are now in position to show that the birth and death chain is recurrent if and

only if

∞∑y=0

γy = ∞

If the birth and death chain is recurrent, then P1(T0 < ∞) = 1 and (66) follows

from (65). To obtain the converse, we observe that P(0, y) = 0 for y ≥ 2 and hence

P0(T0 < ∞) = P(0, 0) + P(0, 1)P1(T0 < ∞)

Page 33

Suppose (66) holds. Then by (65)

P1(T0 < ∞) = 1

From this and (67) we conclude that

P0(T0 < ∞) = P(0, 0) + P(0, 1) = 1

Thus 0 is a recurrent state, and since the chain is assumed to be irreducible, it

must be a recurrent chain.

In summary, we have shown that an irreducible birth and death chain on 0, 1, 2, . . .is recurrent if and only if

∞∑x=1

q1 . . . qx

p1 . . . px

= ∞

Page 50

Consider a birth and death chain on 0, 1, . . . , d or on the nonnegative integers. In

ECE 541 67

the latter case we set d = ∞. We assume without further mention that the chain is

irreducible, i.e., that

px > 0 for 0 ≤ x < d

and

qx > 0 for 0 < x ≤ d

if d is finite, and that

px > 0 for 0 ≤ x < ∞

qx > 0 for 0 < x ≤ ∞

if d is infinite.

Suppose d is infinite. The system of equations

∑π(x)P(x, y) = π(y), y ∈ J

Page 51

become

π(0)r0 + π(1)q1 = π(0)

π(y − 1)py−1 + π(y)ry + π(y + 1)qy+1 = π(y), y ≥ 1

Since

py + qy + ry = 1

these equations reduce to

q1π(1)− p0π(0) = 0

68 MAJEED M. HAYAT

qy+1π(y + 1)− pyπ(y) = qyπ(y)− py−1π(y − 1), y ≥ 1

It follows easily from (7) and induction that

qy+1π(y + 1)− pyπ(y) = 0, y ≥ 0

and hence that

π(y + 1) =py

qy+1

π(y), y ≥ 0

Consequently,

π(x) =p0 . . . px−1

q1 . . . qx

π(0), x ≥ 1

set

π(x) =

1, x = 0,

p0...px−1

q1...qx, x ≥ 1

Then (8) can be written as

π(x) = πxπ(0), x ≥ 0

Conversely, (1) follows from (10).

Suppose now that∑

x π(x) < ∞ or equivalently, that

∞∑x=1

p0 . . . px−1

q1 . . . qx

< ∞

We conclude from (10) that the birth and death chain has a unique stationary

distribution, given by

π(x) =πx∑∞y=0 πy

, x ≥ 0

Suppose instead that (11) fails to hold, i.e., that

ECE 541 69

∑πx = ∞

Page 52

We conclude from (10) and (13) that any solution to (1) is either identically zero or

has infinite sum, and hence that there is no stationary distribution.

In summary, we see that the chain has a stationary distribution if and only if (11)

holds, and that the stationary distribution, when it exists, is given by (9) and (12).

Suppose now that d < ∞. By essentially the same arguments used to obtain (12),

we conclude that the unique stationary distribution is given by

π(x) =πx∑dy=0 πy

, 0 ≤ x ≤ d

where πx, 0 ≤ x ≤ d, is given by (9).

70 MAJEED M. HAYAT

1. Y. S. Chow and H. Teicher, Probability Theory: Independence, Interchange-

ability, Martingales, Springer Texts in Statistics (New York), 3rd ed., 1997. ISBN:

0-387-40607-7.

2. W. Rudin, Real and Complex Analysis. New York: McGraw Hill, 1987.

3. Thomas G. Kurtz, Stochastic Analysis. Class note at the University of Wisconsin-

Madison.

4. Anatole Beck, Integration and Measure. Class notes at the Unversity of Wisconsin-

Madison.

5. P. G. Hoel, S. C. Port, and C. J. Stone, Inytroduction to Stochastic Processes.

Waveland Press, Inc. 1987.

Date post:	19-Mar-2018
Category:	Documents
Upload:	truonglien
View:	229 times
Download:	3 times

ECE 541 - NOTES ON PROBABILITY THEORY ANDece-research.unm.edu/hayat/ece541_06/ECE541_11_11.pdfECE...

Documents