Download - STAT3004 Course Notes

8/18/2019 STAT3004 Course Notes

1/167

STOCHASTIC MODELLING - STAT3004/STAT7018, 2015, Semester 2

Contents

1 Basics of Set-Theoretical Probability Theory 4

2 Random Variables 7

2.1 Definition and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Moments and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Several Random Variables 17

3.1 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Covariance, Correlation, Independency . . . . . . . . . . . . . . . . . . . . 18

3.3 Sums of Random Variables and Convolutions . . . . . . . . . . . . . . . . . 19

3.4 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Conditional Probability 21

4.1 Conditional Probability of Events . . . . . . . . . . . . . . . . . . . . . . . 214.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Mixed Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Random Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Conditioning on Continuous Random Variables . . . . . . . . . . . . . . . 28

4.6 Joint Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Elements of Matrix Algebra 32

6 Stochastic Process and Markov Chains 35

6.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.4 Transition Matrices and Initial Distributions . . . . . . . . . . . . . . . . . 37

6.5 Examples of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.6 Extending the Markov Property . . . . . . . . . . . . . . . . . . . . . . . . 43

1


2/167

6.7 Multi-Step Transition Functions . . . . . . . . . . . . . . . . . . . . . . . . 44

6.8 Hitting Times and Strong Markov Property . . . . . . . . . . . . . . . . . 47

6.9 First Step Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.10 Transience and Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.11 Decomposition of the State Space . . . . . . . . . . . . . . . . . . . . . . . 58

6.12 Computing hitting probabilities . . . . . . . . . . . . . . . . . . . . . . . . 63

6.13 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.14 S pecial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Stationary Distribution and Equilibrium 73

7.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2 Basic Properties of Stationary and Steady State Distributions . . . . . . . 747.3 Periodicity and Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.4 Positive and Null Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.5 Existence and Uniqueness of Stationary Distributions . . . . . . . . . . . . 84

7.6 Examples of Stationary Distributions . . . . . . . . . . . . . . . . . . . . . 87

7.7 Convergence to the Stationary Distribution . . . . . . . . . . . . . . . . . . 93

7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Pure Jump Processes 104

8.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048.2 Characterizing a Markov Jump Processes . . . . . . . . . . . . . . . . . . . 106

8.3 S = {0, 1} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098.4 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.5 Inhomogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . 115

8.6 Special Distributions Associated with the Poisson Processes . . . . . . . . 118

8.7 Compound Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . 121

8.8 Birth and Death Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.9 Infinite Server Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8.10 Long-run Behaviour of Jump Processes . . . . . . . . . . . . . . . . . . . . 129

9 Gaussian Processes 138

9.1 Univariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . 138

9.2 Bivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 139

9.3 Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 143

9.4 Gaussian Processes and Brownian Motion . . . . . . . . . . . . . . . . . . 146

2


3/167

9.5 Brownian Motion via Random Walks . . . . . . . . . . . . . . . . . . . . . 148

9.6 Brownian Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.7 Geometric Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9.8 Integrated Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.9 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

3


4/167

Part I: Review Probability & Conditional Probability

1 Basics of Set-Theoretical Probability Theory

Sets and Events. We need to recall a little bit of set theory and its terminology insofar as

it is relevant to probability. To start, we shall refer to the set of all possible outcomes that

a random experiment may take on as the sample space and denote it by Ω. In probability

theory Ω is contrived as a set. Its elements are called the samples.

An event A is then most simply thought of as a suitable subset of Ω, that is A ⊆ Ω,and we shall generally use the terms event and set interchangeably. (For the technically

minded, not all subsets of Ω can be included as legitimate events for measure theoretic

reasons, but for our purposes, we will ignore this subtlety.)

Example 1.1. Consider the random process of flipping a coin twice. For this scenario,

the sample space Ω is the set of all possible outcomes, namely Ω = {HH, HT, TH, TT}(discounting, of course, the possibility that the coin lands on its side and assuming that

the coin has two distinct sides H and T). One obvious event might be that of getting an

H on the first of the two tosses, in other words A = {HH, HT}.

Basic Set Operations. There are four basic set operators: union (∪), intersection (∩),complementation (c), and cardinality (#).

Let A, B ⊆

Ω. The union of two sets is the set which contains all the elements ω ∈

Ω

in either of the original sets, and we write A ∪ B. A ∪ B is the even that either A or B orboth happen. The intersection of two sets is the set which contains all the elements ω ∈ Ωwhich are common to the two original sets, and we write A ∩ B. A ∩ B is the event thatboth A and B happen simultaneously.

The complement of a set A is the set containing all of the elements ω ∈ Ω in the samplespace which are not in the original set A, and we write Ac. So, clearly, Ωc = ∅, ∅c = Ω,and (Ac)c = A. Ac is the event that not A happens. (Notational note: occasionally, the

complement of A is denoted by A, but this is rarely done in statistics due to the potential

for confusion with sample means.)Note that if two sets A and B have no elements in common then they are referred to

as disjoint and thus, A ∩ B = ∅, where ∅ signifies the empty or null set (the impossibleevent). Also, if A ⊆ B then clearly A ∩ B = A, so that in particular A ∩ Ω = A for anyevent A.

Using unions and intersections, we can now define a very useful set theory concept, the

partition. A collection of sets A1, . . . , Ak is a partition of S if their combined union is equal

4


5/167

to the entire sample space and they are all mutually disjoint; that is, A1 ∪ . . . ∪ Ak = Ωand Ai ∩ A j = ∅ for any i = j . In other words, a partition is a collection of events one andonly one of which must occur. In addition, note that the collection of sets

{A, Ac

} forms

a very simple but nonetheless extremely useful partition.Finally, the cardinality of a set is simply the number of elements it contains. Thus,

in Example 1.1 above, #Ω = 4 while #A = 2. A set is called countable if we can enu-

merate it in a possible nonunique way by natural numbers, for instance, ∅ is countable.Also a set A with finitely many elements is countable, ie. #A is finite. Examples for

countable, but infinite sets ate the natural numbers N = {1, 2, 3, . . . } and the integersZ = {. . . , −2, −1, 0, 1, 2, . . . }. Also the rational numbers Q are countable. Intervals (a, b),(a, b] and the real line R = (−∞, ∞) are examples of uncountable sets.

Basic Set Theory rules.The Distributive laws:

(A ∪ B) ∩ C = (A ∩ C ) ∪ (B ∩ C )(A ∩ B) ∪ C = (A ∪ C ) ∩ (B ∪ C )

DeMorgan’s rules:

(A ∪ B)c = Ac ∩ Bc; (A ∩ B)c = Ac ∪ Bc

You should convince yourself of the validity of these rules through the use of Venn dia-

grams. Formal proofs are elementary.

Basic Probability Rules. We now use the above set theory nomenclature to discuss the

basic tenets of probability. Informally, the probability of an event A is simply the chance

that it will occur. If the elements of the sample space Ω are finite in number and may be

considered “equally likely”, then we may calculate the probability of an event A as

P(A) = #A

#Ω.

More generally, of course, we will have to rely on our long-run frequency interpretation

of the probability of an event; namely, the probability of an event is the proportion of

times that it would occur among a (generally hypothetical) infinite number of equivalent

repetitions of the random experiment.

Zero & Unity Rules. All probabilities must fall between 0 and 1, i.e. 0 ≤ P(A) ≤ 1. Inparticular, P(∅) = 0 and P(Ω) = 1.Subset rule. If A ⊆ B, then P(A) ≤ P(B).

5


6/167

Inclusion-Exclusion Law. The inclusion-exclusion rule states that the probability of the

union of two events is equal to the sum of the probabilities of the two events minus the

probability of the intersection of the two events, which has been in some sense “double

counted” in the sum of the initial two probabilities, so that

P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

Notice that the final subtracted term disappears if the two events A and B are disjoint,

more generally:

Additivity. Assume that A1, . . . , An ⊆ Ω with Ai ∩ A j = ∅ for i = j . Then

P(A1 ∪ · · · ∪ An) = P(A1) + · · · + P(An).

Countable Additivity. Assume that A1, A2, A3, . . . ⊆ Ω is a sequence of events with Ai ∩A j = ∅ for i = j . Then

P(A1 ∪ A2 ∪ A3 ∪ . . . ) = P(A1) + P(A2) + P(A3) + P(A4) + . . .

Complement Rule. The probability of the complement of an event is equal to one minus

the probability of the event itself, so that P(Ac) = 1 − P(A). This rule is easily derivedfrom the Inclusion-Exclusion rule.

Product Rule. Two events A and B are said to be independent if and only if they satisfy

the equation P(A ∩ B) = P(A)P(B).The Law of Total Probability. The law of total probability is a way of calculating a prob-

ability by breaking it up into several (hopefully easier to deal with) pieces. If the sets

A1, . . . , Ak form a partition, then the probability of an event B may be calculated as:

P(B) =k

i=1

P(B ∩ Ai).

Again, heuristic verification is straightforward from a Venn diagram.

6


7/167

2 Random Variables

2.1 Definition and Distribution

Definition A random variable X is a numerically valued function X : Ω → R (R denotingthe real line) whose domain is a sample space Ω. If the range of X is a countable subset of

the real line then we call X a discrete random variable . (For the technically minded, not

all numerical functions X : Ω → R are random variables for measure theoretic reasons,but for our purposes, we will ignore this subtlety.)

Below we introduce the notion of a continuous random variable . A continuous random

variable takes values in an uncountable set such as intervals or the real line. Note that

a random variable cannot be continuous if the sample space on which it is defined is

countable; however, a random variable defined on an uncountable sample space may stillbe discrete. In the coin tossing scenario of Example 1.1 above, the quantity X which

records the number of heads in the outcome is a discrete random variable.

Distribution of a Random Variable. Since random variables are functions on a sample

space, we can determine probabilities regarding random variables by determining the

probability of the associated subset of Ω. The probability of a random variable X being

in some subset I ⊆ R on the real line is equivalent to the probability of the event A ={ω ∈ Ω : X (ω) ∈ I }:

P(X

∈I ) = P(

{ω

∈Ω : X (ω)

∈I

}) .

Note that we have used the notion of a random variable as a function on the sample space

when we use the notation X (ω). The collection of all probabilities P(X ∈ I ) is called thedistribution of X .

Probability Mass Function (PMF). If X is discrete, then it is clearly desirable to find

pX (x) = P(X = x), the probability mass function (or pmf) of X . Because it is possible to

characterise the distribution of X in terms of its pmf pX via

P(X ∈ I ) = i∈I pX (i) .

If X is discrete, we have x∈Range(X )

pX (x) = 1.

Cumulative Distribution Function (CDF). For any random variable X : Ω → R thefunction

F X (x) = P(X ≤ x) , x ∈ R

7


8/167

is called the cumulative distribution function (CDF) of X . The CDF of X determines the

distribution of X (the collection of all probabilities P(X ∈ I ) can be computed from thecdf of X ).

If X is a discrete random variable then its cumulative distribution function is a stepfunction:

F X (x) = P(X ≤ x) =

y∈Range(X ):y≤x pX (y).

(Absolutely) Continuous Random Variable. Assume that X is a random variable

such that

P(X ∈ I ) = I

f X (x) dx

where f X (x) is some nonnegative function with

∞−∞ f X (x) dx = 1. Then X is called a

continuous random variable admitting a density f X . In this case, the CDF is still a validentity, being continuous and given by

F X (x) = P(X ≤ x) = x−∞

f X (x) dx

Observe that the concept of a pmf is completely useless when dealing with continuous

r.v.’s as we have P(X = x) = 0 for all x. The Fundamental Theorem of Calculus thus

shows that f (x) = ddxF (x) = F (x), which in turn leads to the informal identity

P(x < X ≤ x + dx) = F (x + dx) − F (x) = dF (x) = f (x)dx,

which is where the density function f gets its name, since in some sense it describes howthe probability is spread over the real line.

(Notational note: We will attempt to stick to the convention that capital letters denote

random variables while the corresponding lower case letters indicate possible values or

realisations of the random variable.)

2.2 Common Distributions

The real importance of C DF ’s, pmf ’s and densities is that they completely characterize

the random variable from which they were derived. In other words, if we know the C DF (or equivalently the pmf or density) then we know everything there is to know about the

random variable. For most random variables that we might think of, of course, writing

down a pmf , say, would entail the long and tedious process of listing all the possible values

and their associated probabilities. However, there are some types of important random

variables which arise over and over and for which simple formulae for their CDF ’s, pmf ’s

or densities have been found. Some common CDF ’s, pmf ’s and densities are listed below:

8


9/167

Discrete Distributions

Name pmf

Poisson(λ) p(x) = e−λ λx

x!λ > 0 x ∈ N0 = {0, 1, 2, . . .}Binomial(n, p) p(x) =

nx

px(1 − p)n−x

n ∈ N = {1, 2, 3, . . . } 0 < p < 1 x ∈ {0, 1, . . . , n}Negative Binomial(r, p) p(x) =

x+r−1r−1

(1 − p)r px

r ∈ N 0 < p < 1 x ∈ N0 = {0, 1, 2, . . . }Geometric( p) p(x) = p(1 − p)x−10 < p < 1 x ∈ N = {1, 2, 3, . . . }Hypergeometric(n,N,M ) p(x) =

M x

N −M n−x

/

N nn ∈ N m ∈ {0, . . . , n} N ∈ {1, . . . , n} x ∈ {max(0, n+M −N ), . . . , min(M, n)}

Continuous Distributions

Name Density

Normal(µ, σ2) f (x) = 1√ 2πσ2

exp− (x−µ)2

2σ2

µ ∈ R σ2 > 0 x ∈ R = (−∞, ∞)Exponential(λ) f (x) = λe−λx

λ > 0 x ∈ (0, ∞)Uniform(a, b) f (x) = 1/(b

−a)

−∞ < a < b < ∞ x ∈ (a, b)Weibull(α, λ) f (x) = αλxα−1e−λx

α

α > 0, λ > 0 x ∈ (0, ∞)Gamma(α, λ) f (x) = λΓ(α)(λx)

α−1e−λx

α > 0, λ > 0 x ∈ (0, ∞)Chi-Squared(k) f (x) = 1

2k2 Γk2

x 12 (k−2)e−12xk ∈ N = {1, 2, 3, . . . } x ∈ (0, ∞)Beta(α, β ) f (x) = Γ(α+β)Γ(α)Γ(β)x

α−1(1 − x)β−1

α, β > 0 x ∈ (0, 1)Student’s tk f (x) =

Γk+12

√ kπΓ12k1 + x2

k

−12(k+1)

k ∈ N x ∈ (−∞, ∞)Fisher-Snedecor F m,n f (x) =

Γm+n2

Γm2 kΓn2km

n

m2

x(m−2)

21+mx

n

12 (m+n)

m, n ∈ N x ∈ (0, ∞)

9


10/167

The factorials n! and the binomial coefficientsnx

which are defined as follows: 0! := 1

and, for n ∈ N and x ∈ {0, . . . , n},

n! := n × (n − 1) × · · · × 1 , nx := n!x!(n−x)!The gamma function, Γ(α), is defined by the integral

Γ(α) =

∞0

xα−1e−xdx,

from which it follows that if α is a positive integer, then Γ(α) = (α − 1)!. Also, note thatfor α = 1, the Gamma(1,λ) distribution is equivalent to the Exponential(λ) distribution,

while for λ = 12

, the Gamma(α, 12

) distribution is equivalent to the Chi-squared distribu-

tion with 2α degrees of freedom. Similarly, the Geometric( p) distribution is closely relatedto the Negative Binomial distribution when r = 1.

Above we listed formulas only for those x where p(x) or f (x) > 0. For the remaining

xs we have p(x) = 0 or f (x) = 0. We write X ∼ Q indicating that X has the distributionQ: for instance, X ∼Normal(0, 1) refers to a continuous random variable X which has thedensity f X (x) =

1√ 2π

e−x2

2 . Similarly, Y ∼Poisson(5) refers to a discrete random variableY with range N0 having pmf pY (x) = e

−5 5xx! for x ∈ N0.

Exercise. (a) Let X ∼Exponential(λ). Check that the CDF of X satisfies F X (x) =1 − e−λx for x ≥ 0. Graph this function for x ∈ [−1, 4] for the parameter λ = 1.(b) Let X ∼Geometric( p). Check that the CDF of X satisfies F X (x) = 1 − (1 − p)x,x ∈ {1, 2, 3, . . .}. Graph this function for x ∈ [−1, 4] (hint: step function).

10


11/167

2.3 Moments and Quantiles

Moments. The mth moment of a random variable X is the expected value of the random

variable X m and is defined as

E[X m] =

x∈Range(X)xm pX (x),

if X is discrete, and as

E[X m] =

∞−∞

xmf X (x)dx,

if X is continuous (provided, of course, that the quantities on the right hand sides exist).

In particular, when m = 1, the first moment of X is generally referred to as its mean

and is often denoted as µX , or just µ when there is no chance of confusion. The expected

value of a random variable is one measure of the centre of its distribution.

General Formulae. A good, though somewhat informal, way of thinking of the expected

value is that it is the value we would tend to get if we were to average the outcomes of a

very large number of equivalent realisations of the random variable. From this idea, it is

easy to generalize the moment definition to encompass the expectations of any function,

g, of a random variable as either

E[g(X )] =

x∈Range(X)g(x) p(x),

or

E[g(X )] =

∞−∞

g(x)f (x)dx,

depending on whether X is discrete or continuous.

Central Moments and Variance. The idea of moments is often extended by defining

the central moments , which are the moments of the centred random variable X −µX . Thefirst central moment is, of course, equal to zero. The second central moment is generally

referred to as the variance of X , and denoted Var(X ) or sometimes σ2X . The variance is

a measure of the amount of dispersion in the distribution of X ; that is, random variables

with high variances are likely to produce realisations which are far from the mean, while

low variance random variables have realisations which will tend to cluster closely about

the mean. A simple calculation shows the relationship between the moments and the

central moments; for example, we have

Var(X ) = E[(X − µX )2] = E[X 2] − µ2X .

11


12/167

One drawback to the variance is that, by its definition, its units are not comparable

to those of X . To avert this problem, we often use the square root of the variance,

σX = Var(X ), which is called the standard deviation of the random variable X .Quantiles and Median. Another way to characterize the location (i.e. centre) andspread of the distribution of a random variable is through its quantiles. The (1 − α)-quantile of the distribution of X is any value ν α which satisfies:

P(X ≤ ν α) ≥ 1 − α and P(X ≥ ν α) ≥ α.

Note that the definition does not necessarily uniquely define the quantile; in other

words, there may be several distinct (1 −α)-quantiles of a distribution. However, for mostcontinuous distributions that we shall meet the quantiles will be unique. In particular,

the α = 1

2

quantile is called the median of the distribution and is another measure of the

centre of the distribution, since there is a 50% chance that a realisation of X will fall below

it and also a 50% chance that the realisation will be above the median value. The α = 34

and α = 14 quantiles are generally referred to as the first and third quartiles , respectively,

and there difference, called the interquartile range (or I QR), is another measure of spread

in the distribution.

Expectation via Tails. Calculating the mean of a random variable from the definition

can often involve painful integration and algebra. Sometimes, there are simpler ways. For

example, if X is a non-negative integer-valued random variable (i.e. its range contains

only non-negative integers), then we can calculate the mean of X as

µ =∞x=0

P(X > x).

The validity of this can be easily seen by a term rearrangement argument:

∞x=0

P(X > x) =∞x=0

∞y=x+1

p(y) =∞y=1

y−1x=0

p(y) =∞y=1

yp(y) = µ.

More generally, if X is an arbitrary, but non-negative random variable with cumulative

distribution function F , then

µ =

∞0

{1 − F (x)}dx .

Example 2.1. Let a > 0 and U be uniformly distributed on (0, a). Using at least two

methods find E[U ].

Solution: U is a continuous random variable with density f U (u) = 1/a for 0 < u < a

12


13/167

(otherwise, f U (u) = 0 if u = (0, a)).Method I:

E[U ] = 1

a a

0

u du = 1

a

[u2/2]a0 = a/2

Method II: Note that U is a nonnegative random variable taking values only in (0, a).

Also, F (U )(u) = 1a

u0

du = u/a if 0 < u < a. Otherwise, we have either F U (u) = 0 for

u ≤ 0, or, F U (u) = 1 for u ≥ a. Consequently, The tail integral becomes

E[U ] =

∞0

1 − F U (u) du = a0

1 − F U (u) du = a0

1 − (u/a) du = a − 1a

[u2/2]a0 = a/2 .

2.4 Moment Generating Functions

A more general method of calculating moments is through the use of the moment gener-

ating function (or mgf ), which is defined as

m(t) = E[etX ] =

∞−∞

etxdF (x),

provided the expectation exists for all values of t in a neighborhood of the origin. To

obtain the moments of X we note that (provided sufficient regularity conditions which

justify the interchange of the operations of differentiation and integration are satisfied),

dm

dtmE[etX ]

t=0

= E[X metX ]

t=0

= E[X m].

Example 2.2. Suppose that X has a Poisson(λ) distribution. The moment generating

function of X is given by:

m(t) =∞

x=0etx p(x) =

∞

x=0etxλxe−λ

x! = e−λ

∞

x=0

λetx

x! = e−λeλe

t

= eλ(et−1),

where we have used the series expansion ∞

n=0xn

n! = ex. Taking derivatives of m(t) shows

that

m(t) = eλ(et−1)(λet) =⇒ m(0) = E[X ] = λ

m(t) = eλ(et−1)(λet)2 + eλ(e

t−1)(λet) =⇒ m(0) = E[X 2] = λ2 + λ.

Finally, this shows that V ar(X ) = E[X 2] − {E[X ]}2 = (λ2 + λ) − λ2 = λ.

13


14/167

Example 2.3. Suppose that X has a Gamma(α, λ) distribution. The moment generating

function of X is given by:

m(t) = ∞

0 etx λ

Γ(α) (λx)α−1

e−λx

dx =

λα

Γ(α) ∞

0 e−(λ−t)x

xα−1

dx

= λα

(λ − t)α ∞0

λ − tΓ(α)

e−(λ−t)x{(λ − t)x}α−1dx =

λ

λ − tα

, t < λ

where we have used the fact that ∞0

λ−tΓ(α)e

−(λ−t)x{(λ−t)x}α−1dx = 1 since it is the integralof the density of a Gamma(α, λ − t) distribution over the full range of its sample space,provided that t < λ, since the parameters of a Gamma distribution must be positive. So,

differentiating this function shows that:

m(t) = αλα

(λ − t)α+1 =

⇒ m(0) = E[X ] =

α

λ

m(t) = α(α + 1)λα

(λ − t)α+2 =⇒ m(0) = E[X 2] =

α2 + α

λ2 .

Thus, we can see that V ar(X ) = E[X 2] − {E[X ]}2 = αλ2

.

If X, Y are random variables having a finite mgf in an open interval containing zero and

their corresponding mgf’s equal then X and Y have the same distribution. Generally, such

demonstrations rely on algebraic calculation followed by recognition of resultant functions

as the moment generating function of some specific distribution. As a useful reference,

then, we now give the moment generating functions of some of the distributions noted inthe previous section:

Discrete Distributions

Name mgf

Poisson(λ) m(t) = exp(λ(et − 1)) t ∈ RBinomial(n, p) m(t) = (1 − p + pet)n t ∈ RGeometric( p) m(t) = pet/(1 − (1− p)et) (1− p)et


15/167

Generating Functions. More generally, the concept of a generating function of a se-

quence is quite useful in many areas of probability theory. The generating function of a

sequence of numbers

{a0, a1, a2, . . .

} is defined as

A(s) = a0 + a1s + a2s2 + . . . =

∞n=0

ansn

provided the series converges for values of s in a neighborhood of the origin.

Note that from this definition, the moment generating function is just the generating

function of the sequence an = E[X n]/n!. As with the mgf , the elements of the sequence

can be recovered by successive differentiation of the generating function and subsequent

evaluation at s = 0 (as well as a rescaling by n! for the appropriate value of n).

In particular, if X is a discrete random variable taking non-negative integer values,

then setting an = P(X = n) yields the probability generating function,

P (s) = E[sX ] = E[eX log s].

Note that m(t) = P (et), so that there is a clear link between the moment generating

function and the probability generating function. In particular, moment-like quantities

can be found via derivatives of P (s) evaluated at s = 1 = e0. For example,

m(t)

t=0

= P (et)e2t + P (et)ett=0

= P (1) + P (1).

Also, if we let q n = P(X > n), then Q(s) =

n q nsn, is a tail probability generating

function and

Q(s) = 1 − P (s)

1 − s .This can be seen by noting that the coefficient of sn in the function (1 − s)Q(s) is

q n − q n−1 = P(X > n) − P(X > n − 1) = P(X > n) − {P(X = n) + P(X > n)} = −P(X = n),

if n ≥ 1, and q 0 = P(X > 0) = 1 − P(X = 0) if n = 0, so that

(1 − s)Q(s) = 1 − P(X = 0) − ∞n=1

P(X = n)sn = 1 − P (s).

We saw earlier that E[X ] =

n q n, so E[X ] = Q(1) = lims→1{1−P (s)}/(1−s), and thus,the graph of {1 − P (s)}/(1 − s) has a pole rather than an asymptote at s = 1, as long asthe expectation of X exists and is finite.

Additional Remarks. For the mathematically minded, we note that occasionally the

15


16/167

mgf (which, for positive random variables is also sometimes called the Laplace transform

of the density or pmf ) will not exist (for example, the t- and F -distributions have non-

existent moment generating functions, since the necessary integrals are infinite) and this

is why it is often more convenient to work with the characteristic function (also knownto some as the Fourier transform of the density or pmf ), ψ(t) = E[eitX ] which always

exists, but this requires some knowledge of complex analysis. One of the most useful

features of the characteristic function (and of the moment generating function, in the

cases where it exists) is that it uniquely specifies the distribution from which it arose (i.e.

no two distinct distributions have the same characteristic function), and many difficult

properties of distributions can be derived easily from the corresponding properties of

characteristic functions. For example, the Central Limit Theorem is easily proved using

moment generating functions, as are some important relationships regarding the various

distributions listed above.

16


17/167

3 Several Random Variables

3.1 Joint distributions

The joint distribution of two random variables X and Y describes how the outcomes of

the two random variables are probabilistically related. Specifically, the joint distribution

function is defined as

F XY (x, y) = F (x, y) = P(X ≤ x and Y ≤ y).

Usually, the subscripts are omitted when no ambiguity is possible.

If X and Y are both discrete, then they will have a joint probability mass function

defined by P(x, y) = P(X = x and Y = y). Otherwise, if there exists a joint density

defined as that function f XY which satisfies:

F XY (x, y) =

x−∞

y−∞

f XY (ξ, η)dηdξ.

In this case, we call X and Y (jointly) continuous.

The case where one of X and Y is discrete and one continuous is of interest, but is

slightly more complicated and we will deal with it when it comes up.

The function F X (x) = limy→∞ F (x, y) is called the marginal distribution function of

X , and similarly the marginal distribution function of Y is F Y (y) = limx→∞ F (x, y). If X

and Y are discrete, then the marginal probability mass functions are simply

pX (x) =

y∈Range(Y ) p(x, y) and pY (y) =

x∈Range(X )

p(x, y).

If X and Y are continuous, then the marginal densities of X and Y are given by

f X (x) =

∞−∞

f XY (x, y)dy and f Y (y) =

∞−∞

f XY (x, y)dx,

respectively. Note that the marginal density at a particular value is derived by simply

integrating the area under the joint density along the appropriate horizontal or verticalline.

The expectation of a function h of the two random variables X and Y is calculated in

a fashion similar to the expectations of functions of single random variables, namely,

E[h(X, Y )] =

x∈Range(X )

y∈Range(Y )

h(x, y) p(x, y)

17


18/167

if X and Y are discrete, or

E[h(X, Y )] = ∞

−∞ ∞

−∞

h(x, y)f (x, y)dxdy

if X and Y are continuous.

Note that the above definitions show that regardless of the type of random variables,

E[aX + bY ] = aE[X ] + bE[Y ] for any constants a and b. Also, analogous definitions and

results hold for any finite group of random variables. For example the joint distribution

of X 1, . . . , X k is

F (x1, . . . , xk) = P(X 1 ≤ x1 and . . . and X k ≤ xk).

3.2 Covariance, Correlation, IndependencyIndependence. If it happens that F (x, y) = F X (x)F Y (y) then the random variables X

and Y are said to be independent. If both the random variables are continuous, then

the above condition is equivalent to f (x, y) = f X (x)f Y (y), while if both are discrete it is

the same as p(x, y) = pX (x) pY (y). Note the similarity of these definitions to that for the

independence of events.

Given two jointly distributed random variables X and Y , we can calculate their means,

µX and µY , and their standard deviations, σX and σY , using their marginal distributions.

Provided these means and standard deviations exist, we can use the joint distributionto calculate the covariance between X and Y which is defined as Cov(X, Y ) = σXY =

E[(X − µX )(Y − µY )] = E[XY ] − µX µY .Two random variables are said to be uncorrelated if their covariance is zero. Note that

if X and Y are independent then they are certainly uncorrelated, since the factorization of

the pmf or density implies that E[XY ] = E[X ]E[Y ] = µX µY . However, two uncorrelated

random variables need not be independent. Note also that it is an easy calculation to

show that if X , Y , V and W are jointly distributed random variables and a, b, c and d

are constants, then

Cov(aX + bY,cV + dW ) = acσXV + adσXW + bcσY V + bdσY W ;

in other words, the covariance operator is bilinear .

Finally, if we scale σXY by the product of the two standard deviations, we get the

correlation coefficient, ρ = σXY /σX σY , which satisfies −1 ≤ ρ ≤ 1.

18


19/167

3.3 Sums of Random Variables and Convolutions

We saw that the expectation of Z = X + Y was simply the sum of the individual expec-

tations of X and Y for any two random variables. Unfortunately, this is about the extent

of what we can say in general. If, however, X and Y are independent, the distribution of

Z can be determined by means of a convolution:

F Z (z ) =

∞−∞

F X (z − ξ )dF Y (ξ ) = ∞−∞

F Y (z − ξ )dF X (ξ ).

In the case where both X and Y are are discrete, we can write the convolution formula

using pmf ’s:

pZ (z ) = x∈Range(X ) pX (x) pY (z − x) = y∈Range(Y )

pX (z − y) pY (y).

If X and Y are both continuous, we can rewrite the convolution formula using densities:

f Z (z ) =

∞−∞

f X (z − ξ )f Y (ξ )dξ = ∞−∞

f Y (z − ξ )f X (ξ )dξ.

Note that, in the same way that marginal densities are found by integrating along hor-

izontal or vertical lines, the density of Z at the value z is found by integrating along

the line x + y = z , and of course using the independence to state that f XY (ξ, z − ξ ) =f X (ξ )f Y (z − ξ ).

Since convolutions are a bit cumbersome, we now note an advantage of mgf ’s. If X and Y are independent, then

mZ (t) = E[etZ ] = E[et(X +Y )] = E[etX etY ] = E[etX ]E[etY ] = mX (t)mY (t).

So, the mgf of a sum of independent random variables is the product of the mgf ’s of the

summands. This fact makes many calculations regarding sums of independent random

variables much easier to demonstrate:

Suppose that X ∼ Gamma(αX , λ) and Y ∼ Gamma(αY , λ) are two independentrandom variables, and we wish to determine the distribution of Z = X + Y . We could

use the convolution formula, but this would require some extremely difficult (though not

impossible, of course) integration. However, recalling the moment generating function of

the Gamma distribution we see that, for any t < λ:

mX +Y (t) = mX (t)mY (t) =

λ

λ − tαX λ

λ − tαY

=

λ

λ − tαX+αY

,

which easily shows that X 1 + X 2 ∼ Gamma(α1 + α2, λ).

19


20/167

3.4 Change of Variables

We saw previously, that we could find the expectation of g(X ) using the distribution of X .

Suppose, however, that we want to know more about the new random variable Y = g(X ).

If g is a strictly monotone function, we can find the distribution of Y by noting that

F Y (y) = P(Y ≤ y) = P({g(X ) ≤ y}) = P({X ≤ g−1(y)} = F X {g−1(y)}),

if g is increasing, and

F Y (y) = P({Y ≤ y}) = P({g(X ) ≤ y}) = P({X ≥ g−1(y)})= 1 − F X {g−1(y)} + P({X = g−1(y)}),

if g is decreasing (if g is not strictly monotone, we need to be a bit more clever, but wewon’t deal with that case here). Now, if X is continuous and g is a smooth function (i.e.

has a continuous derivative) then the differentiation chain rule yields

f Y (y) = 1

|g{g−1(y)}|f X {g−1(y)} = 1|g(x)|f X (x),

where y = g(x) (note that when X is continuous, the CDF of Y in the case when g is

decreasing simplifies since P{X = g−1(y)} = 0).A similar formula holds for joint distributions except that the derivative factor be-

comes the reciprocal of the modulus of the determinant of the Jacobian matrix for the

transformation function g. In other words, if X 1 and X 2 have joint density f X 1X 2 and

g(x1, x2) = {g1(x1, x2), g2(x1, x2)} = (y1, y2) is an invertible transformation, then the joint density of Y 1 = g1(X 1, X 2) and Y 2 = g2(X 1, X 2) is

f Y 1Y 2(y1, y2) = |J (x1, x2)|−1f X 1X 2(x1, x2),

where y1 = g1(x1, x2) and y2 = g2(x1, x2) and |J (x1, x2)| is the determinant of the Jacobianmatrix J (x1, x2), which has (i, j)

th element J ij(x1, x2) =∂gi(x1,x2)

∂xj

.

20


21/167

4 Conditional Probability

4.1 Conditional Probability of Events

So far, we have discussed the probabilities of events in a rather static situation. However,

typically, we wish to know how the outcomes of certain events will subsequently affect the

chances of later events. To describe such situations, we need to use conditional probability

for events.

Suppose that we wish to know the chance that an event A will occur. Then we have

seen that we want to calculate P(A). However, if we are in possession of the knowledge

that the event B has already occurred, then we would likely change our belief about the

chance of A occuring. For example, if A is the event “it will rain today” and B is the

event “the sky is overcast”. We use the notation P(A|B) to signify the probability of Agiven that B has occurred, and we define it as

P(A|B) = P(A ∩ B)P(B)

,

provided P(B) = 0.If we think of probabilities as areas in a Venn diagram, then a conditional probability

amounts to restricting the sample space down from Ω to B and then finding the relative

area of that part of A which is also in B to the total area of the restricted sample space,

namely B itself.Multiplication Rule. In many of our subsequent applications, conditional probabilities

will be dictated as primary data by the circumstances of the process under study. In this

case, the above definition will find its most useful function in the form

P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A).

Independence. Also, we can rephrase independency of events note interms of conditional

probabilities; namely, two events A and B are independent if and only if

P(A|B) = P(A) and P(B|A) = P(B).(Note that only one of the above two conditions need be verified, since if one is true the

other follows from the definition of conditional probability.) In other words, two events

are independent if the chance of one occurring is unaffected by whether or not the other

has occurred.

Total Probability law. Recalling the law of total probability, we can use this new

21


22/167

identity to show that if the sets B1, . . . , Bk form a partition then

P(A) =k

i=1 P(A|Bi)P(Bi).Bayes’ Rule. Finally, a very useful formula exists which relates the conditional proba-

bility of A given B to the conditional probability of B given A and goes by the name of

Bayes’ Rule . Bayes’ rule states that

P(B|A) = P(A ∩ B)P(A)

= P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc) ,

which follows from the definition of conditional probability and the law of total probability,

since B and Bc form a partition. In fact, we can generalize Bayes’ rule by letting B1, . . . , Bk

be a more general partition, so that

P(Bi|A) = P(A|Bi)P(Bi)k j=1 P(A|B j)P(B j)

.

Example 4.1. Suppose there are three urns labelled I , II and III , the first containing

4 red and 8 blue balls, the second containing 3 red and 9 blue, and the third 6 red and 6

blue. (a) If an urn is picked at random and subsequently a ball is picked at random from

the chosen urn, what is the chance that the chosen ball will be red? (b) If a red ball is

drawn, what is the chance that it came from the first urn?

Solution: Let R be the event that the chosen ball is red. Then, from the description of

the situation it is clear that:

P(I ) = P(II ) = P(III ) = 1

3 , P(R|I ) = 4

12 =

1

3 , P(R|II ) = 3

12 =

1

3 , P(R|III ) = 6

12 =

1

2

(a) Since the events I , I I and I II clearly form a partition (i.e. one and only one of them

must occur), we can use the law of total probability to find

P(R) = P(R

|I )P(I ) + P(R

|II )P(II ) + P(R

|III )P(III ) =

13

36.

(b) Using Bayes’ rule,

P(I |R) = P(R|I )P(I )P(R|I )P(I ) + P(R|II )P(II ) + P(R|III )P(III ) =

(1/3)(1/3)

13/36 =

4

13.

22


23/167

4.2 Discrete Random Variables

Conditional pmf . The conditional probability mass function derives from the definition

of conditional probability for events in a straightforward manner:

pX |Y (x|y) = P(X = x|Y = y) = P(X = x and Y = y)P(Y = y)

= pXY (x, y)

pY (y) ,

as long as pY (y) > 0. Note that for each y , pX |Y is a pmf , i.e.

x pX |Y (x|y) = 1, but thesame is not true for each fixed x. Also, the law of total probability becomes

pX (x) =

y∈Range(Y ) pX |Y (x|y) pY (y).

Example 4.2. Suppose that N has a geometric distribution with parameter 1−

β , and

that conditional on N , X has a negative binomial distribution with parameters p and N .

In other words,

pN (n) = (1 − β )β n−1 for n = 1, 2, . . .and

pX |N (x|n) =

x + n − 1n − 1

px(1 − p)n for x = 0, 1, . . . .

Find the marginal distribution of X .

Solution: Using the law of total probability: for x = 01, 2, 3, . . .

pX (x) =∞

n=1

pX |N (x|n) pN (n)

=∞

n=1

x + n − 1

n − 1

(1 − p)n px(1 − β )β n−1

=∞

n=1

x + n − 1

x

(1 − p)n px(1 − β )β n−1

= (1 − β )β −1 px∞

n=0

x + n

x

[β (1 − p)]n+1

= (1−β )(1− p) px

(1−(1− p)β )x+1∞

n=0

x + n

x

[β (1 − p)]n(1 − (1 − p)β )x+1

=1

= (1−β )(1− p)

1−(1− p)β

p

1−(1− p)β x

Consequently, X + 1 ∈ {1, 2, 3, . . . } is geometric with parameter (1−β)(1− p)1−(1− p)β .

23


24/167

Conditional Expectation. The conditional expectation of g(X ) given Y = y, denoted

as E[g(X )|Y = y], is defined as

E[g(X )|Y = y] = x∈Range(X ) g(x) p

X |Y (x|y).

The law of total probability then shows that

E[g(X )] =x

g(x) pX (x) =x

g(x)y

pX |Y (x|y) pY (y) =y

E[g(X )|Y = y] pY (y).

Note that the conditional expectation can be regarded as a function of y; that is, it is

a numerical function defined on the sample space of Y and is thus a random variable,

denoted by E[g(X )|Y ], and we therefore have

E[g(X )] = E

E [g(X )|Y ].A similar expression can be obtained for variances:

Var(X ) = E[X 2] − (E[X ])2 = E[E[X 2|Y ]] − E[E[X |Y ]]2= E[E[X 2|Y ]] − E(E[X |Y ])2 + E(E[X |Y ])2− E[E[X |Y ]2= E

E[X 2|Y ] − (E[X |Y ])2 + E(E[X |Y ])2− E[E[X |Y ]]2

= E[Var(X |Y )] + Var(E[X |Y ]).

Note that we have defined Var(X |Y ) := E[X 2|Y ] − (E[X |Y ])2 = σ2X |Y , which, like theconditional expectation, is now a random variable.

Example 4.3. Let Y have a distribution with mean µ and variance σ2. Conditional on

Y = y, suppose that X has a distribution with mean −y and variance y2. Find thevariance of X .

Solution: From the information given, E[X |Y ] = −Y and Var(X |Y ) = Y 2. Thus,

Var(X ) = E[Y 2] + Var(

−Y ) = σ2 + µ2 + (

−1)2Var(Y ) = 2σ2 + µ2.

Since the conditional expectation is the expectation with respect to the conditional

probability mass function pX |Y (x|y), conditional expectations behave in most ways likeordinary expectations. For example,

1. E[ag(X 1) + bh(X 2)|Y ] = aE[g(X 1)|Y ] + bE[h(X 2)|Y ]

24


25/167

2. If g ≥ 0 then E[g(X )|Y ] ≥ 0

3. E[g(X, Y )|Y = y] = E[g(X, y)|Y = y]

4. If X and Y are independent, E[g(X )|Y ] = E[g(X )]

5. E[g(X )h(Y )|Y ] = h(Y )E[g(X )|Y ]

6. E[g(X )h(Y )] = E[h(Y )E[g(X )|Y ]]

In particular, it follows from properties 1 and 5 that E[a|Y ] = a for any constant a, andE[h(Y )|Y ] = h(Y ) for any function h.Remark: the formulae 1. – 6. are applicable in more general situations, even if X nor Y

are not discrete (cf. random sums for some applications).

4.3 Mixed Cases

If X is a continuous random variable and N is a discrete random variable, then the

conditional distribution function F X |N (x|n) of X given that N = n can be defined in theobvious way

F X |N (x|n) = P(X ≤ x and N = n)P(N = n)

.

From this definition, we can easily define the conditional probability density function as

f X |N (x|n) = ddx

F X |N (x|n).

As in the discrete case, the conditional density behaves much like an ordinary density, so

that, for example,

P(a < X ≤ b, N = n) = P(a < X ≤ b|N = n)P(N = n) = pN (n) ba

f X |N (x|n)dx.

Note that the key feature to this and the discrete case was that the conditioning random

variable N was discrete, so that we would be able to guarantee that there would be some

possible values of n such that P(N = n) > 0. It is possible to condition on continuous

random variables and the properties are much the same, but we just need to take a bit of

care since technically the probability of any individual outcome of a continuous random

variable is zero.

25


26/167

4.4 Random Sums

Suppose we have an infinite sequence of independent and identically distributed random

variables ξ 1, ξ 2, . . ., and a discrete non-negative integer valued random variable N which

is independent of the ξ ’s. We can then define the random sum

X = ξ 1 + . . . + ξ N =N

k=1

ξ k.

(Note that for convenience, we will define the sum of zero terms to be zero.)

Moments. If we let

E[ξ k] = µ Var(ξ k) = σ2 E[N ] = ν Var(N ) = τ 2

then we can derive the mean and variance of X as

E[X ] = E[E[X |N ]] =∞

n=0

E[X |N = n] pN (n)

=∞

n=1

E[ξ 1 + . . . + ξ N |N = n] pN (n)

=∞

n=1

E[ξ 1 + . . . + ξ n|N = n] pN (n)

=∞

n=1

E[ξ 1 + . . . + ξ n] pN (n) = µ∞

n=1

npN (n)

= µν,

and the variance as

Var(X ) = E[(X − µν )2] = E[(X − Nµ + Nµ − µν )2]= E[(X − Nµ)2] + E[µ2(N − ν )2] + 2E[µ(X − Nµ)(N − ν )]= E

E[(X − Nµ)2|N ] + E[µ2(N − ν )2]

+ 2E

E[µ(X − Nµ)(N − ν )|N ]

= νσ2 + µ2τ 2,since

E[X − Nµ|N = n] = E n

i=1

ξ i − nµ

= 0;

E[(X − N µ)2|N = n] = E n

i=1

ξ i − nµ2

= nσ2.

26


27/167

Example 4.4. Total Grandchildren - Suppose that individuals in a certain species have

a random number of offspring independently of one another with a known distribution

having mean µ and variance σ2. Let X be the number of grandchildren of a single parent,

so that X = ξ 1+. . .+ξ N , where N is the random number of original offspring and ξ k is therandom number of offspring of the kth child of the original parent. Then E[N ] = E[ξ k] = µ

and Var(N ) = Var(ξ k) = σ2, so that

E[X ] = µ2 and Var(X ) = µσ2(1 + µ).

Distribution of Random Sums. In addition to moments, we need to know the distri-

bution of the random sum X . If the ξ ’s are continuous and have density function f (z ),

then the distribution of ξ 1 + . . . + ξ n is the n-fold convolution of f , denoted by f (n)(z )

and recursively defined by

f (1)(z ) = f (z )

f (n)(z ) =

∞−∞

f (n−1)(z − u)f (u)du for n > 1.

Since N is independent of the ξ ’s, f (n) is also the distribution of X given N = n ≥ 1.Thus, if we assume that P(N = 0) = 0, the law of total probability says

f X (x) =

∞n=1 f

(n)

(x) pN (n).

NOTE: If we don’t assume that P(N = 0) = 0, then we have a “mixed” distribution, so

that

P(a < X ≤ b) = ba

∞n=1

f (n)(x) pN (n)

dx

if a < b


28/167

for z ≥ 0, and suppose also that N has a geometric distribution with parameter p, so that pN (n) = p(1 − p)n−1 for n = 1, 2, . . .. In this case,

f

(2)

(z ) = ∞

−∞ f (z − u)f (u)du = z

0 λ

2

e−λz

du

= λ2e−λz z0

du = λ2e−λzz.

In fact, it is straightforward to use mathematical induction to show that f (n)(z ) =λn

(n−1)!z n−1e−λz, for z ≥ 0, which is a Gamma(n, λ) distribution (a fact which is much

more easily demonstrated using moment generating functions!). Thus, the distribution of

X is

f X (x) =∞

n=1 f (n)(x) pN (n) =

∞

n=1λn

(n−

1)!xn−1e−λx p(1 − p)n−1

= λpe−λx∞

n=1

{λ(1 − p)x}n−1(n − 1)! = λpe

−λxeλ(1− p)x

= λpe−λpx.

So, X has an exponential distribution with parameter λp, or a Gamma(1, λp). Note that

the distribution of the random sum is not the same as the distribution of the non-random

sum.

4.5 Conditioning on Continuous Random Variables

Conditional Density. Note that in the previous sections we have been able to use our

definition of conditional probability for events since the conditioning events {Y = y}have non-zero probability for discrete random variables. If we want to find the conditional

distribution of X given Y = y, and Y is continuous, we cannot use, as we might first try,

F X |Y (x|y) = P(X ≤ x|Y = y) = P(X ≤ x and Y = y)P(Y = y)

,

since both probabilities in the final fraction are zero. Instead, we shall define the condi-tional density function as

f X |Y (x|y) = f XY (x, y)f Y (y)

,

for values of y such that f Y (y) > 0. The conditional distribution function is then given

by

F X |Y (x|y) = x−∞

f X |Y (ξ |y)dξ.

28


29/167

Conditional Expectation. Finally, we can define

E[g(X )|Y = y] = ∞

−∞

g(x)f X |Y (x|y)dx,

as expected, and this version of the conditional expectation still satisfies all of the nice

properties that we derived in the previous sections for discrete conditioning variables. For

example,

P(a < X ≤ b|Y = y) = F X |Y (b|y) − F X |Y (a|y) = ba

f X |Y (x|y)dx

=

∞−∞

1(a,b](x)f X |Y (x|y)dx= E[1(a,b](X )

|Y = y],

where the function 1I (x) is the indicator function of the set I , i.e. 1I (x) = 1 if x ∈ I and1I (x) = 0 otherwise.

Note that, as is the case with ordinary expectations and indicators, the conditional

probability of the random variable having an outcome in I is equal to the conditional

expectation of the indicator function of that event. (Recall that

P(X ∈ I ) = I

f X (x)dx =

∞−∞

1I (x)f X (x)dx = E[1I (X )]

for ordinary expectations and probabilities.)We can use the above fact to show a new form of the law of total probability, which

is often a very useful method of finding probabilities; namely,

P(a < X ≤ b) = ∞−∞

P(a < X ≤ b|Y = y)f Y (y)dy.

To see why this is true, note that

∞

−∞

P(a < X ≤ b|Y = y)f Y (y)dy = ∞

−∞ b

a

f X |Y (x|y)dxf Y (y)dy = ∞

−∞ b

a

f XY (x, y)dxdy

= P(a < X ≤ b and − ∞ < Y < ∞)= P(a < X ≤ b).

In fact, we can generalize this notion even further to show that

P{a < g(X, Y ) ≤ b} = ∞−∞

P{a < g(X, y) ≤ b|Y = y}f Y (y)dy.

29


30/167

Example 4.6. Suppose X and Y are continuous random variables having joint density

function

f XY (x, y) = ye−xy−y for x, y > 0.

(a) Find the conditional distribution of X given Y = y.

(b) Find the distribution function of Z = X Y .

Solution: (a) First, we must find the marginal density of Y , which is

f Y (y) =

∞−∞

f XY (x, y)dx =

∞0

ye−xy−ydx = e−y ∞0

ye−xydx = e−y, y > 0.

Therefore,

f X |Y (x|y) = f XY (x, y)f Y (y)

= ye−xy, y > 0

In other words, conditional on Y = y, X has an exponential distribution with parameter

y, and thus F X |Y (x|y) = 1 − e−xy.(b) To find the distribution of Z = XY , we write

F Z (z ) = P(Z ≤ z ) = P(XY ≤ z ) = ∞−∞

P(XY ≤ z |Y = y)f Y (y)dy

=

∞0

P(X ≤ z y|Y = y)e−ydy =

∞0

(1 − e−z)e−ydy= 1 − e−z,

so that Z has an exponential distribution with parameter 1.

4.6 Joint Conditional Distributions

If X , Y and Z are jointly distribution random variables and Z is discrete, we can define

the joint conditional distribution of X and Y given Z in the obvious way,

F XY |Z (x, y

|z ) = P(X

≤x and Y

≤y|Z = z ) =

P(X ≤ x and Y ≤ y and Z = z )P(Z = z )

.

If X , Y and Z are all continuous, then we define the joint conditional density of X and

Y given Z as

f XY |Z (x, y|z ) = f XY Z (x,y,z )f Z (z )

,

where f XY Z (x,y,z ) is the joint density function of X , Y and Z and f Z (z ) is the marginal

density function of Z .

30


31/167

The random variables X and Y are said to be conditionally independent given Z

if F XY |Z (x, y|z ) = F X |Z (x|z )F Y |Z (y|z ), where F X |Z (x|z ) = l i my→∞ F XY |Z (x, y|z ) andF Y |Z (y

|z ) = limx→∞ F XY |Z (x, y

|z ) are the conditional distributions of X given Z and

Y given Z , respectively. As with unconditional independence, an equivalent characteri-zation when the random variables involved are continuous is that the densities factor as

f XY |Z (x, y|z ) = f X |Z (x|z )f Y |Z (y|z ). (NOTE: In an obvious extension of the formula forunconditional densities,

f X |Z (x|z ) = ∞−∞

f XY |Z (x, y|z )dy,

with a similar definition for f Y |Z (y|z ).)As with the case for unconditional joint distributions, a useful concept is the condi-

tional covariance, defined as

Cov(X, Y |Z ) = E[XY |Z ] − E[X |Z ]E[Y |Z ],

and the conditional correlation coefficient, which is simply the conditional covariance

scaled by the product of the conditional standard deviations, σX |Z =

Var(X |Z ) andσY |Z =

Var(Y |Z ). Note that if two random variables are conditionally independent

then they are conditionally uncorrelated (i.e. the conditional covariance is zero), but the

converse is not true. Also, just because two random variables are conditionally independent

or uncorrelated does not necessarily imply that they are unconditionally independent oruncorrelated.

31


32/167

5 Elements of Matrix Algebra

To prepare our analysis of Markov chains it is convenient to recall some elements of matrix

algebra:A matrix A is a tabular with n rows and m columns with the real-valued entries

A(i, j) (A(i, j) refers to the element in the ith row and the j th column). We shortly write

A = (A(i, j)) ∈ Rn×m (verbally, A is a n × m-matrix).Example 5.1. Note

A =

1 2 3

4 5 6

∈ R2×3 .

A(1, 2) = 2.

We have different operations when dealing with matrices:Scalar Multiplication. Let a ∈ R and A = (A(i, j)) ∈ Rn×m The scalar multiplicationaA is defined by taking the product of real number a ∈ R with each of the componentsof A, giving rise to a new matrix C = (C (i, j)) := aA ∈ Rn×m with C (i, j) := aA(i, j).Example 5.2. Let

A =

1 2 3

4 5 6

∈ R2×3 .

Then (a = 2)

C = 2A = 2 4 68 10 12 ∈ R2×3 .

Transposition. Let A = (A(i, j)) ∈ Rn×m Then the transposition of A is denoted byA = (A(i, j)). A = (A(i, j)) is a Rm×n-matrix with entries A(i, j) := A( j, i). (Weinterchange the roles of columns and rows).

Example 5.3. Let

A = 1 2 34 5 6

∈ R2×3 ⇒ A = 1 4

2 53 6

∈ R3×2 .

Sum of Matrices. Let A = (A(i, j)), B = (B(i, j)) ∈ Rn×m. By componentwise addingthe entries we get a new matrix C = (C (i, j)) =: A + B ∈ Rn×m where C (i, j) =A(i, j) + B(i, j).

32


33/167

Example 5.4. Let

A = 1 2 3

4 5 6 ∈ R2×3 , B =

1 1 −21 3 6 ∈ R

2×3

⇒ C = A + B =

2 3 1

5 8 12

∈ R2×3 .

Product of Matrices. Let A = (A(i, j)) ∈ Rn×m, B = (B(i, j)) ∈ Rm×r. (The numberm of A’s columns must match the number m of B’s rows). Then the matrix product

AB = A · B := C is the matrix C = (C (i, j)) ∈ Rn×r with entries

C (i, j) :=m

k=1

A(i, k)B(k, j) , 1 ≤ i ≤ n , 1 ≤ j ≤ r .

By inspection: the entry C (i, j) is the Euclidian product of the ith row of A with the j th

column of B .

Example 5.5. Let

A =

1 2 3

4 5 6

∈ R2×3 B =

1 4 2

2 5 1

3 6 1

.

To compute AB it is convenient to adopt the following scheme

AB =

1 4 2

2 5 1

3 6 1

1 2 3 1 × 1 + 2 × 2 + 3 × 3 1 × 4 + 2 × 5 + 3 × 6 74 5 6 4 × 1 + 5 × 2 + 6 × 3 . . . . . .

Fill in the dots. The result is

C = AB =

14 32 7

32 . . . . . .

.

Product with Vectors and Matrices. This is as special case of the general matrix

multiplication: let x ∈ Rn, A = (A(i, j)) ∈ Rn×m. If we contrive x ∈ R1×n as a matrix

33


34/167

with only one row then xA ∈ R1×m is defined by the corresponding matrix multiplication.The result is a row vector. If we insist on x ∈ Rn×1 to be a column vector then still xAand Ax are well defined. If n

= m then Ax is not defined, even if x is a column vector.

The dimensions must always match.

Power of Matrices. Let I ∈ Rn×n be the identity matrix. I = (I (i, j)) = Rn×n withentries I (i, j) = 1, if i = j , and, otherwise, if i = j then I (i, j) = 0. The identity matrixis a diagonal matrix (only the elements of the diagonal are nonzero) with unit entries on

the diagonal. For any A ∈ Rn×m we have I A = A (for all B ∈ Rm×n we have BI = B).For matrices where the number of columns equals the number of rows, A ∈ Rn×n we

can define the pthe power A p, p ∈ N0 = {0, 1, 2, 3, 4 . . . } by iteration:

A0 := I , A1 := A A p := (A) p−1A = A(A) p−1 .

Example 5.6. Let A =

1 2

3 4

. Find A0, A1, A2 and A3.

Answer:

A0 = I =

1 0

0 1

, A1 = A =

1 2

3 4

, A2 =

7 10

15 22

, A3 =

37 54

81 118

Example 5.7. (a) Show (A) = A.(b) Show A + B = B + A.

(b) Show (A + B) = A + B.

(c) Show (AB) = B A.

(d) Give an example of square matrices A, B ∈ R2×2 showing that AB = BA (’notcommutative’).

(Also see Tutorials)

34


35/167

Part II: Markov Chains

6 Stochastic Process and Markov Chains

6.1 Introduction and Definitions

General Stochastic Process. A stochastic process is a family of random variables,

{X t}t∈T , indexed by a parameter t which belongs to an ordered index set T . For notationalconvenience, we will sometimes use X (t) instead of X t. We use t because the indexing is

most commonly associated with time.

For example, the price of a particular stock at the close of each day’s trading would be

a stochastic process indexed by time. Of course, the index does not have to be time, it may

be a spatial indicator. For example, the number of defects in specified regions of a computerchip. In fact, the indexing may be almost anything. Indeed, if we consider the index to

be individuals, we can consider a random sample X 1, . . . , X n to be a stochastic process.

Of course, this would be a rather special stochastic process in that the random variables

making up the stochastic process would be independent of each other. In general, we will

want to deal with stochastic processes where the random variables may be dependent on

one another.

As with individual random variables, we shall be interested in the set S of values

which the random variables may take on, but we shall generally refer to this set as the

state space in this context. Again, as with single random variables, the state space maybe either discrete or continuous. In addition, however, we must now also consider whether

the index set T is discrete or continuous. In this section, we shall be considering the

case where the index set is the discrete set of natural numbers T = N0 = {0, 1, 2, . . .},such processes are usually referred to as discrete time stochastic processes. We will start

by examining processes with discrete state spaces and later move on to processes with

continuous time sets T .

Markov Chain. The simplest sort of stochastic processes are of course those for which

the random variables X t are independent. However, the next simplest type of process, and

the starting point for our journey through the theory of stochastic processes, is called a

Markov chain. A Markov chain is a stochastic processes having:

1) a countable state space S ,

2) a discrete index set T = {0, 1, 2, . . .},

3) the Markov property, and

35


36/167

4) stationary transition probabilities.

The final two properties listed are discussed next:

6.2 Markov Property

In general, we have defined a stochastic process so that the immediate future may depend

on both the present and the entire past. This framework is a bit too general for an initial

investigation of the concepts involved in stochastic processes. A discrete time process with

discrete state space will be said to have the Markov property if

P(X t+1 = xt+1|X 0 = x0, . . . , X t = xt) = P(X t+1 = xt+1|X t = xt).

In other words, the future depends only on the present and not on the past.At first glance, this may seem a silly property, in the sense that it would never really

happen. However, it turns out that Markov chains can give surprisingly good approxima-

tions to real situations.

Example. As an example, suppose our stochastic process of interest is the total amount

of something (money, perhaps) that we have accumulated at the end of each day. Often, it

is a very reasonable assumption that tomorrow’s amount depends only on what we have

today and not on how we arrived at today’s amount. Indeed, this will be the case if, for

instance, each day’s incremental amount is independent of those for the previous days.

Thus, a very common and useful stochastic process possessing the Markov property is the

sequence of partial totals in a random sum, i.e. X t = ξ 1 + . . . + ξ t where ξ 1, ξ 2 . . . is a

sequence of independent random variables. In this case, it is clear that X t+1 = X t+ξ t+1 de-

pends only on the value of X t (and, of course, on the value of ξ t+1, but this is independent

of all the previous ξ ’s and thus of the previous X ’s as well).

36


37/167

6.3 Stationarity

Suppose we know that at time t, our Markov chain is in state x, and we want to know

about what will happen at time t + 1. The probability of X t+1 being equal to y in this

instance is referred to as the one-step transition probability of going from state x to state

y at time t, and is denoted by P t,t+1(x, y), or sometimes P t,t+1xy . (Note that for convenience

of terminology, even if x = y we will still refer to this as a transition). If we are dealing

with a Markov chain, then we know that

P t,t+1(x, y) = P(X t+1 = y|X t = x),

since the outcome of X t+1 only depends on the value of X t. If, for any value t in the index

set, we have

P t,t+1(x, y) = P (x, y) = P xy for all x, y ∈ S,that is, the one-step transition probabilities are the same at all times t, then the process

is said to have stationary transition probabilities. Here, the word stationary describes the

fact that the probability of going from one specified state to another does not change

with time. Note that for the partial totals in a random sum, the process has stationary

transition probabilities if and only if the ξ ’s are identically distributed.

6.4 Transition Matrices and Initial Distributions

Let’s start by considering the simplest type of Markov chain, namely, a chain with state

space of cardinality 2, say, S = {0, 1}. (Actually, this is the second-simplest type of chain,the simplest being one with only one possible state, but this case is rather unenlightening).

Suppose that at any time t,

P(X t+1 = 1|X t = 0) = p, P(X t+1 = 0|X t = 0) = 1 − p,

P(X t+1 = 0|X t = 1) = q, P(X t+1 = 1|X t = 1) = 1 − q,and that at time t = 0,

P(X 0 = 0) = π0(0), P(X 0 = 1) = π0(1).

We will generally use the notation πt to refer to the pmf of the discrete random variable

X t when dealing with discrete time Markov chains, so that πt(x) = pX t(x) = P(X t = x).

When the state space is finite, we can arrange the transition probabilities, P xy, into a

matrix called the transition matrix . For the two-state Markov chain described above the

37


38/167

transition matrix is

P = P (0, 0) P (0, 1)

P (1, 0) P (1, 1) = P 00 P 01

P 10 P 11 = 1 − p p

q 1−

q .Note that for any fixed x, the pmf of X t given X t−1 = x is pX t|X t−1(y|x) = P (x, y). Thus,the sum of the values in any row of the matrix P will be 1. If the state space is not finite

then we will often refer to P (x, y) as the transition function of the Markov chain.

Similarly, if S is finite, we can arrange the initial distribution as a row vector, for

example, π0 = {π0(0), π0(1)} in the case of the two-state chain above.It is an important fact that P and π0 are enough to completely characterize a Markov

chain, and we shall examine this more thoroughly a little later. As an example, however,

let’s compute some quantities associated with the above two-state chain.

Example 6.1. For the two-state Markov chain above, let’s examine the chance that X t

will equal 0. To do so, we note

πt(0) = P(X t = 0)

= P(X t = 0|X t−1 = 0)P(X t−1 = 0) + P(X t = 0|X t−1 = 1)P(X t−1 = 1)= (1 − p)πt−1(0) + q πt−1(1)

=1−πt−1(0)= q + (1 − p − q )πt−1(0).

By iterating this procedure:

π1(0) = q + (1 − p − q )π0(0)π2(0) = q + (1 − p − q )π1(0) = q + (1 − p − q ){q + (1 − p − q )π0(0)}

= q + (1 − p − q )q + (1 − p − q )2π0(0)...

πt(0) = q t−1i=0

(1 − p − q )i + (1 − p − q )tπ0(0)

= q p + q

+ (1 − p − q )tπ0(0) − q p + q

,where we have used the well-known summation formula for a geometric series,

n−1i=0

ri = 1 − rn

1 − r .

38


39/167

First, note that we can thus calculate the distribution of any of the X t’s using only the

entries of P and π0. Second, as long as p, q


40/167

the two-step transition matrix ,

P 2 = P

×P =

(1 − p)2 + pq p(2 − p − q )

q (2 − p − q ) (1 − q )2

+ pq .We will discuss general n-step transition matrices shortly.

A formal proof that P and π0 fully characterize a Markov chain is beyond the scope

of this class. However, we will try to give the basic idea behind the proof now. It should

seem intuitively reasonable that anything we want to know about a Markov chain {X t}t≥0can be built up from probabilities of the form

P(X n = xn, . . . , X 0 = x0) = P(X n = xn|X n−1 = xn−1, . . . , X 0 = x0)

×P(X n−1 = xn−1, . . . , X 0 = x0)

= P(X n = xn|X n−1 = xn−1)×P(X n−1 = xn−1, . . . , X 0 = x0)

...

= P (xn−1, xn)P (xn−2, xn−1) · · · P (x0, x1)π0(x0)

= π0(x0)n

i=1

P (xi−1, xi).

Notice that the above simply states that the probability that the chain follows a particular

path for the first n steps can be found by simply multiplying the probabilities of thenecessary transitions. Note also, that we directly required both the Markov property and

stationarity for this demonstration. Indeed, the above identity is an equivalent form of

the stationary Markov property. As a technical detail, we must be careful that none of

the conditioning events in the above derivation have probability zero. However, this will

only occur when the original path is not possible (i.e. the specified xi’s do not form a

legitimate set of outcomes), in which case the original probability will clearly be zero as

will at least one of the factors in the final product, so the result still holds true.

For the sake of completeness, we note that the characterization is a one-to-one cor-

respondence. That is, every Markov chain is completely determined by its initial distri-bution and transition matrix, and any initial distribution and transition matrix (recall

that a transition matrix must satisfy the property that each of its rows sums to unity)

determine some Markov chain.

As a final comment, we note that it is the transition function P which is the more

fundamental aspect of a Markov chain rather than the initial distribution π0. We shall

see why this is so specifically in the results to follow, but it should be clear that changing

40


41/167

initial distributions will generally only slightly affect the overall behaviour of the chain,

while a change in P will generally result in dramatic changes.

6.5 Examples of Markov Chains

We now present some of the most commonly used Markov chains.

Random Walk: Let p(u) be a probability mass function on the integers. A random walk

is a Markov chain with transition function P (x, y) = p(y − x) for integer valued x and y .Here we have S = Z = {. . . , −3, −2, −1, 0, 1, 2, 3, . . . }. For instance, if p(−1) = p(1) = 0.5,then the chain is the simple symmetric random walk, where at each stage the chain takes

either one step forward or backward. Such models are sometimes used to describe the

motion of a suspended particle. One question of interest we might ask is how far the

particle will travel. Another might be whether the particle ever returns to its startingposition and if so, how often. Often, the simple random walk is extended so that p(1) = p,

p(−1) = q and p(0) = r , where p, q and r are non-negative numbers less than one suchthat p + q + r = 1.

Ehrenfest chain: The Ehrenfest chain is often used as a simple model for the diffusion

of molecules across a membrane. Suppose that we have two distinct boxes and d distinct

labelled balls. Initially, the balls are distributed between the two boxes. At each step, a

ball is selected at random and is moved from the box that it is in to the other box. If X t

denotes the number of balls in the first box after t transitions, then {X t}t≥0 is a Markovchain with state space S = {0, . . . , d}. The transition function can be easily computed asfollows: If at time t, there are x balls in the first box, then there is probability x/d that

a ball will be removed from this box and put in the other, and a probability of (d − x)/dthat a new ball will be added to this box from the other, thus

P (x, y) =

xd

y = x − 11 − x

d y = x + 1

0 otherwise

For this chain, we might ask if an “equilibrium” is reached.

Gambler’s ruin: Suppose a gambler starts out with x dollars and makes a series of one

dollar bets against the house. Assume that the respective probabilities of winning and

losing the bet are p and q = 1 − p, and that if the capital ever reaches 0, the bettingends and the gambler’s fortune remains 0 forever after. This Markov chain has state space

41


42/167

S = N0 = {0, 1, 2, 3, . . . } transition function

P (x, y) = 1 x = y = 0

q y = x−

1 and x > 0

p y = x + 1 and x > 0

0 otherwise

for x ≥ 1, and P (0, 0) = 1, P (0, y) = 0 for y = 0. Note that a state which satisfiesP (a, a) = 1 and P (a, y) = 0 for y = a is called an absorbing state . We might wish to askwhat the chance is that the gambler is ruined (i.e. loses all his/her initial stake) and how

long it might take. Also, we might modify this chain to incorporate a strategy whereby the

gambler quits when his/her fortune reaches d. For this chain, the above transition matrix

still holds except that the definition given for P (x, y) now holds only for 1

≤x

≤d

−1,

and d becomes an absorbing state. One interpretation of this modification is that two

gamblers are betting against each other and between them they have a total capital of d

dollars. Letting X t represent the fortune of one of the gamblers yields the gambler’s ruin

chain on {0, 1, . . . , d}.Birth and death chains: The Ehrenfest and Gambler’s ruin chains are special cases of

a birth and death chain. A birth and death chain has state space S = N0 = {0, 1, 2, . . .}and has transition function

P (x, y) =

q x y = x − 1rx y = x

px y = x + 1

0 otherwise

where px is the chance of a “birth”, q x the chance of a “death” and 0 ≤ px, q x, rx ≤ 1 suchthat px + q x + rx = 1. Note that we allow the chance of births and deaths to depend on

x, the current population. We will study birth and death chains in more detail later.

Queuing chain: Consider a service facility at which people arrive during each discrete

time interval according to a distribution with probability mass function p(u). If anyone is

in the queue at the start of a time period then a single person is served and removed fromthe queue. Thus, the transition function for this chain is P (0, y) = p(y) and P (x, y) =

p(y − x + 1). In other words, if there is no one in the queue then the chance of havingy people in the queue by the next time interval is just the chance of y people arriving,

namely p(y), while if x people are currently in the queue, one will definitely be served and

removed and thus to get to y individuals in the queue we require the arrival of y − (x − 1)additional individuals. Two obvious questions to ask about this chain are when the queue

42


43/167

will be emptied and how often.

Branching chain: Consider objects or entities, such as bacteria, which generate a number

of offspring according to the probability mass function p(u). If at each time increment,

the existing objects produce a random number of offspring and then expire, then X t, the

total number of objects at generation t is a Markov chain with

P (x, y) = P(ξ 1 + . . . + ξ x = y)

where the ξ i’s are independent random variables each having a probability mass function

given by p(u). A natural question to ask for such a chain is if and when extinction will

occur.

6.6 Extending the Markov Property

Recall that we have said that the Markov property is equivalent to the identity