8/18/2019 STAT3004 Course Notes
1/167
STOCHASTIC MODELLING - STAT3004/STAT7018, 2015, Semester 2
Contents
1 Basics of Set-Theoretical Probability Theory 4
2 Random Variables 7
2.1 Definition and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Moments and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Several Random Variables 17
3.1 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Covariance, Correlation, Independency . . . . . . . . . . . . . . . . . . . . 18
3.3 Sums of Random Variables and Convolutions . . . . . . . . . . . . . . . . . 19
3.4 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Conditional Probability 21
4.1 Conditional Probability of Events . . . . . . . . . . . . . . . . . . . . . . . 214.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Mixed Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Random Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Conditioning on Continuous Random Variables . . . . . . . . . . . . . . . 28
4.6 Joint Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Elements of Matrix Algebra 32
6 Stochastic Process and Markov Chains 35
6.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.4 Transition Matrices and Initial Distributions . . . . . . . . . . . . . . . . . 37
6.5 Examples of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.6 Extending the Markov Property . . . . . . . . . . . . . . . . . . . . . . . . 43
1
8/18/2019 STAT3004 Course Notes
2/167
6.7 Multi-Step Transition Functions . . . . . . . . . . . . . . . . . . . . . . . . 44
6.8 Hitting Times and Strong Markov Property . . . . . . . . . . . . . . . . . 47
6.9 First Step Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.10 Transience and Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.11 Decomposition of the State Space . . . . . . . . . . . . . . . . . . . . . . . 58
6.12 Computing hitting probabilities . . . . . . . . . . . . . . . . . . . . . . . . 63
6.13 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.14 S pecial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7 Stationary Distribution and Equilibrium 73
7.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Basic Properties of Stationary and Steady State Distributions . . . . . . . 747.3 Periodicity and Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4 Positive and Null Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.5 Existence and Uniqueness of Stationary Distributions . . . . . . . . . . . . 84
7.6 Examples of Stationary Distributions . . . . . . . . . . . . . . . . . . . . . 87
7.7 Convergence to the Stationary Distribution . . . . . . . . . . . . . . . . . . 93
7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8 Pure Jump Processes 104
8.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048.2 Characterizing a Markov Jump Processes . . . . . . . . . . . . . . . . . . . 106
8.3 S = {0, 1} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098.4 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.5 Inhomogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . 115
8.6 Special Distributions Associated with the Poisson Processes . . . . . . . . 118
8.7 Compound Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.8 Birth and Death Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.9 Infinite Server Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.10 Long-run Behaviour of Jump Processes . . . . . . . . . . . . . . . . . . . . 129
9 Gaussian Processes 138
9.1 Univariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . 138
9.2 Bivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 139
9.3 Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 143
9.4 Gaussian Processes and Brownian Motion . . . . . . . . . . . . . . . . . . 146
2
8/18/2019 STAT3004 Course Notes
3/167
9.5 Brownian Motion via Random Walks . . . . . . . . . . . . . . . . . . . . . 148
9.6 Brownian Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.7 Geometric Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.8 Integrated Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.9 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3
8/18/2019 STAT3004 Course Notes
4/167
Part I: Review Probability & Conditional Probability
1 Basics of Set-Theoretical Probability Theory
Sets and Events. We need to recall a little bit of set theory and its terminology insofar as
it is relevant to probability. To start, we shall refer to the set of all possible outcomes that
a random experiment may take on as the sample space and denote it by Ω. In probability
theory Ω is contrived as a set. Its elements are called the samples.
An event A is then most simply thought of as a suitable subset of Ω, that is A ⊆ Ω,and we shall generally use the terms event and set interchangeably. (For the technically
minded, not all subsets of Ω can be included as legitimate events for measure theoretic
reasons, but for our purposes, we will ignore this subtlety.)
Example 1.1. Consider the random process of flipping a coin twice. For this scenario,
the sample space Ω is the set of all possible outcomes, namely Ω = {HH, HT, TH, TT}(discounting, of course, the possibility that the coin lands on its side and assuming that
the coin has two distinct sides H and T). One obvious event might be that of getting an
H on the first of the two tosses, in other words A = {HH, HT}.
Basic Set Operations. There are four basic set operators: union (∪), intersection (∩),complementation (c), and cardinality (#).
Let A, B ⊆
Ω. The union of two sets is the set which contains all the elements ω ∈
Ω
in either of the original sets, and we write A ∪ B. A ∪ B is the even that either A or B orboth happen. The intersection of two sets is the set which contains all the elements ω ∈ Ωwhich are common to the two original sets, and we write A ∩ B. A ∩ B is the event thatboth A and B happen simultaneously.
The complement of a set A is the set containing all of the elements ω ∈ Ω in the samplespace which are not in the original set A, and we write Ac. So, clearly, Ωc = ∅, ∅c = Ω,and (Ac)c = A. Ac is the event that not A happens. (Notational note: occasionally, the
complement of A is denoted by A, but this is rarely done in statistics due to the potential
for confusion with sample means.)Note that if two sets A and B have no elements in common then they are referred to
as disjoint and thus, A ∩ B = ∅, where ∅ signifies the empty or null set (the impossibleevent). Also, if A ⊆ B then clearly A ∩ B = A, so that in particular A ∩ Ω = A for anyevent A.
Using unions and intersections, we can now define a very useful set theory concept, the
partition. A collection of sets A1, . . . , Ak is a partition of S if their combined union is equal
4
8/18/2019 STAT3004 Course Notes
5/167
to the entire sample space and they are all mutually disjoint; that is, A1 ∪ . . . ∪ Ak = Ωand Ai ∩ A j = ∅ for any i = j . In other words, a partition is a collection of events one andonly one of which must occur. In addition, note that the collection of sets
{A, Ac
} forms
a very simple but nonetheless extremely useful partition.Finally, the cardinality of a set is simply the number of elements it contains. Thus,
in Example 1.1 above, #Ω = 4 while #A = 2. A set is called countable if we can enu-
merate it in a possible nonunique way by natural numbers, for instance, ∅ is countable.Also a set A with finitely many elements is countable, ie. #A is finite. Examples for
countable, but infinite sets ate the natural numbers N = {1, 2, 3, . . . } and the integersZ = {. . . , −2, −1, 0, 1, 2, . . . }. Also the rational numbers Q are countable. Intervals (a, b),(a, b] and the real line R = (−∞, ∞) are examples of uncountable sets.
Basic Set Theory rules.The Distributive laws:
(A ∪ B) ∩ C = (A ∩ C ) ∪ (B ∩ C )(A ∩ B) ∪ C = (A ∪ C ) ∩ (B ∪ C )
DeMorgan’s rules:
(A ∪ B)c = Ac ∩ Bc; (A ∩ B)c = Ac ∪ Bc
You should convince yourself of the validity of these rules through the use of Venn dia-
grams. Formal proofs are elementary.
Basic Probability Rules. We now use the above set theory nomenclature to discuss the
basic tenets of probability. Informally, the probability of an event A is simply the chance
that it will occur. If the elements of the sample space Ω are finite in number and may be
considered “equally likely”, then we may calculate the probability of an event A as
P(A) = #A
#Ω.
More generally, of course, we will have to rely on our long-run frequency interpretation
of the probability of an event; namely, the probability of an event is the proportion of
times that it would occur among a (generally hypothetical) infinite number of equivalent
repetitions of the random experiment.
Zero & Unity Rules. All probabilities must fall between 0 and 1, i.e. 0 ≤ P(A) ≤ 1. Inparticular, P(∅) = 0 and P(Ω) = 1.Subset rule. If A ⊆ B, then P(A) ≤ P(B).
5
8/18/2019 STAT3004 Course Notes
6/167
Inclusion-Exclusion Law. The inclusion-exclusion rule states that the probability of the
union of two events is equal to the sum of the probabilities of the two events minus the
probability of the intersection of the two events, which has been in some sense “double
counted” in the sum of the initial two probabilities, so that
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
Notice that the final subtracted term disappears if the two events A and B are disjoint,
more generally:
Additivity. Assume that A1, . . . , An ⊆ Ω with Ai ∩ A j = ∅ for i = j . Then
P(A1 ∪ · · · ∪ An) = P(A1) + · · · + P(An).
Countable Additivity. Assume that A1, A2, A3, . . . ⊆ Ω is a sequence of events with Ai ∩A j = ∅ for i = j . Then
P(A1 ∪ A2 ∪ A3 ∪ . . . ) = P(A1) + P(A2) + P(A3) + P(A4) + . . .
Complement Rule. The probability of the complement of an event is equal to one minus
the probability of the event itself, so that P(Ac) = 1 − P(A). This rule is easily derivedfrom the Inclusion-Exclusion rule.
Product Rule. Two events A and B are said to be independent if and only if they satisfy
the equation P(A ∩ B) = P(A)P(B).The Law of Total Probability. The law of total probability is a way of calculating a prob-
ability by breaking it up into several (hopefully easier to deal with) pieces. If the sets
A1, . . . , Ak form a partition, then the probability of an event B may be calculated as:
P(B) =k
i=1
P(B ∩ Ai).
Again, heuristic verification is straightforward from a Venn diagram.
6
8/18/2019 STAT3004 Course Notes
7/167
2 Random Variables
2.1 Definition and Distribution
Definition A random variable X is a numerically valued function X : Ω → R (R denotingthe real line) whose domain is a sample space Ω. If the range of X is a countable subset of
the real line then we call X a discrete random variable . (For the technically minded, not
all numerical functions X : Ω → R are random variables for measure theoretic reasons,but for our purposes, we will ignore this subtlety.)
Below we introduce the notion of a continuous random variable . A continuous random
variable takes values in an uncountable set such as intervals or the real line. Note that
a random variable cannot be continuous if the sample space on which it is defined is
countable; however, a random variable defined on an uncountable sample space may stillbe discrete. In the coin tossing scenario of Example 1.1 above, the quantity X which
records the number of heads in the outcome is a discrete random variable.
Distribution of a Random Variable. Since random variables are functions on a sample
space, we can determine probabilities regarding random variables by determining the
probability of the associated subset of Ω. The probability of a random variable X being
in some subset I ⊆ R on the real line is equivalent to the probability of the event A ={ω ∈ Ω : X (ω) ∈ I }:
P(X
∈I ) = P(
{ω
∈Ω : X (ω)
∈I
}) .
Note that we have used the notion of a random variable as a function on the sample space
when we use the notation X (ω). The collection of all probabilities P(X ∈ I ) is called thedistribution of X .
Probability Mass Function (PMF). If X is discrete, then it is clearly desirable to find
pX (x) = P(X = x), the probability mass function (or pmf) of X . Because it is possible to
characterise the distribution of X in terms of its pmf pX via
P(X ∈ I ) = i∈I pX (i) .
If X is discrete, we have x∈Range(X )
pX (x) = 1.
Cumulative Distribution Function (CDF). For any random variable X : Ω → R thefunction
F X (x) = P(X ≤ x) , x ∈ R
7
8/18/2019 STAT3004 Course Notes
8/167
is called the cumulative distribution function (CDF) of X . The CDF of X determines the
distribution of X (the collection of all probabilities P(X ∈ I ) can be computed from thecdf of X ).
If X is a discrete random variable then its cumulative distribution function is a stepfunction:
F X (x) = P(X ≤ x) =
y∈Range(X ):y≤x pX (y).
(Absolutely) Continuous Random Variable. Assume that X is a random variable
such that
P(X ∈ I ) = I
f X (x) dx
where f X (x) is some nonnegative function with
∞−∞ f X (x) dx = 1. Then X is called a
continuous random variable admitting a density f X . In this case, the CDF is still a validentity, being continuous and given by
F X (x) = P(X ≤ x) = x−∞
f X (x) dx
Observe that the concept of a pmf is completely useless when dealing with continuous
r.v.’s as we have P(X = x) = 0 for all x. The Fundamental Theorem of Calculus thus
shows that f (x) = ddxF (x) = F (x), which in turn leads to the informal identity
P(x < X ≤ x + dx) = F (x + dx) − F (x) = dF (x) = f (x)dx,
which is where the density function f gets its name, since in some sense it describes howthe probability is spread over the real line.
(Notational note: We will attempt to stick to the convention that capital letters denote
random variables while the corresponding lower case letters indicate possible values or
realisations of the random variable.)
2.2 Common Distributions
The real importance of C DF ’s, pmf ’s and densities is that they completely characterize
the random variable from which they were derived. In other words, if we know the C DF (or equivalently the pmf or density) then we know everything there is to know about the
random variable. For most random variables that we might think of, of course, writing
down a pmf , say, would entail the long and tedious process of listing all the possible values
and their associated probabilities. However, there are some types of important random
variables which arise over and over and for which simple formulae for their CDF ’s, pmf ’s
or densities have been found. Some common CDF ’s, pmf ’s and densities are listed below:
8
8/18/2019 STAT3004 Course Notes
9/167
Discrete Distributions
Name pmf
Poisson(λ) p(x) = e−λ λx
x!λ > 0 x ∈ N0 = {0, 1, 2, . . .}Binomial(n, p) p(x) =
nx
px(1 − p)n−x
n ∈ N = {1, 2, 3, . . . } 0 < p < 1 x ∈ {0, 1, . . . , n}Negative Binomial(r, p) p(x) =
x+r−1r−1
(1 − p)r px
r ∈ N 0 < p < 1 x ∈ N0 = {0, 1, 2, . . . }Geometric( p) p(x) = p(1 − p)x−10 < p < 1 x ∈ N = {1, 2, 3, . . . }Hypergeometric(n,N,M ) p(x) =
M x
N −M n−x
/
N nn ∈ N m ∈ {0, . . . , n} N ∈ {1, . . . , n} x ∈ {max(0, n+M −N ), . . . , min(M, n)}
Continuous Distributions
Name Density
Normal(µ, σ2) f (x) = 1√ 2πσ2
exp− (x−µ)2
2σ2
µ ∈ R σ2 > 0 x ∈ R = (−∞, ∞)Exponential(λ) f (x) = λe−λx
λ > 0 x ∈ (0, ∞)Uniform(a, b) f (x) = 1/(b
−a)
−∞ < a < b < ∞ x ∈ (a, b)Weibull(α, λ) f (x) = αλxα−1e−λx
α
α > 0, λ > 0 x ∈ (0, ∞)Gamma(α, λ) f (x) = λΓ(α)(λx)
α−1e−λx
α > 0, λ > 0 x ∈ (0, ∞)Chi-Squared(k) f (x) = 1
2k2 Γk2
x 12 (k−2)e−12xk ∈ N = {1, 2, 3, . . . } x ∈ (0, ∞)Beta(α, β ) f (x) = Γ(α+β)Γ(α)Γ(β)x
α−1(1 − x)β−1
α, β > 0 x ∈ (0, 1)Student’s tk f (x) =
Γk+12
√ kπΓ12k1 + x2
k
−12(k+1)
k ∈ N x ∈ (−∞, ∞)Fisher-Snedecor F m,n f (x) =
Γm+n2
Γm2 kΓn2km
n
m2
x(m−2)
21+mx
n
12 (m+n)
m, n ∈ N x ∈ (0, ∞)
9
8/18/2019 STAT3004 Course Notes
10/167
The factorials n! and the binomial coefficientsnx
which are defined as follows: 0! := 1
and, for n ∈ N and x ∈ {0, . . . , n},
n! := n × (n − 1) × · · · × 1 , nx := n!x!(n−x)!The gamma function, Γ(α), is defined by the integral
Γ(α) =
∞0
xα−1e−xdx,
from which it follows that if α is a positive integer, then Γ(α) = (α − 1)!. Also, note thatfor α = 1, the Gamma(1,λ) distribution is equivalent to the Exponential(λ) distribution,
while for λ = 12
, the Gamma(α, 12
) distribution is equivalent to the Chi-squared distribu-
tion with 2α degrees of freedom. Similarly, the Geometric( p) distribution is closely relatedto the Negative Binomial distribution when r = 1.
Above we listed formulas only for those x where p(x) or f (x) > 0. For the remaining
xs we have p(x) = 0 or f (x) = 0. We write X ∼ Q indicating that X has the distributionQ: for instance, X ∼Normal(0, 1) refers to a continuous random variable X which has thedensity f X (x) =
1√ 2π
e−x2
2 . Similarly, Y ∼Poisson(5) refers to a discrete random variableY with range N0 having pmf pY (x) = e
−5 5xx! for x ∈ N0.
Exercise. (a) Let X ∼Exponential(λ). Check that the CDF of X satisfies F X (x) =1 − e−λx for x ≥ 0. Graph this function for x ∈ [−1, 4] for the parameter λ = 1.(b) Let X ∼Geometric( p). Check that the CDF of X satisfies F X (x) = 1 − (1 − p)x,x ∈ {1, 2, 3, . . .}. Graph this function for x ∈ [−1, 4] (hint: step function).
10
8/18/2019 STAT3004 Course Notes
11/167
2.3 Moments and Quantiles
Moments. The mth moment of a random variable X is the expected value of the random
variable X m and is defined as
E[X m] =
x∈Range(X)xm pX (x),
if X is discrete, and as
E[X m] =
∞−∞
xmf X (x)dx,
if X is continuous (provided, of course, that the quantities on the right hand sides exist).
In particular, when m = 1, the first moment of X is generally referred to as its mean
and is often denoted as µX , or just µ when there is no chance of confusion. The expected
value of a random variable is one measure of the centre of its distribution.
General Formulae. A good, though somewhat informal, way of thinking of the expected
value is that it is the value we would tend to get if we were to average the outcomes of a
very large number of equivalent realisations of the random variable. From this idea, it is
easy to generalize the moment definition to encompass the expectations of any function,
g, of a random variable as either
E[g(X )] =
x∈Range(X)g(x) p(x),
or
E[g(X )] =
∞−∞
g(x)f (x)dx,
depending on whether X is discrete or continuous.
Central Moments and Variance. The idea of moments is often extended by defining
the central moments , which are the moments of the centred random variable X −µX . Thefirst central moment is, of course, equal to zero. The second central moment is generally
referred to as the variance of X , and denoted Var(X ) or sometimes σ2X . The variance is
a measure of the amount of dispersion in the distribution of X ; that is, random variables
with high variances are likely to produce realisations which are far from the mean, while
low variance random variables have realisations which will tend to cluster closely about
the mean. A simple calculation shows the relationship between the moments and the
central moments; for example, we have
Var(X ) = E[(X − µX )2] = E[X 2] − µ2X .
11
8/18/2019 STAT3004 Course Notes
12/167
One drawback to the variance is that, by its definition, its units are not comparable
to those of X . To avert this problem, we often use the square root of the variance,
σX = Var(X ), which is called the standard deviation of the random variable X .Quantiles and Median. Another way to characterize the location (i.e. centre) andspread of the distribution of a random variable is through its quantiles. The (1 − α)-quantile of the distribution of X is any value ν α which satisfies:
P(X ≤ ν α) ≥ 1 − α and P(X ≥ ν α) ≥ α.
Note that the definition does not necessarily uniquely define the quantile; in other
words, there may be several distinct (1 −α)-quantiles of a distribution. However, for mostcontinuous distributions that we shall meet the quantiles will be unique. In particular,
the α = 1
2
quantile is called the median of the distribution and is another measure of the
centre of the distribution, since there is a 50% chance that a realisation of X will fall below
it and also a 50% chance that the realisation will be above the median value. The α = 34
and α = 14 quantiles are generally referred to as the first and third quartiles , respectively,
and there difference, called the interquartile range (or I QR), is another measure of spread
in the distribution.
Expectation via Tails. Calculating the mean of a random variable from the definition
can often involve painful integration and algebra. Sometimes, there are simpler ways. For
example, if X is a non-negative integer-valued random variable (i.e. its range contains
only non-negative integers), then we can calculate the mean of X as
µ =∞x=0
P(X > x).
The validity of this can be easily seen by a term rearrangement argument:
∞x=0
P(X > x) =∞x=0
∞y=x+1
p(y) =∞y=1
y−1x=0
p(y) =∞y=1
yp(y) = µ.
More generally, if X is an arbitrary, but non-negative random variable with cumulative
distribution function F , then
µ =
∞0
{1 − F (x)}dx .
Example 2.1. Let a > 0 and U be uniformly distributed on (0, a). Using at least two
methods find E[U ].
Solution: U is a continuous random variable with density f U (u) = 1/a for 0 < u < a
12
8/18/2019 STAT3004 Course Notes
13/167
(otherwise, f U (u) = 0 if u = (0, a)).Method I:
E[U ] = 1
a a
0
u du = 1
a
[u2/2]a0 = a/2
Method II: Note that U is a nonnegative random variable taking values only in (0, a).
Also, F (U )(u) = 1a
u0
du = u/a if 0 < u < a. Otherwise, we have either F U (u) = 0 for
u ≤ 0, or, F U (u) = 1 for u ≥ a. Consequently, The tail integral becomes
E[U ] =
∞0
1 − F U (u) du = a0
1 − F U (u) du = a0
1 − (u/a) du = a − 1a
[u2/2]a0 = a/2 .
2.4 Moment Generating Functions
A more general method of calculating moments is through the use of the moment gener-
ating function (or mgf ), which is defined as
m(t) = E[etX ] =
∞−∞
etxdF (x),
provided the expectation exists for all values of t in a neighborhood of the origin. To
obtain the moments of X we note that (provided sufficient regularity conditions which
justify the interchange of the operations of differentiation and integration are satisfied),
dm
dtmE[etX ]
t=0
= E[X metX ]
t=0
= E[X m].
Example 2.2. Suppose that X has a Poisson(λ) distribution. The moment generating
function of X is given by:
m(t) =∞
x=0etx p(x) =
∞
x=0etxλxe−λ
x! = e−λ
∞
x=0
λetx
x! = e−λeλe
t
= eλ(et−1),
where we have used the series expansion ∞
n=0xn
n! = ex. Taking derivatives of m(t) shows
that
m(t) = eλ(et−1)(λet) =⇒ m(0) = E[X ] = λ
m(t) = eλ(et−1)(λet)2 + eλ(e
t−1)(λet) =⇒ m(0) = E[X 2] = λ2 + λ.
Finally, this shows that V ar(X ) = E[X 2] − {E[X ]}2 = (λ2 + λ) − λ2 = λ.
13
8/18/2019 STAT3004 Course Notes
14/167
Example 2.3. Suppose that X has a Gamma(α, λ) distribution. The moment generating
function of X is given by:
m(t) = ∞
0 etx λ
Γ(α) (λx)α−1
e−λx
dx =
λα
Γ(α) ∞
0 e−(λ−t)x
xα−1
dx
= λα
(λ − t)α ∞0
λ − tΓ(α)
e−(λ−t)x{(λ − t)x}α−1dx =
λ
λ − tα
, t < λ
where we have used the fact that ∞0
λ−tΓ(α)e
−(λ−t)x{(λ−t)x}α−1dx = 1 since it is the integralof the density of a Gamma(α, λ − t) distribution over the full range of its sample space,provided that t < λ, since the parameters of a Gamma distribution must be positive. So,
differentiating this function shows that:
m(t) = αλα
(λ − t)α+1 =
⇒ m(0) = E[X ] =
α
λ
m(t) = α(α + 1)λα
(λ − t)α+2 =⇒ m(0) = E[X 2] =
α2 + α
λ2 .
Thus, we can see that V ar(X ) = E[X 2] − {E[X ]}2 = αλ2
.
If X, Y are random variables having a finite mgf in an open interval containing zero and
their corresponding mgf’s equal then X and Y have the same distribution. Generally, such
demonstrations rely on algebraic calculation followed by recognition of resultant functions
as the moment generating function of some specific distribution. As a useful reference,
then, we now give the moment generating functions of some of the distributions noted inthe previous section:
Discrete Distributions
Name mgf
Poisson(λ) m(t) = exp(λ(et − 1)) t ∈ RBinomial(n, p) m(t) = (1 − p + pet)n t ∈ RGeometric( p) m(t) = pet/(1 − (1− p)et) (1− p)et
8/18/2019 STAT3004 Course Notes
15/167
Generating Functions. More generally, the concept of a generating function of a se-
quence is quite useful in many areas of probability theory. The generating function of a
sequence of numbers
{a0, a1, a2, . . .
} is defined as
A(s) = a0 + a1s + a2s2 + . . . =
∞n=0
ansn
provided the series converges for values of s in a neighborhood of the origin.
Note that from this definition, the moment generating function is just the generating
function of the sequence an = E[X n]/n!. As with the mgf , the elements of the sequence
can be recovered by successive differentiation of the generating function and subsequent
evaluation at s = 0 (as well as a rescaling by n! for the appropriate value of n).
In particular, if X is a discrete random variable taking non-negative integer values,
then setting an = P(X = n) yields the probability generating function,
P (s) = E[sX ] = E[eX log s].
Note that m(t) = P (et), so that there is a clear link between the moment generating
function and the probability generating function. In particular, moment-like quantities
can be found via derivatives of P (s) evaluated at s = 1 = e0. For example,
m(t)
t=0
= P (et)e2t + P (et)ett=0
= P (1) + P (1).
Also, if we let q n = P(X > n), then Q(s) =
n q nsn, is a tail probability generating
function and
Q(s) = 1 − P (s)
1 − s .This can be seen by noting that the coefficient of sn in the function (1 − s)Q(s) is
q n − q n−1 = P(X > n) − P(X > n − 1) = P(X > n) − {P(X = n) + P(X > n)} = −P(X = n),
if n ≥ 1, and q 0 = P(X > 0) = 1 − P(X = 0) if n = 0, so that
(1 − s)Q(s) = 1 − P(X = 0) − ∞n=1
P(X = n)sn = 1 − P (s).
We saw earlier that E[X ] =
n q n, so E[X ] = Q(1) = lims→1{1−P (s)}/(1−s), and thus,the graph of {1 − P (s)}/(1 − s) has a pole rather than an asymptote at s = 1, as long asthe expectation of X exists and is finite.
Additional Remarks. For the mathematically minded, we note that occasionally the
15
8/18/2019 STAT3004 Course Notes
16/167
mgf (which, for positive random variables is also sometimes called the Laplace transform
of the density or pmf ) will not exist (for example, the t- and F -distributions have non-
existent moment generating functions, since the necessary integrals are infinite) and this
is why it is often more convenient to work with the characteristic function (also knownto some as the Fourier transform of the density or pmf ), ψ(t) = E[eitX ] which always
exists, but this requires some knowledge of complex analysis. One of the most useful
features of the characteristic function (and of the moment generating function, in the
cases where it exists) is that it uniquely specifies the distribution from which it arose (i.e.
no two distinct distributions have the same characteristic function), and many difficult
properties of distributions can be derived easily from the corresponding properties of
characteristic functions. For example, the Central Limit Theorem is easily proved using
moment generating functions, as are some important relationships regarding the various
distributions listed above.
16
8/18/2019 STAT3004 Course Notes
17/167
3 Several Random Variables
3.1 Joint distributions
The joint distribution of two random variables X and Y describes how the outcomes of
the two random variables are probabilistically related. Specifically, the joint distribution
function is defined as
F XY (x, y) = F (x, y) = P(X ≤ x and Y ≤ y).
Usually, the subscripts are omitted when no ambiguity is possible.
If X and Y are both discrete, then they will have a joint probability mass function
defined by P(x, y) = P(X = x and Y = y). Otherwise, if there exists a joint density
defined as that function f XY which satisfies:
F XY (x, y) =
x−∞
y−∞
f XY (ξ, η)dηdξ.
In this case, we call X and Y (jointly) continuous.
The case where one of X and Y is discrete and one continuous is of interest, but is
slightly more complicated and we will deal with it when it comes up.
The function F X (x) = limy→∞ F (x, y) is called the marginal distribution function of
X , and similarly the marginal distribution function of Y is F Y (y) = limx→∞ F (x, y). If X
and Y are discrete, then the marginal probability mass functions are simply
pX (x) =
y∈Range(Y ) p(x, y) and pY (y) =
x∈Range(X )
p(x, y).
If X and Y are continuous, then the marginal densities of X and Y are given by
f X (x) =
∞−∞
f XY (x, y)dy and f Y (y) =
∞−∞
f XY (x, y)dx,
respectively. Note that the marginal density at a particular value is derived by simply
integrating the area under the joint density along the appropriate horizontal or verticalline.
The expectation of a function h of the two random variables X and Y is calculated in
a fashion similar to the expectations of functions of single random variables, namely,
E[h(X, Y )] =
x∈Range(X )
y∈Range(Y )
h(x, y) p(x, y)
17
8/18/2019 STAT3004 Course Notes
18/167
if X and Y are discrete, or
E[h(X, Y )] = ∞
−∞ ∞
−∞
h(x, y)f (x, y)dxdy
if X and Y are continuous.
Note that the above definitions show that regardless of the type of random variables,
E[aX + bY ] = aE[X ] + bE[Y ] for any constants a and b. Also, analogous definitions and
results hold for any finite group of random variables. For example the joint distribution
of X 1, . . . , X k is
F (x1, . . . , xk) = P(X 1 ≤ x1 and . . . and X k ≤ xk).
3.2 Covariance, Correlation, IndependencyIndependence. If it happens that F (x, y) = F X (x)F Y (y) then the random variables X
and Y are said to be independent. If both the random variables are continuous, then
the above condition is equivalent to f (x, y) = f X (x)f Y (y), while if both are discrete it is
the same as p(x, y) = pX (x) pY (y). Note the similarity of these definitions to that for the
independence of events.
Given two jointly distributed random variables X and Y , we can calculate their means,
µX and µY , and their standard deviations, σX and σY , using their marginal distributions.
Provided these means and standard deviations exist, we can use the joint distributionto calculate the covariance between X and Y which is defined as Cov(X, Y ) = σXY =
E[(X − µX )(Y − µY )] = E[XY ] − µX µY .Two random variables are said to be uncorrelated if their covariance is zero. Note that
if X and Y are independent then they are certainly uncorrelated, since the factorization of
the pmf or density implies that E[XY ] = E[X ]E[Y ] = µX µY . However, two uncorrelated
random variables need not be independent. Note also that it is an easy calculation to
show that if X , Y , V and W are jointly distributed random variables and a, b, c and d
are constants, then
Cov(aX + bY,cV + dW ) = acσXV + adσXW + bcσY V + bdσY W ;
in other words, the covariance operator is bilinear .
Finally, if we scale σXY by the product of the two standard deviations, we get the
correlation coefficient, ρ = σXY /σX σY , which satisfies −1 ≤ ρ ≤ 1.
18
8/18/2019 STAT3004 Course Notes
19/167
3.3 Sums of Random Variables and Convolutions
We saw that the expectation of Z = X + Y was simply the sum of the individual expec-
tations of X and Y for any two random variables. Unfortunately, this is about the extent
of what we can say in general. If, however, X and Y are independent, the distribution of
Z can be determined by means of a convolution:
F Z (z ) =
∞−∞
F X (z − ξ )dF Y (ξ ) = ∞−∞
F Y (z − ξ )dF X (ξ ).
In the case where both X and Y are are discrete, we can write the convolution formula
using pmf ’s:
pZ (z ) = x∈Range(X ) pX (x) pY (z − x) = y∈Range(Y )
pX (z − y) pY (y).
If X and Y are both continuous, we can rewrite the convolution formula using densities:
f Z (z ) =
∞−∞
f X (z − ξ )f Y (ξ )dξ = ∞−∞
f Y (z − ξ )f X (ξ )dξ.
Note that, in the same way that marginal densities are found by integrating along hor-
izontal or vertical lines, the density of Z at the value z is found by integrating along
the line x + y = z , and of course using the independence to state that f XY (ξ, z − ξ ) =f X (ξ )f Y (z − ξ ).
Since convolutions are a bit cumbersome, we now note an advantage of mgf ’s. If X and Y are independent, then
mZ (t) = E[etZ ] = E[et(X +Y )] = E[etX etY ] = E[etX ]E[etY ] = mX (t)mY (t).
So, the mgf of a sum of independent random variables is the product of the mgf ’s of the
summands. This fact makes many calculations regarding sums of independent random
variables much easier to demonstrate:
Suppose that X ∼ Gamma(αX , λ) and Y ∼ Gamma(αY , λ) are two independentrandom variables, and we wish to determine the distribution of Z = X + Y . We could
use the convolution formula, but this would require some extremely difficult (though not
impossible, of course) integration. However, recalling the moment generating function of
the Gamma distribution we see that, for any t < λ:
mX +Y (t) = mX (t)mY (t) =
λ
λ − tαX λ
λ − tαY
=
λ
λ − tαX+αY
,
which easily shows that X 1 + X 2 ∼ Gamma(α1 + α2, λ).
19
8/18/2019 STAT3004 Course Notes
20/167
3.4 Change of Variables
We saw previously, that we could find the expectation of g(X ) using the distribution of X .
Suppose, however, that we want to know more about the new random variable Y = g(X ).
If g is a strictly monotone function, we can find the distribution of Y by noting that
F Y (y) = P(Y ≤ y) = P({g(X ) ≤ y}) = P({X ≤ g−1(y)} = F X {g−1(y)}),
if g is increasing, and
F Y (y) = P({Y ≤ y}) = P({g(X ) ≤ y}) = P({X ≥ g−1(y)})= 1 − F X {g−1(y)} + P({X = g−1(y)}),
if g is decreasing (if g is not strictly monotone, we need to be a bit more clever, but wewon’t deal with that case here). Now, if X is continuous and g is a smooth function (i.e.
has a continuous derivative) then the differentiation chain rule yields
f Y (y) = 1
|g{g−1(y)}|f X {g−1(y)} = 1|g(x)|f X (x),
where y = g(x) (note that when X is continuous, the CDF of Y in the case when g is
decreasing simplifies since P{X = g−1(y)} = 0).A similar formula holds for joint distributions except that the derivative factor be-
comes the reciprocal of the modulus of the determinant of the Jacobian matrix for the
transformation function g. In other words, if X 1 and X 2 have joint density f X 1X 2 and
g(x1, x2) = {g1(x1, x2), g2(x1, x2)} = (y1, y2) is an invertible transformation, then the joint density of Y 1 = g1(X 1, X 2) and Y 2 = g2(X 1, X 2) is
f Y 1Y 2(y1, y2) = |J (x1, x2)|−1f X 1X 2(x1, x2),
where y1 = g1(x1, x2) and y2 = g2(x1, x2) and |J (x1, x2)| is the determinant of the Jacobianmatrix J (x1, x2), which has (i, j)
th element J ij(x1, x2) =∂gi(x1,x2)
∂xj
.
20
8/18/2019 STAT3004 Course Notes
21/167
4 Conditional Probability
4.1 Conditional Probability of Events
So far, we have discussed the probabilities of events in a rather static situation. However,
typically, we wish to know how the outcomes of certain events will subsequently affect the
chances of later events. To describe such situations, we need to use conditional probability
for events.
Suppose that we wish to know the chance that an event A will occur. Then we have
seen that we want to calculate P(A). However, if we are in possession of the knowledge
that the event B has already occurred, then we would likely change our belief about the
chance of A occuring. For example, if A is the event “it will rain today” and B is the
event “the sky is overcast”. We use the notation P(A|B) to signify the probability of Agiven that B has occurred, and we define it as
P(A|B) = P(A ∩ B)P(B)
,
provided P(B) = 0.If we think of probabilities as areas in a Venn diagram, then a conditional probability
amounts to restricting the sample space down from Ω to B and then finding the relative
area of that part of A which is also in B to the total area of the restricted sample space,
namely B itself.Multiplication Rule. In many of our subsequent applications, conditional probabilities
will be dictated as primary data by the circumstances of the process under study. In this
case, the above definition will find its most useful function in the form
P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A).
Independence. Also, we can rephrase independency of events note interms of conditional
probabilities; namely, two events A and B are independent if and only if
P(A|B) = P(A) and P(B|A) = P(B).(Note that only one of the above two conditions need be verified, since if one is true the
other follows from the definition of conditional probability.) In other words, two events
are independent if the chance of one occurring is unaffected by whether or not the other
has occurred.
Total Probability law. Recalling the law of total probability, we can use this new
21
8/18/2019 STAT3004 Course Notes
22/167
identity to show that if the sets B1, . . . , Bk form a partition then
P(A) =k
i=1 P(A|Bi)P(Bi).Bayes’ Rule. Finally, a very useful formula exists which relates the conditional proba-
bility of A given B to the conditional probability of B given A and goes by the name of
Bayes’ Rule . Bayes’ rule states that
P(B|A) = P(A ∩ B)P(A)
= P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc) ,
which follows from the definition of conditional probability and the law of total probability,
since B and Bc form a partition. In fact, we can generalize Bayes’ rule by letting B1, . . . , Bk
be a more general partition, so that
P(Bi|A) = P(A|Bi)P(Bi)k j=1 P(A|B j)P(B j)
.
Example 4.1. Suppose there are three urns labelled I , II and III , the first containing
4 red and 8 blue balls, the second containing 3 red and 9 blue, and the third 6 red and 6
blue. (a) If an urn is picked at random and subsequently a ball is picked at random from
the chosen urn, what is the chance that the chosen ball will be red? (b) If a red ball is
drawn, what is the chance that it came from the first urn?
Solution: Let R be the event that the chosen ball is red. Then, from the description of
the situation it is clear that:
P(I ) = P(II ) = P(III ) = 1
3 , P(R|I ) = 4
12 =
1
3 , P(R|II ) = 3
12 =
1
3 , P(R|III ) = 6
12 =
1
2
(a) Since the events I , I I and I II clearly form a partition (i.e. one and only one of them
must occur), we can use the law of total probability to find
P(R) = P(R
|I )P(I ) + P(R
|II )P(II ) + P(R
|III )P(III ) =
13
36.
(b) Using Bayes’ rule,
P(I |R) = P(R|I )P(I )P(R|I )P(I ) + P(R|II )P(II ) + P(R|III )P(III ) =
(1/3)(1/3)
13/36 =
4
13.
22
8/18/2019 STAT3004 Course Notes
23/167
4.2 Discrete Random Variables
Conditional pmf . The conditional probability mass function derives from the definition
of conditional probability for events in a straightforward manner:
pX |Y (x|y) = P(X = x|Y = y) = P(X = x and Y = y)P(Y = y)
= pXY (x, y)
pY (y) ,
as long as pY (y) > 0. Note that for each y , pX |Y is a pmf , i.e.
x pX |Y (x|y) = 1, but thesame is not true for each fixed x. Also, the law of total probability becomes
pX (x) =
y∈Range(Y ) pX |Y (x|y) pY (y).
Example 4.2. Suppose that N has a geometric distribution with parameter 1−
β , and
that conditional on N , X has a negative binomial distribution with parameters p and N .
In other words,
pN (n) = (1 − β )β n−1 for n = 1, 2, . . .and
pX |N (x|n) =
x + n − 1n − 1
px(1 − p)n for x = 0, 1, . . . .
Find the marginal distribution of X .
Solution: Using the law of total probability: for x = 01, 2, 3, . . .
pX (x) =∞
n=1
pX |N (x|n) pN (n)
=∞
n=1
x + n − 1
n − 1
(1 − p)n px(1 − β )β n−1
=∞
n=1
x + n − 1
x
(1 − p)n px(1 − β )β n−1
= (1 − β )β −1 px∞
n=0
x + n
x
[β (1 − p)]n+1
= (1−β )(1− p) px
(1−(1− p)β )x+1∞
n=0
x + n
x
[β (1 − p)]n(1 − (1 − p)β )x+1
=1
= (1−β )(1− p)
1−(1− p)β
p
1−(1− p)β x
Consequently, X + 1 ∈ {1, 2, 3, . . . } is geometric with parameter (1−β)(1− p)1−(1− p)β .
23
8/18/2019 STAT3004 Course Notes
24/167
Conditional Expectation. The conditional expectation of g(X ) given Y = y, denoted
as E[g(X )|Y = y], is defined as
E[g(X )|Y = y] = x∈Range(X ) g(x) p
X |Y (x|y).
The law of total probability then shows that
E[g(X )] =x
g(x) pX (x) =x
g(x)y
pX |Y (x|y) pY (y) =y
E[g(X )|Y = y] pY (y).
Note that the conditional expectation can be regarded as a function of y; that is, it is
a numerical function defined on the sample space of Y and is thus a random variable,
denoted by E[g(X )|Y ], and we therefore have
E[g(X )] = E
E [g(X )|Y ].A similar expression can be obtained for variances:
Var(X ) = E[X 2] − (E[X ])2 = E[E[X 2|Y ]] − E[E[X |Y ]]2= E[E[X 2|Y ]] − E(E[X |Y ])2 + E(E[X |Y ])2− E[E[X |Y ]2= E
E[X 2|Y ] − (E[X |Y ])2 + E(E[X |Y ])2− E[E[X |Y ]]2
= E[Var(X |Y )] + Var(E[X |Y ]).
Note that we have defined Var(X |Y ) := E[X 2|Y ] − (E[X |Y ])2 = σ2X |Y , which, like theconditional expectation, is now a random variable.
Example 4.3. Let Y have a distribution with mean µ and variance σ2. Conditional on
Y = y, suppose that X has a distribution with mean −y and variance y2. Find thevariance of X .
Solution: From the information given, E[X |Y ] = −Y and Var(X |Y ) = Y 2. Thus,
Var(X ) = E[Y 2] + Var(
−Y ) = σ2 + µ2 + (
−1)2Var(Y ) = 2σ2 + µ2.
Since the conditional expectation is the expectation with respect to the conditional
probability mass function pX |Y (x|y), conditional expectations behave in most ways likeordinary expectations. For example,
1. E[ag(X 1) + bh(X 2)|Y ] = aE[g(X 1)|Y ] + bE[h(X 2)|Y ]
24
8/18/2019 STAT3004 Course Notes
25/167
2. If g ≥ 0 then E[g(X )|Y ] ≥ 0
3. E[g(X, Y )|Y = y] = E[g(X, y)|Y = y]
4. If X and Y are independent, E[g(X )|Y ] = E[g(X )]
5. E[g(X )h(Y )|Y ] = h(Y )E[g(X )|Y ]
6. E[g(X )h(Y )] = E[h(Y )E[g(X )|Y ]]
In particular, it follows from properties 1 and 5 that E[a|Y ] = a for any constant a, andE[h(Y )|Y ] = h(Y ) for any function h.Remark: the formulae 1. – 6. are applicable in more general situations, even if X nor Y
are not discrete (cf. random sums for some applications).
4.3 Mixed Cases
If X is a continuous random variable and N is a discrete random variable, then the
conditional distribution function F X |N (x|n) of X given that N = n can be defined in theobvious way
F X |N (x|n) = P(X ≤ x and N = n)P(N = n)
.
From this definition, we can easily define the conditional probability density function as
f X |N (x|n) = ddx
F X |N (x|n).
As in the discrete case, the conditional density behaves much like an ordinary density, so
that, for example,
P(a < X ≤ b, N = n) = P(a < X ≤ b|N = n)P(N = n) = pN (n) ba
f X |N (x|n)dx.
Note that the key feature to this and the discrete case was that the conditioning random
variable N was discrete, so that we would be able to guarantee that there would be some
possible values of n such that P(N = n) > 0. It is possible to condition on continuous
random variables and the properties are much the same, but we just need to take a bit of
care since technically the probability of any individual outcome of a continuous random
variable is zero.
25
8/18/2019 STAT3004 Course Notes
26/167
4.4 Random Sums
Suppose we have an infinite sequence of independent and identically distributed random
variables ξ 1, ξ 2, . . ., and a discrete non-negative integer valued random variable N which
is independent of the ξ ’s. We can then define the random sum
X = ξ 1 + . . . + ξ N =N
k=1
ξ k.
(Note that for convenience, we will define the sum of zero terms to be zero.)
Moments. If we let
E[ξ k] = µ Var(ξ k) = σ2 E[N ] = ν Var(N ) = τ 2
then we can derive the mean and variance of X as
E[X ] = E[E[X |N ]] =∞
n=0
E[X |N = n] pN (n)
=∞
n=1
E[ξ 1 + . . . + ξ N |N = n] pN (n)
=∞
n=1
E[ξ 1 + . . . + ξ n|N = n] pN (n)
=∞
n=1
E[ξ 1 + . . . + ξ n] pN (n) = µ∞
n=1
npN (n)
= µν,
and the variance as
Var(X ) = E[(X − µν )2] = E[(X − Nµ + Nµ − µν )2]= E[(X − Nµ)2] + E[µ2(N − ν )2] + 2E[µ(X − Nµ)(N − ν )]= E
E[(X − Nµ)2|N ] + E[µ2(N − ν )2]
+ 2E
E[µ(X − Nµ)(N − ν )|N ]
= νσ2 + µ2τ 2,since
E[X − Nµ|N = n] = E n
i=1
ξ i − nµ
= 0;
E[(X − N µ)2|N = n] = E n
i=1
ξ i − nµ2
= nσ2.
26
8/18/2019 STAT3004 Course Notes
27/167
Example 4.4. Total Grandchildren - Suppose that individuals in a certain species have
a random number of offspring independently of one another with a known distribution
having mean µ and variance σ2. Let X be the number of grandchildren of a single parent,
so that X = ξ 1+. . .+ξ N , where N is the random number of original offspring and ξ k is therandom number of offspring of the kth child of the original parent. Then E[N ] = E[ξ k] = µ
and Var(N ) = Var(ξ k) = σ2, so that
E[X ] = µ2 and Var(X ) = µσ2(1 + µ).
Distribution of Random Sums. In addition to moments, we need to know the distri-
bution of the random sum X . If the ξ ’s are continuous and have density function f (z ),
then the distribution of ξ 1 + . . . + ξ n is the n-fold convolution of f , denoted by f (n)(z )
and recursively defined by
f (1)(z ) = f (z )
f (n)(z ) =
∞−∞
f (n−1)(z − u)f (u)du for n > 1.
Since N is independent of the ξ ’s, f (n) is also the distribution of X given N = n ≥ 1.Thus, if we assume that P(N = 0) = 0, the law of total probability says
f X (x) =
∞n=1 f
(n)
(x) pN (n).
NOTE: If we don’t assume that P(N = 0) = 0, then we have a “mixed” distribution, so
that
P(a < X ≤ b) = ba
∞n=1
f (n)(x) pN (n)
dx
if a < b
8/18/2019 STAT3004 Course Notes
28/167
for z ≥ 0, and suppose also that N has a geometric distribution with parameter p, so that pN (n) = p(1 − p)n−1 for n = 1, 2, . . .. In this case,
f
(2)
(z ) = ∞
−∞ f (z − u)f (u)du = z
0 λ
2
e−λz
du
= λ2e−λz z0
du = λ2e−λzz.
In fact, it is straightforward to use mathematical induction to show that f (n)(z ) =λn
(n−1)!z n−1e−λz, for z ≥ 0, which is a Gamma(n, λ) distribution (a fact which is much
more easily demonstrated using moment generating functions!). Thus, the distribution of
X is
f X (x) =∞
n=1 f (n)(x) pN (n) =
∞
n=1λn
(n−
1)!xn−1e−λx p(1 − p)n−1
= λpe−λx∞
n=1
{λ(1 − p)x}n−1(n − 1)! = λpe
−λxeλ(1− p)x
= λpe−λpx.
So, X has an exponential distribution with parameter λp, or a Gamma(1, λp). Note that
the distribution of the random sum is not the same as the distribution of the non-random
sum.
4.5 Conditioning on Continuous Random Variables
Conditional Density. Note that in the previous sections we have been able to use our
definition of conditional probability for events since the conditioning events {Y = y}have non-zero probability for discrete random variables. If we want to find the conditional
distribution of X given Y = y, and Y is continuous, we cannot use, as we might first try,
F X |Y (x|y) = P(X ≤ x|Y = y) = P(X ≤ x and Y = y)P(Y = y)
,
since both probabilities in the final fraction are zero. Instead, we shall define the condi-tional density function as
f X |Y (x|y) = f XY (x, y)f Y (y)
,
for values of y such that f Y (y) > 0. The conditional distribution function is then given
by
F X |Y (x|y) = x−∞
f X |Y (ξ |y)dξ.
28
8/18/2019 STAT3004 Course Notes
29/167
Conditional Expectation. Finally, we can define
E[g(X )|Y = y] = ∞
−∞
g(x)f X |Y (x|y)dx,
as expected, and this version of the conditional expectation still satisfies all of the nice
properties that we derived in the previous sections for discrete conditioning variables. For
example,
P(a < X ≤ b|Y = y) = F X |Y (b|y) − F X |Y (a|y) = ba
f X |Y (x|y)dx
=
∞−∞
1(a,b](x)f X |Y (x|y)dx= E[1(a,b](X )
|Y = y],
where the function 1I (x) is the indicator function of the set I , i.e. 1I (x) = 1 if x ∈ I and1I (x) = 0 otherwise.
Note that, as is the case with ordinary expectations and indicators, the conditional
probability of the random variable having an outcome in I is equal to the conditional
expectation of the indicator function of that event. (Recall that
P(X ∈ I ) = I
f X (x)dx =
∞−∞
1I (x)f X (x)dx = E[1I (X )]
for ordinary expectations and probabilities.)We can use the above fact to show a new form of the law of total probability, which
is often a very useful method of finding probabilities; namely,
P(a < X ≤ b) = ∞−∞
P(a < X ≤ b|Y = y)f Y (y)dy.
To see why this is true, note that
∞
−∞
P(a < X ≤ b|Y = y)f Y (y)dy = ∞
−∞ b
a
f X |Y (x|y)dxf Y (y)dy = ∞
−∞ b
a
f XY (x, y)dxdy
= P(a < X ≤ b and − ∞ < Y < ∞)= P(a < X ≤ b).
In fact, we can generalize this notion even further to show that
P{a < g(X, Y ) ≤ b} = ∞−∞
P{a < g(X, y) ≤ b|Y = y}f Y (y)dy.
29
8/18/2019 STAT3004 Course Notes
30/167
Example 4.6. Suppose X and Y are continuous random variables having joint density
function
f XY (x, y) = ye−xy−y for x, y > 0.
(a) Find the conditional distribution of X given Y = y.
(b) Find the distribution function of Z = X Y .
Solution: (a) First, we must find the marginal density of Y , which is
f Y (y) =
∞−∞
f XY (x, y)dx =
∞0
ye−xy−ydx = e−y ∞0
ye−xydx = e−y, y > 0.
Therefore,
f X |Y (x|y) = f XY (x, y)f Y (y)
= ye−xy, y > 0
In other words, conditional on Y = y, X has an exponential distribution with parameter
y, and thus F X |Y (x|y) = 1 − e−xy.(b) To find the distribution of Z = XY , we write
F Z (z ) = P(Z ≤ z ) = P(XY ≤ z ) = ∞−∞
P(XY ≤ z |Y = y)f Y (y)dy
=
∞0
P(X ≤ z y|Y = y)e−ydy =
∞0
(1 − e−z)e−ydy= 1 − e−z,
so that Z has an exponential distribution with parameter 1.
4.6 Joint Conditional Distributions
If X , Y and Z are jointly distribution random variables and Z is discrete, we can define
the joint conditional distribution of X and Y given Z in the obvious way,
F XY |Z (x, y
|z ) = P(X
≤x and Y
≤y|Z = z ) =
P(X ≤ x and Y ≤ y and Z = z )P(Z = z )
.
If X , Y and Z are all continuous, then we define the joint conditional density of X and
Y given Z as
f XY |Z (x, y|z ) = f XY Z (x,y,z )f Z (z )
,
where f XY Z (x,y,z ) is the joint density function of X , Y and Z and f Z (z ) is the marginal
density function of Z .
30
8/18/2019 STAT3004 Course Notes
31/167
The random variables X and Y are said to be conditionally independent given Z
if F XY |Z (x, y|z ) = F X |Z (x|z )F Y |Z (y|z ), where F X |Z (x|z ) = l i my→∞ F XY |Z (x, y|z ) andF Y |Z (y
|z ) = limx→∞ F XY |Z (x, y
|z ) are the conditional distributions of X given Z and
Y given Z , respectively. As with unconditional independence, an equivalent characteri-zation when the random variables involved are continuous is that the densities factor as
f XY |Z (x, y|z ) = f X |Z (x|z )f Y |Z (y|z ). (NOTE: In an obvious extension of the formula forunconditional densities,
f X |Z (x|z ) = ∞−∞
f XY |Z (x, y|z )dy,
with a similar definition for f Y |Z (y|z ).)As with the case for unconditional joint distributions, a useful concept is the condi-
tional covariance, defined as
Cov(X, Y |Z ) = E[XY |Z ] − E[X |Z ]E[Y |Z ],
and the conditional correlation coefficient, which is simply the conditional covariance
scaled by the product of the conditional standard deviations, σX |Z =
Var(X |Z ) andσY |Z =
Var(Y |Z ). Note that if two random variables are conditionally independent
then they are conditionally uncorrelated (i.e. the conditional covariance is zero), but the
converse is not true. Also, just because two random variables are conditionally independent
or uncorrelated does not necessarily imply that they are unconditionally independent oruncorrelated.
31
8/18/2019 STAT3004 Course Notes
32/167
5 Elements of Matrix Algebra
To prepare our analysis of Markov chains it is convenient to recall some elements of matrix
algebra:A matrix A is a tabular with n rows and m columns with the real-valued entries
A(i, j) (A(i, j) refers to the element in the ith row and the j th column). We shortly write
A = (A(i, j)) ∈ Rn×m (verbally, A is a n × m-matrix).Example 5.1. Note
A =
1 2 3
4 5 6
∈ R2×3 .
A(1, 2) = 2.
We have different operations when dealing with matrices:Scalar Multiplication. Let a ∈ R and A = (A(i, j)) ∈ Rn×m The scalar multiplicationaA is defined by taking the product of real number a ∈ R with each of the componentsof A, giving rise to a new matrix C = (C (i, j)) := aA ∈ Rn×m with C (i, j) := aA(i, j).Example 5.2. Let
A =
1 2 3
4 5 6
∈ R2×3 .
Then (a = 2)
C = 2A = 2 4 68 10 12 ∈ R2×3 .
Transposition. Let A = (A(i, j)) ∈ Rn×m Then the transposition of A is denoted byA = (A(i, j)). A = (A(i, j)) is a Rm×n-matrix with entries A(i, j) := A( j, i). (Weinterchange the roles of columns and rows).
Example 5.3. Let
A = 1 2 34 5 6
∈ R2×3 ⇒ A = 1 4
2 53 6
∈ R3×2 .
Sum of Matrices. Let A = (A(i, j)), B = (B(i, j)) ∈ Rn×m. By componentwise addingthe entries we get a new matrix C = (C (i, j)) =: A + B ∈ Rn×m where C (i, j) =A(i, j) + B(i, j).
32
8/18/2019 STAT3004 Course Notes
33/167
Example 5.4. Let
A = 1 2 3
4 5 6 ∈ R2×3 , B =
1 1 −21 3 6 ∈ R
2×3
⇒ C = A + B =
2 3 1
5 8 12
∈ R2×3 .
Product of Matrices. Let A = (A(i, j)) ∈ Rn×m, B = (B(i, j)) ∈ Rm×r. (The numberm of A’s columns must match the number m of B’s rows). Then the matrix product
AB = A · B := C is the matrix C = (C (i, j)) ∈ Rn×r with entries
C (i, j) :=m
k=1
A(i, k)B(k, j) , 1 ≤ i ≤ n , 1 ≤ j ≤ r .
By inspection: the entry C (i, j) is the Euclidian product of the ith row of A with the j th
column of B .
Example 5.5. Let
A =
1 2 3
4 5 6
∈ R2×3 B =
1 4 2
2 5 1
3 6 1
.
To compute AB it is convenient to adopt the following scheme
AB =
1 4 2
2 5 1
3 6 1
1 2 3 1 × 1 + 2 × 2 + 3 × 3 1 × 4 + 2 × 5 + 3 × 6 74 5 6 4 × 1 + 5 × 2 + 6 × 3 . . . . . .
Fill in the dots. The result is
C = AB =
14 32 7
32 . . . . . .
.
Product with Vectors and Matrices. This is as special case of the general matrix
multiplication: let x ∈ Rn, A = (A(i, j)) ∈ Rn×m. If we contrive x ∈ R1×n as a matrix
33
8/18/2019 STAT3004 Course Notes
34/167
with only one row then xA ∈ R1×m is defined by the corresponding matrix multiplication.The result is a row vector. If we insist on x ∈ Rn×1 to be a column vector then still xAand Ax are well defined. If n
= m then Ax is not defined, even if x is a column vector.
The dimensions must always match.
Power of Matrices. Let I ∈ Rn×n be the identity matrix. I = (I (i, j)) = Rn×n withentries I (i, j) = 1, if i = j , and, otherwise, if i = j then I (i, j) = 0. The identity matrixis a diagonal matrix (only the elements of the diagonal are nonzero) with unit entries on
the diagonal. For any A ∈ Rn×m we have I A = A (for all B ∈ Rm×n we have BI = B).For matrices where the number of columns equals the number of rows, A ∈ Rn×n we
can define the pthe power A p, p ∈ N0 = {0, 1, 2, 3, 4 . . . } by iteration:
A0 := I , A1 := A A p := (A) p−1A = A(A) p−1 .
Example 5.6. Let A =
1 2
3 4
. Find A0, A1, A2 and A3.
Answer:
A0 = I =
1 0
0 1
, A1 = A =
1 2
3 4
, A2 =
7 10
15 22
, A3 =
37 54
81 118
Example 5.7. (a) Show (A) = A.(b) Show A + B = B + A.
(b) Show (A + B) = A + B.
(c) Show (AB) = B A.
(d) Give an example of square matrices A, B ∈ R2×2 showing that AB = BA (’notcommutative’).
(Also see Tutorials)
34
8/18/2019 STAT3004 Course Notes
35/167
Part II: Markov Chains
6 Stochastic Process and Markov Chains
6.1 Introduction and Definitions
General Stochastic Process. A stochastic process is a family of random variables,
{X t}t∈T , indexed by a parameter t which belongs to an ordered index set T . For notationalconvenience, we will sometimes use X (t) instead of X t. We use t because the indexing is
most commonly associated with time.
For example, the price of a particular stock at the close of each day’s trading would be
a stochastic process indexed by time. Of course, the index does not have to be time, it may
be a spatial indicator. For example, the number of defects in specified regions of a computerchip. In fact, the indexing may be almost anything. Indeed, if we consider the index to
be individuals, we can consider a random sample X 1, . . . , X n to be a stochastic process.
Of course, this would be a rather special stochastic process in that the random variables
making up the stochastic process would be independent of each other. In general, we will
want to deal with stochastic processes where the random variables may be dependent on
one another.
As with individual random variables, we shall be interested in the set S of values
which the random variables may take on, but we shall generally refer to this set as the
state space in this context. Again, as with single random variables, the state space maybe either discrete or continuous. In addition, however, we must now also consider whether
the index set T is discrete or continuous. In this section, we shall be considering the
case where the index set is the discrete set of natural numbers T = N0 = {0, 1, 2, . . .},such processes are usually referred to as discrete time stochastic processes. We will start
by examining processes with discrete state spaces and later move on to processes with
continuous time sets T .
Markov Chain. The simplest sort of stochastic processes are of course those for which
the random variables X t are independent. However, the next simplest type of process, and
the starting point for our journey through the theory of stochastic processes, is called a
Markov chain. A Markov chain is a stochastic processes having:
1) a countable state space S ,
2) a discrete index set T = {0, 1, 2, . . .},
3) the Markov property, and
35
8/18/2019 STAT3004 Course Notes
36/167
4) stationary transition probabilities.
The final two properties listed are discussed next:
6.2 Markov Property
In general, we have defined a stochastic process so that the immediate future may depend
on both the present and the entire past. This framework is a bit too general for an initial
investigation of the concepts involved in stochastic processes. A discrete time process with
discrete state space will be said to have the Markov property if
P(X t+1 = xt+1|X 0 = x0, . . . , X t = xt) = P(X t+1 = xt+1|X t = xt).
In other words, the future depends only on the present and not on the past.At first glance, this may seem a silly property, in the sense that it would never really
happen. However, it turns out that Markov chains can give surprisingly good approxima-
tions to real situations.
Example. As an example, suppose our stochastic process of interest is the total amount
of something (money, perhaps) that we have accumulated at the end of each day. Often, it
is a very reasonable assumption that tomorrow’s amount depends only on what we have
today and not on how we arrived at today’s amount. Indeed, this will be the case if, for
instance, each day’s incremental amount is independent of those for the previous days.
Thus, a very common and useful stochastic process possessing the Markov property is the
sequence of partial totals in a random sum, i.e. X t = ξ 1 + . . . + ξ t where ξ 1, ξ 2 . . . is a
sequence of independent random variables. In this case, it is clear that X t+1 = X t+ξ t+1 de-
pends only on the value of X t (and, of course, on the value of ξ t+1, but this is independent
of all the previous ξ ’s and thus of the previous X ’s as well).
36
8/18/2019 STAT3004 Course Notes
37/167
6.3 Stationarity
Suppose we know that at time t, our Markov chain is in state x, and we want to know
about what will happen at time t + 1. The probability of X t+1 being equal to y in this
instance is referred to as the one-step transition probability of going from state x to state
y at time t, and is denoted by P t,t+1(x, y), or sometimes P t,t+1xy . (Note that for convenience
of terminology, even if x = y we will still refer to this as a transition). If we are dealing
with a Markov chain, then we know that
P t,t+1(x, y) = P(X t+1 = y|X t = x),
since the outcome of X t+1 only depends on the value of X t. If, for any value t in the index
set, we have
P t,t+1(x, y) = P (x, y) = P xy for all x, y ∈ S,that is, the one-step transition probabilities are the same at all times t, then the process
is said to have stationary transition probabilities. Here, the word stationary describes the
fact that the probability of going from one specified state to another does not change
with time. Note that for the partial totals in a random sum, the process has stationary
transition probabilities if and only if the ξ ’s are identically distributed.
6.4 Transition Matrices and Initial Distributions
Let’s start by considering the simplest type of Markov chain, namely, a chain with state
space of cardinality 2, say, S = {0, 1}. (Actually, this is the second-simplest type of chain,the simplest being one with only one possible state, but this case is rather unenlightening).
Suppose that at any time t,
P(X t+1 = 1|X t = 0) = p, P(X t+1 = 0|X t = 0) = 1 − p,
P(X t+1 = 0|X t = 1) = q, P(X t+1 = 1|X t = 1) = 1 − q,and that at time t = 0,
P(X 0 = 0) = π0(0), P(X 0 = 1) = π0(1).
We will generally use the notation πt to refer to the pmf of the discrete random variable
X t when dealing with discrete time Markov chains, so that πt(x) = pX t(x) = P(X t = x).
When the state space is finite, we can arrange the transition probabilities, P xy, into a
matrix called the transition matrix . For the two-state Markov chain described above the
37
8/18/2019 STAT3004 Course Notes
38/167
transition matrix is
P = P (0, 0) P (0, 1)
P (1, 0) P (1, 1) = P 00 P 01
P 10 P 11 = 1 − p p
q 1−
q .Note that for any fixed x, the pmf of X t given X t−1 = x is pX t|X t−1(y|x) = P (x, y). Thus,the sum of the values in any row of the matrix P will be 1. If the state space is not finite
then we will often refer to P (x, y) as the transition function of the Markov chain.
Similarly, if S is finite, we can arrange the initial distribution as a row vector, for
example, π0 = {π0(0), π0(1)} in the case of the two-state chain above.It is an important fact that P and π0 are enough to completely characterize a Markov
chain, and we shall examine this more thoroughly a little later. As an example, however,
let’s compute some quantities associated with the above two-state chain.
Example 6.1. For the two-state Markov chain above, let’s examine the chance that X t
will equal 0. To do so, we note
πt(0) = P(X t = 0)
= P(X t = 0|X t−1 = 0)P(X t−1 = 0) + P(X t = 0|X t−1 = 1)P(X t−1 = 1)= (1 − p)πt−1(0) + q πt−1(1)
=1−πt−1(0)= q + (1 − p − q )πt−1(0).
By iterating this procedure:
π1(0) = q + (1 − p − q )π0(0)π2(0) = q + (1 − p − q )π1(0) = q + (1 − p − q ){q + (1 − p − q )π0(0)}
= q + (1 − p − q )q + (1 − p − q )2π0(0)...
πt(0) = q t−1i=0
(1 − p − q )i + (1 − p − q )tπ0(0)
= q p + q
+ (1 − p − q )tπ0(0) − q p + q
,where we have used the well-known summation formula for a geometric series,
n−1i=0
ri = 1 − rn
1 − r .
38
8/18/2019 STAT3004 Course Notes
39/167
First, note that we can thus calculate the distribution of any of the X t’s using only the
entries of P and π0. Second, as long as p, q
8/18/2019 STAT3004 Course Notes
40/167
the two-step transition matrix ,
P 2 = P
×P =
(1 − p)2 + pq p(2 − p − q )
q (2 − p − q ) (1 − q )2
+ pq .We will discuss general n-step transition matrices shortly.
A formal proof that P and π0 fully characterize a Markov chain is beyond the scope
of this class. However, we will try to give the basic idea behind the proof now. It should
seem intuitively reasonable that anything we want to know about a Markov chain {X t}t≥0can be built up from probabilities of the form
P(X n = xn, . . . , X 0 = x0) = P(X n = xn|X n−1 = xn−1, . . . , X 0 = x0)
×P(X n−1 = xn−1, . . . , X 0 = x0)
= P(X n = xn|X n−1 = xn−1)×P(X n−1 = xn−1, . . . , X 0 = x0)
...
= P (xn−1, xn)P (xn−2, xn−1) · · · P (x0, x1)π0(x0)
= π0(x0)n
i=1
P (xi−1, xi).
Notice that the above simply states that the probability that the chain follows a particular
path for the first n steps can be found by simply multiplying the probabilities of thenecessary transitions. Note also, that we directly required both the Markov property and
stationarity for this demonstration. Indeed, the above identity is an equivalent form of
the stationary Markov property. As a technical detail, we must be careful that none of
the conditioning events in the above derivation have probability zero. However, this will
only occur when the original path is not possible (i.e. the specified xi’s do not form a
legitimate set of outcomes), in which case the original probability will clearly be zero as
will at least one of the factors in the final product, so the result still holds true.
For the sake of completeness, we note that the characterization is a one-to-one cor-
respondence. That is, every Markov chain is completely determined by its initial distri-bution and transition matrix, and any initial distribution and transition matrix (recall
that a transition matrix must satisfy the property that each of its rows sums to unity)
determine some Markov chain.
As a final comment, we note that it is the transition function P which is the more
fundamental aspect of a Markov chain rather than the initial distribution π0. We shall
see why this is so specifically in the results to follow, but it should be clear that changing
40
8/18/2019 STAT3004 Course Notes
41/167
initial distributions will generally only slightly affect the overall behaviour of the chain,
while a change in P will generally result in dramatic changes.
6.5 Examples of Markov Chains
We now present some of the most commonly used Markov chains.
Random Walk: Let p(u) be a probability mass function on the integers. A random walk
is a Markov chain with transition function P (x, y) = p(y − x) for integer valued x and y .Here we have S = Z = {. . . , −3, −2, −1, 0, 1, 2, 3, . . . }. For instance, if p(−1) = p(1) = 0.5,then the chain is the simple symmetric random walk, where at each stage the chain takes
either one step forward or backward. Such models are sometimes used to describe the
motion of a suspended particle. One question of interest we might ask is how far the
particle will travel. Another might be whether the particle ever returns to its startingposition and if so, how often. Often, the simple random walk is extended so that p(1) = p,
p(−1) = q and p(0) = r , where p, q and r are non-negative numbers less than one suchthat p + q + r = 1.
Ehrenfest chain: The Ehrenfest chain is often used as a simple model for the diffusion
of molecules across a membrane. Suppose that we have two distinct boxes and d distinct
labelled balls. Initially, the balls are distributed between the two boxes. At each step, a
ball is selected at random and is moved from the box that it is in to the other box. If X t
denotes the number of balls in the first box after t transitions, then {X t}t≥0 is a Markovchain with state space S = {0, . . . , d}. The transition function can be easily computed asfollows: If at time t, there are x balls in the first box, then there is probability x/d that
a ball will be removed from this box and put in the other, and a probability of (d − x)/dthat a new ball will be added to this box from the other, thus
P (x, y) =
xd
y = x − 11 − x
d y = x + 1
0 otherwise
For this chain, we might ask if an “equilibrium” is reached.
Gambler’s ruin: Suppose a gambler starts out with x dollars and makes a series of one
dollar bets against the house. Assume that the respective probabilities of winning and
losing the bet are p and q = 1 − p, and that if the capital ever reaches 0, the bettingends and the gambler’s fortune remains 0 forever after. This Markov chain has state space
41
8/18/2019 STAT3004 Course Notes
42/167
S = N0 = {0, 1, 2, 3, . . . } transition function
P (x, y) = 1 x = y = 0
q y = x−
1 and x > 0
p y = x + 1 and x > 0
0 otherwise
for x ≥ 1, and P (0, 0) = 1, P (0, y) = 0 for y = 0. Note that a state which satisfiesP (a, a) = 1 and P (a, y) = 0 for y = a is called an absorbing state . We might wish to askwhat the chance is that the gambler is ruined (i.e. loses all his/her initial stake) and how
long it might take. Also, we might modify this chain to incorporate a strategy whereby the
gambler quits when his/her fortune reaches d. For this chain, the above transition matrix
still holds except that the definition given for P (x, y) now holds only for 1
≤x
≤d
−1,
and d becomes an absorbing state. One interpretation of this modification is that two
gamblers are betting against each other and between them they have a total capital of d
dollars. Letting X t represent the fortune of one of the gamblers yields the gambler’s ruin
chain on {0, 1, . . . , d}.Birth and death chains: The Ehrenfest and Gambler’s ruin chains are special cases of
a birth and death chain. A birth and death chain has state space S = N0 = {0, 1, 2, . . .}and has transition function
P (x, y) =
q x y = x − 1rx y = x
px y = x + 1
0 otherwise
where px is the chance of a “birth”, q x the chance of a “death” and 0 ≤ px, q x, rx ≤ 1 suchthat px + q x + rx = 1. Note that we allow the chance of births and deaths to depend on
x, the current population. We will study birth and death chains in more detail later.
Queuing chain: Consider a service facility at which people arrive during each discrete
time interval according to a distribution with probability mass function p(u). If anyone is
in the queue at the start of a time period then a single person is served and removed fromthe queue. Thus, the transition function for this chain is P (0, y) = p(y) and P (x, y) =
p(y − x + 1). In other words, if there is no one in the queue then the chance of havingy people in the queue by the next time interval is just the chance of y people arriving,
namely p(y), while if x people are currently in the queue, one will definitely be served and
removed and thus to get to y individuals in the queue we require the arrival of y − (x − 1)additional individuals. Two obvious questions to ask about this chain are when the queue
42
8/18/2019 STAT3004 Course Notes
43/167
will be emptied and how often.
Branching chain: Consider objects or entities, such as bacteria, which generate a number
of offspring according to the probability mass function p(u). If at each time increment,
the existing objects produce a random number of offspring and then expire, then X t, the
total number of objects at generation t is a Markov chain with
P (x, y) = P(ξ 1 + . . . + ξ x = y)
where the ξ i’s are independent random variables each having a probability mass function
given by p(u). A natural question to ask for such a chain is if and when extinction will
occur.
6.6 Extending the Markov Property
Recall that we have said that the Markov property is equivalent to the identity