+ All Categories
Home > Documents > STAT3004 Course Notes

STAT3004 Course Notes

Date post: 07-Jul-2018
Category:
Upload: shivneet-kumar
View: 217 times
Download: 0 times
Share this document with a friend

of 167

Transcript
  • 8/18/2019 STAT3004 Course Notes

    1/167

    STOCHASTIC MODELLING - STAT3004/STAT7018, 2015, Semester 2

    Contents

    1 Basics of Set-Theoretical Probability Theory 4

    2 Random Variables 7

    2.1 Definition and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.3 Moments and Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.4 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3 Several Random Variables 17

    3.1 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2 Covariance, Correlation, Independency . . . . . . . . . . . . . . . . . . . . 18

    3.3 Sums of Random Variables and Convolutions . . . . . . . . . . . . . . . . . 19

    3.4 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    4 Conditional Probability 21

    4.1 Conditional Probability of Events . . . . . . . . . . . . . . . . . . . . . . . 214.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.3 Mixed Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.4 Random Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.5 Conditioning on Continuous Random Variables . . . . . . . . . . . . . . . 28

    4.6 Joint Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 30

    5 Elements of Matrix Algebra 32

    6 Stochastic Process and Markov Chains 35

    6.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    6.2 Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    6.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    6.4 Transition Matrices and Initial Distributions . . . . . . . . . . . . . . . . . 37

    6.5 Examples of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    6.6 Extending the Markov Property . . . . . . . . . . . . . . . . . . . . . . . . 43

    1

  • 8/18/2019 STAT3004 Course Notes

    2/167

    6.7 Multi-Step Transition Functions . . . . . . . . . . . . . . . . . . . . . . . . 44

    6.8 Hitting Times and Strong Markov Property . . . . . . . . . . . . . . . . . 47

    6.9 First Step Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    6.10 Transience and Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.11 Decomposition of the State Space . . . . . . . . . . . . . . . . . . . . . . . 58

    6.12 Computing hitting probabilities . . . . . . . . . . . . . . . . . . . . . . . . 63

    6.13 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    6.14 S pecial chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    6.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    7 Stationary Distribution and Equilibrium 73

    7.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    7.2 Basic Properties of Stationary and Steady State Distributions . . . . . . . 747.3 Periodicity and Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    7.4 Positive and Null Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    7.5 Existence and Uniqueness of Stationary Distributions . . . . . . . . . . . . 84

    7.6 Examples of Stationary Distributions . . . . . . . . . . . . . . . . . . . . . 87

    7.7 Convergence to the Stationary Distribution . . . . . . . . . . . . . . . . . . 93

    7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    8 Pure Jump Processes 104

    8.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048.2 Characterizing a Markov Jump Processes . . . . . . . . . . . . . . . . . . . 106

    8.3   S  = {0, 1}   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098.4 Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    8.5 Inhomogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . 115

    8.6 Special Distributions Associated with the Poisson Processes . . . . . . . . 118

    8.7 Compound Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . 121

    8.8 Birth and Death Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    8.9 Infinite Server Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    8.10 Long-run Behaviour of Jump Processes . . . . . . . . . . . . . . . . . . . . 129

    9 Gaussian Processes 138

    9.1 Univariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . 138

    9.2 Bivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 139

    9.3 Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 143

    9.4 Gaussian Processes and Brownian Motion . . . . . . . . . . . . . . . . . . 146

    2

  • 8/18/2019 STAT3004 Course Notes

    3/167

    9.5 Brownian Motion via Random Walks . . . . . . . . . . . . . . . . . . . . . 148

    9.6 Brownian Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

    9.7 Geometric Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 156

    9.8 Integrated Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.9 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

    3

  • 8/18/2019 STAT3004 Course Notes

    4/167

    Part I: Review Probability & Conditional Probability

    1 Basics of Set-Theoretical Probability Theory

    Sets and Events. We need to recall a little bit of set theory and its terminology insofar as

    it is relevant to probability. To start, we shall refer to the set of all possible outcomes that

    a random experiment may take on as the  sample space  and denote it by Ω. In probability

    theory Ω is contrived as a set. Its elements are called the samples.

    An  event  A   is then most simply thought of as a suitable subset of Ω, that is  A ⊆ Ω,and we shall generally use the terms  event  and  set  interchangeably. (For the technically

    minded, not all subsets of Ω can be included as legitimate events for measure theoretic

    reasons, but for our purposes, we will ignore this subtlety.)

    Example   1.1.   Consider the random process of flipping a coin twice. For this scenario,

    the sample space Ω is the set of all possible outcomes, namely Ω = {HH, HT, TH, TT}(discounting, of course, the possibility that the coin lands on its side and assuming that

    the coin has two distinct sides H and T). One obvious event might be that of getting an

    H on the first of the two tosses, in other words  A  = {HH, HT}.  

    Basic Set Operations. There are four basic set operators: union (∪), intersection (∩),complementation (c), and cardinality (#).

    Let  A, B ⊆

    Ω. The  union of two sets is the set which contains all the elements  ω ∈

    in either of the original sets, and we write  A ∪ B. A ∪ B  is the even that either  A  or  B  orboth happen. The  intersection of two sets is the set which contains all the elements ω ∈ Ωwhich are common to the two original sets, and we write  A ∩ B.  A ∩ B  is the event thatboth A  and  B  happen simultaneously.

    The complement of a set A is the set containing all of the elements ω ∈ Ω in the samplespace which are not in the original set  A, and we write  Ac. So, clearly, Ωc = ∅, ∅c = Ω,and (Ac)c =  A.  Ac is the event that not  A  happens. (Notational note: occasionally, the

    complement of  A  is denoted by A, but this is rarely done in statistics due to the potential

    for confusion with sample means.)Note that if two sets  A and  B  have no elements in common then they are referred to

    as  disjoint  and thus,  A ∩ B  = ∅, where ∅  signifies the empty or null set (the impossibleevent). Also, if  A ⊆ B   then clearly  A ∩ B  = A, so that in particular  A ∩ Ω = A  for anyevent  A.

    Using unions and intersections, we can now define a very useful set theory concept, the

     partition. A collection of sets  A1, . . . , Ak is a partition of  S  if their combined union is equal

    4

  • 8/18/2019 STAT3004 Course Notes

    5/167

    to the entire sample space and they are all mutually disjoint; that is,  A1 ∪ . . . ∪ Ak  = Ωand Ai ∩ A j  = ∅ for any i = j . In other words, a partition is a collection of events one andonly one of which must occur. In addition, note that the collection of sets

     {A, Ac

    } forms

    a very simple but nonetheless extremely useful partition.Finally, the cardinality of a set is simply the number of elements it contains. Thus,

    in Example 1.1 above, #Ω = 4 while #A  = 2. A set is called   countable   if we can enu-

    merate it in a possible nonunique way by natural numbers, for instance, ∅   is countable.Also a set   A  with finitely many elements is countable, ie. #A   is finite. Examples for

    countable, but infinite sets ate the natural numbers  N   = {1, 2, 3, . . . }  and the integersZ = {. . . , −2, −1, 0, 1, 2, . . . }. Also the rational numbers  Q  are countable. Intervals (a, b),(a, b] and the real line  R = (−∞, ∞) are examples of uncountable sets.

    Basic Set Theory rules.The Distributive laws:

    (A ∪ B) ∩ C  = (A ∩ C ) ∪ (B ∩ C )(A ∩ B) ∪ C  = (A ∪ C ) ∩ (B ∪ C )

    DeMorgan’s rules:

    (A ∪ B)c = Ac ∩ Bc; (A ∩ B)c = Ac ∪ Bc

    You should convince yourself of the validity of these rules through the use of Venn dia-

    grams. Formal proofs are elementary.

    Basic Probability Rules. We now use the above set theory nomenclature to discuss the

    basic tenets of probability. Informally, the probability of an event  A  is simply the chance

    that it will occur. If the elements of the sample space Ω are finite in number and may be

    considered “equally likely”, then we may calculate the probability of an event  A  as

    P(A) = #A

    #Ω.

    More generally, of course, we will have to rely on our long-run frequency interpretation

    of the probability of an event; namely, the probability of an event is the proportion of 

    times that it would occur among a (generally hypothetical) infinite number of equivalent

    repetitions of the random experiment.

    Zero & Unity Rules.   All probabilities must fall between 0 and 1, i.e. 0 ≤  P(A) ≤  1. Inparticular,  P(∅) = 0 and  P(Ω) = 1.Subset rule.  If  A ⊆ B, then  P(A) ≤ P(B).

    5

  • 8/18/2019 STAT3004 Course Notes

    6/167

    Inclusion-Exclusion Law.  The inclusion-exclusion rule states that the probability of the

    union of two events is equal to the sum of the probabilities of the two events minus the

    probability of the intersection of the two events, which has been in some sense “double

    counted” in the sum of the initial two probabilities, so that

    P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

    Notice that the final subtracted term disappears if the two events  A  and  B  are disjoint,

    more generally:

    Additivity.  Assume that  A1, . . . , An ⊆ Ω with Ai ∩ A j  = ∅  for  i = j . Then

    P(A1 ∪ · · · ∪ An) = P(A1) + · · · + P(An).

    Countable Additivity.  Assume that  A1, A2, A3, . . . ⊆ Ω is a sequence of events with  Ai ∩A j  = ∅  for  i = j . Then

    P(A1 ∪ A2 ∪ A3 ∪ . . . ) = P(A1) + P(A2) + P(A3) + P(A4) + . . .

    Complement Rule.  The probability of the complement of an event is equal to one minus

    the probability of the event itself, so that  P(Ac) = 1 − P(A). This rule is easily derivedfrom the Inclusion-Exclusion rule.

    Product Rule. Two events A  and  B  are said to be  independent if and only if they satisfy

    the equation  P(A ∩ B) = P(A)P(B).The Law of Total Probability.  The law of total probability is a way of calculating a prob-

    ability by breaking it up into several (hopefully easier to deal with) pieces. If the sets

    A1, . . . , Ak  form a partition, then the probability of an event  B  may be calculated as:

    P(B) =k

    i=1

    P(B ∩ Ai).

    Again, heuristic verification is straightforward from a Venn diagram.

    6

  • 8/18/2019 STAT3004 Course Notes

    7/167

    2 Random Variables

    2.1 Definition and Distribution

    Definition A random variable  X  is a numerically valued function X   : Ω → R (R denotingthe real line) whose domain is a sample space Ω. If the range of  X  is a countable subset of 

    the real line then we call  X  a  discrete random variable . (For the technically minded, not

    all numerical functions  X   : Ω →  R  are random variables for measure theoretic reasons,but for our purposes, we will ignore this subtlety.)

    Below we introduce the notion of a continuous random variable . A continuous random

    variable takes values in an uncountable set such as intervals or the real line. Note that

    a random variable cannot be continuous if the sample space on which it is defined is

    countable; however, a random variable defined on an uncountable sample space may stillbe discrete. In the coin tossing scenario of Example 1.1 above, the quantity   X   which

    records the number of heads in the outcome is a discrete random variable.

    Distribution of a Random Variable. Since random variables are functions on a sample

    space, we can determine probabilities regarding random variables by determining the

    probability of the associated subset of Ω. The probability of a random variable  X   being

    in some subset  I  ⊆  R  on the real line is equivalent to the probability of the event  A  ={ω ∈ Ω :   X (ω) ∈ I }:

    P(X 

     ∈I ) = P(

     ∈Ω :   X (ω)

    ∈I 

    }) .

    Note that we have used the notion of a random variable as a function on the sample space

    when we use the notation  X (ω). The collection of all probabilities  P(X  ∈ I ) is called thedistribution  of  X .

    Probability Mass Function (PMF). If  X  is discrete, then it is clearly desirable to find

     pX (x) = P(X  = x), the probability mass function  (or pmf) of  X . Because it is possible to

    characterise the distribution of  X   in terms of its pmf  pX   via

    P(X  ∈ I ) = i∈I  pX (i) .

    If  X   is discrete, we have x∈Range(X )

     pX (x) = 1.

    Cumulative Distribution Function (CDF). For any random variable  X   : Ω → R thefunction

    F X (x) = P(X  ≤ x) , x ∈ R

    7

  • 8/18/2019 STAT3004 Course Notes

    8/167

    is called the cumulative distribution function (CDF)  of  X . The CDF of  X  determines the

    distribution of  X  (the collection of all probabilities  P(X  ∈ I ) can be computed from thecdf of  X ).

    If  X   is a  discrete random variable  then its cumulative distribution function is a stepfunction:

    F X (x) = P(X  ≤ x) =

    y∈Range(X ):y≤x pX (y).

    (Absolutely) Continuous Random Variable.  Assume that  X   is a random variable

    such that

    P(X  ∈ I ) = I 

    f X (x) dx

    where   f X (x) is some nonnegative function with

     ∞−∞ f X (x) dx   = 1. Then   X   is called a

    continuous random variable  admitting a density  f X . In this case, the CDF is still a validentity, being continuous and given by

    F X (x) =   P(X  ≤ x) =   x−∞

    f X (x) dx

    Observe that the concept of a pmf is completely useless when dealing with continuous

    r.v.’s as we have   P(X   =  x) = 0 for all  x. The Fundamental Theorem of Calculus thus

    shows that  f (x) =   ddxF (x) = F (x), which in turn leads to the informal identity

    P(x < X  ≤ x + dx) = F (x + dx) − F (x) = dF (x) = f (x)dx,

    which is where the density function  f  gets its name, since in some sense it describes howthe probability is spread over the real line.

    (Notational note: We will attempt to stick to the convention that capital letters denote

    random variables while the corresponding lower case letters indicate possible values or

    realisations  of the random variable.)

    2.2 Common Distributions

    The real importance of  C DF ’s,  pmf ’s and densities is that they completely characterize

    the random variable from which they were derived. In other words, if we know the  C DF (or equivalently the pmf  or density) then we know everything there is to know about the

    random variable. For most random variables that we might think of, of course, writing

    down a pmf , say, would entail the long and tedious process of listing all the possible values

    and their associated probabilities. However, there are some types of important random

    variables which arise over and over and for which simple formulae for their  CDF ’s, pmf ’s

    or densities have been found. Some common CDF ’s, pmf ’s and densities are listed below:

    8

  • 8/18/2019 STAT3004 Course Notes

    9/167

    Discrete Distributions

    Name   pmf 

    Poisson(λ)   p(x) = e−λ   λx

    x!λ > 0   x ∈ N0  = {0, 1, 2, . . .}Binomial(n, p)   p(x) =

    nx

     px(1 − p)n−x

    n ∈ N = {1, 2, 3, . . . }   0 < p < 1   x ∈ {0, 1, . . . , n}Negative Binomial(r, p)   p(x) =

    x+r−1r−1

    (1 − p)r px

    r ∈ N   0 < p < 1   x ∈ N0 = {0, 1, 2, . . . }Geometric( p)   p(x) = p(1 − p)x−10 < p <  1   x ∈ N = {1, 2, 3, . . . }Hypergeometric(n,N,M )   p(x) =

    M x

    N −M n−x

    /

    N nn ∈ N m ∈ {0, . . . , n} N  ∈ {1, . . . , n}   x ∈ {max(0, n+M −N ), . . . , min(M, n)}

    Continuous Distributions

    Name Density

    Normal(µ, σ2)   f (x) =   1√ 2πσ2

      exp− (x−µ)2

    2σ2

    µ ∈ R   σ2 > 0   x ∈ R = (−∞, ∞)Exponential(λ)   f (x) = λe−λx

    λ > 0   x ∈ (0, ∞)Uniform(a, b)   f (x) = 1/(b

    −a)

    −∞ < a < b < ∞   x ∈ (a, b)Weibull(α, λ)   f (x) = αλxα−1e−λx

    α

    α > 0, λ >  0   x ∈ (0, ∞)Gamma(α, λ)   f (x) =   λΓ(α)(λx)

    α−1e−λx

    α > 0, λ >  0   x ∈ (0, ∞)Chi-Squared(k)   f (x) =   1

    2k2 Γk2

    x 12 (k−2)e−12xk ∈ N = {1, 2, 3, . . . }   x ∈ (0, ∞)Beta(α, β )   f (x) =   Γ(α+β)Γ(α)Γ(β)x

    α−1(1 − x)β−1

    α, β > 0   x ∈ (0, 1)Student’s tk   f (x) =

      Γk+12

    √ kπΓ12k1 +   x2

    k

    −12(k+1)

    k ∈ N   x ∈ (−∞, ∞)Fisher-Snedecor F m,n   f (x) =

      Γm+n2

    Γm2 kΓn2km

    n

    m2

    x(m−2)

    21+mx

    n

    12 (m+n)

    m, n ∈ N   x ∈ (0, ∞)

    9

  • 8/18/2019 STAT3004 Course Notes

    10/167

    The factorials n! and the binomial coefficientsnx

     which are defined as follows: 0! := 1

    and, for  n ∈ N and  x ∈ {0, . . . , n},

    n! := n × (n − 1) × · · · × 1 , nx :=   n!x!(n−x)!The gamma function, Γ(α), is defined by the integral

    Γ(α) =

      ∞0

    xα−1e−xdx,

    from which it follows that if  α  is a positive integer, then Γ(α) = (α − 1)!. Also, note thatfor  α  = 1, the Gamma(1,λ) distribution is equivalent to the Exponential(λ) distribution,

    while for  λ  =   12

    , the Gamma(α,  12

    ) distribution is equivalent to the Chi-squared distribu-

    tion with 2α degrees of freedom. Similarly, the Geometric( p) distribution is closely relatedto the Negative Binomial distribution when r  = 1.

    Above we listed formulas only for those  x  where  p(x) or  f (x) >  0. For the remaining

    xs we have p(x) = 0 or f (x) = 0. We write  X  ∼ Q indicating that  X  has the distributionQ: for instance, X  ∼Normal(0, 1) refers to a continuous random variable X  which has thedensity  f X (x) =

      1√ 2π

    e−x2

    2  . Similarly,  Y  ∼Poisson(5) refers to a discrete random variableY   with range  N0  having pmf  pY (x) = e

    −5 5xx!   for  x ∈ N0.

    Exercise.   (a) Let   X  ∼Exponential(λ). Check that the CDF of   X   satisfies   F X (x) =1 − e−λx for  x ≥ 0. Graph this function for  x ∈ [−1, 4] for the parameter  λ = 1.(b) Let   X  ∼Geometric( p). Check that the CDF of   X   satisfies   F X (x) = 1 − (1 −  p)x,x ∈ {1, 2, 3, . . .}. Graph this function for  x ∈ [−1, 4] (hint: step function).  

    10

  • 8/18/2019 STAT3004 Course Notes

    11/167

    2.3 Moments and Quantiles

    Moments. The  mth moment of a random variable X  is the expected value  of the random

    variable  X m and is defined as

    E[X m] =

    x∈Range(X)xm pX (x),

    if  X  is discrete, and as

    E[X m] =

      ∞−∞

    xmf X (x)dx,

    if  X   is continuous (provided, of course, that the quantities on the right hand sides exist).

    In particular, when   m   = 1, the first moment of   X   is generally referred to as its  mean

    and is often denoted as  µX , or just  µ  when there is no chance of confusion. The expected

    value of a random variable is one measure of the centre of its distribution.

    General Formulae. A good, though somewhat informal, way of thinking of the expected

    value is that it is the value we would tend to get if we were to average the outcomes of a

    very large number of equivalent realisations of the random variable. From this idea, it is

    easy to generalize the moment definition to encompass the expectations of any function,

    g, of a random variable as either

    E[g(X )] =

    x∈Range(X)g(x) p(x),

    or

    E[g(X )] =

      ∞−∞

    g(x)f (x)dx,

    depending on whether X   is discrete or continuous.

    Central Moments and Variance.  The idea of moments is often extended by defining

    the central moments , which are the moments of the centred random variable  X −µX . Thefirst central moment is, of course, equal to zero. The second central moment is generally

    referred to as the variance of  X , and denoted Var(X ) or sometimes  σ2X . The variance is

    a measure of the amount of dispersion in the distribution of  X ; that is, random variables

    with high variances are likely to produce realisations which are far from the mean, while

    low variance random variables have realisations which will tend to cluster closely about

    the mean. A simple calculation shows the relationship between the moments and the

    central moments; for example, we have

    Var(X ) = E[(X − µX )2] = E[X 2] − µ2X .

    11

  • 8/18/2019 STAT3004 Course Notes

    12/167

    One drawback to the variance is that, by its definition, its units are not comparable

    to those of   X . To avert this problem, we often use the square root of the variance,

    σX  =  Var(X ), which is called the  standard deviation  of the random variable  X .Quantiles and Median.   Another way to characterize the location (i.e. centre) andspread of the distribution of a random variable is through its quantiles. The (1 − α)-quantile  of the distribution of  X   is any value  ν α  which satisfies:

    P(X  ≤ ν α) ≥ 1 − α   and   P(X  ≥ ν α) ≥ α.

    Note that the definition does not necessarily uniquely define the quantile; in other

    words, there may be several distinct (1 −α)-quantiles of a distribution. However, for mostcontinuous distributions that we shall meet the quantiles will be unique. In particular,

    the α  =   1

    2

     quantile is called the median of the distribution and is another measure of the

    centre of the distribution, since there is a 50% chance that a realisation of  X  will fall below

    it and also a 50% chance that the realisation will be above the median value. The  α  =   34

    and α  =   14   quantiles are generally referred to as the first and third quartiles , respectively,

    and there difference, called the  interquartile range  (or  I QR), is another measure of spread

    in the distribution.

    Expectation via Tails.  Calculating the mean of a random variable from the definition

    can often involve painful integration and algebra. Sometimes, there are simpler ways. For

    example, if   X   is a non-negative integer-valued random variable (i.e. its range contains

    only non-negative integers), then we can calculate the mean of  X  as

    µ =∞x=0

    P(X > x).

    The validity of this can be easily seen by a term rearrangement argument:

    ∞x=0

    P(X > x) =∞x=0

    ∞y=x+1

     p(y) =∞y=1

    y−1x=0

     p(y) =∞y=1

    yp(y) = µ.

    More generally, if  X  is an arbitrary, but non-negative random variable with cumulative

    distribution function  F , then

    µ =

      ∞0

    {1 − F (x)}dx .

    Example   2.1.   Let  a >  0 and  U   be uniformly distributed on (0, a). Using at least two

    methods find  E[U ].

    Solution:   U   is a continuous random variable with density   f U (u) = 1/a   for 0   < u < a

    12

  • 8/18/2019 STAT3004 Course Notes

    13/167

    (otherwise, f U (u) = 0 if  u = (0, a)).Method I:

    E[U ] = 1

    a    a

    0

    u du = 1

    a

    [u2/2]a0  = a/2

    Method II: Note that   U   is a nonnegative random variable taking values only in (0, a).

    Also,  F (U )(u) =  1a

     u0

      du   =  u/a   if 0  < u < a. Otherwise, we have either  F U (u) = 0 for

    u ≤ 0, or,  F U (u) = 1 for  u ≥ a. Consequently, The tail integral becomes

    E[U ] =

      ∞0

    1 − F U (u) du =   a0

    1 − F U (u) du   =   a0

    1 − (u/a) du =  a −  1a

    [u2/2]a0  = a/2 .

    2.4 Moment Generating Functions

    A more general method of calculating moments is through the use of the  moment gener-

    ating function (or  mgf ), which is defined as

    m(t) = E[etX ] =

      ∞−∞

    etxdF (x),

    provided the expectation exists for all values of   t   in a neighborhood of the origin. To

    obtain the moments of   X  we note that (provided sufficient regularity conditions which

     justify the interchange of the operations of differentiation and integration are satisfied),

    dm

    dtmE[etX ]

    t=0

    = E[X metX ]

    t=0

    = E[X m].

    Example   2.2.   Suppose that  X   has a Poisson(λ) distribution. The moment generating

    function of  X  is given by:

    m(t) =∞

    x=0etx p(x) =

    x=0etxλxe−λ

    x!  = e−λ

    x=0

    λetx

    x!  = e−λeλe

    t

    = eλ(et−1),

    where we have used the series expansion ∞

    n=0xn

    n!   = ex. Taking derivatives of  m(t) shows

    that

    m(t) = eλ(et−1)(λet) =⇒   m(0) = E[X ] = λ

    m(t) = eλ(et−1)(λet)2 + eλ(e

    t−1)(λet) =⇒   m(0) = E[X 2] = λ2 + λ.

    Finally, this shows that  V ar(X ) = E[X 2] − {E[X ]}2 = (λ2 + λ) − λ2 = λ.  

    13

  • 8/18/2019 STAT3004 Course Notes

    14/167

    Example 2.3.  Suppose that X  has a Gamma(α, λ) distribution. The moment generating

    function of  X  is given by:

    m(t) =   ∞

    0 etx   λ

    Γ(α) (λx)α−1

    e−λx

    dx =

      λα

    Γ(α)   ∞

    0 e−(λ−t)x

    xα−1

    dx

    =  λα

    (λ − t)α  ∞0

    λ − tΓ(α)

    e−(λ−t)x{(λ − t)x}α−1dx =

      λ

    λ − tα

    , t < λ

    where we have used the fact that ∞0

    λ−tΓ(α)e

    −(λ−t)x{(λ−t)x}α−1dx = 1 since it is the integralof the density of a Gamma(α, λ − t) distribution over the full range of its sample space,provided that  t < λ, since the parameters of a Gamma distribution must be positive. So,

    differentiating this function shows that:

    m(t) =  αλα

    (λ − t)α+1  =

    ⇒  m(0) = E[X ] =

     α

    λ

    m(t) = α(α + 1)λα

    (λ − t)α+2   =⇒   m(0) = E[X 2] =

     α2 + α

    λ2  .

    Thus, we can see that  V ar(X ) = E[X 2] − {E[X ]}2 =   αλ2

    .  

    If  X, Y   are random variables having a finite mgf in an open interval containing zero and

    their corresponding mgf’s equal then X  and Y  have the same distribution. Generally, such

    demonstrations rely on algebraic calculation followed by recognition of resultant functions

    as the moment generating function of some specific distribution. As a useful reference,

    then, we now give the moment generating functions of some of the distributions noted inthe previous section:

    Discrete Distributions

    Name   mgf 

    Poisson(λ)   m(t) = exp(λ(et − 1))   t ∈ RBinomial(n, p)   m(t) = (1 − p + pet)n t ∈ RGeometric( p)   m(t) = pet/(1 − (1− p)et) (1− p)et

  • 8/18/2019 STAT3004 Course Notes

    15/167

    Generating Functions.   More generally, the concept of a generating function of a se-

    quence is quite useful in many areas of probability theory. The generating function of a

    sequence of numbers

     {a0, a1, a2, . . .

    } is defined as

    A(s) = a0 + a1s + a2s2 + . . . =

    ∞n=0

    ansn

    provided the series converges for values of  s  in a neighborhood of the origin.

    Note that from this definition, the moment generating function is just the generating

    function of the sequence  an  =  E[X n]/n!. As with the  mgf , the elements of the sequence

    can be recovered by successive differentiation of the generating function and subsequent

    evaluation at  s  = 0 (as well as a rescaling by  n! for the appropriate value of  n).

    In particular, if  X   is a discrete random variable taking non-negative integer values,

    then setting  an =  P(X  = n) yields the  probability generating function,

    P (s) = E[sX ] = E[eX  log s].

    Note that  m(t) = P (et), so that there is a clear link between the moment generating

    function and the probability generating function. In particular, moment-like quantities

    can be found via derivatives of  P (s) evaluated at  s  = 1 = e0. For example,

    m(t)

    t=0

    = P (et)e2t + P (et)ett=0

    = P (1) + P (1).

    Also, if we let   q n   =   P(X > n), then   Q(s) = 

    n q nsn, is a tail probability generating

    function and

    Q(s) = 1 − P (s)

    1 − s   .This can be seen by noting that the coefficient of  sn in the function (1 − s)Q(s) is

    q n − q n−1   = P(X > n) − P(X > n − 1) = P(X > n) − {P(X  = n) + P(X > n)} = −P(X  = n),

    if  n ≥ 1, and q 0 =  P(X > 0) = 1 − P(X  = 0) if  n  = 0, so that

    (1 − s)Q(s) = 1 − P(X  = 0) − ∞n=1

    P(X  = n)sn = 1 − P (s).

    We saw earlier that  E[X ] =

    n q n, so E[X ] = Q(1) = lims→1{1−P (s)}/(1−s), and thus,the graph of  {1 − P (s)}/(1 − s) has a pole rather than an asymptote at  s  = 1, as long asthe expectation of  X  exists and is finite.

    Additional Remarks.  For the mathematically minded, we note that occasionally the

    15

  • 8/18/2019 STAT3004 Course Notes

    16/167

    mgf  (which, for positive random variables is also sometimes called the Laplace transform

    of the density or  pmf ) will not exist (for example, the  t- and  F -distributions have non-

    existent moment generating functions, since the necessary integrals are infinite) and this

    is why it is often more convenient to work with the  characteristic function   (also knownto some as the Fourier transform of the density or   pmf ),   ψ(t) =  E[eitX ] which always

    exists, but this requires some knowledge of complex analysis. One of the most useful

    features of the characteristic function (and of the moment generating function, in the

    cases where it exists) is that it uniquely specifies the distribution from which it arose (i.e.

    no two distinct distributions have the same characteristic function), and many difficult

    properties of distributions can be derived easily from the corresponding properties of 

    characteristic functions. For example, the Central Limit Theorem is easily proved using

    moment generating functions, as are some important relationships regarding the various

    distributions listed above.

    16

  • 8/18/2019 STAT3004 Course Notes

    17/167

    3 Several Random Variables

    3.1 Joint distributions

    The   joint distribution  of two random variables  X   and  Y  describes how the outcomes of 

    the two random variables are probabilistically related. Specifically, the joint distribution

    function is defined as

    F XY  (x, y) = F (x, y) = P(X  ≤ x  and  Y  ≤ y).

    Usually, the subscripts are omitted when no ambiguity is possible.

    If  X   and  Y   are both discrete, then they will have a  joint probability mass function

    defined by   P(x, y) =   P(X   =   x   and   Y   =   y). Otherwise, if there exists a   joint density 

    defined as that function  f XY    which satisfies:

    F XY  (x, y) =

       x−∞

       y−∞

    f XY  (ξ, η)dηdξ.

    In this case, we call  X   and  Y  (jointly) continuous.

    The case where one of  X   and  Y   is discrete and one continuous is of interest, but is

    slightly more complicated and we will deal with it when it comes up.

    The function  F X (x) = limy→∞ F (x, y) is called the   marginal distribution function  of 

    X , and similarly the marginal distribution function of  Y   is F Y (y) = limx→∞ F (x, y). If  X 

    and  Y  are discrete, then the  marginal probability mass functions  are simply

     pX (x) =

    y∈Range(Y ) p(x, y) and   pY (y) =

    x∈Range(X )

     p(x, y).

    If  X   and Y   are continuous, then the  marginal densities  of  X   and Y  are given by

    f X (x) =

      ∞−∞

    f XY  (x, y)dy   and   f Y (y) =

      ∞−∞

    f XY  (x, y)dx,

    respectively. Note that the marginal density at a particular value is derived by simply

    integrating the area under the joint density along the appropriate horizontal or verticalline.

    The expectation of a function  h  of the two random variables  X  and Y   is calculated in

    a fashion similar to the expectations of functions of single random variables, namely,

    E[h(X, Y )] =

    x∈Range(X )

    y∈Range(Y )

    h(x, y) p(x, y)

    17

  • 8/18/2019 STAT3004 Course Notes

    18/167

    if  X  and  Y  are discrete, or

    E[h(X, Y )] =   ∞

    −∞   ∞

    −∞

    h(x, y)f (x, y)dxdy

    if  X  and  Y   are continuous.

    Note that the above definitions show that regardless of the type of random variables,

    E[aX  +  bY ] = aE[X ] + bE[Y ] for any constants  a and  b. Also, analogous definitions and

    results hold for any finite group of random variables. For example the joint distribution

    of  X 1, . . . , X  k   is

    F (x1, . . . , xk) = P(X 1 ≤ x1  and   . . .   and X k ≤ xk).

    3.2 Covariance, Correlation, IndependencyIndependence.   If it happens that  F (x, y) =  F X (x)F Y (y) then the random variables  X 

    and   Y   are said to be   independent. If both the random variables are continuous, then

    the above condition is equivalent to  f (x, y) = f X (x)f Y (y), while if both are discrete it is

    the same as  p(x, y) = pX (x) pY (y). Note the similarity of these definitions to that for the

    independence of events.

    Given two jointly distributed random variables  X   and  Y , we can calculate their means,

    µX  and µY , and their standard deviations,  σX  and σY , using their marginal distributions.

    Provided these means and standard deviations exist, we can use the joint distributionto calculate the  covariance   between  X   and  Y  which is defined as Cov(X, Y ) =   σXY    =

    E[(X − µX )(Y  − µY )] = E[XY ] − µX µY .Two random variables are said to be uncorrelated  if their covariance is zero. Note that

    if  X  and Y  are independent then they are certainly uncorrelated, since the factorization of 

    the pmf  or density implies that  E[XY ] = E[X ]E[Y ] = µX µY . However, two uncorrelated

    random variables need not be independent. Note also that it is an easy calculation to

    show that if  X ,  Y ,  V   and  W  are jointly distributed random variables and  a,  b,  c  and  d

    are constants, then

    Cov(aX  + bY,cV   + dW ) = acσXV   + adσXW  + bcσY V   + bdσY W ;

    in other words, the covariance operator is  bilinear .

    Finally, if we scale   σXY   by the product of the two standard deviations, we get the

    correlation coefficient,  ρ  =  σXY /σX σY , which satisfies −1 ≤ ρ ≤ 1.

    18

  • 8/18/2019 STAT3004 Course Notes

    19/167

    3.3 Sums of Random Variables and Convolutions

    We saw that the expectation of  Z  = X  + Y  was simply the sum of the individual expec-

    tations of  X  and Y  for any two random variables. Unfortunately, this is about the extent

    of what we can say in general. If, however,  X   and Y   are independent, the distribution of 

    Z  can be determined by means of a   convolution:

    F Z (z ) =

      ∞−∞

    F X (z − ξ )dF Y (ξ ) =  ∞−∞

    F Y (z − ξ )dF X (ξ ).

    In the case where both  X   and  Y  are are discrete, we can write the convolution formula

    using  pmf ’s:

     pZ (z ) = x∈Range(X ) pX (x) pY (z − x) = y∈Range(Y )

     pX (z − y) pY (y).

    If  X   and Y  are both continuous, we can rewrite the convolution formula using densities:

    f Z (z ) =

      ∞−∞

    f X (z − ξ )f Y (ξ )dξ  =  ∞−∞

    f Y (z − ξ )f X (ξ )dξ.

    Note that, in the same way that marginal densities are found by integrating along hor-

    izontal or vertical lines, the density of   Z   at the value   z   is found by integrating along

    the line  x + y   =  z , and of course using the independence to state that  f XY  (ξ, z − ξ ) =f X (ξ )f Y (z − ξ ).

    Since convolutions are a bit cumbersome, we now note an advantage of  mgf ’s. If  X and  Y   are independent, then

    mZ (t) = E[etZ ] = E[et(X +Y )] = E[etX etY ] = E[etX ]E[etY ] = mX (t)mY (t).

    So, the mgf   of a sum of independent random variables is the product of the mgf ’s of the

    summands. This fact makes many calculations regarding sums of independent random

    variables much easier to demonstrate:

    Suppose that   X  ∼   Gamma(αX , λ) and   Y   ∼   Gamma(αY , λ) are two independentrandom variables, and we wish to determine the distribution of  Z   =  X  + Y . We could

    use the convolution formula, but this would require some extremely difficult (though not

    impossible, of course) integration. However, recalling the moment generating function of 

    the Gamma distribution we see that, for any t < λ:

    mX +Y (t) = mX (t)mY (t) =

      λ

    λ − tαX   λ

    λ − tαY 

    =

      λ

    λ − tαX+αY 

    ,

    which easily shows that  X 1 + X 2 ∼ Gamma(α1 + α2, λ).

    19

  • 8/18/2019 STAT3004 Course Notes

    20/167

    3.4 Change of Variables

    We saw previously, that we could find the expectation of  g(X ) using the distribution of  X .

    Suppose, however, that we want to know more about the new random variable  Y   = g(X ).

    If  g  is a strictly monotone function, we can find the distribution of  Y  by noting that

    F Y (y) = P(Y  ≤ y) = P({g(X ) ≤ y}) = P({X  ≤ g−1(y)} = F X {g−1(y)}),

    if  g  is increasing, and

    F Y (y) =   P({Y  ≤ y}) = P({g(X ) ≤ y}) = P({X  ≥ g−1(y)})= 1 − F X {g−1(y)} + P({X  = g−1(y)}),

    if  g   is decreasing (if  g   is not strictly monotone, we need to be a bit more clever, but wewon’t deal with that case here). Now, if  X   is continuous and  g  is a smooth function (i.e.

    has a continuous derivative) then the differentiation chain rule yields

    f Y (y) =  1

    |g{g−1(y)}|f X {g−1(y)} =   1|g(x)|f X (x),

    where  y  =  g(x) (note that when  X   is continuous, the  CDF   of  Y   in the case when  g   is

    decreasing simplifies since  P{X  = g−1(y)} = 0).A similar formula holds for joint distributions except that the derivative factor be-

    comes the reciprocal of the modulus of the determinant of the Jacobian matrix for the

    transformation function   g. In other words, if   X 1   and   X 2   have joint density   f X 1X 2   and

    g(x1, x2) = {g1(x1, x2), g2(x1, x2)}   = (y1, y2) is an invertible transformation, then the joint density of  Y 1  =  g1(X 1, X 2) and Y 2 =  g2(X 1, X 2) is

    f Y 1Y 2(y1, y2) = |J (x1, x2)|−1f X 1X 2(x1, x2),

    where y1 =  g1(x1, x2) and y2  =  g2(x1, x2) and |J (x1, x2)| is the determinant of the Jacobianmatrix J (x1, x2), which has (i, j)

    th element  J ij(x1, x2) =∂gi(x1,x2)

    ∂xj

    .

    20

  • 8/18/2019 STAT3004 Course Notes

    21/167

    4 Conditional Probability

    4.1 Conditional Probability of Events

    So far, we have discussed the probabilities of events in a rather static situation. However,

    typically, we wish to know how the outcomes of certain events will subsequently affect the

    chances of later events. To describe such situations, we need to use  conditional probability 

    for events.

    Suppose that we wish to know the chance that an event  A  will occur. Then we have

    seen that we want to calculate  P(A). However, if we are in possession of the knowledge

    that the event  B  has already occurred, then we would likely change our belief about the

    chance of   A  occuring. For example, if  A   is the event “it will rain today” and   B   is the

    event “the sky is overcast”. We use the notation  P(A|B) to signify the probability of  Agiven  that  B  has occurred, and we define it as

    P(A|B) =  P(A ∩ B)P(B)

      ,

    provided  P(B) = 0.If we think of probabilities as areas in a Venn diagram, then a conditional probability

    amounts to restricting the sample space down from Ω to  B  and then finding the relative

    area of that part of  A  which is also in  B  to the total area of the restricted sample space,

    namely B   itself.Multiplication Rule. In many of our subsequent applications, conditional probabilities

    will be dictated as primary data by the circumstances of the process under study. In this

    case, the above definition will find its most useful function in the form

    P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A).

    Independence. Also, we can rephrase independency of events note interms of conditional

    probabilities; namely, two events  A  and B  are independent if and only if 

    P(A|B) = P(A) and   P(B|A) = P(B).(Note that only one of the above two conditions need be verified, since if one is true the

    other follows from the definition of conditional probability.) In other words, two events

    are independent if the chance of one occurring is unaffected by whether or not the other

    has occurred.

    Total Probability law.   Recalling the law of total probability, we can use this new

    21

  • 8/18/2019 STAT3004 Course Notes

    22/167

    identity to show that if the sets  B1, . . . , Bk  form a partition then

    P(A) =k

    i=1 P(A|Bi)P(Bi).Bayes’ Rule.  Finally, a very useful formula exists which relates the conditional proba-

    bility of  A  given  B  to the conditional probability of  B  given  A  and goes by the name of 

    Bayes’ Rule . Bayes’ rule states that

    P(B|A) =  P(A ∩ B)P(A)

      =  P(A|B)P(B)

    P(A|B)P(B) + P(A|Bc)P(Bc) ,

    which follows from the definition of conditional probability and the law of total probability,

    since B and Bc form a partition. In fact, we can generalize Bayes’ rule by letting B1, . . . , Bk

    be a more general partition, so that

    P(Bi|A) =   P(A|Bi)P(Bi)k j=1 P(A|B j)P(B j)

    .

    Example  4.1.  Suppose there are three urns labelled  I ,  II   and  III , the first containing

    4 red and 8 blue balls, the second containing 3 red and 9 blue, and the third 6 red and 6

    blue. (a) If an urn is picked at random and subsequently a ball is picked at random from

    the chosen urn, what is the chance that the chosen ball will be red? (b) If a red ball is

    drawn, what is the chance that it came from the first urn?

    Solution:  Let  R  be the event that the chosen ball is red. Then, from the description of 

    the situation it is clear that:

    P(I ) = P(II ) = P(III ) = 1

    3 ,   P(R|I ) =   4

    12 =

     1

    3 ,   P(R|II ) =   3

    12 =

     1

    3 ,   P(R|III ) =   6

    12 =

     1

    2

    (a) Since the events  I , I I  and I II  clearly form a partition (i.e. one and only one of them

    must occur), we can use the law of total probability to find

    P(R) = P(R

    |I )P(I ) + P(R

    |II )P(II ) + P(R

    |III )P(III ) =

     13

    36.

    (b) Using Bayes’ rule,

    P(I |R) =   P(R|I )P(I )P(R|I )P(I ) + P(R|II )P(II ) + P(R|III )P(III ) =

     (1/3)(1/3)

    13/36  =

      4

    13.

    22

  • 8/18/2019 STAT3004 Course Notes

    23/167

    4.2 Discrete Random Variables

    Conditional pmf . The conditional probability mass function  derives from the definition

    of conditional probability for events in a straightforward manner:

     pX |Y (x|y) = P(X  = x|Y   = y) =  P(X  = x  and  Y   = y)P(Y   = y)

      = pXY  (x, y)

     pY (y)  ,

    as long as  pY (y) > 0. Note that for each  y , pX |Y   is a pmf , i.e. 

    x pX |Y (x|y) = 1, but thesame is not true for each fixed  x. Also, the law of total probability becomes

     pX (x) =

    y∈Range(Y ) pX |Y (x|y) pY (y).

    Example  4.2.  Suppose that  N  has a geometric distribution with parameter 1−

    β , and

    that conditional on N , X  has a negative binomial distribution with parameters  p  and  N .

    In other words,

     pN (n) = (1 − β )β n−1 for  n  = 1, 2, . . .and

     pX |N (x|n) =

    x + n − 1n − 1

     px(1 − p)n for  x  = 0, 1, . . . .

    Find the marginal distribution of  X .

    Solution: Using the law of total probability: for  x  = 01, 2, 3, . . .

     pX (x) =∞

    n=1

     pX |N (x|n) pN (n)

    =∞

    n=1

    x + n − 1

    n − 1

    (1 − p)n px(1 − β )β n−1

    =∞

    n=1

    x + n − 1

    x

    (1 − p)n px(1 − β )β n−1

    = (1 − β )β −1 px∞

    n=0

    x + n

    x

    [β (1 − p)]n+1

    =  (1−β )(1− p) px

    (1−(1− p)β )x+1∞

    n=0

    x + n

    x

    [β (1 − p)]n(1 − (1 − p)β )x+1   

    =1

    =  (1−β )(1− p)

    1−(1− p)β 

      p

    1−(1− p)β x

    Consequently, X  + 1 ∈ {1, 2, 3, . . . }  is geometric with parameter   (1−β)(1− p)1−(1− p)β   .  

    23

  • 8/18/2019 STAT3004 Course Notes

    24/167

    Conditional Expectation. The  conditional expectation of  g(X ) given Y   = y, denoted

    as  E[g(X )|Y   = y], is defined as

    E[g(X )|Y   = y] = x∈Range(X ) g(x) p

    X |Y (x|y).

    The law of total probability then shows that

    E[g(X )] =x

    g(x) pX (x) =x

    g(x)y

     pX |Y (x|y) pY (y) =y

    E[g(X )|Y   = y] pY (y).

    Note that the conditional expectation can be regarded as a function of   y; that is, it is

    a numerical function defined on the sample space of   Y  and is thus a random variable,

    denoted by  E[g(X )|Y ], and we therefore have

    E[g(X )] = E

    E [g(X )|Y ].A similar expression can be obtained for variances:

    Var(X ) =   E[X 2] − (E[X ])2 = E[E[X 2|Y ]] − E[E[X |Y ]]2=   E[E[X 2|Y ]] − E(E[X |Y ])2 + E(E[X |Y ])2− E[E[X |Y ]2=   E

    E[X 2|Y ] − (E[X |Y ])2 + E(E[X |Y ])2− E[E[X |Y ]]2

    =   E[Var(X |Y )] + Var(E[X |Y ]).

    Note that we have defined Var(X |Y ) :=   E[X 2|Y ] − (E[X |Y ])2 =   σ2X |Y , which, like theconditional expectation, is now a random variable.

    Example   4.3.   Let  Y  have a distribution with mean  µ  and variance  σ2. Conditional on

    Y   =   y, suppose that   X   has a distribution with mean −y   and variance   y2. Find thevariance of  X .

    Solution: From the information given,  E[X |Y ] = −Y   and Var(X |Y ) = Y 2. Thus,

    Var(X ) = E[Y 2] + Var(

    −Y ) = σ2 + µ2 + (

    −1)2Var(Y ) = 2σ2 + µ2.

    Since the conditional expectation is the expectation with respect to the conditional

    probability mass function  pX |Y (x|y), conditional expectations behave in most ways likeordinary expectations. For example,

    1.   E[ag(X 1) + bh(X 2)|Y ] = aE[g(X 1)|Y ] + bE[h(X 2)|Y ]

    24

  • 8/18/2019 STAT3004 Course Notes

    25/167

    2. If  g ≥ 0 then  E[g(X )|Y ] ≥ 0

    3.   E[g(X, Y )|Y   = y] = E[g(X, y)|Y   = y]

    4. If  X  and  Y   are independent,  E[g(X )|Y ] = E[g(X )]

    5.   E[g(X )h(Y )|Y ] = h(Y )E[g(X )|Y ]

    6.   E[g(X )h(Y )] = E[h(Y )E[g(X )|Y ]]

    In particular, it follows from properties 1 and 5 that  E[a|Y ] = a   for any constant  a, andE[h(Y )|Y ] = h(Y ) for any function  h.Remark: the formulae 1. – 6. are applicable in more general situations, even if  X   nor Y 

    are not discrete (cf. random sums for some applications).

    4.3 Mixed Cases

    If   X   is a continuous random variable and   N   is a discrete random variable, then the

    conditional distribution function F X |N (x|n) of  X  given that  N  = n  can be defined in theobvious way

    F X |N (x|n) =  P(X  ≤ x and N  = n)P(N  = n)

      .

    From this definition, we can easily define the  conditional probability density function  as

    f X |N (x|n) =   ddx

    F X |N (x|n).

    As in the discrete case, the conditional density behaves much like an ordinary density, so

    that, for example,

    P(a < X  ≤ b, N  = n) = P(a < X  ≤ b|N  = n)P(N  = n) = pN (n)   ba

    f X |N (x|n)dx.

    Note that the key feature to this and the discrete case was that the conditioning random

    variable  N  was discrete, so that we would be able to guarantee that there would be some

    possible values of  n  such that  P(N   =  n)  >   0. It is possible to condition on continuous

    random variables and the properties are much the same, but we just need to take a bit of 

    care since technically the probability of any individual outcome of a continuous random

    variable is zero.

    25

  • 8/18/2019 STAT3004 Course Notes

    26/167

    4.4 Random Sums

    Suppose we have an infinite sequence of independent and identically distributed random

    variables  ξ 1, ξ 2, . . ., and a discrete non-negative integer valued random variable  N   which

    is independent of the  ξ ’s. We can then define the random sum

    X  = ξ 1 + . . . + ξ N  =N 

    k=1

    ξ k.

    (Note that for convenience, we will define the sum of zero terms to be zero.)

    Moments.  If we let

    E[ξ k] = µ   Var(ξ k) = σ2 E[N ] = ν    Var(N ) = τ 2

    then we can derive the mean and variance of  X  as

    E[X ] =   E[E[X |N ]] =∞

    n=0

    E[X |N  = n] pN (n)

    =∞

    n=1

    E[ξ 1 + . . . + ξ N |N  = n] pN (n)

    =∞

    n=1

    E[ξ 1 + . . . + ξ n|N  = n] pN (n)

    =∞

    n=1

    E[ξ 1 + . . . + ξ n] pN (n) = µ∞

    n=1

    npN (n)

    =   µν,

    and the variance as

    Var(X ) =   E[(X − µν )2] = E[(X − Nµ + Nµ − µν )2]=   E[(X − Nµ)2] + E[µ2(N  − ν )2] + 2E[µ(X − Nµ)(N  − ν )]=   E

    E[(X − Nµ)2|N ] + E[µ2(N  − ν )2]

    + 2E

    E[µ(X − Nµ)(N  − ν )|N ]

    =   νσ2 + µ2τ 2,since

    E[X − Nµ|N  = n] = E   n

    i=1

    ξ i − nµ

    = 0;

    E[(X − N µ)2|N  = n] = E   n

    i=1

    ξ i − nµ2

    = nσ2.

    26

  • 8/18/2019 STAT3004 Course Notes

    27/167

    Example   4.4.  Total Grandchildren  - Suppose that individuals in a certain species have

    a random number of offspring independently of one another with a known distribution

    having mean µ  and variance  σ2. Let X  be the number of grandchildren of a single parent,

    so that X  = ξ 1+. . .+ξ N , where N  is the random number of original offspring and ξ k is therandom number of offspring of the  kth child of the original parent. Then  E[N ] = E[ξ k] = µ

    and Var(N ) = Var(ξ k) = σ2, so that

    E[X ] = µ2 and Var(X ) = µσ2(1 + µ).

    Distribution of Random Sums. In addition to moments, we need to know the distri-

    bution of the random sum  X . If the  ξ ’s are continuous and have density function  f (z ),

    then the distribution of  ξ 1 +  . . . +  ξ n   is the  n-fold convolution of  f , denoted by  f (n)(z )

    and recursively defined by

    f (1)(z ) = f (z )

    f (n)(z ) =

      ∞−∞

    f (n−1)(z − u)f (u)du   for n > 1.

    Since  N   is independent of the  ξ ’s,  f (n) is also the distribution of  X   given  N   =  n ≥  1.Thus, if we assume that  P(N  = 0) = 0, the law of total probability says

    f X (x) =

    ∞n=1 f 

    (n)

    (x) pN (n).

    NOTE: If we don’t assume that  P(N  = 0) = 0, then we have a “mixed” distribution, so

    that

    P(a < X  ≤ b) =   ba

      ∞n=1

    f (n)(x) pN (n)

    dx

    if  a < b

  • 8/18/2019 STAT3004 Course Notes

    28/167

    for z  ≥ 0, and suppose also that N  has a geometric distribution with parameter  p, so that pN (n) = p(1 − p)n−1 for  n  = 1, 2, . . .. In this case,

    (2)

    (z ) =   ∞

    −∞ f (z − u)f (u)du =    z

    0 λ

    2

    e−λz

    du

    =   λ2e−λz   z0

    du =  λ2e−λzz.

    In fact, it is straightforward to use mathematical induction to show that   f (n)(z ) =λn

    (n−1)!z n−1e−λz, for   z  ≥   0, which is a Gamma(n, λ) distribution (a fact which is much

    more easily demonstrated using moment generating functions!). Thus, the distribution of 

    X   is

    f X (x) =∞

    n=1 f (n)(x) pN (n) =

    n=1λn

    (n−

    1)!xn−1e−λx p(1 − p)n−1

    =   λpe−λx∞

    n=1

    {λ(1 − p)x}n−1(n − 1)!   = λpe

    −λxeλ(1− p)x

    =   λpe−λpx.

    So, X  has an exponential distribution with parameter  λp, or a Gamma(1, λp). Note that

    the distribution of the random sum is not the same as the distribution of the non-random

    sum.  

    4.5 Conditioning on Continuous Random Variables

    Conditional Density.  Note that in the previous sections we have been able to use our

    definition of conditional probability for events since the conditioning events {Y   =   y}have non-zero probability for discrete random variables. If we want to find the conditional

    distribution of  X   given Y   = y, and Y   is continuous, we cannot use, as we might first try,

    F X |Y (x|y) = P(X  ≤ x|Y   = y) =  P(X  ≤ x  and  Y   = y)P(Y   = y)

      ,

    since both probabilities in the final fraction are zero. Instead, we shall define the  condi-tional density function  as

    f X |Y (x|y) =  f XY  (x, y)f Y (y)

      ,

    for values of  y   such that  f Y (y)  >  0. The conditional distribution function is then given

    by

    F X |Y (x|y) =   x−∞

    f X |Y (ξ |y)dξ.

    28

  • 8/18/2019 STAT3004 Course Notes

    29/167

    Conditional Expectation. Finally, we can define

    E[g(X )|Y   = y] =   ∞

    −∞

    g(x)f X |Y (x|y)dx,

    as expected, and this version of the conditional expectation still satisfies all of the nice

    properties that we derived in the previous sections for discrete conditioning variables. For

    example,

    P(a < X  ≤ b|Y   = y) =   F X |Y (b|y) − F X |Y (a|y) =   ba

    f X |Y (x|y)dx

    =

      ∞−∞

    1(a,b](x)f X |Y (x|y)dx=   E[1(a,b](X )

    |Y   = y],

    where the function 1I (x) is the  indicator  function of the set  I , i.e. 1I (x) = 1 if  x ∈ I  and1I (x) = 0 otherwise.

    Note that, as is the case with ordinary expectations and indicators, the conditional

    probability of the random variable having an outcome in   I   is equal to the conditional

    expectation of the indicator function of that event. (Recall that

    P(X  ∈ I ) = I 

    f X (x)dx =

      ∞−∞

    1I (x)f X (x)dx =  E[1I (X )]

    for ordinary expectations and probabilities.)We can use the above fact to show a new form of the law of total probability, which

    is often a very useful method of finding probabilities; namely,

    P(a < X  ≤ b) =  ∞−∞

    P(a < X  ≤ b|Y   = y)f Y (y)dy.

    To see why this is true, note that

      ∞

    −∞

    P(a < X  ≤ b|Y   = y)f Y (y)dy   =   ∞

    −∞    b

    a

    f X |Y (x|y)dxf Y (y)dy =   ∞

    −∞    b

    a

    f XY  (x, y)dxdy

    =   P(a < X  ≤ b  and  − ∞ < Y < ∞)=   P(a < X  ≤ b).

    In fact, we can generalize this notion even further to show that

    P{a < g(X, Y ) ≤ b} =  ∞−∞

    P{a < g(X, y) ≤ b|Y   = y}f Y (y)dy.

    29

  • 8/18/2019 STAT3004 Course Notes

    30/167

    Example   4.6.   Suppose  X   and  Y   are continuous random variables having joint density

    function

    f XY  (x, y) = ye−xy−y for  x, y > 0.

    (a) Find the conditional distribution of  X   given  Y   = y.

    (b) Find the distribution function of  Z  = X Y .

    Solution: (a) First, we must find the marginal density of  Y , which is

    f Y (y) =

      ∞−∞

    f XY  (x, y)dx =

      ∞0

    ye−xy−ydx =  e−y  ∞0

    ye−xydx =  e−y, y > 0.

    Therefore,

    f X |Y (x|y) =  f XY  (x, y)f Y (y)

      = ye−xy, y > 0

    In other words, conditional on  Y   = y, X   has an exponential distribution with parameter

    y, and thus F X |Y (x|y) = 1 − e−xy.(b) To find the distribution of  Z  = XY , we write

    F Z (z ) =   P(Z  ≤ z ) = P(XY  ≤ z ) =  ∞−∞

    P(XY  ≤ z |Y   = y)f Y (y)dy

    =

      ∞0

    P(X  ≤  z y|Y   = y)e−ydy =

      ∞0

    (1 − e−z)e−ydy= 1 − e−z,

    so that  Z  has an exponential distribution with parameter 1.  

    4.6 Joint Conditional Distributions

    If  X , Y   and  Z  are jointly distribution random variables and  Z  is discrete, we can define

    the  joint conditional distribution  of  X  and  Y   given Z   in the obvious way,

    F XY |Z (x, y

    |z ) = P(X 

     ≤x  and  Y 

     ≤y|Z  = z ) =

     P(X  ≤ x  and  Y  ≤ y  and Z  = z )P(Z  = z )

      .

    If  X ,  Y   and  Z  are all continuous, then we define the   joint conditional density  of  X   and

    Y   given Z  as

    f XY  |Z (x, y|z ) =  f XY Z (x,y,z )f Z (z )

      ,

    where f XY Z (x,y,z ) is the joint density function of  X , Y   and Z  and f Z (z ) is the marginal

    density function of  Z .

    30

  • 8/18/2019 STAT3004 Course Notes

    31/167

    The random variables   X   and   Y   are said to be   conditionally independent   given   Z 

    if   F XY  |Z (x, y|z ) =   F X |Z (x|z )F Y |Z (y|z ), where   F X |Z (x|z ) = l i my→∞ F XY  |Z (x, y|z ) andF Y |Z (y

    |z ) = limx→∞ F XY  |Z (x, y

    |z ) are the conditional distributions of   X   given   Z   and

    Y   given  Z , respectively. As with unconditional independence, an equivalent characteri-zation when the random variables involved are continuous is that the densities factor as

    f XY  |Z (x, y|z ) =  f X |Z (x|z )f Y |Z (y|z ). (NOTE: In an obvious extension of the formula forunconditional densities,

    f X |Z (x|z ) =  ∞−∞

    f XY  |Z (x, y|z )dy,

    with a similar definition for  f Y |Z (y|z ).)As with the case for unconditional joint distributions, a useful concept is the condi-

    tional covariance, defined as

    Cov(X, Y |Z ) = E[XY |Z ] − E[X |Z ]E[Y |Z ],

    and the conditional correlation coefficient, which is simply the conditional covariance

    scaled by the product of the conditional standard deviations,   σX |Z   = 

    Var(X |Z ) andσY |Z   =

     Var(Y |Z ). Note that if two random variables are conditionally independent

    then they are conditionally uncorrelated (i.e. the conditional covariance is zero), but the

    converse is not true. Also, just because two random variables are conditionally independent

    or uncorrelated does not necessarily imply that they are unconditionally independent oruncorrelated.

    31

  • 8/18/2019 STAT3004 Course Notes

    32/167

    5 Elements of Matrix Algebra

    To prepare our analysis of Markov chains it is convenient to recall some elements of matrix

    algebra:A matrix   A   is a tabular with   n   rows and   m   columns with the real-valued entries

    A(i, j) (A(i, j) refers to the element in the  ith row and the j th column). We shortly write

    A = (A(i, j)) ∈ Rn×m (verbally, A  is a  n × m-matrix).Example  5.1.  Note

    A =

      1 2 3

    4 5 6

    ∈ R2×3 .

    A(1, 2) = 2.  

    We have different operations when dealing with matrices:Scalar Multiplication.  Let  a ∈ R  and  A = (A(i, j)) ∈ Rn×m The scalar multiplicationaA   is defined by taking the product of real number  a ∈  R  with each of the componentsof  A, giving rise to a new matrix  C  = (C (i, j)) := aA ∈ Rn×m with C (i, j) := aA(i, j).Example  5.2.   Let

    A =

      1 2 3

    4 5 6

    ∈ R2×3 .

    Then (a = 2)

    C  = 2A =   2 4 68 10 12 ∈ R2×3 .

    Transposition.  Let  A  = (A(i, j)) ∈  Rn×m Then the transposition of  A   is denoted byA   = (A(i, j)).   A   = (A(i, j)) is a   Rm×n-matrix with entries   A(i, j) :=   A( j, i). (Weinterchange the roles of columns and rows).

    Example  5.3.   Let

    A =   1 2 34 5 6

    ∈ R2×3 ⇒   A = 1 4

    2 53 6

    ∈ R3×2 .

    Sum of Matrices.  Let  A  = (A(i, j)), B = (B(i, j)) ∈ Rn×m. By componentwise addingthe entries we get a new matrix   C   = (C (i, j)) =:   A  +  B  ∈   Rn×m where   C (i, j) =A(i, j) + B(i, j).

    32

  • 8/18/2019 STAT3004 Course Notes

    33/167

    Example  5.4.   Let

    A =   1 2 3

    4 5 6 ∈ R2×3 , B =

      1 1   −21 3 6 ∈ R

    2×3

    ⇒   C  = A + B  =

      2 3 1

    5 8 12

    ∈ R2×3 .

    Product of Matrices.  Let A = (A(i, j)) ∈ Rn×m,  B  = (B(i, j)) ∈ Rm×r. (The numberm   of   A’s columns must match the number   m   of   B’s rows). Then the matrix product

    AB =  A · B  := C   is the matrix C  = (C (i, j)) ∈ Rn×r with entries

    C (i, j) :=m

    k=1

    A(i, k)B(k, j) ,   1 ≤ i ≤ n , 1 ≤  j ≤ r .

    By inspection: the entry  C (i, j) is the Euclidian product of the  ith row of  A  with the j th

    column of  B .

    Example  5.5.   Let

    A =

      1 2 3

    4 5 6

    ∈ R2×3 B =

    1 4 2

    2 5 1

    3 6 1

     .

    To compute  AB  it is convenient to adopt the following scheme

    AB =

    1 4 2

    2 5 1

    3 6 1

    1 2 3 1 × 1 + 2 × 2 + 3 × 3 1 × 4 + 2 × 5 + 3 × 6 74 5 6 4 × 1 + 5 × 2 + 6 × 3   . . . . . .

    Fill in the dots. The result is

    C  = AB  =

      14 32 7

    32   . . . . . .

     .

    Product with Vectors and Matrices.  This is as special case of the general matrix

    multiplication: let  x ∈  Rn,  A  = (A(i, j)) ∈  Rn×m. If we contrive  x ∈  R1×n as a matrix

    33

  • 8/18/2019 STAT3004 Course Notes

    34/167

    with only one row then  xA ∈ R1×m is defined by the corresponding matrix multiplication.The result is a row vector. If we insist on  x ∈ Rn×1 to be a column vector then still  xAand  Ax  are well defined. If  n

     =  m  then  Ax   is not defined, even if  x   is a column vector.

    The dimensions must always match.

    Power of Matrices.  Let  I  ∈  Rn×n be the identity matrix.  I   = (I (i, j)) =  Rn×n withentries  I (i, j) = 1, if  i =  j , and, otherwise, if  i = j  then  I (i, j) = 0. The identity matrixis a diagonal matrix (only the elements of the diagonal are nonzero) with unit entries on

    the diagonal. For any A ∈ Rn×m we have  I A =  A  (for all  B ∈ Rm×n we have  BI  = B).For matrices where the number of columns equals the number of rows,  A ∈ Rn×n we

    can define the  pthe power  A p,  p ∈ N0 = {0, 1, 2, 3, 4 . . . }  by iteration:

    A0 := I , A1 := A A p := (A) p−1A =  A(A) p−1 .

    Example  5.6.   Let A =

      1 2

    3 4

    . Find A0,  A1,  A2 and  A3.

    Answer:

    A0 = I  =

      1 0

    0 1

    , A1 = A  =

      1 2

    3 4

    , A2 =

      7 10

    15 22

    , A3 =

      37 54

    81 118

    Example  5.7.  (a) Show (A) =  A.(b) Show  A + B  =  B  + A.

    (b) Show (A + B)  =  A + B.

    (c) Show (AB) =  B A.

    (d) Give an example of square matrices   A, B ∈   R2×2 showing that   AB =   BA   (’notcommutative’).

    (Also see Tutorials)  

    34

  • 8/18/2019 STAT3004 Course Notes

    35/167

    Part II: Markov Chains

    6 Stochastic Process and Markov Chains

    6.1 Introduction and Definitions

    General Stochastic Process.   A   stochastic process   is a family of random variables,

    {X t}t∈T , indexed by a parameter t which belongs to an ordered index set  T . For notationalconvenience, we will sometimes use  X (t) instead of  X t. We use  t  because the indexing is

    most commonly associated with time.

    For example, the price of a particular stock at the close of each day’s trading would be

    a stochastic process indexed by time. Of course, the index does not have to be time, it may

    be a spatial indicator. For example, the number of defects in specified regions of a computerchip. In fact, the indexing may be almost anything. Indeed, if we consider the index to

    be individuals, we can consider a random sample  X 1, . . . , X  n   to be a stochastic process.

    Of course, this would be a rather special stochastic process in that the random variables

    making up the stochastic process would be independent of each other. In general, we will

    want to deal with stochastic processes where the random variables may be dependent on

    one another.

    As with individual random variables, we shall be interested in the set   S   of values

    which the random variables may take on, but we shall generally refer to this set as the

    state space   in this context. Again, as with single random variables, the state space maybe either discrete or continuous. In addition, however, we must now also consider whether

    the index set   T   is discrete or continuous. In this section, we shall be considering the

    case where the index set is the discrete set of natural numbers  T   =  N0   = {0, 1, 2, . . .},such processes are usually referred to as discrete time stochastic processes. We will start

    by examining processes with discrete state spaces and later move on to processes with

    continuous time sets  T .

    Markov Chain.  The simplest sort of stochastic processes are of course those for which

    the random variables X t are independent. However, the next simplest type of process, and

    the starting point for our journey through the theory of stochastic processes, is called a

    Markov chain. A Markov chain is a stochastic processes having:

    1) a countable state space S ,

    2) a discrete index set  T   = {0, 1, 2, . . .},

    3) the Markov property, and

    35

  • 8/18/2019 STAT3004 Course Notes

    36/167

    4) stationary transition probabilities.

    The final two properties listed are discussed next:

    6.2 Markov Property

    In general, we have defined a stochastic process so that the immediate future may depend

    on both the present and the entire past. This framework is a bit too general for an initial

    investigation of the concepts involved in stochastic processes. A discrete time process with

    discrete state space will be said to have the Markov property if 

    P(X t+1 =  xt+1|X 0  =  x0, . . . , X  t  =  xt) = P(X t+1 =  xt+1|X t  =  xt).

    In other words, the future depends only on the present and not on the past.At first glance, this may seem a silly property, in the sense that it would never really

    happen. However, it turns out that Markov chains can give surprisingly good approxima-

    tions to real situations.

    Example. As an example, suppose our stochastic process of interest is the total amount

    of something (money, perhaps) that we have accumulated at the end of each day. Often, it

    is a very reasonable assumption that tomorrow’s amount depends only on what we have

    today and not on how we arrived at today’s amount. Indeed, this will be the case if, for

    instance, each day’s incremental amount is independent of those for the previous days.

    Thus, a very common and useful stochastic process possessing the Markov property is the

    sequence of partial totals in a random sum, i.e.  X t   =   ξ 1 +  . . . +  ξ t   where   ξ 1, ξ 2 . . .   is a

    sequence of independent random variables. In this case, it is clear that  X t+1 =  X t+ξ t+1 de-

    pends only on the value of  X t (and, of course, on the value of  ξ t+1, but this is independent

    of all the previous  ξ ’s and thus of the previous  X ’s as well).

    36

  • 8/18/2019 STAT3004 Course Notes

    37/167

    6.3 Stationarity

    Suppose we know that at time  t, our Markov chain is in state  x, and we want to know

    about what will happen at time   t + 1. The probability of  X t+1   being equal to  y   in this

    instance is referred to as the  one-step transition probability  of going from state  x  to state

    y at time t, and is denoted by P t,t+1(x, y), or sometimes P t,t+1xy   . (Note that for convenience

    of terminology, even if  x =  y  we will still refer to this as a transition). If we are dealing

    with a Markov chain, then we know that

    P t,t+1(x, y) = P(X t+1 =  y|X t =  x),

    since the outcome of  X t+1 only depends on the value of  X t. If, for any value  t  in the index

    set, we have

    P t,t+1(x, y) = P (x, y) = P xy   for all x, y ∈ S,that is, the one-step transition probabilities are the same at all times  t, then the process

    is said to have stationary  transition probabilities. Here, the word stationary describes the

    fact that the probability of going from one specified state to another does not change

    with time. Note that for the partial totals in a random sum, the process has stationary

    transition probabilities if and only if the  ξ ’s are identically distributed.

    6.4 Transition Matrices and Initial Distributions

    Let’s start by considering the simplest type of Markov chain, namely, a chain with state

    space of cardinality 2, say,  S  = {0, 1}. (Actually, this is the second-simplest type of chain,the simplest being one with only one possible state, but this case is rather unenlightening).

    Suppose that at any time  t,

    P(X t+1 = 1|X t  = 0) =  p,   P(X t+1 = 0|X t = 0) = 1 − p,

    P(X t+1 = 0|X t  = 1) =  q,   P(X t+1 = 1|X t = 1) = 1 − q,and that at time  t = 0,

    P(X 0 = 0) = π0(0),   P(X 0 = 1) = π0(1).

    We will generally use the notation  πt  to refer to the  pmf  of the discrete random variable

    X t  when dealing with discrete time Markov chains, so that  πt(x) = pX t(x) = P(X t =  x).

    When the state space is finite, we can arrange the transition probabilities,  P xy, into a

    matrix called the  transition matrix . For the two-state Markov chain described above the

    37

  • 8/18/2019 STAT3004 Course Notes

    38/167

    transition matrix is

    P   =   P (0, 0)   P (0, 1)

    P (1, 0)   P (1, 1) =   P 00   P 01

    P 10   P 11 =   1 − p p

    q    1−

    q  .Note that for any fixed x, the pmf   of  X t  given X t−1 =  x  is  pX t|X t−1(y|x) = P (x, y). Thus,the sum of the values in any row of the matrix P  will be 1. If the state space is not finite

    then we will often refer to  P (x, y) as the  transition function  of the Markov chain.

    Similarly, if   S   is finite, we can arrange the initial distribution as a row vector, for

    example, π0 = {π0(0), π0(1)} in the case of the two-state chain above.It is an important fact that P   and π0  are enough to completely characterize a Markov

    chain, and we shall examine this more thoroughly a little later. As an example, however,

    let’s compute some quantities associated with the above two-state chain.

    Example   6.1.   For the two-state Markov chain above, let’s examine the chance that  X t

    will equal 0. To do so, we note

    πt(0) =   P(X t  = 0)

    =   P(X t  = 0|X t−1 = 0)P(X t−1  = 0) + P(X t  = 0|X t−1 = 1)P(X t−1  = 1)= (1 − p)πt−1(0) + q πt−1(1)   

    =1−πt−1(0)=   q  + (1 − p − q )πt−1(0).

    By iterating this procedure:

    π1(0) =   q  + (1 − p − q )π0(0)π2(0) =   q  + (1 − p − q )π1(0) = q  + (1 − p − q ){q  + (1 − p − q )π0(0)}

    =   q  + (1 − p − q )q  + (1 − p − q )2π0(0)...

    πt(0) =   q t−1i=0

    (1 − p − q )i + (1 − p − q )tπ0(0)

    =   q  p + q 

     + (1 − p − q )tπ0(0) −   q  p + q 

    ,where we have used the well-known summation formula for a geometric series,

    n−1i=0

    ri = 1 − rn

    1 − r .

    38

  • 8/18/2019 STAT3004 Course Notes

    39/167

    First, note that we can thus calculate the distribution of any of the X t’s using only the

    entries of  P   and  π0. Second, as long as  p, q

  • 8/18/2019 STAT3004 Course Notes

    40/167

    the two-step transition matrix ,

    P 2 = P 

     ×P   =

      (1 − p)2 + pq p(2 − p − q )

    q (2 − p − q ) (1 − q )2

    + pq  .We will discuss general  n-step transition matrices shortly.

    A formal proof that  P   and  π0  fully characterize a Markov chain is beyond the scope

    of this class. However, we will try to give the basic idea behind the proof now. It should

    seem intuitively reasonable that anything we want to know about a Markov chain {X t}t≥0can be built up from probabilities of the form

    P(X n =  xn, . . . , X  0 =  x0) =   P(X n =  xn|X n−1 =  xn−1, . . . , X  0 =  x0)

    ×P(X n−1 =  xn−1, . . . , X  0 =  x0)

    =   P(X n =  xn|X n−1 =  xn−1)×P(X n−1 =  xn−1, . . . , X  0 =  x0)

    ...

    =   P (xn−1, xn)P (xn−2, xn−1) · · · P (x0, x1)π0(x0)

    =   π0(x0)n

    i=1

    P (xi−1, xi).

    Notice that the above simply states that the probability that the chain follows a particular

    path for the first   n   steps can be found by simply multiplying the probabilities of thenecessary transitions. Note also, that we directly required both the Markov property and

    stationarity for this demonstration. Indeed, the above identity is an equivalent form of 

    the stationary Markov property. As a technical detail, we must be careful that none of 

    the conditioning events in the above derivation have probability zero. However, this will

    only occur when the original path is not possible (i.e. the specified   xi’s do not form a

    legitimate set of outcomes), in which case the original probability will clearly be zero as

    will at least one of the factors in the final product, so the result still holds true.

    For the sake of completeness, we note that the characterization is a one-to-one cor-

    respondence. That is, every Markov chain is completely determined by its initial distri-bution and transition matrix, and any initial distribution and transition matrix (recall

    that a transition matrix must satisfy the property that each of its rows sums to unity)

    determine some Markov chain.

    As a final comment, we note that it is the transition function  P  which is the more

    fundamental aspect of a Markov chain rather than the initial distribution  π0. We shall

    see why this is so specifically in the results to follow, but it should be clear that changing

    40

  • 8/18/2019 STAT3004 Course Notes

    41/167

    initial distributions will generally only slightly affect the overall behaviour of the chain,

    while a change in  P  will generally result in dramatic changes.

    6.5 Examples of Markov Chains

    We now present some of the most commonly used Markov chains.

    Random Walk: Let  p(u) be a probability mass function on the integers. A random walk

    is a Markov chain with transition function  P (x, y) = p(y − x) for integer valued x  and  y .Here we have S  = Z  = {. . . , −3, −2, −1, 0, 1, 2, 3, . . . }. For instance, if  p(−1) = p(1) = 0.5,then the chain is the simple symmetric random walk, where at each stage the chain takes

    either one step forward or backward. Such models are sometimes used to describe the

    motion of a suspended particle. One question of interest we might ask is how far the

    particle will travel. Another might be whether the particle ever returns to its startingposition and if so, how often. Often, the simple random walk is extended so that  p(1) = p,

     p(−1) =  q  and  p(0) =  r , where  p,  q  and  r  are non-negative numbers less than one suchthat  p  + q  + r = 1.

    Ehrenfest chain: The Ehrenfest chain is often used as a simple model for the diffusion

    of molecules across a membrane. Suppose that we have two distinct boxes and  d  distinct

    labelled balls. Initially, the balls are distributed between the two boxes. At each step, a

    ball is selected at random and is moved from the box that it is in to the other box. If  X t

    denotes the number of balls in the first box after  t  transitions, then {X t}t≥0  is a Markovchain with state space  S  = {0, . . . , d}. The transition function can be easily computed asfollows: If at time  t, there are  x  balls in the first box, then there is probability  x/d  that

    a ball will be removed from this box and put in the other, and a probability of (d − x)/dthat a new ball will be added to this box from the other, thus

    P (x, y) =

    xd

      y =  x − 11 −   x

    d  y =  x + 1

    0   otherwise

    For this chain, we might ask if an “equilibrium” is reached.

    Gambler’s ruin: Suppose a gambler starts out with  x  dollars and makes a series of one

    dollar bets against the house. Assume that the respective probabilities of winning and

    losing the bet are   p   and   q   = 1 −  p, and that if the capital ever reaches 0, the bettingends and the gambler’s fortune remains 0 forever after. This Markov chain has state space

    41

  • 8/18/2019 STAT3004 Course Notes

    42/167

    S  = N0 = {0, 1, 2, 3, . . . }  transition function

    P (x, y) = 1   x =  y  = 0

    q y =  x−

    1 and x > 0

     p y =  x + 1 and  x > 0

    0 otherwise

    for   x ≥   1, and   P (0, 0) = 1,   P (0, y) = 0 for   y = 0. Note that a state which satisfiesP (a, a) = 1 and P (a, y) = 0 for  y = a  is called an  absorbing state . We might wish to askwhat the chance is that the gambler is ruined (i.e. loses all his/her initial stake) and how

    long it might take. Also, we might modify this chain to incorporate a strategy whereby the

    gambler quits when his/her fortune reaches  d. For this chain, the above transition matrix

    still holds except that the definition given for  P (x, y) now holds only for 1

    ≤x

    ≤d

    −1,

    and   d   becomes an absorbing state. One interpretation of this modification is that two

    gamblers are betting against each other and between them they have a total capital of  d

    dollars. Letting  X t  represent the fortune of one of the gamblers yields the gambler’s ruin

    chain on {0, 1, . . . , d}.Birth and death chains: The Ehrenfest and Gambler’s ruin chains are special cases of 

    a birth and death chain. A birth and death chain has state space  S  =  N0  = {0, 1, 2, . . .}and has transition function

    P (x, y) =

    q x   y =  x − 1rx   y =  x

     px   y =  x + 1

    0 otherwise

    where px  is the chance of a “birth”, q x  the chance of a “death” and 0 ≤  px, q x, rx ≤ 1 suchthat  px + q x + rx  = 1. Note that we allow the chance of births and deaths to depend on

    x, the current population. We will study birth and death chains in more detail later.

    Queuing chain:  Consider a service facility at which people arrive during each discrete

    time interval according to a distribution with probability mass function  p(u). If anyone is

    in the queue at the start of a time period then a single person is served and removed fromthe queue. Thus, the transition function for this chain is  P (0, y) =  p(y) and  P (x, y) =

     p(y − x + 1). In other words, if there is no one in the queue then the chance of havingy   people in the queue by the next time interval is just the chance of  y  people arriving,

    namely p(y), while if  x  people are currently in the queue, one will definitely be served and

    removed and thus to get to y  individuals in the queue we require the arrival of  y − (x − 1)additional individuals. Two obvious questions to ask about this chain are when the queue

    42

  • 8/18/2019 STAT3004 Course Notes

    43/167

    will be emptied and how often.

    Branching chain: Consider objects or entities, such as bacteria, which generate a number

    of offspring according to the probability mass function  p(u). If at each time increment,

    the existing objects produce a random number of offspring and then expire, then  X t, the

    total number of objects at generation  t is a Markov chain with

    P (x, y) = P(ξ 1 + . . . + ξ x =  y)

    where the  ξ i’s are independent random variables each having a probability mass function

    given by  p(u). A natural question to ask for such a chain is if and when extinction will

    occur.

    6.6 Extending the Markov Property

    Recall that we have said that the Markov property is equivalent to the identity


Recommended