+ All Categories
Home > Documents > 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf ·...

180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf ·...

Date post: 12-Aug-2019
Category:
Upload: phunganh
View: 221 times
Download: 0 times
Share this document with a friend
153
Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Transcript
Page 1: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Bruce K. Driver

180B-C Lecture Notes, Winter and Spring, 2011

March 29, 2011 File:180Lec.tex

Page 2: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 3: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Contents

Part 180B Notes

0 Basic Probability Facts / Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.1 Course Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2 Some Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3 A Stirling’s Formula Like Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1 Course Overview and Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.1 180B Course Topics: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Geometric aspects of L2 (P ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Linear prediction and a canonical form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1 Conditional Expectation for Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 General Properties of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Conditional Expectation for Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Conditional Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.5 Summary on Conditional Expectation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Random Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Part I Discrete Time Markov Chains

Page 4: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

4 Contents

7 Markov Chains Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2 Hitting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8 Markov Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498.1 Hitting Time Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508.2 First Step Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.3 Finite state space examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.4 Random Walk Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.5 Computations avoiding the first step analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8.5.1 General facts about sub-probability kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

9 Markov Chains in the Long Run (Results) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699.1 A Touch of Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

9.1.1 A number theoretic lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719.2 Transience and Recurrence Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

9.2.1 Transience and Recurrence for R.W.s by Fourier Series Methods (optional reading) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749.3 Invariant / Stationary (sub) distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769.4 The basic limit theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799.5 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829.6 Proof Ideas (optional reading) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

10 Finite State Space Results and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8910.1 Some worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9110.2 Life Time Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9510.3 Sampling Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9610.4 Extra Homework Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

11 Discrete Renewal Theorem (optional reading) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Part II Continuous Time Processes

12 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

13 Exponential Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10913.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

14 Math 180B (W 2011) Final Exam Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

15 Order statistics (you may skip this chapter!) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Page: 4 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 5: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Contents 5

16 Point Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12316.1 Poisson and Geometric Random Variables (Review) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12416.2 Law of rare numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12616.3 Bernoulli point process (Homogeneous Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

16.3.1 The Scaling Limit (Homogeneous Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13116.4 Poisson Point Process (Homogeneous Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

16.4.1 The homogeneous Poisson process on R+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13416.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13616.6 Poisson Point Process (Non-Homogeneous Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

16.6.1 Why Poisson Processes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13816.7 Construction of Generalilzed Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14016.8 Bernoulli point process (Non-Homogeneous Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14216.9 The Continuum limit in the non-homogeneous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Page: 5 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 6: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 7: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Part

180B Notes

Page 8: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 9: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

0

Basic Probability Facts / Conditional Expectations

0.1 Course Notation

1. (Ω,P ) will denote a probability spaces and S will denote a set which iscalled state space.

2. EZ will denote the expectation of a random variable, Z : Ω → R which is de-fined as follows. If Z only takes on a finite number of real values z1, . . . , zmwe define

EZ =

m∑i=1

ziP (Z = zi) .

For general Z ≥ 0 we set EZ = limn→∞ EZn where Zn∞n=1 is any sequenceof discrete random variables such that 0 ≤ Zn ↑ Z as n ↑ ∞. Finally if Z isreal valued with E |Z| <∞ (in which case we say Z is integrable) we setEZ = EZ+ − EZ− where Z± = max (±Z, 0) .

3. The expectation has the following basic properties;

a it is linear, E [X + cY ] = EX + cEY where X and Y are any integrablerandom variables and c ∈ R.

b MCT: the monotone convergence theorem holds; if 0 ≤ Zn ↑ Zthen

↑ limn→∞

E [Zn] = E [Z] (with ∞ allowed as a possible value).

c DCT: the dominated convergence theorem holds, if

E[supn|Zn|

]<∞ and lim

n→∞Zn = Z, then

E[

limn→∞

Zn

]= EZ = lim

n→∞EZn.

d Fatou’s Lemma: Fatou’s lemma holds; if 0 ≤ Zn ≤ ∞, then

E[lim infn→∞

Zn

]≤ lim inf

n→∞E [Zn] .

4. If S is a discrete set, i.e. finite or countable and X : Ω → S we let

ρX (s) := P (X = s) .

More generally if Xi : Ω → Si for 1 ≤ i ≤ n we let

ρX1,...,Xn (s) := P (X1 = s1, . . . , Xn = sn)

for all s = (s1, . . . , sn) ∈ S1 × · · · × Sn.5. If S is R or Rn and X : Ω → S is a continuous random variable, we letρX (x) be the operability density function of X, namely,

E [f (X)] =

∫S

f (x) ρX (x) dx.

6. Given random variables X and Y we let;

a) µX := EX be the mean of X.

b) Var (X) := E[(X − µX)

2]

= EX2 − µ2X be the variance of X.

c) σX = σ (X) :=√

Var (X) be the standard deviation of X.d) Cov (X,Y ) := E [(X − µX) (Y − µY )] = E [XY ]− µXµY be the covari-

ance of X and Y.e) Corr (X,Y ) := Cov (X,Y ) / (σXσY ) be the correlation of X and Y.

7. Tonelli’s theorem; if f : Rk × Rl → R+, then∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) (with ∞ being allowed).

8. Fubini’s theorem; if f : Rk × Rl → R is a function such that∫Rkdx

∫Rldy |f (x, y)| =

∫Rldy

∫Rkdx |f (x, y)| <∞,

then ∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) .

Proposition 0.1. Suppose that X is an Rk – valued random variable, Y is anRl – valued random variable independent of X, and f : Rk × Rl → R+ then(assuming X and Y have continuous distributions),

E [f (X,Y )] =

∫Rk

E [f (x, Y )] ρX (x) dx.

Page 10: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

4 0 Basic Probability Facts / Conditional Expectations

and similarly,

E [f (X,Y )] =

∫Rl

E [f (X, y)] ρY (y) dy

Proof. Independence implies that

ρ(X,Y ) (x, y) = ρX (x) ρY (y) .

Therefore,

E [f (X,Y )] =

∫Rk×Rl

f (x, y) ρX (x) ρY (y) dxdy

=

∫Rk

[∫Rldyf (x, y) ρY (y)

]ρX (x) dx

=

∫Rk

E [f (x, Y )] ρX (x) dx.

0.2 Some Discrete Distributions

Definition 0.2 (Generating Function). Suppose that N : Ω → N0 is aninteger valued random variable on a probability space, (Ω,B, P ) . The generatingfunction associated to N is defined by

GN (z) := E[zN]

=

∞∑n=0

P (N = n) zn for |z| ≤ 1. (0.1)

By standard power series considerations, it follows that P (N = n) =1n!G

(n)N (0) so that GN can be used to completely recover the distribution of

N.

Proposition 0.3 (Generating Functions). The generating function satis-fies,

G(k)N (z) = E

[N (N − 1) . . . (N − k + 1) zN−k

]for |z| < 1

andG(k) (1) = lim

z↑1G(k) (z) = E [N (N − 1) . . . (N − k + 1)] ,

where it is possible that one and hence both sides of this equation are infinite.In particular, G′ (1) := limz↑1G

′ (z) = EN and if EN2 <∞,

Var (N) = G′′ (1) +G′ (1)− [G′ (1)]2. (0.2)

Proof. By standard power series considerations, for |z| < 1,

G(k)N (z) =

∞∑n=0

P (N = n) · n (n− 1) . . . (n− k + 1) zn−k

= E[N (N − 1) . . . (N − k + 1) zN−k

]. (0.3)

Since, for z ∈ (0, 1) ,

0 ≤ N (N − 1) . . . (N − k + 1) zN−k ↑ N (N − 1) . . . (N − k + 1) as z ↑ 1,

we may apply the MCT to pass to the limit as z ↑ 1 in Eq. (0.3) to find,

G(k) (1) = limz↑1

G(k) (z) = E [N (N − 1) . . . (N − k + 1)] .

Exercise 0.1 (Some Discrete Distributions). Let p ∈ (0, 1] and λ > 0. Inthe four parts below, the distribution of N will be described. You should workout the generating function, GN (z) , in each case and use it to verify the givenformulas for EN and Var (N) .

1. Bernoulli(p) : P (N = 1) = p and P (N = 0) = 1 − p. You should findEN = p and Var (N) = p− p2.

2. Bin (n, p) : P (N = k) =(nk

)pk (1− p)n−k for k = 0, 1, . . . , n. (P (N = k)

is the probability of k successes in a sequence of n independent yes/noexperiments with probability of success being p.) You should find EN = npand Var (N) = n

(p− p2

).

3. Geometric(p) : P (N = k) = p (1− p)k−1for k ∈ N. (P (N = k) is the

probability that the kth – trial is the first time of success out a sequenceof independent trials with probability of success being p.) You should findEN = 1/p and Var (N) = 1−p

p2 .

4. Poisson(λ) : P (N = k) = λk

k! e−λ for all k ∈ N0. You should find EN = λ =

Var (N) .

Solution to Exercise (0.1).

1. GN (z) = pz1+(1− p) z0 = pz+1−p. Therefore,G′N (z) = p andG′′N (z) = 0so that EN = p and Var (N) = 0 + p− p2.

2. GN (z) =∑nk=0 z

k(nk

)pk (1− p)n−k = (pz + (1− p))n . Therefore,

G′N (z) = n (pz + (1− p))n−1p,

G′′N (z) = n (n− 1) (pz + (1− p))n−2p2

and

EN = np and Var (N) = n (n− 1) p2 + np− (np)2

= n(p− p2

).

Page: 4 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 11: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

0.2 Some Discrete Distributions 5

3. For the geometric distribution,

GN (z) = E[zN]

=

∞∑k=1

zkp (1− p)k−1=

zp

1− z (1− p)for |z| < (1− p)−1

.

Differentiating this equation in z implies,

E[NzN−1

]= G′N (z) =

p [1− z (1− p)] + (1− p) pz(1− z (1− p))2

=p

(1− z (1− p))2 and

E[N (N − 1) zN−2

]= G′′N (z) =

2 (1− p) p(1− z (1− p))3 .

Therefore,EN = G′N (1) = 1/p,

E [N (N − 1)] =2 (1− p) p

p3=

2 (1− p) pp2

,

and

Var (N) = 21− pp2

+1

p− 1

p2=

1

p2− 1

p=

1− pp2

.

Alternative method. Starting with∑∞n=0 z

n = 11−z for |z| < 1 we learn

that

1

(1− z)2 =d

dz

1

1− z=

∞∑n=0

nzn−1 =

∞∑n=1

nzn−1 and

∞∑n=0

n2zn−1 =d

dz

z

(1− z)2 =(1− z)2

+ 2z (1− z)(1− z)4 =

1 + z

(1− z)3 .

Taking z = 1− p in these formulas shows,

EN = p

∞∑n=1

n (1− p)n−1= p

1

p2=

1

p

and

EN2 = p

∞∑n=1

n2 (1− p)n−1= p · 2− p

p3=

2− pp2

and therefore,

Var (N) =2− pp2− 1

p2=

1− pp2

.

4. In the Poisson case,

GN (z) = E[zN]

=

∞∑k=0

zkλk

k!e−λ = e−λeλz = eλ(z−1)

and G(k)N (z) = λkeλ(z−1). Therefore, EN = λ and E [N · (N − 1)] = λ2 so

that Var (N) = λ2 + λ− λ2 = λ.

Remark 0.4 (Memoryless property of the geometric distribution). Supposethat Xi are i.i.d. Bernoulli random variables with P (Xi = 1) = p andP (Xi = 0) = 1 − p and N = inf i ≥ 1 : Xi = 1 . Then P (N = k) =

P (X1 = 0, . . . , Xk−1 = 0, Xk = 1) = (1− p)k−1p, so that N is geometric with

parameter p. Using this representation we easily and intuitively see that

P (N = n+ k|N > n) =P (X1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

P (X1 = 0, . . . , Xn = 0)

= P (Xn+1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

= P (X1 = 0, . . . , Xk−1 = 0, Xk = 1) = P (N = k) .

This can be verified by first principles as well;

P (N = n+ k|N > n) =P (N = n+ k)

P (N > n)=

p (1− p)n+k−1∑k>n p (1− p)k−1

=p (1− p)n+k−1∑∞j=0 p (1− p)n+j

=(1− p)n+k−1

(1− p)n∑∞j=0 (1− p)j

=(1− p)k−1

11−(1−p)

= p (1− p)k−1= P (N = k) .

Exercise 0.2. Let Sn,pd= Bin (n, p) , k ∈ N, pn = λn/n where λn → λ > 0 as

n→∞. Show that

limn→∞

P (Sn,pn = k) =λk

k!e−λ = P (Poi (λ) = k) .

Thus we see that for p = O (1/n) and k not too large relative to n that for largen,

P (Bin (n, p) = k) ∼= P (Poi (pn) = k) =(pn)

k

k!e−pn.

(We will come back to the Poisson distribution and the related Poisson processlater on.)

Page: 5 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 12: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

6 0 Basic Probability Facts / Conditional Expectations

Solution to Exercise (0.2). We have,

P (Sn,pn = k) =

(n

k

)(λn/n)

k(1− λn/n)

n−k

=λknk!

n (n− 1) . . . (n− k + 1)

nk(1− λn/n)

n−k.

The result now follows since,

limn→∞

n (n− 1) . . . (n− k + 1)

nk= 1

and

limn→∞

ln (1− λn/n)n−k

= limn→∞

(n− k) ln (1− λn/n)

= − limn→∞

[(n− k)λn/n] = −λ.

0.3 A Stirling’s Formula Like Approximation

Theorem 0.5. Suppose that f : (0,∞) → R is an increasing concave downfunction (like f (x) = lnx) and let sn :=

∑nk=1 f (k) , then

sn −1

2(f (n) + f (1)) ≤

∫ n

1

f (x) dx

≤ sn −1

2[f (n+ 1) + 2f (1)] +

1

2f (2)

≤ sn −1

2[f (n) + 2f (1)] +

1

2f (2) .

Proof. On the interval, [k − 1, k] , we have that f (x) is larger than thestraight line segment joining (k − 1, f (k − 1)) and (k, f (k)) and thus

1

2(f (k) + f (k − 1)) ≤

∫ k

k−1

f (x) dx.

Summing this equation on k = 2, . . . , n shows,

sn −1

2(f (n) + f (1)) =

n∑k=2

1

2(f (k) + f (k − 1))

≤n∑k=2

∫ k

k−1

f (x) dx =

∫ n

1

f (x) dx.

For the upper bound on the integral we observe that f (x) ≤ f (k)−f ′ (k) (x− k)for all x and therefore,∫ k

k−1

f (x) dx ≤∫ k

k−1

[f (k)− f ′ (k) (x− k)] dx = f (k)− 1

2f ′ (k) .

Summing this equation on k = 2, . . . , n then implies,∫ n

1

f (x) dx ≤n∑k=2

f (k)− 1

2

n∑k=2

f ′ (k) .

Since f ′′ (x) ≤ 0, f ′ (x) is decreasing and therefore f ′ (x) ≤ f ′ (k − 1) for x ∈[k − 1, k] and integrating this equation over [k − 1, k] gives

f (k)− f (k − 1) ≤ f ′ (k − 1) .

Summing the result on k = 3, . . . , n+ 1 then shows,

f (n+ 1)− f (2) ≤n∑k=2

f ′ (k)

and thus ti follows that∫ n

1

f (x) dx ≤n∑k=2

f (k)− 1

2(f (n+ 1)− f (2))

= sn −1

2[f (n+ 1) + 2f (1)] +

1

2f (2)

≤ sn −1

2[f (n) + 2f (1)] +

1

2f (2)

Page: 6 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 13: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

0.3 A Stirling’s Formula Like Approximation 7

Example 0.6 (Approximating n!). Let us take f (n) = lnn and recall that∫ n

1

lnxdx = n lnn− n+ 1.

Thus we may conclude that

sn −1

2lnn ≤ n lnn− n+ 1 ≤ sn −

1

2lnn+

1

2ln 2.

Thus it follows that(n+

1

2

)lnn− n+ 1− ln

√2 ≤ sn ≤

(n+

1

2

)lnn− n+ 1.

Exponentiating this identity then gives the following upper and lower boundson n!;

e√2· e−nnn+1/2 ≤ n! ≤ e · e−nnn+1/2.

These bound compare well with Strirling’s formula (Theorem 0.9) which implies,

n! ∼√

2πe−nnn+1/2 by definition⇐⇒ limn→∞

n!

e−nnn+1/2=√

2π.

Observe that

e√2∼= 1. 922 1 ≤

√2π ∼= 2. 506 ≤ e ∼= 2.718 3.

Definition 0.7 (Gamma Function). The Gamma function, Γ : R+ → R+

is defined by

Γ (x) :=

∫ ∞0

ux−1e−udu (0.4)

(The reader should check that Γ (x) <∞ for all x > 0.)

Here are some of the more basic properties of this function.

Example 0.8 (Γ – function properties). Let Γ be the gamma function, then;

1. Γ (1) = 1 as is easily verified.2. Γ (x+ 1) = xΓ (x) for all x > 0 as follows by integration by parts;

Γ (x+ 1) =

∫ ∞0

e−u ux+1 du

u=

∫ ∞0

ux(− d

due−u

)du

= x

∫ ∞0

ux−1 e−u du = x Γ (x).

In particular, it follows from items 1. and 2. and induction that

Γ (n+ 1) = n! for all n ∈ N. (0.5)

3. Γ (1/2) =√π. This last assertion is a bit trickier. One proof is to make use

of the fact (proved below in Lemma ??) that∫ ∞−∞

e−ar2

dr =

√π

afor all a > 0. (0.6)

Taking a = 1 and making the change of variables, u = r2 below implies,

√π =

∫ ∞−∞

e−r2

dr = 2

∫ ∞0

u−1/2e−udu = Γ (1/2) .

Γ (1/2) = 2

∫ ∞0

e−r2

dr =

∫ ∞−∞

e−r2

dr

= I1(1) =√π.

4. A simple induction argument using items 2. and 3. now shows that

Γ

(n+

1

2

)=

(2n− 1)!!

2n√π

where (−1)!! := 1 and (2n− 1)!! = (2n− 1) (2n− 3) . . . 3 · 1 for n ∈ N.

Theorem 0.9 (Stirling’s formula). The Gamma function (see Definition0.7), satisfies Stirling’s formula,

limx→∞

Γ (x+ 1)√2πe−xxx+1/2

= 1. (0.7)

In particular, if n ∈ N, we have

n! = Γ (n+ 1) ∼√

2πe−nnn+1/2

where we write an ∼ bn to mean, limn→∞anbn

= 1.

Page: 7 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 14: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 15: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

1

Course Overview and Plan

This course is an introduction to some basic topics in the theory of stochas-tic processes. After finishing the discussion of multivariate distributions andconditional probabilities initiated in Math 180A, we will study Markov chainsin discrete time. We then begin our investigation of stochastic processes in con-tinuous time with a detailed discussion of the Poisson process. These two topicswill be combined in Math 180C when we study Markov chains in continuoustime and renewal processes.

In the next two quarters we will study some aspects of Stochastic Pro-cesses. Stochastic (from the Greek στ oχoξ for aim or guess) means random. Astochastic process is one whose behavior is non-deterministic, in that a system’ssubsequent state is determined both by the process’s predictable actions andby a random element. However, according to M. Kac1 and E. Nelson2, any kindof time development (be it deterministic or essentially probabilistic) which isanalyzable in terms of probability deserves the name of stochastic process.

Mathematically we will be interested in collection of random variables orvectors Xtt∈T with Xt : Ω → S (S is the state space) on some probabilityspace, (Ω,P ) . Here T is typically in R+ or Z+ but not always.

Example 1.1. 1. Xt is the value of a spinner at times t ∈ Z+.2. Xt denotes the prices of a stock (or stocks) on the stock market.3. Xt denotes the value of your portfolio at time t.4. Xt is the position of a dust particle like in Brownian motion.5. XA is the number of stars in a region A contained in space or the number

of raisins in a region of a cake, etc.6. Xn ∈ S = Perm (1, . . . , 52) is the ordering of cards in a deck of cards

after the nth shuffle.

Our goal in this course is to introduce and analyze models for such randomobjects. This is clearly going to require that we make assumptions on Xtwhich will typically be some sort of dependency structures. This is where wewill begin our study – namely heading towards conditional expectations andrelated topics.

1 M. Kac & J. Logan, in Fluctuation Phenomena, eds. E.W. Montroll & J.L.Lebowitz, North-Holland, Amsterdam, 1976.

2 E. Nelson, Quantum Fluctuations, Princeton University Press, Princeton, 1985.

1.1 180B Course Topics:

1. Review the linear algebra of orthogonal projections in the context of leastsquares approximations in the context of Probability Theory.

2. Use the least squares theory to interpret covariance and correlations.3. Review of conditional probabilities for discrete random variables.4. Introduce conditional expectations as least square approximations.5. Develop conditional expectation relative to discrete random variables.6. Give a short introduction to martingale theory. (Not done!)7. Study in some detail discrete time Markov chains.8. Review of conditional probability densities for continuous random variables.9. Develop conditional expectations relative to continuous random variables.

10. Begin our study of the Poisson process. (Started in 180C.)

The bulk of this quarter will involve the study of Markov chains and pro-cesses. These are processes for which the past and future are independent giventhe present. This is a typical example of a dependency structure that we willconsider in this course. For an example of such a process, let S = Z and place acoin at each site of S (perhaps the coins are biased with different probabilitiesof heads at each site of S.) Let X0 = s0 be some point in S be fixed and thenflip the coin at s0 and move to the right on step if the result is heads and toleft one step if the result is tails. Repeat this process to determine the positionXn+1 from the position Xn along with a flip of the coin at Xn. This is a typicalexample of a Markov process.

Before going into these and other processes in more detail we are goingto develop the extremely important concept of conditional expectation.The idea is as follows. Suppose that X and Y are two random variables withE |Y |2 <∞. We wish to find the function h such that h (X) is the minimizer of

E (Y − f (X))2

over all functions f such that E[f (X)

2]<∞, that is h (X) is

a least squares approximation to Y among random variables of the form f (X) ,i.e.

E (Y − h (X))2

= minf

E (Y − f (X))2. (1.1)

Fact: a minimizing function h always exist and is “essentially unique.” Wedenote h (X) as E [Y |X] and call it the conditional expectation of Y given

Page 16: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

X. We are going to spend a fair amount of time filling in the details of thisconstruction and becoming familiar with this concept.

As a warm up to conditional expectation, we are going to consider a simplerproblem of best linear approximations. The goal now is to find a0, b0 ∈ R suchthat

E (Y − a0X + b0)2

= mina,b∈R

E (Y − aX + b)2. (1.2)

This is the same sort of problem as finding conditional expectations except wenow only allow consider functions of the form f (x) = ax + b. (You should beable to find a0 and b0 using the first derivative test from calculus! We will carrythis out using linear algebra ideas below.) It turns out the answer to finding(a0, b0) solving Eq. (1.2) only requires knowing the first and second momentsof X and Y and E [XY ] . On the other hand finding h (X) solving Eq. (1.1)require full knowledge of the joint distribution of (X,Y ) .

By the way, you are asked to show on your first homework thatminc∈R E (Y − c)2

= Var (Y ) which occurs for c = EY. Thus EY is the leastsquares approximation to Y by a constant function and Var (Y ) is the leastsquare error associated with this problem.

Page 17: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

2

Covariance and Correlation

Suppose that (Ω,P ) is a probability space. We say that X : Ω → R is in-

tegrable if E |X| <∞ and X is square integrable if E |X|2 <∞. We denotethe set of integrable random variables by L1 (P ) and the square integrable ran-dom variables by L2 (P ) . When X is integrable we let µX := EX be the meanof X. If Ω is a finite set, then

E [|X|p] =∑ω∈Ω|X (ω)|p P ((ω)) <∞

for any 0 < p < ∞. So when the sample space is finite requiring integrabilityor square integrability is no restriction at all. On the other hand when Ω isinfinite life can become a little more complicated.

Example 2.1. Suppose that N is a geometric with parameter p so thatP (N = k) = p (1− p)k−1

for k ∈ N = 1, 2, 3, . . . . If X = f (N) for somefunction f : N→ R, then

E [f (N)] =

∞∑k=1

p (1− p)k−1f (k)

when the sum makes sense. So if Xλ = λN for some λ > 0 we have

E[X2λ

]=

∞∑k=1

p (1− p)k−1λ2k = pλ2

∞∑k=1

[(1− p)λ2

]k−1<∞

iff (1− p)λ2 < 1, i.e. λ < 1/√

1− p. Thus we see that Xλ ∈ L2 (P ) iff λ <1/√

1− p.

Lemma 2.2. L2 (P ) is a subspace of the vector space of random variables on(Ω,P ) . Moreover if X,Y ∈ L2 (P ) , then XY ∈ L1 (P ) and in particular (takeY = 1) it follows that L2 (P ) ⊂ L1 (P ) .

Proof. If X,Y ∈ L2 (P ) and c ∈ R then E |cX|2 = c2E |X|2 < ∞ so thatcX ∈ L2 (P ) . Since

0 ≤ (|X| − |Y |)2= |X|2 + |Y |2 − 2 |X| |Y | ,

it follows that

|XY | ≤ 1

2|X|2 +

1

2|Y |2 ∈ L1 (P ) .

Moreover,

(X + Y )2

= X2 + Y 2 + 2XY ≤ X2 + Y 2 + 2 |XY | ≤ 2(X2 + Y 2

)from which it follows that E (X + Y )

2<∞, i.e. X + Y ∈ L2 (P ) .

Definition 2.3. The covariance, Cov (X,Y ) , of two square integrable randomvariables, X and Y, is defined by

Cov (X,Y ) = E [(X − µX) (Y − µY )] = E [XY ]− EX · EY

where µX := EX and µY := EY. The variance of X,

Var (X) = Cov (X,X) = E[X2]− (EX)

2(2.1)

= E[(X − µX)

2]

(2.2)

We say that X and Y are uncorrelated if Cov (X,Y ) = 0, i.e. E [XY ] =EX · EY. More generally we say Xknk=1 ⊂ L2 (P ) are uncorrelated iffCov (Xi, Xj) = 0 for all i 6= j.

Definition 2.4 (Correlation). Given two non-constant random variables we

define Corr (X,Y ) := Cov(X,Y )σ(X)·σ(Y ) to be the correlation of X and Y.

It follows from Eqs. (2.1) and (2.2) that

0 ≤ Var (X) ≤ E[X2]

for all X ∈ L2 (P ) . (2.3)

Exercise 2.1. Let X,Y be two random variables on (Ω,B, P );

1. Show that X and Y are independent iff Cov (f (X) , g (Y )) = 0 (i.e. f (X)and g (Y ) are uncorrelated) for bounded measurable functions, f, g : R→R. (In this setting X and Y may take values in some arbitrary state space,S.)

2. If X,Y ∈ L2 (P ) and X and Y are independent, then Cov (X,Y ) = 0.Note well: we will see in examples below that Cov (X,Y ) = 0 does notnecessarily imply that X and Y are independent.

Page 18: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

12 2 Covariance and Correlation

Solution to Exercise (2.1). (Only roughly sketched the proof of this inclass.)

1. Since

Cov (f (X) , g (Y )) = E [f (X) g (Y )]− E [f (X)]E [g (Y )]

it follows that Cov (f (X) , g (Y )) = 0 iff

E [f (X) g (Y )] = E [f (X)]E [g (Y )]

from which item 1. easily follows.2. Let fM (x) = x1|x|≤M , then by independence,

E [fM (X) gM (Y )] = E [fM (X)]E [gM (Y )] . (2.4)

Since

|fM (X) gM (Y )| ≤ |XY | ≤ 1

2

(X2 + Y 2

)∈ L1 (P ) ,

|fM (X)| ≤ |X| ≤ 1

2

(1 +X2

)∈ L1 (P ) , and

|gM (Y )| ≤ |Y | ≤ 1

2

(1 + Y 2

)∈ L1 (P ) ,

we may use the DCT three times to pass to the limit as M →∞ in Eq. (2.4) tolearn that E [XY ] = E [X]E [Y ], i.e. Cov (X,Y ) = 0. (These technical detailswere omitted in class.)

End of 1/3/2011 Lecture.

Example 2.5. Suppose that P (X ∈ dx, Y ∈ dy) = e−y10<x<ydxdy. Recall that∫ ∞0

yke−λydy =

(− d

)k ∫ ∞0

e−λydy =

(− d

)k1

λ= k!

1

λk+1.

Therefore,

EY =

∫ ∫ye−y10<x<ydxdy =

∫ ∞0

y2e−ydy = 2,

EY 2 =

∫ ∫y2e−y10<x<ydxdy =

∫ ∞0

y3e−ydy = 3! = 6

EX =

∫ ∫xe−y10<x<ydxdy =

1

2

∫ ∞0

y2e−ydy = 1,

EX2 =

∫ ∫x2e−y10<x<ydxdy =

1

3

∫ ∞0

y3e−ydy =1

33! = 2

and

E [XY ] =

∫ ∫xye−y10<x<ydxdy =

1

2

∫ ∞0

y3e−ydy =3!

2= 3.

Therefore Cov (X,Y ) = 3− 2 · 1 = 1, σ2 (X) = 2− 12 = 1, σ2 (Y ) = 6− 22 = 2,

Corr (X,Y ) =1√2.

Lemma 2.6. The covariance function, Cov (X,Y ) is bilinear in X and Y andCov (X,Y ) = 0 if either X or Y is constant. For any constant k, Var (X + k) =Var (X) and Var (kX) = k2 Var (X) . If Xknk=1 are uncorrelated L2 (P ) –random variables, then

Var (Sn) =

n∑k=1

Var (Xk) .

Proof. We leave most of this simple proof to the reader. As an example ofthe type of argument involved, let us prove Var (X + k) = Var (X) ;

Var (X + k) = Cov (X + k,X + k) = Cov (X + k,X) + Cov (X + k, k)

= Cov (X + k,X) = Cov (X,X) + Cov (k,X)

= Cov (X,X) = Var (X) ,

wherein we have used the bilinearity of Cov (·, ·) and the property thatCov (Y, k) = 0 whenever k is a constant.

Page: 12 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 19: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

2 Covariance and Correlation 13

Example 2.7. Suppose that X and Y are distributed as follows;

ρY 1/4 12 1/4

ρX X\Y −1 0 11/4 1 0 1/4 03/4 0 1/4 1/4 1/4

so that ρX,Y (1,−1) = P (X = 1, Y = −1) = 0, ρX,Y (1, 0) =P (X = 1, Y = 0) = 1/4, etc. In this case XY = 0 a.s. so that E [XY ] = 0 while

E [X] = 1 · 1

4+ 0 · 3

4=

1

4, and

EY = (−1) 1/4 + 01

2+ 1

1

4= 0

so that Cov (X,Y ) = 0− 14 · 0 = 0. Again X and Y are not independent since

ρX,Y (x, y) 6= ρX (x) ρY (y) .

Example 2.8. Let X have an even distribution and let Y = X2, then

Cov (X,Y ) = E[X3]− E

[X2]· EX = 0

since,

E[X2k+1

]=

∫ ∞−∞

x2k+1ρ (x) dx = 0 for all k ∈ N.

On the other hand Cov(Y,X2

)= Cov (Y, Y ) = Var (Y ) 6= 0 in general so that

Y is not independent of X.

Example 2.9 (Not done in class.). Let X and Z be independent withP (Z = ±1) = 1

2 and take Y = XZ. Then EZ = 0 and

Cov (X,Y ) = E[X2Z

]− E [X]E [XZ]

= E[X2]· EZ − E [X]E [X]EZ = 0.

On the other hand it should be intuitively clear that X and Y are not inde-pendent since knowledge of X typically will give some information about Y. Toverify this assertion let us suppose that X is a discrete random variable withP (X = 0) = 0. Then

P (X = x, Y = y) = P (X = x, xZ = y) = P (X = x) · P (X = y/x)

whileP (X = x)P (Y = y) = P (X = x) · P (XZ = y) .

Thus for X and Y to be independent we would have to have,

P (xX = y) = P (XZ = y) for all x, y.

This is clearly not going to be true in general. For example, suppose thatP (X = 1) = 1

2 = P (X = 0) . Taking x = y = 1 in the previously displayedequation would imply

1

2= P (X = 1) = P (XZ = 1) = P (X = 1, Z = 1) = P (X = 1)P (Z = 1) =

1

4

which is false.

Presumably you saw the following exercise in Math 180A.

Exercise 2.2 (A Weak Law of Large Numbers). Assume Xn∞n=1 is a se-quence if uncorrelated square integrable random variables which are identically

distributed, i.e. Xnd= Xm for all m,n ∈ N. Let Sn :=

∑nk=1Xk, µ := EXk and

σ2 := Var (Xk) (these are independent of k). Show;

E[Snn

]= µ,

E(Snn− µ

)2

= Var

(Snn

)=σ2

n, and

P

(∣∣∣∣Snn − µ∣∣∣∣ > ε

)≤ σ2

nε2

for all ε > 0 and n ∈ N.

Page: 13 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 20: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 21: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

3

Geometric aspects of L2 (P )

Definition 3.1 (Inner Product). For X,Y ∈ L2 (P ) , let (X,Y ) := E [XY ]and ‖X‖ :=

√(X,X) =

√E |X2|.

Example 3.2 (This was already mentioned in Lecture 1 with N = 4.). Supposethat Ω = 1, . . . , N and P (i) = 1

N for 1 ≤ i ≤ N. Then

(X,Y ) = E [XY ] =1

N

N∑i=1

X (i)Y (i) =1

NX ·Y

where

X :=

X (1)X (2)

...X (N)

and Y :=

Y (1)Y (2)

...Y (N)

.Thus the inner product we have defined in this case is essentially the dot productthat you studied in math 20F.

Remark 3.3. The inner product on H := L2 (P ) satisfies,

1. (aX + bY, Z) = a(X,Z) + b(Y,Z) i.e. X → (X,Z) is linear.2. (X,Y ) = (Y,X) (symmetry).3. ‖X‖2 := (X,X) ≥ 0 with ‖X‖2 = 0 iff X = 0.

Notice that combining properties (1) and (2) that X → (Z,X) is linear forfixed Z ∈ H, i.e.

(Z, aX + bY ) = a(Z,X) + b(Z, Y ).

The following identity will be used frequently in the sequel without furthermention,

‖X + Y ‖2 = (X + Y,X + Y ) = ‖X‖2 + ‖Y ‖2 + (X,Y ) + (Y,X)

= ‖X‖2 + ‖Y ‖2 + 2(X,Y ). (3.1)

Theorem 3.4 (Schwarz Inequality). Let (H, (·, ·)) be an inner product space,then for all X,Y ∈ H

|(X,Y )| ≤ ‖X‖‖Y ‖

and equality holds iff X and Y are linearly dependent. Applying this result to|X| and |Y | shows,

E [|XY |] ≤ ‖X‖ · ‖Y ‖ .

Proof. If Y = 0, the result holds trivially. So assume that Y 6= 0 andobserve; if X = αY for some α ∈ C, then (X,Y ) = α ‖Y ‖2 and hence

|(X,Y )| = |α| ‖Y ‖2 = ‖X‖‖Y ‖.

Now suppose that X ∈ H is arbitrary, let Z := X − ‖Y ‖−2(X,Y )Y. (So‖Y ‖−2(X,Y )Y is the “orthogonal projection” of X along Y, see Figure 3.1.)

Fig. 3.1. The picture behind the proof of the Schwarz inequality.

Then

0 ≤ ‖Z‖2 =

∥∥∥∥X − (X,Y )

‖Y ‖2Y

∥∥∥∥2

= ‖X‖2 +|(X|Y )|2

‖Y ‖4‖Y ‖2 − 2(X| (X|Y )

‖Y ‖2Y )

= ‖X‖2 − |(X|Y )|2

‖Y ‖2

from which it follows that 0 ≤ ‖Y ‖2‖X‖2 − |(X|Y )|2 with equality iff Z = 0 orequivalently iff X = ‖Y ‖−2(X|Y )Y.

Alternative argument: Let c ∈ R and Z := X − cY, then

0 ≤ ‖Z‖2 = ‖X − cY ‖2 = ‖X‖2 − 2c (X,Y ) + c2 ‖Y ‖2 .

Page 22: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16 3 Geometric aspects of L2 (P )

The right side of this equation is minimized at c = (X,Y ) / ‖Y ‖2 and for thisvalued of c we find,

0 ≤ ‖X − cY ‖2 = ‖X‖2 − (X,Y )2/ ‖Y ‖2

with equality iff X = cY. Solving this last inequality for |(X,Y )| gives the result.

Corollary 3.5. The norm, ‖ · ‖, satisfies the triangle inequality and (·, ·) iscontinuous on H ×H.

Proof. If X,Y ∈ H, then, using Schwarz’s inequality,

‖X + Y ‖2 = ‖X‖2 + ‖Y ‖2 + 2(X,Y )

≤ ‖X‖2 + ‖Y ‖2 + 2‖X‖‖Y ‖ = (‖X‖+ ‖Y ‖)2.

Taking the square root of this inequality shows ‖·‖ satisfies the triangle inequal-ity. (The rest of this proof may be skipped.)

Checking that ‖·‖ satisfies the remaining axioms of a norm is now routineand will be left to the reader. If X,Y,∆X,∆Y ∈ H, then

|(X +∆X,Y +∆Y )− (X,Y )| = |(X,∆Y ) + (∆X,Y ) + (∆X,∆Y )|≤ ‖X‖‖∆Y ‖+ ‖Y ‖‖∆X‖+ ‖∆X‖‖∆Y ‖→ 0 as ∆X,∆Y → 0,

from which it follows that (·, ·) is continuous.

Definition 3.6. Let (H, (·, ·)) be an inner product space, we say X,Y ∈ H areorthogonal and write X ⊥ Y iff (X,Y ) = 0. More generally if A ⊂ H is a set,X ∈ H is orthogonal to A (write X ⊥ A) iff (X,Y ) = 0 for all Y ∈ A. LetA⊥ = X ∈ H : X ⊥ A be the set of vectors orthogonal to A. A subset S ⊂ His an orthogonal set if X ⊥ Y for all distinct elements X,Y ∈ S. If S furthersatisfies, ‖X‖ = 1 for all X ∈ S, then S is said to be an orthonormal set.

Proposition 3.7. Let (H, (·, ·)) be an inner product space then

1. (Pythagorean Theorem) If S ⊂⊂ H is a finite orthogonal set, then∥∥∥∥∥∑X∈S

X

∥∥∥∥∥2

=∑X∈S‖X‖2. (3.2)

2. (Parallelogram Law) (Skip this one.) For all X,Y ∈ H,

‖X + Y ‖2 + ‖X − Y ‖2 = 2‖X‖2 + 2‖Y ‖2 (3.3)

Proof. Items 1. and 2. are proved by the following elementary computa-tions;and ∥∥∥∥∥∑

X∈SX

∥∥∥∥∥2

= (∑X∈S

X,∑Y ∈S

Y ) =∑

X,Y ∈S(X,Y )

=∑X∈S

(X,X) =∑X∈S‖X‖2

and

‖X + Y ‖2 + ‖X − Y ‖2

= ‖X‖2 + ‖Y ‖2 + 2(X,Y ) + ‖X‖2 + ‖Y ‖2 − 2(X,Y )

= 2‖X‖2 + 2‖Y ‖2.

Theorem 3.8 (Least Squares Approximation Theorem). Suppose that Vis a subspace of H := L2 (P ) , X ∈ V, and Y ∈ L2 (P ) . Then the following areequivalent;

1. ‖Y −X‖ ≥ ‖Y − Z‖ for all Z ∈ V (i.e. X is a least squares approximationto Y by an element from V ) and

2. (Y −X) ⊥ V.

Moreover there is “essentially” at most one X ∈ V satisfying 1. or equiva-lently 2. We denote random variable by QV Y and call it orthogonal projec-tion of Y along V.

Proof. 1 =⇒ 2. If 1. holds then f (t) := ‖Y − (X + tZ)‖2 has a minimumat t = 0 and therefore f (0) = 0. Since

f (t) := ‖Y −X − tZ‖2 = ‖Y −X‖2 + t2 ‖Z‖2 − 2t (Y −X,Z) ,

we may conclude that

0 = f (0) = −2 (Y −X,Z) .

As Z ∈ V was arbitrary we may conclude that (Y −X) ⊥ V.2 =⇒ 1. Now suppose that (Y −X) ⊥ V and Z ∈ V, then (Y −X) ⊥

(X − Z) and so

‖Y − Z‖2 = ‖Y −X +X − Z‖2 = ‖Y −X‖2+‖X − Z‖2 ≥ ‖Y −X‖2 . (3.4)

Moreover if Z is another best approximation to Y then ‖Y − Z‖2 = ‖Y −X‖2which happens according to Eq. (3.4) iff

‖X − Z‖2 = E (X − Z)2

= 0,

i.e. iff X = Z a.s.

Page: 16 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 23: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

End of Lecture 3: 1/07/2011 (Given by Tom Laetsch)

Corollary 3.9 (Orthogonal Projection Formula). Suppose that V is a sub-

space of H := L2 (P ) and XiNi=1 is an orthogonal basis for V. Then

QV Y =

N∑i=1

(Y,Xi)

‖Xi‖2Xi for all Y ∈ H.

Proof. The best approximation X ∈ V to Y is of the form X =∑Ni=1 ciXi

where ci ∈ R need to be chosen so that (Y −X) ⊥ V. Equivalently put we musthave

0 = (Y −X,Xj) = (Y,Xj)− (X,Xj) for 1 ≤ j ≤ N.

Since

(X,Xj) =

N∑i=1

ci (Xi, Xj) = cj ‖Xj‖2 ,

we see that cj = (Y,Xj) / ‖Xj‖2 , i.e.

QV Y = X =

N∑i=1

(Y,Xi)

‖Xi‖2Xi.

Example 3.10. Given Y ∈ L2 (P ) the best approximation to Y by a constantfunction c is given by

c =E [Y 1]

E121 = EY.

You already proved this on your first homework by a direct calculus exercise.

Page 24: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 25: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

4

Linear prediction and a canonical form

Corollary 4.1 (Correlation Bounds). For all square integrable random vari-ables, X and Y,

|Cov (X,Y )| ≤ σ (X) · σ (Y )

or equivalently,|Corr (X,Y )| ≤ 1.

.

Proof. This is a simply application of Schwarz’s inequality (Theorem 3.4);

|Cov (X,Y )| = |E [(X − µX) (Y − µY )]| ≤ ‖X − µX‖·‖Y − µY ‖ = σ (X)·σ (Y ) .

Since Corr (X,Y ) > 0 iff Cov (X,Y ) > 0 iff E [(X − µX) (Y − µY )] > 0, wesee that X and Y are positively correlated iff X −µX and Y −µY tend to havethe same sign more often than not. While X and Y are negatively correlatediff X − µX and Y − µY tend to have opposite signs more often than not. Thisdescription is of course rather crude given that it ignores size of X − µX andY −µY but should however give the reader a little intuition into the meaning ofcorrelation. (See Corollary 4.4 below for the special case where Corr (X,Y ) = 1or Corr (X,Y ) = −1.)

Theorem 4.2 (Linear Prediction Theorem). Let X and Y be two squareintegrable random variables, then

σ (Y )√

1− Corr2 (X,Y ) = mina,b∈R

‖Y − (aX + b)‖ = ‖Y −W‖ (4.1)

where

W = µY +Cov (X,Y )

Var (X)(X − µX) =

Cov (X,Y )

Var (X)X +

(EY − µX

Cov (X,Y )

Var (X)

).

Proof. Let µ = EX and X = X −µ. Then

1, X

is an orthogonal set and

V := span 1, X = span

1, X. Thus best approximation of Y by random

variable of the form aX + b is given by

W = (Y, 1) 1 +

(Y, X

)∥∥X∥∥2 X = EY +Cov (X,Y )

Var (X)(X − µX) .

The root mean square error of this approximation is

‖Y −W‖2 =

∥∥∥∥Y − Cov (X,Y )

Var (X)X

∥∥∥∥2

= σ2 (Y )− Cov2 (X,Y )

σ2 (X)

= σ2 (Y )(1− Corr2 (X,Y )

),

so that‖Y −W‖ = σ (Y )

√1− Corr2 (X,Y ).

Example 4.3. Suppose that P (X ∈ dx, Y ∈ dy) = e−y10<x<ydxdy. Recall fromExample 2.5 that

EX = 1, EY = 2,

EX2 = 2, EY 2 = 6

σ (X) = 1, σ (Y ) =√

2,

Cov (X,Y ) = 1, and Corr (X,Y ) =1√2.

So in this case

W = 2 +1

1(X − 1) = X + 1

is the best linear predictor of Y and the root mean square error in this predictionis

‖Y −W‖ =√

2

√1− 1

2= 1.

Corollary 4.4. If Corr (X,Y ) = ±1, then

Y = µY ±σ (Y )

σ (X)(X − µX) ,

i.e. Y − µY is a positive (negative) multiple of X − µX if Corr (X,Y ) = 1(Corr (X,Y ) = −1) .

Page 26: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

20 4 Linear prediction and a canonical form

Proof. According to Eq. (4.1) of Theorem 4.2, if Corr (X,Y ) = ±1 then

Y = µY +Cov (X,Y )

Var (X)(X − µX)

= µY ±σXσYσ2X

(X − µX) = µY ±σYσX

(X − µX) ,

wherein we have used Cov (X,Y ) = Cov (X,Y )σXσY = ±1σXσY .

Theorem 4.5 (Canonical form). If X,Y ∈ L2 (P ) , then there are two meanzero uncorrelated Random variables Z1, Z2 such that ‖Z1‖ = ‖Z2‖ = 1 and

X = µX + σ (X)Z1, and

Y = µY + σ (Y ) [cos θ · Z1 + sin θ · Z2] ,

where 0 ≤ θ ≤ π is chosen such that cos θ := Corr (X,Y ) .

Proof. (Just sketch the main ideal in class!). The proof amounts to apply-ing the Gram-Schmidt procedure to

X := X − µX , Y := Y − µY

to find Z1

and Z2 followed by expressing X and Y in uniquely in terms of the linearlyindependent set, 1, Z1, Z2 . The details follow.

Performing Gram-Schmidt onX, Y

gives Z1 = X/σ (X) and

Z2 = Y −(Y , X

)σ (X)

2 X.

To get Z2 we need to normalize Z2 using;

EZ22 = σ (Y )

2 − 2

(Y , X

)σ (X)

2

(X, Y

)+

(Y , X

)2σ (X)

4 σ (X)2

= σ (Y )2 −

(X, Y

)2σ (X)

2 = σ (Y )2 (

1− Corr2 (X,Y ))

= σ (Y )2

sin2 θ.

Therefore Z1 = X/σ (X) and

Z2 :=Z2∥∥∥Z2

∥∥∥ =Y − (Y ,X)

σ(X)2X

σ (Y ) sin θ=Y − σ(X)σ(Y )Corr(X,Y )

σ(X)2X

σ (Y ) sin θ

=Y − σ(Y )

σ(X) cos θ · Xσ (Y ) sin θ

=Y − σ (Y ) cos θ · Z1

σ (Y ) sin θ

Solving for X and Y shows,

X = σ (X)Z1 and Y = σ (Y ) [sin θ · Z2 + cos θ · Z1]

which is equivalent to the desired result.

Remark 4.6. It is easy to give a second proof of Corollary 4.4 based on Theorem

4.5. Indeed, if Corr (X,Y ) = 1, then θ = 0 and Y = σ (Y )Z1 = σ(Y )σ(X)X while if

Corr (X,Y ) = −1, then θ = π and therefore Y = −σ (Y )Z1 = − σ(Y )σ(X)X.

Exercise 4.1 (A correlation inequality). Suppose that X is a random vari-able and f, g : R→ R are two increasing functions such that both f (X)

and g (X) are square integrable, i.e. E |f (X)|2 + E |g (X)|2 < ∞. ShowCov (f (X) , g (X)) ≥ 0. Hint: let Y be another random variable which hasthe same law as X and is independent of X. Then consider

E [(f (Y )− f (X)) · (g (Y )− g (X))] .

Page: 20 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 27: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

5

Conditional Expectation

Notation 5.1 (Conditional Expectation 1) Given Y ∈ L1 (P ) and A ⊂ Ωlet

E [Y : A] := E [1AY ]

and

E [Y |A] =

E [Y : A] /P (A) if P (A) > 0

0 if P (A) = 0. (5.1)

(In point of fact, when P (A) = 0 we could set E [Y |A] to be any real number.We choose 0 for definiteness and so that Y → E [Y |A] is always linear.)

Example 5.2 (Conditioning for the uniform distribution). Suppose that Ω is afinite set and P is the uniform distribution on P so that P (ω) = 1

#(Ω) for

all ω ∈ W. Then for non-empty any subset A ⊂ Ω and Y : Ω → R we haveE [Y |A] is the expectation of Y restricted to A under the uniform distributionon A. Indeed,

E [Y |A] =1

P (A)E [Y : A] =

1

P (A)

∑ω∈A

Y (ω)P (ω)

=1

# (A) /# (Ω)

∑ω∈A

Y (ω)1

# (Ω)=

1

# (A)

∑ω∈A

Y (ω) .

Lemma 5.3. If P (A) > 0 then E [Y |A] = EP (·|A)Y for all Y ∈ L1 (P ) .

Proof. I will only prove this lemma when Y is a discrete random variable,although the result does hold in general. So suppose that Y : Ω → S where Sis a finite or countable subset of R. Then taking expectation relative to P (·|A)of the identity, Y =

∑y∈S y1Y=y, gives

EP (·|A)Y = EP (·|A)

∑y∈S

y1Y=y =∑y∈S

yEP (·|A)1Y=y =∑y∈S

yP (Y = y|A)

=∑y∈S

yP (Y = y|A) =∑y∈S

yP (Y = y,A)

P (A)=

1

P (A)

∑y∈S

yE [1A1Y=y]

=1

P (A)E

1A∑y∈S

y1Y=y

=1

P (A)E [1AY ] = E [Y |A] .

Lemma 5.4. No matter whether P (A) > 0 or P (A) = 0 we always have,

|E [Y |A]| ≤ E [|Y | |A] ≤√E[|Y |2 |A

]. (5.2)

Proof. If P (A) = 0 then all terms in Eq. (5.2) are zero and so the inequali-ties hold. For P (A) > 0 we have, using the Schwarz inequality in Theorem 3.4),that

|E [Y |A]| =∣∣EP (·|A)Y

∣∣ ≤ EP (·|A) |Y | ≤√EP (·|A) |Y |

2 · EP (·|A)1 =

√EP (·|A) |Y |

2.

This completes that proof as EP (·|A) |Y | = E [|Y | |A] and EP (·|A) |Y |2

=

E[|Y |2 |A

].

Notation 5.5 Let S be a set (often S = R or S = RN ) and suppose thatX : Ω → S is a function. (So X is a random variable if S = R and a randomvector when S = RN .) Further let VX denote those random variables Z ∈ L2 (P )which may be written as Z = f (X) for some function f : S → R. (This is asubspace of L2 (P ) and we let FX :=

f : S → R : f (X) ∈ L2 (P )

.)

Definition 5.6 (Conditional Expectation 2). Given a function X : Ω →S and Y ∈ L2 (P ) , we define E [Y |X] := QVXY where QVX is orthogonalprojection onto VX . (Fact: QVXY always exists. The proof requires technicaldetails beyond the scope of this course.)

Remark 5.7. By definition, E [Y |X] = h (X) where h ∈ FX is chosen so that[Y − h (X)] ⊥ VX , i.e. E [Y |X] = h (X) iff (Y − h (X) , f (X)) = 0 for all f ∈FX . So in summary, E [Y |X] = h (X) iff

E [Y f (X)] = E [h (X) f (X)] for all f ∈ FX . (5.3)

Corollary 5.8 (Law of total expectation). For all random variables Y ∈L2 (P ) , we have EY = E(E(Y |X)).

Proof. Take f = 1 in Eq. (5.3).This notion of conditional expectation is rather abstract. It is now time to

see how to explicitly compute conditional expectations. (In general this can bequite tricky to carry out in concrete examples!)

Page 28: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

22 5 Conditional Expectation

5.1 Conditional Expectation for Discrete RandomVariables

Recall that if A and B are events with P (A) > 0, then we define P (B|A) :=P (B∩A)P (A) . By convention we will set P (B|A) = 0 if P (A) = 0.

Example 5.9. If Ω is a finite set with N elements, P is the uniform distributionon Ω, and A is a non-empty subset of Ω, then P (·|A) restricted to eventscontained in A is the uniform distribution on A. Indeed, a = # (A) and B ⊂ A,we have

P (B|A) =P (B ∩A)

P (A)=P (B)

P (A)=

# (B) /N

# (A) /N=

# (B)

# (A)=

# (B)

a.

Theorem 5.10. Suppose that S is a finite or countable set and X : Ω → S,then E [Y |X] = h (X) where h (s) := E [Y |X = s] for all s ∈ S.

Proof. First Proof. Our goal is to find h (s) such that

E [Y f (X)] = E [h (X) f (X)] for all bounded f.

Let S′ = s ∈ S : P (X = s) > 0 , then

E [Y f (X)] =∑s∈S

E [Y f (X) : X = s] =∑s∈S′

E [Y f (X) : X = s]

=∑s∈S′

f (s)E [Y |X = s] · P (X = s)

=∑s∈S′

f (s)h (s) · P (X = s)

=∑s∈S

f (s)h (s) · P (X = s) = E [h (X) f (X)]

where h (s) := E [Y |X = s] .Second Proof. If S is a finite set, such that P (X = s) > 0 for all s ∈ S.

Thenf (X) =

∑s∈S

f (s) 1X=s

which shows that VX = span 1X=s : s ∈ S . As 1X=ss∈S is an orthogonalset, we may compute

E [Y |X] =∑s∈S

(Y, 1X=s)

‖1X=s‖21X=s =

∑s∈S

E [Y : X = s]

P (X = s)1X=s

=∑s∈S

E [Y |X = s] · 1X=s = h (X) .

Example 5.11. Suppose that X and Y are discrete random variables with jointdistribution given as;

ρY 1/4 12 1/4

ρX X\Y −1 0 11/4 1 0 1/4 03/4 0 1/4 1/4 1/4

.

We then have

E [Y |X = 1] =1

1/4

(−1 · 0 + 0 · 1

4+ 1 · 0

)= 0 and

E [Y |X = 0] =1

3/4

(−1 · 1/4 + 0 · 1

4+ 1 · 1/4

)= 0

and therefore E [Y |X] = 0. On the other hand,

E [X|Y = −1] =1

1/4

(1 · 0 + 0 · 1

4

)= 0,

E [X|Y = 0] =1

1/2

(1 · 1/4 + 0 · 1

4

)=

1

2, and

E [X|Y = 1] =1

1/4

(1 · 0 + 0 · 1

4

)= 0.

Therefore

E [X|Y ] =1

21Y=0.

Example 5.12. Let X and Y be discrete random variables with values in 1, 2, 3whose joint distribution and marginals are given by

ρX .3 .35 .35ρY Y \X 1 2 3.6 1 .1 .2 .3.3 2 .15 .15 0.1 3 .05 0 .05

.

Then

ρX|Y (1, 3) = P (X = 1|Y = 3) =.05

.1=

1

2,

ρX|Y (2, 3) = P (X = 2|Y = 3) =0

.1= 0, and

ρX|Y (3, 3) = P (X = 3|Y = 3) =.05

.1=

1

2.

Therefore,

Page: 22 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 29: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

5.1 Conditional Expectation for Discrete Random Variables 23

E [X|Y = 3] = 1 · 1

2+ 2 · 0 + 3 · 1

2= 2

or

h (3) := E [X|Y = 3] =1

.1(1 · .05 + 2 · 0 + 3 · .05) = 2

Similarly,

h (1) := E [X|Y = 1] =1

.6(1 · .1 + 2 · .2 + 3 · .3) = 2

1

3,

h (2) := E [X|Y = 2] =1

.3(1 · .15 + 2 · .15 + 3 · 0) = 1.5

and so

E [X|Y ] = h (Y ) = 21

3· 1Y=1 + 1.5 · 1Y=2 + 2 · 1Y=3.

Example 5.13 (Number of girls in a family). Suppose the number of childrenin a family is a random variable X with mean µ, and given X = n forn ≥ 1,each of the n children in the family is a girl with probability p and a boy withprobability 1− p. Problem. What is the expected number of girls in a family?

Solution. Intuitively, the answer should be pµ. To show this is correct let Gbe the random number of girls in a family. Then,

E [G|X = n] = p · n

as G = 1A1 + · · · + 1An on X = n where Ai is the event the ith – childis a girl. We are given P (Ai|X = n) = p so that E [1Ai |X = n] = p and soE [G|X = n] = p · n. Therefore, E [G|X] = p ·X and

E [G] = EE [G|X] = E [p ·X] = pµ.

Example 5.14. Suppose that X and Y are i.i.d. random variables with the geo-metric distribution,

P (X = k) = P (Y = k) = (1− p)k−1p for k ∈ N.

We compute, for n > m,

P (X = m|X + Y = n) =P (X = m,X + Y = n)

P (X + Y = n)

=P (X = m,Y = n−m)∑k+l=n P (X = k, Y = l)

where

P (X = m,Y = n−m) = p2 (1− p)m−1(1− p)n−m−1

= p2 (1− p)n−2

and ∑k+l=n

P (X = k, Y = l) =∑k+l=n

(1− p)k−1p (1− p)l−1

p

=∑k+l=n

p2 (1− p)n−2= p2 (1− p)n−2

n−1∑k=1

1.

Thus we have shown,

P (X = m|X + Y = n) =1

n− 1for 1 ≤ m < n.

From this it follows that

E [f (X) |X + Y = n] =1

n− 1

n−1∑m=1

f (m)

and so

E [f (X) |X + Y ] =1

X + Y − 1

X+Y−1∑m=1

f (m) .

As a check if f (m) = m we have

E [X|X + Y ] =1

X + Y − 1

X+Y−1∑m=1

m

=1

X + Y − 1

1

2(X + Y − 1) (X + Y − 1 + 1)

=1

2(X + Y )

as we will see hold in fair generality, see Example 5.24 below.

Example 5.15 (Durrett Example 4.6.2, p. 205). Suppose we want to determinethe expected value of

Y = # of rolls to complete one game of craps.

Let X be the sum we obtain on the first roll. In this game, if;

X ∈ 2, 3, 12 =: L =⇒ game ends and you loose,

X ∈ 7, 11 =: W =⇒ game ends and you win,and

X ∈ 4, 5, 6, 8, 9, 10 =: P =⇒ X is your “point.”

Page: 23 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 30: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

24 5 Conditional Expectation

In the last case, you roll your dice again and again until you either throw untilyou get X (your point) or 7. (If you hit X before the 7 then you win.) We aregoing to compute EY as E [E [Y |X]] .

Clearly if x ∈ L ∪ W then E [Y |X = x] = 1 while if x ∈ P, thenE [Y |X = x] = 1 + ENx where Nx is the number of rolls need to hit eitherx or 7. This is a geometric random variable with parameter px (probability ofrolling an x or a 7) and so ENx = 1

px. For example if x = 4, then px = 3+6

36 = 936

(3 is the number of ways to roll a 4 and 6 is the number of ways to roll as 7) andhence 1 + ENx = 1 + 4 = 5. Similar calculations gives us the following table;

x ∈ 2, 3, 7, 11, 12 4, 10 5, 9 6, 8E [Y |X = x] 1 45

94610

4711

P (set) 1236

636

836

1036

.

(For example, there are 5 ways to get a 6 and 6 ways to get a 7 so whenx = 6 we are waiting for an event with probability 11/36 and the mean of thisgeometric random variables is 36/11 and adding the first roll to this implies,E [Y |X = 6] = 47/11. Similarly for x = 8 and P (X = 6 or 8) = (5 + 5) /36.)Putting the pieces together and using the law of total expectation gives,

EY = E [E [Y |X]] = 1 · 12

36+

45

9· 6

36+

46

10· 8

36+

47

11· 10

36

=557

165∼= 3.376 rolls.

The following two facts are often helpful when computing conditional ex-pectations.

Proposition 5.16 (Bayes formula). Suppose that A ⊂ Ω and Ai is a par-tition of A, then

E [Y |A] =1

P (A)

∑i

E [Y |Ai]P (Ai) =

∑i E [Y |Ai]P (Ai)∑

i P (Ai).

If we further assume that E [Y |Ai] = c independent of i, then E [Y |A] = c.

The proof of this proposition is straight forward and is left to the reader.

Proposition 5.17. Suppose that Xi : Ω → Si for 1 ≤ i ≤ n are independentrandom functions with each Si being discrete. Then for any Ti ⊂ Si we have,

E [u (X1, . . . , Xn) |X1 ∈ T1, . . . , Xn ∈ Tn] = E [u (Y1, . . . , Yn)]

where Yi : Ω → Ti for 1 ≤ i ≤ n are independent random functions such thatP (Yi = t) = P (Xi = t|Xi ∈ Ti) for all t ∈ Ti.

Proof. The proof is contained in the following computation,

E [u (X1, . . . , Xn) |X1 ∈ T1, . . . , Xn ∈ Tn]

=E [u (X1, . . . , Xn) : X1 ∈ T1, . . . , Xn ∈ Tn]

P (X1 ∈ T1, . . . , Xn ∈ Tn)

=1

P (X1 ∈ T1, . . . , Xn ∈ Tn)

∑ti∈Ti

u (t1, . . . , tn)P (X1 = t1, . . . , Xn = tn)

=1∏

i P (Xi ∈ Ti)∑

(t1,...,tn)∈T1×···×Tn

u (t1, . . . , tn)∏i

P (Xi ∈ ti)

=∑

(t1,...,tn)∈T1×···×Tn

u (t1, . . . , tn)∏i

P (Xi ∈ ti)P (Xi ∈ Ti)

=∑

(t1,...,tn)∈T1×···×Tn

u (t1, . . . , tn)∏i

P (Xi = t|Xi ∈ Ti)

=∑

(t1,...,tn)∈T1×···×Tn

u (t1, . . . , tn)P (Y1 = t1, . . . , Yn = tn)

= E [u (Y1, . . . , Yn)] .

Here is an example of how to use these two propositions.

Example 5.18. Suppose we roll a die n – times with results Xini=1 where Xi ∈1, 2, 3, 4, 5, 6 for each i. Let

Y =

n∑i=1

11,3,5 (Xi) = number of odd rolls and

Z =

n∑i=1

13,4,6 (Xi)

= number of times 3, 4, or 6 are rolled.

We wish to compute E [Z|Y ] . So let 0 ≤ y ≤ n be given and let A be the eventwhere Xi is odd for 1 ≤ i ≤ y and Xi is even for y < i ≤ n. Then

E [Z|A] = y1

3+ (n− y) · 2

3

where 13 = P (X1 ∈ 3, 4, 6 |X1 is odd) and 2

3 = P (X1 ∈ 3, 4, 6 |X1 is even) .Now it is clear that Y = y can be partitioned into events like the one abovebeing labeled by which of the y – slots are even and the results are the same forall such choices by symmetry, therefore by Proposition 5.16 we may conclude

Page: 24 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 31: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

5.2 General Properties of Conditional Expectation 25

E [Z|Y = y] = y1

3+ (n− y) · 2

3

and therefore,

E [Z|Y ] = Y1

3+ (n− Y ) · 2

3.

As a check notice that

EE [Z|Y ] = EY1

3+ (n− EY ) · 2

3=n

2

1

3+(n− n

2

)· 2

3

=n

6+n

3=

1

2n = EZ.

The next lemma generalizes this result.

Lemma 5.19. Suppose that Xi : Ω → S for 1 ≤ i ≤ n are i.i.d. randomfunctions into a discrete set S. Given a subset A ⊂ S let

ZA :=

n∑i=1

1A (Xi) = # (i : Xi ∈ A) .

If B is another subset of S, then

E [ZA|ZB ] = ZB · P (X1 ∈ A|X1 ∈ B) + (n− ZB) · P (X1 ∈ A|X1 /∈ B) . (5.4)

Proof. Intuitively, for a typical trial there are ZB of the Xi in B and forthese i we have E [1A (Xi) |Xi ∈ B] = P (X1 ∈ A|X1 ∈ B) . Likewise there aren − ZB of the Xi in S \ B and for these i we have E [1A (Xi) |Xi /∈ B] =P (X1 ∈ A|X1 /∈ B) . On these grounds we are quickly lead to Eq. (5.4).

To prove Eq. (5.4) rigorously we will compute E [ZA|ZB = m] by partitioningZB = m as ∪QΛ where Λ runs through subsets of k elements of S and

QΛ = (∩i∈Λ Xi ∈ B) ∩ (∩i∈Λc Xi /∈ B) .

Then according to Proposition 5.17,

E [ZA|QΛ] = E

[n∑i=1

1A (Yi)

]

where Yi are independent and

P (Yi = s) = P (Xi = s|Xi ∈ B) = P (X1 = s|X1 ∈ B) for i ∈ Λ

and

P (Yi = s) = P (Xi = s|Xi /∈ B) = P (X1 = s|X1 /∈ B) for i /∈ Λ.

Therefore,

E [ZA|QΛ] = E

[n∑i=1

1A (Yi)

]=

n∑i=1

E1A (Yi)

=∑i∈Λ

P (X1 ∈ A|X1 ∈ B) +∑i/∈Λ

P (X1 ∈ A|X1 /∈ B)

= m · P (X1 ∈ A|X1 ∈ B) + (n−m) · P (X1 ∈ A|X1 /∈ B) .

As the result is independent of the choice of Λ with #(Λ) = m we may useProposition 5.16 to conclude that

E [ZA|ZB = m] = m · P (X1 ∈ A|X1 ∈ B) + (n−m) · P (X1 ∈ A|X1 /∈ B) .

As 0 ≤ m ≤ n is arbitrary Eq. (5.4) follows.As a check notice that EZA = n · P (X1 ∈ A) while

EE [ZA|ZB ] =EZB · P (X1 ∈ A|X1 ∈ B) + E (n− ZB) · P (X1 ∈ A|X1 /∈ B)

=n · P (X1 ∈ B) · P (X1 ∈ A|X1 ∈ B)

+ (n− n · P (X1 ∈ B)) · P (X1 ∈ A|X1 /∈ B)

=n ·[

P (X1 ∈ B) · P (X1 ∈ A|X1 ∈ B)+ (1− P (X1 ∈ B)) · P (X1 ∈ A|X1 /∈ B)

]=n · [P (X1 ∈ A|X1 ∈ B)P (X1 ∈ B) + P (X1 ∈ A|X1 /∈ B)P (X1 /∈ B)]

=n · [P (X1 ∈ A,X1 ∈ B) + P (X1 ∈ A,X1 /∈ B)]

=n · P (X1 ∈ A) = EZA.

5.2 General Properties of Conditional Expectation

Let us pause for a moment to record a few basic general properties of conditionalexpectations.

Proposition 5.20 (Contraction Property). For all Y ∈ L2 (P ) , we haveE |E [Y |X]| ≤ E |Y | . Moreover if Y ≥ 0 then E [Y |X] ≥ 0 (a.s.).

Proof. Let E [Y |X] = h (X) (with h : S → R) and then define

f (x) =

1 if h (x) ≥ 0−1 if h (x) < 0

.

Since h (x) f (x) = |h (x)| , it follows from Eq. (5.3) that

Page: 25 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 32: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

26 5 Conditional Expectation

E [|h (X)|] = E [Y f (X)] = |E [Y f (X)]| ≤ E [|Y f (X)|] = E |Y | .

For the second assertion take f (x) = 1h(x)<0 in Eq. (5.3) in order to learn

E[h (X) 1h(X)<0

]= E

[Y 1h(X)<0

]≥ 0.

As h (X) 1h(X)<0 ≤ 0 we may conclude that h (X) 1h(X)<0 = 0 a.s.Because of this proposition we may extend the notion of conditional expec-

tation to Y ∈ L1 (P ) as stated in the following theorem which we do not botherto prove here.

Theorem 5.21. Given X : Ω → S and Y ∈ L1 (P ) , there exists an “essentiallyunique” function h : S → R such that Eq. (5.3) holds for all bounded functions,f : S → R. (As above we write E [Y |X] for h (X) .) Moreover the contractionproperty, E |E [Y |X]| ≤ E |Y | , still holds.

Theorem 5.22 (Basic properties). Let Y, Y1, and Y2 be integrable randomvariables and X : Ω → S be given. Then:

1. E(Y1 + Y2|X) = E(Y1|X) + E(Y2|X).2. E(aY |X) = aE(Y |X) for all constants a.3. E(g(X)Y |X) = g(X)E(Y |X) for all bounded functions g.4. E(E(Y |X)) = EY. (Law of total expectation.)5. If Y and X are independent then E(Y |X) = EY.

Proof. 1. Let hi (X) = E [Yi|X] , then for all bounded f,

E [Y1f (X)] = E [h1 (X) f (X)] and

E [Y2f (X)] = E [h2 (X) f (X)]

and therefore adding these two equations together implies

E [(Y1 + Y2) f (X)] = E [(h1 (X) + h2 (X)) f (X)]

= E [(h1 + h2) (X) f (X)]

E [Y2f (X)] = E [h2 (X) f (X)]

for all bounded f . Therefore we may conclude that

E(Y1 + Y2|X) = (h1 + h2) (X) = h1 (X) + h2 (X) = E(Y1|X) + E(Y2|X).

2. The proof is similar to 1 but easier and so is omitted.3. Let h (X) = E [Y |X] , then E [Y f (X)] = E [h (X) f (X)] for all bounded

functions f. Replacing f by g · f implies

E [Y g (X) f (X)] = E [h (X) g (X) f (X)] = E [(h · g) (X) f (X)]

for all bounded functions f. Therefore we may conclude that

E [Y g (X) |X] = (h · g) (X) = h (X) g (X) = g (X)E(Y |X).

4. Take f ≡ 1 in Eq. (5.3).5. If X and Y are independent and µ := E [Y ] , then

E [Y f (X)] = E [Y ]E [f (X)] = µE [f (X)] = E [µf (X)]

from which it follows that E [Y |X] = µ as desired.The next theorem says that conditional expectations essentially only de-

pends on the distribution of (X,Y ) and nothing else.

Theorem 5.23. Suppose that (X,Y ) and(X, Y

)are random vectors such

that (X,Y )d=(X, Y

), i.e. E [f (X,Y )] = E

[f(X, Y

)]for all bounded (or

non-negative) functions f. If h (X) = E [u (X,Y ) |X] , then E[u(X, Y

)|X]

=

h(X).

Proof. By assumption we know that

E [u (X,Y ) f (X)] = E [h (X) f (X)] for all bounded f.

Since (X,Y )d=(X, Y

), this is equivalent to

E[u(X, Y

)f(X)]

= E[h(X)f(X)]

for all bounded f

which is equivalent to E[u(X, Y

)|X]

= h(X).

Example 5.24. Let Xi∞i=1 be i.i.d. random variables with E |Xi| <∞ for all iand let Sm := X1 + · · ·+Xm for m = 1, 2, . . . . We wish to show,

E [Sm|Sn] =m

nSn for all m ≤ n.

for all m ≤ n. To prove this first observe by symmetry1 that

E (Xi|Sn) = h (Sn) independent of i.

Therefore

Sn = E (Sn|Sn) =

n∑i=1

E (Xi|Sn) =

n∑i=1

h (Sn) = n · h (Sn) .

1 Apply Theorem 5.23 using (X 1, Sn)d= (Xi, Sn) for 1 ≤ i ≤ n.

Page: 26 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 33: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

5.3 Conditional Expectation for Continuous Random Variables 27

Thus we see that

E (Xi|Sn) =1

nSn

and therefore

E (Sm|Sn) =

m∑i=1

E (Xi|Sn) =

m∑i=1

1

nSn =

m

nSn.

If m > n, then Sm = Sn +Xn+1 + · · ·+Xm. Since Xi is independent of Snfor i > n, it follows that

E (Sm|Sn) = E (Sn +Xn+1 + · · ·+Xm|Sn)

= E (Sn|Sn) + E (Xn+1|Sn) + · · ·+ E (Xm|Sn)

= Sn + (m− n)µ if m ≥ n

where µ = EXi.

Example 5.25 (See Durrett, #8, p. 213). Suppose that X and Y are two inte-grable random variables such that

E [X|Y ] = 18− 3

5Y and E [Y |X] = 10− 1

3X.

We would like to find EX and EY. To do this we use the law of total expectationto find,

EX = EE [X|Y ] = E(

18− 3

5Y

)= 18− 3

5EY and

EY = EE [Y |X] = E(

10− 1

3X

)= 10− 1

3EX.

Solving this pair of linear equations shows EX = 15 and EY = 5.

5.3 Conditional Expectation for Continuous RandomVariables

(This section will be covered later in the course when first needed.)Suppose that Y and X are continuous random variables which have a joint

density, ρ(Y,X) (y, x) . Then by definition of ρ(Y,X), we have, for all bounded ornon-negative, U, that

E [U (Y,X)] =

∫ ∫U (y, x) ρ(Y,X) (y, x) dydx. (5.5)

The marginal density associated to X is then given by

ρX (x) :=

∫ρ(Y,X) (y, x) dy (5.6)

and recall from Math 180A that the conditional density ρ(Y |X) (y, x) is definedby

ρ(Y |X) (y, x) =

ρ(Y,X)(y,x)

ρX(x) if ρX (x) > 0

0 if ρX (x) = 0. (5.7)

Observe that if ρ(Y,X) (y, x) is continuous, then

ρ(Y,X) (y, x) = ρ(Y |X) (y, x) ρX (x) for all (x, y) . (5.8)

Indeed, if ρX (x) = 0, then

0 = ρX (x) =

∫ρ(Y,X) (y, x) dy

from which it follows that ρ(Y,X) (y, x) = 0 for all y. If ρ(Y,X) is not continuous,Eq. (5.8) still holds for “a.e.” (x, y) which is good enough.

Lemma 5.26. In the notation above,

ρ (x, y) = ρ(Y |X) (y, x) ρX (x) for a.e. (x, y) . (5.9)

Proof. By definition Eq. (5.9) holds when ρX (x) > 0 and ρ (x, y) ≥ρ(Y |X) (y, x) ρX (x) for all (x, y) . Moreover,∫ ∫

ρ(Y |X) (y, x) ρX (x) dxdy =

∫ ∫ρ(Y |X) (y, x) ρX (x) 1ρX(x)>0dxdy

=

∫ ∫ρ (x, y) 1ρX(x)>0dxdy

=

∫ρX (x) 1ρX(x)>0dx =

∫ρX (x) dx

= 1 =

∫ ∫ρ (x, y) dxdy,

or equivalently, ∫ ∫ [ρ (x, y)− ρ(Y |X) (y, x) ρX (x)

]dxdy = 0

which implies the result.

Page: 27 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 34: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

28 5 Conditional Expectation

Theorem 5.27. Keeping the notation above,for all or all bounded or non-negative, U, we have E [U (Y,X) |X] = h (X) where

h (x) =

∫U (y, x) ρ(Y |X) (y, x) dy (5.10)

=

∫U(y,x)ρ(Y,X)(y,x)dy∫

ρ(Y,X)(y,x)dyif

∫ρ(Y,X) (y, x) dy > 0

0 otherwise. (5.11)

In the future we will usually denote h (x) informally by E [U (Y, x) |X = x],2 sothat

E [U (Y, x) |X = x] :=

∫U (y, x) ρ(Y |X) (y, x) dy. (5.12)

Proof. We are looking for h : S → R such that

E [U (Y,X) f (X)] = E [h (X) f (X)] for all bounded f.

Using Lemma 5.26, we find

E [U (Y,X) f (X)] =

∫ ∫U (y, x) f (x) ρ(Y,X) (y, x) dydx

=

∫ ∫U (y, x) f (x) ρ(Y |X) (y, x) ρX (x) dydx

=

∫ [∫U (y, x) ρ(Y |X) (y, x) dy

]f (x) ρX (x) dx

=

∫h (x) f (x) ρX (x) dx

= E [h (X) f (X)]

where h is given as in Eq. (5.10).

Example 5.28 (Durrett 8.15, p. 145). Suppose that X and Y have joint densityρ (x, y) = 8xy · 10<y<x<1. We wish to compute E [u (X,Y ) |Y ] . To this end wecompute

ρY (y) =

∫R

8xy · 10<y<x<1dx = 8y

∫ x=1

x=y

x · dx = 8y · x2

2|1y = 4y ·

(1− y2

).

Therefore,

2 Warning: this is not consistent with Eq. (5.1) as P (X = x) = 0 for continuousdistributions.

ρX|Y (x, y) =ρ (x, y)

ρY (y)=

8xy · 10<y<x<1

4y · (1− y2)=

2x · 10<y<x<1

(1− y2)

and so

E [u (X,Y ) |Y = y] =

∫R

2x · 10<y<x<1

(1− y2)u (x, y) dx = 2

10<y<1

1− y2

∫ 1

y

u (x, y) xdx

and so

E [u (X,Y ) |Y ] = 21

1− Y 2

∫ 1

Y

u (x, Y )xdx.

is the best approximation to u (X,Y ) be a function of Y alone.

Proposition 5.29. Suppose that X,Y are independent random functions, then

E [U (Y,X) |X] = h (X)

whereh (x) := E [U (Y, x)] .

Proof. I will prove this in the continuous distribution case and leave thediscrete case to the reader. (The theorem is true in general but requires measuretheory in order to prove it in full generality.) The independence assumption isequivalent to ρ(Y,X) (y, x) = ρY (y) ρX (x) . Therefore,

ρ(Y |X) (y, x) =

ρY (y) if ρX (x) > 0

0 if ρX (x) = 0

and therefore E [U (Y,X) |X] = h0 (X) where

h0 (x) =

∫U (y, x) ρ(Y |X) (y, x) dy

= 1ρX(x)>0

∫U (y, x) ρY (y) dy = 1ρX(x)>0E [U (Y, x)]

= 1ρX(x)>0h (x) .

If f is a bounded function of x, then

E [h0 (X) f (X)] =

∫h0 (x) f (x) ρX (x) dx =

∫x:ρX(x)>0

h0 (x) f (x) ρX (x) dx

=

∫x:ρX(x)>0

h (x) f (x) ρX (x) dx =

∫h (x) f (x) ρX (x) dx

= E [h (X) f (X)] .

So for all practical purposes, h (X) = h0 (X) , i.e. h (X) = h0 (X) – a.s. (In-deed, take f (x) = sgn(h (x) − h0 (x)) in the above equation to learn thatE |h (X)− h0 (X)| = 0.

Page: 28 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 35: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

5.3 Conditional Expectation for Continuous Random Variables 29

Theorem 5.30 (Iterated conditioning 1). Let X, Y, and Z be random vec-tors and suppose that (X,Y ) is distributed according to ρ(X,Y ) (x, y) dxdy. Then

E [Z|Y = y] =

∫E [Z|Y = y,X = x] ρY |X (y, x) dx ρY (y) dy – a.s.

Proof. Let h (x, y) := E [Z|Y = y,X = x] so that

E [Zv (X,Y )] = E [h (X,Y ) v (X,Y )] for all v.

Taking v (x, y) = g (y) to be a function of Y alone shows,

E [Zg (Y )] = E [h (X,Y ) g (Y )] for all g.

Thus it follows that

E [Z|Y = y] = E [h (X,Y ) |Y = y]

=

∫h (x, y) ρX|Y (x, y) dx

=

∫E [Z|Y = y,X = x] ρX|Y (x, y) dx.

Remark 5.31. Often, E [Z|Y = y0] may be computed as;

limε↓0

E [Z| |Y − y0| < ε] .

To understand this formula, suppose that h (y) := E [Z|Y = y] and ρY (y) arecontinuous near y0 and ρY (y0) > 0. Then

E [Z| |Y − y0| < ε] =E [Z : |Y − y0| < ε]

P (|Y − y0| < ε)

=E[h (Y ) 1|Y−y0|<ε

]P (|Y − y0| < ε)

=

∫h (y) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy→ h (y0) as ε ↓ 0,

wherein we have used, h (y) ∼= h (y0) for y near y0 and therefore,∫h (y) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy∼=∫h (y0) 1|y−y0|<ερY (y) dy∫

1|y−y0|<ερY (y) dy= h (y0) .

Here is a consequence of this result.

Theorem 5.32 (Iterated conditioning 2). Suppose that Ω is partitionedinto disjoint sets Aini=1 and y0 is given such that the following limits exists;

P (Ai|Y = y0) := limε↓0

P (Ai| |Y − y0| < ε) and

E [Z|Y = y0, Ai] := limε↓0

E [Z| |Y − y0| < ε,Ai] .

(In particular we are assuming that and P (|Y − y0| < ε,Ai) > 0 for all ε > 0.)Then

E [Z|Y = y0] =

n∑i=1

E [Z|Y = y0, Ai]P (Ai|Y = y0) .

Proof. Since,

E [Z| |Y − y0| < ε,Ai] =E[Z1Ai1|Y−y0|<ε

]P (Ai ∩ |Y − y0| < ε)

=E [Z1Ai | |Y − y0| < ε]P (|Y − y0| < ε)

P (Ai ∩ |Y − y0| < ε)

=E [Z1Ai | |Y − y0| < ε]

P (Ai| |Y − y0| < ε),

it follows that

limε↓0

E [Z1Ai | |Y − y0| < ε] = limε↓0

(E [Z| |Y − y0| < ε,Ai] · P (Ai| |Y − y0| < ε))

= E [Z|Y = y0, Ai]P (Ai|Y = y0) .

Moreover,

n∑i=1

limε↓0

E [Z1Ai | |Y − y0| < ε] = limε↓0

n∑i=1

E [Z1Ai | |Y − y0| < ε]

= limε↓0

E

[Z

n∑i=1

1Ai | |Y − y0| < ε

]= lim

ε↓0E [Z| |Y − y0| < ε] = E [Z|Y = y0] ,

and therefore

E [Z|Y = y0] =

n∑i=1

limε↓0

E [Z1Ai | |Y − y0| < ε]

=

n∑i=1

E [Z|Y = y0, Ai]P (Ai|Y = y0)

as claimed.

Page: 29 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 36: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

30 5 Conditional Expectation

Remark 5.33 (An advanced remark). The last result is in fact a special case ofTheorem 5.30 wherein we take X =

∑i i1Ai . In this case, any function u (X,Y )

may be written as,

u (X,Y ) =∑i

1X=iu (i, Y ) .

Thus if we let h (x, y) := E [Z| (X = x, Y = y)] , i.e. h (X,Y ) = E [Z| (X,Y )]a.s. then on one hand;

E [h (X,Y )u (X,Y )] =∑i

E [h (X,Y ) 1X=iu (i, Y )] =∑i

E [h (i, Y )u (i, Y ) 1X=i]

while on the other,

E [h (X,Y )u (X,Y )] = E [Zu (X,Y )] =∑i

E [Z1X=iu (i, Y )] .

Taking u (i, Y ) = δi,jv (Y ) and comparing the resulting expressions shows,

E [h (j, Y ) v (Y ) 1X=j ] = E [Z1X=jv (Y )] for all v

and therefore that

E [Z1X=j |Y ] = E [h (j, Y ) 1X=j |Y ] = h (j, Y ) · E [1X=j |Y ] .

Summing this equation on j then shows,

E [Z|Y ] =∑j

E [Z1X=j |Y ] =∑j

h (j, Y ) · E [1X=j |Y ] ,

which reads,

E [Z|Y = y] =∑j

E [Z|X = j, Y = y] · E [1X=j |Y = y]

=∑j

E [Z|X = j, Y = y] · P [X = j|Y = y]

=∑j

E [Z|Aj , Y = y] · P [Aj |Y = y] (µY – a.s.) .

where µY is the law of Y, i.e. µY (A) := P (Y ∈ A) .

Example 5.34. Suppose that Tknk=1 are independent random times such that

P (Tk > t) = e−λkt for all t ≥ 0 for some λk > 0. LetTk

nk=1

be the order

statistics of the sequence, i.e.Tk

nk=1

is the sequence Tknk=1 in increasing

order, i.e. T1 < T2 < · · · < Tn. Further let K = i on T1 = Ti. Then

E[f(T2 − T1

)|T1 = t

]= E

[f(T2 − T1

)|T1 = t,K = i

]P(K = i|T1 = t

)where

T1 = t,K = i

= t = Ti < Tj for j 6= i and therefore,

E[f(T2 − T1

)|T1 = t,K = i

]= E

[f(T2 − t

)|T1 = t,K = i

]= E

[f

(minj 6=i

Tj − t)|t = Ti < min

j 6=iTj

]= E

[f

(minj 6=i

Tj − t)|t < min

j 6=iTj

]= E

[f

(minj 6=i

Tj

)],

wherein we have used Ti is independent of S := minj 6=i Tjd= E (λ− λi) ,

where λ = λ1 + · · · + λn. Since S is an exponential random variable,P (S > t+ s|S > t) = P (S > s) , i.e. S − t under P (·|S > t) is the same indistribution as S under P. Thus we have shown,

E[f(T2 − T1

)|T1 = t

]=∑i

E[f

(minj 6=i

Tj

)]P(K = i|T1 = t

).

We now compute informally,

P(K = i|T = t

)=P (t = Ti < Tj for j 6= i)

P(t = T1

) =e−(λ−λi)t · P (Ti = t)

P(t = T1

)=e−(λ−λi)t · λie−λitdt

λe−λtdt=λiλ.

Here is the above computaion done more rigorously;

P(K = i|t < T ≤ t+ ε

)=P (Ti < Tj for j 6= i, t < Ti ≤ t+ ε)

P(t < T ≤ t+ ε

)=

∫ t+εt

P (τ < Tj for j 6= i)λie−λiτdτ∫ t+εt

λe−λτdτ

ε↓0→ P (t < Tj for j 6= i)λie−λit

λe−λt

=e−(λ−λi)tλie

−λit

λe−λt=λiλ.

In summary we have shown;

E[f(T2 − T1

)|T1 = t

]=∑i

E[f

(minj 6=i

Tj

)]λiλ.

Page: 30 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 37: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

5.5 Summary on Conditional Expectation Properties 31

5.4 Conditional Variances

Definition 5.35 (Conditional Variance). Suppose that Y ∈ L2 (P ) and X :Ω → S are given. We define

Var (Y |X) = E[Y 2|X

]− (E [Y |X])

2(5.13)

= E[(Y − E [Y |X])

2 |X]

(5.14)

to be the conditional variance of Y given X.

Theorem 5.36. Suppose that Y ∈ L2 (P ) and X : Ω → S are given, then

Var (Y ) = E [Var (Y |X)] + Var (E [Y |X]) .

Proof. Taking expectations of Eq. (5.13) implies,

E [Var (Y |X)] = EE[Y 2|X

]− E (E [Y |X])

2

= EY 2 − E (E [Y |X])2

= Var (Y ) + (EY )2 − E (E [Y |X])

2.

The result follows from this identity and the fact that

Var (E [Y |X]) = E (E [Y |X])2 − (EE [Y |X])

2= E (E [Y |X])

2 − (EY )2.

5.5 Summary on Conditional Expectation Properties

Let Y and X be random variables such that EY 2 <∞ and h be function fromthe range of X to R. Then the following are equivalent:

1. h(X) = E(Y |X), i.e. h(X) is the conditional expectation of Y given X.2. E(Y − h(X))2 ≤ E(Y − g(X))2 for all functions g, i.e. h(X) is the best

approximation to Y among functions of X.3. E(Y ·g(X)) = E(h(X) ·g(X)) for all functions g, i.e. Y −h(X) is orthogonal

to all functions of X. Moreover, this condition uniquely determines h(X).

The methods for computing E(Y |X) are given in the next two propositions.

Proposition 5.37 (Discrete Case). Suppose that Y and X are discrete ran-dom variables and p(y, x) := P (Y = y,X = x). Then E(Y |X) = h(X), where

h(x) = E(Y |X = x) =E(Y : X = x)

P (X = x)=

1

pX(x)

∑y

yp(y, x) (5.15)

and pX(x) = P (X = x) is the marginal distribution of X which may be com-puted as pX(x) =

∑y p(y, x).

Proposition 5.38 (Continuous Case). Suppose that Y and X are randomvariables which have a joint probability density ρ(y, x) (i.e. P (Y ∈ dy,X ∈dx) = ρ(y, x)dydx). Then E(Y |X) = h(X), where

h(x) = E(Y |X = x) :=1

ρX(x)

∫ ∞−∞

yρ(y, x)dy (5.16)

and ρX(x) is the marginal density of X which may be computed as

ρX(x) =

∫ ∞−∞

ρ(y, x)dy.

Intuitively, in all cases, E(Y |X) on the set X = x is E(Y |X = x). Thisintuitions should help motivate some of the basic properties of E(Y |X) sum-marized in the next theorem.

Theorem 5.39. Let Y, Y1, Y2 and X be random variables. Then:

1. E(Y1 + Y2|X) = E(Y1|X) + E(Y2|X).2. E(aY |X) = aE(Y |X) for all constants a.3. E(f(X)Y |X) = f(X)E(Y |X) for all functions f.4. E(E(Y |X)) = EY.5. If Y and X are independent then E(Y |X) = EY.6. If Y ≥ 0 then E(Y |X) ≥ 0.

Remark 5.40. Property 4 in Theorem 5.39 turns out to be a very powerfulmethod for computing expectations. I will finish this summary by writing outProperty 4 in the discrete and continuous cases:

EY =∑x

E(Y |X = x)pX(x) (Discrete Case)

where

E(Y |X = x) =

E(Y 1X=x)P (X=x) if P (X = x) > 0

0 otherwise

E [U (Y,X)] =

∫E(U (Y,X) |X = x)ρX(x)dx, (Continuous Case)

where

E [U (Y, x) |X = x] :=

∫U (y, x) ρ(Y |X) (y, x) dy

and

ρ(Y |X) (y, x) =

ρ(Y,X)(y,x)

ρX(x) if ρX (x) > 0

0 if ρX (x) = 0.

Page: 31 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 38: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 39: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

6

Random Sums

Suppose that Xi∞i=1 is a collection of random variables and let

Sn :=

X1 + · · ·+Xn if n ≥ 1

0 if n = 0.

Given a Z+ – valued random variable, N, we wish to consider the random sum;

SN = X1 + · · ·+XN .

We are now going to suppose for the rest of this subsection that N is indepen-dent of Xi∞i=1 and for f ≥ 0 we let

Tf (n) := E [f (Sn)] for all n ∈ N0.

Theorem 6.1. Suppose that N is independent of Xi∞i=1 as above. Then forany positive function f, we have,

E [f (SN )] = E [Tf (N)] .

Moreover this formula holds for any f such that

E [|f (SN )|] = E [T |f | (N)] <∞.

Proof. If f ≥ 0 we have,

E [f (SN )] =

∞∑n=0

E [f (SN ) : SN = n] =

∞∑n=0

E [f (Sn) : SN = n]

=

∞∑n=0

E [f (Sn)]P (SN = n) =

∞∑n=0

(Tf) (n)P (SN = n)

= E [Tf (N)] .

The moreover part follows from general non-sense not really covered in thiscourse.

Theorem 6.2. Suppose that Xi∞i=1 are uncorrelated L2 (P ) – random vari-ables with µ = EXi and σ2 = Var (Xi) independent of i. Assuming thatN ∈ L2 (P ) is independent of the Xi , then

E [SN ] = µ · EN (6.1)

andVar (SN ) = σ2E [N ] + µ2 Var (N) . (6.2)

Proof. Taking f (x) = x in Theorem 6.1 using Tf (n) = E [Sn] = n · µ wefind,

E [SN ] = E [µ ·N ] = µ · ENas claimed. Next take f (x) = x2 in Theorem 6.1 using

Tf (n) = E[S2n

]= Var (Sn) + (ESn)

2= σ2n+ (n · µ)

2,

we find that

E[S2N

]= E

[σ2N + µ2N2

]= σ2E [N ] + µ2E

[N2].

Combining these results shows,

Var (SN ) = σ2E [N ] + µ2E[N2]− µ2 (EN)

2

= σ2E [N ] + µ2 Var (N) .

Example 6.3 (Karlin and Taylor E.3.1. p77). A six-sided die is rolled, and thenumber N on the uppermost face is recorded. Then a fair coin is tossed Ntimes, and the total number Z of heads to appear is observed. Determine themean and variance of Z by viewing Z as a random sum of N Bernoulli randomvariables. Determine the probability mass function of Z, and use it to find themean and variance of Z.

We have Z = SN = X1 + · · · + XN where Xi = 1 if heads on the ith tossand zero otherwise. In this case

EX1 =1

2,

Var (X1) =1

2−(

1

2

)2

=1

4,

EN =1

6(1 + · · ·+ 6) =

1

6

7 · 62

=7

2,

EN2 =1

6

(12 + 22 + 32 + 42 + 52 + 62

)=

91

6

Var (N) =91

6−(

7

2

)2

=35

12.

Page 40: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

34 6 Random Sums

Therefore,

EZ = EX1 · EN =1

2· 7

2=

7

4

Var (Z) =1

4· 7

2+

(1

2

)2

· 35

12=

77

48= 1.604 2.

Alternatively, we have

P (Z = k) =

6∑n=1

P (Z = k|N = n)P (N = n)

=1

6

6∑n=k∨1

P (Z = k|N = n)

=1

6

6∑n=k∨1

(n

k

)(1

2

)n.

where

EZ =

6∑k=0

kP (Z = k) =

6∑k=1

kP (Z = k)

=

6∑k=1

k1

6

6∑n=k

(n

k

)(1

2

)n=

7

4

and

EZ2 =

6∑k=0

k2P (Z = k) =

6∑k=1

k2 1

6

6∑n=k

(n

k

)(1

2

)n=

14

3

so that

Var (Z) =14

3−(

7

4

)2

=77

48.

We have,

P (Z = 0) =1

6

6∑n=1

(n

0

)(1

2

)n=

21

128

P (Z = 1) =1

6

6∑n=1

(n

1

)(1

2

)n=

5

16

P (Z = 2) =1

6

6∑n=2

(n

2

)(1

2

)n=

33

128

P (Z = 3) =1

6

6∑n=3

(n

3

)(1

2

)n=

1

6

P (Z = 4) =1

6

6∑n=4

(n

4

)(1

2

)n=

29

384

P (Z = 5) =1

6

6∑n=5

(n

5

)(1

2

)n=

1

48

P (Z = 6) =1

6

6∑n=6

(n

6

)(1

2

)n=

1

384.

Remark 6.4. If the Xi are i.i.d., we may work out the moment generatingfunction, mgfSN (t) := E

[etSN

]as follows. Conditioning on N = n shows,

E[etSN |N = n

]= E

[etSn |N = n

]= E

[etSn

]=[EetX1

]n= [mgfX1 (t)]

n

so thatE[etSN |N

]= [mgfX1 (t)]

N= eN ln(mgfX1

(t)).

Taking expectations of this equation using the law of total expectation gives,

mgfSN (t) = mgfN (ln (mgfX1(t))) .

Exercise 6.1 (Karlin and Taylor II.3.P2). For each given p, let Z havea binomial distribution with parameters p and N. Suppose that N is itselfbinomially distributed with parameters q and M. Formulate Z as a randomsum and show that Z has a binomial distribution with parameters pq and M.

Solution to Exercise (Karlin and Taylor II.3.P2). Let Xi∞i=1 be i.i.d.Bernoulli random variables with P (Xi = 1) = p and P (Xi = 0) = 1− p. Then

Zd= X1 + · · ·+XN . We now compute

Page: 34 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 41: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

6 Random Sums 35

P (Z = k) =

M∑n=k

P (Z = k|N = n)P (N = n)

=

M−k∑l=0

P (Z = k|N = k + l)P (N = k + l)

=

M−k∑l=0

P (Z = k|N = k + l)P (N = k + l)

=

M−k∑l=0

pk (1− p)k+l−k(k + l

k

)·(M

k + l

)qk+l (1− q)M−(k+l)

= (pq)kM−k∑l=0

(1− p)l M !

k!l! (M − k − l)!ql (1− q)M−k−l

=

(M

k

)(pq)

kM−k∑l=0

(M − k)!

l! (M − k − l)![(1− p) q]l (1− q)M−k−l

=

(M

k

)(pq)

kM−k∑l=0

(M − kl

)[(1− p) q]l (1− q)M−k−l

=

(M

k

)(pq)

k[(1− p) q + (1− q)]M−k

=

(M

k

)(pq)

k[1− pq]M−k

as claimed. See page 58-59 of the notes where this is carried out.Alternatively. Let ξi be i.i.d. Bernoulli random variables with parameter

q and ηi be i.i.d. Bernoulli random variables with parameter p independentof the ξi . Then let N = η1 + · · · + ηM and Z = ξ1η1 + · · · + ξMηM . Notice

that ξiηiMi=1 are Bernoulli random variables with parameter pq so that Z isBinomial with parameters pq and M. Further N is binomial with parameters pand M. Let B (i1, . . . , in) be the event where ηi1 = ηi2 = · · · = ηin = 1 with allothers being zero, then

N = n = ∪i1<···<inB (i1, . . . , in)

so that

P (Z = k|N = n) =

∑i1<···<in P (Z = k ∩B (i1, . . . , in))∑

i1<···<in P (B (i1, . . . , in))

=

∑i1<···<in P (Z = k|B (i1, . . . , in))P (B (i1, . . . , in))∑

i1<···<in P (B (i1, . . . , in))

=

∑i1<···<in

(nk

)qk (1− q)n−k P (B (i1, . . . , in))∑

i1<···<in P (B (i1, . . . , in))

=

(n

k

)qk (1− q)n−k

and this gives another more intuitive proof of the result.

Page: 35 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 42: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 43: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Part I

Discrete Time Markov Chains

Page 44: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 45: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

7

Markov Chains Basics

For this chapter, let S be a finite or at most countable state space andp : S × S → [0, 1] be a Markov kernel, i.e.∑

y∈Sp (x, y) = 1 for all i ∈ S. (7.1)

A probability on S is a function, π : S → [0, 1] such that∑x∈S π (x) = 1.

Further, let N0 = N∪0 ,

Ω := SN0 = ω = (s0, s1, . . . ) : sj ∈ S ,

and for each n ∈ N0, let Xn : Ω → S be given by

Xn (s0, s1, . . . ) = sn.

Notation 7.1 We will denote (X0, X1, X2, . . . ) by X.

Definition 7.2 (Markov probabilities). A (time homogeneous) Markovprobability1, P, on Ω with transition kernel, p, is probability on Ω such that

P (Xn+1 = xn+1|X0 = x0, X1 = x1, . . . , Xn = xn)

= P (Xn+1 = xn+1|Xn = xn) = p (xn, xn+1) (7.2)

where xjn+1j=1 are allowed to range over S and n over N0. The iden-

tity in Eq. (7.2) is only to be checked on for those xj ∈ S such thatP (X0 = x0, X1 = x1, . . . , Xn = xn) > 0. (Poetically, a Markov chain does notremember its past, its future moves are determined only by its present locationand not how it got there.)

1 The set Ω is sufficiently big that it is no longer so easy to give a rigorous definitionof a probability on Ω. For the purposes of this class, a probability on Ω shouldbe taken to mean an assignment, P (A) ∈ [0, 1] for all subsets, A ⊂ Ω, such thatP (∅) = 0, P (Ω) = 1, and

P (A) =

∞∑n=1

P (An)

whenever A = ∪∞n=1An with An ∩ Am = ∅ for all m 6= n. (There are technicalproblems with this definition which are addressed in a course on “measure theory.”We may safely ignore these problems here.)

If a Markov probability P is given we will often refer to Xn∞n=0 as aMarkov chain. The condition in Eq. (7.2) may also be written as,

E[f(Xn+1) | X0, X1, . . . , Xn] = E[f(Xn+1) | Xn] =∑y∈S

p (Xn, y) f (y) (7.3)

for all n ∈ N0 and any bounded function, f : S → R.

Proposition 7.3 (Markov joint distributions). If P is a Markov probabilityas in Definition 7.2 and π (x) := P (X0 = x) , then for all n ∈ N0 and xj ⊂ S,

P (X0 = x0, . . . , Xn = xn) = π (x0) p (x0, x1) . . . p (xn−1, xn) . (7.4)

Conversely if π : S → [0, 1] is a probability and Xn∞n=0 is a sequence ofrandom variables satisfying Eq. (7.4) for all n and xj ⊂ S, then (Xn , P, p)satisfies Definition 7.2.

Proof. ( =⇒ ) This formal proof is by induction on n. I will do the casen = 1 and n = 2 here. For n = 1, if π (x0) = P (X0 = x0) = 0 then both sidesof Eq. (7.4) are zero and there is nothing to prove. If π (x0) = P (X0 = x0) > 0,then

P (X0 = x0, X1 = x1) = P (X1 = x1|X0 = x0)P (X0 = x0)

= π (x0) · p (x0, x1) .

Now for the case n = 2. Let p := P (X0 = x0, X1 = x1) = π (x0) · p (x0, x1) . Ifp = 0 then again both sides of Eq. (7.4) while if p > 0 we have by assumptionand the case n = 1 that

P (X0 = x0, X1 = x1, X2 = x2)

= P (X2 = x2|X0 = x0, X1 = x1, ) · P (X0 = x0, X1 = x1)

= P (X2 = x2|X1 = x1) · P (X0 = x0, X1 = x1)

= p (x1, x2) · π (x0) p (x0, x1) = π (x0) p (x0, x1) p (x1, x2) .

The formal induction argument is now left to the reader.(⇐=) If

π (x0) p (x0, x1) . . . p (xn−1, xn) = P (X0 = x0, . . . , Xn = xn) > 0,

Page 46: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

40 7 Markov Chains Basics

then by Eq. (7.4) and the definition of conditional probabilities we find,

P (Xn+1 = xn+1|X0 = x0, X1 = x1, . . . , Xn = xn)

=P (X0 = x0, X1 = x1, . . . , Xn = xn, Xn+1 = xn+1)

P (X0 = x0, . . . , Xn = xn)

=π (x0) p (x0, x1) . . . p (xn−1, xn) p (xn, xn+1)

π (x0) p (x0, x1) . . . p (xn−1, xn)= p (xn, xn+1)

as desired.

Fact 7.4 To each probability π on S there is a unique Markov probability, Pπ,on Ω such that Pπ (X0 = x) = π (x) for all x ∈ X. Moreover, Pπ is uniquelydetermined by Eq. (7.4).

Notation 7.5 We will abbreviate the expectation (EPπ ) with respect to Pπ byEπ. Moreover if

π (y) = δx (y) :=

1 if x = y0 if x 6= y

, (7.5)

we will write Px for Pπ = Pδx and Ex for Eδx

For a general probability, π, on S, it follows from Proposition 7.3 and Corol-lary 7.6 that

Pπ =∑x∈S

π (x)Px and Eπ =∑x∈S

π (x)Ex. (7.6)

Corollary 7.6. If π is a probability on S and u : Sn+1 → R is a bounded ornon-negative function, then

Eπ [u (X0, . . . , Xn)] =∑

x0,...,xn∈Su (x0, . . . , xn)π (x0) p (x0, x1) . . . p (xn−1, xn) .

Definition 7.7 (Matrix multiplication). If q : S × S → [0, 1] is anotherMarkov kernel we let p · q : S × S → [0, 1] be defined by

(p · q) (x, y) :=∑z∈S

p (x, z) q (z, y) . (!) (7.7)

We also let

pn :=

n - times︷ ︸︸ ︷p · p · · · · · p.

If π : S → [0, 1] is a probability we let (π · q) : S → [0, 1] be defined by

(π · q) (y) :=∑x∈S

π (x) q (x, y) .

As the definition suggests, p · q is the multiplication of matrices and π · q isthe multiplication of a row vector π with a matrix q. It is easy to check that π ·qis still a probability and p · q and pn are Markov kernels. A key point to keep inmind is that a Markov process is completely specified by its transition kernel,p : S × S → [0, 1] . For example we have the following method for computingPx (Xn = y) .

Lemma 7.8. Keeping the above notation, Px (Xn = y) = pn (x, y) and moregenerally,

Pπ (Xn = y) =∑x∈S

π (x) pn (x, y) = (π · pn) (y) .

Proof. We have from Eq. (7.4) that

Px (Xn = y) =∑

x0,...,xn−1∈SPx (X0 = x0, X1 = x1, . . . , Xn−1 = xn−1, Xn = y)

=∑

x0,...,xn−1∈Sδx (x0) p (x0, x1) . . . p (xn−2, xn−1) p (xn−1, y)

=∑

x1,...,xn−1∈Sp (x, x1) . . . p (xn−2, xn−1) p (xn−1, y) = pn (x, y) .

The formula for Pπ (Xn = y) easily follows from this formula.To get a feeling for Markov chains, I suggest the reader play around with

the simulation provided by Stefan Waner and Steven R. Costenoble at www.

zweigmedia.com/RealWorld/markov/markov.html – see Figure 7.1 below.

Fig. 7.1. See www.zweigmedia.com/RealWorld/markov/markov.html for a Markovchain simulator for chains with a state space of 4 elements or less. The user describesthe chain by filling in the transition matrix P.

Page: 40 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 47: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

7.1 Examples 41

7.1 Examples

Notation 7.9 Associated to a transition kernel, p, is a jump graph (or jumpdiagram) gotten by taking S as the set of vertices and then for x, y ∈ S, drawan arrow from x to y if p (x, y) > 0 and label this arrow by the value p (x, y) .

Example 7.10. The transition matrix,

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is represented by the jump diagram in Figure 7.2.

1 2

3

1/2

1/3

1/2

1/2

1/3

1/4

1/4

0

1/3

1 2

3

1/2

1/3

1/2

1/2

1/3

1/4

Fig. 7.2. A simple 3 state jump diagram. We typically abbreviate the jump diagramon the left by the one on the right. That is we infer by conservation of probabilitythere has to be probability 1/4 of staying at 1, 1/3 of staying at 3 and 0 probabilityof staying at 2.

Example 7.11. The jump diagram for

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is shown in Figure 7.3.

Example 7.12. Suppose that S = 1, 2, 3 , then

114

12

##

212

12oo

3

13

YY

13

EE

Fig. 7.3. In the above diagram there are jumps from 1 to 1 with probability 1/4 andjumps from 3 to 3 with probability 1/3 which are not explicitly shown but must beinferred by conservation of probability.

P =

1 2 3 0 1 01/2 0 1/21 0 0

123

has the jump graph given by 7.2.

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 7.4. A simple 3 state jump diagram.

Example 7.13 (Ehrenfest Urn Model). Let a beaker filled with a particle fluidmixture be divided into two parts A and B by a semipermeable membrane. LetXn = (# of particles in A) which we assume evolves by choosing a particle atrandom from A ∪ B and then replacing this particle in the opposite bin fromwhich it was found. Modeling Xn as a Markov process we find,

P (Xn+1 = j | Xn = i) =

0 if j /∈ i− 1, i+ 1iN if j = i− 1N−iN if j = i+ 1

=: q (i, j)

As these probabilities do not depend on n, Xn is a time homogeneous Markovchain.

Page: 41 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 48: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

42 7 Markov Chains Basics

Exercise 7.1. Consider a rat in a maze consisting of 7 rooms which is laid outas in the following figure. 1 2 3

4 5 67

In this figure rooms are connected by either vertical or horizontal adjacentpassages only, so that 1 is connected to 2 and 4 but not to 5 and 7 is onlyconnected to 4. At each time t ∈ N0 the rat moves from her current room toone of the adjacent rooms with equal probability (the rat always changes roomsat each time step). Find the one step 7 × 7 transition matrix, q, with entriesgiven by Pij := P (Xn+1 = j|Xn = i) , where Xn denotes the room the rat isin at time n.

Solution to Exercise (7.1). The rat moves to an adjacent room from nearestneighbor locations probability being 1/D where D is the number of doors inthe room where the rat is currently located. The transition matrix is therefore,

P =

1 2 3 4 5 6 7

0 1/2 0 1/2 0 0 01/3 0 1/3 0 1/3 0 00 1/2 0 0 0 1/2 0

1/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 00 0 0 1 0 0 0

1234567

. (7.8)

and the corresponding jump diagram is given in Figure 7.5.

1

1/2

1/2++

2

1/3++

1/3

1/3

kk 3

1/2

1/2

kk

4

1/3

SS

1/3

1/3++

51/3

kk

1/3++

1/3

SS

6

1/2

SS

1/2

kk

7

1

SS

Fig. 7.5. The jump diagram for our rat in the maze.

Exercise 7.2 (2 - step MC). Consider the following simple (i.e. no-brainer)two state “game” consisting of moving between two sites labeled 1 and 2. Ateach site you find a coin with sides labeled 1 and 2. The probability of flipping a2 at site 1 is a ∈ (0, 1) and a 1 at site 2 is b ∈ (0, 1). If you are at site i at time n,then you flip the coin at this site and move or stay at the current site as indicatedby coin toss. We summarize this scheme by the “jump diagram” of Figure 9.8.It is reasonable to suppose that your location, Xn, at time n is modeled by a

11−a22

a++

2b

kk 1−bll

Fig. 7.6. The generic jump diagram for a two state Markov chain.

Markov process with state space, S = 1, 2 . Explain (briefly) why this is atime homogeneous chain and find the one step transition probabilities,

p (i, j) = P (Xn+1 = j|Xn = i) for i, j ∈ S.

Use your result and basic linear (matrix) algebra to compute,limn→∞ P (Xn = 1) . Your answer should be independent of the possiblestarting distributions, π = (π1, π2) for X0 where πi := P (X0 = i) .

Solution to Exercise (7.2). Writing q as a matrix with entry in the ith rowand jth column being q (i, j) , we have

q =

[1− a ab 1− b

].

If P (X0 = i) = νi for i = 1, 2 then

P (Xn = 1) =

2∑k=1

νkqnk,1 = [νqn]1

where we now write ν = (ν1, ν2) as a row vector. A simple computation showsthat

det(qtr − λI

)= det (q − λI)

= λ2 + (a+ b− 2)λ+ (1− b− a)

= (λ− 1) (λ− (1− a− b)) .

Note that

q

[11

]=

[11

]

Page: 42 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 49: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

7.1 Examples 43

since∑j q (i, j) = 1 – this is a general fact. Thus we always know that λ1 = 1

is an eigenvalue of q. The second eigenvalue is λ2 = 1− a− b. We now find theeigenvectors of qtr;

Nul(qtr − λ1I

)= Nul

([−a ba −b

])= R ·

[ba

]while

Nul(qtr − λI2

)= Nul

([b ba a

])= R ·

[1−1

].

Thus we may writeν = α (b, a) + β (1,−1)

where1 = ν · (1, 1) = α (b, a) · (1, 1) = α (a+ b) .

Thus β = ν1 − b = − (ν2 − a) , we have

ν =1

a+ b(b, a) + β (1,−1) .

and therefore,

νqn = (b, a) qn + β (1,−1) qn =1

a+ b(b, a) + β (1,−1)λn2 .

By our assumptions on and a, b ∈ (0, 1) it follows that |λ2| < 1 and therefore

limn→∞

νqn =1

a+ b(b, a)

and we have shown

limn→∞

P (Xn = 1) =b

a+ band lim

n→∞P (Xn = 2) =

a

a+ b

independent of the starting distribution ν. Also observe that the convergence isexponentially fast.

Example 7.14. As we will see in concrete examples (see the homework and thetext), many Markov chains arise in the following general fashion. Let S andT be discrete sets, α : S × T → S be a function, ξn∞n=1 be i.i.d. randomfunctions with values in T. Then given a random function, X0 independent ofthe ξn∞n=1 with values in S define Xn inductively by Xn+1 = α (Xn, ξn+1) forn = 0, 1, 2, . . . . We will see that Xn∞n=0 satisfies the Markov property with

p (x, y) = P (α (x, ξ) = y)

where ξd= ξn. To verify this is a Markov process first observe that notice that

ξn+1 is independent of Xknk=0 as Xk depends on (X0, ξ1, . . . , ξk) for all k.Therefore

P [Xn+1 = xn+1 | X0 = x0, . . . , Xn = xn]

= P [α (Xn, ξn+1) = xn+1 | X0 = x0, . . . , Xn = xn]

= P [α (xn, ξn+1) = xn+1 | X0 = x0, . . . , Xn = xn]

= P (α (xn, ξn+1) = xn+1) = p (xn, xn+1) .

Example 7.15 (Random Walks on the line). Suppose we have a walk on the linewith probability of jumping to the right (left) is p (q = 1− p). In this case wehave

P =

. . . −1 0 1 2 . . .

. . .. . .

. . . 0 pq 0 pq 0 p

q 0. . .

. . .. . .

...−1012...

,

i.e.

Pij =

p if j = i+ 1q if j = i− 10 otherwise

The jump diagram for such a walk is given in Figure 7.7.This fits into Exam-

. . .

p**−2

p++

q

ii −1

p((

q

jj 0

p''

qkk 1

p''

q

gg 2

p

''

q

gg . . .q

hh

Fig. 7.7. The jump diagram for a possibly biassed simple random walk on the line.

ple 7.14 by taking S = Z, T = ±1 , F (s, t) = s + t, and ξnd= ξ where

P (ξ = +1) = p and P (ξ = −1) = q = 1− p.

Example 7.16 (See III.3.1 of Karlin and Taylor). Let ξn denote the demand ofa commodity during the nth – period. We will assume that ξn∞n=1 are i.i.d.with P (ξn = k) = ak for k ∈ N0. Let Xn denote the quantity of stock on handat the end of the nth – period which is subject to the following replacementpolicy. We choose s, S ∈ N0 with s < S, if Xn ≤ s we immediately replace the

Page: 43 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 50: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

44 7 Markov Chains Basics

stock to have S on hand at the beginning of the next period while if Xn > swe do not add any stock. Thus,

Xn+1 =

Xn − ξn+1 if s < Xn ≤ SS − ξn+1 if Xn ≤ s,

see Figure 3.1 on p. 106 of the book (also repeated below). Notice that we allowthe stock to go negative indicating the demand is not met. It now follows that

P (Xn+1 = y|Xn = x) =

P (ξn+1 = x− y) if s < x ≤ SP (ξn+1 = S − y) if x ≤ s

=

ax−y if s < x ≤ SaS−y if x ≤ s

Example 7.17 (Discrete queueing model). Let Xn = # of people in line at timen, ξn be i.i.d. be the number of customers arriving for service in a period andassume one person is served if there are people in the queue (think of a taxistand). Therefore, Xn+1 = (Xn − 1)+ + ξn and assuming that P (ξn = k) = akfor all k ∈ N0 we have,

P (Xn+1 = j | Xn = i) =

0 if j < i− 1

P (ξn = 0) = a0 if j = i− 1

P (ξn = j − (i− 1)) = aj−i+1 if j ≥ i

P =

0 1 2 3 4 · · ·a0 a1 a2 a3 · · · · · ·a0 a1 a2 · · · · · · · · ·0 a0 a1 a2 · · · · · ·0 0 a0 a1 a2 · · ·...

.... . .

. . ....

0123...

.

Remark 7.18 (Memoryless property of the geometric distribution). Supposethat Xi are i.i.d. Bernoulli random variables with P (Xi = 1) = p andP (Xi = 0) = 1 − p and N = inf i ≥ 1 : Xi = 1 . Then P (N = k) =

P (X1 = 0, . . . , Xk−1 = 0, Xk = 1) = (1− p)k−1p, so that N is geometric with

parameter p. Using this representation we easily and intuitively see that

P (N = n+ k|N > n) =P (X1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

P (X1 = 0, . . . , Xn = 0)

= P (Xn+1 = 0, . . . , Xn+k−1 = 0, Xn+k = 1)

= P (X1 = 0, . . . , Xk−1 = 0, Xk = 1) = P (N = k) .

This can be verified by first principles as well;

P (N = n+ k|N > n) =P (N = n+ k)

P (N > n)=

p (1− p)n+k−1∑k>n p (1− p)k−1

=p (1− p)n+k−1∑∞j=0 p (1− p)n+j

=(1− p)n+k−1

(1− p)n∑∞j=0 (1− p)j

=(1− p)k−1

11−(1−p)

= p (1− p)k−1= P (N = k) .

Exercise 7.3 (III.3.P4. (Queueing model)). Consider the queueing modelof Section 3.4. of Karlin and Taylor. Now suppose that at most a single customerarrives during a single period, but that the service time of a customer is arandom variable Z with the geometric probability distribution

P (Z = k) = α (1− α)k−1

for k ∈ N.

Specify the transition probabilities for the Markov chain whose state is thenumber of customers waiting for service or being served at the start of eachperiod. Assume that the probability that a customer arrives in a period is βand that no customer arrives with probability 1− β.

Solution to Exercise (III.3.P4). Notice that the probability that the serviceof customer currently being served is finished at the end of the current period

Page: 44 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 51: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

7.2 Hitting Times 45

is α = P (Z = m+ 1|Z > m); this is the memoryless property of the geometricdistribution. A k → k transition can happen in two ways: (i) a new customerarrives and the customer being served finishes, or (ii) no new customer arrivesand the customer in service does not finish. The total probability of a k → ktransition is therefore

β · α+ (1− β)(1− α) = 1− α− β + 2αβ.

(If k = 0 this formula must be emended; the probability of a 0→ 0 transition issimply 1− β.) A k → k + 1 transition occurs if a new customer arrives but thecustomer in service does not finish; this has probability (1 − α)β (β if k = 0).Finally, for k ≥ 1, the probability of a k → k − 1 transition is α(1 − β), seeFigure 7.8 for the jump diagram.

10 kk − 1 k + 1β

(1− β)α

β(1− α)

(1− β)α

β(1− α)

(1− β)α

Fig. 7.8. A jump diagram for a simple queueing model.

Proposition 7.19 (Historical MC). Suppose that Xn∞n=0 is a Markovchain with transition probabilities, p (x, y) for x, y ∈ S. Then for any m ∈ N,

Yn := (Xn, Xn+1, . . . , Xn+m)

is a Markov chain with values in Sm+1 whose transition kernel, q, is given by

q ((a0, . . . , am) , (b0, . . . , bm)) = δ (b0, a1) . . . δ (bm−1, am) p (am, bm) .

Proof. Let me give the proof for m = 2 only as this should suffice to explainthe ideas. We have,

P (Yn+1 = (b0, b1, b2) |Yn = (a0, a1, a2) , Yn−1 = ∗, . . . , Y0 = ∗) =

= P

((Xn+1, Xn+2, Xn+3) = (b0, b1, b2)

∣∣∣∣ (Xn, Xn+1, Xn+2) = (a0, a1, a2)Yn−1 = ∗, . . . , Y0 = ∗

)= P

((Xn+1, Xn+2, Xn+3) = (b0, b1, b2)

∣∣∣∣ (Xn, Xn+1, Xn+2) = (a0, a1, a2)Xn−1 = ∗, . . . , X0 = ∗

)= P

((a1, a2, Xn+3) = (b0, b1, b2)

∣∣∣∣ (Xn, Xn+1, Xn+2) = (a0, a1, a2)Xn−1 = ∗, . . . , X0 = ∗

)= δ (b0, a1) δ (b1, a2)P (Xn+3 = b2|Xn+2 = a2, Xn+1 = ∗, . . . , X0 = ∗)= δ (a0, b1) δ (a2, b1) p (a2, b2) .

Example 7.20. Suppose we flip a fair coin repeatedly and would like to findthe first time the pattern HHT appears. To do this we will later examinethe Markov chain, Yn = (Xn, Xn+1, Xn+2) where Xn∞n=0 is the sequence ofunbiased independent coin flips with values in H,T . The state space for Ynis

S =TTT THT TTH THH HHH HTT HTH HHT

.

The transition matrix for recording three flips in a row of a fair coin is

P =1

2

TTT THT TTH THH HHH HTT HTH HHTTTT 1 0 1 0 0 0 0 0THT 0 0 0 0 0 1 1 0TTH 0 1 0 1 0 0 0 0THH 0 0 0 0 1 0 0 1HHH 0 0 0 0 1 0 0 1HTT 1 0 1 0 0 0 0 0HTH 0 1 0 1 0 0 0 0HHT 0 0 0 0 0 1 1 0

.

7.2 Hitting Times

Skip this section. It is redone better laterWe assume the Xn∞n=0 is a Markov chain with values in S and transition

kernel P. I will often write p (x, y) for Pxy. We are going to further assume thatB ⊂ S is non-empty proper subset of S and A = S \B.

Definition 7.21 (Hitting times). Given a subset B ⊂ S we let TB be thefirst time Xn hits B, i.e.

TB = min n : Xn ∈ B

with the convention that TB = ∞ if n : Xn ∈ B = ∅. We call TB the firsthitting time of B by X = Xnn .

Observe that

TB = n = X0 /∈ B, . . . ,Xn−1 /∈ B,Xn ∈ B= X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ B

andTB > n = X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ A

so that TB = n and TB > n only depends on (X0, . . . , Xn) . A random time,T : Ω → N∪0,∞ , with either of these properties is called a stopping time.

Page: 45 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 52: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

46 7 Markov Chains Basics

Lemma 7.22. For any random time T : Ω → N∪0,∞ we have

P (T =∞) = limn→∞

P (T > n) and ET =

∞∑k=0

P (T > k) .

Proof. The first equality is a consequence of the continuity of P and thefact that

T > n ↓ T =∞ .

The second equality is proved as follows;

ET =∑m>0

mP (T = m) =∑

0<k≤m<∞

P (T = m)

=

∞∑k=1

P (T ≥ k) =

∞∑k=0

P (T > k) .

Notation 7.23 Let Q be P restricted to A, i.e. Qx,y = Px,y for all x, y ∈ A.In particular we have

QNx,y :=

∑x1,...,xN−1∈A

Qx,x1Qx1,x2

. . . QxN−1,y for all x, y ∈ A.

Corollary 7.24. Continuing the notation introduced above, for any x ∈ A wehave

Px (TB =∞) = limN→∞

∑y∈A

QNx,y

and

Ex [TB ] =

∞∑N=0

∑y∈A

QNx,y

with the convention that

Q0x,y = δx,y =

1 if x = y0 if x 6= y

.

Proof. The results follow from Lemma 7.22 after observing that

Px (TB > N) = Px (X0 ∈ A, . . . ,XN ∈ A)

=∑

x1,...,xN∈Ap (x, x1) p (x1, x2) . . .p (xN−1, xN ) =

∑y∈A

QNx,y. (7.9)

Proposition 7.25. Suppose that B ⊂ S is non-empty proper subset of S andA = S \ B. Further suppose there is some α < 1 such that Px (TB =∞) ≤ αfor all x ∈ A, then Px (TB =∞) = 0 for all x ∈ A. [In words; if there is a“uniform” chance that X hits B starting from any site, then X will surely hitB.]

Proof. Taking N = m+ n in Eq. (7.9) shows

Px (TB > m+ n) =∑y,z∈A

Qmx,yQ

ny,z =

∑y∈A

Qmx,yPy (TB > n) . (7.10)

Letting n→∞ (using D.C.T.) in this equation shows,

Px (TB =∞) =∑y∈A

Qmx,yPy (TB =∞)

≤ α∑y∈A

Qmx,y = αPx (TB > n) .

Finally letting n → ∞ shows Px (TB =∞) ≤ αPx (TB =∞) , i.e.Px (TB =∞) = 0 for all x ∈ A.

We will see in examples later that it is possible for Px (TB =∞) = 0 whileExTB =∞. The next theorem gives a criteria which avoids this scenario.

Theorem 7.26. Suppose that B ⊂ S is non-empty proper subset of S and A =S \B. Further suppose there is some α < 1 and n <∞ such that Px (TB > n) ≤α for all x ∈ A, then

Ex (TB) ≤ n

1− α<∞

for all x ∈ A. [In words; if there is a “uniform” chance that X hits B startingfrom any site within a fixed number of steps, then the expected hitting time ofB is finite and bounded independent of the starting point.]

Proof. From Eq. (7.10) for any m ∈ N we have

Px (TB > m+ n) =∑y∈A

Qmx,yPy (TB > n) ≤ α

∑y∈A

Qmx,y = αPx (TB > m) .

One easily uses this relationship to show inductively that

Px (TB > kn) ≤ αk for all k = 0, 1, 2 . . . .

We then have,

ExTB =

∞∑k=0

P (TB > k) ≤∞∑k=0

nP (TB > kn)

≤∞∑k=0

nαk =n

1− α<∞,

Page: 46 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 53: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

7.2 Hitting Times 47

wherein we have used,

P (TB > kn+m) ≤ P (TB > kn) for m = 0, . . . , n− 1.

Corollary 7.27. If A = S \ B is a finite set and Px (TB =∞) < 1 for allx ∈ A, then ExTB <∞ for all x ∈ A.

Proof. Let α0 = maxx∈A Px (T =∞) < 1. Now fix α ∈ (α0, 1) . Using

α0 ≥ Px (T =∞) =↓ limn→∞

Px (T > n)

we will have Px (T > m) ≤ α for m ≥ Nx for some Nx < ∞. Taking n :=max Nx : x ∈ A < ∞ (A is a finite set), we will have Px (T > n) ≤ α for allx ∈ A and we may now apply Theorem 7.26.

Definition 7.28 (First return time). For any x ∈ S, let Rx :=min n ≥ 1 : Xn = x where the minimum of the empty set is defined tobe ∞.

On the event X0 6= x we have Rx = Tx := min n ≥ 0 : Xn = x – thefirst hitting time of x. So Rx is really manufactured for the case where X0 = xin which case Tx = 0 while Rx is the first return time to x.

Exercise 7.4. Let x ∈ X. Show;

a) for all n ∈ N0,

Px (Rx > n+ 1) ≤∑y 6=x

p (x, y)Py (Tx > n) . (7.11)

b) Use Eq. (7.11) to conclude that if Py (Tx =∞) = 0 for all y 6= x thenPx (Rx =∞) = 0, i.e. Xn will return to x when started at x.

c) Sum Eq. (7.11) on n ∈ N0 to show

Ex [Rx] ≤ Px (Rx > 0) +∑y 6=x

p (x, y)Ey [Tx] . (7.12)

d) Now suppose that S is a finite set and Py (Tx =∞) < 1 for all y 6= x, i.e.there is a positive chance of hitting x from any y 6= x in S. Explain howEq. (7.12) combined with Corollary 7.27 shows that Ex [Rx] <∞.

Solution to Exercise (7.4). a) Using the first step analysis we have,

Px (Rx > n+ 1) = Ex [1Rx>n+1] = Ep(x,·)[1Rx(x,X)>n+1

]= p (x, x)Ex

[1Rx(x,X)>n+1

]+∑y 6=x

p (x, y)Ey[1Rx(x,X)>n+1

].

On the event X0 = x we have Rx (x,X) = 1 which is not greater than n + 1so that Ex

[1Rx(x,X)>n+1

]= 0 while on the event X0 6= x we have Rx (x,X) =

Tx (X) + 1 so that for y 6= x,

Ey[1Rx(x,X)>n+1

]= Ey

[1Tx(X)+1>n+1

]= Py (Tx > n) .

Putting these comments together prove Eq. (7.11).b) Let n→∞ in Eq. (7.11) using DCT in order to conclude,

Px (Rx =∞) ≤∑y 6=x

p (x, y)Py (Tx =∞) = 0.

c) Using Lemma2 7.22 twice along with Fubini’s theorem for sums we have,

Ex [Rx] = Px (Rx > 0) +

∞∑n=0

Px (Rx > n+ 1)

≤ Px (Rx > 0) +

∞∑n=0

∑y 6=x

p (x, y)Py (Tx > n)

= Px (Rx > 0) +∑y 6=x

p (x, y)

∞∑n=0

Py (Tx > n)

= Px (Rx > 0) +∑y 6=x

p (x, y)Ey [Tx] .

d) From Corollary 7.27 with B = x , we know that Ey [Tx] < ∞ for ally 6= x. Thus the right side of Eq. (7.12) is a finite sum of finite terms andtherefore is finite. This then implies ExRx <∞.

2 That is ET =∑∞k=0 P (T > k) .

Page: 47 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 54: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 55: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8

Markov Conditioning

We assume the Xn∞n=0 is a Markov chain with values in S and transitionkernel P and π : S → [0, 1] is a probability on S. As usual we write Pπ for theunique probability satisfying Eq. (7.4) and we will often write p (x, y) for Pxy.

Theorem 8.1 (Markov conditioning). Let π be a probability on S, F (X) =F (X0, X1, . . . ) be a random variable1 depending on X. Then for each m ∈ Nwe have

Eπ [F (X0, X1, . . . )] = Eπ[E(Y )Xm

F (X0, X1, . . . Xm−1, Y0, Y1, . . . )]

(8.1)

where E(Y )x denotes the expectation with respect to an independent copy, Y, of

the chain X which starts at x ∈ S. To be more explicit,

Eπ [F (X0, X1, . . . )] = Eπ [h (X0, . . . , Xm)]

where for all x0, . . . , xm ∈ S,

h (x0, . . . , xm) := Exm [F (x0, . . . , xm−1, X0, X1, . . . )] .

[In words, given X0, . . . , Xm, (Xm, Xm+1, . . . ) has the same distribution as in-dependent copy (Y0, Y1, . . . ) of the chain X where Y required to start at Xm.]

Alternatively stated: if x0, x1, . . . , xm ∈ S with Pπ (X0 = x0, . . . , Xm = xm) >0, then

Eπ [F (X0, X1, . . . ) |X0 = x0, . . . , Xm = xm]

= Exm [F (x0, x1, . . . , xm−1, X0, X1, . . . )] (8.2)

or equivalently put,

Eπ ([F (X0, X1, . . . ) |X0, . . . , Xm]) = E(Y )Xm

[F (X0, X1, . . . , Xm−1, Y0, Y1, . . . )] .(8.3)

Proof. Fact: by “limiting” arguments beyond the scope of this course itsuffices to prove Eq. (8.1) for F (X) of the form, F (X) = F (X0, X1, . . . , XN )with N <∞. Now for such a function we have,

1 In this theorem we assume that F is either bounded or non-negative.

Eπ [F (X0, X1, . . . , XN ) : X0 = x0, . . . , Xm = xm]

=∑

xm+1,...,xN∈SF (x0, . . . , xm, xm+1, . . . , xN )

[π (x0) p (x0, x1) . . . p (xm−1, xm) ·p (xm, xm+1) . . . p (xN−1, xN )

]= Pπ (X0 = x0, . . . , Xm = xm) ·

·∑

xm+1,...,xN∈SF (x0, . . . , xm, xm+1, . . . , xN ) p (xm, xm+1) . . . p (xN−1, xN )

= Pπ (X0 = x0, . . . , Xm = xm)

·∑

y1,...,yN−m∈SF (x0, . . . , xm, y1, y2, . . . , yN−m) p (xm, y1) . . . p (yN−m−1, yN−m)

= Pπ (X0 = x0, . . . , Xm = xm)h (x0, . . . , xm) . (8.4)

Summing this equation on x0, . . . , xm in S gives Eq. (8.1) and dividing thisequation by Pπ (X0 = x0, . . . , Xm = xm) proves Eq. (8.2).

To help cement the ideas above, let me pause to write out the above argu-ment in the special case where m = 2 and N = 5. In this case we have;

Eπ [F (X0, X1, . . . , X5) : X0 = x0, X1 = x1, X2 = x2]

=∑

x3,x4,x5∈SF (x0, x1, x2, x3, x4, x5)

[π (x0) p (x0, x1) p (x1, x2) ·p (x2, x3) p (x3, x4) p (x4, x5)

]= Pπ (X0 = x0, X1 = x1, X2 = x2) ·

·∑

x3,x4,x5∈SF (x0, x1, x2, x3, x4, x5) [p (x2, x3) p (x3, x4) p (x4, x5)]

= Pπ (X0 = x0, X1 = x1, X2 = x2)

·∑

y1,y2,y3∈SF (x0, x1, x2, y1, y2, y3) [p (x2, y1) p (y1, y2) p (y2, y3)]

= Pπ (X0 = x0, X1 = x1, X2 = x2) · E(Y )x2

[F (x0, x1, Y0, Y1, Y2, Y3)] .

Page 56: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

50 8 Markov Conditioning

8.1 Hitting Time Estimates

We assume the Xn∞n=0 is a Markov chain with values in S and transitionkernel P. I will often write p (x, y) for Pxy. We are going to further assume thatB ⊂ S is non-empty proper subset of S and A = S \B.

Definition 8.2 (Hitting times). Given a subset B ⊂ S we let TB be the firsttime Xn hits B, i.e.

TB = min n : Xn ∈ Bwith the convention that TB = ∞ if n : Xn ∈ B = ∅. We call TB the firsthitting time of B by X = Xnn .

Observe that

TB = n = X0 /∈ B, . . . ,Xn−1 /∈ B,Xn ∈ B= X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ B

andTB > n = X0 ∈ A, . . . ,Xn−1 ∈ A,Xn ∈ A

so that TB = n and TB > n only depends on (X0, . . . , Xn) . A random time,T : Ω → N∪0,∞ , with either of these properties is called a stopping time.

Lemma 8.3. For any random time T : Ω → N∪0,∞ we have

P (T =∞) = limn→∞

P (T > n) and ET =

∞∑k=0

P (T > k) .

Proof. The first equality is a consequence of the continuity of P and thefact that

T > n ↓ T =∞ .The second equality is proved as follows;

ET =∑m>0

mP (T = m) =∑

0<k≤m<∞

P (T = m)

=

∞∑k=1

P (T ≥ k) =

∞∑k=0

P (T > k) .

Let us now use Theorem 8.1 to give variants of the proofs of our hittingtime results above. In what follows π will denote a probability on S.

Corollary 8.4. Let B ⊂ S and TB be as above, then for n,m ∈ N we have

Pπ (TB > m+ n) = Eπ [1TB>mPXm [TB > n]] . (8.5)

Proof. Using Theorem 8.1,

Pπ (TB > m+ n) = Eπ[1TB(X)>m+n

]= Eπ

[E(Y )Xm

[1TB(X0,...,Xm−1,Y0,Y1,... )>m+n

]]= Eπ

[E(Y )Xm

[1TB(X)>m · 1TB(Y )>n

]]= Eπ

[1TB(X)>mE(Y )

Xm

[1TB(Y )>n

]]= Eπ [1TB>mPXm [TB > n]] .

Corollary 8.5. Suppose that B ⊂ S is non-empty proper subset of S and A =S \B. Further suppose there is some α < 1 such that Px (TB =∞) ≤ α for allx ∈ A, then Pπ (TB =∞) = 0. [In words; if there is a “uniform” chance thatX hits B starting from any site, then X will surely hit B from any point in A.]

Proof. Since TB = 0 on X0 ∈ B we in fact have Px (TB =∞) ≤ α for allx ∈ S. Letting n→∞ in Eq. (8.5) shows,

Pπ (TB =∞) = Eπ [1TB>mPXm [TB =∞]] ≤ Eπ [1TB>mα] = αPπ (TB > m) .

Now letting m → ∞ in this equation shows Pπ (TB =∞) ≤ αPπ (TB =∞)from which it follows that Pπ (TB =∞) = 0.

Corollary 8.6. Suppose that B ⊂ S is non-empty proper subset of S and A =S \B. Further suppose there is some α < 1 and n <∞ such that Px (TB > n) ≤α for all x ∈ A, then

Eπ (TB) ≤ n

1− α<∞

for all x ∈ A. [In words; if there is a “uniform” chance that X hits B startingfrom any site within a fixed number of steps, then the expected hitting time ofB is finite and bounded independent of the starting distribution.]

Proof. Again using TB = 0 on X0 ∈ B we may conclude thatPx (TB > n) ≤ α for all x ∈ S. Letting m = kn in Eq. (8.5) shows

Pπ (TB > kn+ n) = Eπ [1TB>knPXm [TB > n]] ≤ Eπ [1TB>kn · α] = αPπ (TB > kn) .

Iterating this equation using the fact that Pπ (TB > 0) ≤ 1 showsPπ (TB > kn) ≤ αk for all k ∈ N0. Therefore with the aid of Lemma 8.3and the observation,

P (TB > kn+m) ≤ P (TB > kn) for m = 0, . . . , n− 1,

we find,

Page: 50 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 57: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.2 First Step Analysis 51

ExTB =

∞∑k=0

P (TB > k) ≤∞∑k=0

nP (TB > kn)

≤∞∑k=0

nαk =n

1− α<∞.

Corollary 8.7. If A = S \B is a finite set and Px (TB =∞) < 1 for all x ∈ A,then EπTB <∞.

Proof. Since

Px (T > m) ↓ Px (T =∞) < 1 for all x ∈ A

we can findMx <∞ such that Px (T > Mx) < 1. Using the fact that A is a finiteset we let n := maxx∈AMx < ∞ and then take α := maxx∈A Px (T > n) < 1.Corollary 8.6 now applies to complete the proof.

8.2 First Step Analysis

The next theorem (which is a special case of Theorem 8.1) is the basis of thefirst step analysis developed in this section.

Theorem 8.8 (First step analysis). Let F (X) = F (X0, X1, . . . ) be somefunction of the paths (X0, X1, . . . ) of our Markov chain, then for all x, y ∈ Swith p (x, y) > 0 we have

Ex [F (X0, X1, . . . ) |X1 = y] = Ey [F (x,X0, X1, . . . )] (8.6)

and

Ex [F (X0, X1, . . . )] = Ep(x,·) [F (x,X0, X1, . . . )]

=∑y∈S

p (x, y)Ey [F (x,X0, X1, . . . )] . (8.7)

Proof. Equation (8.6) follows directly from Theorem 8.1,

Ex [F (X0, X1, . . . ) |X1 = y] = Ex [F (X0, X1, . . . ) |X0 = x,X1 = y]

= Ey [F (x,X0, X1, . . . )] .

Equation (8.7) now follows from Eq. (8.6), the law of total expectation, and thefact that Px (X1 = y) = p (x, y) .

Let us now suppose for until further notice that B is a non-empty propersubset of S, A = S \B, and TB = TB (X) is the first hitting time of B by X.

Notation 8.9 Given a transition matrix P = (p (x, y))x,y∈S we let Q=(p (x, y))x,y∈A and R := (p (x, y))x∈A,y∈B so that, schematically,

P =

A B[Q R∗ ∗

]AB.

Remark 8.10. To construct the matrix Q and R from P, let P′ be P with therows corresponding to B omitted. To form Q from P′, remove the columnsof P′ corresponding to B and to form R from P′, remove the columns of P′

corresponding to A.

Example 8.11. If S = 1, 2, 3, 4, 5, 6, 7 , A = 1, 2, 4, 5, 6 , B = 3, 7 , and

P =

1 2 3 4 5 6 7

0 1/2 0 1/2 0 0 01/3 0 1/3 0 1/3 0 00 1/2 0 0 0 1/2 0

1/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 00 0 0 0 0 0 1

1234567

,

then

P′ =

1 2 3 4 5 6 70 1/2 0 1/2 0 0 0

1/3 0 1/3 0 1/3 0 01/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 0

12456

.

Deleting the 3 and 7 columns of P′ gives

Q = PA,A =

1 2 4 5 60 1/2 1/2 0 0

1/3 0 0 1/3 01/3 0 0 1/3 00 1/3 1/3 0 1/30 0 0 1/2 0

12456

and deleting the 1, 2, 4, 5, and 6 columns of P′ gives

Page: 51 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 58: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

52 8 Markov Conditioning

R = PA,B =

3 70 0

1/3 00 1/30 0

1/2 0

12456

.

Before continuing on you may wish to first visit Example 8.14 below.

Theorem 8.12 (Hitting distributions). Let h : B → R be a bounded ornon-negative function and let u : S → R be defined by

u (x) := Ex [h (XTB ) : TB <∞] for x ∈ A.

Then u = h on B and

u (x) =∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) for all x ∈ A. (8.8)

In matrix notation this becomes

u = Qu+ Rh =⇒ u = (I −Q)−1

Rh,

i.e.Ex [h (XTB ) : TB <∞] =

[(I −Q)

−1Rh]x

for all x ∈ A. (8.9)

As a special case if h (s) = δy (s) for some y ∈ B, then Eq. (8.9) becomes,

Px (XTB = y : TB <∞) =[(I −Q)

−1R]x,y

. (8.10)

Proof. To shorten the notation we will use the convention that h (XTB ) = 0if TB =∞ so that we may simply write u (x) := Ex [h (XTB )] . Let

F (X0, X1, . . . ) = h(XTB(X)

)= h

(XTB(X)

)1TB(X)<∞,

then for x ∈ A we have F (x,X0, X1, . . . ) = F (X0, X1, . . . ) . Therefore by thefirst step analysis (Theorem 8.8) we learn

u (x) = Exh(XTB(X)

)= ExF (x,X1, . . . ) =

∑y∈S

p (x, y)EyF (x,X0, X1, . . . )

=∑y∈S

p (x, y)EyF (X0, X1, . . . ) =∑y∈S

p (x, y)Ey[h(XTB(X)

)]=∑y∈A

p (x, y)Ey[h(XTB(X)

)]+∑y∈B

p (x, y)h (y)

=∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) .

Theorem 8.13 (Travel averages). Given g : A → [0,∞] , let w (x) :=Ex[∑

n<TBg (Xn)

]. Then w (x) satisfies

w (x) =∑y∈A

p (x, y)w (y) + g (x) for all x ∈ A. (8.11)

In matrix notation this becomes,

w = Qw + g =⇒ w = (I −Q)−1

g

so that

Ex

[ ∑n<TB

g (Xn)

]=[(I −Q)

−1g]x.

The following two special cases are of most interest;

1. Suppose g (x) = δy (x) for some y ∈ A, then∑n<TB

g (Xn) =∑n<TB

δy (Xn) is the number of visits of the chain to y and

Ex (# visits to y before hitting B)

= Ex

[ ∑n<TB

δy (Xn)

]= (I −Q)

−1x,y .

2. Suppose that g (x) = 1, then∑n<TB

g (Xn) = TB and we may conclude that

Ex [TB ] =[(I −Q)

−11]x

where 1 is the column vector consisting of all ones.

Proof. Let F (X0, X1, . . . ) =∑n<TB(X0,X1,... )

g (Xn) be the sum of thevalues of g along the chain before its first exit from A, i.e. entrance into B.With this interpretation in mind, if x ∈ A, it is easy to see that

F (x,X0, X1, . . . ) =

g (x) if X0 ∈ B

g (x) + F (X0, X1, . . . ) if X0 ∈ A= g (x) + 1X0∈A · F (X0, X1, . . . ) .

Therefore by the first step analysis (Theorem 8.8) it follows that

w (x) = ExF (X0, X1, . . . ) =∑y∈S

p (x, y)EyF (x,X0, X1, . . . )

=∑y∈S

p (x, y)Ey [g (x) + 1X0∈A · F (X0, X1, . . . )]

= g (x) +∑y∈A

p (x, y)Ey [F (X0, X1, . . . )]

= g (x) +∑y∈A

p (x, y)w (y) .

Page: 52 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 59: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.3 Finite state space examples 53

8.3 Finite state space examples

Example 8.14. Consider the Markov chain determined by

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 00 0 1 00 0 0 1

1234

whose hitting diagram is given in Figure 8.1.Notice that 3 and 4 are absorb-

1 2

41/8

3/4

3

1/3

1/3

1/3

1/8

Fig. 8.1. For this chain the states 3 and 4 are absorbing.

ing states. Let hi = Pi (Xn hits 3) = P (Xn hits 3 before 4) for i = 1, 2, 3, 4.Clearly h3 = 1 while h4 = 0 and by the first step analysis we have

hi = Pi (Xn hits 3) =

4∑j=1

Pi (Xn hits 3|X1 = j) p (i, j) =

4∑j=1

p (i, j)hj ,

and hence

h1 =1

3h2 +

1

3h3 +

1

3h4 =

1

3h2 +

1

3

h2 =3

4h1 +

1

8h2 +

1

8h3 =

3

4h1 +

1

8h2 +

1

8. (8.12)

Solving

h1 =1

3h2 +

1

3and h2 =

3

4h1 +

1

8h2 +

1

8

for h1 and h2 shows,

P1 (Xn hits 3) = h1 =8

15∼= 0.533 33

P2 (Xn hits 3) = h2 =3

5.

Similarly if we let hi = Pi (Xn hits 4) instead, from Eqs. (8.12) with h3 = 0and h4 = 1, we find

h1 =1

3h2 +

1

3

h2 =3

4h1 +

1

8h2

which has solutions,

P1 (Xn hits 4) = h1 =7

15= 0.466 67 and

P2 (Xn hits 4) = h2 =2

5= 0.4.

Of course we did not really need to compute these, since

P1 (Xn hits 3) + P1 (Xn hits 4) = 1 and

P2 (Xn hits 3) + P2 (Xn hits 4) = 1.

Similarly, if T = T3,4 is the first hitting time of 3, 4 and ui := EiT, wehave,

ui =4∑j=1

Ei [T |X1 = j] p (i, j)

where

Ei [T |X1 = j] =

1 if j ∈ 3, 4

1 + EjT = 1 + uj if j ∈ 1, 2 .

Therefore it follows that

ui =

4∑j=1

1p (i, j) +

2∑j=1

p (i, j)uj = 1 +

2∑j=1

p (i, j)uj

and this leads to the equations,

u1 = 1 +1

3u2

u2 = 1 +3

4u1 +

1

8u2

which has solutions

Page: 53 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 60: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

54 8 Markov Conditioning

E1 [T ] = u1 =29

15and

E2 [T ] = u2 =14

5.

Example 8.15 (Example 8.14 revisited). We may also consider Example 8.14using the matrix formalism. For this we have

1 2 3 4

P′ =

[0 1/3 1/3 1/3

3/4 1/8 1/8 0

]12,

1 2

Q =

[0 1/3

3/4 1/8

]12, and

3 4

R =

[1/3 1/31/8 0

]12.

Matrix manipulations now show,

Ei (# visits to j before hitting 3, 4) = (I −Q)−1

=

i\j12

1 2[75

815

65

85

]=

[1.4 0.533331.2 1.6

],

EiT3,4 = (I −Q)−1

[11

]=

i

12

[2915145

]=

[1.933 3

2.8

]and

Pi(XT3,4 = j

)= (I −Q)

−1R =

i\j12

3 4[815

715

35

25

].

The output of one simulation from www.zweigmedia.com/RealWorld/

markov/markov.html is in Figure 8.2 below.

Example 8.16. Let us continue the rat in the maze Exercise 7.1 and now supposethat room 3 contains food while room 7 contains a mouse trap. 1 2 3 (food)

4 5 67 (trap)

.Recall that the transition matrix for this chain with sites 3 and 7 absorbing isgiven by,

P =

1 2 3 4 5 6 7

0 1/2 0 1/2 0 0 01/3 0 1/3 0 1/3 0 00 0 1 0 0 0 0

1/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 00 0 0 0 0 0 1

1234567

,

Fig. 8.2. In this run, rather than making sites 3 and 4 absorbing, we have madethem transition back to 1. I claim now to get an approximate value for P1 (Xn hits 3)we should compute: (State 3 Hits)/(State 3 Hits + State 4 Hits). In this example wewill get 171/(171 + 154) = 0.526 15 which is a little lower than the predicted value of0.533 . You can try your own runs of this simulator.

see Figure 8.3 for the corresponding jump diagram for this chain.

1

1/2

1/2++

2

1/3,,

1/3

1/3

kk3

food

4

1/3

SS

1/3

1/3++

51/3

kk

1/3++

1/3

SS

6

1/2

RR

1/2

kk

7trap

Fig. 8.3. The jump diagram for our proverbial rat in the maze. Here we assume therat is “absorbed” at sites 3 and 7

Page: 54 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 61: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.3 Finite state space examples 55

We would like to compute the probability that the rat reaches the food beforehe is trapped. To answer this question we let A = 1, 2, 4, 5, 6 , B = 3, 7 ,and T := TB be the first hitting time of B. Then deleting the 3 and 7 rows ofP leaves the matrix,

P′ =

1 2 3 4 5 6 70 1/2 0 1/2 0 0 0

1/3 0 1/3 0 1/3 0 01/3 0 0 0 1/3 0 1/30 1/3 0 1/3 0 1/3 00 0 1/2 0 1/2 0 0

12456

.

Deleting the 3 and 7 columns of P′ gives

Q = PA,A =

1 2 4 5 60 1/2 1/2 0 0

1/3 0 0 1/3 01/3 0 0 1/3 00 1/3 1/3 0 1/30 0 0 1/2 0

12456

and deleting the 1, 2, 4, 5, and 6 columns of P′ gives

R = PA,B =

3 70 0

1/3 00 1/30 0

1/2 0

12456

.

Therefore,

I −Q =

1 − 1

2 −12 0 0

− 13 1 0 − 1

3 0− 1

3 0 1 − 13 0

0 − 13 −

13 1 − 1

30 0 0 − 1

2 1

,and using a computer algebra package we find

Ei [# visits to j before hitting 3, 7] = (I −Q)−1

=

1 2 4 5 6 j116

54

54 1 1

356

74

34 1 1

356

34

74 1 1

323 1 1 2 2

313

12

12 1 4

3

i

12456

.

In particular we may conclude,E1TE2TE4TE5TE6T

= (I −Q)−1

1 =

173143143163113

,and

P1 (XT = 3) P1 (XT = 7)P2 (XT = 3) P2 (XT = 3)P4 (XT = 3) P4 (XT = 3)P5 (XT = 3) P5 (XT = 3)P6 (XT = 3) P6 (XT = 7)

= (I −Q)−1

R =

3 7712

512

34

14

512

712

23

13

56

16

12456

.

.

Since the event of hitting 3 before 7 is the same as the event XT = 3 , thedesired hitting probabilities are

P1 (XT = 3)P2 (XT = 3)P4 (XT = 3)P5 (XT = 3)P6 (XT = 3)

=

712345122356

.We can also derive these hitting probabilities from scratch using the first

step analysis. In order to do this let

hi = Pi (XT = 3) = Pi (Xn hits 3 (food) before 7(trapped)) .

By the first step analysis we will have,

hi =∑j

Pi (XT = 3|X1 = j)Pi (X1 = j)

=∑j

p (i, j)Pi (XT = 3|X1 = j)

=∑j

p (i, j)Pj (XT = 3)

=∑j

p (i, j)hj

where h3 = 1 and h7 = 0. Looking at the jump diagram in Figure 8.3 we easilyfind

Page: 55 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 62: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

56 8 Markov Conditioning

h1 =1

2(h2 + h4)

h2 =1

3(h1 + h3 + h5) =

1

3(h1 + 1 + h5)

h4 =1

3(h1 + h5 + h7) =

1

3(h1 + h5)

h5 =1

3(h2 + h4 + h6)

h6 =1

2(h3 + h5) =

1

2(1 + h5)

and the solutions to these equations are (as seen before) given by[h1 =

7

12, h2 =

3

4, h4 =

5

12, h5 =

2

3, h6 =

5

6

]. (8.13)

Similarly, if

ki := Pi (XT = 7) = Pi (Xn is trapped before dinner) ,

we need only use the above equations with h replaced by k and now takingk3 = 0 and k7 = 1 to find,

k1 =1

2(k2 + k4)

k2 =1

3(k1 + k5)

k4 =1

3(k1 + k5 + 1)

k5 =1

3(k2 + k4 + k6)

k6 =1

2k5

and then solve to find,[k1 =

5

12, k2 =

1

4, k4 =

7

12, k5 =

1

3, k6 =

1

6

]. (8.14)

Notice that the sum of the hitting probabilities in Eqs. (8.13) and (8.14) addup to 1 as they should.

Example 8.17 (A modified rat maze). Here is the modified maze, 1 2 3(food)4 5

6(trap)

.

We now let T = T3,6 be the first time to absorption – we assume that 3 and6 made are absorbing states.2 The transition matrix is given by

P =

1 2 3 4 5 60 1/2 0 1/2 0 0

1/3 0 1/3 0 1/3 00 0 1 0 0 0

1/3 0 0 0 1/3 1/30 1/2 0 1/2 0 00 0 0 0 0 1

123456

.

The corresponding Q and R matrices in this case are;

Q =

1 2 4 50 1/2 1/2 0

1/3 0 0 1/31/3 0 0 1/30 1/2 1/2 0

1245

, and R =

3 60 0

1/3 00 1/30 0

1245

.

After some matrix manipulation we then learn,

Ei [# visits to j] = (I4 −Q)−1

=

1 2 4 52 3

232 1

1 2 1 11 1 2 11 3

232 2

1245

,

Pi [XT = j] = (I4 −Q)−1

R =

3 612

12

23

13

13

23

12

12

1245

,

Ei [T ] = (I4 −Q)−1

1111

=

6556

1245

.

So for example, P4(XT = 3(food)) = 1/3, E4(Number of visits to 1) = 1,E5(Number of visits to 2) = 3/2 and E1T = E5T = 6 and E2T = E4T = 5.

2 It is not necessary to make states 3 and 6 absorbing. In fact it does matter at allwhat the transition probabilities are for the chain for leaving either of the states3 or 6 since we are going to stop when we hit these states. This is reflected in thefact that the first thing we will do in the first step analysis is to delete rows 3 and6 from P. Making 3 and 6 absorbing simply saves a little ink.

Page: 56 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 63: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.3 Finite state space examples 57

For practice let us compute hi = Pi (Xn hits 3 before 6) = Pi(XT =3(food)). By the first step analysis we have,

h6 = 0

h3 = 1

h5 =1

2(h2 + h4)

h4 =1

3(h1 + h5 + h6)

h2 =1

3(h1 + h3 + h5)

h1 =1

2(h2 + h4)

which have solutions[h1 =

1

2, h2 =

2

3, h3 = 1, h4 =

1

3, h5 =

1

2, h6 = 0

]. (8.15)

Similarly if hi = Pi (Xn hits 6 before 3) = Pi (XT = 6) we have

h6 = 1

h3 = 0

h5 =1

2(h2 + h4)

h4 =1

3(h1 + h5 + h6)

h2 =1

3(h1 + h3 + h5)

h1 =1

2(h2 + h4)

which have solutions[h1 =

1

2, h2 =

1

3, h3 = 0, h4 =

2

3, h5 =

1

2, h6 = 1

]. (8.16)

Notice that the sum of the hitting probabilities in Eqs. (8.15) and (8.16) addup to 1 as they should. These results are in agreement with our previous resultsusing the matrix method as well.

Exercise 8.1 (III.4.P11 on p.132). An urn contains two red and two greenballs. The balls are chosen at random, one by one, and removed from the urn.The selection process continues until all of the green balls have been removedfrom the urn. What is the probability that a single red ball is in the urn at thetime that the last green ball is chosen?

Solution to Exercise (III.4.P11 on p.132). Let’s choose the states to be(G,R) = (i, j) with i, j = 0, 1, 2 so that (1, 2) implies that there is one greenball and two red balls in the urn. Let B = (0, 0), (0, 1), (0, 2 ,

T = TB = minn ≥ 0 : Xn = (0, 0) or (0, 1) or (0, 2).

We wish to compute P (XT = (0, 1)|X0 = (2, 2)). The transition matrix for thischain is given by;

P =

(0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)1 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 01 0 0 0 0 0 0 0 00 1/2 0 1/2 0 0 0 0 00 0 1/3 0 2/3 0 0 0 00 0 0 1 0 0 0 0 00 0 0 0 2/3 0 1/3 0 00 0 0 0 0 1/2 0 1/2 0

(0, 0)(0, 1)(0, 2)(1, 0)(1, 1)(1, 2)(2, 0)(2, 1)(2, 2)

.

Using the matrix method. First we remove the (0, 0), (0, 1), (0, 2) - row of P;

P′ =

(0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)1 0 0 0 0 0 0 0 00 1/2 0 1/2 0 0 0 0 00 0 1/3 0 2/3 0 0 0 00 0 0 1 0 0 0 0 00 0 0 0 2/3 0 1/3 0 00 0 0 0 0 1/2 0 1/2 0

(1, 0)(1, 1)(1, 2)(2, 0)(2, 1)(2, 2)

and now form Q by removing the (0, 0), (0, 1), (0, 2) columns of P′ and R bykeeping the (0, 0), (0, 1), (0, 2) columns of P′;

Page: 57 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 64: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

58 8 Markov Conditioning

Q =

(1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2)0 0 0 0 0 0

1/2 0 0 0 0 00 2/3 0 0 0 01 0 0 0 0 00 2/3 0 1/3 0 00 0 1/2 0 1/2 0

(1, 0)(1, 1)(1, 2)(2, 0)(2, 1)(2, 2)

R =

(0, 0) (0, 1) (0, 2)(1, 0) 1 0 0(1, 1) 0 1/2 0(1, 2) 0 0 1/3(2, 0) 0 0 0(2, 1) 0 0 0(2, 2) 0 0 0

.

So

P(a,b) [XTB = (c, d)] = (I −Q)−1R =

(a, b) \ (c, d)(1, 0)(1, 1)(1, 2)(2, 0)(2, 1)(2, 2)

(0, 0) (0, 1) (0, 2)1 0 012

12 0

13

13

13

1 0 023

13 0

12

13

16

and therefore,

P(2,2)(XT = (0, 1)) = P (XT = (0, 1)|X0 = (2, 2)) = 1/3.

Theorem 8.18. Let h : B → [0,∞] and g : A→ [0,∞] be given and for x ∈ S.If we let 3

w (x) := Ex

[h (XTB ) ·

∑n<TB

g (Xn) : TB <∞

]and

gh (x) = g (x)Ex [h (XTB ) : TB <∞] ,

then

3 Recall from Theorem 8.12 that uh = (I −Q)−1 Rh, i.e. u = h on B and u satisfies

u (x) =∑y∈A

p (x, y)u (y) +∑y∈B

p (x, y)h (y) for all x ∈ A.

w (x) = Ex

[ ∑n<TB

gh (Xn) : TB <∞

]. (8.17)

Remark 8.19. Recall that we can find uh (x) := Ex [h (XTB ) : TB <∞] usingTheorem 8.12 and then we can solve for w (x) using Theorem 8.13 with greplaced by gh (x) = g (x)uh (x) . So in the matrix language we solve for w (x)as follows;

uh := (I −Q)−1

Rh,

gh : = g ∗ uh, and

w = (I −Q)−1

gh,

where [a ∗ b]x := ax · bx – the entry by entry product of column vectors.

Proof. First proof. Let H (X) := h (XTB ) 1TB<∞, then using 1n<TB(X) =1X0∈A,...,Xn∈A and

H (X0, . . . , Xn−1, Xn, . . . ) = H (Xn, Xn+1, . . . ) when X0, . . . , Xn ∈ A

along with the Markov property in Theorem 8.1 shows;

w (x) =

∞∑n=0

Ex [H (X) · 1n<TBg (Xn)]

=

∞∑n=0

Ex [H (X) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex[E(Y )Xn

[H (X0, . . . , Xn−1, Y )] · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex[E(Y )Xn

H (Y ) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex [uh (Xn) · 1X0∈A,...,Xn∈Ag (Xn)]

=

∞∑n=0

Ex [uh (Xn) · g (Xn) 1n<TB ]

= Ex

[ ∑n<TB

uh (Xn) · g (Xn)

]= Ex

[ ∑n<TB

gh (Xn)

].

Second proof. Let G (X) :=∑n<TB

g (Xn) and observe that

Page: 58 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 65: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.3 Finite state space examples 59

H (x, Y ) =

H (Y ) if x ∈ Ah (x) if x ∈ B and

G (x, Y ) = g (x) +G (Y )

and so by the first step analysis we find,

w (x) = Ex [H (X)G (X)] = Ep(x,·) [H (x, Y )G (x, Y )]

= Ep(x,·) [H (x, Y ) (g (x) +G (Y ))]

= g (x)Ep(x,·) [H (x, Y )] + Ep(x,·) [H (x, Y )G (Y )] .

The first step analysis also shows (see the proof of Theorem 8.12)

uh (x) := Ex [h (XTB ) 1TB<∞] = Ex [H (X)] = Ep(x,·) [H (x, Y )] .

and therefore,

w (x) = g (x)uh (x) + Ep(x,·) [H (x, Y )G (Y )]

Since G (Y ) = 0 if Y0 ∈ B and H (x, Y ) = H (Y ) if Y0 ∈ A we find,

Ep(x,·) [H (x, Y )G (Y )] =∑x∈S

p (x, y)Ey [H (x, Y )G (Y )]

=∑x∈A

p (x, y)Ey [H (x, Y )G (Y )]

=∑x∈A

p (x, y)Ey [H (Y )G (Y )]

=∑x∈A

p (x, y)w (y)

and hence

w (x) = g (x)uh (x) +∑x∈A

p (x, y)w (y) = gh (x) +∑x∈A

p (x, y)w (y) .

But Theorem 8.13 with g replaced by gh then shows w is given by Eq. (8.17).

Example 8.20 (A possible carnival game). Suppose that B is the disjoint unionof L and W and suppose that you win

∑n<TB

g (Xn) if you end in W and winnothing when you end in L. What is the least we can expect to have to pay toplay this game and where in A := S \B should we choose to start the game. Toanswer these questions we should compute our expected winnings (w (x)) foreach starting point x ∈ A;

w (x) = Ex

[1W (XTB )

∑n<TB

g (Xn)

].

Once we find w we should expect to pay at least C := maxx∈A w (x) and weshould start at a location x0 ∈ A where w (x0) = maxx∈A w (x) = C. As anapplication of Theorem 8.18 we know that

w (x) =[(I−Q)

−1gh

]x

where4

gh (x) = g (x)Ex [1W (XTB )] = g (x)Px (XTB ∈W ) .

Let us now specialize these results to the chain in Example 8.14 where

P =

1 2 3 40 1/3 1/3 1/3

3/4 1/8 1/8 00 0 1 00 0 0 1

1234

Let us make 4 the winning state and 3 the losing state (i.e. h (3) = 0 andh (4) = 1) and let g = (g (1) , g (2)) be the payoff function. We have alreadyseen that [

uh (1)uh (2)

]=

[P1 (XTB = 4)P2 (XTB = 4)

]=

[71525

]so that g ∗ uh =

[715g125g2

]and therefore

[w (1)w (2)

]= (I −Q)

−1

[715g125g2

]=

[75

815

65

85

] [715g125g2

]=

[4975g1 + 16

75g21425g1 + 16

25g2

].

Let us examine a few different choices for g.

1. When g (1) = 32 and g (2) = 7, we have[w (1)w (2)

]=

[75

815

65

85

] [71532257

]=

[1125

1125

]=

[22.422.4

]and so it does not matter where we start and we are going to have to payat least $22.40 to play.

4 Intuitively, the effective pay off for a visit to site x is g (x) · Px ( we win) + 0 ·Px (we loose) .

Page: 59 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 66: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

60 8 Markov Conditioning

2. When g (1) = 10 = g (2) , then[w (1)w (2)

]=

[75

815

65

85

] [715102510

]=

[263

12

]=

[8. 666 7

12.0

]and we should enter the game at site 2. We are going to have to pay at least$12 to play.

3. If g (1) = 20 and g (2) = 7,[w (1)w (2)

]=

[75

815

65

85

] [71520257

]=

[3642539225

]=

[14.5615.68

]and again we should enter the game at site 2. We are going to have to payat least $15.68 to play.

8.4 Random Walk Exercises

Exercise 8.2 (Uniqueness of solutions to 2nd order recurrence rela-tions). Let a, b, c be real numbers with a 6= 0 6= c, α, β ∈ Z∪±∞ withα < β, and g : Z∩ (α, β) → R be a given function. Show that there is exactlyone function u : [α, β] ∩ Z→ R with prescribed values on two two consecutivepoints in [α, β] ∩ Z which satisfies the second order recurrence relation:

au (x+ 1) + bu (x) + cu (x− 1) = f (x) for all x ∈ Z∩ (α, β) . (8.18)

are for α < x < β. Show; if u and w both satisfy Eq. (8.18) and u = w on twoconsecutive points in (α, β) ∩ Z, then u (x) = w (x) for all x ∈ [α, β] ∩ Z.

Solution to Exercise (8.2). Suppose we are given u (x0) = µ andu (x0 + 1) = ν for some µ, ν ∈ R and x0 ∈ Z such that x0, x0 + 1 ⊂ [α, β]∩Z.If x0 − 1 ∈ [α, β] ∩ Z, then according to Eq. (8.18),

au (x0 + 1) + bu (x0) + cu (x0 − 1) = f (x0)

from which we may uniquely determine u (x0 − 1) . Repeating this procedurewe see that we can uniquely determine u (x) for all x ∈ [α, x0] ∩ Z. Similarly ifx0 + 2 ∈ [α, β] ∩ Z Eq. (8.18) implies,

au (x0 + 2) + bu (x0 + 1) + cu (x0) = f (x0 + 1)

from which we may uniquely determine u (x0 + 2) . Repeating this procedurewe see that we can uniquely determine u (x) for all x ∈ [x0 + 1, β] ∩ Z andhence there is exactly one function satisfying Eq. (8.18) with u (x0) = µ andu (x0 + 1) = ν.

Exercise 8.3 (General homogeneous solutions). Let a, b, c be real num-bers with a 6= 0 6= c, α, β ∈ Z∪±∞ with α < β, and supposeu (x) : x ∈ [α, β] ∩ Z solves the second order homogeneous recurrence relation

au (x+ 1) + bu (x) + cu (x− 1) = 0 for all x ∈ Z∩ (α, β) , (8.19)

i.e. Eq. (8.18) with f (x) ≡ 0. Show:

1. for any λ ∈ C,aλx+1 + bλx + cλx−1 = λx−1p (λ) (8.20)

where p (λ) = aλ2 + bλ + c is the characteristic polynomial associatedto Eq. (8.18).

Let λ± = −b±√b2−4ac

2a be the roots of p (λ) and suppose for the momentthat b2− 4ac 6= 0. From Eq. (8.18) it follows that for any choice of A± ∈ R,the function,

w (x) := A+λx+ +A−λ

x−, (8.21)

solves Eq. (8.18) for all x ∈ Z.2. Show there is a unique choice of constants, A± ∈ R, such that the functionu (x) is given by

u (x) := A+λx+ +A−λ

x− for all α ≤ x ≤ β.

3. Now suppose that b2 = 4ac and λ0 := −b/ (2a) is the double root of p (λ) .Show for any choice of A0 and A1 in R that

w (x) := (A0 +A1x)λx0 (8.22)

solves Eq. (8.18) for all x ∈ Z. Hint: Differentiate Eq. (8.20) with respectto λ and then set λ = λ0.

4. Again show that any function u solving Eq. (8.18) is of the form u (x) =(A0 +A1x)λx0 for α ≤ x ≤ β for some unique choice of constants A0, A1 ∈R.

Solution to Exercise (8.3). Items 1. and 3. follows by a direct verification.Indeed if u (x) = λx, then

au (x+ 1) + bu (x) + cu (x− 1) = aλx+1 + bλx + cλx−1

= λx−1[aλ2 + bλ+ c

]= λx−1p (λ) .

Since the equation is linear it now follows w given in Eq. (8.21) solves Eq.(8.19). It can be directly verified that xλx0 solves Eq. (8.19) when b2 = 4acand then linearity again shows that Eq. (8.22) solves Eq. (8.19) in this case.Alternatively, if we differentiate Eq. (8.20) to find

Page: 60 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 67: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.4 Random Walk Exercises 61

a (x+ 1)λx + bxλx−1 + c (x− 1)λx−2 = (x− 1)λx−2p (λ) + λx−1p′ (λ) .

Since λ0 is a double root for p, p (λ0) = 0 = p′ (λ0) so if we evaluate the previousequation at λ = λ0 and multiply the result by λ0 we find,

0 = a (x+ 1)λx+10 + bxλx0 + c (x− 1)λx−1

0 ,

i.e. u (x) = xλx0 solves Eq. (8.19).According to Exercises 8.2, to prove items 2. and 4 we have to show that we

may adjust the constants A± or A0 and A1 so that u (x) = µ and u (x+ 1) = νwhere µ and ν are given numbers in R and x ∈ [α, β) ∩ Z. In the case of item2.(b2 − 4ac 6= 0

)this amounts to solving,

A+λx+ +A−λ

x− = µ and

A+λx+1+ +A−λ

x+1− = ν

which can always be done since

det

[λx+ λx−λx+1

+ λx+1−

]= λx+λ

x− (λ− − λ+) 6= 0.

Here we have used λ± 6= 0 since p (0) = c 6= 0.5

Similarly in the case of item 4.(b2 − 4ac = 0

), we must show there exists

A0, A1 ∈ R such that

(A0 +A1x)λx0 = µ

(A0 +A1 (x+ 1))λx+10 = ν

which is again the case since

det

[λx0 xλx0λx+1

0 (x+ 1)λx+10

]= λ2x+1

0 6= 0.

Again we have used λ0 6= 0 since p (0) = c 6= 0. In fact in this case,

|λ0| := |b/ (2a)| =√ac

|a|=

√c

a6= 0.

In the next group of exercises you are going to use first step analysis to showthat a simple unbiased random walk on Z is null recurrent. We let Xn∞n=0 bethe Markov chain with values in Z with transition probabilities given by

5 In fact,

λ+ · λ− = λ± =b2 −

(b2 − 4ac

)4a2

= c/a 6= 0.

P (Xn+1 = x± 1|Xn = x) = 1/2 for all n ∈ N0 and x ∈ Z.

Further let a, b ∈ Z with a < 0 < b and

Ta,b := min n : Xn ∈ a, b and Tb := inf n : Xn = b .

We know by Corollary 8.7 that E0 [Ta,b] < ∞ from which it follows thatP (Ta,b <∞) = 1 for all a < 0 < b. For these reasons we will ignore theevent Ta,b =∞ in what follows below.

Exercise 8.4. Let w (x) := Px(XTa,b = b

):= P

(XTa,b = b|X0 = x

).

1. Use first step analysis to show for a < x < b that

w (x) =1

2(w (x+ 1) + w (x− 1)) (8.23)

provided we define w (a) = 0 and w (b) = 1.2. Use the results of Exercises 8.2 and 8.3 to show

Px(XTa,b = b

)= w (x) =

1

b− a(x− a) . (8.24)

3. Let

Tb :=

min n : Xn = b if Xn hits b

∞ otherwise

be the first time Xn hits b. Explain why,XTa,b = b

⊂ Tb <∞ and

use this along with Eq. (8.24) to conclude that Px (Tb <∞) = 1 for allx < b. (By symmetry this result holds true for all x ∈ Z.)

Exercise 8.5. The goal of this exercise is to give a second proof of the fact thatPx (Tb <∞) = 1. Here is the outline:

1. Let w (x) := Px (Tb <∞) . Again use first step analysis to show that w (x)satisfies Eq. (8.23) for all x with w (b) = 1.

2. Use Exercises 8.2 and 8.3 to show that there is a constant, c, such that

w (x) = c (x− b) + 1 for all x ∈ Z.

3. Explain why c must be zero to again show that Px (Tb <∞) = 1 for allx ∈ Z.

Exercise 8.6. Let T = Ta,b and u (x) := ExT := E [T |X0 = x] .

1. Use first step analysis to show for a < x < b that

u (x) =1

2(u (x+ 1) + u (x− 1)) + 1 (8.25)

with the convention that u (a) = 0 = u (b) .

Page: 61 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 68: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

62 8 Markov Conditioning

2. Show thatu (x) = A0 +A1x− x2 (8.26)

solves Eq. (8.25) for any choice of constants A0 and A1.3. Choose A0 and A1 so that u (x) satisfies the boundary conditions, u (a) =

0 = u (b) . Use this to conclude that

ExTa,b = −ab+ (b+ a)x− x2 = −a (b− x) + bx− x2. (8.27)

Remark 8.21. Notice that Ta,b ↑ Tb = inf n : Xn = b as a ↓ −∞, and sopassing to the limit as a ↓ −∞ in Eq. (8.27) shows

ExTb =∞ for all x < b.

Combining the last couple of exercises together shows that Xn is “null -recurrent.”

Exercise 8.7. Let T = Tb. The goal of this exercise is to give a second proofof the fact and u (x) := ExT = ∞ for all x 6= b. Here is the outline. Letu (x) := ExT ∈ [0,∞] = [0,∞) ∪ ∞ .

1. Note that u (b) = 0 and, by a first step analysis, that u (x) satisfies Eq.(8.25) for all x 6= b – allowing for the possibility that some of the u (x) maybe infinite.

2. Argue, using Eq. (8.25), that if u (x) < ∞ for some x < b then u (y) < ∞for all y < b. Similarly, if u (x) < ∞ for some x > b then u (y) < ∞ for ally > b.

3. If u (x) < ∞ for all x > b then u (x) must be of the form in Eq. (8.26)for some A0 and A1 in R such that u (b) = 0. However, this would imply,u (x) = ExT → −∞ as x → ∞ which is impossible since ExT ≥ 0 for allx. Thus we must conclude that ExT = u (x) = ∞ for all x > b. (A similarargument works if we assume that u (x) <∞ for all x < b.)

For the remaining exercises in this section we will assume that p ∈ (1/2, 1)and q = 1− p so that p/q > 1.

Exercise 8.8 (Biased random walks I). Let p ∈ (1/2, 1) and consider thebiased random walk Xnn≥0 on the S = Z where Xn = ξ0 + ξ1 + · · · + ξn,

ξi∞i=1 are i.i.d. with P (ξi = 1) = p ∈ (0, 1) and P (ξi = −1) = q := 1 − p,and ξ0 = x for some x ∈ Z. Let T = T0 be the first hitting time of 0 andu (x) := Px (T <∞) .

a) Use the first step analysis to show

u (x) = pu (x+ 1) + qu (x− 1) for x 6= 0 and u (0) = 1. (8.28)

b) Use Eq. (8.28) along with Exercises 8.2 and 8.3 to show for some a± ∈ Rthat

u (x) = (1− a+) + a+ (q/p)x

for x ≥ 0 and (8.29)

u (x) = (1− a−) + a− (q/p)x

for x ≤ 0. (8.30)

c) By considering the limit as x→ −∞ conclude that a− = 0 and u (x) = 1 forall x < 0, i.e. Px (T0 <∞) = 1 for all x ≤ 0.

Exercise 8.9 (Biased random walks II). The goal of this exercise is toevaluate Px (T0 <∞) for x ≥ 0. To do this let Bn := 0, n and Tn := T0,n.Let h (x) := Px (XTn = 0) where XTn = 0 is the event of hitting 0 before n.

a) Use the first step analysis to show

h (x) = ph (x+ 1) + qh (x− 1) with h (0) = 1 and h (n) = 0.

b) Show the unique solution to this equation is given by

Px (XTn = 0) = h (x) =(q/p)

x − (q/p)n

1− (q/p)n .

c) Argue that

Px (T <∞) = limn→∞

Px (XTn = 0) = (q/p)x< 1 for all x ≥ 0.

The following formula summarizes Exercises 8.8 and 8.9; for 12 < p < 1,

Px (T <∞) =

(q/p)

xif x ≥ 0

1 if x < 0. (8.31)

Example 8.22 (Biased random walks III). Continue the notation in Exercise 8.8.Let us start to compute ExT. Since Px (T =∞) > 0 for x > 0 we already knowthat ExT = ∞ for all x > 0. Nevertheless we will deduce this fact again here.Letting u (x) = ExT it follows by the first step analysis that, for x 6= 0,

u (x) = p [1 + u (x+ 1)] + q [1 + u (x− 1)]

= pu (x+ 1) + qu (x− 1) + 1 (8.32)

with u (0) = 0. Notice u (x) =∞ is a solution to this equation while if u (n) <∞for some n 6= 0 then Eq. (8.32) implies that u (x) < ∞ for all x 6= 0 with thesame sign as n. A particular solution to this equation may be found by tryingu (x) = αx to learn,

αx = pα (x+ 1) + qα (x− 1) + 1 = αx+ α (p− q) + 1

Page: 62 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 69: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.5 Computations avoiding the first step analysis 63

which is valid for all x provided α = (q − p)−1. The general finite solution to

Eq. (8.32) is therefore,

u (x) = (q − p)−1x+ a+ b (q/p)

x. (8.33)

Using the boundary condition, u (0) = 0 allows us to conclude that a + b = 0and therefore,

u (x) = (q − p)−1x+ a [1− (q/p)

x] . (8.34)

Notice that u (x)→ −∞ as x→ +∞ no matter how a is chosen and thereforewe must conclude that the desired solution to Eq. (8.32) is u (x) =∞ for x > 0as we already mentioned. In the next exercise you will compute ExT for x < 0.

Exercise 8.10 (Biased random walks IV). Continue the notation in Ex-ample 8.22. Using the outline below, show

ExT =|x|p− q

for x ≤ 0. (8.35)

In the following outline n is a negative integer, Tn is the first hitting time ofn so that Tn,0 = Tn ∧ T = min T, Tn is the first hitting time of n, 0 . By

Corollary 8.7 we know that u (x) := Ex[Tn,0

]< ∞ for all n ≤ x ≤ 0 and by

a first step analysis one sees that u (x) still satisfies Eq. (8.32) for n < x < 0and has boundary conditions u (n) = 0 = u (0) .

a) From Eq. (8.34) we know that, for some a ∈ R,

Ex[Tn,0

]= u (x) = (q − p)−1

x+ a [1− (q/p)x] .

Use u (n) = 0 in order to show

a = an =n

(1− (q/p)n) (p− q)

and therefore,

Ex[Tn,0

]=

1

p− q

[|x|+ n

1− (q/p)x

1− (q/p)n

]for n ≤ x ≤ 0.

b) Argue that ExT = limn→−∞ Ex [Tn ∧ T ] and use this and part a) to proveEq. (8.35).

8.5 Computations avoiding the first step analysis

You may (SHOULD) skip the rest of thischapter!!

Theorem 8.23. Let n denote a non-negative integer. If h : B → R is measur-able and either bounded or non-negative, then

Ex [h (Xn) : TB = n] =(Qn−1A Q [1Bh]

)(x)

and

Ex [h (XTB ) : TB <∞] =

( ∞∑n=0

QnAQ [1Bh]

)(x) . (8.36)

If g : A→ R+ is a measurable function, then for all x ∈ A and n ∈ N0,

Ex [g (Xn) 1n<TB ] = (QnAg) (x) .

In particular we have

Ex

[ ∑n<TB

g (Xn)

]=

∞∑n=0

(QnAg) (x) =: u (x) , (8.37)

where by convention,∑n<TB

g (Xn) = 0 when TB = 0.

Proof. Let x ∈ A. In computing each of these quantities we will use;

TB > n = Xi ∈ A for 0 ≤ i ≤ n and

TB = n = Xi ∈ A for 0 ≤ i ≤ n− 1 ∩ Xn ∈ B .

From the second identity above it follows that for

Ex [h (Xn) : TB = n] = Ex[h (Xn) : (X1, . . . , Xn−1) ∈ An−1, Xn ∈ B

]=

∞∑n=1

∫An−1×B

n∏j=1

Q (xj−1, dxj)h (xn)

=(Qn−1A Q [1Bh]

)(x)

and therefore

Ex [h (XTB ) : TB <∞] =∞∑n=1

Ex [h (Xn) : TB = n]

=

∞∑n=1

Qn−1A Q [1Bh] =

∞∑n=0

QnAQ [1Bh] .

Page: 63 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 70: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

64 8 Markov Conditioning

Similarly,

Ex [g (Xn) 1n<TB ] =

∫An

Q (x, dx1)Q (x1, dx2) . . . Q (xn−1, dxn) g (xn)

= (QnAg) (x)

and therefore,

Ex

[ ∞∑n=0

g (Xn) 1n<TB

]=

∞∑n=0

Ex [g (Xn) 1n<TB ]

=

∞∑n=0

(QnAg) (x) .

In practice it is not so easy to sum the series in Eqs. (8.36) and (8.37). Thuswe would like to have another way to compute these quantities. Since

∑∞n=0Q

nA

is a geometric series, we expect that

∞∑n=0

QnA = (I −QA)−1

which is basically correct at least when (I −QA) is invertible. This suggeststhat if u (x) = Ex [h (XTB ) : TB <∞] , then (see Eq. (8.36))

u = QAu+Q [1Bh] on A, (8.38)

and if u (x) = Ex[∑

n<TBg (Xn)

], then (see Eq. (8.37))

u = QAu+ g on A. (8.39)

That these equations are valid was the content of Corollary 8.29 below andTheorem 8.13 above. below which we will prove using the “first step” analysisin the next theorem. We will give another direct proof in Theorem 8.28 belowas well.

Lemma 8.24. Keeping the notation above we have

ExT =

∞∑n=0

∑y∈A

Qn (x, y) for all x ∈ A, (8.40)

where ExT =∞ is possible.

Proof. By definition of T we have for x ∈ A and n ∈ N0 that,

Px (T > n) = Px (X1, . . . , Xn ∈ A)

=∑

x1,...,xn∈Ap (x, x1) p (x1, x2) . . . p (xn−1, xn)

=∑y∈A

Qn (x, y) . (8.41)

Therefore Eq. (8.40) now follows from Lemma 8.3 and Eq. (8.41).

Proposition 8.25. Let us continue the notation above and let us further as-sume that A is a finite set and

Px (T <∞) = P (Xn ∈ B for some n) > 0 ∀ x ∈ A. (8.42)

Under these assumptions, ExT < ∞ for all x ∈ A and in particularPx (T <∞) = 1 for all x ∈ A. In this case we may may write Eq. (8.40)as

(ExT )x∈A = (I −Q)−1

1 (8.43)

where 1 (x) = 1 for all x ∈ A.

Proof. Since T > n ↓ T =∞ and Px (T =∞) < 1 for all x ∈ A itfollows that there exists an m ∈ N and 0 ≤ α < 1 such that Px (T > m) ≤ αfor all x ∈ A. Since Px (T > m) =

∑y∈AQ

m (x, y) it follows that the row sumsof Qm are all less than α < 1. Further observe that∑

y∈AQ2m (x, y) =

∑y,z∈A

Qm (x, z)Qm (z, y) =∑z∈A

Qm (x, z)∑y∈A

Qm (z, y)

≤∑z∈A

Qm (x, z)α ≤ α2.

Similarly one may show that∑y∈AQ

km (x, y) ≤ αk for all k ∈ N. Therefore

from Eq. (8.41) with m replaced by km, we learn that Px (T > km) ≤ αk forall k ∈ N which then implies that∑

y∈AQn (x, y) = Px (T > n) ≤ αb

nk c for all n ∈ N,

where btc = m ∈ N0 if m ≤ t < m+ 1, i.e. btc is the nearest integer to t whichis smaller than t. Therefore, we have

ExT =

∞∑n=0

∑y∈A

Qn (x, y) ≤∞∑n=0

αbnmc ≤ m ·

∞∑l=0

αl = m1

1− α<∞.

So it only remains to prove Eq. (8.43). From the above computations we seethat

∑∞n=0Q

n is convergent. Moreover,

Page: 64 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 71: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

8.5 Computations avoiding the first step analysis 65

(I −Q)

∞∑n=0

Qn =

∞∑n=0

Qn −∞∑n=0

Qn+1 = I

and therefore (I −Q) is invertible and∑∞n=0Q

n = (I −Q)−1. Finally,

(I −Q)−1

1 =

∞∑n=0

Qn1 =

∞∑n=0

∑y∈A

Qn (x, y)

x∈A

= (ExT )x∈A

as claimed.

Remark 8.26. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with0 being an absorbing state. Let T = T0, i.e. B = 0 so that A = N is nowan infinite set. From Remark 8.21, we learn that EiT = ∞ for all i > 0. Thisshows that we can not in general drop the assumption that A (A = 1, 2, . . . is a finite set the statement of Proposition 8.25.

8.5.1 General facts about sub-probability kernels

Definition 8.27. Suppose (A,A) is a measurable space. A sub-probabilitykernel on (A,A) is a function ρ : A ×A → [0, 1] such that ρ (·, C) is A/BR –measurable for all C ∈ A and ρ (x, ·) : A → [0, 1] is a measure for all x ∈ A.

As with probability kernels we will identify ρ with the linear map, ρ : Ab →Ab given by

(ρf) (x) = ρ (x, f) =

∫A

f (y) ρ (x, dy) .

Of course we have in mind that A = SA and ρ = QA. In the following lemmalet ‖g‖∞ := supx∈A |g (x)| for all g ∈ Ab.

Theorem 8.28. Let ρ be a sub-probability kernel on a measurable space (A,A)and define un (x) := (ρn1) (x) for all x ∈ A and n ∈ N0. Then;

1. un is a decreasing sequence so that u := limn→∞ un exists and is in Ab.(When ρ = QA, un (x) = Px (TB > n) ↓ u (x) = P (TB =∞) as n→∞.)

2. The function u satisfies ρu = u.3. If w ∈ Ab and ρw = w then |w| ≤ ‖w‖∞ u. In particular the equation,ρw = w, has a non-zero solution w ∈ Ab iff u 6= 0.

4. If u = 0 and g ∈ Ab, then there is at most one w ∈ Ab such that w = ρw+g.5. Let

U :=

∞∑n=0

un =

∞∑n=0

ρn1 : A→ [0,∞] (8.44)

and suppose that U (x) <∞ for all x ∈ A. Then for each g ∈ Sb,

w =

∞∑n=0

ρng (8.45)

is absolutely convergent,|w| ≤ ‖g‖∞ U, (8.46)

ρ (x, |w|) < ∞ for all x ∈ A, and w solves w = ρw + g. Moreover if v alsosolves v = ρv + g and |v| ≤ CU for some C <∞ then v = w.Observe that when ρ = QA,

U (x) =

∞∑n=0

Px (TB > n) =

∞∑n=0

Ex (1TB>n) = Ex

( ∞∑n=0

1TB>n

)= Ex [TB ] .

6. If g : A→ [0,∞] is any measurable function then

w :=

∞∑n=0

ρng : A→ [0,∞]

is a solution to w = ρw + g. (It may be that w ≡ ∞ though!) Moreover ifv : A → [0,∞] satisfies v = ρv + g then w ≤ v. Thus w is the minimalnon-negative solution to v = ρv + g.

7. If there exists α < 1 such that u ≤ α on A then u = 0. (When ρ = QA, thisstates that Px (TB =∞) ≤ α for all x ∈ A implies Px (TA =∞) = 0 for allx ∈ A.)

8. If there exists an α < 1 and an n ∈ N such that un = ρn1 ≤ α on A, thenthere exists C <∞ such that

uk (x) =(ρk1)

(x) ≤ Cβk for all x ∈ A and k ∈ N0

where β := α1/n < 1. In particular, U ≤ C (1− β)−1

and u = 0 under thisassumption.(When ρ = QA this assertion states; if Px (TB > n) ≤ α for all α ∈ A, then

Px (TB > k) ≤ Cβk and ExTB ≤ C (1− β)−1

for all k ∈ N0.)

Proof. We will prove each item in turn.

1. First observe that u1 (x) = ρ (x,A) ≤ 1 = u0 (x) and therefore,

un+1 = ρn+11 = ρnu1 ≤ ρn1 = un.

We now let u := limn→∞ un so that u : A→ [0, 1] .2. Using DCT we may let n→∞ in the identity, ρun = un+1 in order to showρu = u.

Page: 65 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 72: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

66 8 Markov Conditioning

3. If w ∈ Ab with ρw = w, then

|w| = |ρnw| ≤ ρn |w| ≤ ‖w‖∞ ρn1 = ‖w‖∞ · un.

Letting n→∞ shows that |w| ≤ ‖w‖∞ u.4. If wi ∈ Ab solves wi = ρwi + g for i = 1, 2 then w := w2 − w1 satisfiesw = ρw and therefore |w| ≤ Cu = 0.

5. Let U :=∑∞n=0 un =

∑∞n=0 ρ

n1 : A → [0,∞] and suppose U (x) < ∞ forall x ∈ A. Then un (x)→ 0 as n→∞ and so bounded solutions to ρu = uare necessarily zero. Moreover we have, for all k ∈ N0, that

ρkU =

∞∑n=0

ρkun =

∞∑n=0

un+k =

∞∑n=k

un ≤ U. (8.47)

Since the tails of convergent series tend to zero it follows that limk→∞ ρkU =0.Now if g ∈ Sb, we have

∞∑n=0

|ρng| ≤∞∑n=0

ρn |g| ≤∞∑n=0

ρn ‖g‖∞ = ‖g‖∞ · U <∞ (8.48)

and therefore∑∞n=0 ρ

ng is absolutely convergent. Making use of Eqs. (8.47)and (8.48) we see that

∞∑n=1

ρ |ρng| ≤ ‖g‖∞ · ρU ≤ ‖g‖∞ U <∞

and therefore (using DCT),

w =

∞∑n=0

ρng = g +

∞∑n=1

ρng

= g + ρ

∞∑n=1

ρn−1g = g + ρw,

i.e. w solves w = g + ρw.If v : A → R is measurable such that |v| ≤ CU and v = g + ρv, theny := w − v solves y = ρy with |y| ≤ (C + ‖g‖∞)U. It follows that

|y| = |ρny| ≤ (C + ‖g‖∞) ρnU → 0 as n→∞,

i.e. 0 = y = w − v.

6. If g ≥ 0 we may always define w by Eq. (8.45) allowing for w (x) = ∞ forsome or even all x ∈ A. As in the proof of the previous item (with DCTbeing replaced by MCT), it follows that w = ρw + g. If v ≥ 0 also solvesv = g + ρv, then

v = g + ρ (g + ρv) = g + ρg + ρ2v

and more generally by induction we have

v =

n∑k=0

ρkg + ρn+1v ≥n∑k=0

ρkg.

Letting n→∞ in this last equation shows that v ≥ w.7. If u ≤ α < 1 on A, then by item 3. with w = u we find that

u ≤ ‖u‖∞ · u ≤ αu

which clearly implies u = 0.8. If un ≤ α < 1, then for any m ∈ N we have,

un+m = ρmun ≤ αρm1 = αum.

Taking m = kn in this inequality shows, u(k+1)n ≤ αukn. Thus a simple

induction argument shows ukn ≤ αk for all k ∈ N0. For general l ∈ N0 wewrite l = kn+ r with 0 ≤ r < n. We then have,

ul = ukn+r ≤ ukn ≤ αk = αl−rn = Cαl/n

where C = α−n−1n .

Corollary 8.29. If h : B → [0,∞] is measurable, then u (x) :=Ex [h (XTB ) : TB <∞] is the unique minimal non-negative solution to Eq.(8.38) while if g : A → [0,∞] is measurable, then u (x) = Ex

[∑n<TB

g (Xn)]

is the unique minimal non-negative solution to Eq. (8.39).

Exercise 8.11. Keeping the notation of Exercise 8.8 and 8.10. Use Corollary8.29 to show again that Px (TB <∞) = (q/p)

xfor all x > 0 and ExT0 =

x/ (q − p) for x < 0. You should do so without making use of the extraneoushitting times, Tn for n 6= 0.

Solution to Exercise (8.11). From Eq. (8.28) of Exercise 8.8 we have seenfor x > 1 that

Px (T0 <∞) = a+ (1− a) (q/p)x

Page: 66 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 73: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

for some a ∈ [0, 1] . Since

d

da[a+ (1− a) (q/p)

x] = 1− (q/p)

x> 0,

the right side will be smallest when a = 0 and therefore we may (Corollary8.29) conclude that

Px (T0 <∞) = (q/p)x

for all x > 0.

Similarly from Eq. (8.34) of Exercise 8.10 we have seen that if ExT0 < ∞for some and hence all x < 0 then

ExT0 = (q − p)−1x+ a [1− (q/p)

x]

for some a ≤ 0. Since the right side of this equation is minimized by takinga = 0 we again have by Corollary 8.29 that

ExT0 = (q − p)−1x for all x < 0.

Corollary 8.30. If Px (TB =∞) = 0 for all x ∈ A and h : B → R is a boundedmeasurable function, then u (x) := Ex [h (XTB )] is the unique solution to Eq.(8.38).

Corollary 8.31. Suppose now that A = Bc is a finite subset of S such thatPx (TB =∞) < 1 for all x ∈ A. Then there exists C < ∞ and β ∈ (0, 1) suchthat Px (TB > n) ≤ Cβn and in particular ExTB <∞ for all x ∈ A.

Proof. Let α0 = maxx∈A Px (TB =∞) < 1. We know that

limn→∞

Px (TB > n) = Px (TB =∞) ≤ α0 for all x ∈ A.

Therefore if α ∈ (α0, 1) , using the fact that A is a finite set, there exists ann sufficiently large such that Px (TB > n) ≤ α for all x ∈ A. The result nowfollows from item 8. of Theorem 8.28.

Page 74: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 75: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9

Markov Chains in the Long Run (Results)

Through out this chapter Xn∞n=0 will be a Markov chain on a discretestate space S with Markov kernel p : S × S → [0, 1] along with the followingnotation.

Notation 9.1 For i, j ∈ S we define the following quantities;

Ti := min n ≥ 1 : Xn = i = first hitting time of i (9.1)

Ri := minn ≥ 1 : Xn = i = first passage time of i, (9.2)

Mi :=

∞∑n=0

1Xn=i – the number of visits to i, (9.3)

f(n)i i = Pi (Ri = n) , (9.4)

fi,i =

∞∑n=0

f(n)i,i =

∞∑n=0

Pi (Ri = n) = Pi (Ri <∞) , (9.5)

mi = Ei [Ri : Ri <∞] =

∞∑n=0

nf(n)i,i and (9.6)

mi,j = EiRj . (9.7)

9.1 A Touch of Class

Definition 9.2. A state j is accessible from i (written i → j) iff Pi(Tj <∞) > 0 and i ←→ j (i communicates with j) iff i → j and j → i. No-tice that i → j iff there is a path, i = x0, x1, . . . , xn = j ∈ S such thatp (x0, x1) p (x1, x2) . . . p (xn−1, xn) > 0.

Definition 9.3. For each i ∈ S, let Ci := j ∈ S : i←→ j be the communi-cating class of i. The state space, S, is partitioned into a disjoint union ofits communicating classes. If there is only one communication class we say thatthe chain is irreducible.

Definition 9.4. A communicating class C ⊂ S is closed provided the proba-bility that Xn leaves C given that it started in C is zero. In other words Pij = 0for all i ∈ C and j /∈ C.

The notion of being closed just introduced follows the usual mathematicalconventions in that; C is closed for a chain X iff the X can not leave C if itstarts in C. In particular it makes sense to restrict a Markov chain to a closedcommunication class. Mathematically this means that p (x, y)x,y∈C form aMarkov transition matrix, i.e.∑

y∈Cp (x, y) = 1 for all x ∈ C.

Example 9.5. Consider the Markov chain with jump diagram given in Figure9.1.In this example the communicating classes are 1, 2 , 3, 4 , and 5 with

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 9.1. A 5 state Markov chain with 3 communicating classes.

the latter classes being closed. The class 1, 2 is not closed.

Example 9.6. Let Xn∞n=0 denote the fair random walk on S = Z, then thischain is irreducible. On the other hand if Xn∞n=0 is the fair random walk on0, 1, 2, . . . with 0 being an absorbing state, then the communication classesare 0 (closed) and 1, 2, . . . (not closed).

Definition 9.7. For each i ∈ S, let d (i) be the greatest common divisor ofn ≥ 1 : Pn

i i > 0 with the convention that d (i) = 0 if Pni i = 0 for all n ≥ 1.

We refer to d (i) as the period of i. We say a site i is aperiodic if d (i) = 1.

Page 76: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

70 9 Markov Chains in the Long Run (Results)

Example 9.8. Each site of the fair random walk on S = Z has period 2. Whilefor the fair random walk on 0, 1, 2, . . . with 0 being an absorbing state, eachi ≥ 1 has period 2 while 0 has period 1, i.e. 0 is aperiodic.

Theorem 9.9. The period function is constant on each communication classof a Markov chain.

Proof. Let x, y ∈ C and a = d (x) and b = d (y) . Now suppose that Pmxy > 0

and Pnyx > 0, then Pm+n

x,x ≥ PmxyP

nyx > 0 and so a| (m+ n) . Further suppose

that Ply,y > 0 for come l ∈ N, then

Pm+n+lx,x ≥ Pm

xyPly,yP

nyx > 0

and therefore a| (m+ n+ l) which coupled with a| (m+ n) implies a|l. We maytherefore conclude that a ≤ b (in fact a|b) as b = gcd

(l ∈ N : Pl

j,j > 0).

Similarly we show that b ≤ a and therefore b = a.

Lemma 9.10. If d (i) is the period of site i, then

1. if m ∈ N and Pmi,i > 0 then d (i) divides m and

2. Pnd(i)i,i > 0 for all n ∈ N sufficiently large.

3. If i is aperiodic iff Pni,i > 0 for all n ∈ N sufficiently large.

In summary, Ai :=m ∈ N : Pm

i,i > 0⊂ d (i)N and d (i)n ∈ Ai for all

n ∈ N sufficiently large.

Proof. Choose n1, . . . , nk ∈ n ≥ 1 : Pni i > 0 such that d (i) =

gcd (n1, . . . , nk) . For part 1. we also know that d (i) = gcd (n1, . . . , nk,m) andtherefore d (i) divides m. For part 2., if mi ∈ N we have,(

P∑k

l=1mlnl

)i,i

≥k∏l=1

[Pnli,i

]ml > 0.

This observation along with the number theoretic Lemma 9.15 below is enough

to show Pnd(i)i,i > 0 for all n ∈ N sufficiently large. The third item is a special

case of item 2.

Example 9.11. Suppose that P =

[0 11 0

], then Pm = P if m is odd and Pm =[

1 00 1

]if m is even. Therefore d (i) = 2 for i = 1, 2 and in this case P2n

i,i = 1 > 0

for all n ∈ N. However observe that P2 is no longer irreducible – there are nowtwo communication classes.

P jump diagram. P2 jump diagram.

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

1/2

Fig. 9.2. All arrows are assumed to have weight 1 unless otherwise specified. Noticethat each state has period d = 2 and that P2 is the transition matrix having twoaperiodic communication classes.

Example 9.12. Consider the Markov chain with jump diagram given in Figure9.2.In this example, d (i) = 2 for all i and all states for P2 are aperiodic.However P2 is no longer irreducible. This is an indication of the what happensin general. In terms of matrices,

P =

1 2 3 4 5 60 1 0 0 0 00 0 1 0 0 00 0 0 1 0 012 0 0 0 1

2 00 0 0 0 0 11 0 0 0 0 0

123456

and P2 =

1 2 3 4 5 60 0 1 0 0 00 0 0 1 0 012 0 0 0 1

2 00 1

2 0 0 0 12

1 0 0 0 0 00 1 0 0 0 0

123456

.

Example 9.13. Consider the Markov chain with jump diagram given in Figure9.3.Assume there are no implied jumps from a site back to itself, i.e. Pi,i = 0 forall i. This chain is then irreducible and has period 2. This chain is irreducibleand has period 2. To calculate the period notice that starting at y there is an

Page: 70 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 77: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.2 Transience and Recurrence Classes 71

x

y

x

y

P jump diagram. P 2 jump diagram.

1

1

1

11/3

1 1

1

1

2/3

1/2

1/2

Fig. 9.3. Assume there are no implied jumps from a site back to itself, i.e. Pi,i = 0for all i. This chain is then irreducible and has period 2.

obvious loop of length 4 and starting at x there is one of length 6. Thereforethe period must divide both 4 and 6 and so must be either 2 or 1. The periodis not 1 as one can only return to a site with an even number of jumps in thispicture. If on the other hand there was any one vertex, i, such that Pi,i = 1,then the period of the chain would have been one, i.e. the chain would have beenaperiodic. Further notice that the jump diagram for P2 is no longer irreducible.The red vertices and the blue vertices split apart. This has to happen as is aconsequence of Proposition 9.14 below.

Proposition 9.14. If P is the Markov matrix for a finite state irreducible ape-riodic chain, then there exists n0 ∈ N such that Pn

ij > 0 for all i, j ∈ S andn ≥ n0.

Proof. Let i, j ∈ S. By Lemma 9.10 with d (i) = 1 we know that Pmi,i > 0

for all m large. As P is irreducible there exists a ∈ N such that Paij > 0 and

therefore Pa+mi,j ≥ Pm

i,iPai,j > 0 for all m sufficiently large. This shows for all

i, j ∈ S there exists ni,j ∈ N such that Pnij > 0 for all n ≥ ni,j . Since there are

only finitely many steps we may now take n0 := max ni,j : i, j ∈ S <∞.

9.1.1 A number theoretic lemma

Lemma 9.15 (A number theory lemma). Suppose that 1 is the greatestcommon denominator of a set of positive integers, Γ := n1, . . . , nk . Thenthere exists N ∈ N such that the set,

A = m1n1 + · · ·+mknk : mi ≥ 0 for all i ,

contains all n ∈ N with n ≥ N. More generally if q = gcd (Γ ) (perhaps not 1),then A ⊂ qN and contains all points qn for n sufficiently large.

Proof. First proof. The set I := m1n1 + · · ·+mknk : mi ∈ N for all iis an ideal in Z and as Z is a principle ideal domain there is a q ∈ I with q > 0such that I = qZ = qm : m ∈ Z . In fact q = min (I∩N) . Since q ∈ I we knowthat q = m1n1 + · · · + mknk for some mi ∈ N and so if l is a common divisorof n1, . . . , nk then l divides q. Moreover as I = qZ and ni ∈ I for all i, we knowthat q|ni as well. This shows that q = gcd (n1, n2, . . . , nk) .

Now suppose that n n1 + · · · + nk is given and large (to be explainedshortly). Then write n = l (n1 + · · ·+ nk)+r with l ∈ N and 0 ≤ r < n1+· · ·+nkand therefore,

nq = ql (n1 + · · ·+ nk) + rq

= ql (n1 + · · ·+ nk) + r (m1n1 + · · ·+mknk)

= (ql + rm1)n1 + · · ·+ (ql + rmk)nk

whereql + rmi ≥ ql − (n1 + · · ·+ nk)mi

which is greater than 0 for l and hence n sufficiently large.Second proof. (The following proof is from Durrett [1].) We first will show

that A contains two consecutive positive integers, a and a + 1. To prove thislet,

k := min |b− a| : a, b ∈ A with a 6= band choose a, b ∈ A with b = a+ k. If k > 1, there exists n ∈ Γ ⊂ A such thatk does not divide n. Let us write n = mk + r with m ≥ 0 and 1 ≤ r < k. Itthen follows that (m+ 1) b and (m+ 1) a+ n are in A,

(m+ 1) b = (m+ 1) (a+ k) > (m+ 1) a+mk + r = (m+ 1) a+ n,

and(m+ 1) b− (m+ 1) a+ n = k − r < k.

This contradicts the definition of k and therefore, k = 1.Let N = a2. If n ≥ N, then n− a2 = ma+ r for some m ≥ 0 and 0 ≤ r < a.

Therefore,

n = a2 +ma+ r = (a+m) a+ r = (a+m− r) a+ r (a+ 1) ∈ A.

9.2 Transience and Recurrence Classes

Definition 9.16 (First return time). For any x ∈ S, let Rx :=min n ≥ 1 : Xn = x where the minimum of the empty set is defined tobe ∞.

Page: 71 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 78: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

72 9 Markov Chains in the Long Run (Results)

On the event X0 6= x we have Rx = Tx := min n ≥ 0 : Xn = x – thefirst hitting time of x. So Rx is really manufactured for the case where X0 = xin which case Tx = 0 while Rx is the first return time to x.

Definition 9.17. A state i ∈ S is:

1. transient if Pi(Ri <∞) < 1 (⇐⇒ Pi (Ri =∞) > 0) ,2. recurrent if Pi(Ri <∞) = 1 (⇐⇒ Pi (Ri =∞) = 0) ,

a) positive recurrent if 1/ (EiRi) > 0, i.e. EiRi <∞,b) null recurrent if it is recurrent (Pi(Ri <∞) = 1) and 1/ (EiRi) = 0,

i.e. EiRi =∞.

We let St, Sr, Spr, and Snr be the transient, recurrent, positive recurrent,and null recurrent states respectively.

Theorem 9.18 (Class properties). Each of the conditions on a state i ∈ Sin Definition 9.17 is a class property. More explicitly if i, j ∈ S communicate(i←→ j) , then i is transient or positive recurrent or null recurrent iff j has thesame property.

Lemma 9.19 (Recurrent classes are closed). Let C ⊂ S be a communicat-ing class. Then

C not closed =⇒ C is transient

or equivalently put,

C is recurrent =⇒ C is closed.

Proof. If C is not closed and i ∈ C, there is a j /∈ C such that i → j, i.e.there is a path i = x0, x1, . . . , xn = j with all of the xlnl=0 being distinct suchthat

Pi (X0 = i,X1 = x1, . . . , Xn−1 = xn−1, Xn = xn = j) > 0.

Since j /∈ C we must have j 9 C and therefore on the event,

A := X0 = i,X1 = x1, . . . , Xn−1 = xn−1, Xn = xn = j ,

Xm /∈ C a.s. for all m ≥ n and therefore Ri =∞ a.s. on the event A which haspositive probability, i.e. Pi (Ri =∞) ≥ Pi (A) > 0.

Proposition 9.20 (Return time estimates). Let x ∈ S and π : S → [0, 1]be a probability on S.

1. If there exists α < 1 such that Py (Tx =∞) ≤ α for all y ∈ S thenPx (Rx =∞) = 0.

2. If there exists α < 1 and n ∈ N such that Py (Tx > n) ≤ α for all y ∈ S,then

Eπ [Rx] ≤ 1 +n

1− α<∞.

Proof. 1. By Corollary 8.5 our hypothesis guarantees Py (Tx =∞) = 0 forall y ∈ S. Hence using the first step analysis we find,

Pπ (Rx =∞) =∑

y∈S \x

p (x, y)Pπ (Rx =∞|X1 = y)

=∑

y∈S \x

p (x, y)Py (Ty =∞) = 0

wherein we have use Rx = 1 if X1 = x.2. By Corollary 8.6, our hypothesis guarantees Ey (Tx =∞) = n/ (1− α)

for all y ∈ S. Hence using the first step analysis we find,

Eπ (Rx) = p (x, x) +∑

y∈S \x

p (x, y)Eπ (Rx|X1 = y)

= p (x, x) +∑

y∈S \x

p (x, y) [1 + EyTx] = 1 +n

1− α<∞.

The last item now follows using the exact same techniques in the proof ofCorollary 8.7.

Corollary 9.21. If C ⊂ S is a finite closed communication class, then C ispositively recurrent and in fact EyRx <∞ for any x, y ∈ C.

Proof. Since C is closed we may restrict our Markov chain to C and sinceC is a communication class we know that Py (Tx =∞) < 1 for all x, y ∈ C.Because C is finite it follows that maxx,y∈C Py (Tx > n) = α < 1 for some nsufficiently large – see the proof of Corollary 8.7. Therefore Proposition 9.20applies to show EyRx <∞ for all x, y ∈ C.

Corollary 9.22. If # (S) < ∞ and C is a communication class in S. If C isclosed then every x ∈ C is positively recurrent and if C is not closed then everx ∈ C is transient. We will refer to the class C as being (positively) recurrent ortransient respectively. We also have the equivalence of the following statements:

1. C is closed.2. C is positive recurrent.3. C is recurrent.

Proposition 9.23. In particular if # (S) <∞, then the recurrent (= positivelyrecurrent) states are precisely the union of the closed communication classes andthe transient states are what is left over.

Page: 72 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 79: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.2 Transience and Recurrence Classes 73

Proof. This is a simple combination of the results in Lemma 9.19 andCorollary 9.21. See Corollary 9.45 for another proof.

Example 9.24. Let P be the Markov matrix with jump diagram given in Figure9.1 above and repeated below in Figure 9.4. As we saw in Example 9.5 above,the communication classes are 1, 2 , 3, 4 , 5 . The latter two are closedand hence positively recurrent while 1, 2 is transient. Each of the classes isaperiodic since Pi,i > 0 for all i = 1, 2, 3, 4, 5 in this example.

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 9.4. A 5 state Markov chain with 3 communicating classes.

Proposition 9.25. Suppose that C ⊂ S is a finite communicating class andT = inf n ≥ 0 : Xn /∈ C be the first exit time from C. If C is not closed, thennot only is C transient but EiT <∞ for all i ∈ C and in particular

Ej [Mi] ≤ EjT <∞ for all i, j ∈ C.

Proof. These results follow from Corollary 8.6 and the fact that

T =∑i∈C

Mi.

Warning: when # (S) =∞ or more importantly # (C) =∞, life is not sosimple.

Remark 9.26. Let Xn∞n=0 denote the fair random walk on 0, 1, 2, . . . with 0being an absorbing state. The communication classes are now 0 and 1, 2, . . . with the latter class not being closed and hence transient. Using Exercise 8.6or Exercise 8.7, it follows that EiT = ∞ for all i > 0 which shows we can not

drop the assumption that # (C) <∞ in the first statement in Proposition 9.25.Similarly, using the fair random walk example, we see that it is not possible todrop the condition that # (C) <∞ for the equivalence statements as well.

The next examples show that if C ⊂ S is closed and # (C) = ∞, then Ccould be recurrent or it could be transient. Transient in this case means thechain goes off to “infinity,” i.e. eventually leaves every finite subset of C neverto return again.

Example 9.27. Let S = Z and X = Xn be the standard fair random walk onZ, i.e. P (Xn+1 = x± 1|Xn = x) = 1

2 . Then S itself is a closed class and everyelement of S is (null) recurrent. Indeed, using Exercise 8.4 or Exercise 8.5 andthe first step analysis we know that

P0 [R0 =∞] =1

2(P0 [R0 =∞|X1 = 1] + P0 [R0 =∞|X1 = −1])

=1

2(P1 [T0 =∞] + P−1 [T0 =∞]) =

1

2(0 + 0) = 0.

This shows 0 is recurrent. Similarly using Exercise 8.6 or Exercise 8.7 and thefirst step analysis we find,

E0 [R0] =1

2(E0 [R0X1 = 1] + E0 [R0|X1 = −1])

=1

2(1 + E1 [T0X] + 1 + E−1 [T0]) =

1

2(∞+∞) =∞

and so 0 is null recurrent. As this chain is invariant under translation it followsthat every x ∈ Z is a null recurrent site.

Example 9.28. Let S = Z and X = Xn be a biassed random walk on Z, i.e.P (Xn+1 = x+ 1|Xn = x) = p and P (Xn+1 = x− 1|Xn = x) = q := 1−p withp > 1

2 . Then every site of is now transient. Recall from Exercises 8.8 and 8.9(see Eq. (8.31)) that

Px (T0 <∞) =

(q/p)

xif x ≥ 0

1 if x < 0. (9.8)

Using these result and the first step analysis implies,

P0 [R0 =∞] = pP0 [R0 =∞|X1 = 1] + qP0 [R0 =∞|X1 = −1]

= pP1 [T0 =∞] + qP−1 [T0 =∞]

= p[1− (q/p)

1]

+ q (1− 1)

= p− q = 2p− 1 > 0.

Page: 73 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 80: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

74 9 Markov Chains in the Long Run (Results)

10−1−2 2pppppp

q q 1/2 q q1/2

Fig. 9.5. A positively recurrent Markov chain.

Example 9.29. Again let S = Z and p ∈(

12 , 1)

and suppose that Xn is therandom walk on Z described the jump diagram in Figure 9.5. In this case usingthe results of Exercise 8.10 we learn that

E0 [R0] =1

2(E0 [R0|X1 = 1] + E0 [R0|X1 = −1])

=1

2(1 + E1 [T0X] + 1 + E−1 [T0])

= 1 +1

2

(1

p− q+

1

p− q

)= 1 +

1

p− q=

2p

2p− 1<∞.

This shows the site 0 is positively recurrent. Thus according to Theorem 9.18,every site in Z is positively recurrent. (Notice that E0 [R0] → ∞ as p ↓ 1

2 , i.e.as the chain becomes closer to the unbiased random walk of Example 9.27.

Theorem 9.30 (Recurrence Conditions). Let j ∈ S. Then the following areequivalent;

1. j is recurrent, i.e. Pj (Rj <∞) = 1,2. Pj (Xn = j i.o. n) = 1,3. EjMj =

∑∞n=0 Pn

jj =∞.

Moreover if C ⊂ S is a recurrent communication class, then Pi(∩j∈CXn =j i.o. n) = 1 for all i ∈ C.In words, if we start in C then every state in C isvisited an infinite number of times.

Theorem 9.31 (Transient States). Let j ∈ S. Then the following are equiv-alent;

1. j is transient, i.e. Pj (Rj <∞) < 1,2. Pj (Xn = j i.o. n) = 0, and3. EjMj =

∑∞n=0 Pn

jj <∞.

More generally if ν : S → [0, 1] is any probability and j ∈ S is transient,then

EνMj =

∞∑n=0

Pν (Xn = j) <∞ =⇒

limn→∞ Pν (Xn = j) = 0Pν (Xn = j i.o. n) = 0.

(9.9)

Example 9.32. Let us revisit the fair random walk on Z describe before Exercise8.4. In this case P0 (Xn = 0) = 0 if n is odd and

P0 (X2n = 0) =

(2n

n

)(1

2

)2n

=(2n)!

(n!)2

(1

2

)2n

.

Making use of Stirling’s formula, n! ∼√

2πnn+ 12 e−n, we find,(

1

2

)2n(2n)!

(n!)2 ∼

(1

2

)2n √2π (2n)

2n+ 12 e−2n

2πn2n+1e−2n=

√1

π

1√n

and therefore,

∞∑n=0

P0 (Xn = 0) =

∞∑n=0

P0 (X2n = 0) ∼ 1 +

∞∑n=1

√1

π

1√n

=∞

which shows again that this walk is recurrent.

Example 9.33. The above method may easily be modified to show that the bi-ased random walk on Z (see Exercise 8.8) is transient. In this case 1

2 < p < 1and

P0 (X2n = 0) =

(2n

n

)pn (1− p)n =

(2n

n

)[p (1− p)]n .

Since p (1− p) has a maximum at p = 12 of 1

4 we have ρp := 4p (1− p) < 1 for12 < p < 1. Therefore,

P0 (X2n = 0) =

(2n

n

)[ρp

1

4

]n=

(2n

n

)(1

2

)2n

ρnp ∼√

1

π

1√nρnp .

Hence

∞∑n=0

P0 (Xn = 0) =

∞∑n=0

P0 (X2n = 0) ∼ 1 +

∞∑n=1

1√nρnp ≤ 1 +

1

1− ρp<∞

which again shows the biased random walk is transient.

9.2.1 Transience and Recurrence for R.W.s by Fourier SeriesMethods (optional reading)

In the next result we will give another way to compute (or at least estimate)ExMy for random walks and thereby determine if the walk is transient or re-current.

Page: 74 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 81: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.2 Transience and Recurrence Classes 75

Theorem 9.34. Let ξi∞i=1 be i.i.d. random vectors with values in Zd and forx ∈ Zd let Xn := x + ξ1 + · · · + ξn for all n ≥ 1 with X0 = x. As usual letMz :=

∑∞n=0 1Xn=z denote the number of visits to z ∈ Zd. Then

EMz = limα↑1

(1

)d ∫[−π,π]d

ei(x−z)·θ

1− αE [eiξ1·θ]dθ

where dθ = dθ1 . . . dθd and in particular,

EMx = limα↑1

(1

)d ∫[−π,π]d

1

1− αE [eiξ1·θ]dθ. (9.10)

Proof. For 0 < α ≤ 1 let

M (α)y :=

∞∑n=0

αn1Xn=y

so that M(1)y = My. Given any θ ∈ [−π, π]

dwe have,

∑y∈Zd

EM (α)y eiy·θ = E

∑y∈Zd

M (α)y eiy·θ = E

∑y∈Zd

∞∑n=0

αn1Xn=yeiy·θ

= E∞∑n=0

αn∑y∈Zd

1Xn=yeiy·θ = E

∞∑n=0

αneiθ·Xn

=

∞∑n=0

αnE[eiθ·Xn

]where

E[eiθ·Xn

]= eiθ·xE

[eiθ·(ξ1+···+ξn)

]= eiθ·xE

n∏j=1

eiθ·ξj

= eiθ·x

n∏j=1

Eeiθ·ξj

= eiθ·x ·(E[eiξ1·θ

])n.

Combining the last two equations shows,∑y∈Zd

EM (α)y eiy·θ =

eiθ·x

1− αE [eiξ1·θ].

Multiplying this equation by e−iz·θ for some z ∈ Zd we find, using the orthog-onality of

eiy·θ

y∈Zd , that

EM (α)z =

(1

)d ∫[−π,π]d

∑y∈Zd

EM (α)y eiy·θe−iz·θ

=

(1

)d ∫[−π,π]d

eiθ·(x−z)

1− αE [eiξ1·θ]dθ.

Since M(α)z ↑ Mz as α ↑ 1 the result is now a consequence of the monotone

convergence theorem.

Example 9.35. Suppose that P (ξi = 1) = p and P (ξi = −1) = q := 1 − p andx = 0. Then

E[eiξ1·θ

]= peiθ + qe−iθ = p (cos θ + i sin θ) + q (cos θ − i sin θ)

= cos θ + i (p− q) sin θ.

Therefore according to Eq. (9.10) we have,

E0M0 = limα↑1

1

∫ π

−π

1

1− α (cos θ + i (p− q) sin θ)dθ. (9.11)

(We could compute the integral in Eq. (9.11) exactly using complex contourintegral methods but I will not do this here.)

The integrand in Eq. (9.11) may be written as,

1− α cos θ + iα (p− q) sin θ

(1− α cos θ)2

+ α2 (p− q)2sin2 θ

.

As sin θ is odd while the denominator is now even we find,

E0M0 = limα↑1

1

∫ π

−π

1− α cos θ

(1− α cos θ)2

+ α2 (p− q)2sin2 θ

dθ. (9.12)

a Let first suppose that p = 12 = q in which case the above equation reduces to

E0M0 = limα↑1

1

∫ π

−π

1

1− α cos θdθ =

1

∫ π

−π

1

1− cos θdθ

wherein we have used the MCT (for θ ∼ 0) and DCT (for θ away from 0)to justify passing the limit inside of the integral. Since 1−cos θ ≈ θ2/2 for θnear zero and

∫ ε−ε

1θ2 dθ =∞ it follows that E0M0 =∞ and the fair random

walk on Z is recurrent.b Now suppose that p 6= 1

2 and let us write α := 1− ε for some ε which we willeventually let tend down to zero. With this notation the integrand (fα (θ))in Eq. (9.12) satisfies,

Page: 75 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 82: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

76 9 Markov Chains in the Long Run (Results)

fα (θ) =1− cos θ + ε cos θ

(1− cos θ + ε cos θ)2

+ (1− ε)2(p− q)2

sin2 θ

=1− cos θ

(1− cos θ + ε cos θ)2

+ (1− ε)2(p− q)2

sin2 θ

+ε cos θ

(1− cos θ + ε cos θ)2

+ (1− ε)2(p− q)2

sin2 θ

≤ 1− cos θ

(1− ε)2(p− q)2

sin2 θ+

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θ.

The first term is bounded in θ and ε because

limθ↓0

1− cos θ

sin2 θ=

1

2

and therefore only makes a finite contribution to the integral. Integratingthe second term near zero and making the change of variables u = sin θ (sodu = cos θdθ) shows,∫ δ

−δ

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θdθ

=

∫ sin(δ)

− sin(δ)

ε

ε2 (1− u2) + (1− ε)2(p− q)2

u2du

≤∫ sin(δ)

− sin(δ)

ε

ε2 12 + 1

2 (p− q)2u2du

= 2

∫ sin(δ)

− sin(δ)

ε

ε2 + (p− q)2u2du

provided δ is sufficiently small but fixed and ε is small. Lastly we make thechange of variables u = εx/ |p− q| in order to find∫ δ

−δ

ε cos θ

ε2 cos2 θ + (1− ε)2(p− q)2

sin2 θdθ

≤ 4

|p− q|

∫ sin(δ)|p−q|/ε

0

1

1 + x2du ↑ 2π

|p− q|<∞ as ε ↓ 0.

Combining these estimates shows the limit in Eq. (9.12) is now finite so therandom walk is transient when p 6= 1

2 .

Example 9.36 (Unbiased R.W. in Zd). Now suppose that P (ξi = ±ej) = 12d for

j = 1, 2, . . . , d and Xn = ξ1 + · · ·+ ξn. In this case,

E[eiθ·ξj

]=

1

d[cos (θ1) + · · ·+ cos (θd)]

and so according to Eq. (9.10) we find (as before)

EM0 = limα↑1

(1

)d ∫[−π,π]d

1

1− α 1d [cos (θ1) + · · ·+ cos (θd)]

=

(1

)d ∫[−π,π]d

1

1− 1d [cos (θ1) + · · ·+ cos (θd)]

dθ (by MCT and DCT).

Again the integrand is singular near θ = 0 where

1− 1

d[cos (θ1) + · · ·+ cos (θd)] ∼= 1− 1

d

[d− 1

2‖θ‖2

]=

1

2‖θ‖2 .

Hence it follows that EM0 < ∞ iff∫‖θ‖≤R

1‖θ‖2 dθ <∞ for R < ∞. The last

integral is well know to be finite iff d ≥ 3 as can be seen by computing in polarcoordinates. For example when d = 2 we have∫

‖θ‖≤R

1

‖θ‖2dθ = 2π

∫ R

0

1

r2rdr = 2π ln r|R0 = 2π (lnR− ln 0) =∞

while when d = 3,∫‖θ‖≤R

1

‖θ‖2dθ = 4π

∫ R

0

1

r2r2dr = 4πR <∞.

In this way we have shown that the unbiased random walk in Z and Z2 isrecurrent while it is transient in Zd for d ≥ 3.

9.3 Invariant / Stationary (sub) distributions

Example 9.37 (Example 9.24 revisited). As a warm-up let us again consider theMarkov chain whose jump diagram is given in Figure 9.1. Let us further supposethat we start the chain at 1. We would like to compute limn→∞ P1 (Xn = j) forj = 1, 2, . . . , 5. Let B = 3, 4, 5 . Since there is a positive chance of hitting Bfrom either 1 or 2 we know that EiTB <∞ for i = 1, 2. Let hi = Pi (XTB = 5)for i = 1, 2. Then h5 = 1, h3 = h4 = 0 and the first step analysis shows,

h1 =1

2h5 +

1

2h2 =

1

2+

1

2h2

h2 =1

2h1 +

1

2h4 =

1

2h1

and therefore, h1 = 12 + 1

4h1 or P1 (XTB = 5) = h1 = 23 . With this information

in hand we may now conclude

Page: 76 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 83: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.3 Invariant / Stationary (sub) distributions 77

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 9.6. A 5 state Markov chain – sites 1 and 2 are transient and 5 is absorbing.

limn→∞

P1 (Xn = 1) = 0 = limn→∞

P1 (Xn = 2) ,

limn→∞

P1 (Xn = 5) =2

3, and lim

n→∞P1 (Xn ∈ 3, 4) =

1

3.

The question now becomes how the chain distributes itself within the closedclass 3, 4 . We are now going to address this issue.

Through out this chapter Xn∞n=0 will be a Markov chain on a discretestate space S with Markov kernel p : S × S → [0, 1] and corresponding matrixP. If πj := limn→∞ Pν (Xn = j) exists then

πj = limn→∞

Pν (Xn+1 = j) = limn→∞

∑k∈S

Pν (Xn+1 = j|Xn = k)Pν (Xn = k)

= limn→∞

∑k∈S

Pν (Xn = k) Pkj?=∑k∈S

limn→∞

Pν (Xn = k) Pkj

=∑k∈S

πkPkj .

Thus we expect that any “limiting distributions” should also be “stationary”or “invariant” distributions.

Definition 9.38. A function, π : S → [0, 1] is a sub-probability if∑j∈S π (j) ≤ 1. We call π (S) :=

∑j∈S π (j) the mass of π. So a probabil-

ity is a sub-probability with mass one.

Definition 9.39. We say a sub-probability, π : S → [0, 1] , is invariant orstationary relative to P if πP = π, i.e.∑

i∈Sπ (i) pij = π (j) for all j ∈ S. (9.13)

An invariant probability, π : S → [0, 1] , is called an invariant distribution.

Lemma 9.40. A probability π : S → [0, 1] is an invariant distribution for P iff

Pπ (Xn = j) = π (j) for all j ∈ S and n ∈ N.

Proof. A simple induction argument shows that π = πP implies that π =πPn for all n ∈ N. This remark along with the following identity completes theproof;

Pπ (Xn = j) =∑i∈S

π (i)Pi (Xn = j) =∑i∈S

π (i) Pnij = (πPn)j .

Example 9.41. Suppose that S = 1, 2, 3 , and

P =

1 2 30 1 0

1/2 0 1/21 0 0

123

has the jump graph given by 9.7. Notice that P211 > 0 and P3

11 > 0 that P is

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 9.7. A simple 3 state jump diagram.

“aperiodic.” We now find the invariant distribution,

Nul (P− I)tr

= Nul

−1 12 1

1 −1 00 1

2 −1

= R

221

.Therefore the invariant distribution is given by

π =1

5

[2 2 1

]=[

0.4 0.4 0.2].

Let us now observe that

Page: 77 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 84: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

78 9 Markov Chains in the Long Run (Results)

P2 =

12 0 1

212

12 0

0 1 0

P3 =

0 1 01/2 0 1/21 0 0

3

=

12

12 0

14

12

14

12 0 1

2

P20 =

4091024

205512

2051024

205512

4091024

2051024

205512

205512

51256

=

0.399 41 0.400 39 0.200 200.400 39 0.399 41 0.200 200.400 39 0.400 39 0.199 22

and so it certainly appears that limn→∞Pn

ij = πj independent of i.

Example 9.42. Suppose that Xn∞n=0 is the fair random walk on S = Z so thatP (Xn+1 = x± 1|Xn = x) = 1

2 for all x ∈ S and n ∈ N0. This chain has nostationary distribution. To see this suppose π : S → [0, 1] were to exist, thenby definition,

π (y) =∑x∈S

π (x) p (x, y) =1

2[π (y − 1) + π (y + 1)] .

From Exercise 8.3 we know that the general solution to this equation is of theform,

π (x) = a+ bx for some a, b ∈ R.

In order for π (x) ≥ 0 for all x we must have b = 0 and a ≥ 0. However thisis no choice for a such that

∑x∈S π (x) = 1. We will see explicitly in the next

chapter that when # (S) < ∞, every chain will have at least one stationarydistribution.

Exercise 9.1 (2 - step M, see 7.2). Consider the following simple (i.e. no-brainer) two state “game” consisting of moving between two sites labeled 1 and2. At each site you find a coin with sides labeled 1 and 2. The probability offlipping a 2 at site 1 is a ∈ [0, 1] and a 1 at site 2 is b ∈ [0, 1] . We assumethat 0 < a + b < 2, i.e. neither both of a and b are zero or 1. If you are atsite i at time n, then you flip the coin at this site and move or stay at thecurrent site as indicated by coin toss. We summarize this scheme by the “jumpdiagram” of Figure 9.8.It is reasonable to suppose that your location, Xn, at

11−a22

a++

2b

kk 1−bll

Fig. 9.8. The generic jump diagram for a two state Markov chain.

time n is modeled by a Markov process with state space, S = 1, 2 . Explain(briefly) why this is a time homogeneous chain and find the one step transitionprobabilities,

p (i, j) = P (Xn+1 = j|Xn = i) for i, j ∈ S.

Use your result and basic linear (matrix) algebra to compute,limn→∞ P (Xn = 1) . Your answer should be independent of the possiblestarting distributions, π = (π1, π2) for X0 where πi := P (X0 = i) .

Solution to Exercise (7.2). The Markov matrix for this chain is

P =

[1− a ab 1− b

].

If P (X0 = i) = νi for i = 1, 2 then

P (Xn = j) =

2∑k=1

νkPnk,j = [νPn]j

where we now write ν = (ν1, ν2) as a row vector. A simple computation showsthat

det(Ptr − λI

)= det (P− λI)

= λ2 + (a+ b− 2)λ+ (1− b− a)

= (λ− 1) (λ− (1− a− b)) .

For any Markov matrix,∑j Pij = 1 and therefore P1 = 1 where 1 is the

column vector with all entries being 1. Thus we always know that λ1 = 1 isan eigenvalue of P. The second eigenvalue is λ2 = 1 − a − b. We now find theeigenvectors of Ptr;

Nul(Ptr − λ1I

)= Nul

([−a ba −b

])= R ·

[ba

]and so the invariant distribution is

π =1

a+ b

[b a].

Similarly we have

Nul(Ptr − λI2

)= Nul

([b ba a

])= R ·

[1−1

].

Writingν = απ + β (1,−1) ,

Page: 78 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 85: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.4 The basic limit theorems 79

we find

1 = ν · (1, 1) = απ · (1, 1) = α and

ν · (a,−b) = απ · (a,−b) + β (1,−1) · (a,−b)

= β (a+ b) =⇒ β =1

a+ bν · (a,−b) =

ν1a− ν2b

a+ b

At any rate we haveν = π + β (1,−1) .

and therefore,

νPn = πPn + β (1,−1) Pn = π + λn2β (1,−1) .

Taking ν = (1, 0) and (0, 1) in this expression shows,

Pn =

[ππ

]+

1

a+ bλn2

[a −a−b b

].

By our assumptions that a+ b 6= 2 we have |λ2| < 1 and therefore

limn→∞

νPn = π

and we have shown

limn→∞

P (Xn = 1) = π1 =b

a+ band lim

n→∞P (Xn = 2) = π2 =

a

a+ b

independent of the starting distribution ν. Also observe that the convergence isexponentially fast. For the two degenerate cases not considered here see Exam-ples 10.8 and 10.9 below.

If we let hi = EiR1 we have,

h1 = (1− a) 1 + a (1 + h2) = ah2 + 1

h2 = b1 + (1− b) (1 + h2) = (1− b)h2 + 1

and therefore

h2 =1

band h1 =

a

b+ 1 =

a+ b

b.

Notice that E1R1 = 1π1. One similarly shows that E2R2 = 1

π2. (Notice that

E2R1 = 1b is easy to understand as 1

b is the mean of the geometric randomvariable R1 under P2 – we are waiting for the first time we jump to 1 whichhappens at each time with probability b.)

9.4 The basic limit theorems

Example 9.43 (Example 9.41 revisited). Recall in Example 9.41 that S =1, 2, 3 , and

P =

1 2 30 1 0

1/2 0 1/21 0 0

123

has the jump graph given by 9.9. We saw that the stationary distribution for

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 9.9. A simple 3 state jump diagram.

this chain was given by

π =1

5

[2 2 1

].

We are now going to compute EiRj in this example using the first step analysis.Let hi := EiR1, then

h1 = 1 + h2

h2 =1

21 +

1

2(1 + h3)

h3 = 1

which have solutions[h1 = 5

2 , h2 = 32 , h3 = 1

]and in particular

m1 := E1R1 = 5/2 =1

π1.

If hi = EiR2 we find,

h1 = 1

h2 =1

2(1 + h1) +

1

2(1 + h3)

h3 = 1 + h1

Page: 79 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 86: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

80 9 Markov Chains in the Long Run (Results)

which have solutions[h1 = 1, h2 = 5

2 , h3 = 2]

and in particular

m2 := E2R2 = 5/2 = 1/π2.

Similarly if hi = EiR3 we find,

h1 = 1 + h2

h2 =1

21 +

1

2(1 + h1)

h3 = 1 + h1

which have solutions [h1 = 4, h2 = 3, h3 = 5] and in particular

m3 := E3R3 = 5 =1

π3.

The results of this example hold in general as is seen in the next theorems –also see Proposition 10.6 below.

Theorem 9.44. Suppose that P = (pij) is an irreducible Markov kernel andπj := 1

EjRj for all j ∈ S. Then:

1. For all i, j ∈ S, we have

limN→∞

1

N

N∑n=0

1Xn=j = πj Pi − a.s. (9.14)

and

limN→∞

1

N

N∑n=1

Pi (Xn = j) = limN→∞

1

N

N∑n=0

Pnij = πj . (9.15)

2. If µ : S → [0, 1] is an invariant sub-probability, then either µ (i) > 0 for alli or µ (i) = 0 for all i.

3. P has at most one invariant distribution.4. P has a (necessarily unique) invariant distribution, µ : S → [0, 1] , iff P is

positive recurrent in which case µ (i) = π (i) = 1EiRi > 0 for all i ∈ S.

(These results may of course be applied to the restriction of a general non-irreducible Markov chain to any one of its communication classes.)

Using this result we can give another proof of Proposition 9.25.

Corollary 9.45. If C is a closed finite communicating class then C is positiverecurrent. (Recall that we already know that C is recurrent by Proposition 9.23.)

Proof. For i, j ∈ C, let

πj := limN→∞

1

N

N∑n=1

Pi (Xn = j) =1

EjRj

as in Theorem 9.44. Since C is closed,∑j∈C

Pi (Xn = j) = 1

and therefore,

∑j∈C

πj = limN→∞

1

N

∑j∈C

N∑n=1

Pi (Xn = j) = limN→∞

1

N

N∑n=1

∑j∈C

Pi (Xn = j) = 1.

Therefore πj > 0 for some j ∈ C and hence all j ∈ C by Theorem 9.44 with Sreplaced by C. Hence we have EjRj <∞, i.e. every j ∈ C is a positive recurrentstate.

Theorem 9.46. Let P is be an irreducible Markov chain and ν is a probabilityon S.

1. If P is null-recurrent or transient, then limn→∞Pnij = 0 for all i, j ∈ S and

more generallylimn→∞

Pν (Xn = i) = 0

2. If P is a positive-recurrent and aperiodic Markov transition kernel, then

limn→∞

Pnij =

1

EjRj=: πj

and more generally,limn→∞

Pν (Xn = j) = πj .

Furthermore, if C is an aperiodic communication class and ν is any proba-bility on S, then

limn→∞

Pν (Xn = j) := limn→∞

∑i∈S

ν (i) Pnij = Pν (Rj <∞)

1

EjRjfor all j ∈ C.

If C is transient or null-recurrent then no matter whether C is aperiodic or notwe will have limn→∞ Pν (Xn = j) = 0 for all j ∈ C.

Page: 80 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 87: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.4 The basic limit theorems 81

Theorem 9.47 (General Convergence Theorem). Let ν : S → [0, 1] beany probability, i ∈ S, C be the communicating class containing i,

Xn hits C := Xn ∈ C for some n ,

and

πi := πi (ν) = Pν (Xn hits C) · 1

EiRi,

where 1/∞ := 0. Then:

1. Pν – a.s.,

limN→∞

1

N

N∑n=1

1Xn=i =1

EiRi1Xn hits C,

2.

limN→∞

1

N

N∑n=1

Pν (Xn = i) = πi = Pν (Xn hits C) · 1

EiRi3. π is an invariant sub-probability for P, and4. the mass of π is∑

i∈Sπi =

∑C: pos. recurrent

Pν (Xn hits C) ≤ 1.

5. If C is a positively recurrent communication class, then we can find µi :=1

EiRi for i ∈ C as the unique invariant distribution for the chain restrictedto C, i.e. if we extend µ to S by setting µj = 0 for j /∈ C, µ : S → [0, 1]must satisfy µ = µP with

∑i∈C µi = 1.

The loss of mass can only occur when # (S) = ∞ and this loss of masshappens through loss of mass (sand) to infinity in the transient and null -recurrent classes with infinitely many points. For example in the fair randomwalk a unit lump of sand starting at zero spreads out with 1/2 going to +∞and the other half going to −∞. While for the biased random walk with p > 1

2 ,the sand all gets shoveled to +∞.

The proofs of these theorems (sketched in Section 9.6 below) rely on theimportant notion of a stopping time which we introduce in Section 9.5.

Example 9.48 (Example 9.24 revisited). Lets us now finish our analysis of thechain with jump diagram in Figure 9.10. As before suppose the chain starts at 1and let B = 3, 4, 5 denote the recurrent states. We have already argued that

limn→∞

P1 (Xn = 5) = P1 (XTB = 5) =2

3,

limn→∞

P1 (Xn = 1) = 0 = limn→∞

P1 (Xn = 2) , and

limn→∞

P1 (Xn ∈ 3, 4) =1

3.

12

3 4

5

1/2

1/31/2

1/2

1/2

1/2

Fig. 9.10. A 5 state Markov chain – sites 1 and 2 are transient and 5 is absorbing.

From Exercise 9.1, the invariant distribution for the chain restricted to 3, 4is

π = (π3, π4) =

(1

3,

1

2

)1

13 + 1

2

=[

25

35

]but only 1

3 of the sand makes it to 3, 4 and so we find,

limn→∞

(P1 (Xn ∈ 3) , P1 (Xn ∈ 4)) =1

3

[25

35

]=[

215

15

].

So all in all we have shown,

limn→∞

(P1 (Xn ∈ 1) , P1 (Xn ∈ 2) , P1 (Xn ∈ 3) , P1 (Xn ∈ 4) , P1 (Xn ∈ 5))

=

(0, 0,

2

15,

1

5,

2

3

)∼=[

0.0 0.0 0.133 33 0.2 0.666 67].

The Markov matrix P in this case is

P =

0 1

2 0 0 12

12 0 0 1

2 00 0 1

212 0

0 0 13

23 0

0 0 0 0 1

10

and

P10 =

7. 888 6× 10−31 0.0 0.133 33 0.2 0.666 67

0.0 7. 888 6× 10−31 0.266 67 0.4 0.333 330.0 0.0 0.4 0.6 0.00.0 0.0 0.4 0.6 0.00.0 0.0 0.0 0.0 1.0

Page: 81 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 88: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

82 9 Markov Chains in the Long Run (Results)

The first row is already close to the invariant distribution we computed. Onemay similarly work out the corresponding long time probabilities when startingthe chain at 2 to find,

limn→∞

(P2 (Xn ∈ 1) , P2 (Xn ∈ 2) , P2 (Xn ∈ 3) , P2 (Xn ∈ 4) , P2 (Xn ∈ 5))

=

(0, 0,

4

15,

2

5,

1

3

)∼=[

0.0 0.0 0.266 67 0.4 0.333 33].

The case of starting at the recurrent sites are even easier and you should thinkabout this and in particular explain the last two row in P10 above.

9.5 Stopping Times

Definition 9.49 (Stopping times). Let τ be an N0 ∪ ∞ - valued randomvariable which is a functional of a sequence of random variables, Xn∞n=0 whichwe write by abuse of notation as, τ = τ (X0, X1, . . . ) . We say that τ is a stop-ping time if for all n ∈ N0, the indicator random variable, 1τ=n is a functional of(X0, . . . , Xn) . Thus for each n ∈ N0 there should exist a function, σn such that1τ=n = σn (X0, . . . , Xn) . In other words, the event τ = n may be describedusing only (X0, . . . , Xn) for all n ∈ N.

Example 9.50. Here are some examples. In these examples we will always usethe convention that the minimum of the empty set is +∞.

1. The random time, τ = min k : |Xk| ≥ 5 (the first time, k, such that |Xk| ≥5) is a stopping time since

τ = k = |X1| < 5, . . . , |Xk−1| < 5, |Xk| ≥ 5.

2. Let Wk := X1 + · · ·+Xk, then the random time,

τ = mink : Wk ≥ π

is a stopping time since,

τ = k =

Wj = X1 + · · ·+Xj < π for j = 1, 2, . . . , k − 1,

& X1 + · · ·+Xk−1 +Xk ≥ π

.

3. For t ≥ 0, let N(t) = #k : Wk ≤ t. Then

N(t) = k = X1 + · · ·+Xk ≤ t, X1 + · · ·+Xk+1 > t

which shows that N (t) is not a stopping time. On the other hand, since

N(t) + 1 = k = N(t) = k − 1= X1 + · · ·+Xk−1 ≤ t, X1 + · · ·+Xk > t,

we see that N(t) + 1 is a stopping time!

4. If τ is a stopping time then so is τ + 1 because,

1τ+1=k = 1τ=k−1 = σk−1 (X0, . . . , Xk−1)

which is also a function of (X0, . . . , Xk) which happens not to depend onXk.

5. On the other hand, if τ is a stopping time it is not necessarily true thatτ − 1 is still a stopping time as seen in item 3. above.

6. One can also see that the last time, k, such that |Xk| ≥ π is typically nota stopping time. (Think about this.)

Remark 9.51. If τ is an Xn∞n=0 - stopping time then

1τ≥n = 1− 1τ<n = 1−∑k<n

σk (X0, . . . , Xk) =: un (X0, . . . , Xn−1) .

That is for a stopping time τ, 1τ≥n is a function of (X0, . . . , Xn−1) only for alln ∈ N0.

The following presentation of Wald’s equation is taken from Ross [4, p.59-60].

Theorem 9.52 (Wald’s Equation). Suppose that Xn∞n=1 is a sequence ofi.i.d. random variables, f (x) is a non-negative function of x ∈ R, and τ is astopping time. Then

E

[τ∑n=1

f (Xn)

]= Ef (X1) · Eτ. (9.16)

This identity also holds if f (Xn) are real valued but integrable and τ is a stop-ping time such that Eτ <∞. (See Resnick for more identities along these lines.)

Proof. If f (Xn) ≥ 0 for all n, then the the following computations need nojustification,

E

[τ∑n=1

f (Xn)

]= E

[ ∞∑n=1

f (Xn) 1n≤τ

]=

∞∑n=1

E [f (Xn) 1n≤τ ]

=

∞∑n=1

E [f (Xn)un (X1, . . . , Xn−1)]

=

∞∑n=1

E [f (Xn)] · E [un (X1, . . . , Xn−1)]

=

∞∑n=1

E [f (Xn)] · E [1n≤τ ] = Ef (X1)

∞∑n=1

E [1n≤τ ]

= Ef (X1) · E

[ ∞∑n=1

1n≤τ

]= Ef (X1) · Eτ.

Page: 82 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 89: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.6 Proof Ideas (optional reading) 83

If E |f (Xn)| <∞ and Eτ <∞, the above computation with f replaced by|f | shows all sums appearing above are equal E |f (X1)| · Eτ < ∞. Hence wemay remove the absolute values to again arrive at Eq. (9.16).

Example 9.53. Let Xn∞n=1 be i.i.d. such that P (Xn = 0) = P (Xn = 1) = 1/2and let

τ := min n : X1 + · · ·+Xn = 10 .For example τ is the first time we have flipped 10 heads of a fair coin. By Wald’sequation (valid because Xn ≥ 0 for all n) we find

10 = E

[τ∑n=1

Xn

]= EX1 · Eτ =

1

2Eτ

and therefore Eτ = 20 <∞.

Example 9.54 (Gambler’s ruin). Let Xn∞n=1 be i.i.d. such that P (Xn = −1) =P (Xn = 1) = 1/2 and let

τ := min n : X1 + · · ·+Xn = 1 .

So τ may represent the first time that a gambler is ahead by 1. Notice thatEX1 = 0. If Eτ < ∞, then we would have τ < ∞ a.s. and by Wald’s equationwould give,

1 = E

[τ∑n=1

Xn

]= EX1 · Eτ = 0 · Eτ

which can not hold. Hence it must be that

Eτ = E [first time that a gambler is ahead by 1] =∞.

9.6 Proof Ideas (optional reading)

The proofs of the limit theorems described in Section 9.4 above rely on the factthat Markov chains enjoy the “strong Markov property.” The strong Markovproperty basically asserts that the m ∈ N0 appearing in Eq. (8.1) may bereplaced by any stopping time. As before let Xn∞n=0 be a Markov chain withtransition probabilities, Pi,j = p (i, j) and ν : S → [0, 1] be any probabilityon S. Xn∞n=0 denotes a Markov chain with state space S for the rest of thischapter.

Theorem 9.55 (Strong Markov Property). Let ν be a probability on S,F (X) = F (X0, X1, . . . ) be a random variable1 depending on X. Then for anystopping time τ = τ (X0, X1, . . . ) ,

1 In this theorem we assume that F is either bounded or non-negative.

Eν [F (X0, X1, . . . ) : τ <∞] = Eν[1τ<∞E(Y )

Xτ[F (X0, X1, . . . Xτ−1, Y0, Y1, . . . )]

].

(9.17)(In words, given the path of a chain up to an including time τ, the path afterand including time τ has the same distribution as the chain started at Xτ .)

Proof. The proof of this deep result is now rather easy to reduce to Theorem8.1. Indeed, making use of Definition 9.49 and Theorem 8.1,

Eν [F (X0, X1, . . . ) : τ <∞]

=

∞∑m=0

Eν[F (X0, X1, . . . ) : 1τ=m

]=

∞∑m=0

Eν [F (X0, X1, . . . ) · σm (X0, . . . , Xm)]

=

∞∑m=0

Eν[E(Y )Xm

[F (X0, X1, . . . Xm−1, Y0, Y1, . . . )σm (X0, . . . , Xm)]]

=

∞∑m=0

Eν[σm (X0, . . . , Xm)E(Y )

Xm[F (X0, X1, . . . Xm−1, Y0, Y1, . . . )]

]=

∞∑m=0

Eν[1τ=mE(Y )

Xm[F (X0, X1, . . . Xm−1, Y0, Y1, . . . )]

]=

∞∑m=0

Eν[1τ=mE(Y )

Xτ[F (X0, X1, . . . Xτ−1, Y0, Y1, . . . )]

]= Eν

[1τ<∞E(Y )

Xτ[F (X0, X1, . . . Xτ−1, Y0, Y1, . . . )]

].

The strong Markov property rather immediately leads to the following keyidentity (see Notation 9.1),

Pν (Mj ≥ k) = Pν (Tj <∞) · Pj (Rj <∞)k−1

(9.18)

for all j ∈ S and k ∈ N. The idea behind this identity is that Mj ≥ k iffX visits j once with probability Pν (Tj <∞) and then revisits j (k − 1) moretimes. Because of the strong Markov property the times between visits of j areall independent and and once the chain hits j the probability that the next visitoccurs in finite time is Pj (Rj <∞) . Putting these remarks together gives Eq.(9.18). Given this result we deduce that

Page: 83 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 90: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

84 9 Markov Chains in the Long Run (Results)

EνMj =

∞∑k=1

Pν (Mj ≥ k) = Pν (Tj <∞) ·∞∑k=1

Pj (Rj <∞)k−1

=Pν (Tj <∞)

1− Pj (Rj <∞)=Pν (Tj <∞)

Pj (Rj =∞)= Pν (Tj <∞) · EjMj (9.19)

with the convention that 0/0 = 0 and 0 ·∞ = 0. In particular taking ν = δj wehave,

EjMj =1

Pj (Rj =∞)=

∞ if j ∈ Sr<∞ if j ∈ St

and (9.20)

Pj (Mj =∞) = limk→∞

Pj (Mj ≥ k)

= limk→∞

Pj (Rj <∞)k−1

=

1 if j ∈ Sr0 if j ∈ St

. (9.21)

These formula show that

j ∈ Sr ⇐⇒ 1 = Pj (Mj =∞) = Pj (X visits j i.o.) ⇐⇒ EjMj =∞

or equivalently that

j ∈ St ⇐⇒ 0 = Pj (Mj =∞) = Pj (X visits j i.o.) ⇐⇒ EjMj <∞.

Let us further observe that

EνMj = Eν∞∑k=0

1Xk=j =

∞∑k=0

Eν1Xk=j =

∞∑k=0

Pν (Xk = j) =

∞∑k=0

[νPk

]j

and in particular the Green’s function, Gij := EiMj , is given by

Gij = EiMj =

∞∑k=0

Pkij (Gij =∞ is quite possible).

Notice that formally, Gij = (I −P)−1ij .

Remark 9.56. The assertion that we saw in Eq. (9.19) that EνMj =Pν (Tj <∞)EjMj also follows from the strong Markov property as follows, AsMj = 0 if Tj =∞ we have,

EνMj = Eν [Mj : Tj <∞]

= Eν[E(Y )j [Mj (X0, . . . , XT−1, Y0, Y1, . . . )] : Tj <∞

]= Eν

[E(Y )j [Mj (Y0, Y1, . . . )] : Tj <∞

]= Pν (Tj <∞)EjMj .

In particular taking ν = δi shows,

Gij = EiMj = Pi (Tj <∞)EjMj = Pi (Tj <∞)Gjj

and so if j ∈ St we find,

Pi (Tj <∞) = Gij/Gjj .

Now suppose that j ∈ Sr and i ←→ j. Since i and j communicate, thereexists α and β in N such that Pα

ij > 0 and Pβji > 0. Therefore

EiMi ≥∑n≥0

Pn+α+βii ≥

∑n≥0

PαijP

nj jP

βji = Pα

ijPβji · EjMj =∞.

This shows that i ∈ Sr as well and recurrence and hence transience are classproperties. Moreover using Pj (Xβ = i) = Pβ

ji > 0 along with the Markov prop-erty we find;

Pj (Xβ = i) = Pj (Xβ = i,Mj =∞)

= Ej[1Xβ=iP

(Y )i (Mj (X0, . . . , Xβ−1, Y0, Y1, . . . ) =∞)

]= Ej

[1Xβ=iP

(Y )i (Mj (Y0, Y1, . . . ) =∞)

]= Pj (Xβ = i) · Pi (Mj =∞)

from which we may conclude that Pi (Mj =∞) = 1.2 In conclusion, if C is acommunication class and ν : S → [0, 1] is a probability on S such that ν (C) = 1,then

Pν (∩j∈C Mj =∞) = 1

because Pi (Mj =∞) = 1 for all j ∈ C implies Pi (∩j∈C Mj =∞) = 1 for alli ∈ C and therefore,

Pν (∩j∈C Mj =∞) =∑i∈C

ν (i)Pi (∩j∈C Mj =∞) =∑i∈C

ν (i) = 1.

The next key observation is that for any site j ∈ S we expect the relativefrequency, Fj := limN→∞

1N

∑Nn=1 1Xn=j , that the chain spends at j is given

by3

2 Basically, on Xβ = i we still have to hit j infinitely many times after time β andas the process starts afresh at i at time β this will happen iff Pi (Mj =∞) = 1.

3 Notice that

Pν (Rj <∞)1

EjRj= Pν (Tj <∞)

1

EjRj

Page: 84 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 91: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

9.6 Proof Ideas (optional reading) 85

Fj := limN→∞

1

N

N∑n=1

1Xn=j = Pν (Rj <∞)1

EjRj= Pν (Tj <∞)

1

EjRj(Pν – a.s.) .

(9.22)The logic here is that in order for Fj 6= 0 we must at least visit (revisit) j oncewhich happens with probability Pν (Rj <∞) and then after that the chainrevisits the site j roughly every EjRj units of time and hence the formulain Eq. (9.22). To make this part of the argument precise one must use thestrong Markov property along with the strong law of large numbers. Takingexpectations of Eq. (9.22) then shows,

limN→∞

1

N

N∑n=1

Pν (Xn = j) = Pν (Rj <∞)1

EjRj.

In particular if S is irreducible and recurrent, then Pν (Rj <∞) = 1 and wefind,

πj := limN→∞

1

N

N∑n=1

Pν (Xn = j) =1

EjRj(9.23)

independent of the choice of ν. The next two propositions dig a little deeperand in particular show that positive recurrence is a class property.

Proposition 9.57. Suppose that Xn is a irreducible, recurrent Markov chainand let πj = 1

EjRj for all j ∈ S as in Eq. (9.23). Then either πi = 0 for all

i ∈ S (in which case Xn is null recurrent) or πi > 0 for all i ∈ S (in which caseXn is positive recurrent). Moreover if πi > 0 then∑

i∈Sπi = 1 and (9.24)

∑i∈S

πiPij = πj for all j ∈ S. (9.25)

as both terms are 0 if EjRj = ∞ while if EjRj < ∞, then Pj (Rj <∞) = 1 =Pj (Tj <∞) and so

Pν (Tj <∞) =∑k 6=j

ν (k)Pk (Tj <∞) + ν (j)Pj (Tj <∞)

=∑k 6=j

ν (k)Pk (Rj <∞) + ν (j) · 1

=∑k 6=j

ν (k)Pk (Rj <∞) + ν (j) · Pj (Rj <∞)

= Pν (Rj <∞) .

That is π = (πi)i∈S is the unique stationary distribution for P.

Proof. Let us define

Tnki :=1

n

n∑l=1

Plki (9.26)

which, according to Eq. (9.23) with ν = δk, satisfies,

limn→∞

Tnki = πi for all i, k ∈ S.

Fatou’s lemma implies for all i, j ∈ S that

α :=∑i∈S

πi =∑i∈S

lim infn→∞

Tnki ≤ lim infn→∞

∑i∈S

Tnki = 1.

Moreover by Fatou’s lemma and the observation that

(TnP)ki =1

n

n∑l=1

Pl+1ki =

1

n

n∑l=1

Plki+

1

n

[Pn+1ki −Pki

]→ πi as n→∞, (9.27)

we also may conclude that,∑i∈S

πiPij =∑i∈S

limn→∞

TnliPij ≤ lim infn→∞

∑i∈S

TnliPij = lim infn→∞

[TnP]l,j = πj .

(Here l ∈ S may chosen arbitrarily.) Thus∑i∈S

πi =: α ≤ 1 and∑i∈S

πiPij ≤ πj for all j ∈ S. (9.28)

By induction it also follows that∑i∈S

πiPkij ≤ πj for all j ∈ S. (9.29)

So if πj = 0 for some j ∈ S, then given any i ∈ S, there is a integer k such thatPkij > 0, and by Eq. (9.29) we learn that πi = 0. This shows that either πi = 0

for all i ∈ S or πi > 0 for all i ∈ S. This shows that positive recurrence is aclass property.

For the rest of the proof we assume that πi > 0 for all i ∈ S. If there weresome j ∈ S such that

∑i∈S πiPij < πj , we would have from Eq. (9.28) that

α =∑i∈S

πi =∑i∈S

∑j∈S

πiPij =∑j∈S

∑i∈S

πiPij <∑j∈S

πj = α,

which is a contradiction and Eq. (9.25) is proved.

Page: 85 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 92: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

86 9 Markov Chains in the Long Run (Results)

From Eq. (9.25) and induction we also have∑i∈S

πiPkij = πj for all j ∈ S

for all k ∈ N and therefore,∑i∈S

πiTkij = πj for all j ∈ S. (9.30)

Since 0 ≤ Tij ≤ 1 and∑i∈S πi = α ≤ 1, we may use the dominated convergence

theorem to pass to the limit as k →∞ in Eq. (9.30) to find

πj = limk→∞

∑i∈S

πiTkij =

∑i∈S

limk→∞

πiTkij =

∑i∈S

πiπj = απj .

Since πj > 0, this implies that α = 1 and hence Eq. (9.24) is now verified.

Proposition 9.58. Suppose that P is an irreducible Markov kernel which ad-mits a stationary distribution µ. Then P is positive recurrent and µj = πj =

1EjRj for all j ∈ S. In particular, an irreducible Markov kernel has at most one

invariant distribution and it has exactly one iff P is positive recurrent.

Proof. Suppose that µ = (µi) is a stationary distribution for P, i.e.∑i∈S µi = 1 and µj =

∑i∈S µiPij for all j ∈ S. Then we also have

µj =∑i∈S

µiTkij for all k ∈ N (9.31)

where T kij is defined above in Eq. (9.26). As in the proof of Proposition 9.57,we may use the dominated convergence theorem to find,

µj = limk→∞

∑i∈S

µiTkij =

∑i∈S

limk→∞

µiTkij =

∑i∈S

µiπj = πj .

Alternative Proof. If P were not positive recurrent then P is either tran-sient or null-recurrent in which case limn→∞ Tnij = 1

EjRj = 0 for all i, j. So

letting k →∞, using the dominated convergence theorem, in Eq. (9.31) allowsus to conclude that µj = 0 for all j which contradicts the fact that µ wasassumed to be a distribution.

The last main item to prove is the second item of Theorem 9.46. Namely ifP is a positive-recurrent and aperiodic Markov transition kernel, then

limn→∞

Pnij =

1

EjRj=: πj for all i ∈ S, and

limn→∞

Pν (Xn = j) = πj for probabilities ν on S.

As we have seen, Pν (Rj <∞) = 1 and so

limn→∞

Pν (Xn = j) = limn→∞

Pν (Xn = j, Rj <∞)

= limn→∞

∞∑k=1

Pν (Xn = j, Rj = k)

= limn→∞

∞∑k=1

Pν (Xn = j|Rj = k)Pν (Rj = k) .

Using the dominated convergence theorem, it follows that

limn→∞

Pν (Xn = j) =

∞∑k=1

limn→∞

Pν (Xn = j|Rj = k)Pν (Rj = k) ,

provided we can show limn→∞ Pν (Xn = j|Rj = k) exists. On the other handby strong Markov property,

limn→∞

Pν (Xn = j|Rj = k) = limn→∞

Pj (Xn−k = j) = limn→∞

Pj (Xn = j)

provided provided limn→∞ Pj (Xn = j) =: πj exits. Thus if we can showlimn→∞ Pj (Xn = j) = πj exists we will have also shown,

limn→∞

Pν (Xn = j) =

∞∑k=1

πj · Pν (Rj = k) = πj

for every probability ν on S. In order to compute limn→∞ Pj (Xn = j) we willrecast the problem in terms of a renewal process.

We now fix j ∈ S and let Wn denote the time of the nth visit to j after time0. (So W1 = Rj for example.) At each time Wn the Markov process renewsitself again and restarts itself starting at j ∈ S. The time to the next visit toj given Wn is distributed according to Rj and is independent of the Wk fork ≤ n. With this notation we have

Pnjj = Pj (Xn = j) = Pj (Wk = n for some k ≤ n) .

With this reformulation the assertions in Theorem 9.46 are now a consequenceof the renewal Theorem 11.5 below. The proof of the renewal theorem relies onthe “renewal equation” relating Pn

jj to

f(n)jj = Pj (Rj = n) = Pj (X1 6= j, . . . , Xn−1 6= j,Xn = j) .

Proposition 9.59 (Renewal Equation). The numbers Pnjj and f

(n)j,j are re-

lated by the “renewal equation,”

Pnjj =

n∑k=1

P (Rj = k) Pn−kjj =

n∑k=1

f(k)jj Pn−k

jj . (9.32)

Page: 86 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 93: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Proof. To prove Eq. (9.32) we first observe for n ≥ 1 that Xn = j is thedisjoint union of Xn = j, Rj = k for 1 ≤ k ≤ n and therefore by the Markovproperty,

Pnjj = Pj(Xn = j) =

n∑k=1

Ej(1Rj=k · 1Xn=j) =

n∑k=1

Ej(1Rj=k · Ej1Xn−k=j)

=

n∑k=1

Ej(1Rj=k)Ej(1Xn−k=j

)=

n∑k=1

Pj(Rj = k)Pj(Xn−k = j)

=

n∑k=1

Pn−kjj P (Rj = k).

Alternatively we have,

Pnjj = Pj(Xn = j) =

n∑k=1

Pj(Rj = k,Xn = j)

=

n∑k=1

Pj(X1 6= j, . . . , Xk−1 6= j,Xk = j,Xn = j)

=

n∑k=1

Pj(X1 6= j, . . . , Xk−1 6= j,Xk = j)Pn−kjj

=

n∑k=1

Pn−kjj P (Rj = k).

It is interesting to notice that knowing Pnjj we may recover f

(n)j,j from Eq.

(9.32) and vice a versa. The reader is invited to prove this for her/himself.

Page 94: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 95: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

10

Finite State Space Results and Examples

For the majority of this chapter we suppose that S = 1, 2, . . . , n and Pij

is a Markov matrix. For a few of the results to follow we will allow S to be acountable set.

Proposition 10.1. The Markov matrix P on a finite state space has at leastone invariant distribution.

Proof. If 1 :=[1 1 . . . 1

]tr, then P1 = 11 from which it follows that

0 = det (P− I) = det(Ptr − I

).

Therefore there exists a non-zero row vector ν such that Ptrνtr = νtr or equiv-alently that νP = ν. At this point we would be done if we knew that νi ≥ 0 forall i – but we don’t. So let πi := |νi| and observe that

πi = |νi| =

∣∣∣∣∣n∑k=1

νkPki

∣∣∣∣∣ ≤n∑k=1

|νk|Pki ≤n∑k=1

πkPki.

We now claim that in fact π = πP. If this were not the case we would haveπi <

∑nk=1 πkPki for some i and therefore

0 <

n∑i=1

πi <

n∑i=1

n∑k=1

πkPki =

n∑k=1

n∑i=1

πkPki =

n∑k=1

πk

which is a contradiction. So all that is left to do is normalize πi so∑ni=1 πi = 1

and we are done.

Remark 10.2. The same proof as above shows; if # (S) =∞ and ν : S → R is anonzero function such that α :=

∑i∈S |νi| <∞ and ν = νP. Then πi := |νi| /α

for all i ∈ S is an invariant distribution for P.

Recall that P is irreducible means that for all i, j ∈ S there exists n ∈ N0

such that Pnij > 0. Alternatively put this implies that Pi (Tj <∞) > 0 for all

i, j ∈ S.1 This is a restatement of the fact that the row sums of P are all equal to one.

Proposition 10.3. Suppose that P is an irreducible Markov matrix on(# (S) = ∞ OK here) and suppose π : S → R is a function such thatα :=

∑i∈S |πi| <∞ and π = πP. Then either πi = 0 for all i, πi > 0 for all i

or πi < 0 for all i.

Proof. By Remark 10.2, µi := |πi| satisfies µ = µP. Suppose that πi > 0for some i and let νj := µj − πj ≥ 0 for all j with νi = 0 and ν = νP. Since Pis irreducible, we can find n such that Pn

ji > 0 and therefore,

0 = νi =∑k∈S

νkPnki ≥ νjPn

ji

which implies νj = 0. As j ∈ S is arbitrary, it follows that νj = 0 for all j,i.e. µj = πj for all j ∈ S and therefore πj ≥ 0 for all j. Now for j ∈ S choosen ∈ N0 such that Pn

ij > 0. Then

πj =∑k∈S

νkPnkj ≥ πiPn

ij > 0

and we have shown πj > 0 for all j.If πi < 0 for some i ∈ S then the above argument applies to −π to see that

−πj > 0 for all j, i.e. πj < 0 for all j. This suffices to complete the proof of theproposition.

Corollary 10.4. If P is an irreducible Markov matrix, then P has at most oneinvariant distribution.

Proof. Suppose that λ and π are two invariant distributions and let ν :=π − λ. Then ν = νP and so by Proposition 10.3 either νi > 0, νi < 0, or νi = 0for all i. If νi > 0 for all i then

0 <∑i∈S

νi =∑i∈S

πi −∑i∈S

λi = 1− 1 = 0

which is a contradiction. Similarly one shows that νi < 0 is not possible eitherand therefore 0 = νi = πi − λi for all i ∈ S.

Corollary 10.5. If P is an irreducible Markov matrix on a finite state spaceS, then P has precisely one invariant distribution π.

Page 96: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

90 10 Finite State Space Results and Examples

Proof. Combine Corollary 10.4 with Proposition 10.1.We now suppose that # (S) < ∞ and P is irreducible. By Corollary 8.31

we know that Ei [Rj ] = EiTj < ∞ for all i 6= j and from Exercise 7.4 thatEiRi < ∞ also holds. The fact that EiRi < ∞ for all i ∈ S will come out ofthe proof of the next proposition as well.

Proposition 10.6. If P is irreducible, then there is precisely one invariantdistribution, π, which is given by πi = 1/ (EiRi) > 0 for all i ∈ S.

Proof. First observe that

Rj (i,X) =

Rj (i, j,X2, . . . ) = 1 if X0 = j

1 +Rj (X) if X0 6= j= 1 + 1X0 6=jRj (X) .

Therefore by the first step analysis,

Ei [Rj ] = Ei [Rj (X)] = Ep(i,·) [Rj (i,X)]

= Ep(i,·) [1 + 1X0 6=jRj (X)]

= 1 +∑k 6=j

PikEk [Rj ] . (10.1)

Here is a slight recasting of this same argument;

Ei [Rj ] =

n∑k=1

Ei [Rj |X1 = k] Pik =∑k 6=j

Ei [Rj |X1 = k] Pik + Pij1

=∑k 6=j

(Ek [Rj ] + 1) Pik + Pij1 =∑k 6=j

Ek [Rj ] Pik + 1.

which is again Eq. (10.1).Now suppose that π is any invariant distribution for P, then multiplying

Eq. (10.1) by πi and summing on i shows

n∑i=1

πiEi [Rj ] =

n∑i=1

πi∑k 6=j

PikEk [Rj ] +

n∑i=1

πi1

=∑k 6=j

πkEk [Rj ] + 1.

Since∑k 6=j πkEk [Rj ] < ∞ we may cancel it from both sides of this equation

in order to learn πjEj [Rj ] = 1. This shows that πj > 0, Ej [Rj ] < ∞, andπj = 1/ (EjRj) for all j ∈ S.

We may use Eq. (10.1) to compute Ei [Rj ] in examples. To do this, fix j andset vi := EiRj . Then Eq. (10.1) states that v = P(j)v + 1 where

P(j) :=[P1| . . . |Pj−1|0|Pj+1| . . . |Pn

]denotes P with the jth – column replaced by all zeros. Thus we have v =(I −P(j)

)−11, i.e.

EiRj =

[(I −P(j)

)−1

1

]i

, (10.2)

i.e. E1Rj...

EnRj

=(I −P(j)

)−1

1...1

. (10.3)

Remark 10.7. We can also derive Eq. (10.2) by first principles as well;

EiRj =

∞∑n=0

Pi (Rj > n) = 1 +

∞∑n=1

Pi (Rj > n)

= 1 +

∞∑n=1

Pi (X1 6= j, . . . , Xn 6= j) (10.4)

= 1 +

∞∑n=1

∑x1,...,xn∈S \j

p (i, x1) p (x1, x2) . . . p (xn−1, xn)

= 1 +

∞∑n=1

([P (j)

]n1)i

=

∞∑n=0

([P (j)

]n1)i

=

[(I − P (j)

)−1

1

]i

.

Multiplying Eq. (10.4) by π (i) and summing on i implies,

EπRj = 1 +

∞∑n=1

Pπ (X1 6= j, . . . , Xn 6= j) .

Assuming that π is an invariant distribution of the chain this leads to

Page: 90 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 97: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

10.1 Some worked examples 91

EπRj = 1 +

∞∑n=1

Pπ (X1 6= j, . . . , Xn 6= j)

= 1 +

∞∑n=1

Pπ (X0 6= j, . . . , Xn−1 6= j)

= 1 +

∞∑n=0

Pπ (X0 6= j, . . . , Xn 6= j)

= 1 + (1− π (j)) +

∞∑n=1

Pπ (X0 6= j,X1 6= j, . . . , Xn 6= j)

= 1 + (1− π (j)) +

∞∑n=1

[Pπ (X1 6= j, . . . , Xn 6= j)

−Pπ (X0 = j,X1 6= j, . . . , Xn 6= j)

]

= 1 +

∞∑n=1

Pπ (X1 6= j, . . . , Xn 6= j) + (1− π (j))

−∞∑n=1

Pπ (X0 = j,X1 6= j, . . . , Xn 6= j)

= EπRj + (1− π (j))− π (j)

∞∑n=1

Pj (X1 6= j, . . . , Xn 6= j)

= EπRj + 1− π (j)

[1 +

∞∑n=1

Pj (X1 6= j, . . . , Xn 6= j)

]= EπRj + 1− π (j)EjRj .

10.1 Some worked examples

Example 10.8. Let S = 1, 2 and P =

[0 11 0

]with jump diagram in Figure

10.1. In this case P2n = I while P2n+1 = P and therefore limn→∞Pn does not

1

1))

21

ii

Fig. 10.1. A non-random chain.

exist. On the other hand it is easy to see that the invariant distribution, π, forP is π =

[1/2 1/2

]and, moreover,

P + P2 + · · ·+ PN

N→ 1

2

[1 11 1

]=

[ππ

].

Let us compute [E1R1

E2R1

]=

([1 00 1

]−[

0 10 0

])−1 [11

]=

[21

]and [

E1R2

E2R2

]=

([1 00 1

]−[

0 01 0

])−1 [11

]=

[12

]so that indeed, π1 = 1/E1R1 and π2 = 1/E2R2. Of course R1 = 2 (P1 -a.s.)and R2 = 2 (P2 -a.s.) so that it is obvious that E1R1 = E2R2 = 2.

Example 10.9. Again let S = 1, 2 and P =

[10

01

]with jump diagram in

Figure 10.2. In this case the chain is not irreducible and every π = [a b] with

1166 2 1

hh

Fig. 10.2. A simple non-irreducible chain.

a+ b = 1 and a, b ≥ 0 is an invariant distribution.

Example 10.10. Suppose that S = 1, 2, 3 , and

P =

1 2 30 1 0

1/2 0 1/21 0 0

123

has the jump graph given by 10.3. Notice that P211 > 0 and P3

11 > 0 that P is“aperiodic.” We now find the invariant distribution,

Nul (P− I)tr

= Nul

−1 12 1

1 −1 00 1

2 −1

= R

221

.Therefore the invariant distribution is given by

π =1

5

[2 2 1

].

Let us now observe that

Page: 91 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 98: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

92 10 Finite State Space Results and Examples

1

1,,

2

12yy

12

ll

3

1

YY

Fig. 10.3. A simple 3 state jump diagram.

P2 =

12 0 1

212

12 0

0 1 0

P3 =

0 1 01/2 0 1/21 0 0

3

=

12

12 0

14

12

14

12 0 1

2

P20 =

4091024

205512

2051024

205512

4091024

2051024

205512

205512

51256

=

0.399 41 0.400 39 0.200 200.400 39 0.399 41 0.200 200.400 39 0.400 39 0.199 22

.Let us also compute E2R3 via,E1R3

E2R3

E3R3

=

1 0 00 1 00 0 1

− 0 1 0

1/2 0 01 0 0

−1 111

=

435

so that

1

E3R3=

1

5= π3.

Example 10.11. The transition matrix,

P =

1 2 31/4 1/2 1/41/2 0 1/21/3 1/3 1/3

123

is represented by the jump diagram in Figure 10.4. This chain is aperiodic. Wefind the invariant distribution as,

Nul (P− I)tr

= Nul

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

− 1 0 0

0 1 00 0 1

tr

= Nul

− 34

12

13

12 −1 1

314

12 −

23

= R

1561

= R

656

114

12

##

212

12oo

3

13

YY

13

EE

Fig. 10.4. In the above diagram there are jumps from 1 to 1 with probability 1/4and jumps from 3 to 3 with probability 1/3 which are not explicitly shown but mustbe inferred by conservation of probability.

π =1

17

[6 5 6

]=[

0.352 94 0.294 12 0.352 94].

In this case

P10 =

1/4 1/2 1/41/2 0 1/21/3 1/3 1/3

10

=

0.352 98 0.294 04 0.352 980.352 89 0.294 23 0.352 890.352 95 0.294 1 0.352 95

.Let us also computeE1R2

E2R2

E3R2

=

1 0 00 1 00 0 1

−1/4 0 1/4

1/2 0 1/21/3 0 1/3

−1 111

=

115175135

so that

1/E2R2 = 5/17 = π2.

Example 10.12. Consider the following Markov matrix,

P =

1 2 3 4

1/4 1/4 1/4 1/41/4 0 0 3/41/2 1/2 0 00 1/4 3/4 0

1234

with jump diagram in Figure 10.5. Since this matrix is doubly stochastic (i.e.∑4i=1 Pij = 1 for all j as well as

∑4j=1 Pij = 1 for all i), it is easy to check that

π = 14

[1 1 1 1

]. Let us compute E3R3 as follows

Page: 92 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 99: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

10.1 Some worked examples 93

114

14 //

14

234

14

4

14

EE

34

3

12

LL

12

RR

Fig. 10.5. The jump diagram for Q.

E1R3

E2R3

E3R3

E4R3

=

1 0 0 00 1 0 00 0 1 00 0 0 1

1/4 1/4 0 1/41/4 0 0 3/41/2 1/2 0 00 1/4 0 0

−1

1111

=

5017521743017

so that E3R3 = 4 = 1/π4 as it should be. Similarly,

E1R2

E2R2

E3R2

E4R2

=

1 0 0 00 1 0 00 0 1 00 0 0 1

1/4 0 1/4 1/41/4 0 0 3/41/2 0 0 00 0 3/4 0

−1

1111

=

5417444175017

and again E2R2 = 4 = 1/π2.

Example 10.13. Consider the following example,

P =

1 2 31/2 1/2 00 1/2 1/2

1/2 1/2 0

123

with jump diagram given in Figure 10.6.We have

1 2

3

1/2

1/2

1/21/2

Fig. 10.6. The jump diagram associated to P.

P2 =

1/2 1/2 00 1/2 1/2

1/2 1/2 0

2

=

14

12

14

14

12

14

14

12

14

and

P3 =

1/2 1/2 00 1/2 1/2

1/2 1/2 0

3

=

14

12

14

14

12

14

14

12

14

.To have a picture what is going on here, imaging that π = (π1, π2, π3)

represents the amount of sand at the sites, 1, 2, and 3 respectively. Duringeach time step we move the sand on the sites around according to the followingrule. The sand at site j after one step is

∑i πipij , namely site i contributes pij

fraction its sand, πi, to site j. Everyone does this to arrive at a new distribution.Hence π is an invariant distribution if each πi remains unchanged, i.e. π = πP.(Keep in mind the sand is still moving around it is just that the size of the pilesremains unchanged.)

As a specific example, suppose π = (1, 0, 0) so that all of the sand starts at1. After the first step, the pile at 1 is split into two and 1/2 is sent to 2 to getπ1 = (1/2, 1/2, 0) which is the first row of P. At the next step the site 1 keeps1/2 of its sand (= 1/4) and still receives nothing, while site 2 again receivesthe other 1/2 and keeps half of what it had (= 1/4 + 1/4) and site 3 then gets(1/2 · 1/2 = 1/4) so that π2 =

[14

12

14

]which is the first row of P2. It turns

out in this case that this is the invariant distribution. Formally,

[14

12

14

] 1/2 1/2 00 1/2 1/2

1/2 1/2 0

=[

14

12

14

].

In general we expect to reach the invariant distribution only in the limit asn→∞.

Page: 93 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 100: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

94 10 Finite State Space Results and Examples

Notice that if π is any stationary distribution, then πPn = π for all n andin particular,

π = πP2 =[π1 π2 π3

] 14

12

14

14

12

14

14

12

14

=[

14

12

14

].

Hence[

14

12

14

]is the unique stationary distribution for P in this case.

Example 10.14 (§3.2. p108 Ehrenfest Urn Model). Let a beaker filled with aparticle fluid mixture be divided into two parts A and B by a semipermeablemembrane. Let Xn = (# of particles in A) which we assume evolves by choosinga particle at random from A∪B and then replacing this particle in the oppositebin from which it was found. Suppose there are N total number of particles inthe flask, then the transition probabilities are given by,

pij = P (Xn+1 = j | Xn = i) =

0 if j /∈ i− 1, i+ 1iN if j = i− 1N−iN if j = i+ 1.

For example, if N = 2 we have

(pij) =

0 1 20 1 0

1/2 0 1/20 1 0

012

and if N = 3, then we have in matrix form,

(pij) =

0 1 2 30 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

0123

.

In the case N = 2, 0 1 01/2 0 1/20 1 0

2

=

12 0 1

20 1 012 0 1

2

0 1 0

1/2 0 1/20 1 0

3

=

0 1 012 0 1

20 1 0

and when N = 3,

0 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

2

=

13 0 2

3 00 7

9 0 29

29 0 7

9 00 2

3 0 13

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

3

=

0 7

9 0 29

727 0 20

27 00 20

27 0 727

29 0 7

9 0

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

25

∼=

0.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.0

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

26

∼=

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

:

0 1 0 01/3 0 2/3 00 2/3 0 1/30 0 1 0

100

∼=

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

We also have

(P− I)tr

=

−1 1 0 013 −1 2

3 00 2

3 −1 13

0 0 1 −1

tr

=

−1 1

3 0 01 −1 2

3 00 2

3 −1 10 0 1

3 −1

and

Nul(

(P− I)tr)

=

1331

.Hence if we take, π = 1

8

[1 3 3 1

]then

πP =1

8

[1 3 3 1

] 0 1 0 0

1/3 0 2/3 00 2/3 0 1/30 0 1 0

=1

8

[1 3 3 1

]= π

is the stationary distribution. Notice that

Page: 94 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 101: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

10.2 Life Time Processes 95

1

2

(P25 + P26

) ∼= 1

2

0.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.0

+1

2

0.25 0.0 0.75 0.00.0 0.75 0.0 0.250.25 0.0 0.75 0.00.0 0.75 0.0 0.25

=

0.125 0.375 0.375 0.1250.125 0.375 0.375 0.1250.125 0.375 0.375 0.1250.125 0.375 0.375 0.125

=

ππππ

.Example 10.15. Let us consider the Markov matrix,

P =

1 2 30 1 0

1/2 0 1/21 0 0

123

.

In this case we have

P25 =

0 1 01/2 0 1/21 0 0

25

∼=

0.399 9 0.400 15 0.199 950.400 02 0.399 9 0.200 070.400 15 0.399 9 0.199 95

P26 =

0 1 01/2 0 1/21 0 0

26

∼=

0.400 02 0.399 9 0.200 070.400 02 0.400 02 0.199 950.399 9 0.400 15 0.199 95

P100 =

0 1 01/2 0 1/21 0 0

100

∼=

0.4 0.4 0.20.4 0.4 0.20.4 0.4 0.2

and observe that

[0.4 0.4 0.2

] 0 1 01/2 0 1/21 0 0

=[

0.4 0.4 0.2].

so that π =[

0.4 0.4 0.2]

is a stationary distribution for P.

10.2 Life Time Processes

A computer component has life time T, with P (T = k) = ak for k ∈ N. Let Xn

denote the age of the component in service at time n. The set up is then

[0, T1] ∪ (T1, T1 + T2] ∪ (T1 + T2, T1 + T2 + T3] ∪ . . .

so for example if (T1, T2, T3, . . . ) = (1, 3, 4, . . . ) , then

X0 = 0, X1 = 0, X2 = 1, X3 = 2, X4 = 0, X5 = 1, X5 = 2, X5 = 3, X6 = 0, . . . .

The transition probabilities are then

P (Xn+1 = 0|Xn = k) = P (T = k + 1|T > k) =ak+1∑m>k ak

P (Xn+1 = k + 1|Xn = k) = P (T > k + 1|T > k)

=P (T > k + 1)

P (T > k)=

∑m>k+1 ak∑m>k ak

=

∑m>k ak − ak+1∑

m>k ak= 1− ak+1∑

m>k ak.

See Exercise IV.2.E6 of Karlin and Taylor for a concrete example involving achain of this form.

There is another way to look at this same situation, namely let Yn denotethe remaining life of the part in service at time n. So if (T1, T2, T3, . . . ) =(1, 3, 4, . . . ) , then

Y0 = 1, Y1 = 3, Y2 = 2, Y3 = 1, Y4 = 4, Y5 = 3, Y5 = 2, Y5 = 1, . . . .

and the corresponding transition matrix is determined by

P (Yn+1 = k − 1|Yn = k) = 1 if k ≥ 2

whileP (Yn+1 = k|Yn = 1) = P (T = k) .

Example 10.16 (Exercise IV.2.E6 revisited). Let Yn denote the remaining lifeof the part in service at time n. So if (T1, T2, T3, . . . ) = (1, 3, 4, . . . ) , then

Y0 = 1, Y1 = 3, Y2 = 2, Y3 = 1, Y4 = 4, Y5 = 3, Y5 = 2, Y5 = 1, . . . .

Ifk 0 1 2 3 4

P (T = k) 0 0.1 0.2 0.3 0.4,

the transition matrix is now given by

P =

1 2 3 4110

15

310

25

1 0 0 00 1 0 00 0 1 0

1234

Page: 95 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 102: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

96 10 Finite State Space Results and Examples

whose invariant distribution is given by

π =1

30

[10 9 7 4

]=[

13

310

730

215

].

The failure of a part is indicated by Yn being 1 and so again the failure frequencyis 1

3 of the time as found before. Observe that expected life time of a part is;

E [T ] = 1 · 0.1 + 2 · 0.2 + 3 · 0.3 + 4 · 0.4 = 3.

Thus we see that π1 = 1/ET which is what we should have expected. To go alittle further notice that from the jump diagram in Figure 10.7,

1 2 43a1

a2a3

a4

111

Fig. 10.7. The jump diagram for this “renewal” chain.

one see that

k 1 2 3 4P1 (R1 = k) a1 = 0.1 a2 = 0.2 a3 = 0.3 a4 = 0.4

and therefore,E1R1 = 1a1 + 2a2 + 3a3 + 4a4 = ET

and hence π1 = 1/E1R1 = 1/ET in general for this type of chain.

10.3 Sampling Plans

This section summarizes the results of Section IV.2.3 of Karlin and Taylor.There one is considering at production line where each item manufactured hasprobability 0 ≤ p ≤ 1 of being defective. Let i and r be two integers and samplethe output of the line as follows. We begin by sampling every item until we havefound i – in a row which are good. Then we sample each of then next itemswith probability 1

r determined randomly at end of production of each item. (Ifr = 6 we might throw a die each time a product comes off the line and samplethat product when we roll a 6 say.) If we find a bad part we start the process

over again. We now describe this as a Markov chain with states Ekik=0 whereEk denotes that we have seen k – good parts in a row for 0 ≤ k < i and Eiindicates we are in stage II where we are randomly choosing to sample an itemwith probability 1

r . The transition probabilities for this chain are given by

P (Xn+1 = Ek+1|Xn = Ek) = q := 1− p for k = 0, 1, 2 . . . , i− 1,

P (Xn+1 = E0|Xn = Ek) = p if 0 ≤ k ≤ i− 1,

P (Xn+1 = E0|Xn = Ei) =p

rand P (Xn+1 = Ei|Xn = Ei) = 1− p

r,

with all other transitions being zero. The stationary distribution for this chainsatisfies the equations,

πk =

k∑l=0

P (Xn+1 = Ek|Xn = El)πl

so that

π0 =

i−1∑k=0

pπi +p

rπ0,

π1 = qπ0, π2 = qπ1, . . . , πi−1 = qπi−2,

πi = qπi−1 +(

1− p

r

)πi.

These equations may be solved (see Section IV.2.3 of Karlin and Taylor) to findin particular that

πk =p (1− p)k

1 + (r − 1) (1− p)ifor 1 ≤ k < i and

πi =r (1− p)k

1 + (r − 1) (1− p)i.

See Karlin and Taylor for more comments on this solution.

10.4 Extra Homework Problems

Exercises 10.1 – 10.4 refer to the following Markov matrix:

P =

1 2 3 4 5 60 1 0 0 0 0

1/2 1/2 0 0 0 00 0 1/2 1/2 0 00 0 1 0 0 00 1/2 0 0 0 1/20 0 0 1/4 3/4 0

123456

(10.5)

We will let Xn∞n=0 denote the Markov chain associated to P.

Page: 96 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 103: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Exercise 10.1. Make a jump diagram for this matrix and identify the recurrentand transient classes. Also find the invariant destitutions for the chain restrictedto each of the recurrent classes.

Exercise 10.2. Find all of the invariant distributions for P.

Exercise 10.3. Compute the hitting probabilities, h5 = P5 (Xn hits 3, 4)and h6 = P6 (Xn hits 3, 4) .

Exercise 10.4. Find limn→∞ P6 (Xn = j) for j = 1, 2, 3, 4, 5, 6.

Page 104: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 105: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

11

Discrete Renewal Theorem (optional reading)

We will follow Feller’s proof of Theorem 3 in [2, Chapter XIII.11, p. 313](The proof starts on p. 335. See Norris [3, Problem 1.8.5 on p. 46] for a Markovchain proof of the discrete renewal theorem.)

Renewal process. Suppose we have a box of identical components, eachnumbered by 1, 2, 3, . . . . Let Ti denote the lifetime of the ith component andassume that Ti∞i=1 are i.i.d. non-negative random variables with values in N sowith probability 1 each component is good when it is put into service. At timezero we put the first component into service, when it fails we immediatelyreplace it by the second, when the second fails we immediately replace it bythe third, and so on. Based on this scenario we make the following definition.

Definition 11.1 (Renewal Process). Let Tk∞k=1 be N – valued i.i.d. ran-dom variables and µ := ET1 > 0 with µ = ∞ being an allowed value. Furtherlet

Wn := T1 + T2 + · · ·+ Tn =

n∑i=1

Ti

be the time of the nth “renewal.” The renewal process is the counting processdefined by

N(t) = #n : Wn ≤ t = maxn : Wn ≤ t for t ∈ N.

We also let Zn denote the event that a renewal occurs at time n, i.e.

Zn = Wk = n for some k ≤ n = ∪k≤n Wk = n .

So N (t) counts the number of renewals which have occurred at time t orless. The random variable, Wn, is the time of the nth renewal whereas Tn isthe time between the (n− 1)

thand the nth renewals. Since the inter-renewal

times, Tn∞n=1, are i.i.d., the process probabilistically restarts at each renewal.

Example 11.2 (Markov Chain). Suppose that Xn∞n=0 is a recurrent Markovchain on some state space S. Suppose the chain starts at some site, x ∈ S, andlet Tk∞k=1 be the subsequent return times for the chain to x. It follows by thestrong Markov property, that Tk∞k=1 are i.i.d. random variables. In this caseN (t) counts the number of returns to x before or equal to time t. This examplehas a analogue for continuous time Markov chains as well.

Our goal in this chapter is to show (Theorem 11.5), under appropriate hy-pothesis, that the following reasonable looking result is in fact true;

limn→∞

P (renewal occurs at time n) = limn→∞

P (Zn) =1

µ. (11.1)

The point is that we should expect to replace parts every µ = ET1 – units oftime on average which his consistent with the previous equation. In fact, it isnot hard to see that Eq. (11.1) can not hold in general. For example, supposethat P (Tk = 2) = 1 for all k. Then Wn = 2n for all n ∈ N and hence P (Zn) = 1if n is even while P (Zn) = 0 when n is odd. Thus in order for Eq. (11.1) tohold we will need to put some assumption on the distribution of the Tn∞n=1

to avoid this “periodic” type behavior – these consideration appear in Lemma11.6 below.

Lemma 11.3 (Renewal Equation). Let fn := P (T = n) for all n ∈ N (weassume that f0 = P (T = 0) = 0) and

un := P (Zn) = P (the part is new at time n)

so that u0 = 1. Then un∞n=0 and fn∞n=1 are related by the renewal equa-tion,

un =

n∑k=1

fkun−k =

n∑k=0

fkun−k. (11.2)

Proof. Conditioning on T1 we find,

un = P (Zn) =

∞∑k=1

P (Zn|T1 = k)P (T1 = k)

=∑k≤n

P (Zn|T1 = k) fk =∑k≤n

P (Zn−k) fk =∑k≤n

un−kfk,

wherein we have used Zn (T1, T2, . . . ) on the set T1 = k ≤ n is equal to

Zn−k (T2, T3, . . . )d= Zn−k (T1, T2, . . . ) .

We can determine un from Eq. (11.2). For example,

Page 106: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

100 11 Discrete Renewal Theorem (optional reading)

u1 = f1u0 = f1, (11.3)

u2 = f1u1 + f2u0 = f21 + f2, (11.4)

u3 = f1u2 + f2u1 + f3u0 = f3 + 2f1f2 + f3. (11.5)

Since

0 ≤ un ≤n∑i=1

fi · max0≤i<n

ui ≤ max0≤i<n

ui,

a simple induction argument shows that 0 ≤ un ≤ 1 for all n as should be thecase since un = P (Zn) .

Conversely given 0 ≤ un ≤ 1 with u0 = 1, there is a unique sequence fi∞i=1

such that

un =

n∑i=1

fiun−i for all n ∈ N.

Indeed, from the Eq. (11.3) we find f1 = u1 and then we may recover the fnrecursively using

fn = un −n−1∑i=1

fiun−i

which is just a rewrite of Eq. (11.2).It is useful to extend un to all n ∈ Z by setting un = 0 if n < 0. With this

definition we may rewrite Eq. (11.2) as

u0 = 1 and un =

∞∑i=1

fiun−i if n 6= 0. (11.6)

Let us now define (θνu)n = uν+n for all n ∈ Z in which case we may reformulateEq. (11.6) as,

(θνu)−ν = 1 and (θνu)n =

∞∑i=1

fi (θνu)n−i if n 6= −ν. (11.7)

The following lemma gives another key reformulation of Eq. (11.2).

Lemma 11.4. Let

ρk := P (T > k) = P (T ≥ k + 1) =

∞∑m=k+1

fm for k ∈ N0

(ρ0 = 1) . The sequence, un∞n=0 determined by Eq. (11.2) satisfies,

ρNu0 + ρN−1u1 + · · ·+ ρ1uN−1 + ρ0uN = 1 for all N ∈ N0. (11.8)

and conversely, Eq. (11.8) along determines the same sequence as in Eq. (11.2).

Proof. The ideal of the proof is that Eq. (11.8) results from summing Eq.(11.2) over 1 ≤ n ≤ N. Since

u1 = f1u0,

u2 = f1u1 + f2u0,

u3 = f1u2 + f2u1 + f3u0

...

uN = f1uN−1 + f2uN−2 . . . fN−1u1 + fNu0

we see that

u1 + · · ·+ uN = (f1 + · · ·+ fN )u0 + (f1 + · · ·+ fN−1)u1 + · · ·+ f1uN−1

= 1− ρN + (1− ρN−1)u1 + (1− ρN−2)u2 · · ·+ (1− ρ1)uN−1.

Moving all terms except for the lone 1 on the right side to the left of thisequation proves Eq. (11.8).

Conversely, if Eq. (11.8) holds, then

u1 = 1− ρ1 = f1,

u1 + u2 = (1− ρ2) + (1− ρ1)u1 = f1 + f2 + f1u1

u1 + u2 + u3 = f1 + f2 + f3 + (f1 + f2)u1 + f1u2,

and so subtracting the first from the second and the second from the thirdequations shows,

u2 = f2 + f1u1 = f2u0 + f1u1 and

u3 = f3 + f2u1 + f1u2 = f3u0 + f2u1 + f1u2.

We leave the general argument to the reader.Recall from Lemma 7.22 that

µ := ET =

∞∑k=0

P (T > k) =

∞∑k=0

ρk. (11.9)

Theorem 11.5 (Discrete Renewal Theorem). Suppose that T is a N valuedrandom variable with P (T = n) = fn for all n ∈ N and µ = ET Further supposethat gcd (n : fn 6= 0) = 1 and un is the sequence of number determined byu0 = 1 and then recursively by Eq. (11.6). Then limn→∞ un = 1

µ where µ is

given in Eq. (11.9).

Proof. We begin by showing, if η := limn→∞ un exists, then η = 1µ . As a

first attempt one might try to let n → ∞ in Eq. (11.2). However this tacticgives no information since, using DCT for sums,

Page: 100 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 107: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

11 Discrete Renewal Theorem (optional reading) 101

η = limn→∞

un = limn→∞

∞∑i=1

fiun−i1i≤n =

∞∑i=1

fi limn→∞

[un−i1i≤n] =

∞∑i=1

fi · η = η.

Let us instead pass to the limit in Eq. (11.8),

1 =

N∑n=0

ρN−nun =

N∑n=0

ρnuN−n =

∞∑n=0

ρnuN−n1n≤N . (11.10)

It follows for any M ∈ N that

1 = limN→∞

(M∑n=0

ρnuN−n +

N∑n=M+1

ρnuN−n

)

≥ ηM∑n=0

ρn +

N∑n=M+1

ρn ≥ ηM∑n=0

ρn. (11.11)

Letting M →∞ in the last inequality shows 1 ≥ η ·µ and so if µ =∞ we musthave η = 0. On the other hand if µ < ∞, we may apply the DCT to the rightmember in Eq. (11.10) to learn,

1 = limN→∞

∞∑n=0

ρnuN−n1n≤N =

∞∑n=0

ρn limN→∞

[uN−n1n≤N ] =

∞∑n=0

ρnη = µη.

We now continue on with the formal proof. Now let η := lim supn→∞ un. Ifit happens that η = 0, then limn→∞ un = η exists and by what we have justproved η = 1

µ and we must have µ =∞ in this case. So now suppose that η > 0and choose ν1 < ν2 < . . . such that

limk→∞

uνk = η := lim supn→∞

un.

By passing to a subsequence if necessary we may assume1 that

wn = limk→∞

(θνku)n = limk→∞

uνk+n

exists for all n ∈ Z. Furthermore η = w0 and 0 ≤ wn ≤ η for all n ∈ Z asfollows from the definition of the limsup. Passing to the limit as k →∞ in Eq.(11.7) with ν = νk implies,

wn = limk→∞

(θνku)n = limk→∞

∞∑i=1

fi (θνku)n−i =

∞∑i=1

fiwn−i for all n ∈ Z.

1 This is proved using Cantor’s diagonalization argument and reflects a simple versionof Tychonoff’s theorem, namely [0, 1]Z is a compact metrizable space.

An application of Lemma 11.6 below applied to wn/η shows wn = η for all n,i.e.

limk→∞

uνk+n = η for all n ∈ Z.

From the variant of Eq. (11.11) with N = νk ↑ ∞ shows

1 = limk→∞

(M∑n=0

ρnu νk−n +

νk∑n=M+1

ρnu νk−n

)≥ η

M∑n=0

ρn → ηµ

and hence we may conclude that µ ≤ 1/η < ∞. Knowing this we may pass tothe limit in Eq. (11.10) with N = νk to learn again that µ · η = 1, i.e. η = 1/µ.So to finish the proof we must show that η0 := lim infn→∞ un = η and for thissuffices to show η ≤ η0 which we will now prove.

Choose a subsequence (αk) of N such that

limk→∞

uαk = lim infn→∞

un = η0.

Taking Eq. (11.8) with N = αk shows,

1 =

αk∑l=0

ρluαk−l =

n∑l=0

ρluαk−l +

αk∑l=n+1

ρluαk−l

for any n ≤ αk. Passing the limit as k →∞ in this identity implies

1 ≤n∑l=0

ρl lim supk→∞

uαk−l + lim supk→∞

[αk∑

l=n+1

ρluαk−l

]≤ ρ0η0 +

n∑l=1

ρlη +

∞∑l=n+1

ρl.

Finally letting n→∞ in this last inequality shows

1 ≤ η0 + (µ− 1) η = η0 − η + 1 =⇒ η ≤ η0.

Lemma 11.6. Suppose that wnn∈Z ⊂ [0, 1] with w0 = 1 and

wn =

∞∑k=1

fkwn−k for all n ∈ Z. (11.12)

If gcd (A) = 1 where A := k : fk > 0 , then wn = 1 for all n.

Proof. First suppose that fk > 0 for all k. Taking Eq. (11.12) with n = 0shows,

1 = w0 =

∞∑k=1

fkw−k ≤∞∑k=1

fk = 1 (11.13)

Page: 101 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 108: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

102 11 Discrete Renewal Theorem (optional reading)

and therefore fkw−k = fk for all k ≥ 1 and w−k for all k ≥ 1. Thus we haveshown wn = 1 for n ≤ 0. Using this fact it follows that

w1 =∞∑k=1

fkw1−k =∞∑k=1

fk = 1,

w2 =

∞∑k=1

fkw2−k =

∞∑k=1

fk = 1, etc.

Now to the general case.From Eq. (11.13) we may conclude that w−k = 1 for all k ∈ A. If a ∈ A we

find,

1 = w−a =

∞∑k=1

fkw−a−k ≤∞∑k=1

fk = 1

and therefore w−a−b = 1 for all a, b ∈ A.This then implies that w−m = 1 forall m ∈ A+ where A+ is the positive linear combinations of elements in A. Thenumber theoretic lemmas shows that A+ contains all n ≥ N. Thus we haveshown that w−n = 1 for all n ≥ N. Now using,

w−N+1 =

∞∑k=1

fkw−N+1−k =

∞∑k=1

fk = 1 and then

w−N+2 =

∞∑k=1

fkw−N+2−k =

∞∑k=1

fk = 1, etc.

we may show inductively that wn = 1 for all n.Let us end with what happens in the periodic case.

Lemma 11.7. Let un∞n=0 and fn∞n=1 be a pair of sequences related by Eq.(11.2) and assume that u0 = 1. Then

a = gcd (n ≥ 1 : fn > 0) = gcd (n ≥ 1 : un > 0) = b.

Proof. Let A := n ≥ 1 : fn > 0 and B := n ≥ 1 : un > 0 so that a =gcd (A) and b = gcd (B) . We know that fn = 0 if a does not divide n. So ifa > 1, we have

un =

n∑k=1

fkun−k =

n∑k=1

0un−k = 0 for 1 ≤ n < a,

and hence

ua+n =

a+n∑k=1

fkua+n−k = faun = fa · 0 = 0 for 1 ≤ n < a.

Similarly,

u2a+n =

2a+n∑k=1

fku2a+n−k = faua+n + f2aun = 0 for 1 ≤ n < a.

Continuing this way inductively shows that uka+n = 0 for all k ∈ N and 1 ≤n < a. So if un > 0 then we must have a|n, i.e. a is a divisor of B and thereforea|b.

Similarly, we know that un = 0 if b does not divide n. Therefore for 1 ≤ n <b,

fn = un −n−1∑i=1

fiun−i = 0 and then

fb+n = ub+n −n+b−1∑i=1

fiun+b−i = ub+n − fbun = 0.

Similarly,

f2b+n = u2b+n −n+2b−1∑i=1

fiun+2b−i = u2b+n − fbun+b − f2bun = 0,

and therefore by induction one may show that fkb+n = 0 for all k ∈ N and1 ≤ n < b. Thus if fn > 0 we must have b|n, i.e. b is a divisor of A and thereforeb|a.Theorem 11.8. Suppose that T is a N valued random variable withP (T = n) = fn for all n ∈ N and µ = ET. Let d := gcd (n : fn 6= 0)and un be the sequence of number determined by u0 = 1 and then recursivelyby Eq. (11.6). Then un = 0 unless d|n and limn→∞ udn = d

µ where µ is given

in Eq. (11.9).

Proof. The assertion that un = 0 unless d|n is a consequence of Lemma11.7. Now let Un := und and Fn = fnd, then

Un = und =

nd∑k=1

fkund−k =

n∑k=1

fkdund−kd =

n∑k=1

FkUn−k

where gcd (n : Fn 6= 0) = 1. Thus we may apply Theorem 11.5 in order tolearn,

limn→∞

und = limn→∞

Un =1

µ0

where

µ0 =

∞∑k=1

kFk =

∞∑k=1

kfkd =1

d

∞∑k=1

kdfkd =1

d

∞∑n=1

nfn =µ

d.

Page: 102 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 109: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Part II

Continuous Time Processes

Page 110: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 111: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

12

Continuous Distributions

In this short chapter I will gather some facts about continuous distributionswe will be needing for the rest of the course. Recall a random vector X :=(X1, . . . , Xk) has a continuous distribution if there is a non-negative functionρX (x) such that

E [f (X1, . . . , Xk)] =

∫Rkf (x1, . . . , xk) ρ (x1, . . . , xk) dx1 . . . dxk

for all f ≥ 0 or f bounded onRk. We will often abbreviate the above equationas

E [f (X)] =

∫Rkf (x) ρ (x) dx or by

P (X ∈ [x,x + dx]) = ρ (x) dx.

Two random vectors X and Y with continuous distributions ρX and ρY areindependent if ρ(X,Y) (x,y) = ρX (x) ρY (y) , i.e.

E [f (X,Y)] =

∫Rk+l

f (x,y) ρX (x) ρY (y) dxdy.

In order to deal with all of these multiple integrals we will frequently use thefollowing two fundamental theorems which will be stated without proof. (Wenow drop the arrows from vectors, it will be up to you to understand where xand y live from the context of the situation at hand.)

Theorem 12.1 (Tonelli’s theorem). If f : Rk × Rl → R+, then∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) (with ∞ being allowed).

Theorem 12.2 (Fubini’s theorem). If f : Rk × Rl → R is a function suchthat ∫

Rkdx

∫Rldy |f (x, y)| =

∫Rldy

∫Rkdx |f (x, y)| <∞,

then ∫Rkdx

∫Rldyf (x, y) =

∫Rldy

∫Rkdxf (x, y) .

This theorems lead to the following very useful proposition which is a specialcase of conditional expectation in this setting.

Proposition 12.3. Suppose that X is an Rk – valued random variable, Y isan Rl – valued random variable independent of X, and f : Rk × Rl → R+ then(assuming X and Y have continuous distributions),

E [f (X,Y )] =

∫Rk

E [f (x, Y )] ρX (x) dx.

and similarly,

E [f (X,Y )] =

∫Rl

E [f (X, y)] ρY (y) dy

Proof. Independence implies that

ρ(X,Y ) (x, y) = ρX (x) ρY (y) .

Therefore,

E [f (X,Y )] =

∫Rk×Rl

f (x, y) ρX (x) ρY (y) dxdy

=

∫Rk

[∫Rldyf (x, y) ρY (y)

]ρX (x) dx

=

∫Rk

E [f (x, Y )] ρX (x) dx.

Here is another very useful theorem.

Theorem 12.4. Suppose that X,Y are independent random variables withcontinuous distributions then X+Y also has a continuous distribution ρX+Y =ρX ∗ ρY where

(ρX ∗ ρY ) (z) =

∫RρX (z − y) ρY (y) dy.

Moreover, if X,Y ≥ 0 (i.e. ρX (x) = 0 = ρY (y) for x, y ≤ 0, then

(ρX ∗ ρY ) (z) =

∫ z0ρX (z − y) ρY (y) dy if z ≥ 0

0 if z < 0.

Page 112: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

106 12 Continuous Distributions

Proof. From Proposition 12.3,

E [f (X + Y )] =

∫RE [f (X + y)] ρY (y) dy

where

E [f (X + y)] =

∫Rf (x+ y) ρX (x) dx =

∫Rf (z) ρX (z − y) dz.

(We made the change of variables, z = x+ y, i.e. x = z − y.) Therefore,

E [f (X + Y )] =

∫R

(∫Rf (z) ρX (z − y) dz

)ρY (y) dy

=

∫Rf (z)

(∫RρX (z − y) ρY (y) dy

)dz

=

∫Rf (z) (ρX ∗ ρY ) (z) dz.

Alternatively and more directly;

E [f (X + Y )] =

∫R2

f (x+ y) ρX (x) ρY (y) dx dy.

Making the change of variables z = x+ y, i.e. x = z − y shows,

E [f (X + Y )] =

∫R2

f (z) ρX (z − y) ρY (y) dydz

=

∫Rf (z)

[∫RρX (z − y) ρY (y) dy

]dz.

If X,Y ≥ 0 then

(ρX ∗ ρY ) (z) =

∫ ∞0

ρX (z − y) ρY (y) dy =

∫ ∞0

1z≥yρX (z − y) ρY (y) dy

=

∫ z0ρX (z − y) ρY (y) dy if z ≥ 0

0 if z < 0.

Definition 12.5 (Gamma Distribution). Let k, θ > 0, we say a positiverandom variable, X, has the Gamma(k, θ) – distribution (abbreviated by writing

Xd=Gamma(k, θ)), if

P (X ∈ [x, x+ dx]) = xk−1 e−x/θ

θkΓ (k)1x>0dx

where

Γ (k) :=

∫ ∞0

tk−1e−tdt for all k > 0.

When k = 1 and θ = 1/λ we say that X is exponentially distributed and write

Xd= E (λ) , i.e.

P (X ∈ [x, x+ dx]) = λ1

Γ (1)e−x/θ1x>0dx = λe−x/θ1x>0dx.

Let us observe that Γ (k) has been chosen to give the correct normalizationsince,

1

θk

∫ ∞0

xk−1e−x/θdx =

∫ ∞0

yk−1e−ydy = Γ (k) ,

wherein we made the change of variables, x = θy.

Example 12.6 (Gamma Distribution Sums). In this example we will show

Gamma(k, θ)⊥⊥+ Gamma(l, θ)

d=Gamma(k + l, θ) , i.e. X

d=Gamma(k, θ)

and Yd=Gamma(l, θ) , and X and Y are independent, then

X + Yd=Gamma(k + l, θ) . We will do this by two methods.

Method 1. It suffices to compute the convolution,∫ z

0

(z − y)k−1 e

−(z−y)/θ

θkΓ (k)yl−1 e

−y/θ

θlΓ (l)dy =

1

θk+lΓ (k)Γ (l)e−z/θ

∫ z

0

(z − y)k−1

yl−1dy.

Making the change of variables, y = tz, in the above integral shows,∫ z

0

(z − y)k−1

yl−1dy =

∫ 1

0

(z − tz)k−1(tz)

l−1zdt = zk+l−1

∫ 1

0

(1− t)k−1tl−1dt.

Putting this together shows,

ρX ∗ ρY (z) =zk+l−1

θk+lΓ (k)Γ (l)e−z/θ ·

∫ 1

0

(1− t)k−1tl−1dt.

As the latter is a probability distribution, we must have

1

Γ (k)Γ (l)

∫ 1

0

(1− t)k−1tl−1dt =

1

Γ (k + l).

an identity which may also be verified directly, see Lemma 12.7 below.Method 2. In Exercise 12.1 below, you are asked to show

E[etX]

= (1− θt)−k for t < θ−1.

Page: 106 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 113: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

12 Continuous Distributions 107

Using this remark and the independence of X and Y it follows that

E[et(X+Y )

]= E

[etXetY

]= E

[etX]E[etY]

= (1− θt)−k (1− θt)−l = (1− θt)−(k+l)

and therefore X + Yd=Gamma(k + l, θ) .

Exercise 12.1 (Gamma Distribution Moments). If Xd=Gamma(k, θ) ,

show X has moment generating function, MX (t) := E[etX], given by

MX (t) := E[etX]

= (1− θt)−k for t < θ−1.

Differentiate your result in t to show

E [Xm] = k (k + 1) . . . (k +m− 1) θm for all m ∈ N0.

In particular, E [X] = kθ and Var (X) = kθ2. (Notice that when k = 1 and

θ = λ−1, Xd= E (λ) .)

Solution to Exercise (12.1). For t < θ−1

E[etX]

=

∫ ∞0

f (x; k, θ) etxdx =1

θkΓ (k)

∫ ∞0

etxxk−1e−x/θdx

=1

θkΓ (k)

∫ ∞0

xk−1e−x(θ−1−t)dx

=1

θkΓ (k)

∫ ∞0

xke−x(θ−1−t) dx

x

and hence making the change of variables, y = x(θ−1 − t

)in this integral

implies

E[etX]

=1

θkΓ (k)

(θ−1 − t

)−k ∫ ∞0

yke−ydy

y= (1− θt)−k .

For the second part observe that

d

dt(1− θt)−k = kθ (1− θt)−(k+1)

and hence by induction,(d

dt

)m(1− θt)−k = k (k + 1) . . . (k +m− 1) θm (1− θt)−(k+m)

.

Therefore,

E [Xm] =

(d

dt

)m|t=0E

[etX]

=

(d

dt

)m|t=0 (1− θt)−k = k (k + 1) . . . (k +m− 1) θm

and in particular,

E [X] = kθ, E[X2]

= k (k + 1) θ2 and Var (X) = kθ2.

Lemma 12.7 (Beta function (may be skipped!)). Let

B (x, y) :=

∫ 1

0

tx−1 (1− t)y−1dt for Rex,Re y > 0. (12.1)

Then

B (x, y) =Γ (x)Γ (y)

Γ (x+ y).

Proof. Let u = t1−t so that t = u (1− t) or equivalently, t = u

1+u and

1− t = 11+u and dt = (1 + u)

−2du.

B (x, y) =

∫ ∞0

(u

1 + u

)x−1(1

1 + u

)y−1(1

1 + u

)2

du

=

∫ ∞0

ux−1

(1

1 + u

)x+y

du.

Recalling that

Γ (z) :=

∫ ∞0

e−ttzdt

t.

We find ∫ ∞0

e−λttzdt

t=

∫ ∞0

e−t(t

λ

)zdt

t=

1

λzΓ (z) ,

i.e.1

λz=

1

Γ (z)

∫ ∞0

e−λttzdt

t.

Taking λ = (1 + u) and z = x+ y shows

B (x, y) =

∫ ∞0

ux−1 1

Γ (x+ y)

∫ ∞0

e−(1+u)ttx+y dt

tdu

=1

Γ (x+ y)

∫ ∞0

dt

t

x

e−ttx+y

∫ ∞0

du

uuxe−ut

=1

Γ (x+ y)

∫ ∞0

dt

t

x

e−ttx+y Γ (x)

tx

=Γ (x)

Γ (x+ y)

∫ ∞0

dt

t

x

e−tty =Γ (x)Γ (y)

Γ (x+ y).

Page: 107 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 114: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Fig. 12.1. Plot of t/ (1− t) .

Page 115: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

13

Exponential Random Variables

Recall the geometric distribution of parameter p is the discrete random vari-able T = min n : Xn = 1 where Xn∞n=1 are i.i.d. Bernoulli random variablessuch that P (Xn = 1) = p = 1−P (Xn = 0) . The distribution of P is given by,

P (T = k) = qk−1p

and observe that

P (T > k) = p

∞∑l=k+1

ql−1 = pqk1

1− q= qk.

Recall that T is forgetful in the sense that

P (T = k + n|T > k) =qk+n−1p

qk= qn−1p = P (T = n) for n = 1, 2, 3, . . .

which represents the fact that knowing a success has not occurred by time k inno way helps you guess when the next success will occur.

Given N ∈ N observe that

P

(1

NTp > t

)= P (Tp > tN) = q[tN ] = (1− p)[tN ]

.

Thus if we let p = λ/N for some λ > 0 and then let N →∞ we find,

limN→∞

P

(1

NTλ/N > t

)= limN→∞

(1− λ

N

)[tN ]

= exp

(limN→∞

(tN + εN (t)) ln

(1− λ

N

))= e−λt.

This leads us to the exponential distribution.

Definition 13.1. A random variable T ≥ 0 is said to be exponential withparameter λ ∈ [0,∞) provided, P (T > t) = e−λt for all t ≥ 0. We will write

Td= E (λ) for short.

Recall if P (T > t) = F (t) , then

P (a < T ≤ b) = P (T > a)− P (T > b) = F (a)− F (b) =

∫ b

a

(−F ′ (t)) dt

so thatP (T ∈ (t, t+ dt)) = −F ′ (t) dt.

In particular if Td= E (λ) , then

P (T ∈ (t, t+ dt)) = λ1t≥0e−λtdt

or in other words,

E [f (T )] =

∫ ∞0

f (t)λe−λtdt for “all” f : [0,∞)→ R. (13.1)

Proposition 13.2. Suppose that Td= E (λ) , then

ET k = k!λ−k for all k ∈ N.

In particular,

ET =1

λand Var (T ) = 2λ−2 − λ−2 = λ−2. (13.2)

Proof. By differentiating under the integral sign we find,

ET =

∫ ∞0

τλe−λτdτ = λ

(− d

)∫ ∞0

e−λτdτ = λ

(− d

)λ−1 = λ−1.

More generally, repeating this procedure shows,

ET k =

∫ ∞0

τke−λτλdτ = λ

(− d

)k ∫ ∞0

e−λτdτ = λ

(− d

)kλ−1 = k!λ−k.

(13.3)Alternatively we may compute the moment generating function for T,

MT (a) := E[eaT]

=

∫ ∞0

eaτλe−λτdτ

=

∫ ∞0

eaτλe−λτdτ =λ

λ− a=

1

1− aλ−1(13.4)

Page 116: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

110 13 Exponential Random Variables

which is valid for a < λ. On the other hand, we know that

E[eaT]

=

∞∑n=0

an

n!E [Tn] for |a| < λ. (13.5)

Comparing this with Eq. (13.4) again shows that Eq. (13.3) is valid.Here is yet another way to understand and generalize Eq. (13.4). We simply

make the change of variables, u = λτ in the integral in Eq. (13.3) to learn,

ET k = λ−k∫ ∞

0

uke−udu = λ−kΓ (k + 1) .

This last equation is valid for all k ∈ (−1,∞) – in particular k need not be aninteger.

Theorem 13.3 (Memoryless property). A random variable, T ∈ (0,∞] hasan exponential distribution iff it satisfies the memoryless property:

P (T > s+ t|T > s) = P (T > t) for all s, t ≥ 0,

where as usual, P (A|B) := P (A ∩B) /P (B) when p (B) > 0. (Note that Td=

E (0) means that P (T > t) = e0t = 1 for all t > 0 and therefore that T = ∞a.s.)

Proof. (The following proof is taken from [3].) Suppose first that Td= E (λ)

for some λ > 0. Then

P (T > s+ t|T > s) =P (T > s+ t)

P (T > s)=e−λ(s+t)

e−λs= e−λt = P (T > t) .

For the converse, let g (t) := P (T > t) , then by assumption,

g (t+ s)

g (s)= P (T > s+ t|T > s) = P (T > t) = g (t)

whenever g (s) 6= 0 and g (t) is a decreasing function. Therefore if g (s) = 0 forsome s > 0 then g (t) = 0 for all t > s. Thus it follows that

g (t+ s) = g (t) g (s) for all s, t ≥ 0.

Since T > 0, we know that g (1/n) = P (T > 1/n) > 0 for some n andtherefore, g (1) = g (1/n)

n> 0 and we may write g (1) = e−λ for some 0 ≤ λ <

∞.Observe for p, q ∈ N, g (p/q) = g (1/q)

pand taking p = q then shows,

e−λ = g (1) = g (1/q)q. Therefore, g (p/q) = e−λp/q so that g (t) = e−λt for all

t ∈ Q+ := Q ∩ R+. Given r, s ∈ Q+ and t ∈ R such that r ≤ t ≤ s we have,since g is decreasing, that

e−λr = g (r) ≥ g (t) ≥ g (s) = e−λs.

Hence letting s ↑ t and r ↓ t in the above equations shows that g (t) = e−λt for

all t ∈ R+ and therefore Td= E (λ) .

Theorem 13.4. Let I be a countable set and let Tkk∈I be independent ran-dom variables such that Tk ∼ E (λk) with λ :=

∑k∈I λk ∈ (0,∞) . Let

T := infk Tk and let K = k on the set where Tj > Tk for all j 6= k. On thecomplement of all these sets, define K = ∗ where ∗ is some point not in I. ThenP (K = ∗) = 0, K and T are independent, T ∼ E (λ) , and P (K = k) = λk/λ.

Proof. Let us first suppose that I = 1, 2, . . . , n is a finite set. Then, usingProposition 0.1 we find,

P (K = k, T > t) = P (∩j 6=k Tj > Tk , Tk > t)

=

∫ ∞t

dtkλke−λktkP (∩j 6=k Tj > tk)

=

∫ ∞t

dtkλke−λktk ·

∏j 6=k

P (Tj > tk)

=

∫ ∞t

dtkλke−λktk ·

∏j 6=k

e−λjtk

=

∫ ∞t

dτλke−λτ =

λkλe−λt. (13.6)

Summing this equation on k shows P (T > t) = e−λt, i.e. Td= E (λ) . In partic-

ular P (T > 0) = 1 so taking t = 0 in Eq. (13.6) shows P (K = k) = λkλ . Using

these remarks we may rewrite Eq. (13.6) as

P (K = k, T > t) = P (K = k)P (T > t)

which is the required independence of K and T.Case if # (I) = ∞. Let k ∈ I and t ∈ R+ and Λn be a sequence of finite

subsets of I such that k ∈ Λn ↑ I as n ↑ ∞ and let λ (n) :=∑i∈Λn λi. Using

what we have just proved,

P (K = k, T > t) = P (∩j 6=k Tj > Tk , Tk > t) = limn→∞

P (∩j∈Λn Tj > Tk , Tk > t)

= limn→∞

λkλ (n)

e−λ(n)t =λkλe−λt. (13.7)

Page: 110 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 117: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

13 Exponential Random Variables 111

Summing this equation on k shows P (K 6= ∗, T > t) = e−λt and taking t = 0implies P (K 6= ∗, T > 0) = 1 which implies P (K 6= ∗) = 1 = P (T > 0) , i.e.P (K = ∗) = 0 or P (K ∈ I) = 1. The remaining results now follow exactly asin the case where # (I) <∞.

Remark 13.5. This is a rather interesting result which says that given knowledgeof which clock rings in no way helps you predict when it rang!! So imagine thatyou are sitting at bus stop which is served by many different bus lines whichfor reasons of traffic lights and other delays come at exponential times withrates λi independently on one another. You are an experienced rider so youknow these rates and hence you know that you expect to wait 1

λ – amount oftime where λ = λ1 + · · ·+ λn. You don’t know for sure which bus you will geton as any of them could come first. However, on average you will certainly beriding the busses for which λi are largest. On the other hand, knowing whichbus has just come does impact which bus you are likely to see next and howlong you are likely to wait for that bus. One might guess that if we are givethat bus #1 is the first to arrive, then the further waiting time for the secondbus is E (λ2 + · · ·+ λn) and the probability that the second bus will be #j isλj/ (λ2 + · · ·+ λn) .

(You might also think in terms of a fluid analogy here as well. Imagine asteady unit speed wind hitting your face. Assume the air contains n – types ofparticle labeled by i = 1, 2, . . . , n and the density of the ith – practice is λi for1 ≤ i ≤ n. Then the rate that some particle is hitting your face is the sum ofall of the rates, λ = λ1 + · · · + λn while the likely hood that at any time youwill be hit by a particle of the ith – type is λi

λ .)

Corollary 13.6. Let Tknk=1 be independent random variables such that Tk ∼E (λk) with λ :=

∑nk=1 λk. Further let T := mink Tk, K1 = k1 and K2 =

k2 6= k1 on the set where Tj > Tk2 > Tk1 for all j /∈ k1, k2 , and S =∑nk=1 1K1=k minj 6=k (Tj − Tk) . So T is the time the first clock rings, S is the

time between the first and second ring, K1 is the number of the first clock toring and K2 is the number of the second clock to ring. Then

P (K1 = k1) =λk1λ, P (K2 = k2|K1 = k1) =

λk2λ− λk1

for k2 6= k1

and given K1 = k1,K2 = k2 , Td= E (λ) and S

d= E (λ− λk1) , i.e.

Td= E (λ) and S

d= E (λ− λk1) relative to P (·|K1 = k1,K2 = k2) .

In particular the distribution of S is determined by

E [f (S)] =

∫ ∞0

f (s) ·n∑k=1

λkλ

(λ− λk) e−(λ−λk)sds.

(In summary, independent of which clock rings, the ringing time is T is dis-tributed as E (λ) . Given the kth – clock rings first, then the time to the next

ring S is distributed as minj 6=k Tjd= E (λ− λk) .) Moreover, if all λk = µ

independent of k, then Td= E (nµ) , S

d= E ((n− 1)µ) , and P (K = k) = 1

n .

Proof. We are going to compute P (K1 = k1,K2 = k2, T > t, S > s) usingthe identity,

K1 = k1,K2 = k2, T > t, S > s = ∩j /∈k1,k2 Tj > Tk2 > Tk1 + s > t+ s .

To simplify notation for the rest of the computation we will now assume thatk1 = 1 and k2 = 2 where

K1 = 1,K2 = 2, T > t, S > s = ∩j /∈1,2 Tj > T2 > T1 + s > t+ s

and so with µ = λ3 + · · ·+ λn,

P (K1 = 1,K2 = 2, T > t, S > s) =

∫ ∞t

dxλ1e−λ1x

∫ ∞x+s

dyλ2e−λ2yP (∩j Tj > y)

=

∫ ∞t

dxλ1e−λ1x

∫ ∞x+s

dyλ2e−λ2ye−µy

=λ2

λ2 + µ

∫ ∞t

dxλ1e−λ1xe−(µ+λ2)(x+s)

=λ2

λ2 + µe−(µ+λ2)s

∫ ∞t

dxλ1e−λx

=λ2

λ2 + µe−(µ+λ2)s · λ1

λe−λt

=λ1

λ

λ2

λ− λ1e−(λ−λ1)s · e−λt

which is the desired result.Hence it follows that

E [f (S)] =

n∑k=1

E [f (S) |K = k]P (K = k)

=

n∑k=1

λkλ

∫ ∞0

(λ− λk) e−(λ−λk)sf (s) ds

=

∫ ∞0

n∑k=1

λkλ

(λ− λk) e−(λ−λk)sf (s) ds

Page: 111 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 118: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

112 13 Exponential Random Variables

Example 13.7 (K.&T. I.V.P9, p. 52). A flashlight requires two good batteries inorder to shine. Suppose, for the sake of this academic exercise, that the lifetimesof batteries in use are independent random variables that are exponentiallydistributed with parameter λ = 1. Reserve batteries do not deteriorate. Youbegin with five fresh batteries. On average, how long can you shine your light?

Answer: According to Theorem 13.4, the life time for the first two batteriesis an exponential random variable with parameter 2 = 1 + 1 – the expectedlifetime being 1

2 . The remaining lifetime of the good battery after the otherbattery has died is still an exponential random variable with parameter 1, seeCorollary 13.6. Thus when we replace the dead battery by one of the reservebatteries the whole process starts again and repeats for a total of 4 times so theexpected run time of our light is 4 · 1

2 = 2 hours.

Corollary 13.8. Continuing the notation in Corollary 13.6, the distribution ofS + T (the time of the second ring) is given by

P (S + T ∈ (τ, τ + dτ) =

n∑k=1

(λ− λk) e−λτ[eλkτ − 1

]dτ

=

n∑k=1

(λ− λk)[e−(λ−λk)τ − e−λτ

]dτ

=

[n∑k=1

(λ− λk) e−(λ−λk)τ − (n− 1)λe−λτ

]dτ.

Proof. To compute the distribution of the second ring, T + S, we use whatwe have already proved along with the law of total expectations;

E [f (T + S)] =

n∑k=1

E [f (T + S) |K = k]P (K = k)

=

n∑k=1

λkλ

∫R2

+

f (t+ s)λe−λt (λ− λk) e−(λ−λk)sdsdt

=

n∑k=1

λkλ

∫R2

+

f (t+ s)λe−λ(t+s) (λ− λk) eλksdsdt

=

n∑k=1

λkλ

∫ ∞0

ds

∫ ∞s

dτf (τ)λe−λτ (λ− λk) eλks

=

n∑k=1

λkλ

∫R2

+

dsdτ10≤s≤τ<∞f (τ)λe−λτ (λ− λk) eλks

=

n∑k=1

λkλ

∫ ∞0

dτf (τ)λe−λτ (λ− λk)1

λkeλks|s=τs=0

=

n∑k=1

∫ ∞0

dτf (τ) e−λτ (λ− λk)[eλkτ − 1

]=

∫ ∞0

f (τ)

(n∑k=1

(λ− λk)[e−(λ−λk)τ − e−λτ

])dτ

=

∫ ∞0

f (τ)

(n∑k=1

(λ− λk) e−λτ[eλkτ − 1

])dτ.

This identity suffice to complete the proof.

Example 13.9 (Special Case). Suppose that Ti2i=1 are exponential indepen-dent random variables with parameter λi for i = 1, 2. If λ = λ1 + λ2,T1 := min (T1, T2) , T2 := max (T1, T2) , and

K =

1 if T1 = T1

2 if T1 = T2

then

P(K = 1, T1 > t, T2 − T1 > s

)=λ1

λe−λte−λ2s and

P(K = 2, T1 > t, T2 − T1 > s

)=λ2

λe−λte−λ1s.

Furthermore,

Page: 112 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 119: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

13 Exponential Random Variables 113

P(T2 ∈ (t, t+ dt)

)=

2∑k=1

(λ− λk) e−λt[eλkt − 1

]dt

=

2∑k=1

(λ− λk)[e−(λ−λk)t − e−λt

]dt

= λ2

[e−λ2t − e−(λ1+λ2)t

]+ λ1

[e−λ1t − e−(λ1+λ2)t

]= λ1e

−λ1t + λ2e−λ2t − λe−λt.

Theorem 13.10. Suppose that S ∼ E (λ) and R ∼ E (µ) are independent.Then for t ≥ 0 we have

µP (S ≤ t < S +R) = λP (R ≤ t < R+ S) .

Proof. We have

P (S ≤ t < S +R) =

∫ t

0

λe−λsP (t < s+R) ds = λ

∫ t

0

e−λse−µ(t−s)ds.

Similarly interchanging the roles of S and R (hence µ and λ) implies

P (R ≤ t < R+ S) = µ

∫ t

0

e−µse−λ(t−s)ds = µ

∫ t

0

e−µ(t−σ)e−λσdσ

wherein we have made the change of variables σ = t − s. Comparing the lasttwo displayed equations shows

µP (S ≤ t < S +R) = µλ

∫ t

0

e−λse−µ(t−s)ds = λP (R ≤ t < R+ S) .

Alternatively,

λ

∫ t

0

e−λse−µ(t−s)ds = µλe−µt∫ t

0

e−(λ−µ)sds

= µλe−µt · 1− e−(λ−µ)t

λ− µ

= µλ · e−µt − e−λt

λ− µ

which is symmetric in µ and λ.

Lemma 13.11. If 0 ≤ x ≤ 12 , then

e−2x ≤ 1− x ≤ e−x. (13.8)

Moreover, the upper bound in Eq. (13.8) is valid for all x ∈ R.

Fig. 13.1. A graph of 1− x and e−x showing that 1− x ≤ e−x for all x.

Proof. The upper bound follows by the convexity of e−x, see Figure 13.1.For the lower bound we use the convexity of ϕ (x) = e−2x to conclude that theline joining (0, 1) = (0, ϕ (0)) and

(1/2, e−1

)= (1/2, ϕ (1/2)) lies above ϕ (x)

for 0 ≤ x ≤ 1/2. Then we use the fact that the line 1− x lies above this line toconclude the lower bound in Eq. (13.8), see Figure 13.2.

Fig. 13.2. A graph of 1−x (in red), the line joining (0, 1) and(1/2, e−1

)(in green), e−x

(in purple), and e−2x (in black) showing that e−2x ≤ 1− x ≤ e−x for all x ∈ [0, 1/2] .

For an∞n=1 ⊂ [0, 1] , let

∞∏n=1

(1− an) := limN→∞

N∏n=1

(1− an) .

Page: 113 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 120: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

114 13 Exponential Random Variables

The limit exists since,∏Nn=1 (1− an) decreases as N increases.

Exercise 13.1. Show; if an∞n=1 ⊂ [0, 1), then

∞∏n=1

(1− an) = 0 ⇐⇒∞∑n=1

an =∞.

The implication, ⇐= , holds even if an = 1 is allowed.

Solution to Exercise (13.1). By Eq. (13.8) we always have,

N∏n=1

(1− an) ≤N∏n=1

e−an = exp

(−

N∑n=1

an

)

which upon passing to the limit as N →∞ gives

∞∏n=1

(1− an) ≤ exp

(−∞∑n=1

an

).

Hence if∑∞n=1 an =∞ then

∏∞n=1 (1− an) = 0.

Conversely, suppose that∑∞n=1 an <∞. In this case an → 0 as n→∞ and

so there exists an m ∈ N such that an ∈ [0, 1/2] for all n ≥ m. Therefore byEq. (13.8), for any N ≥ m,

N∏n=1

(1− an) =

m∏n=1

(1− an) ·N∏

n=m+1

(1− an)

≥m∏n=1

(1− an) ·N∏

n=m+1

e−2an =

m∏n=1

(1− an) · exp

(−2

N∑n=m+1

an

)

≥m∏n=1

(1− an) · exp

(−2

∞∑n=m+1

an

).

So again letting N →∞ shows,

∞∏n=1

(1− an) ≥m∏n=1

(1− an) · exp

(−2

∞∑n=m+1

an

)> 0.

Theorem 13.12. Let Tj∞j=1 be independent random variables such that Tjd=

E (λj) with 0 < λj <∞ for all j. Then:

1. If∑∞n=1 λ

−1n < ∞ then P (

∑∞n=1 Tn =∞) = 0 (i.e. P (

∑∞n=1 Tn <∞) =

1).

2. If∑∞n=1 λ

−1n =∞ then P (

∑∞n=1 Tn =∞) = 1.

Proof. 1. Since

E

[ ∞∑n=1

Tn

]=

∞∑n=1

E [Tn] =

∞∑n=1

λ−1n <∞

it follows that∑∞n=1 Tn <∞ a.s., i.e. P (

∑∞n=1 Tn =∞) = 0.

2. By the DCT, independence, and Eq. (13.4) with a = −1,

E[e−∑∞

n=1Tn]

= limN→∞

E[e−∑N

n=1Tn

]= limN→∞

N∏n=1

E[e−Tn

]= limN→∞

N∏n=1

(1

1 + λ−1n

)=

∞∏n=1

(1− an)

where

an = 1− 1

1 + λ−1n

=1

1 + λn.

Hence by Exercise 13.1, E[e−∑∞

n=1Tn]

= 0 iff ∞ =∑∞n=1 an which hap-

pens iff∑∞n=1 λ

−1n = ∞ as you should verify. This completes the proof since

E[e−∑∞

n=1Tn]

= 0 iff e−∑∞

n=1Tn = 0 a.s. or equivalently

∑∞n=1 Tn =∞ a.s.

13.1 Exercises

Exercise 13.2. Suppose that T1, T2 are independent random variables with

Tid= E (λi) with λi > 0 for i = 1, 2. Show

P (T1 + T2 ∈ (w,w + dw)) = 1w≥0λ1λ2

λ2 − λ1

[e−λ1w − e−λ2w

]dw,

i.e. show

E [f (T1 + T2)] =

∫ ∞0

f (w)λ1λ2

λ2 − λ1

[e−λ1w − e−λ2w

]dw

for all bounded or non-negative functions f. If λ1 = λ2 = λ the above formulashould be interpreted as

E [f (T1 + T2)] =

∫ ∞0

f (w)λ2we−λwdw.

(See Exercise 13.4 for an extension of this last formula.)

Page: 114 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 121: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

13.1 Exercises 115

Exercise 13.3. For n ∈ N and t > 0 show

Vn (t) :=

∫0≤s1≤s2≤···≤sn≤t

ds1 . . . dsn =tn

n!.

Hints: first observe that V1 (t) = t. Now show

Vn (t) =

∫ t

0

Vn−1 (s) ds

and complete the proof by induction.

Exercise 13.4. Suppose that Tini=1 are i.i.d. exponential random times withparameter λ and let Wn = T1 + · · ·+ Tn. Shown

P (Wn ∈ (w,w + dw)) =λnwn−1

(n− 1)!e−λwdw,

i.e. show

E [f (Wn)] =

∫ ∞0

f (w)λnwn−1

(n− 1)!e−λwdw for all f ≥ 0.

Hint: you may find Exercise 13.3 helpful.

Solution to Exercise (13.2). We start with the case n = 2. In this case,letting w = t1 + t2, i.e. t2 = w − t1, we find,

E [f (T1 + T2)] =

∫f (t1 + t2)λ1λ2e

−(λ1t1+λ2t2)dt1dt2

=

∫f (w)λ1λ2e

−(λ1t1+λ2(w−t1))10≤t1≤wdt1dw

=

∫dwe−λ2wf (w)λ1λ2e

(λ2−λ1)t110≤t1≤wdt1

=

∫dwf (w)

λ1λ2

λ2 − λ1

[e(λ2−λ1)w − 1

]e−λ2w

=

∫dwf (w)

λ1λ2

λ2 − λ1

[e−λ1w − e−λ2w

].

provided λ2 6= λ1. When λ2 = λ1 we have instead,

E [f (T1 + T2)] =

∫dwe−λwf (w)λ210≤t1≤wdt1 =

∫ ∞0

λ2we−λwf (w) dw.

Solution to Exercise (13.3). For n = 1,

V1 (t) :=

∫0≤s1≤t

ds1 = t.

For n ≥ 2,

Vn−1 (sn) =

∫0≤s1≤s2≤···≤sn

ds1 . . . dsn−1

and therefore,

Vn (t) :=

∫0≤sn≤t

Vn−1 (sn) dsn.

Hence it follows that

V2 (t) =

∫ t

0

V1 (s) ds =

∫ t

0

sds =t2

2,

V3 (t) =

∫ t

0

V2 (s) ds =

∫ t

0

s2

2ds =

t3

3!

...

Vn (t) =tn

n!.

Solution to Exercise (13.4). Making the change of variables w = t1+· · ·+tn(holding t1, . . . , tn−1 fixed), implies

E [f (T1 + · · ·+ Tn)] = λn∫f (t1 + · · ·+ tn) e−λ

∑n

i=1ti

n∏i=1

dti

= λn∫f (w) e−λw

n−1∏i=1

dti · 1w≥t1+···+tn−1dw

= λn∫f (w) e−λwVn−1 (w) dw

= λn∫ ∞

0

f (w) e−λwwn−1

(n− 1)!dw.

Exercise 13.5 (Simulating E (λ) - R.V.’s). Suppose that U is a random

variable uniformly distributed in [0, 1] and λ > 0 is given. Show T := − 1λ lnU

d=

E (λ) .

Solution to Exercise (13.5). As 0 < U < 1 a.s. T := − 1λ lnU is a (0,∞) –

valued random variable. Given t > 0 we then have,

P (T > t) = P

(− 1

λlnU > t

)= P (lnU < −λt) = P

(U < e−λt

)= e−λt,

which shows Td= E (λ) .

Page: 115 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 122: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 123: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

14

Math 180B (W 2011) Final Exam Information

Things to know and understand for the 180B final exam.

1. A basic understanding of Variance, Covariance, Correlation, and Linearprediction.

2. Know the basic properties of conditional expectations.3. Be able to compute conditional expectations involving conditioning on a

discrete random variable.4. Be able to apply the law of total expectations.5. Know how to compute the mean and variance of sums of the form Z =X1 + · · · + XN where N is independent of Xi∞i=1 and the Xi∞i=1 arei.i.d.

6. Be able to find one step transition probabilities for a Markov chain.7. Know how to go back and forth between the jump diagram for a Markov

chain and its transition matrix.8. Know how to compute basic probabilities for a Markov chain in terms of

the one step probabilities.9. In the case of finite state spaces you should know that ETB <∞ if Pi(TB <∞) > 0 for all i ∈ S. In words, if there is a positive chance to hit B fromall starting locations then you will hit B for sure.

10. Be able to use the first step analysis to compute expected hitting times,expected number of visits to site, and hitting probabilities.

11. You should be able to find the communication classes of a Markov chain(i.e. of its transition matrix) and;

a) find the periods of each class and in particular be able to recognizeaperiodic classes,

b) determine if the class is closed or not and use this to determine if afinite communication class is recurrent or transient.

12. You should have a basic understanding of the limiting behavior of Markovchains. For example you should know for any initial distribution ν that;

a) limn→∞ Pν (Xn = i) = 0 and limN→∞1N

∑Nn=1 1Xn=i if i is a transient

(or null recurrent) site.b) If the chain has a finite state space and is regular (i.e. only one com-

munication class which is aperiodic), then

limn→∞

Pν (Xn = i) = πi =1

N

N∑n=1

1Xn=i

where π is the unique invariant distribution for the chain. You shouldknow that π = πP, and πi = 1/EiRi – where Ri = min n ≥ 1 : Xn = iis the first return time to i. You should be able to find π.

13. You should be able to use the formulas,

E [f (X,Y)] =

∫Rk

E [f (x,Y)] ρX (x) dx =

∫Rl

E [f (X,y)] ρY (y) dy

which is valid when X and Y are independent random vectors.14. You should know the definition of exponential random variables and their

basic properties like;

a) they are memoryless, i.e. if Td= E (λ) , then given T > t, T − t has the

same distribution as T. To be precise, P (T − t > s|T > t) = P (T > s)for all s ≥ 0.

b) You should also be able to apply Theorem 14.1 below to compute ba-sic expectations and probabilities involving sequences of independentexponential random variables.

Theorem 14.1. Let Tknk=1 be independent random variables such that Tk ∼E (λk) with λ :=

∑nk=1 λk. Further let T := mink Tk, K = k, and S =∑n

k=1 1K=k minj 6=k (Tj − Tk) . So T is the time the first clock rings, S is thetime between the first and second ring, K is the number of the first clock toring. Then

P (K = k) =λkλ

and Td= E (λ)

and moreover given K = k , T and S are independent with Td= E (λ) and

Sd= E (λ− λk) . i.e. relative to P (·|K = k) , T and S are independent, T

d=

E (λ) and Sd= E (λ− λk) . In even more detail this states,

E [f (S, T ) |K = k] =

∫ ∞0

∫ ∞0

f (s, t) (λ− λk) e−(λ−λk)sλe−λtdsdt

and so by the law of total expectations,

E [f (S, T )] =

n∑k=1

λkλ

∫ ∞0

∫ ∞0

f (s, t) (λ− λk) e−(λ−λk)sλe−λtdsdt.

Page 124: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 125: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

15

Order statistics (you may skip this chapter!)

Definition 15.1 (Order Statistics). Suppose that X1, . . . , Xn are non-negative random variables such that P (Xi = Xj) = 0 for all i 6= j. The order

statistics of X1, . . . , Xn are the random variables, X1, X2, . . . , Xn defined by

Xk = max#(Λ)=k

min Xi : i ∈ Λ (15.1)

where Λ always denotes a subset of 1, 2, . . . , n in Eq. (15.1).

The reader should verify that X1 ≤ X2 ≤ · · · ≤ Xn, X1, . . . , Xn =X1, X2, . . . , Xn

with repetitions, and that X1 < X2 < · · · < Xn if

Xi 6= Xj for all i 6= j. In particular if P (Xi = Xj) = 0 for all i 6= j then

P (∪i6=j Xi = Xj) = 0 and X1 < X2 < · · · < Xn a.s. In this case we have

E[f(X1, . . . , Xn

)]=∑σ∈Sn

E [f (Xσ1, . . . , Xσn) : Xσ1 < Xσ2 < · · · < Xσn] ,

(15.2)where the sum is over the symmetric groups, Sn.

Lemma 15.2. If f : Rn+ → R is a bounded (non-negative) symmetric function(i.e. f (wσ1, . . . , wσn) = f (w1, . . . , wn) for all σ ∈ Sn and (w1, . . . , wn) ∈ Rn+)then

E[f(X1, . . . , Xn

)]= E [f (X1, . . . , Xn)] .

Proof. From Eq. (15.2) and the symmetry of f we find,

E[f(X1, . . . , Xn

)]=∑σ∈Sn

E [f (Xσ1, . . . , Xσn) : Xσ1 < Xσ2 < · · · < Xσn]

=∑σ∈Sn

E [f (X1, . . . , Xn) : Xσ1 < Xσ2 < · · · < Xσn]

= E [f (X1, . . . , Xn)]

as Xσ1 < Xσ2 < · · · < Xσnσ∈Sn partitions the sample space up to a zero prob-ability set.

Lemma 15.3. Suppose that (X1, . . . , Xn) have a continuous distribution func-tion, ρ (x1, . . . , xn) , i.e.

E [f (X1, . . . , Xn)] =

∫Rn

+

f (x1, . . . , xn) ρ (x1, . . . , xn) dx1 . . . dxn

for all bounded functions f. If

ρ (x1, . . . , xn) :=∑σ∈Sn

ρ (xσ1, . . . , xσn)

is the symmetrization of ρ (where the sum is over the permutation group Sn),then

E[f(X1, X2, . . . , Xn

)]=

∫0≤x1≤x2≤···≤xn<∞

f (x1, . . . , xn) ρ (x1, . . . , xn) dx1 . . . dxn

Proof. From Eq. (15.2) and definition of ρ we find,

E[f(X1, . . . , Xn

)]=∑σ∈Sn

∫Rn

+

10≤xσ1≤xσ2≤···≤xσnf (xσ1, . . . , xσn) ρ (x1, . . . , xn) dx1 . . . dxn

=∑σ∈Sn

∫Rn

+

10≤x1≤x2≤···≤xnf (x1, . . . , xn) ρ (xσ−11, . . . , xσ−1n) dx1 . . . dxn

=

∫Rn

+

10≤x1≤x2≤···≤xnf (x1, . . . , xn)∑σ∈Sn

ρ (xσ−11, . . . , xσ−1n) dx1 . . . dxn

=

∫Rn

+

10≤x1≤x2≤···≤xnf (x1, . . . , xn)∑σ∈Sn

ρ (xσ1, . . . , xσn) dx1 . . . dxn.

Example 15.4. If Xknk=1 are i.i.d. with each having continuous distributionρ (x) for x ≥ 0, then

ρ (x1, . . . , xn) = n!ρ (x1) . . . ρ (xn) for 0 ≤ x1 < x2 < · · · < xn.

Page 126: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

120 15 Order statistics (you may skip this chapter!)

Exercise 15.1. Suppose that Tknk=1 are i.i.d. E (λ) – random variables and

let Sk := Tk − Tk−1 with T0 = 0. Show Sknk=1 are independent and Skd=

E ((n− k + 1)) for k = 1, 2, . . . , n.

Solution to Exercise (15.1). We know that

ρ (t1, . . . , tn) = n!λne−λ(t1+···+tn)

and therefore,

E [f (S1, . . . , Sn)]

= E[f(T1, T2 − T1 . . . , Tn − Tn−1

)]=

∫0≤t1<t2<···<tn

f (t1, t2 − t1 . . . , tn − tn−1)n!λne−λ(t1+···+tn)dt1 . . . dtn.

=

∫0≤t1<t2<···<tn

f (t1, t2 − t1 . . . , tn − tn−1)n!λne−λ(t1+···+tn)dt1 . . . dtn

We now make the change of variables, si = ti−ti−1 so that ti = s1 +s2 + · · ·+sito find,

E [f (S1, . . . , Sn)]

=

∫si≥0

f (s1, s2 . . . , sn)n!λne−λns1+(n−1)s2+···+2sn−1+s1ds1 . . . dsn

=

∫si≥0

f (s1, s2 . . . , sn)

n∏k=1

kλe−λ(n−k+1)skdsk

from which it follows that Sknk=1 are independent and Skd= E (λ (n+ 1− k))

for k = 1, 2, . . . , n.

Exercise 15.2. Suppose that X1, . . . , Xn are non-negative1 random variablessuch that P (Xi = Xj) = 0 for all i 6= j. Show;

1. If f : ∆n → R is bounded (non-negative) measurable, then

E[f(X1, . . . , Xn

)]=∑σ∈Sn

E [f (Xσ1, . . . , Xσn) : Xσ1 < Xσ2 < · · · < Xσn] ,

(15.3)where Sn is the permutation group on 1, 2, . . . , n .

1 The non-negativity of the Xi are not really necessary here but this is all we needto consider.

2. If we further assume that X1, . . . , Xn are i.i.d. random variables, then

E[f(X1, . . . , Xn

)]= n! · E [f (X1, . . . , Xn) : X1 < X2 < · · · < Xn] .

(15.4)

(It is not important that f(X1, . . . , Xn

)is not defined on the null set,

∪i 6=j Xi = Xj .)3. f : Rn+ → R is a bounded (non-negative) measurable symmetric function

(i.e. f (wσ1, . . . , wσn) = f (w1, . . . , wn) for all σ ∈ Sn and (w1, . . . , wn) ∈Rn+) then

E[f(X1, . . . , Xn

)]= E [f (X1, . . . , Xn)] .

4. Suppose that Y1, . . . , Yn is another collection of non-negative random vari-ables such that P (Yi = Yj) = 0 for all i 6= j such that

E [f (X1, . . . , Xn)] = E [f (Y1, . . . , Yn)]

for all bounded (non-negative) measurable symmetric functions from Rn+ →R. Show that

(X1, . . . , Xn

)d=(Y1, . . . , Yn

).

Hint: if g : ∆n → R is a bounded measurable function, define f : Rn+ → Rby;

f (y1, . . . , yn) =∑σ∈Sn

1yσ1<yσ2<···<yσng (yσ1, yσ2, . . . , yσn)

and then show f is symmetric.

Lemma 15.5. Suppose that Tknk=1 are i.i.d. E (λ) – random times and Wi =T1 + · · ·+ Ti for all i. Then

E [f (W1,W2, . . . ,Wn)] =

∫0≤w1≤w2≤···≤wn

f (w1, w2, . . . , wn)λne−λwndw1 . . . dwn.

(15.5)

Proof. By definition,

E [f (W1,W2, . . . ,Wn)]

= E [f (T1, T1 + T2, . . . , T1 + T2 + · · ·+ Tn)]

=

∫Rn

+

f (t1, t1 + t2, . . . , t1 + t2 + · · ·+ tn)λne−λ(t1+···+tn)dt1 . . . dtn.

Making the change of variables, wi = t1 + t2 + · · · + ti for each i above leadsdirectly to Eq. (15.5).

Page: 120 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 127: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

Notation 15.6 For each n ∈ N and T ≥ 0 let

∆n (T ) := (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn < T

and let

∆n := ∪T>0∆n (T ) = (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn <∞ .

Corollary 15.7. If t ∈ R+ and f : ∆n (t)→ R is a bounded (or non-negative)measurable function, then

E [f (W1, . . . ,Wn) |Wn ≤ t < Wn+1]

=n!

tn

∫∆n(t)

f (w1, w2, . . . , wn) dw1 . . . dwn

= E[f(U1, . . . , Un

)]where

Ui

ni=1

are the order statistics of Uini=1 where the Ui are i.i.d. uni-

formly distributed random variables on [0, t] .

Proof. Applying Eq. (15.5) at level n+ 1 with

g (w1, w2, . . . , wn+1) = f (w1, w2, . . . , wn) 1wn≤t<wn+1

to learn

E [f (W1, . . . ,Wn) : Wn ≤ t < Wn+1]

=

∫0<w1<···<wn<t<wn+1

f (w1, w2, . . . , wn)λn+1e−λwn+1dw1 . . . dwndwn+1

=

∫∆n(t)

f (w1, w2, . . . , wn)λne−λtdw1 . . . dwn. (15.6)

Taking f ≡ 1 in this equation and then doing the integrals implies (see Exercise13.3),

P (Wn ≤ t < Wn+1) =tn

n!· λne−λt.

Dividing Eq. (15.6) by the probability P (Wn ≤ t < Wn+1) we just found com-pletes the proof in light of Exercise 15.2.

Page 128: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex
Page 129: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16

Point Processes

Suppose that S is some region of space (i.e. a subset of Rd for some d). Ourfirst goal is to introduce the notion of a collection of random point objects in S.Our goal is to model phenomenon like the location of raisins in a raisin bread,stars in the sky, starfish on the sea floor, etc. The informal idea is that for eachs ∈ S we would like to flip a (possibly biased) coin and place an object at site sif the coin shows heads and leave the site empty otherwise. We further want todo this independently for each point in S. However if we take this point of viewliterally we will see that we have typically placed an infinite number of pointsin any infinite subset of S. This would not form a very good model the locationof stars in the sky, etc. but nevertheless this is the spirit of the Poisson pointprocess.

According to Wikipedia. The following examples are well-modeled by a Pois-son process:

• The number of goals in (90 minutes of) a soccer match.• The arrival of “customers” in a queue.• The number of adjacent pairs of lockers open within an area.• The number of raindrops falling over an area.• The number of photons hitting a photodetector.• The number of telephone calls arriving at a switchboard, or at an automatic

phone-switching system.• The number of particles emitted via radioactive decay by an unstable sub-

stance, where the rate decays as the substance stabilizes.• The long-term behavior of the number of web page requests arriving at

a server, except for unusual circumstances such as coordinated denial ofservice attacks or flash crowds. Such a model assumes homogeneity as wellas weak stationarity.”

So what is a Poisson point process? We are going to answer this questionby taking a “scalling limit” of its discrete cousin – the Bernoulli point process.

Definition 16.1 (Bernoulli point process). Let S be a finite or countableset and p : S → [0, 1] be a given function. Further let Yxx∈S be independentBernoulli random variables with P (Yx = 1) = p (x) for all x ∈ S. When Yx = 1we say there is a particle present at site x while when Yx = 0 indicates x isvacant site. Given A ⊂ S let

η (A) :=∑x∈A

Yx = # x ∈ A : Yx = 1

count the number of particles in A. We also let

Λ = Λη = x ∈ S : Yx = 1 = x ∈ S : η (x) = 1

be the random subset of S indicating the occupied sites in S.We refer to η (·)and Λ as a Bernoulli point process on S.

Notice that η (x) = Yx so that η (A)A⊂fS contains all the same infor-

mation as the original random field Yxx∈S . If A1, . . . , Ak are finite disjoint

subsets of S, then η (Al)kl=1 are independent. In what follows we will writeq (x) for 1− p (x) .

We will study the Bernoulli process and its scaling limit (the Poisson pro-cess) in more detail shortly. For now have a look at Figures 16.1 and 16.2 whichshow two “typical” sample points of the Poisson point process. Please noticethe clumping or “Poisson bursting” in these pictures. The moral is that randomdoes not mean uniform!

The reader should also have a look at the nice article; Shark attacks andthe Poisson approximation by Byron Schmuland at

http : //www.stat.ualberta.ca/people/schmu/preprints/poisson.pdf.

The Poisson approximation used implicitly in this article is more delicate thanwhat we will cover in this class, see Robin Pemantle

http : //www.math.upenn.edu/˜pemantle/Lectures/lec2.61−chen−stein.pdf

for a discussion of this point and see the papers;

1. Arratia, R., Goldstein, L. and Gordon, L. (1989) “Two moments suffice forPoisson Approximation”, Ann. Prob. 17:9–25.

2. Arratia, R., Goldstein, L. and Gordon, L. (1990) “Poisson Approximationand the Chen-Stein method”, Stat. Sci. 5:403–424. (This is a review paper.)

Page 130: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

124 16 Point Processes

λ = 10

Fig. 16.1. A Poisson point process typical sample point in the unit square and theunit interval with “intensity” equal to 10. This picture was generalted with the aid ofTheorem 16.23 below.

16.1 Poisson and Geometric Random Variables (Review)

Before studying the Bernoulli point process and its scaling limit we will pauseto recall and develop some more properties of some associated distributions.Recall from Exercise 0.1 that a Random variable, N, is Poisson distributedwith intensity, λ, if

P (N = k) =λk

k!e−λ for all k ∈ N0.

We will abbreviate this in the future by writing Nd= Poi (λ) . Let us also recall

the following facts about Poi (λ) , see Exercise 0.1.

λ = 100

Fig. 16.2. A Poisson point process typical sample point in the unit square and theunit interval with “intensity” equal to 100. This picture was generalted with the aidof Theorem 16.23 below.

Lemma 16.2 (Poisson properties). If Nd= Poi (λ) , (i.e. P (N = k) =

λk

k! e−λ for all k ∈ N0), then

GN (z) := E[zN]

= eλ(z−1).

You should find EN = λ = Var (N) . Moreover if Md= Poi (µ) with M and N

independent, then M + Nd= Poi (µ+ λ) , in short we abbreviate this by saying

Poi (λ)⊥⊥+ Poi (µ)

d= Poi (µ+ λ) .

Proof. Using Taylor’s series for the exponential function gives;

Page: 124 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 131: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.1 Poisson and Geometric Random Variables (Review) 125

GN (z) = E[zN]

=

∞∑k=0

zkλk

k!e−λ = e−λeλz = eλ(z−1).

Therefore, for all k ∈ N0,

E [N (N − 1) . . . (N − k + 1)] = G(k)N (z) = λkeλ(z−1)

and in particular,

EN = λ, E [N · (N − 1)] = λ2, and

Var (N) = λ2 + λ− λ2 = λ.

Lastly,

E[zM+N

]= E

[zM · zN

]= E

[zM]· E[zN]

= eµ(z−1) · eλ(z−1) = e(µ+λ)(z−1)

which suffices to prove Poi (λ)⊥⊥+ Poi (µ)

d= Poi (µ+ λ) .

Alternatively, we may show this directly with the aid of the binomialformula,

P (M +N = n) =

n∑k=0

P (M = k,N = n− k)

=

n∑k=0

P (M = k) · P (N = n− k)

=

n∑k=0

µk

k!e−µ · λn−k

(n− k)!e−λ

= e−(µ+λ) 1

n!

n∑k=0

(n

k

)µkλn−k =

(µ+ λ)n

n!e−(µ+λ).

Lemma 16.3. Suppose that Ni∞i=1 are independent Poisson random variableswith parameters, λi∞i=1 such that λ :=

∑∞i=1 λi < ∞. Then N :=

∑∞i=1Ni is

Poisson with parameter λ.

Proof. By Lemma 16.2, for each n ∈ N,∑ni=1Ni

d= Poi (

∑ni=1 λi) . Since

for each k ∈ N0, ∑ni=1Ni = k ↓ N = k as n ↑ ∞ we have

P (N = k) = limn→∞

P

(n∑i=1

Ni = k

)= limn→∞

(∑ni=1 λi)

k

k!exp

(−

n∑i=1

λi

)

=λk

k!e−λ

which shows Nd= Poi (λ) .

Lemma 16.4. Suppose that Ni∞i=1 are independent Poisson random variableswith parameters, λi∞i=1 such that

∑∞i=1 λi =∞. Then

∑∞i=1Ni =∞ a.s.

Proof. Let Λn = λ1 + · · ·+ λn, then

P

( ∞∑i=1

Ni ≥ k

)≥ P

(n∑i=1

Ni ≥ k

)= 1− e−Λn

k−1∑l=0

Λlnl!→ 1 as n→∞.

Therefore P (∑∞i=1Ni ≥ k) = 1 for all k ∈ N and hence,

P

( ∞∑i=1

Ni ≥ ∞

)= P

(∩∞k=1

∞∑i=1

Ni ≥ k

)= 1.

Definition 16.5 (Geometric distribution). A integer valued random vari-able, N, is said to have a geometric distribution with parameter, p ∈ (0, 1)provided,

P (N = k) = p (1− p)k−1for k ∈ N.

Let us recall that

EN = 1/p and Var (N) =1− pp2

. (16.1)

To prove this let q = 1− p and notice that

EN =

∞∑k=1

kpqk−1 = pd

dq

∞∑k=1

qk = pd

dq

q

1− q

= p · 1− q + q

(1− q)2 = p · 1

(1− q)2 =1

p

and

EN2 =

∞∑k=1

kpqk−1 = pd

dq

[qd

dq

∞∑k=1

qk

]

= pd

dq

q

(1− q)2 = p · (1− q)2+ 2 (1− q) q

(1− q)4

=p+ 2q

p2=

2− pp2

and hence

Var (N) = EN2 − (EN)2

=2− pp2− 1

p2=

1− pp2

.

Page: 125 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 132: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

126 16 Point Processes

16.2 Law of rare numbers

In order to take the scaling limit of the Bernoulli model described in the previoussection we will need the limiting results of this section.

Lemma 16.6. Suppose that aε > 0 and 0 < pε ≤ 1 with limε↓0 aε = ∞ andβ := limε↓0 aεpε exists as a positive number. Then for any bounded function,γε, of ε we have,

limε↓0

(1− pε)aε+γε = e−β .

Proof. Since pε → 0 and γε is bounded it follows that limε↓0 pεγε = 0 andtherefore that limε↓0 pε (aε + γε) = β. Since, by Taylor’s theorem,

ln (1− pε) =(−pε +O

(p2 (ε)

)),

we find

ln (1− pε)aε+γε = (aε + γε) ln (1− pε)= (aε + γε)

(−pε +O

(p2 (ε)

))→ −β as ε ↓ 0.

Exponentiating this limiting result proves the lemma.

Definition 16.7. Suppose that S is a finite or countable set and Xn and Xare random functions with values in S. We say that Xn converges to X indistribution provided

limn→∞

P (Xn = s) = P (X = s) for all s ∈ S.

We will abbreviate this type of convergence by writing Xn =⇒ X. We alsowrite Xn =⇒ X to mean limn→∞ P (Xn > x) = P (X > x) for all x ∈ Rwhen Xn ∪ X are random variables and X has a continuous distribution.

Notation 16.8 For any a ≥ 0, let [a] be the nearest integer to a which is nolarger than a. For example [3.5] = 3 and [3] = 3.

Proposition 16.9 (Law of rare numbers I). Suppose that aε > 0 and 0 ≤pε < 1 are chosen so that pε → 0 and aε →∞ in such a way that limε↓0 pεaε = βexists with β ∈ (0,∞) . Then, in the limit as ε ↓ 0,

Bin ([aε] , pε) =⇒ Poi (β) and (16.2)

1

aεGeo (pε) =⇒ E (β) . (16.3)

To be more precise, for every k ∈ N0 and every t > 0,

limn→∞

P (Bin ([aε] , pε) = k) = P (Poi (β) = k) =βk

k!e−β and

limn→∞

P

(1

aεGeo (pε) > t

)= P (E (β) > t) = e−βt.

Proof. In what follows we will abbreviate [aε] by n and pε by p and letβε := np = [aε] pε Then for 0 ≤ k ≤ n,

P (Bin (n, p) = k) =

(n

k

)pk (1− p)n

=pk

k!n (n− 1) . . . (n− k + 1) · (1− p)n−k

=1

k!

(np

1− p

)k· 1 ·

(1− 1

n

)· · · · ·

(1− k − 1

n

)(1− p)n

=1

k!

(βε

1− βε/n

)k· 1 ·

(1− 1

n

)· · · · ·

(1− k − 1

n

)(1− βε/n)

n.

As ε ↓ 0, n→∞ and βε → β and so passing to the limit as ε ↓ 0 in the previousequation for fixed k gives,

limε↓0

P (Bin (n, p) = k) =βk

k!e−β

which is Eq. (16.2). Similarly, using Lemma 16.6,

P

(1

aεGeo (pε) > t

)= P (Geo (pε) > aεt)

= P (Geo (pε) > [aεt])

= (1− pε)[aεt] → e−βt as n→∞,

which proves Equation (16.3).The next theorem is a substantial generalization of Eq. (16.2) while at the

same time allowing us to give quantitative bounds on the error in this equation,see Corollary 16.11 below.

Theorem 16.10 (Law of rare events II). Let Zini=1 be independentBernoulli random variables with P (Zi = 1) = pi ∈ (0, 1) and P (Zi = 0) =

1− pi, S := Z1 + · · ·+ Zn, a := p1 + · · ·+ pn, and Xd= Poi (a) . Then for any

A ⊂ N0 we have

|P (S ∈ A)− P (X ∈ A)| ≤n∑i=1

p2i , (16.4)

(Of course this estimate has no content unless∑ni=1 p

2i < 1.)

Proof. Let Xini=1 be independent Poisson random variables with Xid=

Poi (pi) so that X = X1 + · · ·+Xnd= Poi (a) . (We do not assume that (Xi, Zi)

are independent!) It then follows that

Page: 126 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 133: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.2 Law of rare numbers 127

|P (S ∈ A)− P (X ∈ A)| ≤ E |1A (S)− 1A (X)| .

Making use of the observations that |1A (S)− 1A (X)| ≤ 1 and|1A (S)− 1A (X)| = 0 unless Zi 6= Xi for some i we may conclude,

|1A (S)− 1A (X)| ≤ 1∪ni=1Zi 6=Xi ≤

n∑i=1

1Zi 6=Xi

and therefore,

|P (S ∈ A)− P (X ∈ A)| ≤n∑i=1

E [1Zi 6=Xi ] =

n∑i=1

P (Zi 6= Xi) .

We are now going to complete the proof by manufacturing the random vec-tors (Xi, Zi)ni=1 so that P (Zi 6= Xi) is as small as possible. We will constructthat (Xi, Zi) as functions of Ui where Uini=1 are i.i.d. random variables dis-tributed uniformly on [0, 1] .

If we letZi := 1(1−pi,1] (Ui) = 11−pi<Ui≤1,

then Zini=1 are independent Bernoulli random variables with P (Zi = 1) = pi.

Further let J i0 := (0, e−pi ] and then choose disjoint intervalsJ ik∞k=1

of (e−pi , 1]

such that∣∣J ik∣∣ =

pkik! e−pi and define

Xi :=

∞∑k=0

k1Jik

(Ui)

as in see Figure 16.3. Then Xid= Poi (pi) and we have,

P (Xi 6= Zi) = [αi (0)− (1− pi)] + 1− αi (1)

=[e−pi − (1− pi)

]+ 1− e−pi (1 + pi)

= pi(1− e−pi

)≤ p2

i .

Corollary 16.11. Let λ > 0 be given, then

dTV

(Bern

n, n

),Poi (λ)

)≤ n ·

n

)2

=λ2

n.

See Figures 16.4 – 16.7.

Fig. 16.3. Plots of Xi and Zi as functions of Ui. In this figure αi (k) :=P (Poi (pi) ≤ k) .

Fig. 16.4. Plot of the probability functions for Bern(

5100

, 100)

in black and Poi (5)in green.

Page: 127 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 134: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

128 16 Point Processes

Fig. 16.5. Plot of the probability functions for Bern(

51000

, 1000)

in black and Poi (5)in green.

Compare the Poisson approximation with the central limit approximation.

So consider let νd= Poi (λ) which we view as ν =

∑nk=1 νk where νk∞k=1 are

i.i.d. with νkd= Poi (λ/n) . As Eν = λ = Var (λ) , the central limit then states,

Zλ := ν−λ√λ

d≈ N (0, 1) so that

νd≈ λ+

√λN (0, 1)

which should be closer to the truth as λ ↑ ∞. This explains why the pictureswith λ = 15 and then λ = 30 looks more Gaussian than the pictures with λ = 5.

16.3 Bernoulli point process (Homogeneous Case)

Notation 16.12 Suppose A is a set and f is a symmetric function on An.Given any subset, C ⊂ A, with n - elements, we write f (C) for f (x1, . . . , xn)where C = x1, . . . , xn . It does not matter how we order the points in C as fis a symmetric function of its arguments.

Fig. 16.6. Plot of the probability functions for Bern(

151000

, 1000)

in black and Poi (15)in green.

Theorem 16.13 (Uniform Bernoulli process). Suppose that p = p (x) isindependent of x in Definition 16.1. In this case if A is a finite subset of S

(we denote this by writing A ⊂f S in the future), then η (A)d= Bin (p,# (A)) .

Moreover;

1. if 1 ≤ n ≤ # (A) and C ⊂ A is a subset with n elements, then

P (Λ ∩A = C|η (A) = n) =1(

#(A)n

) , (16.5)

i.e. given η (A) = n, Λ ∩ A is distributed uniformly among all subsets of Awith n – elements.

2. if Uini=1 are i.i.d. random functions with values in A distributed uniformlyin A, then Λ ∩ A given η (A) = n and U1, . . . , Un given no coincidencesare equally distributed. The event, Γn, of no coincidences in the sequence(U1, U2, . . . , Un) is formally defined by;

Γn := [∪i 6=j Ui = Uj]c = ω : # U1 (ω) , U2 (ω) , . . . , Un (ω) = n .(16.6)

Page: 128 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 135: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.3 Bernoulli point process (Homogeneous Case) 129

Fig. 16.7. Plot of the probability functions for Bern(

301000

, 1000)

in black and Poi (30)in green.

3. If f is a symmetric function on An, then

E [f (Λ ∩A) ||η (A) = n] = E [f (U1, . . . , Un) |Γn] . (16.7)

Proof. The assertion that η (A)d= Bin (p,# (A)) should be obvious to the

reader as η (A) is a sum of a := # (A) independent Bernoulli random variableswith parameter p.

1. If C ⊂ A has n – elements, then

P (η (C) = n, η (A) = n) = P (η (C) = n, η (A \ C) = 0) = pnqa−n

while

P (η (A) = n) =

(a

n

)pnq(a−n).

Therefore, P (η (C) = n|η (A) = n) = 1/(an

)as claimed.

2. To prove the last assertion write C = x1, . . . , xn where the xi are alldistinct points in A and notice that

U1, . . . , Un = C = ∪σ∈Sn U1 = xσ1, . . . , Un = xσn .

Therefore,

P (U1, . . . , Un = C, Γn) = P (U1, . . . , Un = C)

=∑σ∈Sn

P (U1 = xσ1, . . . , Un = xσn)

=∑σ∈Sn

(1

a

)n= n! ·

(1

a

)nwhere Sn is the set of permutations1 on 1, 2, . . . , n . Since

Γn = ∪C⊂A;#(C)=n U1, . . . , Un = C

it follows that

P (Γn) =∑

C⊂A;#(C)=n

P (U1, . . . , Un = C) =

(a

n

)· n! ·

(1

a

)nand therefore,

P (U1, . . . , Un = C|Γn) =1(an

) .3. Equation (16.7) is really just a rewriting of item 2. The point is that item

2. implies the distribution of f (Λ ∩A) given η (A) = n is the same asf (U1, . . . , Un) given Γn and therefore the expectations agree.

We will also like to pay special attention to case where S = Sε = ε ·N ⊂ R+

for some ε > 0. In this setting we will make use of the order on S by definingW ε

0 = 0 and then define W εn inductively by

W εn = min

t ∈ S : t > W e

n−1 and Yt = 1.

So W εn is the time that the nth 1 appears in the sequence, (Yε, Y2ε, Y3ε, . . . ) . We

further let T εn := W εn−W ε

n−1 so that T εn is the time between “events” n−1 andn. For example if (Yε, Y2ε, Y3ε, . . . ) = (0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, . . . ) , thenW ε

1 = 3ε, W ε2 = 5ε, W ε

3 = 9ε, W ε4 = 11ε, W ε

5 = 12ε, . . . . and T ε1 = 3ε, T ε2 = 2ε,T ε3 = 4ε, T ε4 = 2ε, T ε5 = 1ε, . . . . Notice that W ε

n = T ε1 + · · · + T εn for n ≥ 1.We refer to W ε

n as the waiting time for the nth – event and T εn as the nth –interarrival time.

1 A permutation σ ∈ Sn is a bijective function from 1, 2, . . . , n to itself.

Page: 129 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 136: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

130 16 Point Processes

Theorem 16.14. Continuing the notation above,

P(T εn+1 > t|W ε

1 , . . . ,Wεn

)= qt/ε for all t ∈ Sε. (16.8)

In particular it follows that T εn+1 is independent of W ε1 , . . . ,W

εn and in particular

T εn∞n=1 are all i.i.d. random variables such that T εn

d= ε ·Geo (p) .

Proof. Given (W ε1 , . . . ,W

εn) , the sites above W ε

n are independent of all thesites up to and before W ε

n and therefore,

P(T εn+1 > t|W ε

1 , . . . ,Wεn

)=

∏τ∈Sε:W ε

n<τ≤W εn+t

q = qt/ε (16.9)

for all t ∈ Sε.

Theorem 16.15. For any bounded or non-negative function of f (w1, . . . , wn)with w1 < w2 < · · · < wn in Sε we have

E [f (W ε1 , . . . ,W

εn)] =

∑0<w1<w2<···<wn<∞

f (w1, . . . , wn) qwn/ε ·(p

q

)n.

Proof. We have

E [f (W ε1 , . . . ,W

εn)]

=∑

0<w1<w2<···<wn<∞f (w1, . . . , wn)

∏x∈(0,wn]\w1,...,wn

q ·n∏i=1

p

=∑

0<w1<w2<···<wn<∞f (w1, . . . , wn)

∏x∈(0,wn]

q ·n∏i=1

p

q

=∑

0<w1<w2<···<wn<∞f (w1, . . . , wn) qwn/ε ·

(p

q

)n.

Corollary 16.16. Suppose that t ∈ Sε and n ∈ N are given such that t/ε ≥ nand let Uini=1 be i.i.d. random variables uniformly distributed in (0, t] ∩ Sε.Further let Γn = [∪i 6=j Ui = Uj]c be the event where no two of the Uini=1

are equal and let(U1, . . . , Un

)be the order statistics of (U1, . . . , Un) , i.e.(

U1, . . . , Un

)is the sequence (U1, . . . , Un) arranged in increasing order. Then

E [f (W ε1 , . . . ,W

εn) |Nt = n] = E

[f(U1, . . . , Un

)|Γn]

for all bounded or non-negative function of f (w1, . . . , wn) with w1 < w2 < · · · <wn.

Proof. Since Nt = n =W εn ≤ t < W ε

n+1

it follows from Theorem 16.15

that

E [f (W ε1 , . . . ,W

εn) : Nt = n]

=∑

0<w1<w2<···<wn≤t<wn+1<∞

f (w1, . . . , wn) qwn+1/ε ·(p

q

)n+1

=∑

0<w1<w2<···<wn≤t

f (w1, . . . , wn) qt/ε+1 1

1− q·(p

q

)n+1

=∑

0<w1<w2<···<wn≤t

f (w1, . . . , wn) pnqt/ε−n.

Taking f ≡ 1 in the formula shows,

P (Nt = n) =∑

0<w1<w2<···<wn≤t

pnqt/ε−n =

(t/ε

n

)pnqt/ε−n

as we already know should be the case. In particular it follows that

E [f (W ε1 , . . . ,W

εn) |Nt = n] =

1(t/εn

) ∑0<w1<w2<···<wn≤t

f (w1, . . . , wn) (16.10)

which is the uniform distribution on increasing sequences (w1, . . . , wn) in (0, t]∩Sε.

On the other hand,

E[f(U1, . . . , Un

): Γn

]=∑σ∈Sn

E [f (Uσ1, . . . , Uσn) : Uσ1 < · · · < Uσn ]

= n! · E [f (U1, . . . , Un) : U1 < · · · < Un]

=n!

(t/ε)n

∑0<w1<w2<···<wn≤t

f (w1, . . . , wn) .

Taking f ≡ 1 in this equation shows

P (Γn) =n!

(t/ε)n ·(t/ε

n

)and therefore,

E[f(U1, . . . , Un

)|Γn]

=1(t/εn

) ∑0<w1<w2<···<wn≤t

f (w1, . . . , wn)

which combined with Eq. (16.10) completes the proof.

Page: 130 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 137: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.3 Bernoulli point process (Homogeneous Case) 131

16.3.1 The Scaling Limit (Homogeneous Case)

Our next goal is to take a scaling limit of the Bernoulli point process. Givenε > 0, let S = Sε := εZd :=

εx : x ∈ Zd

and suppose that Yxx∈εZd are i.i.d.

Bernoulli random variables with parameter p = pε. Given a reasonable subsetA ⊂ Rd with finite volume, let Aε := A ∩

[εZd

]and

ηε (A) :=∑x∈Aε

Yx = # (x ∈ Aε : Yx = 1) .

For simplicity, in this section we will assume that p (x) = pε is independent of xbut will depend on ε. Our first goal is to understand how to choose pε in orderto get a non-trivial limit.

A

εZd

ε

Fig. 16.8. An overlay of εZd and a subset A ⊂ Rd. The finite set Aε is indicated bythe solid dots. Observe from this figure that |A| ∼= εd ·# (Aε) .

We will denote the volume of A by |A| = Vol (A) . As indicated in Figure16.8, we have, |A| ∼= εd ·# (Aε) and and therefore,

E [ηε (A)] = pε ·# (Aε) ∼= |A| ·pεεd.

In order to get a non-trivial limit, we will now suppose that pε = λεd (or atleast that limε↓0 pεε

−d = λ) for some λ > 0.

Theorem 16.17. Let pε = λεd. Then;

1. for “nice” A ⊂ Rd,

ηε (A)d= Bin

(λεd,# (Aε)

)=⇒ Poi

(limε↓0

λεd ·# (Aε)

)= Poi (λ |A|) as ε ↓ 0.

2. More generally if A1, . . . Ak are nice finite volume disjoint subsets of Rdthen

(ηε (A1) , . . . , ηε (Ak)) =⇒ (N (A1) , . . . , N (Ak))

where N (A1) , . . . , N (Ak) are independent Poison random variables with

N (Ai)d= Poi (λ |Ai|) .

3. If f is a symmetric continuous function on An, then

limε↓0

E [f (Ληε ∩A) |ηε (A) = n] = limε↓0

E [f (Uε1 , . . . , Uεn) |Γ εn]

= E [f (U1 . . . , Un)]

where Uini=1 are i.i.d. uniformly distributed random vectors in A. (Noticethat P (Γ εn) → 1 as ε ↓ 0 and hence there is not need for conditioning onno-coincidences in the final limit.)

4. For the d = 1, we know that T εn∞n=1 =⇒ Tn∞n=1 where the Tn∞n=1 are

i.i.d. E (λ) – random variables.5. For the d = 1, if f (w1, . . . , wn) is a bounded continuous function of 0 <w1 < w2 < · · · < wn <∞, then

limε↓0

E [f (W ε1 , . . . ,W

εn)] = E [f (W1, . . . ,Wn)]

where Wn = T1 + · · · + Tn for all n ≥ 1 and Ti∞i=1 are i.i.d. E (λ) –random variables.

Proof. Most of these statements are easy to verify.

1. This one follows from the law of rare numbers in Proposition 16.9.2. Since ηε (Al)kl=1 are independent,

limε→0

P(ηε (Al) = nlkl=1

)= limε→0

k∏l=1

P (ηε (Al) = nl) =

k∏l=1

P (N (Al) = nl)

which is the assertion in the second item.3. Let Γn := (w1, . . . , wn) ∈ An : wi 6= wj for all i 6= j . From item 3. of The-

orem 16.13,

Page: 131 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 138: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

132 16 Point Processes

E [f (Ληε ∩A) |ηε (A) = n]

= E [f (Uε1 , . . . , Uεn) |Γ εn]

=1

(# (A ∩ εZd))n∑

wi∈A∩εZdf (w1, . . . , wn) 1Γn (w1, . . . wn)

∼=(εd)n

|A|n∑

wi∈A∩εZdf (w1, . . . , wn) 1Γn (w1, . . . wn)

=1

|A|n∑

wi∈A∩εZdf (w1, . . . , wn) 1Γn (w1, . . . wn)

(εd)n

ε↓0−→ 1

|A|n∫An

f (w1, . . . , wn) 1Γn (w1, . . . wn) dw1 . . . dwn

=1

|A|n∫An

f (w1, . . . , wn) dw1 . . . dwn = E [f (U1 . . . , Un)] .

4. Let ti > 0 be given. From Theorem 16.14, we know,

P (T εi > tini=1) =

n∏i=1

P (T εi > ti) =

n∏i=1

(1− ελ)[ti/ε]]

→n∏i=1

e−λti =

n∏i=1

P (Ti > ti) .

5. Formally,

limε↓0

E [f (W ε1 , . . . ,W

εn)] = lim

ε↓0E [f (T ε1 , T

ε1 + T ε2 , . . . , T

ε1 + T ε2 + · · ·+ T εn)]

= E [f (T1, T1 + T2, . . . , T1 + T2 + · · ·+ Tn)]

= E [f (W1, . . . ,Wn)] .

Hopefully this theorem will serve as sufficient motivation for the definitionsand theorems of the next section.

16.4 Poisson Point Process (Homogeneous Case)

In what follows we will write A ⊂fv Rd is a A ⊂ Rd with |A| < ∞, i.e. A is afinite volume subset of Rd.

Definition 16.18 (Poisson Point Process). A Poisson point process onRd with constant intensity λ is a collection of random variables N (A)A⊂fvRdwith values in N0 with the following properties;

1. N (A)d= Poi (λ |A|) for all A ⊂ Rd,

2. N (Al)kl=1 are independent random variables whenever Alkl=1 are dis-joint subsets of Rd, and

3. if Al∞l=1 are disjoint subsets of Rd then

N (∪∞l=1Al) =

∞∑l=1

N (Al) ,

i.e. N is a counting process.4. N (x) ≤ 1 for all x ∈ Rd.

Let Λ :=x ∈ Rd : N (x) = 1

.

Theorem 16.19. For all A ⊂ Rd, N (A) = # (Λ ∩A) =∑x∈A∩ΛN (x) .

Proof. Let N (A) := # (Λ ∩A) which is a measure. Moreover µ (A) :=N (A) − N (A) is a finite integer valued measure such that µ (x) = 0 forall x ∈ S. I claim that such a measure is necessarily 0. For if µ (S) > 0 thenassuming S is a rectangle (this suffices) we may subdivide S over and overagain to find a sequence of sub-rectangles Sn ⊂ S such that Sn ↓ x or ∅ whileµ (Sn) > 0 for all n. But his would violate the countable additivity of µ asµ (Sn) ↓ µ (x) = 0 or µ (∅) = 0. Therefore we have shown µ ≡ 0 and thereforethat N = N .

Lemma 16.20 (May be safely skipped). If we have a process satisfyingitems 1. – 3. of Definition 16.18, then by throwing out a set of zero probabilitywe may arrange for item 4. to hold as well.

Proof. We suppose for simplicity that |S| < ∞. For each ε > 0 letSεjηεj=1

denote a subdivision of S into disjoint sets with∣∣Sεj ∣∣ = O (ε) .

The “event” N (x) ≥ 2 for some x ∈ S is then contained in the event,Eε := ∪ηεj=1

N(Sεj)≥ 2

where,

P (Eε) =

ηε∑j=1

P(N(Sεj)≥ 2)≤ C

ηε∑j=1

∣∣Sεj ∣∣2 = O (ε)

ηε∑j=1

∣∣Sεj ∣∣ = O (ε) |S| .

Therefore the event E0 := ∩∞n=1E1/n has zero probability andN (x) ≥ 2 for some x ∈ S ⊂ E0. Therefore by replacing our proba-bility space by Ω \ E0 we may assume that N (x) ≤ 1 for all x ∈ Sindependent of the random sample point.

Remark 16.21 (Key facts). Based on Theorem 16.17 we should expect;

Page: 132 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 139: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.4 Poisson Point Process (Homogeneous Case) 133

1. If A ⊂ Rd is a finite volume set, Ui∞i=1 are i.i.d. random vectors uniformly

distributed in A, and νd= Poi (λ |A|) independent of the Ui∞i=1 , then

N (B) :=

ν∑i=1

1B (Ui) for all B ⊂ A

is a Poisson point process on A. Indeed, we have formally we should have

limε↓0

E [f (Ληε ∩A) |ηε (A) = n] = E [f (Λ ∩A) |N (A) = n]

while by Theorem 16.17 we know

limε↓0

E [f (Ληε ∩A) |ηε (A) = n] = E [f (U1 . . . , Un)] .

The distribution of Λ ∩ A given N (A) = n to be the same as the distri-

bution of U1, . . . , Un . Hence if Alkl=1 are subsets of A it follows that

N (Al) = # (Al ∩ Λ)kl=1 given N (A) = n has the same distribution as

∑ni=1 1Al (Ui)

k

l=1, i.e.

P(N (Al) = mlkl=1 |N (A) = n

)= P

n∑i=1

1Al (Ui) = ml

kl=1

.

Multiplying this equation by P (N (A) = n) = P (ν = n) and summing onl implies,

P(N (Al) = mlkl=1

)=

∞∑n=0

P

n∑i=1

1Al (Ui) = ml

kl=1

P (ν = n)

=

∞∑n=0

P

n∑i=1

1Al (Ui) = ml

kl=1

, ν = n

= P

ν∑i=1

1Al (Ui) = ml

kl=1

.

2. Similarly, if Ti∞i=1 are i.i.d. E (λ) – random variables and Wn := T1 +· · ·+ Tn for all n, then

N (A) =

∞∑n=1

1A (Wn) for all A ⊂ R+

is a Poisson point process on R+.

These claims are in fact true and are proved in Theorems 16.23 and 16.29below. If the reader is willing to take these facts for granted she/he may wishto skim most of the rest of this section.

Theorem 16.22. Suppose that A ⊂ Rd is a finite volume set2 and f : An → Ris a bounded symmetric function of its arguments. Then

E [f (Λ ∩A) |N (A) = n] = E [f (U1, . . . , Un)]

where Uini=1 are i.i.d. random vectors with values in A uniformly distributedin A, i.e.

P (Ui ∈ B) :=|B||A|

for all B ⊂ A.

Proof. Let Bini=1 be a collection of disjoint subsets of A and letf (x1, . . . , xn) :=

∑σ∈Sn

∏ni=1 1Bi (xσi) . Then on the event N (A) = n ,

f (Λ ∩A) =

1 if N (Bi) = 1 for 1 ≤ i ≤ n0 otherwise

.

Therefore,

E [f (Λ ∩A) : N (A) = n] = P (N (Bi) = 1 for 1 ≤ i ≤ n and N (A) = n)

= P (N (Bi) = 1 for 1 ≤ i ≤ n and N (A\ ∪i Bi) = 0)

=

N∏i=1

e−λ|Bi|λ |Bi| · e−λ|A\∪iBi| = e−λ|A|N∏i=1

λ |Bi|

whereas,

P (N (A) = n) =(λ |A|)n

n!e−λ|A|

from which it follows that

E [f (Λ ∩A) |N (A) = n] = n! ·N∏i=1

λ |Bi|λ |A|

= n! ·N∏i=1

|Bi||A|

.

On the other hand,

E [f (U1, . . . , Un)] =∑σ∈Sn

En∏i=1

1Bi (Uσi) = n! ·n∏i=1

P (Ui ∈ B) = n! ·N∏i=1

|Bi||A|

.

The result for general f now follows by an approximation argument (π – λtheorem to be precise for general f).

2 We will denote this by A ⊂fv Rd in the future, i.e. A ⊂fv Rd means A ⊂ Rd and|A| <∞.

Page: 133 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 140: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

134 16 Point Processes

Theorem 16.23 (Converse to Theorem 16.22). Suppose A ⊂fv Rd andlet Ui∞i=1 be i.i.d. random vectors uniformly distributed in A ⊂fv Rd and

νd= Poi (λ |A|) with ν independent of the Ui∞i=1 . Then the counting process,

N (B) :=∑νi=1 1B (Ui) , for all B ⊂ A is a Poisson process on A with intensity

λ.

Proof. Suppose that Akmk=1 ⊂ BA is a measurable partition of A. Ifnkmk=1 ⊂ N0 and n := n1 + · · ·+ nm, then on the event, ∩mk=1 N (Ak) = nkwe must have ν = N (A) = n. Therefore,

P (∩mk=1 N (Ak) = nk) = P (∩mk=1 N (Ak) = nk |ν = n)P (ν = n)

= P

(∩mk=1

n∑i=1

1Ak (Ui) = nk

|ν = n

)P (ν = n)

=n!

n1! · · ·nm!

[|A1||A|

]n1

· · ·[|Am||A|

]nmP (ν = n)

=n!

n1! · · ·nm!

[|A1||A|

]n1

· · ·[|Am||A|

]nm (λ |A|)n

n!e−(λ|A|)

=

m∏k=1

(λ |Ak|)nk

nk!e−λ|Ak|.

This shows that the N (Ak)mk=1 are independent and are Poisson randomvariables and that N (Ak) = Poi (λ |Ak|) .

Alternatively; suppose that zk ∈ C and consider,

E

[m∏k=1

zN(Ak)k : ν = n

]= P (ν = n)

1

|A|n∫An

m∏k=1

z

∑n

i=11Ak (si)

k ds1 . . . dsn

= e−λ|A|λn

n!

∫An

m∏k=1

n∏i=1

z1Ak (si)

k ds1 . . . dsn

= e−λ|A|λn

n!

n∏i=1

[∫A

m∏k=1

z1Ak (si)

k dsi

]

= e−λ|A|λn

n!

n∏i=1

(m∑k=1

zk · |Ak|

)

= e−λ|A|1

n!

m∑k=1

zk · |Ak|

)n.

Summing this equation on n shows, From this it follows as in the previoussolution that

E

[m∏k=1

zN(Ak)k

]= e−λ|A| · exp

m∑k=1

zk · |Ak|

)= exp

(m∑k=1

(zk − 1)λ |Ak|

).

(16.11)Taking z2 = · · · = zm = 1 in this equation shows,

E[zN(A1)1

]= exp ((z1 − 1) |A1|)

from which we conclude that N (A1) = Poi (λ |A1|) . Similarly we showN (Ak) = Poi (λ |Ak|) for all k and Eq. (16.11) may also be written as,

E

[m∏k=1

zN(Ak)k

]=

m∏k=1

E[zN(Ak)k

]which proves the independence of the N (Ak)mk=1 .

16.4.1 The homogeneous Poisson process on R+

Now suppose that N (A)A⊂R+is a Poisson point process on R+ with intensity

λ and Nt := N ((0, t]) which is also referred to as the Poisson process whend = 1. Notice that N ((a, b]) = Nb − Na for all 0 ≤ a < b < ∞. Notice thatthe process Ntt≥0 is integer valued, non-decreasing, and right continuous, seeFigure 16.9.

Definition 16.24 (Arrival times). Given a right continuous non-decreasingfunction with values in N0, we define the arrival times Wn∞n=0 of the processby W0 = 0 and then inductively by,

Wn+1 = inf t > Wn : t ∈ Λ = inf t > Wn : N ((0, t]) = n+ 1 .

Furthermore, for each n ∈ N let Tn := Wn−Wn−1. We refer to Tn∞n=1 as theinterarrival times, see Figure 16.9.

Definition 16.25 (Poisson Process I). Let (Ω,P ) be a probability space andNt : Ω → N0 be a random variable for each t ≥ 0. We say that Ntt≥0 is a

Poisson process with intensity λ if; 1) N0 = 0, 2) Nt −Nsd= Poi (λ (t− s)) for

all 0 ≤ s < t <∞, 3) Ntt≥0 has independent increments, and 4) t→ Nt (ω)is right continuous and non-decreasing for all ω ∈ Ω.

Let N∞ (ω) :=↑ limt↑∞Nt (ω) and observe that N∞ =∑∞k=0 (Nk −Nk−1) = ∞ a.s. by Lemma 16.4. Therefore, we may and do

assume that N∞ (ω) =∞ for all ω ∈ Ω.

Page: 134 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 141: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.4 Poisson Point Process (Homogeneous Case) 135

W1

W2

W3

W4

W5 W6

W7

W8 W9W0 W10

T1

T9

T7

Fig. 16.9. A sample path of a Poisson process with the the arival and interarivaltimes indicated on the picture.

Notation 16.26 For each n ∈ N and t > 0 let

∆n (t) := (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn < t

and let

∆n := ∪t>0∆n (t) = (w1, . . . , wn) ∈ Rn : 0 < w1 < w2 < · · · < wn <∞ .

Theorem 16.27. Suppose that Ntt≥0 is a Poisson process with intensity λas in Definition 16.25,

Wn := inf t : Nt = n for all n ∈ N0

be the first time Nt reaches n. (The Wn∞n=0 are well defined off a set ofmeasure zero and Wn < Wn+1 for all n by the right continuity of Ntt≥0 .)Then for all positive or bounded g : ∆n → R we have

E [g (W1, . . . ,Wn)] =

∫∆n

g (w1, . . . , wn)λne−λwndw1 . . . dwn. (16.12)

Moreover the interarrival times, Tn := Wn −Wn−1∞n=1 , are i.i.d. E (λ) – ran-dom variables.

Proof. Suppose that Ji = (ai, bi] with bi ≤ ai+1 < ∞ for all i. We willbegin by showing

P (∩ni=1 Wi ∈ Ji) = λnn−1∏i=1

m (Ji) ·∫Jn

e−λwndwn (16.13)

= λn∫J1×J2×···×Jn

e−λwndw1 . . . dwn. (16.14)

To show this let Ki := (bi−1, ai] where b0 = 0. Then for Wi ∈ Ji for all i theremust be no jumps in Ki (i.e. N (Ki) = 0) there must be exactly one jump ineach Ji for i < n and at least one jump in Jn, i.e.

∩ni=1 Wi ∈ Ji = [∩ni=1 N (Ki) = 0] ∩[∩n−1i=1 N (Ji) = 1

]∩ N (Jn) ≥ 1 .

Therefore,

P (∩ni=1 Wi ∈ Ji) =

n∏i=1

e−λm(Ki) ·n−1∏i=1

e−λm(Ji)λm (Ji) ·(

1− e−λm(Jn))

= λn−1n−1∏i=1

m (Ji) ·[e−λan − e−λbn

]= λn−1

n−1∏i=1

m (Ji) ·∫Jn

λe−λwndwn.

In other words,

E [1J1×···×Jn (W1, . . . ,Wn)] =

∫Rn

+

1J1×···×Jn (w1, . . . , wn)λne−λwndwn.

By an approximation argument3 we may now conclude Eq. (16.12) holds.Now suppose that f (t1, . . . , tn) with ti ≥ 0 is a given positive or bounded

function and then define g on ∆n by

g (w1, . . . , wn) := f (w1, w2 − w1, . . . , wn − wn−1) .

Using this function in Eq. (16.12) then shows,

E [f (T1, . . . , Tn)] =

∫∆n

f (w1, w2 − w1, . . . , wn − wn−1)λne−λwndw1 . . . dwn

and then making the change of variables ti = wi − wi−1 for each i implies,

E [f (T1, . . . , Tn)] =

∫Rn

+

f (t1, t2, . . . , tn)λne−λ(t1+···+tn)dt1 . . . dtn

=

∫Rn

+

f (t1, t2, . . . , tn)

n∏j=1

λe−λtjdtj

3 The formal argument involves Dynkin’s π – λ theorem and the fact thatσ (J1 × · · · × Jn) = B∆n .

Page: 135 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 142: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

136 16 Point Processes

which shows Tini=1 are i.i.d. exponential random variables with parameter λ.

Corollary 16.28. If Ntt≥0 is a Poisson process as in Definition 16.25, n ∈N, and f : ∆n → R is a positive or bounded function, then

E [f (W1, . . . ,Wn) : Nt = n] = λne−λt∫∆n(t)

f (w1, w2, . . . , wn) dw1 . . . dwn,

(16.15)and

E [f (W1, . . . ,Wn) |Nt = n] =n!

tn

∫∆n(t)

f (w1, w2, . . . , wn) dw1 . . . dwn,

(16.16)and

E [f (W1, . . . ,Wn) |Nt = n] = E[f(U1, . . . , Un

)](16.17)

whereU1, . . . , Un

are the order statistics4 of a sequence, Uini=1 , of i.i.d.

random variables uniformly distributed in [0, t] . (In other words, the distribu-tion of the Wini=1 given Nt = n is the same as the order statistics forU1, . . . , Un .) Moreover, if f : [0, t]

n → R is a symmetric function, then

E [f (W1, . . . ,Wn) |Nt = n] = E [f (U1, . . . , Un)] . (16.18)

Proof. Making use of the observation that Nt = n = Wn ≤ t < Wn+1 ,we may apply Eq. (16.12) at level n+ 1 with

g (w1, w2, . . . , wn+1) = f (w1, w2, . . . , wn) 1wn≤t<wn+1

to learn

E [f (W1, . . . ,Wn) : Nt = n]

=

∫0<w1<···<wn<t<wn+1

f (w1, w2, . . . , wn)λn+1e−λwn+1dw1 . . . dwndwn+1

and then doing the integral over wn+1 gives Eq. (16.15). Dividing Eq. (16.15)

by P (Nt = n) = (λt)n

n! e−λt then gives Eq. (16.16). Equation 16.17 is now easilyproved;

4 That isU1, . . . , Un

is the sequence U1, . . . , Un arranged to be in increasing

order.

E[f(U1, . . . , Un

)]=∑σ∈Sn

E [f (Uσ1, . . . , Uσn) : Uσ1 < Uσ2 < · · · < Uσn]

= n!E [f (U1, . . . , Un) : U1 < U2 < · · · < Un]

=n!

tn

∫∆n(t)

f (w1, w2, . . . , wn) dw1 . . . dwn

= E [f (W1, . . . ,Wn) |Nt = n] .

If we further assume that f : [0, t]n → R is a symmetric function, then

E[f(U1, . . . , Un

)]=∑σ∈Sn

E [f (Uσ1, . . . , Uσn) : Uσ1 < Uσ2 < · · · < Uσn]

=∑σ∈Sn

E [f (U1, . . . , Un) : Uσ1 < Uσ2 < · · · < Uσn]

= E [f (U1, . . . , Un)]

which combined with Eq. (16.17) proves Eq. (16.18).

Theorem 16.29 (Converse to Theorem 16.27). Suppose that Ti∞i=1 arei.i.d. E (λ) – random variables and Wn := T1 + · · · + Tn for all n ∈ N. Thenthe counting process,

N (A) :=

∞∑n=1

1A (Wn) for all A ⊂ R+

is a Poisson process on R+ with intensity λ.

Proof. It is easy to reverse the logic in the proofs of Theorem 16.27 andCorollary 16.28 in order to show the sequence Wn∞n=1 satisfies Eq. (16.18).With this observation, we apply Theorem 16.23 with A = (0, t] for any t > 0 inorder to conclude that N is a Poisson process with intensity λ.

16.5 Examples

Example 16.30 (V.I.E5). Suppose that a random variable X is distributed ac-cording to a Poisson distribution with parameter λ. The parameter λ is itselfa random variable exponentially distributed with density f (x) = θe−θx. Findthe probability mass function for X. By assumption,

P (X = k|λ = x) =xk

k!e−x

and therefore,

Page: 136 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 143: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.6 Poisson Point Process (Non-Homogeneous Case) 137

P (X = k) =

∫ ∞0

P (X = k|λ = x) f (x) dx

=

∫ ∞0

xk

k!e−xθe−θxdx =

∫ ∞0

xk

k!θe−(θ+1)xdx.

Making the change of variables, y = (θ + 1)x, then shows

P (X = k) =θ

k! (θ + 1)k+1

∫ ∞0

yke−ydx

k! (θ + 1)k+1

Γ (k + 1) =θ

(θ + 1)k+1

.

Letting q = 11+θ so that θ = 1

q − 1, the above expression becomes,

P (X = k) =

(1

q− 1

)qk+1 = qk (1− q) = qkp

where p = 1− q = θ1+θ . Thus X is basically geometric with parameter p.

Example 16.31 (V.I.E6). Messages arrive at a telegraph office as a Poisson pro-cess with mean rate of 3 messages per hour. (a) What is the probability thatno messages arrive during the morning hours 8:00 A.M. to noon? (b) What isthe distribution of the time at which the first afternoon message arrives?

a) P (X (12)−X (8) = 0) = e−3·4 = e−12. b) Starting the clock at noon wehave T is distributed according to E (3) , i.e. P (T > t) = e−3(t−12).

Example 16.32 (V.I.E7). Suppose that customers arrive at a facility accordingto a Poisson process having rate λ = 2. Let X(t) be the number of customersthat have arrived up to time t. Determine the following probabilities and con-ditional probabilities:

a P (X(1) = 2) = e−2 22

2! = 2e−2 = 0.270 67.b P (X(1) = 2 and X(3) = 6) = P (X(1) = 2, X(3) − X (1) = 4) = 2e−2 ·

(2·2)4

4! e−2·2 = 643 e−6.

c P (X(1) = 2|X(3) = 6) =643 e−6

(3·2)66! e−6

= 80243 .

d P (X(3) = 6|X(1) = 2) =643 e−6

2e−2 = 323 e−4.

Example 16.33 (V.II.E4). Suppose that a book of 600 pages contains a total of240 typographical errors. Develop a Poisson approximation for the probabilitythat three particular successive pages are error-free. Ans. We have λ = 240

600 = 25

error per page. Thus

P (N ((k, k + 3]) = 0) = e−25 ·3 = e−

65 = e−1. 2 ∼= 0.301 19.

Example 16.34 (V.II.P2). Suppose that 100 tags, numbered 1, 2, ..., 100, areplaced into an urn, and 10 tags are drawn successively, with replacement. LetA be the event that no tag is drawn twice. Show that

P (A) = 1 ·(

1− 1

100

)(1− 2

100

). . .

(1− 9

100

)= 0.6282

Use the approximation 1− x ∼= e−x for x near zero to get

P (A) ∼= exp

(− 1

100(1 + 2 + · · ·+ 9)

)= e−0.45 ∼= 0.6376.

Interpret this in terms of the law of rare events.ANS. Lets generalize the problem to k tags drawn from N tags. (When

N = 365 this is the Birthday problem.) We then have

P (A) =N · (N − 1) . . . (N − k + 1)

Nk

= 1 ·(

1− 1

N

)·(

1− 2

N

). . .

(1− k − 1

N

)∼= e−

1N e−

2N . . . e−

k−1N = exp

(− 1

N· k (k − 1)

2

).

If choose k tags and recorded their values, then the number of pairs of these

values would be k(k−1)2 . If we let εpair = 1 if the two agree and 0 otherwise,

we would have N :=∑pairs εpair would represent the number of matching

pairs. If (which is not true) all of the εpairs were independent, we wouldargue by our Poisson approximation result, suing P (εpair = 1) = 1

N , that

Nd∼= Poi

(1N ·

k(k−1)2

)and in particular that

P (A) ∼= exp

(− 1

N· k (k − 1)

2

).

This expression is plotted with N = 365 in Figure 16.10.

16.6 Poisson Point Process (Non-Homogeneous Case)

The notions and results in the previous section have a natural generalization tothe non-homogeneous setting were we now allow the intensity λ to vary withlocation. In more detail let λ : Rd → (0,∞) be a continuous and bounded (forsimplicity) function and replace λ |A| in all of the formulas above by

λ (A) :=

∫A

λ (x) dx1 . . . dxn for all A ⊂ Rd.

Notice that when λ (x) = λ is a constant then λ (A) = λ |A| .

Page: 137 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 144: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

138 16 Point Processes

Fig. 16.10. The probability of no birthday coincidences versus class size.

Definition 16.35 (Poisson Point Process II). Suppose that λ : Rd → (0,∞)is a continuous and bounded (for simplicity) function. A Poisson pointprocess on Rd with intensity density λ is a collection of random variablesN (A)A⊂fvRd with values in N0 with the following properties;

1. N (A)d= Poi (λ (A)) for all A ⊂ Rd where

λ (A) :=

∫A

λ (x) dx1 . . . dxn.

2. N (Al)kl=1 are independent random variables whenever Alkl=1 are dis-joint subsets of Rd, and

3. if Al∞l=1 are disjoint subsets of Rd then

N (∪∞l=1Al) =

∞∑l=1

N (Al) ,

i.e. N is a counting process.4. N (x) ≤ 1 for all x ∈ Rd. (This restriction guarantees that N carries the

same information as the random subset of Rd defined by

Λ :=x ∈ Rd : N (x) = 1

.

Notice that we may recover N via N (A) = # (A ∩ Λ) for all A ⊂ Rd.)

The next theorem is the direct analogue of Theorem 16.22 and explainsthe distribution of the points described by a non-homogeneous Poisson pointprocess.

Theorem 16.36 (Distribution of the points). Suppose that A ⊂ Rd is aset with λ (A) < ∞ and f : An → R is a bounded symmetric function of itsarguments. Then

E [f (Λ ∩A) |N (A) = n] = E [f (U1, . . . , Un)]

where Uini=1 are i.i.d. random vectors with values in A distributed as;

P (Ui ∈ B) := λA (B) := λ (B) /λ (A) for all B ⊂ A.

Proof. Let Bini=1 be a collection of disjoint subsets of A and letf (x1, . . . , xn) :=

∑σ∈Sn

∏ni=1 1Bi (xσi) . Then on the event N (A) = n,

f (Λ ∩A) =

1 if N (Bi) = 1 for 1 ≤ i ≤ n0 otherwise

.

Therefore,

E [f (Λ ∩A) : N (A) = n] = P (N (Bi) = 1 for 1 ≤ i ≤ n and N (A) = n)

= P (N (Bi) = 1 for 1 ≤ i ≤ n and N (A\ ∪i Bi) = 0)

=

N∏i=1

e−λ(Bi)λ (Bi) · e−λ(A\∪iBi) = e−λ(A)N∏i=1

λ (Bi)

whereas,

P (N (A) = n) =λ (A)

n

n!e−λ(A)

from which it follows that

E [f (Λ ∩A) |N (A) = n] = n! ·N∏i=1

λ (Bi)

λ (A).

On the other hand,

E [f (U1, . . . , Un)] =∑σ∈Sn

En∏i=1

1Bi (Uσi) = n! ·n∏i=1

P (Ui ∈ B) = n! ·N∏i=1

λ (Bi)

λ (A).

The result for general f now follows by an approximation argument (π – λtheorem to be precise for general f).

16.6.1 Why Poisson Processes?

In this section, suppose that λ : Rd → (0,∞) is a continuous and bounded (forsimplicity) function and λ (A) :=

∫Aλ (x) dx as above. In what follows below

we will write F (A) = o (λ (A)) provided there exits an increasing function,δ : R+ → R+, such that δ (x) → 0 as x → 0 and |F (A)| ≤ λ (A) δ (λ (A)) forall A ∈M.

Page: 138 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 145: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.6 Poisson Point Process (Non-Homogeneous Case) 139

Lemma 16.37 (Coupling Estimates). Suppose X and Y are any randomvariables on a probability space, (ΩP ) and A ∈ BR. Then

|P (X ∈ A)− P (Y ∈ A)| ≤ P (X ∈ A 4 Y ∈ A) ≤ P (X 6= Y ) .

Proof. Since |1A (X)− 1A (Y )| ≤ 1 and |1A (X)− 1A (Y )| = 0 it followsthat |1A (X)− 1A (Y )| ≤ 1X 6=Y . Therefore,

|P (X ∈ A)− P (Y ∈ A)| = |E [1A (X)− 1A (Y )]|≤ E |1A (X)− 1A (Y )| ≤ E1X 6=Y = P (X 6= Y ) .

Proposition 16.38 (Why Poisson). Also assume thatN (A) : A ⊂ Rd

is

a collection of N0 – valued random variables with the following properties;

1. If Ajnj=1 are disjoint subsets of Rd, then N (Ai)ni=1 are independentrandom variables and

N

(n∑i=1

Ai

)=

n∑i=1

N (Ai) a.s.

2. P (N (A) ≥ 2) = o (λ (A)) .3. |P (N (A) ≥ 1)− λ (A)| = o (λ (A)) .

Then N (A)d= Poi (λ (A)) for all A ∈ M and in particular EN (A) = λ (A)

for all A ∈M.

Proof. Let A ⊂ Rd and ε > 0 be given. Choose a partition AεiNi=1 of A

such that λ (Aεi ) ≤ ε for all i. Let Zi := 1N(Aεi )≥1 and S :=∑Ni=1 Zi. Using

N (A) =

N∑i=1

N (Aεi )

and Lemma 16.37, we have

|P (N (A) = k)− P (S = k)| ≤ P (N (A) 6= S) ≤N∑i=1

P (Zi 6= N (Aεi )) .

Since Zi 6= N (Aεi ) = N (Aεi ) ≥ 2 and P (N (Aεi ) ≥ 2) = o (λ (Aεi )) , it fol-lows that

|P (N (A) = k)− P (S = k)| ≤N∑i=1

λ (Aεi ) δ (λ (Aεi ))

≤N∑i=1

λ (Aεi ) δ (ε) = δ (ε)λ (A) . (16.19)

On the other hand, ZiNi=1 are independent Bernoulli random variables with

P (Zi = 1) = P (N (Aεi ) ≥ 1) ,

and aε =∑Ni=1 P (N (Aεi ) ≥ 1) . Then by the Law of rare events Theorem 16.10,∣∣∣∣P (S = k)− akε

k!e−aε

∣∣∣∣ ≤ N∑i=1

[P (N (Aεi ) ≥ 1)]2 ≤

N∑i=1

[λ (Aεi ) + o (λ (Aεi ))]2

≤N∑i=1

λ (Aεi )2

(1 + δ′ (ε))2

= (1 + δ′ (ε))2ελ (A) .

(16.20)

Combining Eqs. (16.19) and (16.20) shows∣∣∣∣P (N (A) = k)− akεk!e−aε

∣∣∣∣ ≤ [δ (ε) + (1 + δ′ (ε))2ε]λ (A) (16.21)

where aε satisfies

|aε − λ (A)| =

∣∣∣∣∣N∑i=1

[P (N (Aεi ) ≥ 1)− λ (Aεi )]

∣∣∣∣∣≤

N∑i=1

|[P (N (Aεi ) ≥ 1)− λ (Aεi )]| ≤N∑i=1

o (λ (Aεi ))

≤N∑i=1

λ (Aεi ) |δ′ (λ (Aεi ))| ≤ λ (A) δ′ (ε) .

Hence we may let ε ↓ 0 in Eq. (16.21) to find

P (N (A) = k) =(λ (A))

k

k!e−λ(A).

You may omit reading the rest of this chapter otherthan Theorem 16.40

Page: 139 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 146: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

140 16 Point Processes

16.7 Construction of Generalilzed Poisson Processes

Exercise 16.1 (A Generalized Poisson Process II). Continuing the nota-tion as in Definition 16.35. Given S ⊂ Rd such that λ (S) < ∞ let Ui∞i=1 bei.i.d. S – valued Random variables distributed by

P (Ui ∈ B) =λ (B)

λ (S)for all B ⊂ S.

Also let ν be a Poi (λ (S)) random variable independent of the Ui∞i=1 . Show,for B ⊂ A, that N (B) :=

∑νi=1 1B (Ui) is a Poisson point process on S with

intensity measure, λ.

Solution to Exercise (16.1). Suppose that Akmk=1 be a measurable par-tition of S, nkmk=1 ⊂ N0, and n := n1 + · · · + nm. Then on the event,∩mk=1 N (Ak) = nk we must have ν = N (S) = n. Therefore,

P (∩mk=1 N (Ak) = nk) = P (∩mk=1 N (Ak) = nk |ν = n)P (ν = n)

= P

(∩mk=1

n∑i=1

1Ak (Ui) = nk

|ν = n

)P (ν = n)

=n!

n1! · · ·nm!

[λ (A1)

λ (S)

]n1

· · ·[λ (Am)

λ (S)

]nmP (ν = n)

=n!

n1! · · ·nm!

[λ (A1)

λ (S)

]n1

· · ·[λ (Am)

λ (S)

]nm λ (S)n

n!e−λ(S)

=

m∏k=1

λ (Ak)nk

nk!e−λ(Ak).

This shows that the N (Ak)mk=1 are independent and are Poisson randomvariables and that N (Ak) = Poi (λ (Ak)) .

Alternatively; suppose that zk ∈ C and consider,

E

[m∏k=1

zN(Ak)k : ν = n

]= P (ν = n)

∫Sn

m∏k=1

z

∑n

i=11Ak (si)

k

λ (s1)

λ (S). . .

λ (sn)

λ (S)ds1 . . . dsn

= e−λ(S) 1

n!

∫Sn

m∏k=1

n∏i=1

z1Ak (si)

k λ (s1) ds1 . . . λ (sn) dsn

= e−λ(S) 1

n!

∫Sn

n∏i=1

m∏k=1

z1Ak (si)

k λ (s1) ds1 . . . λ (sn) dsn

= e−λ(S) 1

n!

n∏i=1

(∫S

m∏k=1

z1Ak (si)

k λ (si) dsi

)

= e−λ(S) 1

n!

n∏i=1

(m∑k=1

zkλ (Ak)

)

= e−λ(S) 1

n!

(m∑k=1

zkλ (Ak)

)n.

Summing this equation on n implies,

E

[m∏k=1

zN(Ak)k

]= exp

(m∑k=1

(zk − 1)λ (Ak)

),

from which we conclude N (Ai)mi=1 are independent Poisson random variables

with N (Ai)d= Poi (λ (Ai)) .

Exercise 16.2 (A Generalized Poisson Process III). Continuing the no-tation as in Definition 16.35. Further choose Sl∞l=1 to be a partition of Rdsuch that 0 < λ (Sl) < ∞ for all l. Suppose that for each l ∈ N that Nl isPoisson process on Rd with intensity measure λl defined by

λl (B) = λl (B ∩ Sl) :=

∫Sl∩B

λ (x) dx for all B ⊂ Rd.

(Such a process was constructed was shown to exist in Exercise 16.1.) Assumethe Nl∞l=1 are all independent. Show that N :=

∑∞l=1Nl is a Poisson point

process on Rd with intensity measure, λ. To be more precise observe that N isa random counting measure on Rd which satisfies (as you should show);

1. For each A ⊂ Rd with λ (A) < ∞, show N (A)d= Poi (λ (A)) . Also show

N (A) =∞ a.s. if λ (A) =∞.2. If Akmk=1 ⊂ Rd are disjoint sets with λ (Ak) <∞, show N (Ak)mk=1 are

independent random variables.

Page: 140 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 147: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.7 Construction of Generalilzed Poisson Processes 141

Solution to Exercise (16.2). We prove each item in turn.

1. For A ⊂ Rd we have N (A) =∑∞l=1Nl (A) where Nl (A)∞l=1 are inde-

pendent Poisson random variables with Nl (A)d= Poi (λl (A)) and λ (A) =∑∞

l=1 λl (A) . The claims in the first item now follow from Lemmas 16.3 and16.4.

2. If Akmk=1 ⊂ Rd are disjoint sets with λ (Ak) < ∞ then(N1 (Ak) , N2 (Ak) , . . . )mk=1 are independent sequences and therefore

N (Ak) =

∞∑l=1

Nl (Ak)

∞k=1

are independent.

There is one more case of interest which is the so called marked Poissonprocess.

Definition 16.39 (Marked Poisson Process). Keeping the Wn∞n=1 be thearrival times of a homogeneous Poisson process on R+ with intensity λ andlet Yn∞n=1 be i.i.d. random functions with values in some measure space S.We refer to (Wn, Yn)∞n=1 as a marked Poisson process. We let η be theassociated counting process on R+ × S defined by,

η (B) :=

∞∑k=1

1B (Wk, Yk) for all B ⊂ R+ × S.

Theorem 16.40 (Marked Poisson Process). Continuing the notation usedin Definition 16.39, the counting process η is non-homogeneous Poisson processwith intensity measure, λ (t) dt · ρ (ds) where ρ is the distribution Yn.

Proof. The best way to prove this is as follows. Let T < ∞ be given andrestrict our considerations to (0, T ]×S for the moment and let Ui∞i=1 are i.i.d.random variables uniformly distributed in (0, T ] which are independent of the(Wn, Yn)∞n=1 .

If f is any symmetric function on [(0, T ]× S]n

we have

E [f ((W1, Y1) , . . . (Wn, Yn)) |NT = n] = E [F (W1, . . . ,Wn) |NT = n] ,

where

F (w1, . . . , wn) :=

∫S

. . .

∫S

dρ (y1) . . . dρ (yn) f ((w1, y1) , . . . (wn, yn))

for all wi ∈ [0, T ] . As F is still a symmetric function and (W1, . . . ,Wn) givenNT = n are distributed according to the order statistics of (U1, . . . , Un) byTheorem 16.22 we may conclude,

E [F (W1, . . . ,Wn) |NT = n] = E [F (U1, . . . , Un)]

= E [f ((U1, Y1) , . . . (Un, Yn))]

where the last equality follows from our assumption that the Ui are inde-pendent of the (Wn, Yn)∞n=1 . Furthermore this assumption guarantees that

NTd= Poi (λT ) and is independent of (Ui, Yi)∞i=1 . Therefore it follows that

η (B) :=∑NTk=1 1B (Uk, Yk) is a Poisson process on (0, T ]×S with intensity mea-

sure λdt · ρ (ds) according to Exercise 16.1. Moreover, the above considerationsshow that η and η have the same distribution and hence η is also a Poissonprocess on (0, T ]× S with intensity measure λdt · ρ (ds) .

Before leaving this section we want to specialize our considerations to thecase where S = R+ = [0,∞).

Theorem 16.41. Let Wn∞n=0 and Tn∞n=1 be as in Definition 16.24. Then

E [f (W1, . . . ,Wn)]

=

∫∆n

f (w1, . . . , wn) e−λ((0,wn])λ (dw1) . . . λ (dwn) . (16.22)

and

P (Tn ∈ [t, t+ dt]|W1, . . . ,Wn−1) = e−λ((Wn−1,Wn−1+t])λ (Wn−1 + t) dt.(16.23)

or equivalently put,

P (Tn > t|W1, . . . ,Wn−1) = e−λ((Wn−1,Wn−1+t]) for all t ≥ 0. (16.24)

Proof. Let Ji := (ai, bi] with bi < ai+1 for each i and let s = an and t := bn.Then

P (Wi ∈ Ji for 1 ≤ i ≤ n)

= P (N (Wi) = 1 for 1 ≤ i < n,N (Wn) ≥ 1, and N ((0, t] \ ∪Ji) = 0)

=

[n−1∏i=1

λ (Ji) e−λ(Ji)

]·[1− e−λ(Jn)

]· e−λ((0,t]\∪Ji)

=

n−1∏i=1

λ (Ji) ·[e−λ((0,s]) − e−λ((0,t])

]=

∫Rn

+

λ (dw1) . . . λ (dwn) 1J1 (w1) . . . 1Jn (wn) e−λ((0,wn]) (16.25)

wherein we have used the identity,

Page: 141 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 148: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

142 16 Point Processes

e−λ((0,s]) − e−λ((0,t]) =

∫ s

t

d

dτe−λ((0,τ ])dτ = −

∫ s

t

e−λ((0,τ ])λ (τ) dτ

=

∫ t

s

e−λ((0,τ ])λ (τ) dτ =

∫ t

s

e−λ((0,τ ])λ (dτ) .

Equation (16.22) now follows from Eq. (16.25) and a limiting argument (tech-nically the π – λ theorem is used here.) Using this result it also follows that

E [f (W1, . . . ,Wn−1, Tn) g (W1, . . . ,Wn−1)]

= E [f (W1, . . . ,Wn−1,Wn −Wn−1) g (W1, . . . ,Wn−1)]

=

∫∆n

[f (w1, . . . , wn−1, wn − wn−1)

·g (w1, . . . , wn−1)

]e−λ((0,wn])λ (dw1) . . . λ (dwn) .

=

∫∆n−1

[h (w1, . . . , wn−1)·g (w1, . . . , wn−1)

]e−λ((0,wn−1])λ (dw1) . . . λ (dwn−1)

= E [h (W1, . . . ,Wn−1) g (W1, . . . ,Wn−1)] (16.26)

where,

h (w1, . . . , wn−1) =

∫ ∞wn−1

f (w1, . . . , wn−1, wn − wn−1) e−λ((wn−1,wn])λ (wn) dwn.

=

∫ ∞0

f (w1, . . . , wn−1, t) e−λ((wn−1,wn−1+t])λ (wn−1 + t) dt

wherein we have made the change of variables wn = wn−1 + t. As Eq. (16.26)holds for all g we have shown,

E [f (W1, . . . ,Wn−1, Tn) |W1, . . . ,Wn−1]

= h (W1, . . . ,Wn−1)

=

∫ ∞0

f (W1, . . . ,Wn−1, t) e−λ((Wn−1,Wn−1+t])λ (Wn−1 + t) dt.

This suffices to prove Eq. (16.23) and using this result we also find,

P (Tn > t|W1, . . . ,Wn−1) =

∫ ∞t

e−λ((Wn−1,Wn−1+τ ])λ (Wn−1 + τ) dτ

=

∫ ∞λ((Wn−1,Wn−1+t])

e−udu = e−λ((Wn−1,Wn−1+t])

wherein we have made the chance of variable, u = λ ((Wn−1,Wn−1 + τ ]) .

Theorem 16.42 (Converse to Theorem 16.41). Suppose that Wn∞n=0 isa sequence of non-negative random variables such that Eq. (16.22) holds for allf and n. Then the counting process, N (A) :=

∑∞n=1 1A (Wn) for all A ⊂ (0,∞)

is a non-homorgeneous Poisson process with intensity measure, dλ (t) = λ (t) dt.

Proof. The proof is similar to the proof of Theorem 16.29 and will thus beomitted.

16.8 Bernoulli point process (Non-Homogeneous Case)

The next result generalizes Theorem 16.13 to the non-homogeneous case.

Theorem 16.43 (Non-uniform Bernoulli process). Suppose that A ⊂f Sand C ⊂ A is any subset with n element where 0 ≤ n ≤ a := # (A) . Then

P (C = A ∩ Λ|η (A) = n) =

∏x∈C

p(x)q(x)∑

D⊂A:#D=n

∏x∈D

p(x)q(x)

. (16.27)

Moreover, if Uini=1 are i.i.d. random functions with values in A so that

P (Ui = x) = pA (x) :=1

ZA

p (x)

q (x)with ZA :=

∑x∈A

p (x)

q (x). (16.28)

Then the distribution of the random set Λ ∩ A given η (A) = n is same as thedistribution of the random set U1, . . . , Un given Γn, where Γn is the event ofno coincidences as in Eq. (16.6).

Even more generally if we let Uini=1 be independent functions on somemeasure (Γ, µ) such that µ (Ui = x) = p (x) /q (x) , then for any subset C ⊂ Aof size n,

P (Λ ∩A = C) = µ (C = U1, . . . , Un |Γn)

where,Γn := Γn (A) := [∪i 6=j Ui = Uj] ∩ ∩ni=1 Ui ∈ A .

(One can take Γ = Sn and µ ((x1, . . . , xn)) =∏ni=1 p (xi) /q (xi) .)

Proof. Let KA :=∏y∈A q (y) . Given C ⊂ A with # (C) = n and a :=

# (A) , we find,

P (C = A ∩ Λ) = P (η (C) = n, η (A) = n)

= P (η (C) = n, η (A \ C) = 0)

=∏x∈C

p (x) ·∏

y∈A\C

q (y) = KA ·∏x∈C

p (x)

q (x)

while

P (η (A) = n) = KA ·∑

D⊂A:#D=n

∏x∈D

p (x)

q (x).

Page: 142 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 149: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.8 Bernoulli point process (Non-Homogeneous Case) 143

Therefore,

P (C = A ∩ Λ|η (A) = n) =KA ·

∏x∈C

p(x)q(x)

KA ·∑D⊂A:#D=n

∏x∈D

p(x)q(x)

,

which immediately leads to Eq. (16.27 ).On the other hand, if C = x1, . . . , xn where the xi are all distinct points

in A, then

U1, . . . , Un = C = ∪σ∈Sn U1 = xσ1, . . . , Un = xσn .

Therefore,

µ (U1, . . . , Un = C, Γn) = µ (U1, . . . , Un = C)

=∑σ∈Sn

µ (U1 = xσ1, . . . , Un = xσn)

=∑σ∈Sn

n∏i=1

pA (xσi) = n! ·n∏i=1

pA (xi)

= n! ·∏x∈C

pA (x) .

SinceΓn = Γn (A) = ∪C⊂A;#(C)=n U1, . . . , Un = C

it follows that

P (Γn) =∑

C⊂A;#(C)=n

P (U1, . . . , Un = C) =∑

C⊂A;#(C)=n

n! ·∏x∈C

pA (x)

and therefore

P (U1, . . . , Un = C|Γn) =n! ·

∏x∈C pA (x)∑

D⊂A;#(D)=n n! ·∏x∈D pA (x)

=

∏x∈C pA (x)∑

D⊂A;#(D)=n

∏x∈D pA (x)

=

∏x∈C p (x) /q (x)∑

D⊂A;#(D)=n

∏x∈D p (x) /q (x)

. (16.29)

Combining Eqs. (16.27) and (16.29) completes the proof.

Remark 16.44. The results of Theorem 16.13 are of course a special case of thosein Theorem 16.43 upon realizing that

pA :=p/q

a · p/q=

1

a

and that Eq. (16.27) reduces to Eq. (16.5) as follows;

P (C = A ∩ Λ|η (A) = n) =

∏x∈C

pq∑

D⊂A:#D=n

∏x∈D

pq

=1∑

D⊂A:#D=n 1=

1(#(A)n

) .We will also like to pay special attention to case where S = Sε = ε ·N ⊂ R+

for some ε > 0. In this setting we will make use of the order on S by definingW0 = 0 and then define Wn inductively by

Wn = min t ∈ S : t > Wn−1 and Yt = 1 .

So Wn is the time that the nth 1 appears in the sequence, (Yε, Y2ε, Y3ε, . . . ) . Wefurther let Tn := Wn−Wn−1 so that Tn is the time between “events” n−1 andn. For example if (Yε, Y2ε, Y3ε, . . . ) = (0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, . . . ) , thenW1 = 3ε, W2 = 5ε, W3 = 9ε, W4 = 11ε, W5 = 12ε, . . . . and T1 = 3ε, T2 = 2ε,T3 = 4ε, T4 = 2ε, T5 = 1ε, . . . . Notice that Wn = T1 + · · · + Tn for n ≥ 1.We refer to Wn as the waiting time for the nth – event and Tn as the nth –interarrival time.

Remark 16.45. Notice that

P (W1 > t) =∏

0<x≤t

q (x)

and soP (W1 =∞) =

∏0<x<∞

q (x) =∏

0<x<∞[1− p (x)]

which is positive iff∑∞x=1 p (x) < ∞. On the other hand if

∑∞x=1 p (x) = ∞,

then P (Wn =∞) = 0 for all n. For example when n = 2,

W2 =∞ = W1 =∞ ∪ W2 =∞,W1 <∞

and therefore,

P (W2 =∞) = P (W1 =∞) + P (W2 =∞,W1 <∞)

= 0 + P (W2 =∞,W1 <∞) .

Since∑∞x=1 p (x) =∞ implies

∏w1<x

q (x) = 0 for all w1 <∞ it follows that

Page: 143 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 150: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

144 16 Point Processes

P (W2 =∞,W1 <∞) =∑

0<w1<∞

∏0<x<w1

q (x) · p (w1) ·∏w1<x

q (x)

=∑

0<w1<∞

∏0<x<w1

q (x) · p (w1) · 0 = 0.

The full induction argument is now left to the reader.

For the rest of this section we will assume∑∞x=1 p (x) = ∞. It is not nec-

essarily the case that EW1 < ∞ under this assumption. For example if S = Nand p (x) = 1

x then∑∞x=1 p (x) =∞ while

EW1 =∑t>0

P (W1 > t) =∑t>0

∏0<x≤t

q (x) =∑t>0

∏0<x≤t

[1− p (x)]

∼∑t>0

exp

− ∑0<x≤t

p (x)

=∑t>0

exp

− ∑0<x≤t

1

x

∼∑t>0

exp (− ln t) =∑t>0

1

t=∞.

Theorem 16.46. Continuing the above notation, the distributions of the Tnare determined inductively by;

P (Tn+1 > t|W1, . . . ,Wn) =∏

τ∈Sε:Wn<τ≤Wn+t

q (τ) for all t ∈ Sε. (16.30)

Proof. The simple proof is left to the reader.

Corollary 16.47. If p (x) = p independent of x in Theorem 16.46, then Eq.(16.30) reduces to

P (Tn+1 > t|W1, . . . ,Wn) = qt/ε for all t ∈ Sε. (16.31)

In particular it follows that Tn+1 is independent of W1, . . . ,Wn and in particular

Tn∞n=1 are all i.i.d. random variables such that 1εTn

d= Geo (p) .

Theorem 16.48. For any bounded or non-negative function of f (w1, . . . , wn)with w1 < w2 < · · · < wn in Sε we have

E [f (W1, . . . ,Wn)] =∑

0<w1<w2<···<wn<∞f (w1, . . . , wn)Q (wn)

n∏i=1

p (wi)

q (wi)

whereQ (w) :=

∏x∈(0,w]

q (x)

and (0, w] := x ∈ Sε : 0 < x ≤ w for all w ∈ Sε.

Proof. We have

E [f (W1, . . . ,Wn)]

=∑

0<w1<w2<···<wn<∞f (w1, . . . , wn)

∏x∈(0,wn]\w1,...,wn

q (x) ·n∏i=1

p (wi)

=∑

0<w1<w2<···<wn<∞f (w1, . . . , wn)

∏x∈(0,wn]

q (x) ·n∏i=1

p (wi)

q (wi)

=∑

0<w1<w2<···<wn<∞f (w1, . . . , wn)Q (wn)

n∏i=1

p (wi)

q (wi).

Corollary 16.49. Suppose that t ∈ Sε and n ∈ N are given such that t/ε ≥ nand let Uini=1 be i.i.d. random variables with values in (0, t] ∩ Sε such that

P (Ui = x) = c · p(x)q(x) for all x ∈ (0, t] where c−1 :=

∑x∈(0,t] p (x) /q (x) . Let

Γn = [∪i 6=j Ui = Uj]c be the event where no two of the Uini=1 are equal and

let(U1, . . . , Un

)be the order statistics of (U1, . . . , Un) . Then

E [f (W1, . . . ,Wn) |Nt = n] = E[f(U1, . . . , Un

)|Γn]

for all bounded or non-negative function of f (w1, . . . , wn) with w1 < w2 < · · · <wn.

Proof. Since Nt = n = Wn ≤ t < Wn+1 it follows from Theorem 16.48that

E [f (W1, . . . ,Wn) : Nt = n]

=∑

0<w1<w2<···<wn≤t<wn+1<∞

f (w1, . . . , wn)

n∏i=1

p (wi)

q (wi)·Q (wn+1)

p (wn+1)

q (wn+1).

Taking f ≡ 1 in the formula shows,

P (Nt = n) =∑

0<w1<w2<···<wn≤t<wn+1<∞

n∏i=1

p (wi)

q (wi)·Q (wn+1)

p (wn+1)

q (wn+1)

= Z (n, t) ·∑

t<w<∞Q (w)

p (w)

q (w)

where

Page: 144 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 151: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

16.9 The Continuum limit in the non-homogeneous case 145

Z (n, t) :=∑

0<w1<w2<···<wn≤t

n∏i=1

p (wi)

q (wi).

Therefore,

E [f (W1, . . . ,Wn) |Nt = n] =1

Z (n, t)

∑0<w1<w2<···<wn≤t

f (w1, . . . , wn)

n∏i=1

p (wi)

q (wi).

On the other hand,

E[f(U1, . . . , Un

): Γn

]=∑σ∈Sn

E [f (Uσ1, . . . , Uσn) : Uσ1 < · · · < Uσn ]

= n! · E [f (U1, . . . , Un) : U1 < · · · < Un]

= n!cn ·∑

0<w1<w2<···<wn≤t

f (w1, . . . , wn)

n∏i=1

p (wi)

q (wi).

Taking f ≡ 1 in this equation shows

P (Γn) = n!cn ·∑

0<w1<w2<···<wn≤t

n∏i=1

p (wi)

q (wi)= n! · Z (n, t)

and therefore,

E[f(U1, . . . , Un

)|Γn]

=1

Z (n, t)

∑0<w1<w2<···<wn≤t

f (w1, . . . , wn)

n∏i=1

p (wi)

q (wi)

= E [f (W1, . . . ,Wn) |Nt = n]

as claimed.

16.9 The Continuum limit in the non-homogeneous case

Suppose that ρ : Rd → (0,∞) is a positive bounded continuous function andfor ε > 0 let Sε = εZd and pε (x) := εdρ (x) . (If ε > 0 is sufficiently small, then0 < pε (x) < 1 for all x ∈ Rd. From now on we assume that ε > 0 is small so thatthis happens.) Further let Yxx∈Sε be independent Bernoulli random variables

with P (Yx = 1) = pε (x) for all x ∈ Sε. Given subset A ⊂ Rd we let Aε :=A ∩ Sε, ηε (A) :=

∑x∈Aε Yx, and Λε := x ∈ Sε : ηε (x) = 1 . Notice that if

f : An → R is a symmetric function of its arguments and C ⊂ A is a subsetwith n elements, we may define f (C) := f (x1, . . . , xn) where x1, . . . , xn is alisting of the elements in C in any order we choose.

Theorem 16.50. Keeping the notation and assumptions above, if A ⊂ Rd is anice bounded set, then ηε (A) =⇒ Poi s

(∫Aρ (x) dx

). Moreover, if f : An → R

is a symmetric function of its arguments, then

limε↓0

E [f (Λε ∩A)] =

∫An

f (x) ρ (x1) . . . ρ (xn) dx1 . . . dxn[∫Aρ (x) dx

]n . (16.32)

Proof. First assertion. By definition,

ηε (A) =∑x∈Aε

Yx

where P (Yx = 1) = ρ (x) εd. So according to the law of rare events,

dTV

(ηε (A) ,Poi s

(∑x∈Aε

ρ (x) εd

))≤∑x∈Aε

ρ2 (x) ε2d

= εd ·∑x∈Aε

ρ2 (x) εd → 0 as ε ↓ 0.

Since∑x∈Aε ρ (x) εd →

∫Aρ (x) dx it further follows that

dTV

(ηε (A) ,Poi s

(∫A

ρ (x) dx

))→ 0 as ε ↓ 0.

Second assertion. According to Eq. (16.27) we have,

E [f (Λε ∩Aε)] =

∑C⊂Aε:#C=n f (C)

∏x∈C

(1

1−εdρ(x)ρ (x) εd

)∑C⊂Aε:#C=n

∏x∈C

(1

1−εdρ(x)ρ (x) εd

) .

So to compute this limit it suffices to show

limε↓0

∑C⊂Aε:#C=n

f (C)

[∏x∈C

1

1− εdρ (x)

]ρ (x) εd

=1

n!

∫An

f (x) ρ (x1) . . . ρ (xn) dx1 . . . dxn.

As∏x∈C

11−εdρ(x)

→ 1 uniformly in C ⊂ A with n elements as ε ↓ 0, it suffices

to show,

limε↓0

∑C⊂Aε:#C=n

f (C)∏x∈C

[ρ (x) εd

]=

1

n!

∫An

f (x) ρ (x1) . . . ρ (xn) dx1 . . . dxn.

Page: 145 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 152: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

146 16 Point Processes

Let us next observe, if M is a bound on |f | , then∑x2,...,xn∈Aε

|f (x2, x2 . . . , xn)| pε (x2) ·n∏i=2

pε (xi)

≤M∑

x2,...,xn∈Aε

pε (x2) ·n∏i=2

pε (xi)

= M∑x∈Aε

p2ε (x)

(∑x∈Aε

pε (x)

)n−2

≤ Mεd

(∑x∈Aε

pε (x)

)n−1

→ M · 0 ·(∫

A

ρ (x) dx

)n−1

= 0

as ε ↓ 0. A slight generalization of this argument shows that∑x1,...,xn∈Aε

f (x1, . . . , xn)

n∏i=1

pε (xi)−′∑

x1,...,xn∈Aε

f (x1, . . . , xn)

n∏i=1

pε (xi)→ 0 as ε ↓ 0

where∑′x1,...,xn∈Aε indicates the sum is over x1, . . . , xn ∈ Aε with all points

being distinct. Since f (x1, . . . , xn)∏ni=1 pε (xi) is symmetric in all its variables,

it follows that

′∑x1,...,xn∈Aε

f (x1, . . . , xn)

n∏i=1

pε (xi) = n! ·∑

C⊂Aε:#C=n

f (C)∏x∈C

[ρ (x) εd

].

Combining this with the fact that

limε↓0

∑x1,...,xn∈Aε

f (x1, . . . , xn)

n∏i=1

pε (xi) =

∫An

f (x) ρ (x1) . . . ρ (xn) dx1 . . . dxn

proves Eq. (16.32).

Corollary 16.51. Let us now restrict to the 1 dimensional case where we writeλ (t) for ρ (t) and let N (·) be the Poisson process with intensity density λ.Further let W0 = 0 and Wn be defined inductively by;

Wn+1 := inf t > Wn : N ((0, t]) = Wn + 1

and then set Tn := Wn − Wn−1 for all n ∈ N. Then the distribution of theTn∞n=1 and hence the Wn is determined inductively by

P (Tn+1 > t|W1, . . . ,Wn) = exp

(−∫ Wn+t

Wn

λ (τ) dτ

).

Proof. We give an informal argument here based on passing to the limit inTheorem 16.46. In that theorem we take p (t) := pε (t) := λ (t) ε. Then formallypassing to the limit as ε ↓ 0 in Eq. (16.30) gives the desired result;

P(T εn+1 > t|W ε

1 , . . . ,Wεn

)=

∏τ∈Sε:W ε

n<τ≤W εn+t

(1− ελ (τ))

= exp

∑τ∈Sε:W ε

n<τ≤W εn+t

ln (1− ελ (τ))

= exp

∑τ∈Sε:W ε

n<τ≤W εn+t

[−ελ (τ) +O

(ε2)]

→ exp

(−∫ Wn+t

Wn

λ (τ) dτ

).

Page: 146 job: 180Lec macro: svmonob.cls date/time: 29-Mar-2011/9:44

Page 153: 180B-C Lecture Notes, Winter and Spring, 2011bdriver/math180C_S2011/Lecture Notes/180CLec1b.pdf · Bruce K. Driver 180B-C Lecture Notes, Winter and Spring, 2011 March 29, 2011 File:180Lec.tex

References

1. Richard Durrett, Probability: theory and examples, second ed., Duxbury Press,Belmont, CA, 1996. MR MR1609153 (98m:60001)

2. William Feller, An Introduction to Probability Theory and Its Applications. Vol. I,John Wiley & Sons Inc., New York, N.Y., 1950. MR MR0038583 (12,424a)

3. J. R. Norris, Markov chains, Cambridge Series in Statistical and ProbabilisticMathematics, vol. 2, Cambridge University Press, Cambridge, 1998, Reprint of1997 original. MR MR1600720 (99c:60144)

4. Sheldon M. Ross, Stochastic processes, Wiley Series in Probability and Mathemat-ical Statistics: Probability and Mathematical Statistics, John Wiley & Sons Inc.,New York, 1983, Lectures in Mathematics, 14. MR MR683455 (84m:60001)


Recommended