(Cambridge Series in Statistical and Probabilistic Mathematics) David Pollard-A User's Guide to...

A User's Guide to Measure Theoretic Probability

This book grew from a need to teach a rigorous probability course to a mixed

audience - statisticians, mathematically inclined biostatisticians, mathematicians,

economists, and students of finance - at the advanced undergraduate/introductory

graduate level, without the luxury of a course in measure theory as a prerequisite.

The core of the book covers the basic topics of independence, conditioning, martin-

gales, convergence in distribution, and Fourier transforms. In addition, there are

numerous sections treating topics traditionally thought of as more advanced, such as

coupling and the KMT strong approximation, option pricing via the equivalent mar-

tingale measure, and Fernique's inequality for Gaussian processes.

In a further break with tradition, the necessary measure theory is developed via

the identification of integrals with linear functional on spaces of measurable func-

tions, allowing quicker access to the full power of the measure theoretic methods.

The book is not just a presentation of mathematically rigorous theory; it is also

a discussion of why some of that theory takes its current form and how anyone could

have thought of those clever ideas in the first place. It is intended as a secure start-

ing point for anyone who needs to invoke rigorous probabilistic arguments and to

understand what they mean.

David Pollard is Professor of Statistics and Mathematics at Yale University in NewHaven, Connecticut. His interests center on probability, measure theory, theoreticaland applied statistics, and econometrics. He believes strongly that research andteaching (at all levels) should be intertwined. His book, Convergence of Stochastic

Processes (Springer-Verlag, 1984), successfully introduced many researchers andgraduate students to empirical process theory.

CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC MATHEMATICS

Editorial BoardR. Gill (Department of Mathematics, Utrecht University)

B. D. Ripley (Department of Statistics, University of Oxford)S. Ross (Department of Industrial Engineering, University of California, Berkeley)

M. Stein (Department of Statistics, University of Chicago)D. Williams (School of Mathematical Sciences, University of Bath)

This series of high quality upper-division textbooks and expository monographs coversall aspects of stochastic applicable mathematics. The topics range from pure and appliedstatistics to probability theory, operations research, optimization and mathematical pro-gramming. The books contain clear presentations of new developments in the field andalso of the state of the art in classical methods. While emphasizing rigorous treatment oftheoretical methods, the books also contain applications and discussions of new tech-niques made possible by advances in computational practice.

Already Published1. Bootstrap Methods and Their Application, by A. C. Davison and D. V. Hinkley2. Markov Chains, by J. Norris3. Asymptotic Statistics, by A. W. van der Vaart4. Wavelet Methods for Time Series Analysis, by Donald B. Percival and

Andrew T. Walden5. Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers,

by Thomas Leonard and John S. J. Hsu6. Empirical Processes in M-Estimation, by Sara van de Geer7. Numerical Methods of Statistics, by John F. Monahan

A User's Guide toMeasure Theoretic

Probability

DAVID POLLARDYale University

CAMBRIDGEUNIVERSITY PRESS

CAMBRIDGE UNIVERSITY PRESSCambridge, New York, Melbourne, Madrid, Cape Town, Singapore,

São Paulo, Delhi, Dubai, Tokyo, Mexico City

Cambridge University Press32 Avenue of the Americas, New York, NY 10013-2473, USA

www.cambridge.orgInformation on this title: www.cambridge.org/9780521002899

© David Pollard 2002

This publication is in copyright. Subject to statutory exceptionand to the provisions of relevant collective licensing agreements,no reproduction of any part may take place without the written

permission of Cambridge University Press.

First published 20027th printing 2010

A catalog record for this publication is available from the British Library.

Library of Congress Cataloging in Publication Data

Pollard, David, 1950–A user’s guide to measure theoretic probability / David Pollard.

p. cm. – (Cambridge series in statistical and probabilistic mathematics)Includes bibliographical references and index.

ISBN 0-521-80242-3 – ISBN 0-521-00289-3 (pbk.)1. Probabilities. 2. Measure theory. I. Title.

II. Cambridge series in statistical and probabilistic mathematics.QA273 .P7735 2001

519.2 – dc21 2001035270

ISBN 978-0-521-80242-0 HardbackISBN 978-0-521-00289-9 Paperback

Cambridge University Press has no responsibility for the persistence or accuracy of URLs forexternal or third-party Internet Web sites referred to in this publication and does not guarantee that

any content on such Web sites is, or will remain, accurate or appropriate.

ContentsPREFACE

CHAPTER 1:

§1§2§3§4

*§5§6

§7

CHAPTER 2:

§1§2§3

*§4§5§6

*§7*§8

§9§10

*§11§12§13

CHAPTER 3:

§1*§2§3§4

*§5*§6

§7§8

CHAPTER 4:

§1§2§3§4

*§5§6

*§7*§8

XI

MOTIVATION

Why bother with measure theory? 1The cost and benefit of rigor 3Where to start: probabilities or expectations? 5The de Finetti notation 7Fair prices 11Problems 13Notes 14

A MODICUM OF MEASURE THEORY

Measures and sigma-fields 17Measurable functions 22Integrals 26Construction of integrals from measures 29Limit theorems 31Negligible sets 33Lp spaces 36Uniform integrability 37Image measures and distributions 39Generating classes of sets 41Generating classes of functions 43Problems 45Notes 51

DENSITIES AND DERIVATIVES

Densities and absolute continuity 53The Lebesgue decomposition 58Distances and affinities between measures 59The classical concept of absolute continuity 65Vitali covering lemma 68Densities as almost sure derivatives 70Problems 71Notes 75

PRODUCT SPACES AND INDEPENDENCE

Independence 77Independence of sigma-fields 80Construction of measures on a product spaceProduct measures 88Beyond sigma-finiteness 93SLLN via blocking 95SLLN for identically distributed summands 97Infinite product spaces 99

83

Contents

§9§10

CHAPTER 5:

§1§2§3§4

*§5§6

*§7§8§9

CHAPTER 6:

§1§2§3§4

*§5*§6*§7*§8§9§10

CHAPTER 7:

§1§2§3§4

*§5§6§7

CHAPTER 8:

§1§2§3§4

*§5§6

*§7*§8§9§10

Problems 102Notes 108CONDITIONING

Conditional distributions: the elementary case 111Conditional distributions: the general case 113Integration and disintegration 116Conditional densities 118Invariance 121Kolmogorov's abstract conditional expectation 123Sufficiency 128Problems 131Notes 135

MARTINGALE ET AL.

What are they? 138Stopping times 142Convergence of positive supermartingales 147Convergence of submartingales 151Proof of the Krickeberg decomposition 152Uniform integrability 153Reversed martingales 155Symmetry and exchangeability 159Problems 162Notes 166

CONVERGENCE IN DISTRIBUTION

Definition and consequences 169Lindeberg's method for the central limit theoremMultivariate limit theorems 181Stochastic order symbols 182Weakly convergent subsequences 184Problems 186Notes 190

FOURIER TRANSFORMS

Definitions and basic properties 193Inversion formula 195A mystery? 198Convergence in distribution 198A martingale central limit theorem 200Multivariate Fourier transforms 202Cramer-Wold without Fourier transforms 203The Levy-Cramer theorem 205Problems 206Notes 208

176

Contents ix

CHAPTER 9: BROWNIAN MOTION

§1 Prerequisites 211§2 Brownian motion and Wiener measure 213§3 Existence of Brownian motion 215

*§4 Finer properties of sample paths 217§5 Strong Markov property 219

*§6 Martingale characterizations of Brownian motion 222*§7 Functionals of Brownian motion 226*§8 Option pricing 228§9 Problems 230§10 Notes 234

CHAPTER 10: REPRESENTATIONS AND COUPLINGS

§1 What is coupling? 237§2 Almost sure representations 239

*§3 Strassen's Theorem 242*§4 The Yurinskii coupling 244§5 Quantile coupling of Binomial with normal 248§6 Haar coupling—the Hungarian construction 249§7 The Koml6s-Major~Tusn£dy coupling 252§8 Problems 256§9 Notes 258

CHAPTER 11: EXPONENTIAL TAILS AND THE LAW OF THE ITERATED LOGARITHM

§1 LIL for normal summands 261§2 LIL for bounded summands 264

*§3 Kolmogorov's exponential lower bound 266*§4 Identically distributed summands 268§5 Problems 271§6 Notes 272

CHAPTER 12: MULTIVARIATE NORMAL DISTRIBUTIONS

§1 Introduction 274*§2 Fernique's inequality 275*§3 Proof of Fernique's inequality 276§4 Gaussian isoperimetric inequality 278

*§5 Proof of the isoperimetric inequality 280§6 Problems 285§7 Notes 287

APPENDIX A: MEASURES AND INTEGRALS

§ 1 Measures and inner measure 289§2 Tightness 291§3 Countable additivity 292§4 Extension to the He-closure 294§5 Lebesgue measure 295§6 Integral representations 296§7 Problems 300§8 Notes 300

Contents

APPENDIX B: HILBERT SPACES

§1 Definitions 301§2 Orthogonal projections 302§3 Orthonormal bases 303§4 Series expansions of random processes 305§5 Problems 306§6 Notes 306

APPENDIX C: CONVEXITY

§1 Convex sets and functions 307§2 One-sided derivatives 308§3 Integral representations 310§4 Relative interior of a convex set 312§5 Separation of convex sets by linear functionals 313§6 Problems 315§7 Notes 316

APPENDIX D: BINOMIAL AND NORMAL DISTRIBUTIONS

§1 Tails of the normal distributions 317§2 Quantile coupling of Binomial with normal 320§3 Proof of the approximation theorem 324§4 Notes 328

APPENDIX E: MARTINGALES IN CONTINUOUS TIME

§1 Filtrations, sample paths, and stopping times 329§2 Preservation of martingale properties at stopping times 332§3 Supermartingales from their rational skeletons 334§4 The Brownian filtration 336§5 Problems 338§6 Notes 338

APPENDIX F: DISINTEGRATION OF MEASURES

§1 Representation of measures on product spaces 339§2 Disintegrations with respect to a measurable map 342§3 Problems 343§4 Notes 345

INDEX 347

Preface

This book began life as a set of handwritten notes, distributed to students in myone-semester graduate course on probability theory, a course that had humble aims:to help the students understand results such as the strong law of large numbers, thecentral limit theorem, conditioning, and some martingale theory. Along the waythey could expect to learn a little measure theory and maybe even a smattering offunctional analysis, but not as much as they would learn from a course on MeasureTheory or Functional Analysis.

In recent years the audience has consisted mainly of graduate students instatistics and economics, most of whom have not studied measure theory. Most ofthem have no intention of studying measure theory systematically, or of becomingprofessional probabilists, but they do want to learn some rigorous probabilitytheory—in one semester.

Faced with the reality of an audience that might have neither the time northe inclination to devote itself completely to my favorite subject, I sought tocompress the essentials into a course as self-contained as I could make it. I triedto pack into the first few weeks of the semester a crash course in measure theory,with supplementary exercises and a whirlwind exposition (Appendix A) for theenthusiasts. I tried to eliminate duplication of mathematical effort if it served nouseful role. After many years of chopping and compressing, the material that I mostwanted to cover all fit into a one-semester course, divided into 25 lectures, eachlasting from 60 to 75 minutes. My handwritten notes filled fewer than a hundredpages.

I had every intention of making my little stack of notes into a little book. But Icouldn't resist expanding a bit here and a bit there, adding useful reference material,spelling out ideas that I had struggled with on first acquaintance, slipping in extratopics that my students have seemed to need when writing dissertations, and pullingin material from other courses I have taught and neat tricks I have learned from myfriends. And soon it wasn't so little any more.

Many of the additions ended up in starred Sections, which contain hardermaterial or topics that can be skipped over without loss of continuity.

My treatment includes a few eccentricities that might upset some of myprofessional colleagues. My most obvious departure from tradition is in the useof linear functional notation for expectations, an approach I first encountered inbooks by de Finetti. I attempt to explain the virtues of this notation in the firsttwo Chapters. Another slight novelty—at least for anyone already exposed to theKolmogorov interpretation of conditional expectations—appears in my treatmentof conditioning, in Chapter 5. For many years I have worried about the wide gapbetween the free-wheeling conditioning calculations of an elementary probabilitycourse and the formal manipulations demanded by rigor. I claim that a treatmentstarting from the idea of conditional distributions offers one way of bridging the gap,

xii Preface

at least for many of the statistical applications of conditioning that have troubled methe most.

The twelve Chapters and six Appendixes contain general explanations, remarks,opinions, and blocks of more formal material. Theorems and Lemmas contain themost important mathematical details. Examples contain gentler, or less formal,explanations and illustrations. Supporting theoretical material is presented eitherin the form of Exercises, with terse solutions, or as Problems (at the ends of theChapters) that work step-by-step through material that missed the cutoff as Exercises,Lemmas, or Theorems. Some Problems are routine, to give students an opportunityto digest the ideas in the text without great mental effort; some Problems are hard.

A possible one-semester course

Here is a list of the material that I usually try to cover in the one-semestergraduate course.

Chapter 1: Spend one lecture on why measure theory is worth the effort, usinga few of the Examples as illustrations. Introduce de Finetti notation, identifyingsets with their indicator functions and writing P for both probabilities of sets andexpectations of random variables. Mention, very briefly, the fair price Section as analternative to the frequency interpretation.

Chapter 2: Cover the unstarred Sections carefully, but omitting many detailsfrom the Examples. Postpone Section 7 until Chapter 3. Postpone Section 8until Chapter 6. Describe briefly the generating class theorem for functions, fromSection 11, without proofs.

Chapter 3: Cover Section 1, explaining the connection with the elementary notionof a density. Take a short excursion into Hilbert space (explaining the projectiontheorem as an extension of the result for Euclidean spaces) before presenting thesimple version of Radon-Nikodym. Mention briefly the classical concept of absolutecontinuity, but give no details. Maybe say something about total variation.

Chapter 4: Cover Sections 1 and 2, leaving details of some arguments to thestudents. Give a reminder about generating classes of functions. Describe theconstruction of /x 0 A, only for a finite kernel A, via the iterated integral. Coverproduct measures, using some of the Examples from Section 4. Explain the need forthe blocking idea from Section 6, using the Maximal Inequality to preview the ideaof a stopping time. Mention the truncation idea behind the version of the SLLN forindependent, identically distributed random variables with finite first moments, butskip most of the proof.

Chapter 5: Discuss Section 1 carefully. Cover the high points of Sections 2through 4. (They could be skipped without too much loss of continuity, but I prefernot to move straight into Kolmogorov conditioning.) Cover Section 6.

Preface xiii

Chapter 6: Cover Sections 1 through 4, but skipping over some Examples.Characterize uniformly integrable martingales, using Section 6 and some of thematerial postponed from Section 8 of Chapter 2, unless short of time.

Chapter 7: Cover the first four Sections, skipping some of the examples of centrallimit theorems near the end of Section 2. Downplay multivariate results.

Chapter 8: Cover Sections 1, 2, 4, and 6.

If time is left over, cover a topic from the remaining Chapters.

Acknowledgments

I am particularly grateful to Richard Gill, who is a model of the constructive critic.His comments repeatedly exposed weaknesses and errors in the manuscript. Mycolleagues Joe Chang and Marten Wegkamp asked helpful questions while usingearlier drafts to teach graduate probability courses. Andries Lenstra provided someimportant historical references.

Many cohorts of students worked through the notes, revealing points ofobscurity and confusion. In particular, Jeankyung Kim, Gheorghe Doros, DanielaCojocaru, and Peter Radchenko read carefully through several chapters and workedthrough numerous Problems. Their comments led to a lot of rewriting.

Finally, I thank Lauren Cowles for years of good advice, and for her inex-haustible patience with an author who could never stop tinkering.

David PollardNew Haven

February 2001

Chapter 1

Motivation

SECTION 1 offers some reasons for why anyone who uses probability should know aboutthe measure theoretic approach.

SECTION 2 describes some of the added complications, and some of the compensatingbenefits that come with the rigorous treatment of probabilities as measures.

SECTION 3 argues that there are advantages in approaching the study of probability theoryvia expectations, interpreted as linear functionate, as the basic concept.

SECTION 4 describes the de Finetti convention of identifying a set with its indicatorfunction, and of using the same symbol for a probability measure and its correspondingexpectation.

SECTION *5 presents a fair-price interpretation of probability, which emphasizes thelinearity properties of expectations. The interpretation is sometimes a useful guide tointuition.

1. Why bother with measure theory?

Following the appearance of the little book by Kolmogorov (1933), which set fortha measure theoretic foundation for probability theory, it has been widely acceptedthat probabilities should be studied as special sorts of measures. (More or lesstrue—see the Notes to the Chapter.) Anyone who wants to understand modernprobability theory will have to learn something about measures and integrals, but ittakes surprisingly little to get started.

For a rigorous treatment of probability, the measure theoretic approach is a vastimprovement over the arguments usually presented in undergraduate courses. Letme remind you of some difficulties with the typical introduction to probability.

Independence

There are various elementary definitions of independence for random variables. Forexample, one can require factorization of distribution functions,

F[X < x, Y < y] = F{X < x} F{Y < y] for all real x, y.

The problem with this definition is that one needs to be able to calculate distributionfunctions, which can make it impossible to establish rigorously some desirable

2 Chapter 1: Motivation

properties of independence. For example, suppose X i , . . . ,X4 are independentrandom variables. How would you show that

F = X,X2

is independent of

Z = sin I X3 + x\ + X3X4 + X| 4- Jxf+X% 1 ,

by means of distribution functions? Somehow you would need to express events{Y < y, Z < z) in terms of the events {X* < JC/}, which is not an easy task. (If youdid figure out how to do it, I could easily make up more taxing examples.)

You might also try to define independence via factorization of joint densityfunctions, but I could invent further examples to make your life miserable, such asproblems where the joint distribution of the random variables are not even givenby densities. And if you could grind out the joint densities, probably by means ofhorrible calculations with Jacobians, you might end up with the mistaken impressionthat independence had something to do with the smoothness of the transformations.

The difficulty disappears in a measure theoretic treatment, as you will see inChapter 4. Facts about independence correspond to facts about product measures.

Discrete versus continuous

Most introductory texts offer proofs of the Tchebychev inequality,

P{|X - Ml > €} < var(X)/€2,

where \i denotes the expected value of X. Many texts even offer two proofs, one forthe discrete case and another for the continuous case. Indeed, introductory coursestend to split into at least two segments. First one establishes all manner of resultsfor discrete random variables and then one reproves almost the same results forrandom variables with densities.

Unnecessary distinctions between discrete and continuous distributions disappearin a measure theoretic treatment, as you will see in Chapter 3.

Univariate versus multivariate

The unnecessary repetition does not stop with the discrete/continuous dichotomy.After one masters formulae for functions of a single random variable, the wholeprocess starts over for several random variables. The univariate definitions acquire aprefix joint, leading to a whole host of new exercises in multivariate calculus: jointdensities, Jacobians, multiple integrals, joint moment generating functions, and soon.

Again the distinctions largely disappear in a measure theoretic treatment.Distributions are just image measures; joint distributions are just image measures formaps into product spaces; the same definitions and theorems apply in both cases.One saves a huge amount of unnecessary repetition by recognizing the role of image

1.1 Why bother with measure theory? 3

measures (described in Chapter 2) and recognizing joint distributions as measureson product spaces (described in Chapter 4).

Approximation of distributions

Roughly speaking, the central limit theorem asserts:Jf £ i , . . •, Hn ore independent random variables with zero expected values andvariances summing to one, and if none of the £,- makes too large a contributionto their sumy then %\ + . . . + £„ is approximately N(0, 1) distributed.

What exactly does that mean? How can something with a discrete distribution,such as a standardized Binomial, be approximated by a smooth normal distribution?The traditional answer (which is sometimes presented explicitly in introductorytexts) involves pointwise convergence of distribution functions of random variables;but the central limit theorem is seldom established (even in introductory texts) bychecking convergence of distribution functions. Instead, when proofs are given, theytypically involve checking of pointwise convergence for some sort of generatingfunction. The proof of the equivalence between convergence in distribution andpointwise convergence of generating functions is usually omitted. The treatment ofconvergence in distribution for random vectors is even murkier.

As you will see in Chapter 7, it is far cleaner to start from a definition involvingconvergence of expectations of "smooth functions" of the random variables, anapproach that covers convergence in distribution for random variables, randomvectors, and even random elements of metric spaces, all within a single framework.

***

In the long run the measure theoretic approach will save you much work andhelp you avoid wasted effort with unnecessary distinctions.

2. The cost and benefit of rigor

In traditional terminology, probabilities are numbers in the range [0,1] attached toevents, that is, to subsets of a sample space Q. They satisfy the rules

(i) P0 = 0 and FQ = 1

(ii) for disjoint events A\, A2,..., the probability of their union, P(U/A,), is equalto Yî Pî» tne sum of the probabilities of the individual events.

When teaching introductory courses, I find that it pays to be a little vagueabout the meaning of the dots in (ii), explaining only that it lets us calculate theprobability of an event by breaking it into disjoint pieces whose probabilities aresummed. Probabilities add up in the same way as lengths, areas, volumes, andmasses. The fact that we sometimes need a countable infinity of pieces (as incalculations involving potentially infinite sequences of coin tosses, for example) isbest passed off as an obvious extension of the method for an arbitrarily large, finitenumber of pieces.

In fact the extension is not at all obvious, mathematically speaking. Asexplained by Hawkins (1979), the possibility of having the additivity property (ii)


hold for countable collections of disjoint events, a property known officially ascountable additivity, is one of the great discoveries of modern mathematics. In his1902 doctoral dissertation, Henri Lebesgue invented a method for defining lengthsof complicated subsets of the real line, in a countably additive way. The definitionhas the subtle feature that not every subset has a length. Indeed, under the usualaxioms of set theory, it is impossible to extend the concept of length to all subsetsof the real line while preserving countable additivity.

The same subtlety carries over to probability theory. In general, the collectionof events to which countably additive probabilities are assigned cannot include allsubsets of the sample space. The domain of the set function P (the probabilitymeasure) is usually just a sigma-field, a collection of subsets of Q with propertiesthat will be defined in Chapter 2.

Many probabilistic ideas are greatly simplified by reformulation as propertiesof sigma-fields. For example, the unhelpful multitude of possible definitions forindependence coalesce nicely into a single concept of independence for sigma-fields.

The sigma-field limitation turns out to be less of a disadvantage than might befeared. In fact, it has positive advantages when we wish to prove some probabilisticfact about all events in some sigma-field, A. The obvious line of attack—first find anexplicit representation for the typical member of A> then check the desired propertydirectly—usually fails. Instead, as you will see in Chapter 2, an indirect approachoften succeeds.

(a) Show directly that the desired property holds for all events in some subclass £of "simpler sets" from A.

(b) Show that A is the smallest sigma-field for which A^> £.

(c) Show that the desired property is preserved under various set theoreticoperations. For example, it might be possible to show that if two events havethe property then so does their union.

(d) Deduce from (c) that the collection 23 of all events with the property formsa sigma-field of subsets of £2. That is, 23 is a sigma-field, which, by (a), hasthe property T> 2 £.

(e) Conclude from (b) and (d) that B D A That is, the property holds for allmembers of A.

REMARK. Don't worry about the details for the moment. I include the outlinein this Chapter just to give the flavor of a typical measure theoretic proof. I havefound that some students have trouble adapting to this style of argument.

The indirect argument might seem complicated, but, with the help of a few keytheorems, it actually becomes routine. In the literature, it is not unusual to seeapplications abbreviated to a remark like "a simple generating class argument shows. . . , " with the reader left to fill in the routine details.

Lebesgue applied his definition of length (now known as Lebesgue measure)to the construction of an integral, extending and improving on the Riemannintegral. Subsequent generalizations of Lebesgue's concept of measure (as inthe 1913 paper of Radon and other developments described in the Epilogue to

1.2 The cost and benefit of rigor 5

Hawkins 1979) eventually opened the way for Kolmogorov to identify probabilitieswith measures on sigma-fields of events on general sample spaces. From the Prefaceto Kolmogorov (1933), in the 1950 translation by Morrison:

The purpose of this monograph is to give an axiomatic foundation for thetheory of probability. The author set himself the task of putting in their naturalplace, among the general notions of modern mathematics, the basic concepts ofprobability theory—concepts which until recently were considered to be quitepeculiar.

This task would have been a rather hopeless one before the introductionof Lebesgue's theories of measure and integration. However, after Lebesgue'spublication of his investigations, the analogies between measure of a set andprobability of an event, and between integral of a function and mathematicalexpectation of a random variable, became apparent. These analogies allowed offurther extensions; thus, for example, various properties of independent randomvariables were seen to be in complete analogy with the corresponding propertiesof orthogonal functions. But if probability theory was to be based on the aboveanalogies, it still was necessary to make the theories of measure and integrationindependent of the geometric elements which were in the foreground withLebesgue. This has been done by Frechet.

While a conception of probability theory based on the above generalviewpoints has been current for some time among certain mathematicians, therewas lacking a complete exposition of the whole system, free of extraneouscomplications. (Cf., however, the book by Fre*chet . . . )

Kolmogorov identified random variables with a class of real-valued functions(the measurable functions) possessing properties allowing them to coexist com-fortably with the sigma-field. Thereby he was also able to identify the expectationoperation as a special case of integration with respect to a measure. For the newlyrestricted class of random variables, in addition to the traditional properties

(i) E(c\X\ + c2X2) = c\E(X\) + c2E(X2), for constants c\ and c2,

(ii) E(X) > E(Y) if X > F,

he could benefit from further properties implied by the countable additivity of theprobability measure.

As with the sigma-field requirement for events, the measurability restriction onthe random variables came with benefits. In modern terminology, no longer was Ejust an increasing linear functional on the space of real random variables (withsome restrictions to avoid problems with infinities), but also it had acquired somecontinuity properties, making possible a rigorous treatment of limiting operations inprobability theory.

3. Where to start: probabilities or expectations?

From the example set by Lebesgue and Kolmogorov, it would seem natural to startwith probabilities of events, then extend, via the operation of integration, to the studyof expectations of random variables. Indeed, in many parts of the mathematicalworld that is the way it goes: probabilities are the basic quantities, from whichexpectations of random variables are derived by various approximation arguments.


The apparently natural approach is by no means the only possibility, as anyonebrought up on the works of the fictitious French author Bourbaki could affirm.(The treatment of measure theory, culminating with Bourbaki 1969, started fromintegrals defined as linear functionals on appropriate spaces of functions.) Moreover,historically speaking, expectation has a strong claim to being the preferred startingpoint for a theory of probability. For instance, in his discussion of the 1657 bookCalculating in Games of Chance by Christian Huygens, Hacking (1978, page 97)commented:

The fair prices worked out by Huygens are just what we would call theexpectations of the corresponding gambles. His approach made expectation amore basic concept than probability, and this remained so for about a century.

The fair price interpretation is sketched in Section 5.The measure theoretic history of integrals as linear functionals also extends

back to the early years of the twentieth century, starting with Daniell (1918), whodeveloped a general theory of integration via extension of linear functionals fromsmall spaces of functions to larger spaces. It is also significant that, in one of thegreatest triumphs of measure theory, Wiener (1923, Section 10) defined what is nowknown as Wiener measure (thereby providing a rigorous basis for the mathematicaltheory of Brownian motion) as an averaging operation for functionals defined onBrownian motion paths, citing Daniell (1919) for the basic extension theorem.

There are even better reasons than historical precedent for working with expec-tations as the basic concept. Whittle (1992), in the Preface to an elegant, intermediatelevel treatment of Probability via Expectations, presented some arguments:

(i) To begin with, people probably have a better intuition for what is meant byan * average value' than for what is meant by a 'probability.'

(ii) Certain important topics, such as optimization and approximation problems,can be introduced and treated very quickly, just because they are phrased interms of expectations.

(iii) Most elementary treatments are bedeviled by the apparent need to ringthe changes of a particular proof or discussion for all the special cases ofcontinuous or discrete distribution, scalar or vector variables, etc. In theexpectations approach these are indeed seen as special cases, which can betreated with uniformity and economy.

His list continued. I would add that:

(a) It is often easier to work with the linearity properties of integrals than withthe additivity properties of measures. For example, many useful probabilityinequalities are but thinly disguised consequences of pointwise inequalities,translated into probability form by the linearity and increasing properties ofexpectations.

(b) The linear functional approach, via expectations, can save needless repetitionof arguments. Some theorems about probability measures, as set functions,are just special cases of more general results about expectations.

1.3 Where to start: probabilities or expectations? 7

(c) When constructing new probability measures, we save work by definingthe integral of measurable functions directly, rather than passing throughthe preliminary step of building the set function then establishing theoremsabout the corresponding integrals. As you will see repeatedly, definitions andtheorems sometimes collapse into a single operation when expressed directlyin terms of expectations, or integrals.

* * •

I will explain the essentials of measure theory in Chapter 2, starting from thetraditional set-function approach but working as quickly as I can towards systematicuse of expectations.

4. The de Finetti notation

The advantages of treating expectation as the basic concept are accentuated bythe use of an elegant notation strongly advocated by de Finetti (1972, 1974).Knowing that many traditionally trained probabilists and statisticians find thenotation shocking, I will introduce it slowly, in an effort to explain why it is worthat least a consideration. (Immediate enthusiastic acceptance is more than I couldhope for.)

Ordinary algebra is easier than Boolean algebra. The correspondence A ++ 1A

between subsets A of a fixed set X and their indicator functions,

1 if x € A,

transforms Boolean algebra into ordinary pointwise algebra with functions. I claimthat probability theory becomes easier if one works systematically with expectationsof indicator functions, E1U, rather than with the corresponding probabilities ofevents.

Let me start with the assertions about algebra and Boolean algebra. Theoperations of union and intersection correspond to pointwise maxima (denoted bymax or the symbol v) and pointwise minima (denoted by min or the symbol A), orpointwise products:

iu,A, ( * > = v ^(jc) and ^ ( j c ) = A u ( j c > = n JA> <*>•i i i

Complements correspond to subtraction from one: IAc(x) = 1 — IA(x). Derivedoperations, such as the set theoretic difference A\B := AD Bc and the symmetricdifference, AAB := (A\B) U (B\A), also have simple algebraic counterparts:

IA\B(X) = (IA(*) ~ Ia(*))+ := max (0, IA(JC) - lB(x)),

To check these identities, just note that the functions take only the values 0 and 1,then determine which combinations of indicator values give a 1. For example,\lA(x) - IB(X)\ takes the value 1 when exactly one of IA(x) and lB(x) equals 1.


The algebra looks a little cleaner if we omit the argument x. For example, thehorrendous set theoretic relationship

(n*=1A,) A (njL,ft) c u*=1 (A,Aft)

corresponds to the pointwise inequality

whose verification is easy: when the right-hand side takes the value 1 the inequalityis trivial, because the left-hand side can take only the values 0 or 1; and whenright-hand side takes the value 0, we have 1 . = lBi for all i, which makes theleft-hand side zero.

 Example. One could establish an identity such as

(AAB)A(CAD) = AA (BA(CAD))

by expanding both sides into a union of many terms. It is easier to note the patternfor indicator functions. The set AAB is the region where IA + I B takes an odd value(that is, the value 1); and (AAB)AC is the region where (lA + IB) + IC takes an oddvalue. And so on. In fact both sides of the set theoretic identity equal the regionwhere 1A + 1B + lc + ID t ake s a n °dd value. Associativity of set theoretic differences

• is a consequence of associativity of pointwise addition.

<2> Example. The lim sup of a sequence of sets {An : n € N} is defined as

lim sup An := PUJ-A*.n n=l i>n

That is, the lim sup consists of those x for which, to each n there exists an i > risuch that x € A(. Equivalently, it consists of those x for which x € Ai for infinitelymany i. In other words,

An =limsupIAn.

Do you really need to learn the new concept of the lim sup of a sequenceof sets? Theorems that work for lim sups of sequences of functions automaticallycarry over to theorems about sets. There is no need to prove everything twice. The

D correspondence between sets and their indicators saves us from unnecessary work.

After some repetition, it becomes tiresome to have to keep writing the I forthe indicator function. It would be much easier to write something like A in placeof IA. The indicator of the lim sup of a sequence of sets would then be writtenlimsupn An, with only the tilde to remind us that we are referring to functions. Butwhy do we need reminding? As the example showed, the concept for the lim supof sets is really just a special case of the concept for sequences of functions. Whypreserve a distinction that hardly matters?

There is a well established tradition in Mathematics for choosing notation thateliminates inessential distinctions. For example, we use the same symbol 3 for thenatural number and the real number, writing 3 + 6 = 9 as an assertion both aboutaddition of natural numbers and about addition of real numbers.

1.4 The de Finetti notation

It does not matter if we cannot tell immediately whichnat^L interpretation is intended, because we know there is a one-to-one

correspondence between natural numbers and a subset of the realnumbers, which preserves all the properties of interest. Formally,there is a map \/r : N - • R for which

i?(x +natural j ) = f(x) +real f(y) for all X, y in N,

with analogous equalities for other operations. (Notice that I eventook care to distinguish between addition as a function from N x N

3reai reai to N and as a function from E x E to E.) The map ty is anisomorphism between N and a subset of E .

REMARK. Of course there are some situations where we need to distinguishbetween a natural number and its real counterpart. For example, it would be highlyconfusing to use indistinguishable symbols when first developing the properties of thereal number system from the properties of the natural numbers. Also, some computerlanguages get very upset when a function that expects a floating point argument isfed an integer variable; some languages even insist on an explicit conversion betweentypes.

We are faced with a similar overabundance of notation in the correspondencebetween sets and their indicator functions. Formally, and traditionally, we have amap A »-» I,* from sets into a subset of the nonnegative real functions. The mappreserves the important operations. It is firmly in the Mathematical tradition thatwe should follow de Finetti's suggestion and use the same symbol for a set and itsindicator function.

REMARK. A very similar convention has been advocated by the renownedcomputer scientist, Donald Knuth, in an expository article (Knuth 1992). He attributedthe idea to Kenneth Iversen, the inventor of the programming language APL.

In de Finetti's notation the assertion from Example <2> becomes

lim sup An = lim sup An,

a fact that is quite easy to remember. The theorem about lim sups of sequencesof sets has become incorporated into the notation; we have one less theorem toremember.

The second piece of de Finetti notation is suggested by the same logic thatencourages us to replace -{-natural and +reai by the single addition symbol: use thesame symbol when extending the domain of definition of a function. For example,the symbol "sin" denotes both the function defined on the real line and its extensionto the complex domain. More generally, if we have a function g with domain Go,which can be identified with a subset Go of some G via a correspondence JC «->• i ,and if | is a function on G for which g(x) = g(x) for x in Go, then why not write ginstead of g for the function with the larger domain?

With probability theory we often use P to denote a probability measure, as amap from a class A (a sigma-field) of subsets of some O into the subinterval [0,1]of the real line. The correspondence A «•> A := 1^, between a set A and its indicatorfunction A, establishes a correspondence between A and a subset of the collection of


random variables on Q. The expectation maps random variables into real numbers,in such a way that E(A) = P(A). This line of thinking leads us to de Finetti'ssecond suggestion: use the same symbol for expectation and probability measure,writing FX instead of EX, and so on.

The de Finetti notation has an immediate advantage when we deal with severalprobability measures, P, Q, .. . simultaneously. Instead of having to invent newsymbols Ep, EQ, . . . , we reuse P for the expectation corresponding to P, and so on.

REMARK. YOU might have the concern that you will not be able to tell whetherFA refers to the probability of an event or the expected value of the correspondingindicator function. The ambiguity should not matter. Both interpretations give thesame number; you will never be faced with a choice between two different valueswhen choosing an interpretation. If this ambivalence worries you, I would suggestgoing systematically with the expectation/indicator function interpretation. It willnever lead you astray.

<3> Example. For a finite collection of events A\,..., An, the so-called method ofinclusion and exclusion asserts that the probability of the union Uf<nAi equals

i, j , k distinctionAyHA*)-. . .±P(AiflA2n.. .HAn).

The equality comes by taking expectations on both sides of an identity for (indicator)functions,

t; - ^2 AiAJ + ] C ^ ' ^ * d i s t i n c t M » A J A * - . . . ± A I A 2 ... A n .

The right-hand side of this identity is just the expanded version of 1 — Y\i<n (1 ~ Ai).The identity is equivalent to

which presents two ways of expressing the indicator function of n,<nAf. See• Problem [1] for a generalization.

<4> Example. Consider Tchebychev's inequality, F{\X — ix\ > e] < var(X)/e2, foreach € > 0, and each random variable X with expected value /x := FX and finitevariance, var(X) := P(X ~- /x)2. On the left-hand side of the inequality we havethe probability of an event. Or is it the expectation of an indicator function?Either interpretation is correct, but the second is more helpful. The inequality isa consequence of the increasing property for expectations invoked for a pair offunctions, {|X — JJL\ > e] < (X — ^t)2/€2. The indicator function on the left-handside takes only the values 0 and 1. The quadratic function on the right-hand side is

• nonnegative, and is > 1 whenever the left-hand side equals 1.

***

For the remainder of the book, I will be using the same symbol for a set andits indicator function, and writing P instead of E for expectation.

REMARK. For me, the most compelling reason to adopt the de Finetti notation,and work with P as a linear functional defined for random variables, was not thatI would save on symbols, nor any of the other good reasons listed at the end of

1.4 The de Finetti notation 11

Section 3. Instead, I favor the notation because, once the initial shock of seeing oldsymbols used in new ways wore off, it made probability theory easier. I can trulyclaim to have gained better insight into classical techniques through the mere fact oftranslating them into the new notation. I even find it easier to invent new argumentswhen working with a notation that encourages thinking in terms of linearity, andwhich does not overemphasize the special role for expectations of functions that takeonly the values 0 and 1 by according them a different symbol.

The hope that I might convince probability users of some of the advantagesof de Finetti notation was, in fact, one of my motivations for originally deciding towrite yet another book about an old subject.

*5. Fair prices

For the understanding of this book the interpretation of probability as a model foruncertainty is not essential. You could study it purely as a piece of mathematics,divorced from any interpretation but then you would forgo much of the intuitionthat accompanies the various interpretations.

The most widely accepted view interprets probabilities and expectations aslong run averages, anticipating the formal laws of large numbers that make precisea sense in which averages should settle down to expectations over a long sequenceof independent trials. As an aid to intuition I also like another interpretation, whichdoes not depend on a preliminary concept of independence, and which concentratesattention on the linearity properties of expectations.

Consider a situation—a bet if you will-where you stand to receive an uncertainreturn X. You could think of X as a random variable, a real-valued function on aset Q. For the moment forget about any probability measure on Q. Suppose youconsider p(X) to be the fair price to pay now in order to receive X at some latertime. (By fair I mean that you should be prepared to take either side of the bet. Inparticular, you should be prepared to accept a payment p(X) from me now in returnfor giving me an amount X later.) What properties should p{) have?

REMARK. AS noted in Section 3, the value p(X) corresponds to an expectedvalue of the random variable X. If you already know about the possibility of infiniteexpectations, you will realize that I would have to impose some restrictions on theclass of random variables for which fair prices are defined, if I were seriously tryingto construct a rigorous system of axioms. It would suffice to restrict the argument tobounded random variables.

Your net return will be the random quantity X'(o>) := X(a>) - p(X). Callthe random variable X' a fair return, the net return from a fair trade. Unless youstart worrying about utilities—in which case you might consult Savage (1954) orFerguson (1967, Section 1.4)—you should find the following properties reasonable.

(i) fair + fair = fair. That is, if you consider p(X) fair for X and p(Y) fairfor Y then you should be prepared to make both bets, paying p(X) 4- p(Y) toreceive X + F.

(ii) constant x fair = fair. That is, you shouldn't object if I suggest you pay2p(X) to receive 2X (actually, that particular example is a special case of (i))


or 3.76/?(X) to receive 3.76X, or -p(X) to receive -X. The last examplecorresponds to willingness to take either side of a fair bet. In general, toreceive cX you should pay cp(X), for constant c.

Properties (i) and (ii) imply that the collection of all fair returns is a vector space.There is a third reasonable property that goes by several names: coherency or

nonexistence of a Dutch book, the no-arbitrage requirement, or the no-free-lunchprinciple:

(iii) There is no fair return Xf for which Xf(co) > 0 for all co, with strict inequalityfor at least one co.

(Students of decision theory might be reminded of the the concept of admissibility.)If you were to declare such an X' to be fair I would be delighted to offer you theopportunity to receive a net return of — 10100X'. I couldn't lose.

<5> Lemma. Properties (I), (ii), and (iii) imply that p() is an increasing linearfunctional on random variables. The fair returns are those random variables forwhich p(X) = 0.

Proof. For constants a and ft, and random variables X and Y with fair prices p(X)and p(Y), consider the combined effect of the following fair bets:

you pay me ap(X) to receive aX

you pay me f$p(Y) to receive f$Y

I pay you p(aX + @Y) to receive (aX 4-

Your net return is a constant,

c = p(aX + PY) - ap(X) -

If c > 0 you violate (iii); if c < 0 take the other side of the bet to violate (iii). Thatproves linearity.

To prove that p() is increasing, suppose X(co) > Y(co) for all co. If you claimthat p(X) < p(Y) then I would be happy for you to accept the bet that delivers

(Y - p(Y)) - (X - p(X)) = -(X -Y)- (p(Y) - p(X)),

which is always < 0.If both X and X - p(X) are considered fair, then the constant return p(X) =

• X — (X — p(X)) is fair, which would contradict (iii) unless p(X) = 0.

As a special case, consider the bet that returns 1 if an event F occurs, and 0otherwise. If you identify the event F with the random variable taking the value 1on F and 0 on Fc (that is, the indicator of the event F), then it follows directlyfrom Lemma <5> that /?(•) is additive: p(F\ U Fi) = p{F\) 4- p{Fi) for disjointevents F\ and F2. That is, p defines a finitely additive set-function on events. Theset function /?(•) has most of the properties required of a probability measure. Asan exercise you might show that /?(0) = 0 and p(Q) = 1.

Contingent bets

Things become much more interesting if you are prepared to make a bet to receive anamount X but only when some event F occurs. That is, the bet is made contingent

7.5 Fair prices 13

on the occurrence of F. Typically, knowledge of the occurrence of F should changethe fair price, which we could denote by p(X | F). Expressed more compactly,the bet that returns (X - p(X | F)) F is fair. The indicator function F ensures thatmoney changes hands only when F occurs.

<6> Lemma. If Q is partitioned into disjoint events F\,..., F*, and X is a randomvariable, then p(X) = £f= 1 p(Ft)p(X | F,).

Proof. For a single FJ, argue by linearity that

0 = p (XFf - p{X | Fi)Fi) = p(XFi) - p(X | Fi)p(Fi).

Sum over i, using linearity again, together with the fact that X = £f . XF,, to deduceD that p(X) = £ . p(XFf) = £,. p(Fi)p(X \ F,), as asserted.

Why should we restrict the Lemma to finite partitions? If we allowed countablepartitions we would get the countable additivity property—the key requirement inthe theory of measures. I would be suspicious of such an extension of the simpleargument for finite partitions. It makes a tacit assumption that a combination ofcountably many fair bets is again fair. If we accept that assumption, then why notaccept that arbitrary combinations of fair events are fair? For uncountably infinitecollections we would run into awkward contradictions. For example, suppose co isgenerated from a uniform distribution on [0,1]. Let X, be the random variable thatreturns 1 if co = t and 0 otherwise. By symmetry one might expect p(Xt) = c forsome constant c that doesn't depend on /. But there can be no c for which

1 / i \ /v^ v \ ? v^ / v \ ( 0 i f c = 0l = p{\) = p I2ô<r<i xn = <Lo<r<i P\xo = I ±oo i f c _£ 0

Perhaps our intuition about the infinite rests on shaky analogies with the finite.

REMARK. I do not insist that probabilities must be interpreted as fair prices, justas I do not accept that all probabilities must be interpreted as assertions about longrun frequencies. It is convenient that both interpretations lead to almost the samemathematical formalism. You are free to join either camp, or both, and still play bythe same probability rules.

6. Problems

[1] Let A i , . . . , AN be events in a probability space (Q, J, P). For each subset Jof {1,2, . . . , # } write Aj for nl€yA/. Define S* := J2\j\=kÂj, where | / | de-notes the number of indices in J. For 0 < m < N show that the probabilityPfexactly m of the A,-'s occur} equals (™)Sm - (m+l)Sm+\ + ... ± (£)SN. Hint: Fora dummy variable z, show that njLiO*? + zî) = YH=o Yl\j\=k(z ~~ l)kAj. Expandthe left-hand side, take expectations, then interpret the coefficient of zm.

[2] Rederive the assertion of Lemma <6> by consideration of the net return from thefollowing system of bets: (i) for each i, pay cip(Fi) in order to receive c, if f}occurs, where c, := p(X | F,); (ii) pay -p(X) in order to receive - X ; (iii) foreach i, make a bet contingent on F,, paying c, (if F( occurs) to receive X.


[3] For an increasing sequence of events {An : n e N} with union A, show FAn t FA.

7. Notes

See Dubins & Savage (1964) for an illustration of what is possible in a theory ofprobability without countable additivity.

The ideas leading up to Lebesgue's creation of his integral are described infascinating detail in the excellent book of Hawkins (1979), which has been thestarting point for most of my forays into the history of measure theory. Lebesguefirst developed his new definition of the integral for his doctoral dissertation(Lebesgue 1902), then presented parts of his theory in the 1902-1903 Peccot courseof lectures (Lebesgue 1904). The 1928 revision of the 1904 volume greatly expandedthe coverage, including a treatment of the more general (Lebesgue-)Stieltjes integral.See also Lebesgue (1926), for a clear description of some of the ideas involved inthe development of measure theory, and the Note Historique of Bourbaki (1969), fora discussion of later developments.

Of course it is a vast oversimplification to imagine that probability theoryabruptly became a specialized branch of measure theory in 1933. As Kolmogorovhimself made clear, the crucial idea was the measure theory of Lebesgue. Kol-mogorov's little book was significant not just for "putting in their natural place,among the general notions of modern mathematics, the basic concepts of probabilitytheory", but also for adding new ideas, such as probability distributions in infinitedimensional spaces (reinventing results of Daniell 1919) and a general theory ofconditional probabilities and conditional expectations.

Measure theoretic ideas were used in probability theory well before 1933.For example, in the Note at the end of L6vy (1925) there was a clear statementof the countable additivity requirement for probabilities, but Levy did not adoptthe complete measure theoretic formalism; and Khinchin & Kolmogorov (1925)explicitly constructed their random variables as functions on [0,1], in order to availthemselves of the properties of Lebesgue measure.

It is also not true that acceptance of the measure theoretic foundation was totaland immediate. For example, eight years after Kolmogorov's book appeared, vonMises (1941, page 198) asserted (emphasis in the original):

In recapitulating this paragraph I may say: First, the axioms of Kolmogorovare concerned with the distribution function within one kollektiv and aresupplementary to my theory, not a substitute for it. Second, using the notion ofmeasure zero in an absolute way without reference to the arbitrarily assumedmeasure system, leads to essential inconsistencies.

See also the argument for the measure theoretic framework in the accompanyingpaper by Doob (1941), and the comments by both authors that follow (von Mises &Doob 1941).

For more about Kolmogorov's pivotal role in the history of modern probability,see: Shiryaev (2000), and the other articles in the same collection; the memorial

1.7 Notes 15

articles in the Annals of Probability, volume 17 (1989); and von Plato (1994), whichalso contains discussions of the work of von Mises and de Finetti.

REFERENCES

Bourbaki, N. (1969), Integration sur les espaces topologiques separes, Elements demath6matique, Hermann, Paris. Fascicule XXXV, Livre VI, Chapitre IX.

Daniell, P. J. (1918), 'A general form of integral', Annals of Mathematics (series 2)19, 279-294.

Daniell, P. J. (1919), 'Functions of limited variation in an infinite number ofdimensions', Annals of Mathematics (series 2) 21, 30-38.

de Finetti, B. (1972), Probability, Induction, and Statistics, Wiley, New York,de Finetti, B. (1974), Theory of Probability, Wiley, New York. First of two volumes

translated from Teoria Delle probability, published 1970. The second volumeappeared under the same title in 1975.

Doob, J. L. (1941), 'Probability as measure', Annals of Mathematical Statistics12, 206-214.

Dubins, L. & Savage, L. (1964), How to Gamble if You Must, McGraw-Hill.Ferguson, T. S. (1967), Mathematical Statistics: A Decision Theoretic Approach,

Academic Press, Boston.Fr6chet, M. (1915), 'Sur l'int^grale d'une fonctionnelle 6tendue a un ensemble

abstrait', Bull. Soc. Math. France 43, 248-265.Hacking, I. (1978), The Emergence of Probability, Cambridge University Press.Hawkins, T. (1979), Lebesgue*s Theory of Integration: Its Origins and Development,

second edn, Chelsea, New York.Khinchin, A. Y. & Kolmogorov, A. (1925), 'liber Konvergenz von Reihen, deren

Glieder durch den Zufall bestimmt werden', Mat. Sbornik 32, 668-677.Knuth, D. E. (1992), 'Two notes on notation', American Mathematical Monthly

99, 403-422.Kolmogorov, A. N. (1933), Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer-

Verlag, Berlin. Second English Edition, Foundations of Probability 1950,published by Chelsea, New York.

Lebesgue, H. (1902), Integrate, longueur, aire. Doctoral dissertation, submitted toFaculty des Sciences de Paris. Published separately in Ann. Mat. Pura Appl. 7.Included in the first volume of his CEuvres Scientifiques, published in 1972 byL'Enseignement Math6matique.

Lebesgue, H. (1904), Lecons sur Vintegration et la recherche des fonctions primitives,first edn, Gauthier-Villars, Paris. Included in the second volume of his CEuvresScientifiques, published in 1972 by L'Enseignement Math6matique. Secondedition published in 1928. Third edition, 'an unabridged reprint of the secondedition, with minor changes and corrections', published in 1973 by Chelsea,New York.


Lebesgue, H. (1926), 'Sur le developpement de la notion d'int£grale', MatematiskTidsskrift B. English version in the book Measure and Integral, edited andtranslated by Kenneth O. May.

L6vy, P. (1925), Calcul des Probability, Gauthier-Villars, Paris.Radon, J. (1913), 'Theorie und Anwendungen der absolut additiven Mengenfuno

tionen', Sitzungsberichten der Kaiserlichen Akademie der Wissenschaften in Wien.Mathematisch-naturwissenschaftliche Klasse 122, 1295-1438.

Savage, L. J. (1954), The Foundations of Statistics, Wiley, New York. Second edition,Dover, New York, 1972.

Shiryaev, A. N. (2000), Andrei Nikolaevich Kolmogorov: a biographical sketchof his life and creative paths, in 'Kolmogorov in Perspective', AmericanMathematical Society/London Mathematical Society,

von Mises, R. (1941), 'On the foundations of probability and statistics', Annals ofMathematical Statistics 12, 191-205.

von Mises, R. & Doob, J. L. (1941), 'Discussion of papers on probability theory',Annals of Mathematical Statistics 12, 215-217.

von Plato, J. (1994), Creating Modern Probability: its Mathematics, Physics andPhilosophy in Historical Perspective, Cambridge University Press.

Whittle, P. (1992), Probability via Expectation, third edn, Springer-Verlag, New York.First edition 1970, under the title "Probability".

Wiener, N. (1923), 'Differential-space', Journal of Mathematics and Physics 2, 131-174. Reprinted in Selected papers ofNorbert Wiener, MIT Press, 1964.

Chapter 2

A modicum of measure theory

SECTION 1 defines measures and sigma-fields.

SECTION 2 defines measurable functions.

SECTION 3 defines the integral with respect to a measure as a linear functional on a coneof measurable functions. The definition sidesteps the details of the construction ofintegrals from measures.

SECTION *4 constructs integrals of nonnegative measurable functions with respect to acountably additive measure.

SECTION 5 establishes the Dominated Convergence theorem, the Swiss Army knife ofmeasure theoretic probability.

SECTION 6 collects together a number of simple facts related to sets of measure zero.SECTION * 7 presents a few facts about spaces of Junctions with integrable pth powers,

with emphasis on the case p=2, which defines a Hilbert space.SECTION 8 defines uniform integrabilityt a condition slightly weaker than domination.

Convergence in £* is characterized as convergence in probability plus uniformintegrability.

SECTION 9 defines the image measure, which includes the concept of the distribution of arandom variable as a special case.

SECTION 10 explains how generating class arguments, for classes of sets, make measuretheory easy.

SECTION *11 extends generating class arguments to classes of functions.

1. Measures and s igma-f ie lds

As promised in Chapter 1, we begin with measures as set functions, then workquickly towards the interpretation of integrals as linear fimctionals. Once we arepast the purely set-theoretic preliminaries, I will start using the de Finetti notation(Section 1.4) in earnest, writing the same symbol for a set and its indicator function.

Our starting point is a measure space: a triple (X, A% /i), with X a set, A a classof subsets of X, and /JL a function that attaches a nonnegative number (possibly +00)to each set in A. The class A and the set function JJL are required to have propertiesthat facilitate calculations involving limits along sequences.

18 Chapter 2: A modicum of measure theory

 Definition. Call a class A a sigma-field of subsets ofX if:

(i) the empty set 0 and the whole space X both belong to A;

(ii) if A belongs to A then so does its complement Ac;

(Hi) if A\, Ai,... is a countable collection of sets in A then both the union UjA,-and the intersection n( A, are also in A.

Some of the requirements are redundant as stated. For example, once wehave 0 G A then (ii) implies X e A. When we come to establish properties aboutsigma-fields it will be convenient to have the list of defining properties pared downto a minimum, to reduce the amount of mechanical checking. The theorems willbe as sparing as possible in the amount the work they require for establishing thesigma-field properties, but for now redundancy does not hurt.

The collection A need not contain every subset of X, a fact forced upon us ingeneral if we want fju to have the properties of a countably additive measure.

<2> Definition. A function /x defined on the sigma-field A is called a (countably

additive, nonnegative) measure if:

(i) 0 < ixA < oo for each A in A;

(ii) fM0 = 0;

(Hi) if A i, A2, . . . is a countable collection of pairwise disjoint sets in A then

A measure fi for which \iX — 1 is called a probability measure, and thecorresponding (X, A, pu) is called a probability space. For this special case it istraditional to use a symbol like P for the measure, a symbol like Q, for the set,and a symbol like 7 for the sigma-field. A triple (Q, 7, P) will always denote aprobability space.

Usually the qualifications "countably additive, nonnegative" are omitted, on thegrounds that these properties are the most commonly assumed—the most commoncases deserve the shortest names. Only when there is some doubt about whetherthe measures are assumed to have all the properties of Definition <2> should thequalifiers be attached. For example, one speaks of "finitely additive measures"when an analog of property (iii) is assumed only for finite disjoint collections, or"signed measures" when the value of fiA is not necessarily nonnegative. Whenfinitely additive or signed measures are under discussion it makes sense to mentionexplicitly when a particular measure is nonnegative or countably additive, but, ingeneral, you should go with the shorter name.

Where do measures come from? The most basic constructions start from setfunctions [i defined on small collections of subsets £, such as the collection of allsubintervals of the real line. One checks that /x has properties consistent with therequirements of Definition <2>. One seeks to extend the domain of definition whilepreserving the countable additivity properties of the set function. As you saw inChapter 1, Theorems guaranteeing existence of such extensions were the culminationof a long sequence of refinements in the concept of integration (Hawkins 1979).They represent one of the great achievements of modern mathematics, even thoughthose theorems now occupy only a handful of pages in most measure theory texts.

2.1 Measures and sigma-fields 19

Finite additivity has several appealing interpretations (such as the fair-pricesof Section 1.5) that have given it ready acceptance as an axiom for a model ofreal-world uncertainty. Countable additivity is sometimes regarded with suspicion,or justified as a matter of mathematical convenience. (However, see Problem [6] foran equivalent form of countable additivity, which has some claim to intuitive appeal.)It is difficult to develop a simple probability theory without countable additivity,which gives one the licence (for only a small fee) to integrate series term-by-term,differentiate under integrals, and interchange other limiting operations.

The classical constructions are significant for my exposition mostly because theyensure existence of the measures needed to express the basic results of probabilitytheory. I will relegate the details to the Problems and to Appendix A. If you cravea more systematic treatment you might consult one of the many excellent texts onmeasure theory, such as Royden (1968).

The constructions do not—indeed cannot, in general—lead to countablyadditive measures on the class of all subsets of a given X. Typically, they extenda set function defined on a class of sets £ to a measure defined on the sigma-field<r(£) generated by £, or to only slightly larger sigma-fields. By definition,

a(£) := smallest sigma-field on X containing all sets from £

= {A c X : A € 5F for every sigma-field 7 with £ c f ) .

The representation given by the second line ensures existence of a smallest sigma-field containing £. The method of definition is analogous to many definitions of"smallest... containing a fixed class" in mathematics—think of generated subgroupsor linear subspaces spanned by a collection of vectors, for example. For thedefinition to work one needs to check that sigma-fields have two properties:

(i) If {7i : i e 3} is a nonempty collection of sigma-fields on X then n,-€jSi, thecollection of all the subsets of X that belong to every 7,, is also a sigma-field.

(ii) For each £ there exists at least one sigma-field 7 containing all the sets in £.

You should check property (i) as an exercise. Property (ii) is trivial, because thecollection of all subsets of X is a sigma-field.

REMARK. Proofs of existence of nonmeasurable sets typically depend onsome deep set-theoretic principle, such as the Axiom of Choice. Mathematicianswho can live with different rules for set theory can have bigger sigma-fields. SeeDudley (1989, Section 3.4) or Oxtoby (1971, Section 5) for details.

<3> Exercise. Suppose X consists of five points a, b, c, d, and e. Suppose £ consistsof two sets, E\ = [a, b, c] and E2 = {c, d, e}. Find the sigma-field generated by £.SOLUTION: For this simple example we can proceed by mechanical application ofthe properties that a sigma-field cr(£) must possess. In addition to the obvious 0and X, it must contain each of the sets

Fx := {a, b} = Exn Ec2 and F2 := {c} = £ , n £2,

F3 := {</, e] = E\ n E2 and F4 := [a, b, d, e] = F\ U F3.


Further experimentation creates no new members of <x(£); the sigma-field consistsof the sets

0, Fi, F2, F3, F!UF3, Fil)F2 = Eu F2 U F3 = F2, X.

The sets F\9 F2, F3 are the atoms of the sigma-field; every member of <r(£) is aunion of some collection (possibly empty) of Ft. The only measurable subsets of f}are the empty set and F, itself. There are no measurable protons or neutrons hiding

• inside these atoms.

An unsystematic construction might work for finite sets, but it cannot generateall members of a sigma-field in general. Indeed, we cannot even hope to list allthe members of an infinite sigma-field. Instead we must find a less explicit way tocharacterize its sets.

<4> Example. By definition, the Borel sigma-field on the real line, denoted by 2(R),is the sigma-field generated by the open subsets. We could also denote it by a (9)where 9 stands for the class of all open subsets of R. There are several othergenerating classes for 2(R). For example, as you will soon see, the class £ of allintervals (—oo, t], with t e R, is a generating class.

It might appear a hopeless task to prove that cr(£) = !B(R) if we cannotexplicitly list the members of both sigma-fields, but actually the proof is quiteroutine. You should try to understand the style of argument because it is often usedin probability theory.

The equality of sigma-fields is established by two inclusions, a(£) c a (9) andtf(9) £ <T(£), both of which follow from more easily established results. First wemust prove that £ c a(9), showing that a (9) is one of the sigma-fields 7 that enterinto the intersection defining a(£), and hence cr(E) c a(9). The other inclusionfollows similarly if we show that 9 c cr(£).

Each interval (—oo, t] in £ has a representation njjli(~~°°» t+n~l), a countableintersection of open sets. The sigma-field a(9) contains all open sets, and it isstable under countable intersections. It therefore contains each (—oo, t]. That is,£ C a ( 9 ) .

The argument for 9 c a(£) is only slightly harder. It depends on the factthat an open subset of the real line can be written as a countable union of openintervals. Such an interval has a representation (a, b) = (—oo, b) n (—oo, a]c, and(-oo, b) = U^=i(~°°' * - n~1]- That is, every open set can be built up from setsin £ using operations that are guaranteed not to take us outside the sigma-field a(£).

My explanation has been moderately detailed. In a published paper thereasoning would probably be abbreviated to something like "a generating class

• argument shows that . . . ," with the routine details left to the reader.

REMARK. The generating class argument often reduces to an assertion like: Ais a sigma-field and A 2 £, therefore A = a{A) 2 <T(£).

<5> Example. A class £ of subsets of a set X is called & field if it contains the emptyset and is stable under complements, finite unions, and finite intersections. For afield £, write £5 for the class of all possible intersections of countable subclassesof £, and Ea for the class of all possible unions of countable subclasses of £.

2.1 Measures and sigma-fields 21

Of course if £ is a sigma-field then £ = £5 = £<,, but, in general, the inclusionscr(£) 2 £j 2 £ and a(E) 2 Ea 2 £ will be proper. For example, if X = R and £consists of all finite unions of half open intervals (a, b]9 with possibly a = —oo orb = +oo, then the set of rationals does not belong to Ea and the complement of thesame set does not belong to £5.

Let /* be a finite measure on a(£). Even though <x(£) might be much largerthan either Za or £5, a generating class argument will show that all sets in a(£) canbe inner approximated by Es, in the sense that,

fjtA = snplfxF : A 2 F € £5} for each A in a(£),

and outer approximated by £a, in the sense that,

fiA = inf{fiG : A c G € £CT} for each A in cr(£).

REMARK. Incidentally, I chose the letters G and F to remind myself of openand closed sets, which have similar approximation properties for Borel measures onmetric spaces—see Problem [12].

It helps to work on both approximation properties at the same time. Denote bySo the class of all sets in a(£) that can be both innner and outer approximated. Aset B belongs to So if and only if, to each e > 0 there exist F e £5 and G € Ea suchthat F c B c G and fi(G\F) < e. I'll call the sets F and G an ^-sandwich for B.

Trivially So 2 £» because each member of £ belongs to both £a and £$. Theapproximation result will follow if we show that So is a sigma-field, for then wewill have So = or (So) 2 *(£).

Symmetry of the definition ensures that So is stable under complements: ifF c B c G is an ^-sandwich for B, then Gc c Bc c Fc is an €-sandwich for Bc.To show that So is stable under countable unions, consider a countable collection{Bn : n e N} of sets from So. We need to slice the bread thinner as n gets larger:choose €/2n-sandwiches Fn c Bn c Gn for each n. The union UwBn is sandwichedbetween the sets G := UnGn and # = UnFn; and the sets are close in /JL measurebecause

/x(Uw Gn\Un Fn) < ]TM(Gn\Fn) < ] [ > / 2 " = €.

REMARK. Can you prove this inequality? Do you see why UnGn\ Un Fn cUft (Gn\Fn) and why countable additivity implies that the measure of a countable unionof (not necessarily disjoint) sets is smaller than the sum of their measures? If not,just wait until Section 3, after which you can argue that VnGn\Un Fn < ^Tn(Gn\Fn),as an inequality between indicator functions, and /x (£,H(Gn\FH)) = Yin M(^n\^n)by Monotone Convergence.

We have an €-sandwich, but the bread might not be of the right type. It iscertainly true that G € Ea (a countable union of countable unions is a countableunion), but the set H need not belong to £$. However, the sets HN := KJn<NFn dobelong to £5, and countable additivity implies that fiH^ f lt<H.

REMARK. DO you see why? If not, wait for Monotone Convergence again.

D If we choose a large enough N we have a 2e-sandwich HN c.UnBn c G.


The measure m on 23 (R) for which m(ay b] = b — a is called Lebesgue measure.Another sort of generating class argument (see Section 10) can be used to showthat the values m(B) for B in 23 (R) are uniquely determined by the values given tointervals; there can exist at most one measure on 23(R) with the stated property. Itis harder to show that at least one such measure exists. Despite any intuitions youmight have about length, the construction of Lebesgue measure is not trivial—seeAppendix A. Indeed, Henri Lebesgue became famous for proving existence of themeasure and showing how much could be done with the new integration theory.

The name Lebesgue measure is also given to an extension of m to a measureon a sigma-field, sometimes called the Lebesgue sigma-field, which is slightly largerthan 23 (R). I will have more to say about the extension in Section 6.

Borel sigma-fields are defined in similar fashion for any topological space X.That is, 23 (X) denotes the sigma-field generated by the open subsets of X.

Sets in a sigma-field A are said to be measurable or yi-measurable. Inprobability theory they are also called events. Good functions will also be given thetitle measurable. Try not to get confused when you really need to know whether anobject is a set or a function.

2. Measurable functions

Let X be a set equipped with a sigma-field A, and y be a set equipped with asigma-field 23, and T be a function (also called a map) from X to y. We say that Tis AyB-measurable if the inverse image {x € X : Tx e B] belongs to A for eachB in 23. Sometimes the inverse image is denoted by {T e B] or T~lB. Don't befooled by the T~l notation into treating T~l as a function from y into X: it's not,unless T is one-to-one (and onto, if you want to have domain y). Sometimes an.A\23-measurable map is referred to in abbreviated form as just .A-measurable, orjust 23-measurable, or just measurable, if there is no ambiguity about the unspecifiedsigma-fields.

T

01B)

(X,A)

For example, if y = R and 23 equals the Borel sigma-field 23 (R), it is commonto drop the 23 (R) specification and refer to the map as being ^4-measurable, or asbeing Borel measurable if A is understood and there is any doubt about whichsigma-field to use for the real line. In this book, you may assume that any sigma-fieldon R is its Borel sigma-field, unless explicitly specified otherwise. It can get confusingif you misinterpret where the unspecified sigma-fields live. My advice would bethat you imagine a picture showing the two spaces involved, with any missingsigma-field labels filled in.

2.2 Measurable functions 23

Sometimes the functions come first, and the sigma-fields are chosen specificallyto make those functions measurable.

<6> Definition. Let *K be a class of functions on a set X. Suppose the typical h in *Kmaps X into a space y*, equipped with a sigma-field 'Bh- Then the sigma-field o^K)generated by *K is defined as a{h~x(B) : B e *Bh, h € JC}. It is the smallestsigma-field Ao on X for which each h in *K is Ao\Bh-measurable.

<7> Example. If 3 = <r(£) for some class £ of subsets of y then a map T is»/i\a(£)-measurable if and only if T~lE e A for every E in £. You should provethis assertion by checking that {B e !B : T~lB e A} is a sigma-field, and thenarguing from the definition of a generating class.

In particular, to establish ./l\!B(R)-measurability of a map into the real lineit is enough to check the inverse images of intervals of the form (t, oo), with tranging over R. (In fact, we could restrict r to a countable dense subset of R,such as the set of rationals: How would you build an interval (t, oo) from intervals(f,-, oo) with rational ff?) That is, a real-valued function / is Borel-measurable if{x e X : f(x) > t] e A for each real t. There are many similar assertions obtainedby using other generating classes for #(R). Some authors use particular generatingclasses for the definition of measurability, and then derive facts about inverse images

• of Borel sets as theorems.

It will be convenient to consider not just real-valued functions on a set X,but also functions from X into the extended real line R := [—oo, oo]. The Borelsigma-field £(R) is generated by the class of open sets, or, more explicitly, by allsets in $(R) together with the two singletons {-oo} and {oo}. It is an easy exerciseto show that £(R) is generated by the class of all sets of the form (f, oo], for t in R,and by the class of all sets of the form [-oo, t)9 for t in R. We could even restrict tto any countable dense subset of R.

<8> Example. Let a set X be equipped with a sigma-field A. Let {/„ : n e N} be asequence of ,A\£(R)-measurable functions from X into R. Define functions / and gby taking pointwise suprema and infima: f(x) := supn fn(x) and g(x) := infw fn(x).Notice that / might take the value -f oo, and g might take the value —oo, at somepoints of X. We may consider both as maps from X into R. (In fact, the wholeargument is unchanged if the /„ functions themselves are also allowed to takeinfinite values.)

The function / is yi\®(R)-measurable because

[x : f(x) > t] = U«{JC : fn(x) > t} € A for each real t :

for each fixed JC, the supremum of the real numbers fn(x) is strictly greater than tif and only if fn(x) > t for at least one n. Example <7> shows why we have onlyto check inverse images for such intervals.

The same generating class is not as convenient for proving measurability of g.It is not true that an infimum of a sequence of real numbers is strictly greater than tif and only if all of the numbers are strictly greater than t: think of the sequence[n~l : n = 1,2, 3 , . . . } , whose infimum is zero. Instead you should argue via the

D identity {JC : g(x) < t] = Un[x : fn(x) < t} € A for each real r.


From Example <8> and the representations limsup fn(x) = infnG^ supm>n fm(x)and liminf fn(x) = supn€Ninfm>n fm(x), it follows that the lim sup or liminf of asequence of measurable (real- or extended real-valued) functions is also measurable.In particular, if the limit exists it is measurable.

Measurability is also preserved by the usual algebraic operations—sums,differences, products, and so on—provided we take care to avoid illegal pointwisecalculations such as oo - oo or 0/0. There are several ways to establish thesestability properties. One of the more direct methods depends on the fact that R hasa countable dense subset, as illustrated by the following argument for sums.

<9> Example, Let / and g be !B(R)-measurable functions, with pointwise sumh(x) = /(JC) + g(x). (I exclude infinite values because I don't want to get caughtup with inconclusive discussions of how we might proceed at points x wheref(x) = + oo and g(x) = -oo , or /(JC) = - o o and g(x) = +oo.) How can we provethat h is also a !B(R)-measurable function?

It is true that

{x : *(*) > t} = U,€R ({JC : f{x) = s) n {x : g(x) > t - s}) ,

and it is true that the set {JC : /(JC) = s] n {JC : g(x) > t — s] is measurable for each sand r, but sigma-fields are not required to have any particular stability properties foruncountable unions. Instead we should argue that at each JC for which f(x)-\-g(x) > tthere exists a rational number r such that /(JC) > r > t — g(x). Conversely if thereis an r lying strictly between /(JC) and t - g(x) then /(JC) -I- g(x) > t. Thus

{x : h(x) > t } = U r G Q ({x : f(x) > r } n { x : g(x) > t - r \ ) ,

where Q denotes the countable set of rational numbers. A countable union ofintersections of pairs of measurable sets is measurable. The sum is a measurable

• function.

As a little exercise you might try to extend the argument from the last Exampleto the case where / and g are allowed to take the value H-oo (but not the value —oo).If you want practice at playing with rationals, try to prove measurability of products(be careful with inequalities if dividing by negative numbers) or try Problem [4],which shows why a direct attack on the lim sup requires careful handling ofinequalities in the limit.

The real significance of measurability becomes apparent when one worksthrough the construction of integrals with respect to measures, as in Section 4. Forthe moment it is important only that you understand that the family of all measurablefunctions is stable under most of the familiar operations of analysis.

<io> Definition. The class M(X,.A), or M(X) or just M for short, consists of allA\"B(R)-measurable functions from X into R. The class M+(X,.A), or M+(X) orjust M + for short, consists of the nonnegative functions in M(X,A).

If you desired exquisite precision you could write M(X,A, R, !B(R)), toeliminate all ambiguity about domain, range, and sigma-fields.

The collection M+ is a cone (stable under sums and multiplication of functionsby positive constants). It is also stable under products, pointwise limits of sequences,

2.2 Measurable functions 25

and suprema or infima of countable collections of functions. It is not a vector space,because it is not stable under subtraction; but it does have the property that if / andg belong to M+ and g takes only real values, then the positive part ( / - g)+, definedby taking the pointwise maximum of f(x) - g(x) with 0, also belongs to M+. Youcould adapt the argument from Example <9> to establish the last fact.

It proves convenient to work with M+ rather than with the whole of M, therebyeliminating many problems with oo — oo. As you will soon learn, integrals havesome convenient properties when restricted to nonnegative functions.

For our purposes, one of the most important facts about M + will be thepossibility of approximation by simple functions that is by measurable functionsof the form s := £),. or, A/, for finite collections of real numbers a,- and events A,-from A. If the A,- are disjoint, s(x) equals a,- when x e A,-, for some i, and iszero otherwise. If the A,- are not disjoint, the nonzero values taken by s are sumsof various subsets of the {a,-}. Don't forget: the symbol A, gets interpreted as anindicator function when we start doing algebra. I will write M+mple for the cone ofall simple functions in M+.

<n> Lemma. For each f in M+ the sequence {/„} c Mj le, defined by

has the property 0 < f\ (x) < fi(x) < ... < fn(x) f f(x) at every x.

REMARK. The definition of /„ involves algebra, so you must interpret {/ > i/2n]as the indicator function of the set of all points x for which f(x)> i/2n.

Proof. At each x, count the number of nonzero indicator values. If f(x) > 2", all4n summands contribute a 1, giving fn(x) = 2n. If k2~n < f{x) < (k + 1)2"", forsome integer k from { 0 , 1 , 2 , . . . , 4n — 1}, then exactly k of the summands contributea 1, giving fn(x) = kl~n. (Check that the last assertion makes sense when kequals 0.) That is, for 0 < f(x) < 2n, the function /„ rounds down to an integermultiple of 2~", from which the convergence and monotone increasing propertiesfollow.

If you do not find the monotonicity assertion convincing, you could argue,more formally, that

f ~ 2^} - 2 ^ S {{f ~ 2^} + {f ~ ^ r } ) = fn+U

which reflects the effect of doubling the maximum value and halving the step size• when going from the nth to the (n+l)st approximation.


As an exercise you might prove that the product of functions in M+ alsobelongs to M+, by expressing the product as a pointwise limit of products of simplefunctions. Notice how the convention 0 x oo = 0 is needed to ensure the correctlimit behavior at points where one of the factors is zero.

3. Integrals

Just as f* f(x) dx represents a sort of limiting sum of f(x) values weighted bysmall lengths of intervals—the / sign is a long "S", for sum, and the dx is a sortof limiting increment—so can the general integral / f(x) ii(dx) be defined as alimit of weighted sums but with weights provided by the measure /x. The formaldefinition involves limiting operations that depend on the assumed measurability ofthe function / . You can skip the details of the construction (Section 4) by takingthe following result as an axiomatic property of the integral.

< 12> Theo rem. For each measure /x on (X, A) there is a uniquely determined functional,a map fi from 3Vt+(X, A) into [0, oo], having the following properties:

(i) /I(IU) = iiA for each A in A;

(ii) /2(0) = 0, where the first zero stands for the zero function;

(Hi) for nonnegative real numbers a, p and functions f,gin M4",

(iv) if f, g are in M+ and f < g everywhere then jl(f) < jl(g);(v) if / i , / 2 , . . . is a sequence in M+ with 0 < f\ (x) < f2(x) < . . . t /(*) for

each x in X then jl(fn) f £(/)•

I will refer to (iii) as linearity, even though M+ is not a vector space.It will imply a linearity property when jl is extended to a vector subspace of M.Property (iv) is redundant because it follows from (ii) and nonnegativity. Property (ii)is also redundant: put A = 0 in (i); or, interpreting 0 x oo as 0, put a = p = 0 and/ = g = 0 in (iii). We need to make sure the bad case jlf = oo, for all / in M + ,does not slip through if we start stripping away redundant requirements.

Notice that the limit function / in (v) automatically belongs to M + . Thelimit assertion itself is called the Monotone Convergence property. It correspondsdirectly to countable additivity of the measure. Indeed, if [Ai : i € N} is a countablecollection of disjoint sets from A then the functions /„ := A\ -f . . . + An increasepointwise to the indicator function of A = U I € N ^ > SO that Monotone Convergenceand linearity imply /JLA = X!, M^<-

REMARK. YOU should ponder the role played by +oo in Theorem <12>. Forexample, what does ajx{f) mean if a = 0 and / ! ( / ) = oo? The interpretationdepends on the convention that 0 x oo = 0.

In general you should be suspicious of any convention involving ±oo. Paycareful attention to cases where it operates. For example, how would the fiveassertions be affected if we adopted a new convention, whereby 0 x oo = 6? Wouldthe Theorem still hold? Where exactly would it fail? I feel uneasy if it is notclear how a convention is disposing of awkward cases. My advice: be very, very

2.3 Integrals 27

\i definedhere

careful with any calculations involving infinity. Subtle errors are easy to miss whenconcealed within a convention.

There is a companion to Theorem <12> that shows why it is largely a matterof taste whether one starts from measures or integrals as the more primitive measuretheoretic concept.

<13> Theorem. Let /x be a map from M+ to [0, oo] that satisfies properties (ii)through (v) of Theorem <12>. Then the set function defined on the sigma-field Aby (i) is a (countably additive, nonnegative) measure, with ji the functional that itgenerates.

Lemma <n> provides the link between the measure /x and the functional /x.For a given / in M+, let {/„} be the sequence defined by the Lemma. Then

iif = lim iifn = lim 2~nTn{f > i/2n],n~>-oo n-»oo f—•*

the first equality by Monotone Convergence, the second by linearity. The valueof jlf is uniquely determined by /x, as a set function on A. It is even possible touse the equality, or something very similar, as the basis for a direct construction ofthe integral, from which properties (i) through (v) are then derived, as you will seefrom Section 4.

In summary: There is a one-to-one correspondencebetween measures on the sigma-field A and increasing linearfunctional on M+(A) with the Monotone Convergenceproperty. To each measure /tx there is a uniquely determinedfunctional /x for which /x(IU) == ti(A) for every A in A.The functional ft is usually called an integral with respectto /x, and is variously denoted by f fd/x or / f(x) /x(rfjt)or fxfdfi or / f(x)dfi(x). With the de Finetti notation,where we identify a set A with its indicator function, thefunctional fi, is just an extension of /x from a smaller domain(indicators of sets in A) to a larger domain (all of 3VC+).

Accordingly, we should have no qualms about denoting it by the same symbol. Iwill write /x/ for the integral. With this notation, assertion (i) of Theorem <12>becomes: fiA = ixA for all A in A. You probably can't tell that the A on theleft-hand side is an indicator function and the /x is an integral, but you don't needto be able to tell—that is precisely what (i) asserts.

REMARK. In elementary algebra we rely on parentheses, or precedence, to makeour meaning clear. For example, both (ax) + b and ax 4- b have the same meaning,because multiplication has higher precedence than addition. With traditional notation,the f and the dfi act like parentheses, enclosing the integrand and separating itfrom following terms. With linear functional notation, we sometimes need explicitparentheses to make the meaning unambiguous. As a way of eliminating someparentheses, I often work with the convention that integration has lower precedencethan exponentiation, multiplication, and division, but higher precedence than additionor subtraction. Thus I intend you to read ixfg -I- 6 as (fi(fg)) + 6. I would writeli(fg + 6) if the 6 were part of the integrand.


Some of the traditional notations also remove ambiguity when functions ofseveral variables appear in the integrand. For example, in f f(x,y) fi(dx) the yvariable is held fixed while the fx operates on the first argument of the function.When a similar ambiguity might arise with linear functional notation, I will appenda superscript, as in fjLxf(x,y), to make clear which variable is involved in theintegration.

<14> Example. Suppose /JL is a finite measure (that is, fiX < oo) and / is a functionin M+. Then /x/ < oo if and only if J2T=\ Pif — n) < °°-

The assertion is just a pointwise inequality in disguise. By consideringseparately values for which k < f(x) < k + 1, for k = 0,1, 2, . . . , you can verify thepointwise inequality between functions,

In fact, the sum on the left-hand side defines I_/C*)J> the largest integer <and the right-hand side denotes the smallest integer > f(x). From the leftmostinequality,

V>f ^ ^(££Li U >n)) increasing= lim A*(]££Li{/>«}) Monotone Convergence

N-+OQ

= Jim E!Li /*{/ > «1 linearity

= TZi nif > «}•A similar argument gives a companion upper bound. Thus the pointwise inequalityintegrates out to J^L{ M ( / > «} < M/ < MC + 12T=i Ml/ > wh from which the

D asserted equivalence follows.

Extension of the integral to a larger class of functions

Every function / in M can be decomposed into a difference / = / + — / " of twofunctions in M+, where f+(x) := max(/(jt),0) and f~(x) := max(-/(jc),0). Toextend /Lt from M+ to a linear functional on M we should define fif := /xf+ — tif~.This definition works if at least one of /x/+ and fif" is finite; otherwise we getthe dreaded oo — oo. If both /z/+ < oo and / / /" < oo (or equivalently, / ismeasurable and ^l / l < oo) the function / is said to be integrable or ^t-integrable.The linearity property (iii) of Theorem <12> carries over partially to M if oo - ooproblems are excluded, although it becomes tedious to handle all the awkward casesinvolving ±oo. The constants a and f$ need no longer be nonnegative. Also if both/ and g are integrable and if / < g then \if < jig, with obvious extensions tocertain cases involving oo.

<15> Definition. The set of all real-valued, fi-integrable functions in M is denoted by

The set ^ ( M ) is a vector space (stable under pointwise addition and multipli-cation by real numbers). The integral fi defines an increasing linear functional on&l(li), in the sense that fif > fig if / > g pointwise. The Monotone Convergenceproperty implies other powerful limit results for functions in &(&), as described inSection 5. By restricting /JL to ^(fi), we eliminate problems with oo — oo.

2.3 Integrals 29

For each / in Cl(ji), its Cl norm is defined as ||/||i := MI/I- Strictly speaking,|| • 111 is only a seminorm, because ||/| |i = 0 need not imply that / is the zerofunction—as you will see in Section 6, it implies only that /Lt{/ ^ 0} = 0. Itis common practice to ignore the small distinction and refer to || • ||i as a norm

<16> Example. Let be a convex, real-valued function on M. The function ^ ismeasurable (because {*!> < t] is an interval for each real r), and for each xo in Ethere is a constant a such that V(x) > 4>(JCO) + a(x - xo) for all x (Appendix C).

Let P be a probability measure, and X be an integrable random variable.Choose JCO := PX. From the inequality *(x) > -|^(*o)l - M(|JC| + \xo\) wededuce that P*I>(X)~ < |^(JCO)I + |cr|(P|X| + |*ol) < «>• Thus we should have nooo - oo worries in taking expectations (that is, integrating with respect to P) todeduce that P*(X) > 4>(PX) + «(PX - x0) = *(PX), a result known as Jensen'sinequality. One way to remember the direction of the inequality is to note that

• 0 < var(X) = PX2 - (PX)2, which corresponds to the case V(x) = x2.

Integrals with respect to Lebesgue measure

Lebesgue measure m on S(R) corresponds to length: m[a, b] = b - a for eachinterval. I will occasionally revert to the traditional ways of writing such integrals,

/

/•oo [b

f(x)dx = / f(x)dx and m*(/(jt){a < x < b}) = / f(x)dx.Don't worry about confusing the Lebesgue integral with the Riemann integral overfinite intervals. Whenever the Riemann is well defined, so is the Lebesgue, and thetwo sorts of integral have the same value. The Lebesgue is a more general concept.Indeed, facts about the Riemann are often established by an appeal to theoremsabout the Lebesgue. You do not have to abandon what you already know aboutintegration over finite intervals.

The improper Riemann integral, / ^ f{x) dx = limn_ oo /"n /(JC) dx9 also agreeswith the Lebesgue integral provided tn|/ | < oo. If m| / | = oo, as in the case ofthe function /(JC) := YlT=\{n - x < n + l)(-l)rt/w, the improper Riemann integralmight exist as a finite limit, while the Lebesgue integral m/ does not exist.

*4. Construction of integrals from measures

To construct the integral fl as a functional on M+(X,,A), starting from a measureix on the sigma-field A, we use approximation from below by means of simplefunctions.

First we must define fi on M^mplc. The representation of a simple function as alinear combination of indicator functions is not unique, but the additivity propertiesof the measure \i will let us use any representation to define the integral. Forexample, if s := 3A\ + 7A2 = 3A\AC

2 -I- IOA\A2 + 1A\A2, then

+ l0n(AxA2) + 7^(Af A2).


More generally, if s := £,-(*,• A,- ^ a s another representation s = J2j PjBj^(XifiAi = J2JPJ/*BJ. Proof? Thus we can uniquely define jl(s) for a simplefunction s := £,. a, A, by \x (s) := £ f atfiAi.

Define the increasing functional jl on M+ by

A ( / ) := sup{£ (s):f>se M+mple}.

That is, the integral of f is a supremum of integrals of nonnegative simple functionsless than / .

From the representation of simple functions as linear combinations of disjointsets in A, it is easy to show that /zOU) = /xA for every A in A. It is also easy toshow that jl(0) = 0, and ji(af) = oejl(f) for nonnegative real or, and

The last inequality, which is usually referred to as the superadditivity property,follows from the fact that if / > u and g > v, and both u and v are simple, then/ + £ > « + *> with u + v simple.

Only the Monotone Convergence property and the companion to <17>,

require real work. Here you will see why measurability is needed.

Proof of inequality <18>. Let s be a simple function < / + g, and let e be a smallpositive number. It is enough to construct simple functions M, V with u < f andv < g such that u + i> > (1 — 6)5. For then jlf + jig > fiu + fiv > (1 — €)JJLS, fromwhich the subadditivity inequality <18> follows by taking a supremum over simplefunctions then letting e tend to zero.

For simplicity of notation I will assume s to be very simple: s := A. You canrepeat the argument for each Af in a representation Yli aiî w^h disjoint A,- to get

the general result. Suppose € = \/m for some positiveinteger m. Write tj for j/m. Define simple functions

u := A{f > 1} + E7=i A {£,_! < / < «y} <y-i,

The measurability of / ensures .A-measurability of allA the sets entering into the definitions of u and v. For the

inequality v < g, notice that / + g > 1 on A, so g > 1 - €, = v when €,-1 < / < 4on A. Finally, note that the simple functions were chosen so that

u + v = A { / > ij + ;

• as desired.

7=i A {4_! < / < ij} (1 - €) > (1 - 6) A,

Proof of the Monotone Convergence property. Suppose fn eand /„ t / • Suppose / > s := X]a,-A,-, with the A, disjointsets in *A and at > 0. Define approximating simple functionssn := E i O - ^)«/Al{/n > (1 - ^K} . Clearly j n < /„. The

2.4 Construction of integrals from measures 31

simple function sn is one of those that enters into the supremum defining jlfn. Itfollows that

A/n > Mfo.) = (1 - €) Ei OLill {Ai{fn > (1 -

On the set A,- the functions /„ increase monotonely to / , which is > a,-. The setsAi{fn > (1 — €)ctj} expand up to the whole of A,. Countable additivity implies thatthe [i measures of those sets increase to /zA,-. It follows that

lim/Z/n > lim sup/Is,, > (1 — €)jls.

D Take a supremum over simple s < f then let e tend to zero to complete the proof.

5. Limit theorems

Theorem <13> identified an integral on M+ as an increasing linear functional withthe Monotone Convergence property :

<19> if 0 < /„ f then \i (lim /„) = lim [ifn.\n->oo / n- oo

Two direct consequences of this limit property have important applications through-out probability theory. The first, Fatou's Lemma, asserts a weaker limit propertyfor nonnegative functions when the convergence and monotonicity assumptions aredropped. The second, Dominated Convergence, drops the monotonicity and nonneg-ativity but imposes an extra domination condition on the convergent sequence {/„}.I have slowly realized over the years that many simple probabilistic results can beestablished by Dominated Convergence arguments. The Dominated ConvergenceTheorem is the Swiss Army Knife of probability theory.

It is important that you understand why some conditions are needed beforewe can interchange integration (which is a limiting operation) with an explicitlimit, as in <19>. Variations on the following example form the basis for manycounterexamples.

<20> E x a m p l e , Let /JL be Lebesgue measure on 3 [ 0 , 1 ] and let {an} be a sequenceof positive numbers. The function fn(x) := an{0 < x < l/n) converges to zero,pointwise, but its integral /x(/n) = otn/n need not converge to zero. For example,an = n2 gives fifn - • oo; the integrals diverge. And

1 6n for n even • AO f f 6 for n even

3n for n odd glVCS M/n = I 3 for n odd.

• The integrals oscillate.

<2i> Fatou's Lemma. For every sequence {/„} in M+ (not necessarily convergent),

Proof. Write / for liminf /„. Remember what a liminf means. Define gn :=infm>n fm. Then gn < fn for every n and the {gn} sequence increases monotonely tothe function / . By Monotone Convergence, fif = limn-+oo iign. By the increasing

• property, pgn < ^fn for each n, and hence limôo fign < liminfôo \ifn.


For dominated sequences of functions, a splicing together of two Fatou Lemmaassertions gives two liminf consequences that combine to produce a limit result.(See Problem [10] for a generalization.)

<22> Dominated Convergence. Let {/„} be a sequence of fi-integrable functions forwhich lim,, fn(x) exists for all x. Suppose there exists a ii-integrable function F,which does not depend on n, such that \fn(x)\ < F(x) for all x and all n. Then thelimit function is integrable and //(lim,,-^ /„) = lim,,.^ M/n-

Proof, The limit function is also bounded in absolute value by F, and hence it isintegrable.

Apply Fatou's Lemma to the two sequences {F + /„} and {F — /„} in M+, toget

/x(liminf(F + /„)) < liminf §i(F + /„) = liminf {jiF + fifn),

/z(liminf(F - /„)) < liminf/z(F - /„) = liminf QiF - /*/„).

Simplify, using the fact that a liminf is the same as a lim for convergent sequences.

IL(F ± lim/,,)) < tiF 4- liminf (±iifH).

Notice that we cannot yet assert that the liminf on the right-hand side is actually alimit. The negative sign turns a liminf into a lim sup.

Cancel out the finite number fiF then rearrange, leaving

limsup/i/n < fi(limfn) < liminf^/w.

D The convergence assertion follows.

REMARK. YOU might well object to some of the steps in the proof on o o - o ogrounds. For example, what does F(x) + /„(*) mean at a point where F(x) = ooand fn(x) = —oo? To eliminate such problems, replace F by F{F < oo} and /„by fn{F < oo}, then appeal to Lemma <26> in the next Section to ensure that theintegrals are not affected.

The function F is said to dominate the sequence {/„}. The assumption inTheorem <22> could also be written as /u, (sup,, | / n | ) < oo, with F := supn \fn\ asthe dominating function. It is a common mistake amongst students new to the resultto allow F to depend on n.

Dominated Convergence turns up in many situations that you might not at firstrecognize as examples of an interchange in the order of two limit procedures.

<23> Example. Do you know why

— f extx5f2(l - xff2dx = f extx7^2(l - x)3?2 dx ?dt Jo Jo

Of course I just differentiated under the integral sign, but why is that allowed? Theneatest justification uses a Dominated Convergence argument.

More generally, for each t in an interval (-8,8) about the origin let /(•, t) be aji-integrable function on X, such that the function / (* , •) is differentiate in (-<$, 8)

2.5 Limit theorems 33

for each JC. We need to justify taking the derivative at t = 0 inside the integral, toconclude that

d< 2 4 > — (lLX/(JC, t))\ = \lx I —

</rv 7 lf=o \ dt

Domination of the partial derivative will suffice.Write g(t) for fixf(x, t) and A(JC, t) for the partial derivative JJ/(JC, 0- Suppose

there exists a /x-integrable function M such that

I A(JC, 01 < M{x) for all JC, all t e (-<$,8).

To establish <24>, it is enough to show that

for every sequence [hn] of nonzero real numbers tending to zero. (Please make surethat you understand why continuous limits can be replaced by sequential limits in thisway. It is a common simplification.) With no loss of generality, suppose 8 > hn > 0for all n. The ratio on the left-hand side of <25> equals the \x integral of thefunction fn(x) := (f(x,hn)-f(x,0))/hn. By assumption, fn(x) -» A(JC,O) forevery x. The sequence {/„} is dominated by M: by the mean-value theorem, foreach x there exists a tx in (-hn, hn) c (-<$, 8) for which |/n(^)| = |A(JC, tx)\ < M(x).

D An appeal to Dominated Convergence completes the argument.

6. Negligible sets

A set N in A for which /xN = 0 is said to be ^-negligible. (Some authors usethe term /x-null, but I find it causes confusion with null as a name for the emptyset.) As the name suggests, we can usually ignore bad things that happen only ona negligible set. A property that holds everywhere except possibly for those JC in a/^-negligible set of points is said to hold fi-almost everywhere or âlmost surely,abbreviated to a.e. [ft] or a.s. [/LA], with the [JJ,] omitted when understood.

There are several useful facts about negligible sets that are easy to prove andexceedingly useful to have formally stated. They depend on countable additivity,via its Monotone Convergence generalization. I state them only for nonnegativefunctions, leaving the obvious extensions for £*(//) to you.

<26> Lemma. For every measure /JL:

(i) if g e M+ and fjug < oo then g < oo a.e. [//];

(ii) if g, h € M+ and g = h a.e. [fi] then iig = fxh;

(Hi) if Ni, #2> •. • is a sequence of negligible sets then {Jt Ni is also negligible;

(iv) if g € M+ and fig = 0 then g = 0 a.e. [/x].

Proof. For (i): Integrate out the inequality g > n{g = oo} for each positive integer nto get oo > fig > nfi{g = oo}. Let n tend to infinity to deduce that fi{g = oo} = 0.

For (ii): Invoke the increasing and Monotone Convergence properties ofintegrals, starting from the pointwise bound h < g + oo{h ^ g] = limM (g + n{h g})


to deduce that /xh < limn (fig + nfi{h ^ g}) = /xg. Reverse the roles of g and h toget the reverse inequality.

For (iii): Invoke Monotone Convergence for the right-hand side of the pointwiseinequality U,-iV< < £,• Nt to get /x (U,//,) < /x QT. N() = £ . /xty = 0.

For (iv): Put Nn := {g > \/n] for n = 1,2,.... Then //,#„ < n/xg = 0, fromD which it follows that {g > 0} = (J« Mi is negligible.

REMARK. Notice the appeals to countable additivity, via the MonotoneConvergence property, in the proofs. Results such as (iv) fail without countableadditivity, which might trouble those brave souls who would want to develop aprobability theory using only finite additivity.

Property (iii) can be restated as: if A € A and A is covered by a countablefamily of negligible sets then A is negligible. Actually we can drop the assumptionthat A € A if we enlarge the sigma-field slightly.

<27> Definition. The fx-completion of the sigma-field A is the class A^ of all thosesets B for which there exist sets Ao, A\ in A with Ao c B c A\ and ix(A\\Ao) = 0.

You should check that A^ is a sigma-field and that jx has a unique extensionto a measure on A^ defined by \xB := fiAo = fiA\, with Ao and A\ as in theDefinition. More generally, for each / in M+(X,.AM), you should show that thereexist functions /o, go in M+(X, A) for which /o < / < /o 4- go and figo = 0. Ofcourse, we then have /xf := fxfo.

The Lebesgue sigma-field on the real line is the completion of the Borelsigma-field with respect to Lebesgue measure.

<28> Example. Here is one of the standard methods for proving that some measurableset A has zero fi measure. Find a measurable function / for which f(x) > 0, forall x in A, and fi(fA) = 0. From part (iv) of Lemma <26> deduce that fA = 0a.e. \ji\. That is, f(x) = 0 for almost all x in A. The set A = {x € A : f(x) > 0}

• must be negligible.

Many limit theorems in probability theory assert facts about sequences thathold only almost everywhere.

<29> Example. (Generalized Borel-Cantelli lemma) Suppose {/„} is a sequence in M+

for which ]Tn /xfn < oo. By Monotone Convergence, fi J2n fn = ]Tn fifn < oo.Part (i) of Lemma <26> then gives ]Tn fn(x) < oo for /x almost all x.

For the special case of probability measure with each /„ an indicator functionof a set in A, the convergence property is called the Borel-Cantelli lemma: IfJ2n PAn < oo then £ n An < oo almost surely. That is,

P{ft> € Q : co G An for infinitely many n] = 0,

a trivial result that, nevertheless, is the basis for much probabilistic limit theory.The event in the last display is often written in abbreviated form, {An i. o.}.

REMARK. For sequences of independent events, there is a second part tothe Borel-Cantelli lemma (Problem [1]), which asserts that if ^2nÂn = oo thenF{An i. o.} = 1. Problem [2] establishes an even stronger converse, replacingindependence by a weaker limit property.

2.6 Negligible sets 35

The Borel-Cantelli argument often takes the following form when invokedto establish almost sure convergence. You should make sure you understand themethod, because the details are usually omitted in the literature.

Suppose {Xn} is a sequence of random variables (all defined on the same Q)for which 2Zn F{|Xn| > €} < oo for each € > 0. By Borel-Cantelli, to each € > 0there is a P-negligible set N(€) for which £rt{|X,,(<y)i > *} < oo if co € N(€)c. Asum of integers converges if and only if the summands are eventually zero. Thus toeach co in N(€)€ there exists a finite n(c, co) such that \Xn(co)\ < € when n > n(€, co).

We have an uncountable family of negligible sets {N(e): € > 0}. We are allowedto neglect only countable unions of negligible sets. Replace 6 by a sequence ofvalues such as 1, 1/2, 1/3, 1/4, . . . , tending to zero. Define N := ( J ^ N(l/k),which, by part (iii) of Lemma <26>, is negligible. For each co in Nc we have\Xn(<*>)\ < 1/& when n > n{\/k, co). Consequently, Xn(co) -> 0 as n —• oo for each

• coin Nc; the sequence {Xn} converges to zero almost surely.

For measure theoretic arguments with a fixed /Lt, it is natural to treat as identicalthose functions that are equal almost everywhere. Many theorems have trivial modi-fications with equalities replaced by almost sure equalities, and convergence replacedby almost sure convergence, and so on. For example, Dominated Convergence holdsin a slightly strengthened form:

Let {/„} be a sequence of measurable functions for which fn(x) -» f(x) at fialmost all x. Suppose there exists a /i-integrable function F, which does notdepend on n9 such that |/„(*)) < F(x) for fi almost all x and all n. Then

Most practitioners of probability learn to ignore negligible sets (and then sufferslightly when they come to some stochastic process arguments where the handlingof uncountable families of negligible sets requires more delicacy). For example,if I could show that a sequence [fn] converges almost everywhere I would hardlyhesitate to write: Define / := limn /„. What happens at those x where fn(x) doesnot converge? If hard pressed I might write:

n r. « f, v ._ f limn fn(x) on the set where the limit exists,uenne / \x) .— i10 otherwise.

You might then wonder if the function so-defined were measurable (it is), or if theset where the limit exists is measurable (it is). A sneakier solution would be towrite: Define f(x) := limsupn /„(*). It doesn't much matter what happens on thenegligible set where the limsup is not equal to the liminf, which happens only whenthe limit does not exist.

A more formal way to equate functions equal almost everywhere is to work withequivalence classes, [/] := {g e M : / = g a.e. [/it]}. The almost sure equivalencealso partitions &1(IA) into equivalence classes, for which we can define /x[/] := figfor an arbitrary choice of g from [ / ] . The collection of all these equivalence classesis denoted by L 1 ^ ) . The L1 norm, ||[/]||i := | | / | | i , is a true norm on L1, because[/] equals the equivalence class of the zero function when ||[/]|h = 0. Few authorsare careful about maintaining the distinction between / and [ / ] , or between L1(JJL)

and


* 7 . Lp s p a c e s

For each real number p with p > 1 the Cp-norm is defined on M(X, A, ix) byII/UP *= (Ml/Ip)1/P. Problem [17] shows that the map / H> | | / | | P satisfies thetriangle inequality, | | / + g\\p < \\f\\p + ||g||p, at least when restricted to real-valuedfunctions in M.

As with the i^-norm, it is not quite correct to call || • ||p a norm, for tworeasons: there are measurable functions for which \\f\\p = oo, and there are nonzeromeasurable functions for which ||/ | |p = 0. We avoid the first complication byrestricting attention to the vector space Cp := LP(X,A, JJL) of all real-valued, A-measurable functions for which ||/ | |p < oo. We could avoid the second complicationby working with the vector space Lp := LP(X,A, fi) of /^-equivalence classes offunctions in £P(X, A, /x). That is, the members of Lp are the /^-equivalence classes,[/] := {g e &p : g = / a.e. [/i]}, with / in Cp. (See Problem [20] for the limitingcase, p = oo.)

REMARK. The correct term for || • ||p on £JP is pseudonorm, meaning that ithas all the properties of a norm (triangle inequality, and ||c/|| = |c| | | / | | for realconstants c) except that it might be zero for nonzero functions. Again, few authorsare careful about maintaining the distinction between Lp and Lp.

Problem [19] shows that the norm defines a complete pseudometric on Lp

(and a complete metric on Lp). That is, if {/„} is a Cauchy sequence of functionsin £ / (meaning that \\fn — fm\\p -> 0 as min(m,n) -> oo) then there exists afunction / in Lp for which ||/n — f\\n —• 0. The limit function / i s unique up to a/^-equivalence.

For our purposes, the case where p equals 2 will be the most important. Thepseudonorm is then generated by an inner product (or, more correctly, a "pseudo"inner product), (/, g) := /x(/g). That is, \\f\\\ := (/, / ) . The inner product has theproperties:

(a) (af + 0g, h) = <*{/, h) + 0(g, h) for all real a, p all / , g, h in £2;

(b) </,g) = ( g , / ) f o r a l l / , g i n £ 2 ;

(c) (/» / ) > 0 with equality if and only if / = 0 a.e. [/x].

If we work with the equivalence classes of L2 then (c) is replaced by the assertionthat ( [ / ] , [/]) equals zero if and only if [/] is zero, as required for a true innerproduct.

A vector space equipped with an inner product whose corresponding normdefines a complete metric is called a Hilbert space, a generalization of ordinaryEuclidean space. Arguments involving Hilbert spaces look similar to their analogsfor Euclidean space, with an occasional precaution against possible difficultieswith infinite dimensionality. Many results in Probability and Statistics rely onHilbert space methods: information inequalities; the Blackwell-Rao theorem;the construction of densities and abstract conditional expectations; Hellingerdifferentiability; prediction in time series; Gaussian process theory; martingaletheory; stochastic integration; and much more.

2.7 Lp spaces 37

Some of the basic theory for Hilbert space is established in Appendix B. Forthe next several Chapters, the following two Hilbert space results, specialized to L2

spaces, will suffice.

(1) Cauchy-Schwarz inequality: Mfg)\ < WfhWgh for all / , g in L2O0,which follows from the Holder inequality (Problem [15]).

(2) Orthogonal projections: Let !Ko be a closed subspace of C2(/JL), For each /in £2 there is a /o in !Ko, the (orthogonal) projection of / onto IKo, for which/ — /o is orthogonal to !Ko, that is, ( / — /o, g) = 0 for all g in MQ. Thepoint /o minimizes \\f -h\\ over all h in 9<o. The projection /o is unique upto a ji-almost sure equivalence.

REMARK. A closed subspace IKo of £ 2 must contain all / in £ 2 for which thereexist /„ € 3<o with \\fn - f\\2 - • 0. In particular, if / belongs to !K0 and g = /a.e. [/i] Jien g must also belong to !K0. If % is closed, the set of equivalenceclasses 3€o = f [/] • / € % } must be a closed subspace of L2(fi), and !Ko mustequal the union of all equivalence classes in 3fo-

For us the most important subspaces of £2(X, A, fz) will be defined by the sub-sigma-fields AQ of A. Let *Ho = -C2(X, AQ, /JL). The corresponding L2(X, 4o> £0 is aHilbert space in its own right, and therefore it is a closed subspace of L2(X, A, n).Consequently 3<o is a complete subspace of £2: if {/„} is a Cauchy sequence in 5Cothen there exists an /o € Oio such that ||/n - /o|| 2 -> 0. However, {/„} also convergesto every other .A-measurable / for which / = /o a.e. [p]. Unless Ao contains all/^-negligible sets from A, the limit / need not be .Ao-measurable; the subspace Jioneed not be closed. If we work instead with the corresponding L2(X,^4, fi) andL2(X, *Ao» 11) we do get a closed subspace, because the equivalence class of the limitfunction is uniquely determined.

*8. Uniform integrability

Suppose {/„} is a sequence of measurable functions converging almost surelyto a limit / . If the sequence is dominated by some /x-integrable function F,then 2F > \fn — f\ - • 0 almost surely, from which it follows, via DominatedConvergence, that jj,\fn — f\ -> 0. That is, domination plus almost sure convergenceimply convergence in £*(M) norm. The converse is not true: fM equal to Lebesguemeasure and fn(x) := nKn + l)"1 < x < n~1} provides an instance of &1 convergencewithout domination.

At least when we deal with finite measures, there is an elegant circle ofequivalences, involving a concept (convergence in measure) slightly weaker thanalmost sure convergence and a concept (uniform integrability) slightly weaker thandomination. With no loss of generality, I will explain the connections for a sequenceof random variables {Xn} on a probability space (Qf 7, P).

The sequence is said to converge in probability to a random variable X,P

sometimes written Xn —• X, if P{|Xn - X| > e] -* 0 for each € > 0. Problem [14]guides you through the proofs of the following facts.


(a) If {Xn} converges to X almost surely then Xn -* X in probability, but theconverse is false: there exist sequences that converge in probability but notalmost surely.

(b) If {Xn} converges in probability to X, there is an increasing sequence ofpositive integers {n(k)} for which lim*_>oo Xn(*) = X almost surely.

If a random variable Z is integrable then a Dominated Convergence argumentshows that P|Z|{ \Z\ > M] -> 0 as M -» oo. Uniform integrability requires thatthe convergence holds uniformly over a class of random variables. Very roughlyspeaking, it lets us act almost as if all the random variables were bounded by aconstant M, at least as far as Lx arguments are concerned.

<30> Definition. A family of random variables {Zt : t € T] is said to be uniformlyintegrable if sup,€7 P|Z,|{ \Zt\ > M} -> 0 as M -> oo.

It is sometimes slightly more convenient to check for uniform integrability bymeans of an e-5 characterization.

<3i> Lemma. A family of random variables {Zt : t e T] is uniformly integrable if andonly if both the following conditions hold:

(i) sup,6rP|Z,| <oo(ii) for each € > 0 there exists a 8 > 0 such that supr e r F\Zt\F < e for every

event F with FF < 8.

REMARK. Requirement (i) is superfluous if, for each 8 > 0, the space Q can bepartitioned into finitely many pieces each with measure less than 8.

Proof. Given uniform integrability, (i) follows from P|Z,| < M + P|Z,|{ \Zt\ > M},and (ii) follows from P|Z, |F < MFF + P|Z,|{ |Z,| > M}.

Conversely, if (i) and (ii) hold then the event {|Z,| > M) is a candidate forthe F in (ii) when M is so large that P{ |Z,| > M} < supr e r F\Zt\/M < 8. It follows

• that sup,€rP|Z,|{ \Zt\ > M) < € if M is large enough.

Xn

(a)

, . „ . , . , . . . . . ^

xn

-> X

-* X

almost surely

subsequence, (b)

in probability

domination

uniform integrability

Xn -» X in^CP), with each Xn integrable

The diagram summarizes the interconnections between the various convergenceconcepts, with each arrow denoting an implication. The relationship between almostsure convergence and convergence in probability corresponds to results (a) and (b)noted above. A family [Zt : t e T] dominated by an integrable random variable Y

2.8 Uniform integrability 39

is also uniformly integrable, because Y[Y > M} > |Z,|{|Z, | > M] for every t. Onlythe implications leading to and from the box for the £* convergence remain to beproved.

<32> Theorem. Let {Xn : n e N} be a sequence of integrable random variables. Thefollowing two conditions are equivalent

(i) The sequence is uniformly integrable and it converges in probability to arandom variable Xoo, which is necessarily integrable.

(ii) The sequence converges in Zl norm, P|XW — Xool -> 0, with a limit X^ thatis necessarily integrable.

Proof. Suppose (i) holds. The assertion about integrability of Xoo follows fromFatou's lemma, because \Xn>\ -» |Xoo| almost surely along some subsequence, sothat P|Xoo| < liming P|Xn| < supn P|Xn| < oo. To prove £* convergence, first splitaccording to whether |Xn - Xool is less than € or not, and then split according towhether max(|Xn|, |Xoo|) is less than some large constant M or not.

P|Xn - Xool < € + P(|Xn| + IXool) {|Xn - Xoo| > €}

< € + 2MP{|Xn - Xool > <0 + P(|Xnj + IXool) {|Xn| V IXool > M}.

Split the event {|XJ v |Xool > M] according to which of the two random variablesis larger, to bound the last term by 2P|Xn|{|Xw| > M} + 2F\XOO\{\XOO\ > M}. Invokeuniform integrability of {Xn} and integrability of Xoo to find an M that makes thisbound small, uniformly in n. With M fixed, the convergence in probability sendsMP{|Xn — Xool > *} to zero as n ~> oo.

Conversely, if the sequence converges in -C1, then XQO must be integrable,because P|Xoo| < P|Xn| +P|Xn - Xool for each n. When |Xn| < M or |Xool > M/2,the inequality

|Xn|{|Xw| >M}< |Xoo|{IXool > M/2} + 2|Xn -Xool,

is easy to check; and when |Xool < M/2 and |Xrt| > M, it follows from the inequality|XM - Xool > |Xn| - |Xoo| > |Xn|/2. Take expectations, choose M large enoughto make the contribution from Xoo small, then let n tend to infinity to find an nosuch that P|Xn|{ |Xn| > M} < € for n > no- Increase M if necessary to handle the

D corresponding tail contributions for n < no.

9. Image measures and distributions

Suppose fi is a measure on a sigma-field A of subsets of X and T is a map from Xinto a set y, equipped with a sigma-field 33. If T is ,A\!B-measurable we can carryfi over to y, by defining

<33> vB := ii(T~lB) for each B in iB.

Actually the operation is more one of carrying the setsback to the measure rather than carrying the measure over

) to the sets, but the net result is the definition of a new setfunction on $ .

(X.A.\L)


It is easy to check that v is a measure on 3, using facts such as T~x (Bc) =(T-XB)C and T~l (U,-ft) = U/T"1^. It is called the image measure of fi under T,or just the image measure, and is denoted by iiT~l or \iT or F(/i), or even just T/x.The third and fourth forms, which I prefer to use, have the nice property that if \i isa point mass concentrated at x then T(/x) denotes a point mass concentrated at T(x).

Starting from definition <33> we could prove facts about integrals with respectto the image measure v. For example, we could show

<34> vg = ii(g o T) for all g e M+QJ, £).

The small circle symbol o denotes the composition of functions: (goT)(x) := g(Tx).The proof of <34> could follow the traditional path: first argue by linearity

from <33> to establish the result for simple functions; then take monotone limits ofsimple functions to extend to M+(î !B).

There is another method for constructing image measures that gets <34> all inone step. Define an increasing linear functional v on M+(y, !B) by vg := /x(g o F). Itinherits the Monotone Convergence property directly from /x, because, if 0 < gn f gthen 0 < gn o T f g o T. By Theorem <13> it corresponds to a uniquely determinedmeasure on 2$. When restricted to indicator functions of measurable sets the newmeasure coincides with the measure defined by <33>, because if g is the indicatorfunction of B then g o T is the indicator function of T~l B, (Why?) We have gaineda theorem with almost no extra work, by starting with the linear functional as thedefinition of the image measure.

Using the notation T^JL for image measure, we could rewrite the definingequality as (7»(g) := /x(g o T) at least for all g e M+(y, 3), a relationship that Ifind easier to remember.

REMARK. In the last sentence I used the qualifier at least, as a reminder thatthe equality could easily be extended to other cases. For example, by splitting intopositive and negative parts then subtracting, we could extend the equality to functionsin JC'ftl.B, V). And soon.

Several familiar probabilistic objects are just image measures. If X is a randomvariable, the image measure X(P) on 3(R) is often written P*, and is called thedistribution of X. More generally, if X and Y are random variables defined on thesame probability space, they together define a random vector, a (measurable—seeChapter 4) map T(co) = (X(co), Y((o)) from Q into M2. The image measure T(F)on S(M2) is called the joint distribution of X and Y, and is often denoted by fxj-Similar terminology applies for larger collections of random variables.

Image measures also figure in a construction that is discussed nonrigorouslyin many introductory textbooks. Let P be a probability measure on ®(M). Itsdistribution function (also known as a cumulative distribution function) is definedby FP(x) :— P(—oo, x] for x e R. Don't confuse distribution, as a synonym forprobability measure, with distribution Junction, which is a function derived fromthe measures of a particular collection of sets. The distribution function has thefollowing properties.

(a) It is increasing, with lim^-oo FP(x) = 0 and lim*.^ FP(x) = 1.

2.9 Image measures and distributions 41

(b) It is continuous from the right: to each € > 0 and x G R, there exists a 8 > 0such that FP(x) < FP(y) < FP(x) + e for x < y < x + 8.

Property (a) follows from that fact that the integral is an increasing functional,and from Dominated Convergence applied to the sequences (-oo, -n] I 0 and(-oo, n] f R as n -> oo. Property (b) also follows from Dominated Convergence,applied to the sequence (-oo, x + l/n] | (-oo, x].

Except in introductory textbooks, and in works dealing with the order propertiesof the real line (such as the study of ranks and order statistics), distribution functionshave a reduced role to play in modern probability theory, mostly in connectionwith the following method for building measures on £(R) as images of Lebesguemeasure. In probability theory the construction often goes by the name of quantiletransformation.

<35> Example. There is a converse to the assertions (a) and (b) about distributionfunctions. Suppose F is a right-continuous, increasing function on R for whichlim^.oo F(x) = 0 and lim*-^ F(x) = 1. Then there exists a probability measure Psuch that P(—oo, x] = F(x) for all real x. To construct such a P, consider thequantile function q, defined by q{t) := inf{jt : F(x) > t) for 0 < t < 1.

By right continuity of the increasing function F, the set {JC e R : F(x) > t] isa closed interval of the form [a, oo), with a = q(t). That is, for all x e R and allt € (0, 1),

<36> F(x) >t if and only if x > q(t).

In general there are many plausible, but false, equalities related to <36>. Forexample, it is not true in general that F(q(t)) = f. However, if F is continuousand strictly increasing, then q is just the inverse function of F, and the plausibleequalities hold.

Let m denote Lebesgue measure restricted to the Borel sigma-field on (0,1).The image measure P := #(m) has the desired property,

P ( - o o , JC] = m{t : q(t) <x} = m{t:t< F(x)} =

the first equality by definition of the image measure, and the second by equality <36>.The result is often restated as: if i- has a Uniform(0, 1) distribution then q(t-) has

D distribution function F.

10. Generating classes of sets

To prove that all sets in a sigma-field A have some property one often resorts to agenerating-class argument. The simplest form of such an argument has three steps:

(i) Show that all members of a subclass £ have the property,

(ii) Show that A c <r(£).

(iii) Show that A) := {A €A: A has the property } is a sigma-field.

Then one deduces that .Ao = <r(Ao) 5 <r(£) 2 A, whence AQ = A. That is, theproperty holds for all sets in A.


For some properties, direct verification of all the sigma-field requirementsfor Ao proves too difficult. In such situations an indirect argument sometimessucceeds if £ has some extra structure. For example, if is possible to establishthat Ao is a X-system of sets, then one needs only check one extra requirement for £in order to produce a successful generating-class argument.

<37> Definition. A class D of subsets ofX is called a X-system if

(i) X € D,

(ii) if D\, D2 e D and D\ 2 D2 then D\\D2 e D,

(Hi) if [Dn] is an increasing sequence of sets in D then U™Dn e D.

REMARK. Some authors start from a slightly different definition, replacingrequirement (iii) by

(iii)' if {/>„} is a sequence of disjoint sets in D then U™Dn € D.

The change in definition would have little effect on the role played by A-systems.Many authors (including me, until recently) use the name Dynkin class instead

of X-system, but the name Sierpinski class would be more appropriate. See the Notesat the end of this Chapter.

Notice that a A-system is also a sigma-field if and only if it is stable underfinite intersections. This stability property can be inherited from a subclass £, asin the next Theorem, which is sometimes referred to as the n-X theorem. The nstands for product, an indirect reference to the stability of the subclass £ underfinite intersections (products). I think that the letter A. stands for limit, an indirectreference to property (iii).

<38> Theorem. If £ is stable under finite intersections, and if D is a X-system withD 2 £, then D 2 a(£).

Proof. It would be enough to show that D is a sigma-field, by establishing that itis stable under finite intersections, but that is a little more than I know how to do.Instead we need to work with a (possibly) smaller A-system Do, with D 2 Do 2 £,for which generating class arguments can extend the assumption

<39> E\E2e£ for all E\, E2 in £

to an assertion that

<40> DiD2€T>0 for all Di, D2 in Do.

It will then follow that Do is a sigma-field, which contains £, and hence Do 2 ^(fi).The choice of Do is easy. Let {Da : a € A] be the collection of all A-systems

with D a 2 £, one of them being the D we started with. Let Do equal the intersectionof all these D a . That is, let Do consist of all sets D for which D e T)a for each a.I leave it to you to check the easy details that prove Do to be a A.-system. In otherwords, Do is the smallest A.-system containg £; it is the A-system generated by £.

To upgrade <39> to <40> we have to replace each Ei on the left-hand side bya Di in Do, without going outside the class Do- The trick is to work one component ata time. Start with the E\. Define Di := {A : AE € Do for each E e £}. From <39>,we have Di 2 £. If we show that T>\ is a A.-system then it will follow that Di 2 Do,because Do is the smallest A-system containing £. Actually, the assertion that Dj is

2. JO Generating classes of sets 43

A-system is trivial; it follows immediately from the ^.-system properties for Do andidentities like (A\\A2)E = (AXE)\(A2E) and (U.-A,-) E = U(A,£).

The inclusion Di D D O implies that D\E2 e D o for all D\ € D o and all E2 e £.Put another way—this step is the only subtlety in the proof—we can assert that theclass D 2 := {B : BD € Do for each D € Do} contains £. Just write Di instead of D,and Ei instead of B, in the definition to see that it is only a matter of switching theorder of the sets.

Argue in the same way as for Di to show that T>2 is also a A.-system. It then• follows that D2 2 Do> which is another way of expressing assertion <40>.

The proof of the last Theorem is typical of many generating class arguments,in that it is trivial once one knows what one has to check. The Theorem, or itsanalog for classes of functions (see the next Section), will be my main method forestablishing sigma-field properties. You will be getting plenty of practice at fillingin the details behind frequent assertions of "a generating class argument shows that... ." Here is a typical example to get you started.

<4i> Exercise. Let \x and v be finite measures on !B(R) with the same distributionfunction. That is, ^( -00 , t] = v(-oo, t] for all real t. Show that ixB = vB for allB € S(M), that is, /x = v as Borel measures.SOLUTION: Write £ for the class of all intervals (-00, f], with t e R. Clearly £ isstable under finite intersections. From Example <4>, we know that cr(£) = B(M).It is easy to check that the class D := [B € !B(R) : /JLB = vB] is a X-system.For example, if Bn e D and Bn t B then ixB = limn \iBn = limn vBn = v#, byMonotone Convergence. It follows from Theorem <38> that D D <r(£) = !B(R),

• and the equality of the two Borel measures is established.

When you employ a A.-system argument be sure to verify the properties requiredof £. The next Example shows what can happen if you forget about the stabilityunder finite intersections.

<42> Example. Consider a set X consisting of four points, labelled nw, ne, sw, and se.Let £ consist of X and the subsets N — {nw, ne}, 5 = {sw, se}, E = {ne, se}, andW = {nw, sw}. Notice that £ generates the sigma-field of all subsets of X, but it isnot stable under finite intersections. Let /x and v be probability measures for which

/i(nw) = 1/2 /x(ne) = 0 v(nw) = 0 v(ne) = 1/2/Lt(sw) = 0 /it(se) = 1/2 v(sw) = 1/2 v(se) = 0

Both measures give the the value 1/2 to each of N, 5, E, and W, but they differ in• the values they give to the four singletons.

* 11. Generating classes of functions

Theorem <38> is often used as the starting point for proving facts about measurablefunctions. One first invokes the Theorem to establish a property for sets in a sigma-field, then one extends by taking limits of simple functions to M+ and beyond, usingMonotone Convergence and linearity arguments. Sometimes it is simpler to invokean analog of the A.-system property for classes of functions.


<43> Definition. Call a class 2C+ of bounded, nonnegative functions on a set X aX-cone if:

(i) 9C+ is a cone, that is, ifh\,h2e 0i+ and a\ and a2 are nonnegative constantsthen <x\h\ + <x2h2 € 0<+;

(ii) each nonnegative constant function belongs to !H+;

(iii) ifhuh2 € IK4" and hi > h2 then h\-h2e <KAr;

(iv) if{hn] is an increasing sequence of functions in *H+ whose pointwise limit his bounded then h € W+.

Typically IK+ consists of the nonnegative functions in a vector space of boundedfunctions that is stable under pairwise maxima and minima.

REMARK. The name X-cone is not standard. I found it hard to come up with aname that was both suggestive of the defining properties and analogous to the namefor the corresponding classes of sets. For a while I used the term Dynkin-cones butabandoned it for historical reasons. (See the Notes.) I also toyed with the namecdl-cone, as a reminder that the cone contains the (positive) constant functionsand that it is stable under (proper) differences and (monotone increasing) limits ofuniformly bounded sequences.

The sigma-field properties of A-cones are slightly harder to establish than theirA-system analogs, but the reward of more streamlined proofs will make the extra,one-time effort worthwhile. First we need an analog of the fact that a A-system thatis stable under finite intersections is also a sigma-field.

<44> Lemma. If a k-cone H+ is stable under the formation of pointwise products ofpairs of functions then it consists of all bounded, nonnegative, a (IK+) -measurablefunctions, where a(!K+) denotes the sigma-field generated by %+.

Proof. First note that J{+ must be stable under uniform limits. For suppose hn - • huniformly, with hn e 0<+. Write 8n for 2~n. With no loss of generality we maysuppose hn + 8n > h > hn — 8n for all n. Notice that

hn + 3 4 = hn + 8n + 8n.i > h + 5n_i > hn-\.

From the monotone convergence, 0 < hn + 3(5i + . . . + 8n) f h + 3, deduce thath + 3 € *K+, and hence, via the proper difference property (iii), h e *H+.

Via uniform limits we can now show that J{+ is stable under compositionwith any continuous nonnegative function / . Let / i b e a member of J{+ , boundedabove by a constant D. By a trivial generalization of Problem [25],x there exists asequence of polynomials pn() such that s u p ^ ^ \pn{t) - f(t)\ < 1/n. The functionfn(h) := pn(h) -f 1/n takes only nonegative values, and it converges uniformlyto f(h). Suppose fn(t) = ao -h a\t + . . . + autk. Then

$ t ^ * *) 2: 0.By virtue of properties (i) and (ii) of X-cones, and the assumed stability underproducts, both terms on the right-hand side belong to 0<+. The proper differencingproperty then gives fn(h) e 3f+. Pass uniformly to the limit to get f(h) € Ji+.

Write E for the class of all sets of the form {h < C}, with h e IK+ and Ca positive constant. From Example <7>, every h in 3€+ is a(E)-measurable, and

2.11 Generating classes of functions 45

hence <x(£) = a(M+). For a fixed h and C, the continuous function (1 - (h/C)n)+

of h belongs to 9C1", and it increases monotonely to the indicator of {h < C}.Thus the indicators of all sets in £ belong to 3€+. The assumptions about *H+

ensure that the class $ of all sets whose indicator functions belong to W+ is stableunder finite intersections (products), complements (subtract from 1), and increasingcountable unions (montone increasing limits). That is, £ is a A-system, stableunder finite intersections, and containing £. It is a sigma-field containing £. Thus£ 2 <T(£) = cr(M+). That is, !K+ contains all indicators of sets in cr(W+).

Finally, let it be a bounded, nonnegative, a(9<+)-measurable function. Fromthe fact that each of the sets {k > / /2n} , for i = 1 , . . . , 4 \ belongs to the cone 5£+,we have kn := 2~n ^^L\{k > i/2n] e <K¥. The functions kn increase monotonely

• to k9 which consequently also belongs to !H+.

<45> Theorem. Let !K+ be a X-cone of bounded, nonnegative functions, and 5 be asubclass ofH+ that is stable under the formation ofpointwise products of pairs offunctions. Then !K+ contains all bounded, nonnegative, o{^-measurable functions.

Proof Let JCj be the smallest A.-cone containing 9- From the previous Lemma, itis enough to show that Wj is stable under pairwise products.

Argue as in Theorem <38> for A-systems of sets. A routine calculation showsthat 'K^ := {h e JCj : hg e J{J for all g in 9 } is a X-cone containing 9, andhence Oif = ^Cj• ^ ^ i s ' ho8 € MQ f o r a11 hoe^ andg e 9. Similarly, the class:Kj := {h € Wj : hoh € J{J for all h0 in IKj } is a X-cone. By the result for J{f we

• have JCj 2 9, and hence JCj = ô • ^ a t *s' ô ^s s t a^le under products.

<46> Exercise. Let \i be a finite measure on 23(K*). Write Co for the vector spaceof all continuous real functions on R* with compact support. Suppose / belongsto Cl(ix). Show that for each € > 0 there exists a g in Co such that /x | / — g\ < e.

• That is, show that Co is dense in -C!(/x) under its .C1 norm.

SOLUTION: Define *K as the collection of all bounded functions in -CHAO thatcan be approximated arbitrarily closely by functions from Co. Check that theclass IK+ of nonnegative functions in IK is a X-cone. Trivially it contains C j , theclass of nonnegative members of Co. The sigma-field a(Cj) coincides with theBorel sigma-field. Why? The class *K+ consists of all bounded, nonnegative Borelmeasurable functions.

To approximate a general / in £>l(ii), first reduce to the case of nonnegativefunctions by splitting into positive and negative parts. Then invoke DominatedConvergence to find a finite n for which /x | / + — / + An\ < e, then approximate / + A nby a member of C j . See Problem [26] for the extension of the approximation result

• to infinite measures.

12. Problems

[1] Suppose events A\,A2,..., in a probability space (£2,3\P), are independent:meaning that F(AhAi2... Aik) = PAflPAl2.. .PAIJt for all choices of distinctsubscripts I'I, 12, . . . , 1'*, all k. Suppose Y11L\ p^« = °°-


(i) Using the inequality e~x > 1 — x, show that

P max At = l- [ I 0 ~ PAi) >: 1 — exp<i< * *n<i<m .

n<i<m

[ — V PA,-1\ n<i<m /

(ii) Let m then n tend to infinity, to deduce (via Dominated Convergence) thatPlimsup; At[ = 1. That is, P{A,- i. o.} = 1.

REMARK. The result gives a converse for the Borel-Cantelli lemma fromExample <29>. The next Problem establishes a similar result under weakerassumptions.

[2] Let A\, A2, . . . be events in a probability space (Q, 7, P). Define Xn = Ai 4-. . . + An

and an = PXn. Suppose an -» 00 and i|Xn/crn||2 - • 1. (Compare with the inequality\\Xn/on§2 > 1» which follows from Jensen's inequality.)

(i) Show that

for each positive integer k,

(ii) By an appropriate choice of k (depending on n) in (i), deduce that YA ^ - *almost surely.

(iii) Prove that Yl^ &i t. 1 almost surely, for each fixed m. Hint: Show that thetwo convergence assumptions also hold for the sequence Am, Am+\,

(iv) Deduce that F{co € A, i. o. } = 1.

(v) If [Bi) is a sequence of events for which ]T. Pi?,- = 00 and Pi?,-fl>- = PB/PB,for 1 # j , show that F{co e Bi i. o. } = 1.

[3] Suppose T is a function from a set X into a set y, and suppose that y is equippedwith a cr-field B. Define A as the sigma-field of sets of the form T~lB, with B in B.Suppose / € M+(X,^l). Show that there exists a S\S[0, oo]-measurable functiong from y into [0, 00] such that f(x) = g(T(x)), for all x in X, by following thesesteps.

(i) Show that A is a a-field on X. (It is called the a-field generated by the map T.It is often denoted by cr(F).)

(ii) Show that {/ > i/2n] = T~lBUn for some Bin in B. Define

/» = 2~n £ < ' ^ '/211} and *„ = 2"" J ] B,tn.1=1 1=1

Show that /n(jc) = gn(T(x)) for all x.

(iii) Define g(y) = limsupgn(y) for each y in y. Show that g has the desiredproperty. (Question: Why can't we define g(y) = li

[4] Let g\,g2,... be yi\B(E)-measurable functions from X into R. Show that{limsupngn > t] = (Jr€Q fl^Li U«>mte > rJ- Deduce, without any appeal to

r>t ~Example <8>, that lim sup gn is ^l\jB(E)-measurable. Warning: Be careful about

2.12 Problems 47

strict inequalities that turn into nonstrict inequalities in the limit—it is possible tohave xn > x for all n and still have limsupw xn = x.

[5] Suppose a class of sets £ cannot separate a particular pair of points x, y: for every Ein £, either {JC, y] c E or {JC, y] c Ec. Show that a(£) also cannot separate the pair.

[6] A collection of sets Jo that is stable under finite unions, finite intersections, andcomplements is called a field. A nonnegative set function fi defined on 7© is calleda finitely additive measure if /i(u,-<„/*}) = Yli<nî f°r e v e r y finite collectionof disjoint sets in 3$. The set function is said to be countably additive on 3$ iffi(Ui(:f$Fi) = £ l € N M ^ f° r e v e rY countable collection of disjoint sets in 3o whoseunion belongs to 7. Suppose fiX < oo. Show that fi is countably additive on 3$ ifand only if fiAn \ 0 for every decreasing sequence in 3b with empty intersection.Hint: For the argument in one direction, consider the union of differences A,-\A,-+i.

[7] Let / i , . . . , / „ be functions in JVt+(X, A), and let fi be a measure on A Show thatfi (v,-/}) < J2i Pfi ^ M (v,-/-) + £i<y M (/} A /}) where v denotes pointwise maximaof functions and A denotes pointwise minima.

[8] Let IJL be a finite measure and / be a measurable function. For each positiveinteger k, show that fi\f\k < oo if and only if J2™=i W*~!M{I/I >n}<oo.

[9] Suppose v := T/x, the image of the measure fi under the measurable map T. Showthat f eCl(v) if and only if / o T e Clbi), in which case vf = /x ( / o T).

[10] Let {hn}, {/„}, and {gn} be sequences of /x-integrable functions that converge ixalmost everywhere to limits h, f and g. Suppose hn(x) < fn(x) < gn(x) for all JC.Suppose also that fihn —• fjuh and ixgn -> fig. Adapt the proof of DominatedConvergence to prove that fifn -> fif.

[11] A collection of sets is called a monotone class if it is stable under unions ofincreasing sequences and intersections of decreasing sequences. Adapt the argumentfrom Theorem <38> to prove: if a class £ is stable under finite unions andcomplements then a(£) equals the smallest monotone class containing £.

[12] Let / i b e a finite measure on the Borel sigma-field 23 (X) of a metric space X. Calla set B inner regular if fiB = sup{/xF : B D F closed } and outer regular iffiB = inf{/xF : 5 C G open }

(i) Prove that the class So of all Borel sets that are both inner and outer regular isa sigma-field. Deduce that every Borel set is inner regular.

(ii) Suppose fi is tight: for each € > 0 there exists a compact K€ such thatfiKc

€ < €. Show that the F in the definition of inner regularity can then beassumed compact.

(iii) When fi is tight, show that there exists a sequence of disjoint compacts subsets{Kt : i € N} of X such that fi (UtKif = 0.

[13] Let / x b e a finite measure on the Borel sigma-field of a complete, separable metricspace X. Show that fi is tight: for each e > 0 there exists a compact K€ such thatliKc

€ < €. Hint: For each positive integer n, show that the space X is a countable


union of closed balls with radius l/n. Find a finite family of such balls whoseunion Bn has 11 measure greater than fiX — e/2n. Show that C\nBn is compact, usingthe total-boundedness characterization of compact subsets of complete metric spaces.

[14] A sequence of random variables {Xn} is said to converge in probability to a random

variable X, written Xn —• X, if F{\Xn - X\ > €} -* 0 for each e > 0.

(i) If Xn -+ X almost surely, show that I > {\Xn — X\ > e] -* 0 almost surely.Deduce via Dominated Convergence that Xn converges in probability to X.

(ii) Give an example of a sequence {Xn} that converges to X in probability but notalmost surely.

(iii) Suppose Xn -> X in probability. Show that there is an increasing sequence ofpositive integers {n(k)} for which X^P{|^n(*) — X\ > l/k] < oo. Deduce thatXn(k) -* X almost surely.

[15] Let / and g be measurable functions on (X, A, fi), and r and s be positive realnumbers for which r~l H-s"1 = 1. Show that fi\fg\ < (ji\f\r)l/r (jJL\gls)l/s byarguing as follows. First dispose of the trivial case where one of the factors onthe righthand side is 0 or oo. Then, without loss of generality (why?), assumethat Ml/r = 1 = Mlgl5- Use concavity of the logarithm function to show that\fg\ 5 \f\r/r + \g\*/s9 and then integrate with respect to JJL. This result is called theHolder inequality.

[16] Generalize the Holder inequality (Problem [15]) to more than two measurablefunctions / i , . . . , /*, and positive real numbers n , . . . , r* for which ]£,. rfl = 1.show that MI/I ... fk\ < n, (Mi/;rol/r'.

[17] Let (X, A, fi) be a measure space, / and g be measurable functions, and r be areal number with r > 1. Define ||/ | | r = (Ml/D1/r- Follow these steps to proveMinkowski's inequality: \\f + *||r < il/llr + llgilr.

(i) From the inequality \x + y\r < \2x\r + \2y\r deduce that | | / -•- g\\r < oo if| |/ | | r < oo and ||g||r < oo.

(ii) Dispose of trivial cases, such as | |/ | | r = 0 or ||/| | r = oo.

(iii) For arbitrary positive constants c and d argue by convexity that

V c + d ) " c + d \ c ) c + d \ d )

(iv) Integrate, then choose c = ||/| | r and d = ||g||r to complete the proof.

[18] For / in £ ! ( M ) define | |/ | | i = fi\f\. Let {fn} be a Cauchy sequence in -C!(M)» thatis, ||/n — /mill -> 0 as min(m,n) -> oo. Show that there exists an / in -C^/x) forwhich ||/n - /Hi -> 0, by following these steps.

(i) Find an increasing sequence {n{k)} such that £ j£ i H/»(*) ~ /»(*+i)lli < °°-

Deduce that the function H := J2T=\ !/«(*) " /«(*+i)l i s integrable.

(ii) Show that there exists a real-valued, measurable function / for which

H > \fn(k)(x) - /C*)| -> 0 as k -* oo, for \i almost all x.

2.12 Problems 49

Deduce that \\fn{k) - / | | i -* 0 as k -* oo.

(iii) Show that / belongs to £ ! 0 i ) and \\fn - f\\x -> 0 as n -> oo.

[19] Let {/„} be a Cauchy sequence in LP(X,A,^\ that is, \\fn - fm\\p - • 0 asmin(m,n) -> oo. Show that there exists a function / in £P(X,.A, /x) for whichII/« - /lip -> 0, by following these steps.

(i) Find an increasing sequence {n(k)} such that C := J^=l \\fn(k) - fn{k+\)\\P < oo.Define H^ = l i m ^ ^ HN, where HN = £f = 1 |/w(ib) - /„<*+1}| for 1 < W < oo.Use the triangle inequality to show that jxH^ < Cp for all finite N. Then useMonotone Convergence to deduce that /JLH£Q < Cp.

(ii) Show that there exists a real-valued, measurable function / for which fn(k)(x) ->f(x) as k - • oo, a.e. [fi\.

(iii) Show that \fn(k) - f\ < YlZk I/*(O ~ /»<i+ol < #<x> a.e. [/i]. Use DominatedConvergence to deduce that ||/n(*) - f\\p -> 0 as k -> oo.

(iv) Deduce from (iii) that / belongs to <C(X, >lf /x) and ||/B - / | | p -+ 0 as w - • oo.

[20] For each random variable on a probability space (Q, 7, P) define

||X||oo := inf{c € [0, oo] : |X| < c almost surely}.

Let L°° := L°° (Q, J , P) denote the set of equivalence classes of real-valued randomvariables with ||X||oo < oo. Show that || • ||oo is a norm on L°°, which is a vectorspace, complete under the metric defined by ||X||oo.

[21] Let {Xt : t e T] be a collection of E-valued random variables with possiblyuncountable index set T. Complete the following argument to show that there existsa countable subset To of T such that the random variable X = suprGro Xt has theproperties

(a) X > Xt almost surely, for each t e T

(b) if Y > Xt almost surely, for each t e 7\ then Y > X almost surely

(The random variable X is called the essential supremum of the family. It is denotedby esssupr€r Xt. Part (b) shows that it is, unique up to an almost sure equivalence.)

(i) Show that properties (a) and (b) are unaffected by a monotone, one-to-onetransformation such as jc i -> jc / ( l + |jc|). Deduce that there is no loss ofgenerality in assuming \Xt\ < 1 for all t.

(ii) Let 8 = sup{Psupre5X, : countable S c T}. Choose countable Tn such thatPsup,e7 ; Xt > 8 - \/n. Let 7b = UnTn. Show that Psup,€7b Xt = 8.

(iii) Suppose t i To. From the inequality 8 > F(Xt v X) > WX = 8 deduce thatX > Xt almost surely.

(iv) For a Y as in assertion (b), show that Y > suprero Xt = X almost surely.

[22] Let ^ be a convex, increasing function for which ^(0) = 0 and ^(x) -> oo asx -> oo. (For example, V(x) could equal xp for some fixed p > 1, or exp(jc) — 1or exp(jc2) — 1.) Define ^^(X, A, /x) to be the set of all real-valued measurablefunctions on X for which iity{\f\/c$) < oo for some positive real CQ. Define


:= inf{c > 0 : /x^(|/|/c) < 1}, with the convention that the infimum of anempty set equals +oo. For each / , g in £*(X,.A, fi) and each real t prove thefollowing assertions.

(i) ll/ll* < oo. Hint: Apply Dominated Convergence to /x^(\f\/c).

(ii) f+g € £*(X, yi, M) and the triangle inequality holds: \\f+gU < WfU + WgU-Hint: If c > | | / | |* and d > ||g||«, deduce that

\ c + d ) - c + d \ c ) c+d

by convexity of * .

(iii) tf e L*{X,A,n) and ||f/||* = \t\

REMARK. II • Ik is called an Orlicz "norm"—to make it a true norm one shouldwork with equivalence classes of functions equal {i almost everywhere. The Lp

norms correspond to the special case 4*(JC) = xp, for some p > 1.

[23] Define | | / | |* and £ * as in Problem [22]. Let {/„} be a Cauchy sequence inthat is, ||/„ — fm\\y -> 0 as min(m, n) -> oo. Show that there exists an / in £*(/i)for which ||/n - / | | * -> 0, by following these steps.

(i) Let {gi} be a nonnegative sequence in £ * ( M ) for which C := £ f . Hg/11* < oo.Show that the function G := £ i # is finite almost everywhere and ||G||^ <E i WSiU < °°' H i n t : U s e Problem [22] to show that P*I> ( £ « < n # / c ) < 1 foreach n, then justify a passage to the limit.

(ii) Find an increasing sequence {n(k)} such that EJU=I li/»(*> ™ /«(*+!)II* < °°-Deduce that the functions Hi := EjbU l/«(*)"~ /«(fc+i)l satisfy

o o > | | t f i | | * > || H2\h > . . . - > 0.

(iii) Show that there exists a real-valued, measurable function / for which

\fn(k)(x) ~ /(*)l —• 0 as k —• ex), for /x almost all x.

(iv) Given € > 0, choose L so that H/JiH* < 6. For i > L, show that

* (»L/€) > * (|/n(L) ~ /n(0IA) "> * (\fn(L) ~ / l A ) .

Deduce that ||/n(L) - f\\* < e.

(v) Show that / belongs to £*(/x) and ||/rt - / | | ^ -> 0 as n -> oo.

[24] Let ^ be a convex increasing function with *I>(0) = 0, as in Problem [22]. Letdenote its inverse function. If X\,..., XN e £*(X, A, ^ ) , show that

Hint: Consider ^(Pmax|X/|/C) with C > max,^ ||X|||*.

REMARK. Compare with van der Vaart & Wellner (1996, page 96): ifalso limsupj y_^0Q^(x)^f(y)/}\f(cxy) < oo for some constant c > 0 then||max/</v IX/IH* < KV~l(N)maXi<N \\Xi\\* for a constant # depending only onSee page 105 of their Problems and Complements for related counterexamples.

2.12 Problems 51

[25] For each 0 in [0,1] let Xnt$ be a random variable with a Binomial (w, 0) distribution.That is, F{Xnj = k] = (2)0*(1 - 0)"-* for * = 0 , 1 , . . . . n. You may assume theseelementary facts: VXnte = n0 and F(Xn# -nO)2 = n0(l - 0 ) . Let / be a continuousfunction defined on [0,1].

(i) Show that pn(0) = Ff(Xnj/n) is a polynomial in 0.

(ii) Suppose | / | < M, for a constant M. For a fixed 6, invoke (uniform) continuityto find a 8 > 0 such that | / ( s ) - f(t)\ < € whenever \s - t\ < 8, for all s, tin [0,1]. Show that

\f(x/n) - / ( 0 ) | < € + 2M{ |(jc/n) - 0| 2 A f l (*/*O-01

(iii) Deduce that supo<^<! \pn(9) - / ( 0 ) | < 2e for n large enough. That is, deducethat /(•) can be uniformly approximated by polynomials over the range [0,1],a result known as the Weierstrass approximation theorem.

[26] Extend the approximation result from Example <46> to the case of an infinitemeasure /z on !B(]R*) that gives finite measure to each compact set. Hint: Let Bbe a closed ball of radius large enough to ensure fi\f\B < e. Write ixB for therestriction of it to B. Invoke the result from the Example to find a g in Co suchthat ixB\f - g\ < €. Find Co functions 1 > hx \, B. Consider approximations ght

for i large enough.

13. Notes

I recommend Royden (1968) as a good source for measure theory. The books ofAsh (1972) and Dudley (1989) are also excellent references, for both measure theoryand probability. Dudley's book contains particularly interesting historical notes.

See Hawkins (1979, Chapter 4) to appreciate the subtlety of the idea of anegligible set.

The result from Problem [10] is often attributed to (Pratt 1960), but, as he noted(in his 1966 Acknowledgment of Priority), it is actually much older.

Theorem <38> (the n-k theorem for generating classes of sets) is oftenattributed to Dynkin (1960, Section 1.1), although Sierpiiiski (1928) had earlierproved a slightly stronger result (covering generation of sigma-rings, not just sigma-fields). I adapted the analogous result for classes of functions, Theorem <45>, fromProtter (1990, page 7) and Dellacherie & Meyer (1978, page 14). Compare withthe "Sierpiiiski Stability Lemma" for sets, and the "Functional Sierpinski Lemma"presented by Hoffmann-J0rgensen (1994, pages 8, 54, 60).

REFERENCES

Ash, R. B. (1972), Real Analysis and Probability, Academic Press, New York.Dellacherie, C. & Meyer, P. A. (1978), Probabilities and Potential, North-Holland,

Amsterdam.Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.


Dynkin, E. B. (1960), Theory of Markov Processes, Pergamon.Hawkins, T. (1979), Lebesgue's Theory of Integration: Its Origins and Development,

second edn, Chelsea, New York.Hoffmann-J0rgensen, J. (1994), Probability with a View toward Statistics, Vol. I,

Chapman and Hall, New York.Oxtoby, J. (1971), Measure and Category, Springer-Verlag.Pratt, J. W. (1960), 'On interchanging limits and integrals', Annals of Mathematical

Statistics 31, 74-77. Acknowledgement of priority, same journal, vol 37 (1966),page 1407.

Protter, P. (1990), Stochastic Integration and Differential Equations, Springer, NewYork.

Roy den, H. L. (1968), Real Analysis, second edn, Macmillan, New York.Sierpiiiski, W. (1928), 'Un theoreme general sur les families d'ensembles', Funda-

menta Mathematicae 12, 206-210.van der Vaart, A. W. & Wellner, J. A. (1996), Weak Convergence and Empirical

Process: With Applications to Statistics, Springer-Verlag.

Chapter 3

Densities and derivatives

SECTION 1 explains why the traditional split of introductory probability courses intotwo segments—the study of discrete distributions, and the study of "continuous"distributions—is unnecessary in a measure theoretic treatment. Absolute continuityof one measure with respect to another measure is defined. A simple case of theRadon-Nikodym theorem is proved.

SECTION *2 establishes the Lebesgue decomposition of a measure into parts absolutelycontinuous and singular with respect to another measure, a result that includes theRadon-Nikodym theorem as a particular case.

SECTION 3 shows how densities enter into the definitions of various distances betweenmeasures.

SECTION 4 explains the connection between the classical concept of absolute continuity andits measure theoretic generalization. Part of the Fundamental Theorem of Calculus isdeduced from the Radon-Nikodym theorem.

SECTION *5 establishes the Vitali covering lemma, the key to the identification of derivativesas densities.

SECTION *6 presents the proof of the other part of the Fundamental Theorem of Calculus,showing that absolutely continuous functions (on the real line) are Lebesgue integralsof their derivatives, which exist almost everywhere.

1. Densities and absolute continuity

Nonnegative measurable functions create new measures from old.Let (X, Ay /x) be a measure space, and let A(-) be a function in M+(X, A). The

increasing, linear functional defined on M+(X, A) by vf := fi (/A) inherits from \ithe Monotone Convergence property, which identifies it as an integral with respectto a measure on A.

The measure /x is said to dominate v; the measure v is said to have density Awith respect to /x. This relationship is often indicated symbolically as A =which fits well with the traditional notation,

/

f dvf(x)dv(x) = j f(x)—dn(x).

The d/ji symbols "cancel out," as in the change of variable formula for Lebesgueintegrals.

54 Chapter 3: Densities and derivatives

REMARK. The density dv/dju, is often called the Radon-Nikodym derivative of vwith respect to /x, a reference to the result described in Theorem <4> below. Theword derivative suggests a limit of a ratio of v and fi measures of "small" sets. Forfj, equal to Lebesgue measure on a Euclidean space, dv/d/x can indeed be recoveredas such a limit. Section 4 explains the one-dimensional case. Chapter 6 will giveanother interpretation via martingales.

For example, if /x is Lebesgue measure on $(R), the probability measuredefined by the density A(x) = (2n)~l/2exp(-x2/2) with respect to /JL is called thestandard normal distribution, usually denoted by AT(0,1). If /u, is counting measureon No (that is, mass 1 at each nonnegative integer), the probability measure definedby the density A(x) = e~$0x/xl is called the Poisson(#) distribution, for eachpositive constant 0, If \i is Lebesgue measure on $(R2), the probability measuredefined by the density A(JC, y) = (ln)~x exp (-(x2 + y2)/2) with respect to fi iscalled the standard bivariate normal distribution. The qualifier joint sometimescreeps into the description of densities with respect to Lebesgue measure on !B(E2)or $(]R*). From a measure theoretic point of view the qualifier is superfluous, butit is a comforting probabilistic tradition perhaps worthy of preservation.

Under mild assumptions, which rule out the sort of pathological behaviorinvolving sets with infinite measure described by Problems [1] and [2], it is not hardto show (Problem [3]) that the density is unique up to a //-equivalence. The simplestway to avoid the pathologies is an assumption that the dominating measure \x issigma-finite, meaning that there exists a partition of X into countably many disjointmeasurable sets Xi, X2,... with fiXi < 00 for each i.

Existence of a density is a property that depends on two measures. Evenmeasures that don't fit the traditional idea of a continuous distribution can bespecified by densities, as in the case of measures dominated by a counting measure.Some introductory texts use the technically correct term density in that case, muchto the confusion of students who have come to think that densities have somethingto do with continuous functions. More generally, every measure could be thought ofas having a density, because d/x/dfi = 1, which is a perfectly useless fact. Densitiesare useful because they allow integrals with respect to one measure to be reexpressedas integrals with respect to a different measure.

The distribution function of a probability measure dominated by Lebesguemeasure on S(M) is continuous, as a map from R into E; but not every probabilitymeasure with a continuous distribution function has a density with respect toLebesgue measure

 Example. Let m denote Lebesgue measure on [0,1). Each point x in [0,1) has abinary expansion x = Y^Li xn^~n with each JC, either 0 or 1. To ensure uniquenessof the {*„}, choose the expansion that ends in a string of zeros when x is a dyadicrational. The set {xn = 1} is then a finite union of intervals of the form [a9b),with both endpoints dyadic rationals. The map T(x) := J^L\ 2**3~" from [0,1)back into itself is measurable. The image measure v := T(m) concentrates on the

3.1 Densities and absolute continuity 55

compact subset C = nnCn of [0,1] obtained by successive removal of "middlethirds" of subintervals:

Ci = [0,1/3] U [2/3,1]C2 = [0, 1/9] U [2/9, 3/9] U [6/9, 7/9] U [8/9,1]

and so on. The set C, which is called the Cantor set, has Lebesgue measure lessthan mCn == (2/3)" for every n. That is, mC = 0.

The distribution function F(x) := v[0, JC], for 0 < x < 1 has the strangeproperty that it is constant on each of the open intervals that make up each C£,because v puts zero mass in those intervals. Thus F has a zero derivative at each

• point of Cc = UnC£, a set with Lebesgue measure one.

The distinction between a continuous function and a function expressibleas an integral was recognized early in the history of measure theory, with thename absolute continuity being used to denote the stronger property. The originaldefinition of absolute continuity (Section 4) is now a special case of a streamlinedcharacterization that applies not just to measures on the real line.

REMARK. Probability measures dominated by Lebesgue measure correspond tothe continuous distributions of introductory courses, although the correct term isdistributions absolutely continuous with respect to Lebesgue measure. By extension,random variables whose distributions are dominated by Lebesgue measure aresometimes called "continuous random variables," which I regard as a harmful abuseof terminology. There need be nothing continuous about a "continuous randomvariable" as a function from a set £2 into R. Indeed, there need be no topologyon Q; the very concept of continuity for the function might be void. Many a studentof probability has been misled into assuming topological properties for "continuousrandom variables." I try to avoid using the term.

<2> Definition. A measure v is said to be absolutely continuous with respect to ameasure /x, written v <c /x, if every fi-negligible set is also v-negligible.

Clearly, a measure v given by a density with respect to a dominating measure/x is also absolutely continuous with respect to /x.

<3> Example. If the measure v is finite, there is an equivalent formulation of absolutecontinuity that looks more like a continuity property.

Let v and /x be measures on (X, A)9 with vX < oo. Suppose that to each € > 0there exists a 8 > 0 such that vA < e for each A e A with \xA < 8. Then clearlyv must be absolutely continuous with respect to /JL: for each measurable set Awith /iA = 0 we must have vA < € for every positive €.

Conversely, if v fails to have the e-8 property then there exists some 6 > 0 anda sequence of sets {An} with \iAn -> 0 but vAn >'e infinitely often. With no lossof generality (or by working with a subsequence) we may assume that ixAn < 2~n

and vAn > e for all n. Define A := {An i.o.) = limsup,, An. Finiteness of £ „ fiAn

implies £ n An < oo a.e. [/x], and hence /JLA = 0; but Dominated Convergence,using the assumption vX < oo, gives vA = limn v (supI>n At) > €. Thus v is notabsolutely continuous with respect to /x.

In other words, the €-8 property is equivalent to absolute continuity, at leastwhen v is a finite measure. The equivalence can fail if v is not a finite measure.


For example, if /x denotes Lebesgue measure on 3(E) and v is the measure definedby the density |JC|, then the interval (n, n + n~l) has /JL measure n~l but v measure

• greater than 1.

REMARK. For finite measures it might seem that absolute continuity should alsohave an equivalent formulation in terms of functionals on M + . However, even ifv «: /x, it need not be true that to each € > 0 there exists a 8 > 0 such that: / G M+

and /If < 8 imply vf < €. For example, let //, be Lebesgue measure on 3(0, 1) andv be the finite measure with density A(JC) := JC~1/2 with respect to /x. The functionsfn(x) := A(x){n~2 < x < n~1} have the property that jxfn < f*/nx~l/2dx -> 0 asn - • oo, but vfn = J "2 x~l dx = logw -> oo, even though v <& [i.

Existence of a density and absolute continuity are equivalent properties ifwe exclude some pathological examples, such as those presented in Problems [1]and [2].

<4> Radon-Nikodym Theorem. Let /x be a sigma-finite measure on a space (X, A).Then every sigma-finite measure that is absolutely continuous with respect to [i hasa density, which is unique up to fx-equivalence.

The Theorem is a special case of the slightly more general result known as theLebesgue decomposition, which is proved in Section 2 using projections in Hilbertspaces. Most of the ideas needed to prove the general version of the Theoremappear in simpler form in the following proof of a special case.

<5> Lemma. Suppose v and JJL are finite measures on (X, A), with v < /x, that is,vf < fif for all f in M+(X, A). Then v has a density A with respect to /x for which0 < A < 1 everywhere.

Proof. The linear subspace JCo := {/ e £2(M) : vf = 0} of £2(/x) is closed forconvergence in the L2{ii) norm: if /x|/n - f\2 -> 0 then, by the Cauchy-Schwarzinequality,

Mn ~ Vf\2 < (vl2) (v\fn - / |2 -> 0.

Except when v is the zero measure (in which case the result is trivial), the constantfunction 1 does not belong to *KQ. From Section 2.7, there exist functions go € ôand g\ orthogonal to Jio for which 1 = go + g\. Notice that vg\ = vl ^ 0. Thedesired density will be a constant multiple of g\.

Consider any / in £2(/x). With C := vf/vg\ the function / - Cg\ belongsto Wo, because v(/ - Cg\) = vf- Cvg\ = 0. Orthogonality of / - Cg\ and g\ gives

0 = (/ - Cgugi) = /x(/gi) - ~ix{g\),vg\

which rearranges to vf = /x(/A) where A := (vg\/iig\)g\.The inequality 0 < v{A < 0} = AIA{A < 0} ensures that A > 0 a.e. [/x];

and the inequalities v{A > 1} = /xA{A > 1} > /x{A > 1} > v{A > 1} forceA < 1 a.e. [/x], for otherwise the middle inequality would be strict. Replacementof A by A{0 < A < 1} therefore has no effect on the representation vf = /x(/A)for / e £2(/x). TO extend the equality to / in M+, first invoke it for the £2(/x)

• function n A / , then invoke Monotone Convergence twice as n tends to infinity.

3.1 Densities and absolute continuity 57

Not all measures on (X, A) need be dominated by a given /x. The extremeexample is a measure v that is singular with respect to /z, meaning that thereexists a measurable subset S for which fiSc = 0 = vS. That is, the two measuresconcentrate on disjoint parts of X, a situation denoted by writing v ± JJL. Perhapsit would be better to say that the two measures are mutually singular, to emphasizethe symmetry of the relationship. For example, discrete measures—those thatconcentrate on countable subsets—are singular with respect to Lebesgue measure onthe real line. There also exist singular measures (with respect to Lebesgue measure)that give zero mass to each countable set, such as the probability measure v fromExample .

Avoidance of all probability measures except those dominated by a countingmeasure or a Lebesgue measure—as in introductory probability courses—imposesawkward constraints on what one can achieve with probability theory. The restrictionbecomes particularly tedious for functions of more than a single random variable, forthen one is limited to smooth transformations for which image measures and densities(with respect to Lebesgue measure) can be calculated by means of Jacobians. Theunfortunate effects of an artificially restricted theory permeate much of the statisticalliterature, where sometimes inappropriate and unnecessary requirements are imposedmerely to accommmodate a lack of an appropriate measure theoretic foundation.

REMARK. Why should absolute continuity with respect to Lebesgue measureor counting measure play such a central role in introductory probability theory? Ibelieve the answer is just a matter of definition, or rather, a lack of definition. For aprobability measure P concentrated on a countable set of points, expectations Pg(X)become countable sums, which can be handled by elementary methods. For generalprobability measures the definition of Wg(X) is typically not a matter of elementarycalculation. However, if X has a distribution P with density A with respect toLebesgue measure, then Pg(X) = Pg = f A(x)g(x)dx. The last integral has thefamiliar look of a Riemann integral, which is the subject of elementary Calculuscourses. Seldom would A or g be complicated enough to require the interpretationas a Lebesgue integral—one stays away from such functions when teaching anintroductory course.

From the measure theoretic viewpoint, densities are not just a crutch for supportof an inadequate integration theory; they become a useful tool for exploiting absolutecontinuity for pairs of measures.

In much statistical theory, the actual choice of dominating measure matterslittle. The following result, which is often called SchefiK's lemma, is typical.

<6> Exercise. Suppose {Pn} is a sequence of probability measures with densities {An}with respect to a measure fi. Suppose An converges almost everywhere [/x] to thedensity A := dP/dfi of a probability measure P. Show that fi\ An - A| -> 0.SOLUTION: Write /z|A - An| as

An)+ 4- M(A - An)~ =

On the right-hand side, the second term equals zero, because both densitiesintegrate to 1, and the first term tends to zero by Dominated Convergence, because

• A > (A - An)+ -* 0 a.e. [/A].


Convergence in £* (fju) of densities is equivalent to convergence of the probabilitymeasures in total variation, a topic discussed further in Section 3.

*2, The Lebesgue decomposition

Absolute continuity and singularity represent the two extremes for the relationshipbetween two measures on the same space.

Lebesgue decomposi t ion. Let ii be a sigma-finite measure on a space (X, .A).To each sigma-fmite measure v on A there exists a ^-negligible set K and areal-valued A in M + such that

vf = + for all / e M+(X, A).

The set DSf and the function AKC are unique up to a v + ii almost sure equivalence.

REMARK. Of course the value of A on N has no effect on M ( / A ) or v(fJ4).Some authors adopt the convention that A = oo on the set N. The convention hasno effect on the equality <8>, but it does make A unique v -f fi almost surely.

The restriction Vj_ of v to 3sf is called the part of v that is singular with respectto ix. The restriction vabs of v to Jic is called the part of v that is absolutely continuouswith respect to fi. Problem [11] shows that the decomposition v = v± + vabs, intosingular and dominated components, is unique.

There is a corresponding decomposition JJL = fjb±_ 4- /*abS into parts singular andabsolutely continuous with respect to v. Together the two decompositions partitionthe underlying space into four measurable sets: a set that is negligible for bothmeasures; a set where they are mutually absolutely continuous; and two sets wherethe singular components /JL± and v± concentrate.

A = 0v = 0

\L± concentrates here

0<A<°o

dvabsÂd^hs J / A

âbs dvzbs

A = °o?|i = v = 0

neither measureputs mass here

1/A=O?1 = 0

vj_ concentrates here

Proof. Consider first the question of existence. With no loss of generality we mayassume that both v and JJL are finite measures. The general decomposition wouldfollow by piecing together the results for countably many disjoint subsets of X.

Define A. = v + /z. Note that v < k. From Lemma <5>, there is an A-measurable function Ao, taking values in [0,1], for which vf = X(fAo) for all /in M + . Define K := {Ao = 1}, a set that has zero \i measure because

v{A0= 1} = vAo{Ao = 1} + M A O { A O = 1} = WA0 = l} + /x{A0= 1}.

3.2 The Lebesgue decomposition 59

Define

A := T-^ r - fAo < 1}.1 - Ao

We have to show that v ( /Nc) = /x(/A) for all / in M + . For such an / , and eachpositive integer n, the defining property of Ao gives

v(f A n){Ao < 1} = v(f A n)Ao{Ao < 1} + /x(/ A n)A0{A0 < 1},

which rearranges (no problems with oo — oo) to

v(f A n)(l - A0){A0 < 1} = M(/ A n)A0{A0 < 1).

Appeal twice to Monotone Convergence as n —• oo to deduce

v/ ( l - A0){A0 < 1} = At/A0{A0 < 1} for all / € M+.

Replace / by /{Ao < 1}/(1 - Ao), which also belongs to M+, to complete the proofof existence.

The proof of uniqueness of the representation, up to various almost sureequivalences, follows a similar style of argument. Problem [9] will step you through

D the details.

3. Distances and affinities between measures

Let ix\ and /X2 be two finite measures on (X,A). We may suppose both /xi and ^2are absolutely continuous with respect to some dominating (nonnegative) measure k,with densities m\ and m2. For example, we could choose k = /JL\ + M2-

For the purposes of this subsection, a finite signed measure will be any setfunction expressible as a difference ix\ - \X2 of finite, nonnegative measures. In factevery real valued, countably additive set function defined on a sigma field can berepresented in that way, but we shall not be needing the more basic characterization.If fjii has density mf with respect to k then ii\ — /12 has density m\ — m,2.

Throughout the section, Mbdd will denote the space of all bounded, real-valued,yi-measurable functions on X, and JVt£jd will denote the cone of nonnegativefunctions in Mbdd-

There are a number of closely related distances between the measures, all ofwhich involve calculations with densities. Several easily proved facts about thesedistances have important application in mathematical statistics.

Total variation distance

The total variation norm ||/i||i of the signed measure /x is defined as sup^ £ A € 7 r |/xA|,where the supremum ranges over all partitions it of X into finitely many measurablesets. The total variation distance between two signed measures is the norm of theirdifference. In terms of the density m for /x with respect to k9 we have

\liA\ = J2 l*<wA)l ^ I > M A = Mm\.Aen Aen


Equality is achieved for a partition with two sets: A\ := {m > 0} and A 2 := {m < 0}.In particular, the total variation distance between two measures equals the JC1

distance between their densities; in fact, total variation distance is often referredto as i;1 distance. The initial definition of \\ii\\u in which the dominating measuredoes not appear, shows that the .C1 norm does not depend on the particular choiceof dominating measure. The £* norm also equals sup^^j \fif\, because \/xf\ =\H*nf)\ < * ( M I/I) < k\m\ for | / | < 1, with equality for"/ := {m > 0} - {m < 0}.

REMARK. Some authors also refer to v(/î>M2) •= SUPAZA IMI^ — /*2^l asthe total variation distance, which can be confusing. In the special case when

we have ii{m\ > m2} = — fi{m\ < W2}, whence ||/x|| = 2v(fjL\, /x2).

The affinity between two finite signed measures

The affinity between /xi and ^2 is defined as

<*i(/xi, 112) := inf{/xi/i + ^2/2 : / 1 , h € M^,d, f\ + / 2 > 1}

= inf{Mwi/i + m2/2) : fu h € M^ d , /1 + f2 > 1}

= A. (mi A m2),

the infimum being achieved by f\ = {mi < 7722) and / 2 = {mi > m2}. That is, theaffinity equals the minimum of ii\ A +112Ac over all A in A.

REMARK. For probability measures, the minimizing set has the statisticalinterpretation of the (nonrandomized) test between the two hypotheses \i\ and /i2

that minimizes the sum of the type one and type two errors.

The pointwise equality 2 (mi A m2> = mi -f m2 — |mi — m2| gives the connectionbetween affinity and £ ! distance,

+ m 2 - | m i - m2|) =

Both the affinity and the £* distance are related to a natural ordering on thespace of all finite signed measures on A, defined by:

v < 1/ means vf < vr f for each / in M^jd.

<9> Example . To each pair of finite, signed measures ii\ and /X2 there exists a largestmeasure v for which v < \i\ and v < M2- It is determined by its density m\ /\miwith respect to the dominating A. It is easy to see that the v so defined is smallerthan both JX\ and [12, To prove that it is the largest such measure, let /x be any othersigned measure with the same property. Without loss of generality, we may assumeix has a density m with respect to X. For each bounded, nonnegative / , we thenhave

k (rnf) = fif = /x/{mi > m2} 4- fif{m\ < m2}

< îf{m\ > m2} + Mi/{mi < m2}

= X(mi A m2)f.

The particular choice f = [m > m\ /\ mi) then leads to the conclusion thatm < m\ Ani2 a.e. [A.], and hence /x < v as measures.

3.3 Distances and affinities between measures 61

The measure v is also denoted by /JL\ A /X2 and is called the measure theoreticminimum of /xi and jx2- For nonnegative measures, the affinity <*i(/xi, /i2) equals

• the t 1 norm of fix A M.

By a similar argument, it is easy to show that the density \m\ — rri2\ defines thesmallest measure v (necessarily nonnegative) for which v > \i\ —^ and v > 2 — (i\.

Hellinger distance between probability measures

Let P and Q be probability measures with densities p and q with respect to adominating measure k. The square roots of the densities, <fp and fq are bothsquare integrable; they both belong to £2(A.). The Hellinger distance betweenthe two measures is defined as the £ 2 distance between the square roots of theirdensities,

H(P, Q)2 := k (Jp - Jqf = *(/> + * - 2Jpq) = 2 - 2k Jpq.

Again the distance does not depend on the choice of dominating measure (Prob-lem [13]). The quantity k^fpq is called the Hellinger affinity between the twoprobabilities, and is denoted by ai(P, Q). Clearly ^fpq > p A #, from which itfollows that a2(P, Q) > « i (P, Q) and H(P, Q)2 < \\P - Q\\\. The Cauchy-Schwarzinequality gives a useful lower bound:

\\P - Gil? =

Substituting for the Hellinger affinity, we get | |P -G | | i < H(P, Q) (4 - H(P, Q)2)l/2,which is smaller than 2/ / (P , Q). The Hellinger distance defines a bounded metric onthe space of all probability measures on A. Convergence in that metric is equivalentto convergence in -C1 norm, because

<H» H(Py Q)2 < ||P - Gill < » ( P , G).

The Hellinger distance satisfies the inequality 0 < H(P, Q) < >/2. The equalityat 0 occurs when ^fp = y/q almost surely [A.], that is, when P = Q as measureson A. Equality at V2 occurs when the Hellinger affinity is zero, that is, whenpq = 0 almost surely [A.], which is the condition that P and Q be mutually singular.For example, discrete distributions (concentrated on a countable set) are always atthe maximum Hellinger distance from nonatomic distributions (zero mass at eachpoint).

REMARK. Some authors prefer to have an upper bound of 1 for the Hellingerdistance; they include an extra factor of a half in the definition of H(Py Q)2.

Relative entropy

Let P and Q be two probability measures with densities p and q with respect tosome dominating measure k. The relative entropy (also known as the Kullback-Leibler "distance," even though it is not a metric) between P and Q is defined as

= k(plog(p/q)).


At first sight, it is not obvious that the definition cannot suffer from the oo — ooproblem. A Taylor expansion comes to the rescue: for x > —1,

<n> (1 +Jc)log(l + *) = x + R(x),

with R(x) = \x2/{\ + x*) for some x* between 0 and x. When p > 0 and q > 0, putx = (p — q)/q, discard the nonnegative remainder term, then multiply through by qto get plog(p/q) > p -q. The same inequality also holds at points where p > 0and q = 0, with the left-hand side interpreted as oo; and at points where p = 0 we getno contribution to the defining integral. It follows not only that kp (log(p/q))~ < oobut also that D(P\\Q) > X(p - q) = 0. The relative entropy is well defined andnonnegative. (Nonnegativity would also follow via Jensen's inequality.) Moreover,if k{p > 0 = q] > 0 then plog(p/q) is infinite on a set of positive measure, whichforces D(P\\Q) = oo. That is, the relative entropy is infinite unless P is absolutelycontinuous with respect to Q. It can also be infinite even if P and Q are mutuallyabsolutely continuous (Problem [15]).

As with the Cl and Hellinger distances, the relative entropy does not dependon the choice of the dominating measure X (Problem [14]).

It is easy to deduce from the conditions for equality in Jensen's inequality thatD(P\\ Q) = 0 if and only if P = Q. An even stronger assertion follows from theinequality

D{P\\Q)>H2(P,Q).

This inequality is trivially true unless P is absolutely continuous with respect to Q,in which case we can take X equal to Q. For that case, define rj = ^/p — 1. Notethat Qrj2 = H2(P, Q) and

1 = Qp = 2(1 + nf = 1 + IQr) + Qt)2,

which implies that 2Qrj = -H2(P, Q). Hence

= 2Q ((1

as asserted.In a similar vein, there is a lower bound for the relative entropy involving the

£*-distance. Inequalities <io> and <12> together imply D(P\\Q) > | | |P - Q\\2V A

more direct argument will give a slightly better bound,

The improvement comes from a refinement (Problem [19]) of the error term inequality <n>, namely, R(x) > \x2/(l + x/3) for x > - 1 .

3.3 Distances and affinities between measures 63

To establish <13> we may once more assume, with no loss of generality, that Pis absolutely continuous with respect to Q. Write 1 + 8 for the density. Notice thatQ8 = 0. Then deduce that

D(P\\Q) = fi((l + «)log(l + « ) - « ) > \Q

Multiply the right-hand side by 1 = 0(1 + 5/3), then invoke the Cauchy-Schwarzinequality to bound the product from below by half the square of

+ 673The asserted inequality <13> follows.

<14> Example. Let PQ denote the N(0,1) distribution on the real line with density<f>(x - 6) with respect to Lebesgue measure, where 4>{x) = exp(-jc2/2)/\/27r. Eachof the three distances between PQ and P9 can be calculated in closed form:

D(P0\\P$) = Po* i-\x2 + Ux - $)2) = \02y

and

H2(P0, P9) = 2 - - | = f00 exp (-i^2 + |(x - ^)2) = 21

and

= i°J—

=2/'

(<Hx)-<t>(x-0))+dx

(*(*) - <f>(x - 0)) dx = 2(<t> {\0) - <D ( - |

where O denotes the AT(0,1) distribution function.For 0 near zero, Taylor expansion gives

H2(Po, PG) = | ^ 2 + O(04) and ||P0 - Poh = / H ^ I

Inequalities <io>, <12>, and <13> then become, for 0 near zero,

\e2

\e2 > -e2 + o($4)The comparisons show that there is little room for improvement, except possibly for

• the lower bound in <io>.

<15> Example. The method from the previous Example can be extended to otherfamilies of densities, providing a better indication of how much improvement mightbe possible in the three inequalities. I will proceed heuristically, ignoring questions


of convergence and integrability, and assuming existence of well behaved Taylorexpansions.

Suppose Pe has a density exp(g(;t — 9)) with respect to Lebesgue measure onthe real line. Write g and g for the first and second derivatives of the function g(x),so that

g(x - 9) - g(x) = -9g(x) + \02g{x) + terms of order |<9|3

and

exp (g(x -9)- g{x)) = 1 - 9g(x) + \02g(x) + \ (-9g(x) + \02g(x)) + . . .

= 1 - 0g(x) 4- \02 (g{x) + g(jt)2) 4- terms of order |0|3

From the equalities

1= / exp(g(x-9))J-oo

= P$cxp(g(x-9)-g(x))

= 1 - 0Pog(x) + \02P0 (g(x) + g(jc)2) + terms of order |0|3

we can conclude first that Pog = 0, and then PQ (g(x) -f g(x)2) = 0.

REMARK. The last two assertions can be made rigorous by dominationassumptions, allowing derivatives to be taken inside integral signs. Readers familiarwith the information inequality from theoretical statistics will recognize the two waysof representing the information function for the family of densities, together with thezero-mean property of the score function.

Continuing the heuristic, we will get approximations for the three distances forvalues of 0 near zero.

D(P0\\Pe) = PQ (g(x) - g(x - 9)) = \92Pg2 + .. .

From the Taylor approximation,

exp (ig(x - 9) - {g{x)) = \-\9g + \92 {l'g + g2) + .. .

we get

H2(P0, Pe) = 2 - 2P£ exp (\g{x - 9) - \g{x)) = \92Pg2 + .. .

And finally,

If we could find a g for which (Po\g\)2 = Pog2, the two inequalities

2H(P0, Pe) > lift - ft||i and D(ft||P*) > | | | f t - P llf

would become sharp up to terms of order 92. The desired equality for g would force\g| to be a constant, which would lead us to the density f(x) = \e~^xK Of courselog / is not differentiate at the origin, an oversight we could remedy by a slightsmoothing of |JC| near the origin. In that way we could construct arbitrarily smoothdensities with Pog2 as close as we please to (Po\g\)2. There can be no improvement

• in the constants in the last two displayed inequalities.

3.4 The classical concept of absolute continuity 65

4. The classical concept of absolute continuity

The fundamental problem of Calculus concerns the interpretation of differentiationand integration, for functions of a real variable, as inverse operations: Whichfunctions on the real line can be expressed as integrals of their derivatives? Clearly,if H(x) = f* h(t)dt, with h Lebesgue integrable, then if is a continuous function,but it also possesses a stronger property.

<16> Definition. A real valued function H defined on an interval [a, b] of the realline is said to be absolutely continuous if to each e > 0 there exists a 8 > 0 suchthat J2i \H(bt) - H{a{)\ < € for all finite collections of nonoverlapping subintervals{cii, bi\ of [a, b] for which ][ (&i ~~ a*) < *•

REMARK. In the Definition, and subsequently, nonoverlapping means that theinteriors of the intervals are disjoint. For example, [0,1] and [1, 2] are nonoverlapping.Their intersection has zero Lebesgue measure.

Note the strong similarity between Definition <16> and the reformulationof the measure theoretic definition of absolute continuity as an e-S property, inExample <3>.

The connection between absolute continuity of functions and integration ofderivatives was established by Lebesgue (1904). It is one of the most celebratedresults of classical analysis.

<17> Fundamental Theorem of Calculus. A real valued function H defined on aninterval [a, b] is absolutely continuous if and only if the following three conditionshold

(i) the derivative H'(x) exists at Lebesgue almost all points of [a, b]

(ii) the derivative Hf is Lebesgue integrable

(Hi) H(x) - H(a) = £ Hf{t)dt for each x in [a, b]

REMARK. Of course it is actually immaterial for (ii) and (iii) how Hf

is defined on the Lebesgue negligible set of points at which the derivativedoes not exist. For example, we could take H' as the measurable functionlimsupn_>>0On (#(JC + n~l) — /f(*)). The proof in Section 6 provides two othernatural choices.

We may fruitfully think of the Fundamental Theorem as making two separateassertions:

(FTi) A real-valued function H on [a, b] is absolutely continuous if and only if it hasthe representation

<18> H(x) = H(a) + / h(t)dt for all x in [a, b]Ja

for some Lebesgue integrable h defined on [a, b]. The function h is unique up toLebesgue almost sure equivalence.

(FT2) If H is absolutely continuous then it is differentiate almost everywhere, withderivative equal Lebesgue almost everywhere to the function h from <18>.


As shown at the end of this Section, Assertion FTi can be recovered from theRadon-Nikodym theorem for measures. The proof of Assertion FT2 (Section 6),which identifies the density h with a derivative defined almost everywhere, requiresan auxiliary result known as the Vitali Covering Lemma (Section 5).

Notice what the Fundamental Theorem does not assert: that differentiabilityalmost everywhere should allow the function to be recovered as an integral of thatderivative. As a pointwise limit of continuous functions, (H(x + 8) - H(x))/8, thederivative H'(x) is measurable, when it exists. However, it need not be Lebesgueintegrable, as noted by Lebesgue in his doctoral thesis (Lebesgue 1902): the functionF(x) := jc2sin(l/jc2), with F(0) = 0, has a derivative at all points of the real line,but F'(JC) behaves like -2x~l COS(1/JC2) for x near 0, which prevents integrabilityon any interval that contains 0 (Problem [20]). The function F is not absolutelycontinuous.

Neither does the Fundamental Theorem assert that absolute continuity followsfrom almost sure existence of H' with f* \H'(x)\dx finite. For example, thefunction F constructed in Example , whose derivative exists and is equal to zerooutside a set of zero Lebesgue measure, cannot be recovered by integration of thederivative, even though that derivative is integrable. It is therefore quite surprisingthat existence of even a one sided, integrable derivative everywhere is enough toensure absolute continuity.

<19> Theorem. Let H be a continuous function defined on an interval [a, b\, with a(real-valued) right-hand derivative h(x) := limô (H(x + 8) - H(x)) /8 existing ateach point of [a, b). Ifh is Lebesgue integrable then H{x) = H(q) + J* h(t)dt foreach x in [a,b], and hence H is an absolutely continuous function on [a,b].

See Problem [21] for an outline of the proof. The Theorem justifies the usualtreatment in introductory Calculus courses of integration as an inverse operation todifferentiation.

Proof of FT\. If H is given by representation <18> then its increments arecontrolled by the measure v defined on [a,b] by its density \h\ with respectto Lebesgue measure m on 33[<z,&]. If {[aifbi] : i = 1, ...,&} is a family ofnonoverlapping subintervals of [a,b] then Y%=\ l#(*i) ~ #(*i)l < Z!*=i *>[«,•,&,•].From Example <3>, when the Lebesgue measure YLi&i — at) of the set A = U/[a,, bi]is small enough, the v measure is also small, because v is absolutely continuous (inthe sense of measures) with respect to m. It follows that H is absolutely continuousas a function on [a,b].

Conversely, an absolutely continuous function H defines two nonnegativefunctions F+ and F~ on [a, b] by

<20> F±(x) = supY" (H(x() - Hixt-Ot7T(X) i

where the suprema run over the collection 7T(JC) of all finite partitions a — *o < x\ <... < xk = x of [a, JC], for each x in [a, b]. At x = a all partitions are degenerate andF±(a) = 0. By splitting [a, x] into a finite union of intervals of length less than 8,you will see that both functions are real valued. More precisely, F±(x) < €(b-a)/8,where e and 8 come from Definition <16>.

3.4 The classical concept of absolute continuity 67

Both F± are increasing functions, for the following reason. First note thatinsertion of an extra point into a partition {*,} increases the defining sums. Inparticular, if a < y < x then we may arrange that y is a point in the n(x) partition,so that the sums on the right-hand side of <20> are larger than the correspondingsums for n{y). More precisely, F±(x) = FÔO + sup^ . (H(#•) - #(#_!))*, wherethe supremum runs over all finite partitions y = yo < y\ < • • • < yk = x of [y, JC].When |JC — y\ < 6, with 8 > 0 as in Definition <16>, each sum for the supremumis less than €, and hence \F±(x) - F±(y)\ < 6. That is, both F± are continuousfunctions on [a, b]. A similar argument applied to finite nonoverlapping collectionsof intervals {[j,, JC,]} leads to the stronger conclusion that both F± are absolutelycontinuous functions.

REMARK. We didn't really need absolute continuity to break H into a differenceof two increasing functions. It would suffice to have sup]T\ \H(yi) — H(y,--i)| < oo,where the supremum runs over all finite partitions y — y0 < y\ < . . . < yk = xof [y,x]. Such a function is said to have bounded variation on the interval [a,b].Close inspection of the arguments in Section 6 would reveal that functions of boundedvariation have a derivative almost everywhere. Without absolute continuity, we cannotrecover the function by integrating that derivative, as shown by Example <1>.

As shown in Section 2.9 using quantile transformations, increasing functionssuch as F± correspond to distribution functions of measures: there exists two finitemeasures v* on *B[a9b] such that v±(a,x] = F±(x) for x in [a,b]. Notice that v^put zero mass at each point, by continuity of F±. The absolute continuity of the F±

functions translates into an assertion regarding the class £ of all subsets of [a, b]expressible as a finite union of intervals (#,.*,]: for each € > 0 there exists a 8 > 0such that vE <€ for every set E in E with tn£ < 8.

A simple X-class generating argument shows that £ is a dense subclass of jB[a, b]in the £ ! (M) sense, where /x = m + v+ + v". That is, for each set B in *B[a, b] andeach €f > 0 there exists an £ in £ for which JJL\B — E\ < €;. In particular, if we put€f = min(5/2, €) then mB < 8/2 implies tnE < 8 from which we get v±E < e andv±B < 2€. That is, both v± are absolutely continuous (as measures) with respectto m.

By the Radon-Nikodym Theorem, there exist m-integrable functions h± forwhich v±f = mC/i*/) for all / in M+. In particular, F^x) = vâ, JC] = /* h±(t) dtfor a < x < b.

Finally, note that for each partition [xt] in n(x) we have

which reduces to H(x) — H(a) after cancellations. We can choose the partition togive simultaneously values as close as we please to both suprema in <20>. In thelimit we get

H(x) - H(a) = F+(JC) - F~(x) = f (h+(t) - h'(t)) dt.Ja

That is, representation <18> holds with h = h+ - h~. The uniqueness of theD representing h can be established by another A-class generating argument.


*5. Vitali covering lemma

Suppose D is a Borel subset of Rd with finite Lebesgue measure mD. There arevarious ways in which we may approximate D by simpler sets. For example, fromSection 2.1 and related problems, to each e > 0 there exists an open set G D D anda compact set K c D for which m(G\K) < €. In particular, we could take G as acountable union of open cubes. In general, we cannot hope to represent K as unionof closed cubes, because those sets would all have to lie in the interior of D, a setwhose measure might be strictly smaller than mD. However, if we allow the cubesto poke slightly outside D we can even approximate by a finite union of disjointclosed cubes, as a consequence of a Vitali covering theorem.

There are many different results presented in texts as the Vitali theorem, eachinvolving different levels of generality and different degrees of subtlety in the proof.For the application to FT2, a simple version for approximation by intervals on thereal line would suffice, but I will present a version for Rd, in which the sets arenot assumed to be cubes, but are instead required to be "not too skinny". The extragenerality is useful; and it causes few extra difficulties in the proof, indeed, it helpsto highlight the beautiful underlying idea; and the pictures look nicer in more thanone dimension.

Of course skinny is not a technical term. Instead, for a fixed constant y > 0,let us say that a measurable set F is ^-regular if there exists an open ball BF withBF 2 F and mF/xnBF > y. We will sometimes need to write BF as B(x,r), theopen ball with center x and radius r.

<2i> Definition. Call a collection V of closed subsets ofRd a y-regular Vitali coveringof a set E if each member of V is y-regular and if, to each e > 0, each point of Ebelongs to a set F (depending on the point and €) from V with diameter less than €.

Put another way, if we write V€ for [F e V : diam(F) < e], then the Vitalicovering property is equivalent to E c u{F : F e V€] for every 6 > 0. Notice that ifG is an open set with G 3 E, then {F e V : F c G) is also a Vitali covering for E:for each x in E, all points close enough to x must lie within G.

<22> Vitali Covering Lemma. For some fixed y > 0, let V be a Vitali covering for aset D in $(Rrf) with finite Lebesgue measure. Then there exists a countable family{Fi} of disjoint sets from V for which the set D\ (U,F,) has zero Lebesgue measure.

REMARK. More refined versions of the Theorem (such as the results presentedby Saks 1937, Section IV.3, or Wheeden & Zygmund 1977, Section 7.3) allow yto vary from point to point within D, and relax assumptions of measurablity orfiniteness of mD. We will have no need for the more general versions.

Proof. The result is trivial if mD is zero, so we may assume that mD > 0. Theproof works via repeated reduction of mD by removal of unions of finite familiesof disjoint sets from V. The method for the first step sets the pattern.

Fix an e > 0. As you will see later, we need e small enough that the constantp := 3"rf(l - €)y - e is strictly positive.

3.5 Vitali covering lemma 69

Find an open set G and a compact set K for which G 2 D 3 K andm(G\K) < emD. Discard all those members of V that are not subsets of G. Whatremains is still a Vitali covering of D.

REMARK. AS the proof proceeds, various sets will be discarded from V. Ratherthan invent a new symbol for the class of sets that remains after each discard, I willreuse the symbol V. That is, V will denote different things at different stages of theproof, but in every case it will be a Vitali covering for a subset of interest.

The family of open balls {Bp : F € V] covers the compact set K. It has a finitesubcover, corresponding to sets F\,..., Fm. The ball BFi is of the form B(x{, r,-).We may assume that the sets are numbered so that r\ > r2 > . . . > rm.

Define a subset J of { 1 , 2 , . . . , m} by successivelyconsidering each y, in the order 1, 2, . . . , m, for inclusionin / , rejecting j if Fj intersects an F, for an i alreadyaccepted into J. For example, with the five sets shown in thepicture, we would have / = {1,2,5}: we include 1, then 2(because F2nFi = 0), reject 3 (because F3flFi ^ 0), reject 4(because F4 n F2 # 0), then accept 5 (because F5 n F\ = 0and F5nF2 = 0).

For each excluded j there is an j in J for whichi < j and Ft n Fj # 0. The ordering of the radiiensures that B(xiy 3r,) 3 B(xh r,): if z € Ft n F;

and y € £(*,, r7), we have

to - y\ < \xi -z\ + \z-y\< n + 2/> < 3n.

Thus U^y£(*,-, 3r,-) 2 U^^CJCy.r,-) 2 AT. Theopen balls B(jc/,3r/) might poke outside G, butthe corresponding sets Ff from V, and their union,E\ := UIG/F,, are closed subsets of G.

Pairwise disjointness of the sets {Ft : i e J] allows us to calculate the measureof their union by adding, which, together with the regularity property

mFi > ymBixi, n) = (y3-d)mB(xiy 3r,),

gives us a lower bound for the measure of E\,

3d2—mE\ = h 3n) > , 3rf)) > > (1 - e)mD.

It follows that

m (D\E\) < mG - mE{ < (1 + €)mD - (3"rf(l - e)y) mD = (1 - p)mD.

That is, E\ carves away at least a fraction p of the Lebesgue measure of D, therebycompleting the first step in the construction.

The second step is analogous. Discard all those members of V that intersect E\.What remains is a new Vitali covering V for D\E\, because each point of D\E\ isat a strictly positive distance from the closed set E\. Repeat the argument from theprevious paragraphs to find disjoint sets from V whose union, E2, removes at least a


fraction p of the Lebesgue measure of D\E\. Then m (D\(E\ U E2)) < (1 - p)2tnD,and the sets that make up E\ U Ej are all disjoint. And so on. In the limit we have

• the countable disjoint family whose existence is asserted by the Lemma.

*6. Densities as almost sure derivatives

With Lemma <22> we have the means to prove FT2: if

H(x) - H(a) = I h(t) dt for a < x < bJa

for a Lebesgue integrable function h then H has a derivative at almost all pointsof [a, b] and that derivative coincides with h almost everywhere.

It suffices if we consider the case where h is nonnegative, from which thegeneral case would follow by breaking h into its positive and negative parts. Fornonnegative h9 write v for the measure on !B[a, b] with density h with respect toLebesgue measure m. Define £W(JC) as the set of all nondegenerate (that is, nonzerolength) closed intervals E for which x e E c (x -n~l,x + n~l). Notice that if thederivative Hf(x) exists then it can be recovered as a limit of ratios vEn/mEn, withEn € £„(*).

Define functions

~:E€e,n(x)\ and gn(x) = inf ^ : E € £„(*)

Both the sets {/„ > r] and {gn < r] are open, and hence both fn and gn areBorel measurable. For example, if fn(x) > r then there exists an interval E in£„(*) with vE > rmE. Continuity of H at the endpoints of E lets us expand Eslightly, ensuring that x is an interior point of E while keeping the ratio above rand keeping E within (JC — n~l,x + n~l). I f j i s close enough to *, the interval Ealso belongs to £„(», and hence fn(y) > r. The monotone limits /(JC) := infn fn(x)and g(x) := supn gn(x) are also Borel measurable.

Clearly /(JC) > g(jc) everywhere. For 0 < 8 < 1/n, both [JC, JC +5] and [JC - 5 , JC]belong to £n(x), and hence both

+ 8)-H(x) A H(x)-H(x-8)

8lie between gn(x) and fn(x). In the limit as first 8 tends to zero then n tends toinfinity we have

, , r . -H(x + 8)-H(x) r H(x+8)-H(x) , / xg(x) < hminf < hmsup < f(x),

with an analogous pair of inequalities for [JC - 8, JC]. At points where g(jc) = /I(JC) =/(JC) the derivative Hf(x) must exist and be equal to h(x). The next Lemma willhelp us prove the almost sure equality of / , g9 and h.

<23> Lemma. Let A be a Borel set with finite Lebesgue measure and r be a positiveconstant

(i) If /(JC) > r for all x in A then vA > rmA.

3.6 Densities as almost sure derivatives 71

(ii) If g(x) < r for all x in A then vA < rmA.

Proof. Let G be an open set and K be a compact set, with G 2 A 2 K.

For (i), note that the collection V of all nondegenerate closed subintervals Eof G that satisfy the inequality vE > rmE is a Vitali covering of A (because/„ > r on A for every ri). By Lemma <22> there is a countable family of disjointintervals {£,} from V whose union L covers A up to a m-negligible set. We thenhave

vG > vL because G contains all intervals from V

= ]n, vEj disjoint intervals

> J2t rmEi definition of V

= rmL disjoint intervals

> rmA covering property.

Take the infimum over G to obtain the assertion from (i).The argument for (ii) is similar: Reverse the direction of the inequality in the

definition of V, then interchange the roles of v and rm in the string of inequalities• for the covering.

To complete the proof of FT2, for any pair of rational numbers with r > s > 0let A = {x e [a, b] : h(x) < s < r < / (* ) } . From Lemma <23> and the fact that vhas density h with respect to m we get smA > vA > rmA. Conclude that mA = 0.Cast out countably many negligible sets, one for each pair of rationals, to deducethat h > / a.e. [k] on [a, b]. Argue similarly to show that h < g a.e. [k] on [a,b].Together with the fact that f > g these two inequalities imply that / = g = h atalmost all points of [a%b\.

7. Problems

[1] Let IX be the measure defined for all subsets of X by /JLA = +00 if A ^ 0, and= 0. Show that / / / = /x(2/) for all nonnegative functions / .

[2] Let A, denote the measure on R with kA equal to the number of points in A (possiblyinfinite), and let [iA — 00 for all nonempty A. Show that k has no density withrespect to JJL, even though both measures have the same negligible sets.

[3] Suppose Ai and A2 are functions in M+(X, A) for which /x(/Ai) = /x(/A2) for all/ in M+(X, A), for some measure \i.

(i) Show that Ai = A2 a.e. [/x] if fiA\ < 00. Hint: Consider the equality/xAi{Ai > A2} = MA2{AI > A2}.

(ii) Show that the assumption of finiteness of fiA\ is not required for the conclusionAi = A2 a.e. [//,] if /x is a sigma-finite measure. Hint: For each positive rationalnumber r show that /JL(A\ - A2MAi > r > A2M = 0 for each set with /JLA < 00.

[4] Let P and Q be probability measures on (£2,7), and let £ be a generating classfor 7 that is stable under finite intersections and contains Q. Suppose there exists a


p in Cl(Q) such that PE = Q(pE) for every E in £. Show that P has density /?with respect to Q.

[5] Does Example <3> have an analog for Hp convergence, for some p > 1: if v is afinite measure dominated by a measure A, and if {/„} is a sequence of measurablefunctions for which X\fn\

p -> 0, does it follow that v\fn\p -* 0?

[6] Let v and /JL be finite measures on the sigma-field a(£) generated by a countablefield £. Suppose that for each € > 0 there exists a 6 > 0 such that vE < € for each£ in £ with \iE < 8. Show that v is absolutely continuous with respect to /x, asmeasures on <r(£). Hint: Suppose /xA = 0 for an A in <x(£). Show that there existsa countable subclass {En} of £ such that UnEn 2 A and ix(Uw£n) < a. Use theproperties of fields to argue that the En sets can be chosen disjoint. Deduce that[i (l)n<NEn) < 8 for each finite N. Conclude that vA < lim v ((UnEn) < c.

[7] Let v and fi be finite measures on (X,A). Suppose v has Lebesgue decomposition(A, H) with respect to fi. Find the Lebesgue decomposition of ii with respect to v.Hint: Consider I/A on {0 < A < oo}.

[8] Show that a measure /JL is sigma-finite if and only if there exists a strictly positive,measurable function *1> for which ^ < oo. Hint: If /JL is sigma-finite, considerfunctions of the form <f>(jt) = ]T\ a,-{* e A,}, for appropriate partitions.

[9] Suppose (3\fi, AO and (N2, A2) are both Lebesgue decompositions for a finitemeasure v with respect to a finite measure 11. Prove Ji\ = K2 a.e. [v 4- /JL] andAiKj = A2N2 a e - tv + Ml by the following steps.

(i) Show that Hi = K2 a.e. [v] by means of the equality v(N2:Np = v ( K 2 ^ K i ) +^(H2Kj Ai) = 0, and a companion equality obtained by interchanging the rolesof the two decompositions. Using the trivial fact that Hi = H2 a.e. [JJL] (becauseboth sets are ^-negligible), deduce that Hi = H2 a.e. [v 4- fi],

(ii) Use the result from part (i) to show that

M ( /AiHJ) = v ( / H p = v(fjq) = fi(fA2jq) for all / € M + .

Deduce that AiH| = A2H2 a.e. [/x] by arguing as in Problem [3].

(iii) Use part (ii) to show that

v{AiH^ > A2H|} = v{A,Hf > A2H^}Hi 4- fi[AiJ4\ > A2H^}A!H| = 0

Hint: What value does AiHj take on Hi? Argue similarly for the companionequality. Deduce that AiHf = A2H^ a.e. [v 4- fi].

(iv) Extend the argument to prove the analogous uniqueness assertion for theLebesgue decomposition when v and /x are sigma-finite measures. Hint:Decompose X into countably many sets fX, : i € N} with (v 4- /x)X, < 00 foreach 1.

[10] Suppose y and /JL are both sigma-finite measures on (X, A).

(i) Show that v 4- /x is also sigma-finite. Deduce via Problem [8] that there existsa strictly positive, measurable function #0 for which v<J>o < 00 and fi&o < 00.

3.7 Problems 73

(ii) Define vo/ = v(/<J>o) and /xo/ = M(/^>O) for each / in M+. Show that vo and/io are both finite measures.

(iii) From the Lebesgue decomposition vof = vo(/N) + /*o(/A) for all / e M+,where /xoN = 0 and A € JVC1", derive the corresponding Lebesgue decompositionfor A. with respect to /x.

(iv) From the uniqueness result of Problem [9], applied to Xo and /xo, deduce thatthe Lebesgue decomposition for A. is unique up to the appropriate almost sureequivalences.

[11] Let v be a finite measure for which v = v\ + k\ = v2 + A.2, where each vt isdominated by a fixed sigma-finite measure /x and each A.,- is singular with respectto /x. Show that vi = v2 and A.i = A.2. Hint: If fiSf = 0 = A.,-S, and dvi/dfi = A,-,for i = 1, 2, show that /x (fAifS\S2) = M (f&ifS\S2) for all / in M+, by arguingas in Problem [3].

[12] Let fji\ and \i2 be finite measures with densities m\ and m2 with respect to asigma-finite measure X. Define (/xi — /JL2)* as the measure with density (m\ — mwith respect to A..

(i) Show that (m\ — mi)+B = sup^ YIA^171^ ~ m2)ABy the supremum runningover all finite partitions of X.

(ii) Show that |/*i - /JL2\ = (Mi - M2>+ + (M2 - /î)+-

[13] Let P and Q be probability measures with densities p and q with respect to asigma-finite measure A. For fixed a > J, show that Aa(7>, 0 ;= A|/?1/flf - qx/a\a

does not depend on the choice of dominating measure. Hint: Let [i be anothersigma-finite dominating measure. Write \[r for the density of A. with respect to A. + M-Show that dP/dik + /x) = fp and dQ/d(X 4- /x) = ifa. Express AO(P, Q) as anintegral with respect to A. + /x. Argue similarly for /x.

[14] Adapt the argument from the previous Problem to show that the relative entropyD(P\\Q) does not depend on the choice of dominating measure.

[15] Let P be the standard Cauchy distribution on the real line, and let Q be the standardnormal distribution. Show that D(P\\Q) = 00, even though P and Q are mutuallyabsolutely continuous.

[16] Let P and Q be probability measures defined on the same sigma-field 9r. Arandomized test is a measurable function / with 0 < / < 1. (For observation co, thevalue f(co) gives the probability of rejecting the "hypothesis" P.) Find the test thatminimizes Pf -f Q(l - / ) .

[17] Let P and Q be finite measures defined on the same sigma-field y, with densities pand q with respect to a measure /x. Suppose Xo is a measurable set with the propertythat there exists a nonnegative constant K such that q > Kp on Xo and q < Kpon XQ. For each jF-measurable function with 0 < / < 1 and Pf < PXo, prove thatQf < QXo. Hint: Prove that (q - Kp)(2X0 - 1 ) > (q - Kp)(2f - 1 ) , then integrate.To statisticians this result is known as the Neyman-Pearson Lemma.


[18] Let JLC = ix\ - fa ^nd fi' = fi\ - fi2 be signed measures, in the sense of Section 3.Show that v := (/xi + /x^) A (M'I + Pi) — 0^2 + A^) *s the largest signed measure forwhich v < ^ and v < fif.

[19] Let /(JC) = (1 + JC) log(l + JC) - JC, for JC > - 1 . Use the representations

to show that /(JC) > 72JC2(1 + JC/3)"1.

[20] Show that the function JC~1COS(JC~2) is not integrable on [0,1]. Hint: Considercontributions from intervals where JC~2 lies in a range nn ± T T / 4 . The intervals havelengths of order n~3/2, but the JC"1 contributes a factor of order n1/2.

[21] Prove Theorem <19> by establishing the following assertions. With no loss ofgenerality, assume a = 0 and b = 1 and //(0) = 0. Write m for Lebesgue measureon $[0, 1], Fix € > 0. Define An := {ne < h < (n + 1)6} for each integer n, bothpositive and negative. Note that

h+(x) = £n>0/*(*){* € An) and h~(x) = - J2n<Qh(x){x € An).

(i) There exist open sets Gn and compact sets Kn with Kn c An c Gn andm(Gn\ATn) < 6M, with en so small that the functions

fix) := E«>o(« + D«l* € GM} and ^(JC) := Ew<o lw + ll«(* € ATn}

satisfy the inequalities f >h+ and g <h~ and m ( / — / J + ) + m (/i~ — g) < 6.

(ii) The function h€ := 6 -{- / — ^ " i . j v \n -f l|e{jc € ^ n } , for a large enough positiveinteger N, has the properties h€ > h + e and m(/if — ft) < 36.

(iii) The function h€ is lower semicontinuous, that is, for each real number r the set{h€ > r] is open.

(iv) For 0 < JC < 1, define a continuous function G(JC) := /Q h€(t)dt — //(JC).

It achieves its maximum at some point y in [0,1]. We cannot have y < 1,for otherwise there would be a small 8 > 0 for which h€(t) > h(y) 4- 6 fory < f < y + £ < l , and (//(y + 8) — //(y)) /S h(y) + 6 - (/i(y) + 6/2) = 6/2,a contradiction. Thus G(l) > G(0) = 0. That is, / / ( I ) < flh€(t)dt </ J h(t) dt + 46, for each 6 > 0.

(v) Similarly //(JC) < f* h(t)dt for 0 < JC < 1.

(vi) An analogous argument applied to —H gives the reverse inequality, letting usconclude that //(JC) = / ^ h(t)dt for 0 < JC < 1. The function / / is absolutelycontinuous on [0,1].

[22] Let H be a continuous real function defined on [0,1). Suppose the right-handderivative h exists and is finite everywhere. If A is an increasing function, showthat H is convex. Hint: For fixed 0 < XQ < x\ < 1 and 0 < a < 1 definexa = (1 — CT)JCO •+• ajci. Show that

(1 - a)H(xo)+aH(xx) - H(xa)

= / (a[xa < t < J C I } - ( 1 -QO{JC0 < t <xa})h(t)dt > 0.Jo

3.7 Problems 75

[23] Let h be integrable with respect to Lebesgue measure m on Rrf. Show that

lim J kx ({\x -z\< r}h(x)) = h(z) a.e. [m].r*o mB(z, r) v

Generalize by replacing open balls by decreasing sequences of closed sets Fn | {z}that are regular in the sense of Section 5.

8. Notes

The Fundamental Theorem <n> is due to Lebesgue, proved in part in his doctoraldissertation (Lebesgue 1902), and completed in his Peccot lectures (Lebesgue 1904).Read Hawkins (1979) if you want to appreciate the subtlety and great significanceof Lebesgue's contributions. Note, in particular, Hawkins's comments on page 145regarding introduction of the term absolute continuity. Compare the footnotes onpage 129 (as reprinted in his collected works) of the 1904 edition of the Peccotlectures,

Pour qu'une fonction soit integrate indefinie, il faut de plus que sa variationtotale dans une infinite de"nombrable d'intervalles de longeur totale £, tende verszero avec I.

Si, dans l'enonce* de la page 94, on n'assujettit pas f(x) a etre borne*e,ni F(x) a etre a nombres de*rive*s bornes, mais seulement a la conditionpr6cedente, on a une definition de 1'integrate 6quivalente a celle developpe*e dansce Chapitre et applicable a toutes les fonctions sommables, born£es ou non.

and on page 188 of the reprinted 1928 edition,

Dans la premiere edition de ce livre, j'avais signale cet e"nonce\ en note dela page 128, de fae.on tout a fait incidente et sans demonstration. M. Vitali aretrouv^ ce the"oreme et en a publie" la premiere demonstration (Ace. Reale delleSc. di Torino, 1904-1905). C'est a l'occasion de ce the"oreme que M. Vitalia introduit, pour les fonctions d'une variable, la denomination de fonctionabsolument continue et qu'il a montre* la simplicite et la clarte que prend toutela th6orie quand on met cette notion a sa base.

The essay by Lebesgue (1926) contains a clear account of the transformationof absolute continuity from a property of functions to a property of measures. Thebook of Benedetto (1976), which is particularly interesting for its discussion of therole played by Vitali, and the Notes to Chapters 5 and 7 of Dudley (1989), providemore historical background.

For the discussion in Sections 4 through 6,1 borrowed ideas from Saks (1937,Chapters 4 and 7), Royden (1968, Chapter 8), Benedetto (1976, Chapter 4), andWheeden & Zygmund (1977, Chapter 7). The methods of Section 6 extend easilyto higher dimensional Euclidean spaces.

See Royden (1968, Chapter 6) and Dunford & Schwartz (1958, Chapter 3) formore about the Radon-Nikodym theorem. According to Dudley (1989, page 141),the method of proof used in Section 2 is due to von Neumann (1940), but I havenot seen the original paper.


Inequality <13> is due to Kemperman (1969), Csiszar (1967), and Kull-back (1967). (In fact, Kullback established a slightly better inequality.) My proofis based on Kemperman's argument. See Devroye (1987, Chapter 1) for otherinequalities involving distances between probability densities.

REFERENCES

Benedetto, J. J. (1976), Real Variable and Integration, Mathematische Leitfaden,Teubner, Stuttgart. Subtitled "with Historical Notes".

Csiszar, I. (1967), 'Information-type measures of difference of probability distribu-tions and indirect observations', Studia Scientarium Mathematicarum Hungarica2, 299-318.

Devroye, L. (1987), A Course in Density Estimation, Birkhauser, Boston.Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.Dunford, N. & Schwartz, J. T. (1958), Linear Operators, Part I: General Theory,

Wiley.Hawkins, T. (1979), Lebesgue's Theory of Integration: Its Origins and Development,

second edn, Chelsea, New York.Kemperman, J. H. B. (1969), On the optimum rate of transmitting information,

in 'Probability and Information Theory', Springer-Verlag. Lecture Notes inMathematics, 89, pages 126-169.

Kullback, S. (1967), 'A lower bound for discrimination information in terms ofvariation', IEEE Transactions on Information Theory 13, 126-127.

Lebesgue, H. (1902), Integrate, longueur, aire. Doctoral dissertation, submitted toFaculte des Sciences de Paris. Published separately in Ann. Mat. Pura Appl. 7.Included in the first volume of his (Euvres Scientifiques, published in 1972 byL'Enseignement Mathematique.

Lebesgue, H. (1904), Lecons sur I'integration et la recherche des fonctions primitives,first edn, Gauthier-Villars, Paris. Included in the second volume of his (EuvresScientifiques, published in 1972 by L'Enseignement Mathematique. Secondedition published in 1928. Third edition, 'an unabridged reprint of the secondedition, with minor changes and corrections', published in 1973 by Chelsea,New York.

Lebesgue, H. (1926), 'Sur le developpement de la notion d'integrale', MatematiskTidsskrift B. English version in the book Measure and Integral, edited andtranslated by Kenneth O. May.

Royden, H. L. (1968), Real Analysis, second edn, Macmillan, New York.Saks, S. (1937), Theory of the Integral, second edn, Dover. English translation of the

second edition of a volume first published in French in 1933. Page referencesare to the 1964 Dover edition,

von Neumann, J. (1940), 'On rings of operators, III', Annals of Mathematics

41, 94-161.

Wheeden, R. & Zygmund, A. (1977), Measure and Integral: An Introduction to RealAnalysis, Marcel Dekker.

Chapter 4

Product spaces and independence

SECTION 1 introduces independence as a property that justifies some sort of factorizationof probabilities or expectations. A key factorization Theorem is stated, with proofdeferred to the next Section, as motivation for the measure theoretic approach. TheTheorem is illustrated by a derivation of a simple form of the strong law of largenumbers, under an assumption of bounded fourth moments.

SECTION 2 formally defines independence as a property of sigma-fields. The key Theoremfrom Section 1 is used as motivation for the introduction of a few standard techniques

for dealing with independence. Product sigma-fields are defined.

SECTION 3 describes a method for constructing measures on product spaces, starting froma family of kernels.

SECTION 4 specializes the results from Section 3 to define product measures. The Tonelliand Fubini theorems are deduced. Several important applications are presented.

SECTION *5 discusses some difficulties encountered in extending the results of Sections 3and 4 when the measures are not sigma-finite.

SECTION 6 introduces a blocking technique to refine the proof of the strong law of largenumbers from Section I, to get a version that requires only a second moment condition.

SECTION *7 introduces a truncation technique to further refine the proof of the stronglaw of large numbers, to get a version that requires only a first moment condition foridentically distributed summands.

SECTION *8 discusses the construction of probability measures on products of countablymany spaces.

1. Independence

Much classical probability theory, such as the laws of large numbers and central

limit theorems, rests on assumptions of independence, which justify factorizations

for probabilities of intersections of events or expectations for products of random

variables.

An elementary treatment usually starts from the definition of independence for

events. Two events A and B are said to be independent if ¥(AB) = (PA)(P/?); three

events A, £, and C, are said to be independent if not only P (ABC) = (PA)(Pfl)(PC)

but also F(AB) = (PA)(PB) and P(AC) = (PA)(PC) and P(BC) = (PB)(¥C).And so on. There are similar definitions for independence of random variables,

in terms of joint distribution functions or joint densities. The definitions have two

78 Chapter 4: Product spaces and independence

things in common: they all assert some type of factorization; and they do notlend themselves to elementary derivation of desirable facts about independence.The measure theoretic approach, by contrast, simplifies the study of independenceby eliminating unnecessary duplications of definitions, replacing them by a singleconcept of independence for sigma-fields, from which useful consequences areeasily deduced. For example, the following key assertion is impossible to derive byelementary means, but requires only routine effort (see Section 2) to establish bymeasure theoretic arguments.

 Theorem. Let Z\,..., Zn be independent random variables on a probabil-ity space (Qf y ,F) . If f € M+(R*,S(R*)) and g € M+(MW-*,:B(R'1-*)) thenf(Z\,..., Zk) and g(Z^+i , . . . , Zn) are independent random variables, and

P / ( Z i , . . . , Zk)g(ZM,- .-,Zn) = P / ( Z i , . . . , Zk)Fg(Zk+u • • •, Zn).

<2> Corollary. The same conclusion (independendence and factorization) holds forBorel measurable functions f and g taking both positive and negative values if bothf(Z\,..., Zk) and g(ZM,...,Zn) are integrable.

As you will see at the end of Section 2, the Corollary follows easily from addi-tion and subtraction of analogous results for the functions / * and g±. Problem [10]shows that the result also extends to cases where some of the integrals are infinite,provided oo — oo problems are ruled out.

The best way for you to understand the worth of Theorem and its Corollaryis to see it used. At the risk of interrupting the flow of ideas, I will digress slightlyto present an instructive application.

The proof of the strong law of large numbers (often referrred to by meansof the acronym SLLN) illustrates well the use of Corollary <2>. Actually, severalslightly different results answer to the name SLLN. A law of large numbers assertsconvergence of averages to expectations, in some sense. The word "strong" specifiesalmost sure convergence. The various SLLN's differ in the assumptions made aboutthe individual summands. The most common form invoked in statistical applicationsgoes as follows.

<3> Theorem. (Kolmogorov) Let X\, X2, . . . be independent, integrable randomvariables, each with the same distribution and common expectation \x. Then theaverage (X\ + . . . + Xn)/n converges almost surely to 11.

REMARK. If P |Xt | = 00 then (Xi + . . .+Xn)/n cannot converge almost surely toa finite limit (Problem [21]). Moreover Kolmogorov's zero-one law (Example <12>)implies that it cannot even converge to a finite limit at each point of a set withstrictly positive probability. If only one of P x f is infinite, the average still convergesalmost surely to PX, (Problem [20]).

A complete proof of this form of the SLLN is quite a challenge. The classicalproof (a modified version of which appears in Sections 6 and 7) combines a numberof tricks that are more easily understood if introduced as separate ideas and notjust rolled into one monolithic argument. The basic idea is not too hard to graspwhen we have bounded fourth moments; it involves little more than an applicationof Corollary <2> and an appeal to the Borel-Cantelli lemma from Section 2.6.

4.1 Independence 79

For theoretical purposes, for summands that need not all have the samedistribution, it is cleaner to work with the centered variables Xf — PX,-, which isequivalent to an assumption that all variables have zero expected values.

<4> Theorem. Let Xi, X2, . . . be independent random variables with PX, = 0 forevery i and sup, PXf < 00. Then (Xi 4-. . . + Xn)/n —• 0 almost surely.

Proof. Define Sn = Xi + . . . 4- Xn. It is good enough to show, for each e > 0, that

Do you remember why? If not, you should refer to Section 2.6 for a detailedexplanation of the Borel-Cantelli argument: the series 53n{|5n|/n > e] mustconverge almost surely, which implies that limsup|5n/n| < € almost surely, fromwhich the conclusion Mm sup \Sn/n\ — 0 follows after a casting out of a sequence ofnegligible sets.

Bound the nth term of the sum in <5> by (ne)~4¥(X\ + . . . + Xn)4. Expand

the fourth power.

+ (lots of terms like X\X2)

j terms like 6X?x| (A)

+ (lots of terms like x\X2X3) H

-f (lots of terms like XiX2X3X4) [U

The contributions to P(Xi -I-... -f Xn)4 from the five groups of terms are:

QG Yli<n ¥Xt ^ nM > W h e r e M = SUPi PXf'[2] zero, because V(X]X2) = (PXf) (PX2) = 0;\3\ less than 12$M, because P(X;fx|) < PX? + PX^ < 2M;g] zero, because P(Xf X2X3) = (PXf X2) (PX3) = 0;g | zero, because P(Xi X2X3X4) = (PXiX2X3) (PX4) = 0.

Notice all the factorizations due to independence. Combining these bounds and• equalities we get P{|Sn|/n > e] = O («~2), from which <5> follows.

If you feel that Theorem <4> is good enough for 'practical purposes,' and thatall the extra work to whittle a fourth moment assumption down to a first momentassumption is hardly worth the gain in generality, you might like to contemplate thefollowing example. How natural, or restrictive, would it be if we were to assumefinite fourth moments?

<6> Example. Let {Pe : 9 = 0 , 1 , . . . , N] be a finite family of distinct probability mea-sures, defined by densities {/?<?} with respect to a measure /x. Suppose observationsXi, X2, . . . are generated independently from Po- The maximum likelihood estimator@n(o)) is defined as the value that maximizes Ln(9,(o) := Yli<n PeiXii*0))- The SLLNwill show that P $ , = 0 eventually} = 1. That is, the maximum likelihood estimatoreventually picks the true value of 9.


It will be enough to show, for each 0 ^ 0 , that with probability one,log(Ln(0)/Ln(O)) < 0 eventually. For fixed 0 ^ 0 define lt =By Jensen's inequality, with a strict inequality because Pe ^

= logfjLxpe(x){po(x)^0}<0.

By the SLLN (or its extension from Problem [20] if P€,- = -oo), for almost all cothere exists a finite no(co,0) for which 0 > n~l ]£,-<„€,• := n~X log(Ln(0)/Ln(O))when n > no(coy 0). When n > max^, wo(<*>, 0), we have m a x ^ Ln(0) < Ln(0), in

D which case the maximizing % prefers 0 to each 0 > 1.

REMARK. Notice that the argument would not work if the index set were infinite.To handle such sets, one typically imposes compactness assumptions to reduce to thefinite case, by means of a much-imitated method originally due to Wald (1949).

2. Independence of sigma-fields

Technically speaking, the best treatment of independence starts with the conceptof independent sub-sigma-fields of 3", for a fixed probability space (£2,7, P). ThisSection will develop the appropriate definitions and techniques for dealing withindependence of sigma-fields, using the ideas needed for the proof of Theorem <2>as motivation.

<7> Definition. Let (£2, ? , P) be a probability space. Sub-sigma-fields S i . . . . . S»of J are said to be independent if

P(Gj . . . Gn) = (PGO ... (PGn) for all G, € 9,, for i = 1,.. .n.

An infinite collection of sub-sigma-fields {3/ : i e 1} is said to be independent ifeach finite subcollection is independent, that is, ifF(Di€sGi) = r L e s ^ ' f°r ^finite subsets S of / , and all choices Gi e Si for each i in S.

The definition neatly captures all the factorizations involved in the elementarydefinitions of independence for more than two events.

<8> Example. Let A, B, and C be events. They generate sigma-fields A ={0, A, A\ ft), and $ = {0, B, Bc, £2}, and e = {0, C, C\ Q). Independence ofthe three sigma-fields requires factorization for 43 = 64 triples of events, amongstwhich are the four factorizations stated at the start of Section 1 as the elemen-tary definition of independence for the three events A, B, and C. In fact, all64 factorizations are consequences of those four. For example, any factorizationwhere one of the factors is the empty set will reduce to the identity 0 = 0. Thefactorization F(ABCC) = (PA)(P#C)(PC) follows from ¥(AC) = (PA)(PC) and

• F(ABC) = (PA)(PJ?)(PC), by subtraction. And so on.

Generating class arguments, such as the 7t-k Theorem from Section 2.10, makeit easy to derive facts about independent sigma-fields. For example, Problem [8]uses such arguments in a routine way to establish the following result.

4.2 Independence of sigma-fields 81

<9> Theorem. Let £1, . . . , £„ be classes of measurable sets, each class stable underfinite intersections and containing the whole space Q. If

F(E\E2...£„) = (FE\)(FE2)...(FEn) for all Et € £,, for i = 1 ,2 , . . . ,n ,

tfien the sigma-fields a(£i) , a (£2 ) , . . . , cr(£n) are independent

REMARK. The requirement that £2 G £, for each i is just a sneaky way ofgetting factorizations for intersections of fewer than n sets.

<io> Corollary. Let {£; : i € /} be classes of measurable sets, each stable under finiteintersections. IfF(nieSEi) = f|i€S P£T, for all finite subsets S of I, and all choicesEt € £, for each i in S, then the sigma-fields a(£,), for i € / , are independent

Proof. Notice the alternative to requiring Q € £,• for every 1. Theorem <9>• establishes independence for each finite subcollection.

< n > Corollary. Let {S* : 1 € /} he independent sigma-fields. If{Ij : j € J} are disjointsubsets of I, then the sigma-fields a (uI€/.Si), for j € J, are independent

Proof. Invoke Corollary <io> with £, consisting of the collection of all finite• intersections of sets chosen from U,-€/.Si-

<12> Example. Let {Si : 1 € N} be a sequence of independent sigma-fields. For each nlet Jin denote the sigma-field generated by Ul>nSi. The tail sigma-field is definedas Oioo := nn3€n. Kolmogorov's zero-one law asserts that, for each H in Iftoo, eitherFH = 0 or FH = 1. Equivalently, the sigma-field JCoo is independent of itself, sothat F(HH) = (FH)(FH) for every H in JCoo.

For each finite n, Corollary < n > implies independence of 0in, Si, . , S « . Fromthe fact that IKQO £ *Kn for every n, it then follows that each finite subcollectionof {CKQO, Si : 1 € N} is independent, and hence the whole collection of sigma-fields is independent. From Corollary < n > again, IKoo and 3 ^ := a (UI 6NSI) are

• independent. To complete the argument, note that Joo 2 Koo-

Random variables (or random vectors, or random elements of more generalspaces) inherit their definition of independence from the sigma-fields they generate.Recall that if X is a map from Q into a set X, equipped with a sigma-field A, thenthe sigma-field a(X) on Q generated by X is defined as the smallest sigma-field S forwhich X is S\«A-rneasurable. it consists of all sets of the form {co e Q : X(<o) e A],with A €A.

REMARK. The extra generality gained by allowing maps into arbitrary measurablespaces will not be wasted; but in the first instance you could safely imagine eachspace to be the real line, ignoring the fact that the definition also covers independenceof random vectors and independence of stochastic processes.

<13> Definition. Measurable maps Xif for i e I, from & into measurable spaces(Xi,Ai) are said to be independent if the sigma-fields that they generate areindependent, that is, if

I \PlfjfX/ € Ai}\ = I~]P{X| € A;},

\i€S I ieSfor all finite subsets S of the index set I, and all choices of At e At for i € S.


Results about independent random variables are usually easy to deduce fromthe corresponding results about independent sigma-fields.

<15> Example. Real random variables Xi and X2 for which

¥{X{ <xu X2< x2] = nX\ < *i}P{X2 ^ *2) for all xux2 in R

are independent, because the collections of sets £,- = {{X,- < *} : JC € R} are bothD stable under finite intersections, and <T(X,) = a(£,).

We now have the tools needed to establish Theorem <2>. Write Xifor / ( Z i , . . . , Zk) and X2 for g(Z*+i,..., Zn). Write Sf for a(Zi), the sigma-field generated by the random variable Z,-. From Corollary <n>, the sigma-fields$\ := <r (Si U .. . U 3*) and J2 := 0 (9M U .. . U gn) are independent. If we canshow that Xi is 3j\33(R)-measurable and X2 is XB^-measurable, then theirindependence will follow: we will have the desired factorization for all sets of theform {Xi € A\) and {X2 € A2}, for Borel sets A\ and A2.

Consider first the measurability property for Xi. Temporarily write Z for(Zi , . . . , Zu), a map from Q into R*. We need to show that the set

{XleA} = {Zef~l(A)}

belongs to 5\ for every A in S(R). The !B(R*)\S(R)-measurability of / ensuresthat f~l{A) € B(R*). We therefore need only show that {Z € B] e 7\ for every Bin B(R*), that is, that the map Z is ?i\$(R*)-measurable.

As with many measure theoretic problems, it is better to turn the questionaround and ask: For how extensive a class of sets B does {Z e B] belong to 7\1It is very easy to show that the class 25o of all such B is a sigma-field; so Z is anîXô-nieasurable function. Moreover, for all choices of A € S(R), the set

D := {(zi,..., z*) € R* : z,- € A for i = 1, . . . , k]

belongs to *BQ because {Z € D] = n/{Zi e Dj] e $\. As shown in Problem [6],the collection of all such D sets generates the Borel sigma-field ®(R*). ThusB(R*) c Bo, and {Z € B] e 7X for all B e £(R*). It follows that X\ is 5Ti\B(R)-measurable. Similarly, X2 is 3r

2\®(R)-measurable. The random variables Xi andX2 are independent, as asserted by Theorem <2>.

The whole argument can be carried over to random elements of more generalspaces if we work with the right sigma-fields.

< 16> Definition. Let X\,..., Xn be sets equipped with sigma-fields A\,..., An. Theset of all ordered n-tuples (x\9..., xn)9 with xt € X, for each i is denoted byX\ x . . . x Xn or X(<n Xi. It is called the product of the {Xi}. A set of the form

A\ x . . . x An = { ( J C I , . . . , xn) € Xi x . . . x Xn : jcf € i4/ for each 1},

with At € Ai for each i, is called a measurable rectangle. The product sigma-fieldA\ ® . . . 0 An onX\ x . . . x Xn is defined to be the sigma-field generated by allmeasurable rectangles.

REMARK. Even if n equals 2 and Xi = X2 = R, there is is no presumptionthat either A\ or A2 is an interval—a measurable rectangle might be composed of

4.2 Independence of sigma-fields 83

many disjoint pieces. The symbol <g> in place of x is intended as a reminder thatA\ <8>*42 consists of more than the set of all measurable rectangles A\ x A2.

If Z, is an 3\Ai-measurable map from £2 into X,, for i = 1 , . . . , n, thenthe map co H* Z(CO) = (Z\(co),..., Zn(co)) from Q into X = Xi x . . . x Xn is3Vi-measurable, where A denotes the product sigma-field A\ <g>... <g> An. If / isan >l\B(R)-measurable real-valued function on X then / (Z) is ^^(RJ-measurable.

The second assertion of Theorem is now reduced to a factorizationproperty for products of independent random variables, a result easily deduced fromthe defining factorization for independence of sigma-fields by means of the usualapproximation arguments.

< n > Lemma. Let X and Y be independent random variables. If either X > 0 andY > 0, or both X and Y are integrable, then F(XY) == (FX)(FY). The product XYis integrable if both X and Y are integrable.

Proof. Consider first the case of nonnegative variables. Express X and Y as

monotone increasing limits of simple random variables (as in Section 2.2), Xn :—

2""1 Ei<i<4«{X > l/2n} and Yn := 2~n Ei<i<4»{^ > 'V2"}- Then, for each n,

¥(XnYn) = 4"" J2ij W({* > V2nHY > j/2n})

= 4~n E/,y ( p (^ > '72n}) (p(^ > 7/2n}) by independence

= (2-« Zi nx > i/in}) (i~n Ej nr > j/2n])

= (FXn)(FYn).

Invoke Monotone Convergence twice in the passage to the limit to deduce F(XY) =

For the case of integrable random variables, factorize expectations for productsof positive and negative parts, P(X±F±) = (PX±)(Pr±). Each of the four productsrepresented by the right-hand side is finite. Complete the argument by splitting eachterm on the right-hand side of the decomposition

into a product of expectations, then refactorize as (PX+ - FX~)(FY+ - FY~).D Integrability of XY follows from a similar decomposition for

3. Construction of measures on a product space

The probabilistic concepts of independence and conditioning are both closely relatedto the measure theoretic constructions for measures on product spaces. As youwill see in Chapter 5, conditioning may be thought of as a inverse operation to ageneral construction whereby a measure on a product space is built from familiesof measures on the component spaces. For probability measures the componentshave the interpretation of distributions involved in a two-stage experiment. Productmeasures, and independence, correspond to the special case where the second stageof the experiment does not depend on the first stage. Many traditional facts about


independence, such as the assertion of Theorem <2>, have interpretations as factsabout product measures.

If you want to understand independence then you should learn about productmeasures. If you want to understand conditioning you should learn about the moregeneral construction. We kill two birds with one stone by starting with the generalcase. The full generality will be needed in Chapter 5.

To keep the notation simple, I will mostly consider only measures on aproduct of two spaces, (X,A) and (y, $). Sometimes I will abbreviate symbols likeM+(X x y, A <g> B) to M+(X x y), with the product sigma-field assumed, or toM+(A 0 £), with the product space assumed.

To each measure r on A ® *B there correspond two marginal measures fi andA, defined by

/xA := T(A x y) for A € A and kB := T(X x B) for B € <B.

Equivalently, ix is the image of T under the coordinate projection X, which takes(x, y) to JC, and k is the image under the projection onto the other coordinate space.In particular, if F(X x y) = 1 then the marginals give the distributions of thecoordinate projections as X- or y-valued random variables on the probability space

In general, each marginal has the same total mass as T, which can lead tobizarre behavior if F (X x y) = oo. For example, if X = y = R and r is Lebesguemeasure on B(R2) then each marginal assigns infinite mass to all except the Lebesguenegligible subsets of R.

As you will see in Section 4, if F is a probability measure under whichthe coordinate projections define independent random variables then r can bereconstructed as a product of its marginal distributions. Without independence, Tis not completely determined by its marginals. Instead we need a whole familyA = {kx : x € X} of measures on $, together with the marginal /x. The constructionwill make sense—and be useful—for more than just probability measures, providedthe members of A are tied together by a measurability assumption.

<18> Definition. Call a family of measures A = {kx : x € X} on 3 a kernel from (X, A)to (y, !B) if the map x H» XXB is A-measurable for each B in S. In addition, call Aa probability kernel ifkx(^) = 1 for each x.

When there is no ambiguity regarding the sigma-fields, I will also speak ofkernels from X to y.

REMARK. Probability kernels are also known by several other names: Markovkernels, randomizations, conditional distributions, and (particularly when X isinterpreted as a parameter space) statistical models.

Suppose /x is a measure on A and A = {kx : x € X} is a kernel from (X, A)to (y, IB). The main idea behind the construction is definition of a measure onvia an iterated integral, fix (kxf(x, y)).

REMARK. Remember the notation for distinguishing between arguments. Thesuperscript y means that the kx measure integrates out f(x9 y) with x held fixed.The measure /x then integrates out over the resulting function of JC. Notice that

4.3 Construction of measures on a product space 85

superscripts denote dummy variables of integration, and subscripts denotes variablesthat are held fixed. In traditional notation the iterated integral would be written/ / /(*, y) kx(dy)n(dxl or / ( / /(*, v) kx(dy)) fi(dx).

For the iterated integral to make sense, we need to establish two key measur-ability properties: for each product measurable function / and each fixed x, themap y H> f(x, y) should be ^-measurable; and the map x H> ky

xf{x, y) should be^.-measurable. In order to establish conditions under which these two measurabilityproperties hold, I will make use of the generating class method for A-cones, asdeveloped in Section 2.11.

REMARK. It is unfortunate that the letter X should have two distinct meaningsin this Chapter: as a measure or member of a family of measures on V, and as aprefix suggesting the idea of stability under bounded limits. Sometimes there are notenough good symbols to go around.

Recall that a A-cone on a set Q is a family 3i+ of bounded, nonnegativefunctions on X with the properties:

(i) 2{+ is a cone, that is, if h\, h2 e IK+ and <x\ and a2 are nonnegative constantsthen a\h\ 4- ot2h2 € JC+;

(ii) each nonnegative constant function belongs to !K+;

(iii) if huh2€ JC+ and hx > h2 then h\ - h2 e 0-C+;

(iv) if [hn] is an increasing sequence of functions in 5f+ whose pointwise limit his bounded then h e 0i+.

Recall also that: if a sigma-field 3 on £1 is generated by a subclass S (of ak-cone 3i+) that is stable under die formation of pointwise products of pairs offunctions, then every bounded, nonnegative, ^-measurable function belongs to 0i+.

Let me illustrate the application of X-cone methods, by establishing the firstof the desired measurability properties. It would be more elegant to combine theproofs for both properties into a single generating class argument, but I feel it isgood to see the method first in a simple setting.

<19> Lemma. For each f in M(X x y, A <g> !B), and each fixed x in X, the functionJ H - f(x,y) is *B-measurable.

Proof. It is enough to consider the case of bounded nonnegative / . The general casewould then follow by splitting / into positive and negative parts, and representingeach part as a pointwise limit of bounded functions, / * = limn ( / ± A n).

Write !K+ for the collection of all bounded, nonnegative, A ® S-measurablefunctions on X x ^ for which the stated measurablity property holds. It is routineto check the four properties identifying 0i+ as a A.-cone. It contains the class S ofall indicator functions g(jc, y) := {JC € A, y € B] of measurable rectangles, becauseg(x, •) is either the zero function or the indicator of the set B. The class S is stable

• under pointwise products, and it generates A ® !B.

The application of the generating argument to establish the second desiredmeasurablity property can be surprisingly delicate for kernels assigning infinitemeasure to y. A finiteness assumption eliminates all difficulties.


<20> Theorem. Let A = {Xx : x £ X} be a kernel from (X, A) to (y, 2 ) with Xxy < oofor all x, and let /x be a measure on A. Then, for each function f in M+(Xx y, .A<g>2),

(i) y H> /(JC, y) is 'B-measurable for each fixed x;

(ii) x H> Xlf(x, y) is A-measurable;

(Hi) the iterated integral (/i® A)( / ) := nx (Xyxf(x, y)), for f in M+(X x y, yi<8>2),

defines a measure on *A <g> 2 .

Proof. Property (i) merely restates the assertion of Lemma <19>. For (ii), considerthe class IH+ := {/ e M+(X x y) : / bounded and satisfying (ii)}. Note thatky

xf{x,y) < oo for each JC if / € JC4*, so there is no problem with infinite valueswhen subtracting to show that kyf\(x, y) — klf2(x, y) is ^-measurable if f\ > fo andboth functions belong to IK+. It is just as easy to check the other three propertiesneeded to show that IK+ is a A.-cone. All indicator functions of measurable rectanglesbelong to JC+, because Xx{x € A, y € B] = {JC € A}(kxB). Property (ii) thereforeholds for all bounded functions in M+(X x y). A Monotone Convergence argument,

Xyxf(x,y)= l i m ^ ( W A / ( j c , y ) ) ,

n—•oo

extends the property to all of M+(X x y).It follows immediately that the iterated integral is well defined, thereby defining

an increasing linear functional \i % A on M+(X x y). The functional has theMonotone Convergence property: if 0 < /„ \ f then

= \ix uimX^/n(jc, y) j Monotone Convergence for Xx

= lim iix (Xyxfn (x, y)) Monotone Convergence for //.

n

Thus /x 0 A has all the properties required of a functional corresponding to a measureD on A 0 2 .

If /x is a probability measure and A is a probabilty kernel, then / i ® A definesa probability measure on the product sigma-field of X x y. It defines a jointdistribution for the coordinate projections, X and Y. In Chapter 5, the probabilitymeasure kx will be identified as the conditional distribution for Y, given that X = x.The construction can be extended further by means of a second probability kernel,N = {vx,y : (JC, y) e X x y}, from (X x y, A <g> 2 ) to another space, (X, e). For /in M+(X x y x Z, 1 (8) 2 ® 6), the iterated integral

(Qi ® A) ® N) f = (/x ® A)x-y ( i £ y / (* , y, z)) = /** (XJ (v*,y/(jf, y, z)))

is well defined. It corresponds to a probability measure on A <g> 2 <8> e, which definesa joint distribution for the three coordinate projections, X, F, Z. We can also define aprobability kernel A <g> N from X to y x Z by means of the map JC H> A* (v* yg()>, z)),for ^ in M+(y x Z, 2 <8) 6), thereby identifying the joint distribution as /x 0 (A <g> A/").It makes no difference which way we interpret the probability on A <g> 2 ® e, whichfrom now on I will write as /x (8) A ® JV. A similar construction works for anyfinite number of probability kernels. The extension to a countably infinite sequenceinvolves a more delicate argument, which will be presented in Section 8.

4.3 Construction of measures on a product space 87

Infinite kernels

The construction of p 0 A, for general measures and kernels, extends easily toproduct measurable sets that can be covered by a countable collection of measurablerectangles, in each of which an analog of the conditions of Theorem <20> hold.

<2i> Definition. Say that a kernel is sigma-finite if there exist countably manymeasurable rectangles Ar- x ft for which X x y = UIGN>*; X ft and for whichXxBi < oo for all x € A,-, for each i.

<22> Corollary. If A is a sigma-finite kernel, then the three assertions of Theorem <20>still hold. The countably additive measure fi ® A on A 0 3 , defined via the iteratedintegral, is sigma-finite if fi is a sigma-finite measure on A.

Proof. Only property (ii), the measurablity of x \-> Xxf(x,y) for each / inM+(X x y), requires any new argument, because the finiteness of Xxy was used onlyin the proof of the corresponding assertion in the Theorem.

Temporarily, I will use the word rectangle as an abbreviation for measurablerectangle. The difference of two rectangles can be written as a disjoint union of two

CxD other rectangles: (Ax B)\(C x D) = (ACC x B)U(AC x BDC). In particular, we can write At x B2 as

I a disjoint union of two rectangles, each disjoint fromACxBDc A\ x B\; then write A3 x B3 as a disjoint union of at

R most four rectangles, each disjoint from both A\ x B\I and A2 x B2; and so on. In other words, with no loss

of generality, we may assume the rectangles A,- x ft are pairwise disjoint, therebyensuring that £ i e N A,- x Bf = 1.

For / in M+(X x y), the integral kyxf(x, y) breaks into a countable sum,

K E,-€N({* € At, y e Bi)f{x% y)) = £ | € N { * € Ai)ki({y e «,}/(*, y)).For the ith summand, we may regard Xx as a kernel from Af to Bi9 which lets usinvoke Theorem <20> to establish measurability of that summand as a function of x.Thus A.*/(x, y) is a countable sum of .A-measurable functions, which establishes (ii).

If /x is sigma-finite, we may assume, with no loss of generality, that /x A, < 00 forevery 1. Define Ain = {x € Af : XxBi < n}. Then /x <8> A (Af> x 5,-) < nÂiyn < 00,

D and X x y = UI€N,n€N (Af> x ft).

<23> Example. If X = y = R and each Xx and fi equals (one-dimensional) Lebesguemeasure X on S(M), then the resulting /x ® A may be taken as the definition oftwo-dimensional Lebesgue measure 1112 on S(E2). (For a more direct construction,

• see Section A.5 of Appendix A.)

<24> Example. Let ft denote the N(0,1) distribution on 3(M), and Xx denote theN(px, 1 - p2) distribution, also on S(M), for a constant p with |p| < 1. For/ € M + ( l 2 ) ,

/ / - p2)AW x /exP(-jc2/2) y /A) / = m I — , —xny I

x y /exp (-x2/2 - (y - px)2/2(l - p2))= m nr I


That is, /JL 0 A is absolutely continuous with respect to two-dimensional Lebesguemeasure, with density

(JC2 — 2pxy H

2nJl^l

\

) •

The probability measure //, 0 A is called the standard bivariate normal distribution• with correlation p .

4. Product measures

A particularly important special case of the constructions from the previous Sectionarises when A = [k] for a fixed measure k on £ , that is, when kx = k for all x.Sigma-finiteness of the kernel A is then equivalent to sigma-finiteness of A. in theusual sense: the space y should be a countable union of sets each with finite kmeasure. I will abbreviate / A 0 {k} to /z0A., the conventional notation for the productof the measures \x and A. That is, \x 0 k is the measure on A 0 £ defined by thelinear functional

(Ji ® A.) ( / ) := \ix {kyf(xy y)) for / € M+(X x y, A 0 B).

In particular, (/z 0 k) (A x £) = (fiA)(kB).If /x is also sigma-finite, we can reverse the roles of the two measures,

integrating first with respect to /JL and then with respect to k, to build anothermeasure on A 0 !B, which again would give the value (/jiA)(kB) to the measurablerectangle A x B. As shown in Problem [11], the equality for the generating classof all measurable rectangles ensures that the new linear functional defines the samemeasure on A 0 S.

<25> Tonelli Theorem. If ii is a sigma-finite measure on (X, A), and k is a sigma-finitemeasure on (y, £ ) , then, for each f in M+ (X x y, A 0 !B),

(i) y i-> /(JC, y) is <B-measurable for each fixed JC, and x H> /(JC, y) is »A-measurable for each fixed y;

(ii) x H> kyf(x, y) is A-measurable, and y h* iixf(x, y) is *B-measurable;

(Hi) (ji®X)f = iL* Wfix, 3O) = W (/**/(*, 30).

Proo/. Assertions (i) and (ii) follow immediately from Corollary <22>. The thirdassertion merely restates the fact that the linear fiinctionals defined by both iterated

• integrals correspond to the same measure on A 0 !B.

See Problem [12] for an example emphasizing the need for sigma-finiteness.

<26> Example . Let fi be a sigma-finite measure on A. For / in M+(X, A) and eachconstant p > 1, we can express \x (fp) as an iterated integral,

H (fp) = nx (m> (pyp-l{y : f(x) >y> 0 } ) ) ,

where m denotes Lebesgue measure on 3(R). It is not hard—although a littlemessy, as you will see from Problem [2]—to show that the function g(x,y) :=pyp~{[f(x) > y > 0}, on X x R, is product measurable. Tonelli lets us reverse the

4.4 Product measures 89

order of integration. Abbreviating fix[y : f(x) > y > 0} to /z{ / > y] and writingthe Lebesgue integral in traditional notational, we then conclude that

= pf°JoIn particular, if /z (fp) < oo then /z{/ > 3;} must decrease to zero faster than y~p as

• y -> 00.

The definition of product measures, and the Tonelli Theorem, can be extendedto collections / z i , . . . , \in of more than two sigma-finite measures, as for kernels.

<27> Example. Apparently every mathematician is supposed to know the value of theconstant C := i!!^ exp(-*2) dx. With the help of Tonelli, you too will discoverthat C = y/n. Let m denote Lebesgue measure on S(R) and mi = m <g> m denoteLebesgue measure on !B(R2). Then

C2 = m W exp(-x2 - y2) = m*'* (mz{;t2 + / < z}e~z} .

The ni2 measure of the ball [x2 + y2 < z}, for fixed positive z, equals nz, A changeD in the order of integration leaves mz (7r{0 < z}ze~z) = n as the value for C2.

The Tonelli Theorem is often invoked to establish integrability of a productmeasurable (extended-) real-valued function / , by showing that at least one ofthe iterated integrals /z* (A^l/Ot, y)\) or Xy (/ZX|/(JC, ;y)|) is finite. In that case, theTheorem also asserts equality for pairs of iterated integrals for the positive andnegative parts of the function:

HxVf+(x, y) = W / + ( * > y) < 00,

with a similar assertion for / " . As a consequence, the ^-measurable set

tyi := {x : AV+(*. y) = ooor \yf~(x, y) = 00}has zero /z-measure, and the analogously defined S-measurable set Nx has zeroA. measure. For x £ N^ the integral Wf(x,y) := kyf+(x,y) - Vf~{x,y) iswell defined and finite. If we replace / by the product measurable function/(JC, y) := /(JC, y){x £ Nv, y £ Nk], the negligible sets of bad behavior disappear,leaving an assertion similar to the Tonelli Theorem but for integrable functions takingboth positive and negative values. Less formally, we can rely on the convention thata function can be left undefined on a negligible set without affecting its integrabilityproperties.

<28> Corollary (Pubini Theorem). For sigma-finite measures /z and A., and a productmeasurable function f with (/JL 0 A.) | / | < 00,

(i) y H> f(xyy) is *B -measurable for each fixed x; and x \-+ f(x,y) isA-measurable for each fixed y;

(ii) the integral kyf(x, y) is well defined and finite fi almost everywhere, andx H>- A//(JC, >>) is /x-integrable; the integral /z*/(JC, y) is well defined andfinite k almost everywhere, and y H* \XX f{x,y) is X-integrable;

(Hi) (/z (g) k) f = /z* ( * v u . y)) = v (M*/(*. y))-


REMARKS. If we add similar almost sure qualifiers to assertion (i), then theFubini Theorem also works for functions that are measurable with respect to ? , the/it 0 X completion of the product sigma-field. The result is easy to deduce fromthe Theorem as stated, because each ^-measurable function / can be sandwichedbetween two product measurable functions, fo < f < fu with /o = / i , a.e. [fx <g> A.].Many authors work with the slightly more general version, stated for the completion,but then the Tonelli Theorem also needs almost sure qualifiers.

Without integrability of the function / , the Fubini Theorem can fail, as shownby Problem [13]. Strictly speaking, the sigma-finiteness of the measures is notessential, but little is gained by eliminating it from the assumptions of the Theorem.As explained in the next Section, under the traditional definition of products forgeneral measures, integrable functions must almost concentrate on a countable unionof measurable rectangles each with finite product measure.

Product measures correspond to joint distributions for independent randomvariables or random elements. For example, suppose X is a random element ofa space (X, A) (that is, X is an ^.A-measurable map from Q into X), and Y isa random element of a space (^, £ ) . Let X have (marginal) distribution fi and Yhave (marginal) distribution X. That is, F{X € A] = fiA for each A in A, andP{F e B] = XB for each B in tB. The joint distribution of X and Y is the imagemeasure of P under the map co Y-> (X((O), Y(CO)). It is the probability measure Qon A ® $ defined by QD = Pf (X, Y) e D). If X and F are independent, and if Dis a measurable rectangle, then QD factorizes:

Q(A x B) = F{X e A, Y e B] = F{X e A}F{Y e B} = (fiA)(XB).

That is, QD = (/x ® X) D for each D in the generating class of measurable rectangles.It follows that Q = }JL <g> X, as measures on the product sigma-field.

Conversely, if Q is a product measure then F{X e A, Y e B] factorizes,implying that X and Y are independent.

In short: random variables (or random elements of general spaces) are in-dependent if and only if their joint distribution is the product of their marginaldistributions.

Facts about independence can often be deduced from analogous facts aboutproduct measures. In effect, the proofs of the Tonelli/Fubini Theorems are equivalentto several of the standard generating class and Monotone Convergence tricks usedfor independence arguments. Moreover, the results for product measures are statedfor functions, which eliminates a layer of argument needed to extend independencefactorizations from sets to functions.

<29> Example . The factorization asserted by Theorem <2> follows from Tonelli'sTheorem. Let X = ( Z j , . . . , Zk) and Y = (Zk+u • • •, Zn). The function (JC, y) H>f(x)g(y) is product measurable. Why? With Q, /x, and v as above,

P(f(X)g(Y)) = Qx y (f(x)g(y)) image measure

= (^ 0 v)x*y (f(x)g(y)) independence

= vy(fxxf(x)g(y)) Tonelli

= (P/(X)) (Fg(Y)) images measures.


In the third line on the right-hand side the factor g(y) behaves like a constant for• the [i integral.

<30> Example. The image of fx ® v, a product of finite measures on R* x R*, underthe map T(x, y) = x + y is called the convolution of the two measures, and isdenoted by ft • v (or v • /x, the order of the factors being irrelevant). If fi andv are probability measures, and X and Y are independent random vectors withdistributions /JL and v, then the product measure /x 0 v gives their joint distribution,and the convolution $i • v gives the distribution of the sum X 4- F.

By the Tonelli Theorem and the definition of image measure,

QL * v)(/) = iixvyf(x + y) for / € M+(R*).

When v has a density &(-) with respect to it-dimensional Lebesgue measure m, theinnermost integral on the right-hand side can be written, in traditional notation, as/ Hy)f(* + y)dy, which invites us to make a change of variable and rewrite itas / 8(y — x)f(y)dy. More formally, we could invoke the invariance of Lebesguemeasure under translations (Problem [15]) to justify the reexpression. Whicheverway we justify the change, the convolution becomes

= myiix8(y - x)f(y) by Tonelli.

Writing g(y) := [ix8(y — x), we have (/Lt • v) / = my (g(y)f(y))> That is, JJL • v has• density g with respect to m.

REMARK. The statistical techniques known as density estimation and nonpara-metric smoothing rely heavily on convolutions.

<3i> Exercise. The N(0, a2) distribution on the real line is defined by the density

with respect to Lebesgue measure. Show that the convolution of two normaldistributions is normal: N(0\,af) • N($2, o%) = N{0\ + 02, a\ + a%).SOLUTION: The convolution formula from Example o o > (with \x as the N(0\Jarf)distribution and 8 as the N(02, a2) density) becomes

Ky * 2) ) dx.

Make the change of variable z = x — 0\, and replace y by 0\ -h 2 •+• y (to get a neaterexpression).

r2 rv-T^2\1 dz.

Complete the square, to rewrite the exponent as


The coefficient of y2 simplifies to - (2a 2 )" 1 , where a2 = of + or|. When integratedout, the quadratic in z contributes a factor y/lnr, leaving the appropriate multiple

• of exp(—y2/2o2). The convolution result follows.

REMARK. We could avoid most of the algebra in the Example, by noting thatg(y + Q\ + #2) has the form C\ exp(—Ciy2), for constants Ci and C2. Nonnegativityand integrability of the density would force both constants to be strictly positive,which would show that g must be some N(i±,o2) density. Calculation of means andvariances would then identify /x and a2.

There is also a quicker derivation of the result, based on Fourier transforms(see Chapter 8), but a distaste for circular reasoning has compelled me to inflict onyou the grubby Calculus details of the direct convolution argument: I will derive theform of the N(0, a1) Fourier transform by means of a central limit theorem, for theproof of which I will use the convolution property of normals.

<32> Example. Recall from Section 2.2 the definition of the distribution function Fx

and its corresponding quantile function for a random variable X:

Fx(x) = F{X < x] for x e R,

qx(u) = inf{f : Fx(t) > u] for 0 u if and only if qx(u) < x. As a random variable on (0,1) equippedwith its Borel sigma-field and Lebesgue measure P, the function X := qx(u) hasthe same distribution as X. Similarly, if Y has distribution function FY and quantilefunction qy, the random variable Y := qy(u) has the same distribution as Y.

Notice that X and Y are both defined on the same (0,1), even though theoriginal variables need not be defined on the same space. If X and Y do happento be defined on the same £2 their joint distribution need not be the same as thejoint distribution for X and Y. In fact, the new variables are closer to each other,in various senses. For example, several applications of Tonelli will show thatP|X - Y\P > F\X - Y\P for each p > 1.

As a first step, calculate an inequality for tail probabilities.

¥{X > JC, Y > y] < min (P{X > JC}, F{Y > y})

= 1 - Fx(x) v FY(y)

u> Fx(x)vFY(y)}du

«), y <qy(u)}du0

<33> = P{X > JC, Y > y)

We also have P{X > JC} = P{X > x] and F{Y > y) = F{Y > y], from equality of themarginal distributions. By subtraction,

P({X > x) + (Y > y] - 2{X > JC, Y > y})

<34> > P({X > JC} + {Y > y] - 2{X > JC, Y > y}) for all JC and y.


The left-hand side can be rewritten as

I P ({X(a)) > x f y > Y(a>)} + {X(co) < x , y < Y(co)}) ,

a nonnegative function just begging for an application of Tonelli. For each realconstant s, put y = x + s then integrate over x with respect to Lebesgue measure mon !B(E). Tonelli lets us interchange the order of integration, leaving

IP (mx{X(co) >x> Y(co) - s) + mx{X((o) < x < Y(co) - s})

= IP ((X(co) - Y(co) + s ) + + (K(o>) - s - X(co))+)

Argue similarly for the right-hand side of <34>, to deduce that

P|X - Y + s\ > P|X - Y + s\ for all real s.

For each nonnegative t, invoke the inequality for s = r then 5 = - r , then add.

P(|X - F + r| + |X - Y - t\) > P(|X - Y H- f| + |X - Y - t\) for all t > 0.

An appeal to the identity |z + f| + |z - f| = 2* + 2(|z| - 1 ) + , for z € E andt > 0, followed by a cancellation of common terms, then leaves us with a usefulrelationship, which neatly captures the idea that X and Y are more tightly coupledthan X and F.

<35> P (|X - Y| - 0 + > P (i* ~ ?i ~ 0 + f o r a11 f ^ °-

Various interesting inequalities follow from <35>. Putting t equal to zero weget P|X - Y\ > P|X - Y\. For p > 1, note the identity

D? = p(p-l) (D- t)tp~2 dt = p(p - l)m^ [tp-2 (D - 0+) for D > 0,

where mo denotes Lebesgue measure on S (M+). Temporarily write A for |X - F|and A for |X - Y\. Two more appeals to Tonelli then give

P|X - Y\P = p{p - l)m£ (tp-2¥° (A(co) -

> p(p -

See Problem [17] for the analogous inequality, P^(|X - F|) > P^(|X - F|), forD every convex, increasing function f on 1 + .

*5. Beyond sigma-finiteness

Even when the kernel A is not sigma-finite, it is possible to define fi ® A as ameasure on the whole of A ® S. Unfortunately, the obvious method—by meansof an iterated integral—can fail for measurability reasons. Even when kx doesnot depend on x, the map x h* A*/(JC, y) need not be measurable for all / inM+i% x y, A (8) B), as shown by the following counterexample.


<36> Example. Let X equal [0,1], equipped with Borel sigma-field A, and y alsoequal [0,1], but equipped with the sigma-field £ of all subsets of [0,1]. Let £ be asubset of [0,1], not .A-measurable. Let A. be the counting measure on £ , interpretedas a measure on £ . That is, A. B equals the cardinality of E n B, for every subset Bof [0,1]. The Borel measurable subset D = {x = y] of [0,1]2 also belongs to thelarger sigma-field A <g> 23. The function x \-+ ky{(x, y) e D) equals 1 when x € £ ,

• and is 0 otherwise; it is not yi-measurable.

In general, the difficulty lies in the possible nonmeasurability of the functionx »-• A.J/(JC, JO. When / equals the indicator function of a measurable rectanglethe difficulty disappears, because x H> XXB is assumed to be .A-measurable forevery B in !B. If [ix (A.£{JC € A,y € B}) < OO then Ao := {x e A : XXB = oo} haszero fi measure. The method of Theorem <20> defines a measure on the productsigma-field of (A\Ao) x B, by means of an iterated integral,

Tf := fix (kyf(x, y)) if / € M+ and / = 0 outside (A\A0) x B.

For the remainder of the Section, I will use the letter F, instead of the symbolM <g> A, to avoid any inadvertent suggestion of an iterated integral.

The same definition also works if / is required to be zero only outside A x /?,provided we ignore possible nonmeasurability of Xy

xf(x,y) on a /x-negligible set.More formally, we could assume [i to be complete, so that the behavior for x in AQhas no effect on the measurability. The method of Corollary <22> can then beapplied to extend F to a measure on a larger collection of subsets.

<37> Definition. Write % for the collection of all A ® *B-measurable sets D for whichthere exist measurable rectangles with D c U/eNAj x Bt and ixx ({JC e Ai)kxBi) < oofor each i. Denote by M+(X x y, #) , or just M+(#) , the collection of functions fin M+(X x y,.A® B) for which {/ / 0} e X.

The collection Jl is stable under countable unions and countable intersections,and differences (that is, if /?,- € % then R\\R2 € #)—that is, Jl is a sigma-ring.It need not be stable under complements, because the whole product space X x ymight not have a covering by a sequence of measurable rectangles with the desiredproperty—that is, Jl need not be a sigma-field.

The definition of a countably additive measure on a sigma-ring is essentiallythe same as the definition for a sigma-field. There is a one-to-one correspondencebetween measures defined on *R and increasing linear functionals on M+(#) with theMonotone Convergence property. In particular, the iterated integral /x* (A.*/(JC, y))defines an increasing linear functional on / e M+(#) , corresponding to a measureon ft.

For product measurable sets not in % there is no unique way to proceed.A minor extension of the classical approach for product measures (using

approximation from above, as in Section 12.4 of Royden 1968) suggests we shoulddefine Y(D) as the infimum of £ i € NM* ({x € Ai}A.xfi,-), taken over all countablecoverings of D by measurable rectangles, U/€NA; X B,-. This definition is equivalentto putting T(D) = oo when D € (./i®:B)\3£. Consequently, we would haveF ( / ) = oo for each nonnegative, product measurable / not in M+(ft). For the

4.5 Beyond sigma-finiteness 95

particular case where k == A., if / is product measurable and r ( | / | ) < oo, then{/ ^ 0} c UI€NA| x Bt, for some countable collection of measurable rectangles with£ I € N (jiAi) (kBi) < oo. For each i in the set J^ := {i : /xA, = oo} we must haveXBj = 0, which implies that the set Nk := Ui€jfiBi is ^.-negligible. Similarly, theset Nn := Ul€yxA|, where Jx := {i : kBt = oo}, is /^-negligible. When restricted toXo := Ul yMi4l, the measure fi is sigma-finite; and when restricted to % := U^Z?,-,the measure X is sigma-finite. Corollary <28> gives the assertions of the FubiniTheorem for the restriction of / to Xo x %. For trivial reasons, the Fubini Theoremalso holds for the restriction of / to N^ x y or X x Nx.

REMARK. In effect, with the classical approach, the general Fubini Theorem ismade to hold by forcing integrable functions to concentrate on regions where bothmeasures are sigma-finite. The general form of the Fubini Theorem is really just aminor extension of the theorem for sigma-finite measures. I see some virtue in beingcontent with the definition of the general /i <8> A as a measure on the sigma-ring 31.

6. SLLN via blocking

The fourth moment bound in Theorem <4> is an unnecessarily strong requirement.It reflects the crudity of the Borel-Cantelli argument based on <5> as a method forproving almost sure convergence. Successive terms contribute almost the same tailprobability to the sum, because the averages do not change very rapidly with n. Itis possible to do better by breaking the sequence of averages into blocks of similarterms. We need to choose the blocks so that the maximum of the terms over eachblock behaves almost like a single average, for then the Borel-Cantelli argumentneed be applied only to a subsequence of the averages.

Probabilistic bounds involving maxima of collections of random variables areoften called maximal inequalities. For the SLLN problem there are several typesof maximal inequality that could be used. I am fond of the following result becauseit also proves useful in other contexts. The proof introduces the important idea of afirst passage time.

<38> Maximal Inequality. Let Z\, ..., Z# be independent random variables, and €\,€2, and p be positive constants for which P{|Z,- + Z,+i + . . . + ZN\ < e2] > l/P foreach i. Then

p|max|Zi + ... + Z,| ><?! +€2\ <

REMARK. The same inequality holds if the {Z,} are independent random vectors,and | • | is interpreted as length. The inequality is but one example of a large classof results based on a simple principle. Suppose we wish to bound the probabilitythat a sequence {5, : i = 0, 1 , . . . , N] ever enters some region IR. If it does enter %there will be a first time, r, at which it does so. If we can show that the process hasa (conditional) probability of at least \/fi of 'staying near the scene of the crime'(that is, of remaining within a distance 6 of D£ at time N), then the probability ofthe event {ZN within € of 31} should be at least \/fi times the probability of theevent {the process hits CR}. Of course the time r will be a random variable, which


means that SN — ST need not behave like a typical increment SN — 5,-. We avoid thatcomplication by arguing separately for each event {r = i) .

There are also conditional probability versions of the inequality, relaxing theindependence assumption.

Proof. Define

Define a random

A r \ * •y \ i i

5, := Z,

variable

\ ^

+ ..

r(a

te2Id

• + z,,

first passage time) by

first i for whichA if 15,1 <e{ +

Notice that the events {r =

-st

\St

i)

•

1 > ifor all

for i

i.

= 1 , . . . , N* X are disjoint. The probability on the left-hand side

of the asserted inequality equals

P{r = i and |5£| > e\ 4- Q for some i}

= YriLi ^{ r = *' \$\ > €i + €^} disjoint events

< X!|Li P{* = «, I-S.-I > €\ 4- €2} )8P{|7f I < 62} definition of £.

The event {T = 1, |5,-| > €1 4- 62} is a function of Z i , . . . , Z,-; the event {|7}| < 62}is a function of Z,+i , . . . , Z#. By independence, the product of the probabilities inthe last displayed line is equal to the probability of the intersection. The sum is lessthan

If I Si-1 > 61 4- 62 and |7}| < 62 then certainly \SN\ > €\. The sum is less thanQ P E £ I nr = /, |5AT| > 6,} = /8P{|SA,| > 61}, as asserted.

REMARK. Notice how the disjointness of the events [r — i} for i = 1 , . . . , Nwas used twice: first to break the maximal event into pieces and then to reassemblethe bounds on the pieces. Also, you should ponder the choice of r value for thecase where \St\ < €\ +€2 for all i. Where would the proof fail if I had chosen r = 1instead of T = N for that case?

The Maximal Inequality will control the behavior of the averages over blocks ofsuccessive partial sums, chosen short enough that the ft constants stay bounded. TheBorel-Cantelli argument can then be applied along the subsequence correspondingto the endpoints of the blocks. The longer the blocks, the sparser the subsequenceof endpoints, and the easier it becomes to establish convergence for the sum of tailprobabilities along the subsequence. With block lengths increasing geometricallyfast, under mild second moment conditions we get both acceptable ft values and tailprobabilities decreasing rapidly enough for the Borel-Cantelli argument.

<39> Theorem. (Kolmogorov) Let X\, X2, . . . be independent random variables withFXi = 0 for every i. If £ \ FXf/i2 < 00 then (X\ + . . . 4- Xn)/n - • 0 almost surely.

Proof. Define St := X\ 4 - . . . 4- Xt and

V(i) := of + . . . + or? = PS? where or? := PX?,

Bk :— {n : nk < n < nk+\] where nk := 2*, for k = 1, 2 , . . . .

4.6 SLLN via blocking 97

REMARK. The nk define the blocks of terms to which the Maximal Inequalitywill be applied. You might try experimenting with other block sizes, to get a feelfor how much latitude we have here. For example, what would happen if we tooknk = 3*, or kkl

The series ]T^ V (nk) /nj is convergent, because

ED Enk *=1 , = 1 i=\

The innermost sum on the right-hand side is just the tail of a geometric series,started at *(i), the smallest k for which i < 2*. The sum equals

The almost sure convergence assertion of the Theorem is equivalent to

\Sn\<40> max > 0 as k -» oo.

neBk ft

It therefore suffices if we show, for each € > 0, that

|5"' - < o o .

Replace the n in the denominator for the fcth summand by the smaller number ithen expand the range from Bk to {n < nk+\] to bound the probability by

P max \Sn\[n<nk+\

By the Maximal Inequality, this probability is less than

PkP{\SHk+l\ > enk] <

where

Pkx = min P{|5njk+1 - Sn\ < €nk]

> 1 - m a x Knk+l2 2 "; > 1 - • y

2v 7 ' i / ^ 1 as k -> oo.

The A:th term in <4i> is eventually smaller than a constant multiple of V(nk+\)/n\+v

• which establishes the desired convergence.

*7. SLLN for identically distributed summands

Theorem <39> is not only useful in its own right, but also it acts as the steppingstone towards the version of the SLLN stated as Theorem <3>, where the momentassumptions are relaxed even further, at the cost of a requirement that the variableshave identical distributions. The proof of Theorem <3> requires a truncationargument and several appeals to the assumption of identical distributions in orderto reduce the calculation of bounds involving many Xt to pointwise inequalities

<42>

<43>


involving only a typical X\. Actually slightly less would suffice. We only need away to bound various integrals by quantities that do not change with n.

Proof of Theorem <3>. With no loss of generality, suppose fi = 0 (or replace X,-by the centered variable X, — JJL)9 SO that the Theorem becomes: for independent,identically distributed, integrable random variables X\, X2,... with PX,- = 0,

——— -> 0 almost surely.

Break each X,- into two contributions: a central part, F, = X,-{|X,-| < 1}, which,when recentered at its expectation /x, = PX,-{|X,-| < 1}, becomes a candidate forTheorem <39>; and a tail part, X,- - Yt = X,-{|X,-| > 1}, which can be handled byrelatively crude summability arguments. Notice that, after truncation, the summandsno longer have identical distributions.

REMARK. The choice of i as the ith truncation level is determined by theavailable moment information. More generally, finiteness of a /?th moment wouldallow us to truncate at il/p for almost sure convergence arguments.

The truncation constants increase fast enough to allow us to dispose of variousfragments by means of the first moment assumption. For example, using the identity0 = PX/ = fit + PX,-{|X,-| > /}, and the identical distribution assumption, we have

in (l, l-^< P|X,|min (l, l-^) -* 0,

the final convergence to zero following by Dominated Convergence. Notice howthe contribution from X,{iX,| < i) was related to a contribution from — X,-{|Xj| > 1}.Without that trick, the analog of <42> would have given the useless upperbound P|X! I.

The rate of growth of the truncation constants is also fast enough to ensurethat the truncation has little effect on the summands, as far as the almost sureconvergence of the averages is concerned:

00 00

X>f*/ # Yt) < ]TP{|X,| > 1} < P|Xi| < 00.i = l i = l

Again the identical distributions have reduced the argument to calculation of apointwise bound involving the single Xi. It follows that, with probability one, thereexists a positive integer io(co) such that, for n > io(co),

- T Xi(a>) - -n T-* n

< - T \Xi(a>)\ -» 0.n *fni<n

Notice that the last sum does not change as n increases.The truncation constants increase slowly enough to allow us to bound second

moment quantities for the truncated variables by first moments of the originalvariables. Here I leave some minor calculus details to you (see Problem [25]). Firstyou should establish a deterministic inequality: there exists a finite constant C suchthat

<44> |JC|2 J^fkl < J'JTJ S C\X\ for each real x.1=1 l

4.7 SLLN for identically distributed summands 99

From this bound and the inequality P(Y, - /z,)2 < py? = PX^{|Xi| < i}, deduce that

oo p / y . _ i i .)2

y \ i .2 < CPIXI i < oo.

It follows by Theorem <39> that

<45> - /™AF/ — lit) -^ 0 almost surely.nT<n

The asserted SLLN for the identically distributed {X,} then follows from theresults <42> and <43> and <45>.

*8. Infinite product spaces

The SLLN makes an assertion about an infinite sequence [Xn : n e N} of independentrandom variables. How do we know such sequences exist? More generally, for anassertion about any sequence of random variables {Xn}, with prescribed behavior forthe joint distributions of finite subcollections of the variables, how do we know thereexists a probability space (£2, 7, P) on which the {Xn} can be defined? Dependingon one's attitude towards rigor, the question of existence is either a technical detailor a vital preliminary to any serious study of limit theorems. If you prefer, at leastinitially, to take the existence as a matter of faith, then you will probably want toskip this Section.

To accommodate more general random objects, suppose X,- takes values in ameasurable space (X,-, .A,-). For finite collections of random elements X i , . . . , Xn, wecan take Q as a product space X,<n X,- equipped with its product sigma-field ® l < n A ,with the random elements X, as the coordinate projections. For infinite collections{Xn : n € N) of random elements, we need a way of building a probability measureon Q = XI €N Xf, the set of all infinite sequences co := (JCI, JC2,...) with JC, € X, foreach i. The measure should live on the product sigma field 7 := ® I G N A , which isgenerated by the measurable hyperrectangles X I € N A,- with At € A for each i.

It is natural to start from a prescribed family {Pw : n e N} of desired finitedimensional distributions for the random elements. That is, Pn is a probabilitymeasure defined on the product sigma-field 3>, := A\ 0 . . . 0 An of the productspace Qn := Xi x . . . x Xn, for each n. The X,'s correspond to the coordinateprojections on these product spaces. Of course we need the Pn's to be consistent inthe distributions they give to the variables, in the sense that

<46> Pn+i (F x XB+i) = PB +i{(Xi , . . . , XH) e F, Xn+i € Xn+1} = FnF all F in 7 , ,

or, equivalently,

Pw+i«(jri, . . .*„) = Vng(xu • • .*„) for all g e M + (Q n , %).

Such a family of probability measures is said to be a consistent family of finitedimensional distributions. Within this framework, the existence problem becomes:

For a consistent family of finite dimensional distributions {Fn : n € N}, whendoes there exist a probability measure P on 7 for which the joint distributionof the coordinate projections (X\,..., Xn) equals Pn, for each finite n?


Roughly speaking, when such a P exists (as a countably additive probabilitymeasure on JF), we are justified in speaking of the joint distribution of the wholesequence {Xn : n € N}. If such a P does not exist, assertions such as almost sureconvergence require delicate interpretation.

The sigma-field Jn is also generated by the cone 9+ of all bounded, nonnegative,Jn -measurable real functions on Qn. Write !K+ for the corresponding cone ofall functions on Q of the form h(co) = gn(co\n), where co\n denotes the initialsegment (jti, . . . , xn) of eo and gn e St- A routine argument shows that the coneJ{+ := Un€N3iJ generates the product sigma-field 7 on Q. Consistency of the family{Fn : n € N} lets us define unambiguously an increasing linear functional P on 3C1"by

<47> Fh := Fngn if h(co) = gn(co\n).

The functional is well defined because condition <46> ensures that Fngn = Fn+\gn+\when h(co) = gn+i(a)\n + 1) = gn(co\n), that is, when h depends on only the first ncoordinates of a>. Some authors would call P a finitely additive probability.

From Appendix A.6, the functional P has an extension to a (countably additive)probability measure on 7 if and only if it is cr-smooth at 0, meaning that P/*, \ 0for every sequence {hi} in J(+ for which 1 > hi \, 0 pointwise.

REMARK. The a-smoothness property requires more than countable additivityof each individual Prt. Indeed, Andersen & Jessen (1948) gave an example of weirdspaces {Xj} for which P is not countably additive (see Problem [29] for details).The failure occurs because a decreasing sequence {hi} in 5€+ need not depend ononly a fixed, finite set of coordinates. If there were such a finite set, that is, ifhi(co) = gi(con) for a fixed n, then we would have P/i, = Pnfc, —> 0 if hi | 0. Ingeneral, ht might depend on the first nt coordinates, with nt —• oo, precluding thereduction to a fixed ¥„.

In the literature, sufficient conditions for a-smoothness generally take one of twoforms. The first, due to Daniell (1919), and later rediscovered by Kolmogorov (1933,Section III.4) in slightly greater generality, imposes topological assumptions onthe X,- coordinate spaces and a tightness assumption on the Pn's. Probabilists whohave a distaste for topological solutions to probability puzzles sometimes find theDaniell/Kolmogorov conditions unpalatable, in general, even though those conditionshold automatically for products of real lines.

REMARK. Remember that a probability measure Q on a topological space X iscalled tight if, to each € > 0 there is a compact subset Ke of X such that Q ^ < e.If X = Xi x . . . x X«, then Q is tight if and only if each of its marginals is tight.

The other sufficient condition, due to Ionescu Tulcea (1949), makes noassumptions about existence of a topology, but instead requires a connection betweenthe {Fn} by means of a family of probability kernels, An+\ = {k^ : con e Qn} fromQn to Xn+i, for each n e N. Indeed, a sequence of probability measures can beconstructed from such a family, starting from an arbitrary probability measure Pi

<48> Fn := Pi 0 A2 0 A2 ® . . . ® An.

4.8 Infinite product spaces 101

More succinctly, Fn = Pn_i <8> An for n > 2. The requirement that each An bea probability kernel ensures that {¥n : n e N} is a consistent family of finitedimensional distributions.

<49> Theorem. Suppose [Fn : n € N} be the consistent family of finite dimensionaldistributions defined via probability kernels, Pn = Pn_i <g> An for each n, as in <48>.Then the P defined by <47> has a/i extension to a countably additive probabilitymeasure on the product sigma-field 7.

Proof. We have only to establish the a-smoothness at 0. Equivalently, suppose wehave a sequence {A/ : i e N} in ?{+ with 1 > A, | A, but for which inf, PA,- > e forsome e > 0. We need to show that A is not the zero function. That is, we need tofind a point o> = (Jci, Jt2,...,) in Q for which A(a>) > 0.

Construct PAn = Fngn = (Pi ® A2 ® . . . ® AB) gn(xux2 xn) > €

The product A2 <8>... ® An defines a probability kernel from Xi to X2<i<n X,-. Definefunctions /„ on Xi by

fn(x\) := (A.Xl ® A3 ® . . . ® An) gn(xux2 xH) for n > 2,

with f$x$ = gi(jci). Then Pi / n = Fngn > € for each n. The assumed monotonicity,

<5l> gn((o\n) = An(o>) > An+i(a>) = gn+i(o>|n 4-1) for all co,

implies that {/„ : n e N} is a decreasing sequence of measurable functions. ByDominated Convergence, € < F\fn i Piinfl€N/;, from which we may deduceexistence of at least one value Jci for which fn(x\) > € for all n. In particular,g$x$ > e.

Hold x\ fixed for the rest of the argument. The defining property of Jci becomes

(X , ® A3 ® . . . ® An) gn(XuX2, . . . , Xn) > € fOTH > 2,

Notice the similarity to the final inequality in <50>, with the role of Pi taken overby A.jc,. Repeating the argument, we find an Jc2 for which

(*<*1.*2) ® A4 ® . . . ® An) gn(X\,X2, X3 . . . , Xn) > € for rt > 3,

and with g2(Jci, jc2) > 6.And so on. In this way construct an ) =

gn(co\n) > e for all n. In the limit we have A(o>) > €, which ensures that A is notthe zero function. Sigma-smoothness at zero, and the asserted countable additivity

D for P, follow.

<52> Corollary^ For probability measures Pi on arbitrary measure spaces (X,-,./!,,')»there exists a probability measure P (also denoted by (S>i€nPi) for whichP (A\ x A2 x . . . x A*) = Wi<k piAi, for all measurable rectangles.

D Proof. Specialize to the case where X,-(u;,-_i, •) = P/ for all wf_i € Q|_i.

The proof for existence of a countably additive P under topological assumptionsis similar, with only a slight change in the definition of <K¥.


<53> Theorem. (Daniell/Kolmogorov extension theorem) Suppose {Fn : n € N} is aconsistent family of finite dimensional distributions. If each X,- is a separable metricspace, equipped with its Borel sigma-field Aif and if each Fn is tight, then the Pdefined by <47> has an extension to a tight probability measure on the productsigma-field 7.

REMARK. AS you will see in Chapter 5, the tightness assumption actually ensuresthat the Pn satsify the conditions of Theorem <49>. However, the construction ofthe probability kernels requires even more topological work than the direct proof ofTheorem <53> given below.

Proof. As shown in Section A.6 of Appendix A, we may simplify definition <47>by restricting it to bounded, continuous, nonnegative gn. Countable additivity of Pis implied by its a -smoothness at zero for the smaller W+ class. That is, it sufficesto prove that Fhn I 0 for every sequence {hn : n € N} from !K+ that decreasespointwise to the zero function.

As in the proof of Theorem <49>, we may consider a sequence {A, : i e N]in iK+ with 1 > ht, j A, but for which inf, PA/ > e for some e > 0. Once again, withno loss of generality we may assume hn depends on only the first n coordinates,that is, hn(co) = gn{(o\n)\ but now the gn functions are also continuous. Again weneed to find a point a) = (x\, jt2, . . . , ) in Q for which h(co) > 0.

The tightness assumption lets us construct compact subsets Kn c Qn for whichFnKn > 1 — e/2, with the added property that Kn x Xn+\ 2 Kn+\, for every n: usethe fact that sup* Pn+i «Kn x Xw+i) n K) = Pw+1 (Kn x Xw+i) = FnKn, with thesupremum running over all compact subsets of £2n+i. The construction of Kn ensuresthat Fngn((on){con € Kn] > e/2, so the compact set Ln := {(on e Kn : gn(con) > e/2]is nonempty. Moreover, Ln x Xn+\ 2 Ln +i .

By a Cantor diagonalization argument (Problem [30]), nrt€N (Ln x XI>n X,-) ^ 0.That is, there exists an a> in Q for which co\n e Kn and hn(co) > e/2 for every n. Itfollows that h((b) > e/2. The probability measure P gives mass > 1 — e/2 to the set

D nn e N (Kn x Xl>n X,-), which Problem [30] shows is compact.

9. Problems

[1] Let B\,B2,... be independent events for which £ X i M < = °°- S h o w t h a t

P{^, infinitely often } = 1 by following these steps.

(i) Show that P (B\BC2 ... Bc

n) < exp ( - J^Li Pft) -> 0.

(ii) Deduce that f [ £ i Bl = ° almost surely,

(iii) Deduce that YftLx Bi - ! a l m °s t surely.

(iv) Deduce that J^m Bi - 1 almost surely, for each finite m. Hint: The eventsBm, Bm+\,... are independent.

(v) Complete the proof.

Remark: This result is a converse to the Borel-Cantelli Lemma discussed in Section 2.6.A stronger converse was established in the Problems to Chapter 2.

4.9 Problems 103

[2] In Example <26> we needed the function g(jc, y) = pyp~l{f(x) > y > 0} to beproduct measurable if / is ^-measurable. Prove this fact by following these steps.

(i) The map </>: R2 -> R defined by 0(5, t) = {5 > *} is B(R2)\B(R)-measurable.(What do you know about inverse images for <f> ?)

(ii) Prove that the map (JC, y) i-> (/(*), y) is A <g> B(R)\!B(R) 0 B(R)-measurable.

(iii) Show that the composition of measurable functions is measurable, if the varioussigma-fields fit together in the right way.

(iv) Complete the argument.

[3] Show that every continuous real function on a topological space X is !B(X)VB(R)-measurable. Hint: What do you know about inverse images of open sets?

[4] Let X be a topological space. Say that its topology is countably generated if thereexists a countable class of open sets So such that G = (J{Go : G 2 Go € So) foreach open set G. Show that such a 9o generates the Borel sigma-field on X.

[5] Let (X, d) be a separable metric space with a countable dense subset Xo. Showthat its topology is countably generated. [Lindelof s theorem.] Hint: Let % be thecountable class of all open balls of rational radius centered at a point of Xo- Ifx G G, find a rational r such that G contains the ball of radius 2r centered at x, thenfind a point of Xo lying within a distance rofjc.

[6] Let X and y be topological spaces equipped with their Borel sigma-fields £(X)and S(y). Equip X x y with the product topology and its Borel sigma-field B(X x y).(The open sets in the product space are, by definition, all possible unions of setsG x Hy with G open in X and H open in y.)

(i) Show that B(X) <8> BflO c B(X x U).

(ii) If both X and y have countably generated topologies, prove equality of the twosigma-fields on the product space.

(iii) Show that tB(Rn) = B(R*) <g>

[7] Let X be a set with cardinality greater than 2**°, the cardinality of the set of allsequences of 0's and l's. Equip X with the trivial topology for which all subsetsare open.

(i) Show that the Borel sigma-field B ( X x I ) consists of all subsets of X x X.

(ii) Show that B(X) ® B(X) equals

[J{a(£) : £ a countable class of measurable rectangles}.

(iii) For a given countable class of subsets 6 of X define an equivalence relationx ~ y if and only if {x e C] = {y e C) for all C in C Show that there areat most 2*° equivalence classes. Deduce that there exists at least one pair ofdistinct points JCO and yo such that *o ~ yo> Deduce that a(6) cannot separatethe pair (jto, yo)>

(iv) Show that the diagonal A = {(JC, y) € X x X : x = y] cannot belong to theproduct sigma-field !B(X) (8) B(X), which is therefore a proper sub-sigma-field


of 3(X x X). Hint: Suppose A e a(£) for some countable class £ of measurablerectangles. Find a pair of distinct points JCO, yo such that no member of £, orof <r(£), can extract a proper subset of F = {(JCO, JCO), (}>O, *O)> (*O> yo), (yo, yo)hbut FA =

[8] Prove Theorem <9> by following these steps.

(i) For fixed £2, • • •, £* in £2, . . . , £« , define T>i as the class of all events D forwhich P (D£2 ••.£„) = (PD) (FE2)... (FEn). Show that T>i is a A,-class.

(ii) Deduce that T)\ 2 <r(£i). That is, that the analog of the hypothesizedfactorization with £1 replaced by <r(£i) also holds.

(iii) Argue similarly that each subsequent £, can also be replaced by its a(£/).

[9] Let Zn = (Z\ + . . . + Zn)/n, for a sequence {Z,-} of independent random variables.

(i) Use the Kolmogorov zero-one law (Example <12>) to show that the set{limsup Zn > r] is a tail event for each constant r, and hence it has probabilityeither zero or one. Deduce that limsupZ,, = co almost surely, for someconstant co (possibly ±00).

(ii) If Zn converges to a finite limit (possibly random) at each co in a set A withPA > 0, show that in fact there must exists a finite constant co for whichZn -> co almost surely.

[10] Let X and Y be independent, real-valued random variables for which P(XF) iswell defined, that is, either F(XY)+ < 00 or F(XY)~ < 00. Suppose neither Xnor Y is degenerate (equal to zero almost surely). Show that both PX and FYare well defined and F(XY) = (FX)(FY). Hint: What would you learn from00 > P(AT)+ = PX+K+

[11] For sigma-finite measures /JL and k, on sigma-fields A and S, show that the onlymeasure r on A 0 2 for which V (A x B) = (/xA) (XB) for all A e .A and 5 € Sis the product measure /x 0 X. Hint: Use the 7T-A. theorem when both measures arefinite. Extend to the sigma-finite case by breaking the underlying product space intoa countable union of measurable rectangles, At x Bj, with /xA; < 00 and XBj < 00,for all i, j .

[12] Let m denote Lebesgue measure and X denote counting measure (that is, A. A equalsthe number of points in A, possibly infinite), both on S(R). Let /(JC, y) = {JC = y}.Show that mxXyf(x, y) = 00 but Xymxf(x, y) = 0. Why does the Tonelli Theoremnot apply?

[13] Let X and \i both denote counting measure on the sigma-field of all subsets of N. Let/ ( * , y) := {y = x] - {y = JC + 1}. Show that / I ^ V ( J C , y) = 0 but kyitxf(x, y) = 1.Why does the Fubini Theorem not apply?

[14] For nonnegative random variables Zi, • r Zm, show that P(ZiZ2 • • • Zm) is nogreater than /J Gz,(«)0z2(w) • • • QZm(u)du, with gZ/ the quantile function for Z,-.

[15] Let m denote Lebesgue measure on R*. For each fixed a in 1*, define maf —mxf(x + a). Check that ma is a measure on B(R*) for which ma[a, fc] = m[a, fr],

4.9 Problems 105

for each rectangle [a, b] = X,[a,, &,]. Deduce that ma = m. That is, m is invariantunder translation: xtiaf = f f(x + a) dx = f f(x) dx = m/ .

[16] Let ix and v be finite measures on $(R). Define distribution functions F(t) :=/ i(-oo, r] and G(f) := v(-oo, / ] .

(i) Show that there are at most countably many points JC, atoms, for which /x{*} > 0.

(ii) Show thatv'F(t) = /x(R)v(R) + £ . Mtelvfo},

where {*, : i e N} contains all the atoms for both measures,

(iii) Explain how (ii) is related to the integration-by-parts formula:

f dF(t)dt = F(oo)G(oo) - I G(t)—-— dt

Hint: Read Section 3.4.

[17] In the notation of Example <32>, show that P^(|X - Y\) > P4>(|X - ?|), forevery convex, increasing function V on R+. Hint: Use the fact that ^(JC) =*(0) + /o H(t)dt, where H is an increasing right-continuous function on R+.Represent H(t) as /x(0, t] for some measure /x. Consider Pm5/i'{0 < f < 5 < A},remembering inequality <35>.

[18] Let P = ®i<nPi and Q = <8>,<„(?,, where P, and Q,- are defined on the samesigma-field yi;. Show that

//2(P, Q) = 2 - 2 n i 5 n ( l - \H\Pt, Qi)) < E I 5

where / / 2 denotes squared Hellinger distance, as defined in Section 3.3. Hint:For the equality, factorize Hellinger affinities calculated using dominating measuresXi = (Pf + G,-)/2. For the inequality, establish the identity ]£;<„ yi+[\i<n 0 - yt) > 1for all 0 < yf < 1.

[19] (One-sided version of <38>) Let Z\, . . . , ZN be independent random variables, and€\, €2, and ft be nonnegative constants for which P{Z,- + Z,-+i + .. .+Z# > —€2} > l/)3for each 1. Show that P {maxI<Ar(Zi + . . . + Zf) > e\ -h e2j < )3P{Zi + . . .-hZyv > €\}.

[20] Let Xi ,X2 , . . . be independent, identically distributed, random variables withPX^ = 00 > PXr. Show that J2i<n

X'/n -+ °° almost surely. Hint: ApplyTheorem <3> to the random variables {Xr} and [m A X^}, for positive constants m.Note that P(m A X*) -> 00 as m -> 00.

[21] Let Xj ,X2, . . . be independent, identically distributed, random variables withP|Xf| = 00. Let Sn = X\ + . . . + Xn. Show that Sn/n cannot converge almost surelyto a finite value. Hint: If Sn/n -> c, show that (Sn+i - Sn)/n -> 0 almost surely.Deduce from Problem [1] that ^ ^ { 1 ^ 1 > «} < 00. Argue for a contradiction byshowing that P|Xi| < 1 + ^ 1 W n l > n).

[22] (Kronecker's lemma) Let {fc,} and {jcf} be sequences of real numbers for which0 < b\ < 62 < . . . -> 00 and J]î xi is convergent (with a finite limit). Show thatJ21=\ biXi/bn -> 0 as n -> 00, by following these steps.


(i) Express b\ as a\ + . . . -f a,, for a sequence {a,} of nonnegative numbers. By achange in the order of summation, show, for m < n, that $^=1A-Jt,- equals

cm + E«\yU < y < n}cr/{max(m, j) 0, find an m such that |£"=/7 *,-1 < € whenever m m, show that |]r"=i M i | < Idil -f e J2"=\ <*/•

(iv) Deduce the asserted convergence.

[23] Let Y\,Yi,... be independent, identically distributed random variables, withW\Y\\° < oo for some fixed a with 0 < a < 1. Define 0 = I/a and Z, = Yi{\Yi\a < i}.

(i) Show that YZi WZfli2? < oo and J2Z\ p lz«l/^ < °°-(ii) Deduce that Ylî Zi/i^ is convergent almost surely,

(iii) Show that £ £ i p{lyHa > i] < oo.

(iv) Deduce that Y1H\ Y\l^ x% convergent almost surely.

(v) Deduce via Kronecker's Lemma (Problem [22]) that n~xla(Y\ + . . . 4- Yn)converges to 0 almost surely.

[24] Let Sn = Xi 4- . . . + Xnj a sum of independent, identically distributed randomvariables with ¥X\k < oo for some positive integer k and PX,- = 0. Show thatPS* = O(nk) as n -> oo.

[25] Establish inequality <44>. Hint: First show that £ ( ^ i ~ 2 decreases like it"1, fork = 2, 3 , . . . , by comparing with fj™{ y~2 dy. Then convert to a continuous range. Ithelps to consider the case \x\ <2 separately, to avoid inadvertent division by zero.

[26] (Etemadi 1981) Let X\, Xi,... be independent, integrable random variables, withcommon expected value /x. Give another proof of the SLLN by following thesesteps. As in Section 7, define Yt = Xt{\Xi\ < i] and /z, = PF,. Let Tn = J2i<n

Yi-

(i) Argue that there is no loss of generality in assuming Xt > 0. Hint: Considerpositive and negative parts.

(ii) Show that PTn/n -+ \x as n - • oo and (Sn - Tn)/n - • 0 almost surely,

(iii) Show that var(7n) < Y,i WX\{XX < / < n}.

(iv) For a fixed p > 1, let {kn} be an increasing sequence of positive integers suchthat kn/pn -» 1. Show that £ n { / < kn}/kl < C/i2 for each positive integer i,for some constant C.

(v) Use parts (iii) and (iv), and Problem [25], to show that

J2n P{\Tkn - VTk(n)\ > €kn) < Ce~2 J2( F^?{^1 < i)/i2 < oo.Deduce via the Borel-Cantelli lemma that (Tkn - PTkn)/kn -+ 0 almost surely,

(vi) Deduce that Skjkn -> /JL almost surely, as n -> oo.

(vii) For each p' > p, show that

^ ^ ^ forkn<m<kn+um kn+\

4.9 Problems 107

when n is large enough.

(viii) Deduce that limsup.Sm/m and liminf Sm/m both lie between fi/p and /zp, withprobability one.

(ix) Cast out a sequence of negligible sets as p decreases to 1 to deduce thatSm/m -> /x almost surely.

[27] Let P be a probability measure on (£2,T). Suppose X is a (possibly nonmeasurable)subset of £2 with outer probability 1, that is, PF = 1 for every set with X c F € 7.

(i) If Fx and F2 are sets in 7 for which F{X = F2X, show that PF\ = PF 2 .

(ii) Write A for the collection of all sets of the form FX, with F e 7. Show thatA is a sigma-field (the so-called trace sigma-field).

(iii) Show that M+(X, A) = {/(*) : / € M+(Q, 7)}.

(iv) Show that Q(FX) := PF is a well defined (by (i)) probability measure on A.

(v) Show that Qf = Pf, for each / € M+(£2,3) whose restriction to X equals / .

[28] Let m denote Lebesgue measure on the Borel sigma-field of [0,1). Define anequivalence relation on [0, 1) by x ~ y if x - y is rational. Let Ao be any subsetcontaining exactly one point from each equivalence class. Let {r, : i € N} be anenumeration for the set of all rational numbers in (0,1). Write A,- for set of allnumbers of the form x -f r,-, with x e Ao and addition carried out modulo 1.

(i) Show that the sets A,-, for i € N, are disjoint, and U,->oA,- = [0,1).

(ii) Let Do be a Borel measurable subset of Ao. Define D, as the set of points* + r,-, with x e Do and addition carried out modulo 1. Show mDo = mZ), foreach i, and 1 > Y,i>omDi- Deduce that mDo = 0.

(iii) Deduce that Ao cannot be measurable (not even for the completion of the Borelsigma-field), for otherwise [0,1) would be a countable union of Lebesguenegligible sets.

(iv) Suppose D is a Borel measurable subset of U/Â, for some finite n. Show that[0,1) contains countably many disjoint translations of D. Deduce that mD = 0.

(v) Define Xn = U,->nA,-. Deduce from (iv) that each Xn has Lebesgue outermeasure 1 (that is, mB — 1 for each Borel set 5 2 1B), but nnXn = 0.

[29] Let m denote Lebesgue measure on [0, 1), and let {Xn : n € N} be a decreasingsequence of (nonmeasurable) subsets with DnXn = 0 but with each Xn having outermeasure 1, as in Problem [28]. Write An for the trace of 3 := 2*[0, 1) on Xn, as inProblem [27]. Write Qn for X,<n X, and 7n for <8>,<nA, as in Section 8.

(i) Show that each function / in M+(£2n, 7n) is the restriction of a function / inM+([0, D-.B^toO,, .

(ii) Show that ¥nf := m'/(f, *,...,*)» for / € M+(^n, yn), defines a consistentfamily of finite dimensional distributions.


(iii) Let gn denote the indicator function of {con e Qn : x\ — xi = . . . = *„}, and letfn(co) = gn((o\n) be the corresponding functions in M+. Show that Pngn = 1for all n, even though /„ 4 0 pointwise.

(iv) Deduce that the functional P, as defined by <47>, is not sigma-smooth at zero.

[30] Let Q = XieN %i be a product of metric spaces. Define Qn = Xt<n X,- andSn = Xi>n X,-. For each n let Jfn be a compact subset of Qn, with the propertythat Hn := n,-<n (A:, x 5I+i) # 0 for each finite n. Write 7rf for the projection mapfrom Q onto ft/. Show that H := nIGN#; is a nonempty compact subset of £2, byfollowing these steps. (Remember, for metric spaces, compactness is equivalent tothe property that each sequence has a convergent subsequence.)

(i) Let {zn ' n e N] be a sequence with zn € Hn. Use compactness of each AT, tofind subsequences Ni 2 N2 2 N3 2 • • • of N for which y(i) := limw€N, 7tiZn

exists, as a point of Kt. Define Noo to be the subsequence whose ith memberequals the ith member of N,-. Show that limnGNoo ^tzn = y(i) for every i.

(ii) Show that the first i components of y(i + 1) coincide with y(i) for each 1.Deduce that there exists a y in Q for which 7T/y = y(i) € ^, for every i.Deduce that y € H, and hence H ^0.

(iii) Specialize to the case where zn € / / for every n. Show that there is asubsequence that converges to a point of / / .

[31] Let T be an uncountable index set, and let Q = Xt€TXt denote the set of allfunctions co : T -» U ^ X , with o>(0 G X, for each r. Let At be a sigma-fieldon X,. For each S c r, define J5 be the smallest sigma-field for which each of themaps co H> co (s), for 5 € 5, is IFVAs-measurable. The product sigma-field (S)/€r^f isdefined to equal 5>.

(i) Show that 3r = U5 J5, the union running over all countable subsets S of T.

(ii) For each countable S c 7\ let P$ be a probability measure on J5, with theproperty that Fs equals the restriction of P y to J 5 if 5 c S'. Show thatPF := P5F for F G J5 defines a countably additive probability measure on 3V-

1 0 . N o t e s

According to the account by Hawkins (1979, pages 154-162), the Fubini result(in a 1907 paper, discussed by Hawkins) was originally stated only for productsof Lebesgue measure on bounded intervals. I do not personally know whether theappelation Tonelli's Theorem is historically justified, but Dudley (1989, page 113)cited a 1909 paper of Tonelli as correcting an error in the Fubini paper. As notedby Hawkins, Lebesgue also has a claim to being the inventor of the measuretheoretic version of the theorem for iterated integrals: in his thesis (pages 44-51of Lebesgue 1902) he established a form of the theorem for bounded measurablefunctions defined on a rectangle. He expressed the two-dimensional Lebesgueintegral as an iterated inner or outer one-dimensional integral. With modern

4.10 Notes 109

hindsight, his result essentially contains the Fubini Theorem for Lebesgue measureon intervals, but the reformulation involves some later refinements. Royden (1968,Section 12.4) gave an excellent discussion of the distinctions between the twotheorems, Tonelli and Fubini, and the need for something like sigma-finiteness inthe Tonelli theorem.

I do not know, in general, whether my definition of sigma-finiteness of a kernel,in the sense of Definition <2i>, is equivalent, to the apparently weaker property ofsigma-finiteness for each measure Xx, for x € X.

The inequality between £2 norms in Example <32> was noted byFrechet (1957). He cited earlier works of Salvemini, Bass, and Dall'Aglio (noneof which I have seen) containing more specialized forms of the result, based onFrechet (1951). The £* version of the result is also in the literature (see the com-ments by Dudley 1989, page 342). I do not know whether the general version ofthe inequality, as in Problem [17], has been stated before.

The Maximal Inequality <38> is usually attributed to a 1939 paper of Ottaviani,which I have not seen. Theorem <39> is due to Kolmogorov (1928). Theorem <3>was stated without proof by Kolmogorov (1933, p 57; English edition p 69), with theremark that the proof had not been published. However, the necessary techniquesfor the proof were already contained in his earlier papers (Kolmogorov 1928, 1930).Problem [26] presents a slight repackaging of an alternative method of proof dueto Etemadi (1981). By splitting summands into positive and negative parts, he wasable to greatly simplify the method for handling blocks of partial sums.

Daniell (1919) constructed measures on countable products of boundedsubintervals of the real line. Kolmogorov (1933, Section IH.4), apparently unawareof DanielFs work, proved the extension theorem for arbitrary products of real lines.As shown by Problem [31], the extension from countable to uncountable products isalmost automatic. Theorem <49> is due to Ionescu Tulcea (1949). See Doob (1953,613-615) or Neveu (1965, Section 5.1) for different arrangements of the proof.Apparently (Doob 1953, p 639) there was quite a history of incorrect assertionsbefore the question of existence of measures on infinite product spaces was settled.Andersen & Jessen (1948), as well as providing a counterexample (Problem [29])to the general analog of the Kolmogorov extension theorem, also suggested that theIonescu Tulcea form was more widely known:

In the terminology of the theory of probability this means, that the case ofdependent variables cannot be treated for abstract variables in the same manneras for unrestricted real variables. Professor Doob has kindly pointed out, whatwas also known to us, that this case may be dealt with along similar lines as thecase of independent variables (product measures) when conditional probabilitymeasures are supposed to exist. This question will be treated in a forthcomingpaper by Doob and Jessen.

REFERENCES

Andersen, E. S. & Jessen, B. (1948), 'On the introduction of measures in infiniteproduct sets', Danske Vid. Selsk. Mat.-Fys. Medd.


Daniell, P. J. (1919), 'Functions of limited variation in an infinite number ofdimensions', Annals of Mathematics (series 2) 21, 30-38.

Doob, J. L. (1953), Stochastic Processes, Wiley, New York.Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.Etemadi, N. (1981), 'An elementary proof of the strong law of large numbers',

Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 55, 119-122.Frechet, M. (1951), 'Sur les tableaux de correlation dont les marges sont donnees',

Annales de VUniversite de Lyon 14, 53-77.Frechet, M. (1957), 'Sur la distance de deux lois de probability, Comptes Rendus de

VAcademic des Sciences, Paris, Sen I Math 244, 689-692.Hawkins, T. (1979), Lebesgue's Theory of Integration: Its Origins and Development,

second edn, Chelsea, New York.Ionescu Tulcea, C. T. (1949), 'Mesures dans les espaces produits', Lincei-Rend. Sc.

fis. mat. enat. 7, 208-211.Kolmogorov, A. (1928), 'Uber die Summen durch den Zufall bestimmter un-

abhangiger GroBen', Mathematische Annalen 99, 309-319. Corrections: samejournal, volume 102, 1929, pages 484-488.

Kolmogorov, A. (1930), 'Sur la loi forte des grands nombres', Comptes Rendus deVAcademie des Sciences, Paris 191, 910-912.

Kolmogorov, A. N. (1933), Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer-Verlag, Berlin. Second English Edition, Foundations of Probability 1950,published by Chelsea, New York.

Lebesgue, H. (1902), Integrate, longueur, aire. Doctoral dissertation, submitted toFaculte des Sciences de Paris. Published separately in Ann. Mat. Pura Appl. 7.Included in the first volume of his (Euvres Scientifiques, published in 1972 byL'Enseignement Mathematique.

Neveu, J. (1965), Mathematical Foundations of the Calculus of Probability, Holden-Day, San Francisco.

Royden, H. L. (1968), Real Analysis, second edn, Macmillan, New York.Wald, A. (1949), 'Note on the consistency of the maximum likelihood estimate',

Annals of Mathematical Statistics 20, 595-601.

Chapter 5

Conditioning

SECTION 1 considers the elementary case of conditioning on a map that takes only finitelymany different values, as motivation for the general definition.

SECTION 2 defines conditional probability distributions for conditioning on the value of ageneral measurable map.

SECTION 3 discusses existence of conditional distributions by means of a slightly moregeneral concept, disintegration, which is essential for the understanding of generalconditional densities.

SECTION 4 defines conditional densities. It develops the general analog of the elementaryformula for a conditional density: (joint density)/(marginal density).

SECTION *5 illustrates how conditional distributions can be identified by symmetryconsiderations. The classical Borel paradox is presented as a warning against themisuse of symmetry.

SECTION 6 discusses the abstract Kolmogorov conditional expectation, explaining why it isnatural to take the conditioning information to be a sub-sigma-field.

SECTION *7 discusses the statistical concept of sufficiency.

1. Conditional distributions: the elementary case

In introductory probability courses, conditional probabilities of events are definedas ratios, P(A | B) = FAB/FB, provided FB ^ 0. The division by FB ensures thatP( • | B) is also a probability measure, which puts zero mass outside the set #, thatis, F(BC | B) = 0. The conditional expectation of a random variable X is defined asits expectation with respect to P( • | B), or, more succinctly, P(X | B) = F(XB)/FB.If FB = 0, the conditional probabilities and conditional expectations are either leftundefined or are extracted by some heuristic limiting argument. For example, if Yis a random variable with F[Y = y] = 0 for each possible value y, one hopesthat something like P (A \ Y = y) = Iim5_+o F(A\y<Y<y + 8) exists and is aprobability measure for each fixed y. Rigorous proofs lie well beyond the scope ofthe typical introductory course.

In applications of conditioning, the definitions get turned around, to deriveprobabilities and expectations from conditional distributions constructed by appealsto symmetry or modelling assumptions. The typical calculation starts from a

112 Chapter 5: Conditioning

partition of the sample space Q into finitely many disjoint events, such as the sets[T = t} where some random variable T takes each of its possible values 1 ,2 , . . . , n.From the probabilities F{T = t] and the conditional distributions P(- | T = /), foreach t, one calculates expected values as weighted averages.

FX = ]Tr F(X{T = t}) = J2tnX\T = t)F{T = t}.

Notice that the weights Q[t] := F{T = t] define a probability measure Q on the rangespace T = { 1 , 2 , . . . , w}, the distribution of T under P. (That is Q = TF.) Also,if there is no ambiguity about the choice of 7\ it helps notationally to abbreviatethe conditional distribution to P,(), writing P,(X) instead of P(X | T = t). Theprobability measure P, lives on Q, with Ft{T ^ t] = 0 for each t in 7. With thesesimplifications in notation, formula can be rewritten more concisely as

with the interpretation that the probability measure P is a weighted average ofthe family of probability measures 9 = [Ft : t € 7}. The new formula also hasa suggestive interpretation as a two-step method for generating an observation cofrom P:

(i) First generate a t from the distribution Q on 7.

(ii) Given the value t from step (i), generate co from the distribution Ft.Notice that P, concentrates on the set of co for which T(co) = t. The value of T(co)from step (ii) must therefore equal the t from step (i).

<3> Example. Suppose a deck of 26 red and 26 black cards is well shuffled, that is,all 52! permutations of the cards are equally likely. Let A denote the event {top andbottom cards red), and let T be the map into T = {red, black} that gives the color ofthe top card. Then T has distribution Q given by

Q{red} = F{T = red} = 1/2 and Q{black} = F{T = black} = 1/2.

By symmetry, the conditional distribution P, gives equal probability to all permuta-tions of the remaining 51 cards. In particular

PredA = P{top and bottom cards red | T = red} = 25/51,

Pbiack^ = P{top and bottom cards red | T = black} = 0/51,

from which we deduce that

PA = Q{red} (25/51) + Q{black} (0/51) = (25/102).

Notice how we were able to assign P, probabilities by appeals to symmetry, rather• than by a direct calculation of a ratio.

Section 2 will describe the extension of formula <2> to more general familiesof conditional distributions {P, : t e 7}. Unfortunately, for subtle technical reasons(discussed in Appendix F), conditional distributions do not always exist. In suchsituations we must settle for a weaker concept of conditional expectation, followingan approach introduced by Kolmogorov (1933, Chapter 5), as explained in Section 6.

REMARK. I claim that it is usually easier to think in terms of conditionaldistributions, despite the technical caveats regarding existence. Kolmogorov's abstract

5.7 Conditional distributions: the elementary case 113

conditional expectations can be thought of as rescuing just one of the desirableproperties of conditional distributions in situations where the full conditionaldistribution does not exist. The rescue comes at the cost of some loss in intuitiveappeal, and with some undesirable side effects. Not all intuitively desirable propertiesof conditioning survive the nonexistence of a conditional probability distribution.Section 7 will provide an example, where the abstract approach to conditioningallows some counterintuitive cases to slip through the definition of sufficiency.

In some situations—such as the study of martingales—we need conditionalexpectations for only a small collection of random variables. In those situations wedo not need the full conditional distribution and so the abstract Kolmogorov approachsuffices.

2. Conditional distributions: the general case

With some small precautions about negligible sets, the representation of P as aweighted average of distributions living on the level sets {T = t}, as in <2>, makessense in a more general setting.

<4> Definition. Let T be an 3\B-measurable map from a probability space (£2, J, P)into a measurable space (T, 2). Let Q equal TF, the distribution of T under P. Calla family T = {P, : t e 7] of probability measures on 7 the conditional probabilitydistribution of P given T if

(i) Ft{T ^ t} = 0 for Q almost all t in T,

(ii) the map t \-+ F?f(co) is %-measurable and F^fico) = QV?f(a>), for each f

In the language of Chapter 4, the family 7 is a probability kernel from (T, S)to (Q,3). The fine print about an exceptional Q-negligible set in (i) protects usagainst those t not in the range of T; if {T = t] were empty then P, would havenowhere to live. We could equally well escape embarrassment by allowing P, tohave total mass not equal to 1 for a Q-negligible set of t.

The Definition errs slightly in referring to the conditional probability distribution,as if it were unique. Clearly we could change P, on a Q-negligible set of t and stillhave the two defining properties satisfied. Under mild conditions, that is the extentof the possible nonuniqueness: see Theorem <9>.

REMARK. Many authors work with the slightly weaker concept of a regularconditional distribution, substituting a requirement such as

r"h(Ta>)X(a>) = Qh(t)FtX for all h in M+(T, £)

for the concentration property (i). As you will see in Section 3, the differencebetween the two concepts comes down to little more than a question of measurabilityof a particular subset of fl x T.

In some problems, where intuitively obvious candidates for the conditionaldistributions exist, it is easy to check directly properties (i) and (ii) of the Definition.


•

Exercise. Let P denote Lebesgue measure on 3 ([0, I]2). Let T(xy y) = max(jc, y).Show that the conditional probability distributions {P,} for P given T are uniformon the sets {T = t}.SOLUTION: Write m for Lebesgue measure on !B[0,1]. For 0 < t < 1, formalizethe idea of a uniform distribution on the set

{T = t] = {(*, t) : 0 < x < t] U {(/, y) : 0 < y < t]

by defining

±'0 + Yt t*y){0

You should check that Ft{T ^ t] = 0 by direct substitution. The definition of P,for t — 0 will not matter, because F[T = 0} = 0. Tonelli gives measurability oft H> P , / for each / in M+([0,1]2).

The image measure Q is determined by the values it gives to the generatingclass of all intervals [0, / ] ,

Q[0, t] = P{(JC, y): max(jc, y) <t} = P[0, t]2 = t2 for 0 < t < 1.

That is, Q is the measure that has density 2t with respect to m.It remains to show that the {P,} satisfy property (ii) required of a

conditional distribution.

/ = m'2f (±mx (/(JC, 0(0 < t}) + ^m^ (/(r, y){0 <y

(/(JC, 0(0 < x < t}) + m'm (/(rf y){0 < y < rJ).

Replace the dummy variables /, y in the last iterated integral by new dummyvariables JC, t to see that the sum is just a decomposition of P / = m 0 m/ intocontributions from the two triangular regions {x < t] and {t < x}. The overlapbetween the two regions, and the missing edges, all have zero m (8) m measure.

The decomposition of the uniform distribution on the square, into an average ofuniform conditional distributions on the boundaries of subsquares, has an extensionto more general sets, but the conditonal distributions need no longer be uniform.

Example. Let AT be a compact subset of Rd, star-shaped around the origin. Thatis, if y e K and 0 < t < 1 then ty € K.

x/p(x)

x/p(x)K

//

/ K t

For each t > 0, let K, denote the compact subset {ty : y e K). The sets {K, : t > 0}are nested, shrinking to {0} as t decreases to zero. The sets define a functionp : Rd -+ R+, by p(x) := M{t > 0 : x e Kt] = inf{t : x/t e K}. In fact, theinfimum is achieved for each x in KA{0}, with 1 > p(x) > 0. Only when x = 0do we have p(x) = 0. The set \x : p(x) = 1} is a subset of the boundary, dK, a

5.2 Conditional distributions: the general case 115

proper subset unless 0 lies in the interior of K. For each x in Krf\{0}, the pointfix) := x/p(x) lies in dK. Notice that f(x/t) = ty(x) for each t > 0.

Let P denote the uniform distribution on K, defined by

Px f{x) = — where m is Lebesgue measure on Rd.TtiK

The scaling property of Lebesgue measure under the transformation x H» x/t,

mxf(x/t) = tdmyf(y) for / € M+flR**) and t > 0,

will imply independence of p(x) and f(x) under P. For 0 < t < 1 and g € M

< t] = m**(*(x/0){*A € *}/m*r

= tdmyg(ir(y)){y e K}/mK

In particular, P{p(x) <t} = tdfor0<t<l and

Pxg(f(x)){p(x) < *} = (P *

A generating class argument then leads to the factorization needed for independence.Write /x for the distribution of 1r(x), a probability measure concentrated on dK,

and Q for the distribution with density dtd~l with respect to Lebesgue measureon [0,1], The image of fi under the map x H> xry for a fixed r in [0,1], definesa probability measure Pr concentrated on the set {JC : p(x) = r} c dATr. Thedefining equality x = p(jt)^Cx) then has the interpretation: if R has distribution Q,independently of F, which has distribution IJL, then RY has distribution P, that is,

/>*/(*) = Qrt*yf(ry) = QrPrxfW for / € M+(Mrf).

The probability kernel {Pr : 0 < r < 1} is the conditional distribution for P• given p(x).

<7> Example. Consider P := Pn, the joint distribution for n independent observationsJCI, . . . , xn from P, with P equal to the uniform distribution on a compact, star-shaped subset K of Rd, as in Example <6>. Define T(x) := max,*<np(jCi). Wemay generate each xi from a pair of independent variables, xt = nyt, with yi € BKhaving distribution /x, and rt € [0, 1] having distribution Q. More formally,

p / ( * i , . . . , xn) = ^ M V f r i y i , • • •, ynyn),

where r := ( n , . . . , rn) and y := ( j i , . . . , yn), with M := fin, the product measure on(9Jf)\ and Q := Qn> the product measure on [ 0 , 1 ] \ Then we have T — max/<n r,,with distribution v for which v[0, t] = (QfO, f])n = tnd for 0 < t < 1. Thus v hasdensity ndtnd~l with respect to Lebesgue measure on [0,1].

Problem [2] shows how to construct the conditional distribution Q,, for Qgiven T = t, namely, generate n — 1 independent observations {52,. . . , sn} fromthe conditional distribution Qt := Q (• | j < *), then take ( n , . . . , rn) as a randompermutation of (f, 52, . . . ,sn). The representation for P becomes

i , . . . , xn) = vr


a

Thus the conditional distribution Pr equals Qt <g> M, for each t in [0 ,1] . Less

formally, to generate a random sample x := (JCI, . . . , xn) from the conditional

probability distribution P,

(i) independently generate w from fi and yi> ...,yn from Pn~l

(ii) take x as a random permutation of (tw, tyi,.. .,tyn).

REMARK. The two step construction lets us reduce calculations involving nindependent observations from P to the special case where one of the points lieson the boundary dK. The conditioning machinery provides a rigorous setting forCrofton's theorem from geometric probability.

The theorem dates from an 1885 article by M. F. Crofton in the EncyclopediaBritannica. As noted by Eisenberg & Sullivan (2000), existing statements andproofs (such as those in Chapter 2 of Kendall & Moran 1963, or Chapter 5 ofSolomon 1978) of the theorem are rather on the heuristic side. Eisenberg andSullivan derived a version of the theorem by means of Kolmogorov conditionalexpectations, with supplementary smoothness assumptions. If the boundary issmooth, the measure ix is absolutely continuous with respect to surface measure,with a density expressible in terms of quantities familiar from differential geometry(as discussed by Baddeley 1977).

Of course there are many situations where one cannot immediately guess the

form of the conditional distributions, and then one must rely on systematic methods

such as those to be discussed in Sections 4 and 5.

3. Integration and disintegration

Suppose P has conditional distribution 9 = {P, : t e T) given 7\ as in Definiton <4>.As shown in Section 4.3, from the probability kernel 9 and the distribution Q, itis possible to construct a probability measure Q ® 9 on the product sigma-field bymeans of an iterated integral, (Q <g> 9) (g) := Q' (Pf g(co, t)) for g e !M+(ft xT, 7(8)3).This measure has marginal distributions P and Q, as may be seen by restricting g

to functions of co alone or / alone. The concentration property (i) of the Definitionensures that f*?g(co, t) = F?g(co, Tco) for Q almost all f, which, by property (ii) ofthe Definition, leads to

<8> (Q 0 00 (g) = Q' (P?*(o>, 0 ) = ®'K8(«>, Tco) = W°g(co, Tco).

That is, Q 0 7 is the joint distribution of co and Tco, the image of P under the map

Y ' co H-» (co, Tco), which "lifts P up to live on the graph."

(Tf«fQ)

7(P) concentrates ongraph(T)

t

Pt concentrates on

5.3 Integration and disintegration 117

That is, the joint distribution concentrates on the set

graph(7*) := {(co, t) e Q x 7: t = Tco}y

the graph of the map T. Indeed, provided the graph is a product measurable set,existence of the conditional distribution is equivalent to the representation of theimage measure y(P) as Q <g> 3\

Existence of conditional distributions follows as a special case of a moregeneral decomposition. The following Theorem is proved in Appendix F, where itis also shown that condition (ii) is satisfied by every sigma-finite Borel measure ona complete separable metric space.

<9> Theorem. Let X be a metric space equipped with its Borel sigma-field A, andlet T be a measurable map from X into a space 7, equipped with a sigma-field !B.Suppose k is a sigma-finite measure on A and fx is a sigma-finite measure on 2dominating the image measure T(k). Suppose:

(i) the graph of T is A® B -measurable;

(ii) the measure k is expressible as a countable sum of finite measures, each withcompact support.

Then there exists a kernel A = [kt : t e 7} from (T, 2) to (X, A) for which theimage ofk under the map x H* (JC, TX) equals \x 0 A, that is,

<io> kxg(x, Tx) = (/x <g> A) (g) := ^kxtg(x, t) for g e M+(T x X, B ® A).

Moreover, property < io> is equivalent to the two requirements

(Hi) kt[T ^ t] = 0 for /x almost all t,

(iv) kxf(x) = ^kxtf(x), for each f in M+(X,.A),

or to the assertion that Tx = t for /x <g> A almost all (x,t). The kernel A is uniqueup to a ix almost sure equivalence.

REMARK. The uniqueness assertion is quite strong. If A = {kt : t € 7} andA = [kt : t € 7} are two kernels with the stated properties, it requires kt = kt, asmeasures on A, for p almost all r, and not just that XtA — ktA a.e. [/x], for each A.

A kernel A with the properties described by the Theorem is called a (7, /x)-disintegration of k. The construction of a disintegration is a sort of reverse operationto the "integration" methods used in Section 4.3 to construct measures on productspaces. As you will see in the next Section, disintegrations are essential for theunderstanding of general conditional densities.

You should not worry too much about the details of Theorem <9>. I havestated it in detail so that you can see how existence of disintegrations involvestopological assumptions (or topologically inspired assumptions, as in Pachl 1978).Dellacherie & Meyer (1978, page 78) lamented that "The theorem on disintegrationof measures has a bad reputation, and probabilists often try to avoid the use ofconditional distributions . . . But it really is simple and easy to prove." Perhaps theunpopularity is due, in part, to the role of topology in the proof. Many probabilistsseem to regard topology as completely extraneous to any discussion of conditioning,or even to any discussion of abstract probability theory. Nevertheless, it is a sad


reality of measure theoretic life that the axioms for countable additivity are not wellsuited to dealing with uncountable families of negligible sets, and that occasionallythey need a little topological help.

REMARK. Topological ideas also come to the rescue of countable additivity inthe modern theory of stochastic processes in continuous time. I think it is no accidentthat the abstract theory of (stochastic) processes has flourished more readily amongstthe probabilists who are influenced by the French approach to measure theory, wheremeasures are linear functional and topological requirements are accepted as natural.

 Example. Let X denote Lebesgue measure on S(R2), let X denote the map thatprojects R2 onto the first coordinate axis, and let 11 denote the one-dimensionalLebesgue measure on that axis. Write Xx for one-dimension Lebesgue measuretransplanted to live on [x] 0 R. That is, Xx is defined on 2(ft2) but it concentratesall its mass along a line orthogonal to the first coordinate axis. By the Tonelli

• Theorem, {Xx : x e R} is an (X, ^-disintegration of X.

REMARK. Many authors (including Dellacherie & Meyer 1978, Section 111-70)require A to be a probability kernel, a restriction that excludes interesting and usefulcases, such as the decomposition of Lebesgue measure from Example <11>. Asshown by Problem [3], A can be chosen as a probability kernel if and only if fi isequal to the image measure TX. In particular, because the image of two-dimensionalLebesgue measure under a coordinate projection is not a sigma-finite measure, thereis no way we could have chosen A as a probability kernel in Example <11>.

4. Conditional densities

In introductory courses one learns to calculate conditional densities by dividingmarginal densities into joint densities. Such a calculation has a natural generalizationfor conditional distributions: it is merely a matter of reinterpreting the meanings ofthe joint and marginal densities.

Recall the elementary construction. Suppose P has density p(x,y) with respectto Lebesgue measure X on $(R2). Then the X-marginal has density q(x) :=f p(x,y)dy with respect to Lebesgue measure /x on the X-axis, and the conditionaldistribution Px() is given by the conditional density p(y \ x) := p{x,y)/q{x).Typically one does not worry too hard about how to define conditional distributionswhen q(x) = 0 or q(x) = oo, but maybe one should.

The dy in the definition of q(x) corresponds to the disintegrating Lebesguemeasure Xx from Example < n > . The marginal density q(x) equals Xxp. It is thedensity, with respect to Lebesgue measure //, on E, of the image of P under thecoordinate map X. The probability distribution Px() is absolutely continuous withrespect to Xx, with density /?(• | JC) standardized to integrate to 1.

To generalize the elementary formula, suppose a probability measure Pis dominated by a sigma-finite measure X, which has a (7\ /^-disintegrationA = {Xt : / € 7}. The density p(x) := dP/dX corresponds to the joint density, andq(t) := Xx

tp{x) corresponds to the marginal density. If the analogy is appropriate,the ratio pt(x) := p(x)/q(t) should correspond to the conditional density, with Xt

5.4 Conditional densities 119

as the dominating measure. In fact, if we took a little care to avoid 0/0 or oo/ooby inserting an indicator function {0 < q < oo} into the definition, the elementaryformula would indeed carry over to the general setting. For some purposes (suchas the discussion of sufficiency in Section 7) it is better not to force pt(x) to bezero when q(t) = 0 or q{t) = oo. The wording in part (iii) of the next Theorem isdesigned to accommodate a slightly more flexible choice for the conditional density.

<12> Theorem. Suppose P is a probability measure with density p with respect to asigma-finite measure k that has a (7\ /JL)-disintegration A = [kt : t e T}. Then

(i) the image measure Q — TP has density q(t) := ktp with respect to //,;

(ii) the set {(JC, t) : q(t) = oo or q(t) = 0 < p(x)} has zero /JL <g> A measure.

Let pt(x) be an A® B-measurable function for which q(t)pt(x) = p(x) a. e. [/u,® A].Then

(iii) the Pt defined by dPt/dkt := /?,(•) is a probability measure, for Q almostall t, and P has conditional probability distributions {Pt : t € 7} given T.

Proof. For each h in M+(7),

Qh = Ph(Tx) image measure

= kx (p(x)h(Tx)) density p = dP/dk

= it* (h(t)kxp(x)) by <io> with g(x, t) = p(x)h(t).

Thus, Q has density q(t) := kxp(x) with respect to /JL.

From the fact that \iq — 1 we have \x{q = oo} = 0. Also, if q(t) = 0 for some tthen p{x) = 0 for kt almost all JC. Assertion (ii) follows.

For Assertion (iii), first note that Pt concentrates on {T = f}, for /x almost all t(and hence Q almost all 0 , because it is absolutely continuous with respect to kt.The set [t : q(t) = 0 or oo} has zero Q measure. When 0 < q(t) < oo we havekxpt(x) = kxp(x)/q(t) = 1. Also, for / € M+(X),

Pf = kx (p(x)f(x)) density p = dP/dk

= fMfkx (p(x)f(x)) disintegration of k

— /jL*kx (q(t)pt(x)f(x)) assumption on pt

= n* (q(t)kx (Pt(x)f(x))) = Q' (Ptxf(x)) ,

• as required for the conditional distribution.

REMARK. Notice that I did not prove that every Pt is a probability measure. Infact, the best we can assert, as in Problem [3], is that g-almost all Pt are probabilitymeasures. Of course we have no control over pt(x) when q{t) is zero or infinite.For maximum precision you might like to make appropriate almost sure modificationsto Definition <4>.

<13> Example . Let {P# : 9 € 0} be a family of probability measures on X. If themap 0 H> Wof is measurable for / € M+(X), and if n is a probability (a Bayesianprior distribution) on 0 , then Q = n 0 7 is a probability measure on X (8) 0 . Thecoordinate maps X (onto X) and T (onto 0 ) have Q as their joint distribution.The conditional distribution of X given T = 0 is ¥0. The conditional distributionQx(.) = Q(. | X = JC) is called the Bayesian posterior distribution.


If P# has a density p(x, 9) with respect to a sigma-finite [i on X, then

Q / = nenx (p(x, 9) f{x, 9)) for / € M+(X <g> 0) .

That is, Q has density p{x, 9) with respect to the product measure JJL®TT. Theproduct measure has the trivial disintegration (/x <g> n)x = 7r, if we regard 7r to bea measure living on {x} 0 0 . It follows from Theorem <12> that the posteriordistribution Qx has density p(x, 9)/nep(x, 9) with respect to n. Why would aBayesian probably not worry about the negligible sets where such a ratio is not well

• defined?

<14> Example. Let P and Q be probability measures on (X, A), with densities p andq with respect to a sigma-finite measure X. The Hellinger distance H(P, Q) wasdefined in Section 3.3 as the £2(A) distance between ^fp and *Jq. Suppose X has a(7\ /^-disintegration A = {Xt : f G T}, with T a measurable map from X into T.

The image measures TP and r<9 have densities p(t) = Xt(p) and q(t) = ^,(4)with respect to /z. By the Cauchy-Schwarz inequality, {Xtyfpq) < pq, and hence

Xjpq = ^Xxty/p{x)q{x) < /n'^/p(t)q(t).

That is, the Hellinger affinity between P and Q is smaller than the Hellinger affinity• between TP and TQ. Equivalently, H2(TP, TQ) < H2(P, Q).

<15> Example. Let {P0 : 9 e 0} be a family of probability measures with densitiesp(x,9) with respect to a sigma-finite measure A. on (X, A), The method of maximumlikelihood promises good properties for the estimator defined by maximizing p(x, 9)as a function of 9, for each observed x.

If x itself is not known, but instead the value of a measurable function T(x),taking values in some space (7, £ ) , is observed, the method still applies. If Qe,the distribution of T under P9, has density q(t, 9) with respect to some sigma-finitemeasure /x, the estimate of 9 is obtained by maximizing q(t, 9) as a function of 0,for the observed value t = T(x).

REMARK. Often we can conceive of x as a one-to-one function of some pairof statistics (5(JC), T(X)). By observing only T(x) we find ourselves working withincomplete data.

If direct maximization of 9 t-> #(f,0) is awkward, there is an alternativeiterative method, due to Dempster, Laird & Rubin (1977), which at least offersa way of finding a sequence of 9 values with q(t,9) increasing at each iteration.Their method is called the EM-algorithm, the E standing for estimation, the M formaximization.

Each iteration consists of a two-step procedure to replace an initial guess #o bya new guess #i.

(E-step) Calculate G(9) = P0Q (log/?(jt, 9) \ T = 0-(M-step) Maximize G, or at least find a 9\ for which G(9\) > G(#o)-

It is easiest to analyze the algorithm when X has a (7\ /^-disintegration, so thatdQe/dii = q(t,9) = Xxp(x, 9). Throughout the calculation t is held fixed, so thereis no ambiguity in writing v for the measure Xt. Similarly, it is cleaner to write 7r,

5.4 Conditional densities 121

for the density of P .(- | T = t) with respect to v and qt instead of q(t,$i), fori = 1,2. Then p(x, 0,) = Xi(x)qi9 and

0 < G(9i) - G(ft>) = v*jro(x) log

which implies log(qi/qo) > vx (jto(x)log(n\(x)/7to(x))). The last integral definesthe Kullback-Leibler distance, which is always nonnegative, by Jensen's inequality.Perhaps more informative is the lower bound (V|TTI — TTO|)2 /2, proved in Section 3.3,

• which quantifies the improvement in q{t, 0) from the EM iteration.

REMARK. I paid no attention to possible division by zero in the calculationsleading to the lower bound for log(#i/#<))• You might like to provide appropriateintegrability assumptions, and insert indicator functions to guard against 0/0. Itwould also be helpful to ignore the story about missing data and conditioning—as thesimplifications in notation should make clear, the EM algorithm works for any q(0)expressible as vxp(x, 0), for any measure v and any v-integrable, nonnegative p.Some regularity assumptions are needed to ensure finiteness of the various integralsappearing in the Example.

* 5 . I n v a r i a n c e

Formal verification of the disintegration property can sometimes be reduced to asymmetry, or invariance, argument, as in the following Exercise.

<16> Exercise. Let P be a rotationally symmetric probability measure on $(R2), suchas the standard bivariate normal distribution. That is, if S# denotes the map forrotation through an angle 0 about the origin, then S#P = P for all 0. Define R(x)to equal (x\ + ^|)i/2» ^ e distance from the point x = (jq, xz) to the origin. Showthat P has conditional probability distributions Pr uniform on the sets {R = r). Thatis, show that the polar angle 0(x) is uniformly distributed on [0,2n)y independentlyof R.SOLUTION: Let Q denote the distribution of R under P. Let Xr denote the uniformprobability measure on {R = r}, which can be characterized by invariance, S$Xr = kr,for 0 in a countable, dense set 0o of rotational angles (Problem [14]). We need toshow that Pr = Xr, for Q-almost all r.

The problem lies in the transfer of rotational invariance from P to each ofthe conditional probabilities. Consider a fixed 0 in the countable @o and a fixed /in 3Vt+(R2). Then

p /= (Se P)yf(y) invariance of P= Pxf(Sex) image measure= QrP?f(S$x) disintegration= Qr(SePr)y f(y) image measure.

When Pr concentrates on the set {R = r}, so does SePr. It follows that the family{SePr : r e M+} is another disintegration for P. By uniqueness of disintegrations,there exists a g-negligible set H^ such that SePr = Pr for all r in 3MJ. Cast outa sequence of negligible sets to deduce that, for Q almost all r, the probability


measure concentrates on {R = r] and it is invariant under all ©o rotations, whichimplies that Pr — A.r, as asserted.

Invariance of Pr implies that 0(x) has the uniform conditional distribution,m := Uniform[0, 2TT), under Pr. Thus, for g e M+(R+) and h € M+[0, 2TT),

Pxg(R(x)M6(x)) = QrPrxg(r)h(0(x)) = (Qrg(r)) (m9h(0)).

By a generating class argument it follows that the joint distribution of R(x) and 9(x)• equals the product Q ® m on S(R+) <g> !B[0, 27r), which implies independence.

REMARK. A similar argument would work for any sigma-finite measure kthat is invariant under a group G of measurable transformations on X, if the levelsets {T = t] are also invariant. Let G o be a countable subset of G. Then themeasures {A,} must be invariant under each transformation in Go, except possibly forthose f i n a negligible set that depends on Go. Often invariance under a suitable G o

will characterize the {Xt} up to multiplicative constants.

Appeals to symmetry can be misleading if not formalized as invariancearguments. The classical Borel paradox, in the next Example, is a case in point.

< n > Example. Let x be a point chosen at random (from the uniform distribution P)on the surface of the Earth. Intuitively, if the point lies somewhere along the equatorthen the longitude should be uniformly distributed over the range [—180°, 180°].But what is so special about the equator? Given that the point lies on any particulargreat circle, its position should be uniformly distributed around that circle. Inparticular, for a great circle through the poles (that is, conditional on the longitude)there should be conditional probability 1/4 that the point lies north of latitude 45°N.Average out over the longitude to deduce that the point has probability 1/4 of lyingin the spherical cap extending from the north pole down to the 45° parallel oflatitude. Unfortunately, that cap does not cover 1/4 of the earth's surface area, asone would require for a point uniformly distributed over the whole surface. That isBorel's paradox.

The apparent paradox does not mean that one cannot argue by symmetry inassigning conditional distributions. As Kolmogorov (1933, page 51) put it, "theconcept of a conditional probability with regard to an isolated hypothesis whoseprobability equals 0 is inadmissible. For we can obtain a probability distributionfor [the latitude] on the meridean circle only if we regard this circle as an elementof the decomposition of the entire spherical surface into meridian circles with thegiven poles." In other words, the conditional distributions on circles are not definedunless the disintegration in which they are level sets is specified. Even then, theconditional distributions are only determined up to an almost sure equivalence.

We can, however, argue by invariance for a uniform conditional distributionon almost all circles of constant latitude in their roles as level sets for a latitudedisintegration. Suppose T picks off the latitude of the point on the surface. Thesets of constant T value are the parallels of constant latitude. Each such parallel ismapped onto itself by rotations about the polar axis; the sets {T = t] are invariantunder each member of that group S of rotations. The uniform measure P is alsoinvariant under such rotations. As in Example <16>, it follows that almost all Pt

measures must be invariant under 9. That is, almost all the conditional probability

5.5 Invariance 123

distributions are uniform around circles of constant latitude. It follows that almostall great circles of constant latitude must carry a uniform conditional distribution.Unfortunately, the equator is the only such great circle; it corresponds to a negligibleset of T values. Thus there is no harm in assuming that Po> the conditionaldistribution around the equator, is uniform, just as there is no harm in assuming it tocarry any other conditional distribution. A uniform conditional distribution aroundthe equator is not forced by the T disintegration.

What happens if we change T to pick off the longitude of the random pointon the surface? It would be nonsense to argue that the conditional distributions staythe same—the level sets supporting those distributions are not even the same asbefore. Couldn't we just argue again by invariance? Certainly not. The great circlesof constant T are no longer invariant under a group of rotations; the disintegratingmeasures need no longer be invariant; the conditional distributions around the great

• circles of constant longitude are not uniform.

There is a general lesson to be learned from the last Example. Even if almostall level sets [T = t] in one disintegration must carry particular distributions, itdoes not follow that similar sets in a different disintegration must carry the samedistributions. It is an even worse error to assume that a conditional distribution thatcould be assigned to a region as a typical level set of one disintegration must be thesame as the conditional distribution similar regions must carry if they appear as levelsets of another disintegration.

6. Kolmogorov's abstract conditional expectation

Let (£2, 3\ P) be a probability space, and T be an 3\'B-measurable map into a set 7equipped with a sigma-field B. Write Q for the image measure FP.

Suppose the conditional distribution {P, : t e 7} of P given T exists. For afixed X in M+(£2,5F), the conditional expectation gx(t) := V?X((o) is a functionin M+(T, !B), with the property that I f (a(T(o)X(o))) = a(t)gx(t) a.e. [Q], and hence

<18> ¥° (a(Ta))X((o)) = Qf (a(t)gx(t)) for each a e M+(T, $).

Even when the conditional distribution does not exist, it turns out that it isstill possible to find a function t H> g(t, X) satisfying an analog of <18>, for eachfixed X. Kolmogorov (1933) suggested that such a function should be interpretedas the conditional expectation of X given F = t. Most authors write E(X | F = t)for g(t, X). This abstract conditional expectation has many—but not all—of theproperties associated with an expectation with respect to a conditional distribution.Kolmogorov's suggestion has the great merit that it imposes no extra regularityassumptions on the underlying probability spaces and measures but at the cost ofgreater abstraction.

The properties of the Kolmogorov conditional expectation X H> g(t,X) areanalogous to the properties of increasing linear functionals with the MonotoneConvergence property, except for omnipresent a.e. [Q] qualifiers.


<19> Theorem. There is a map X H> g(t, X) from M+(ft, 90 to M+(T, !B) with theproperty that P" (<x(T(o)X(co)) = Q' («(*)£(*, *)) for each a G M+(T, £) . For eachX, the function t H» g(f, X) is unique up to a Q-eguivalence. The map has thefollowing properties.

(i) IfX = h(T) for some h in M+(T, £ ) then g(t, X) = h(t) a.e. [Q].

(ii) For Xt, X2 in M+(£2, 3) and h\, h2 in M+(T, S),

g (r, *i(r)Xi + A2(DX2) = M O s C * i ) + A2(Og(*> *2> a.e. [Q].

OfiiJ> if 0 < Xi < X2 a.e. [P] then g(f, Xi) < g(f, X2) a.e. [Q].

(iv) If 0 < Xi < X2 < . . . t X a.e. [P] then g(t, Xn) \ g(t, X) a.e. [Q].

To establish the existence and the four properties of the function g(ty X), wemust systematically translate to corresponding assertions about averages, by meansof the following simple result.

<20> Lemma. Let h\ and fc2 be functions in M+(T, ®) for which

Q' (a(OMO) < Qr («(0*2(0) for all a € M+(T, S).

Then h\ < l%2 a.e. [Q]. If the inequality is replaced by an equality, that is, ifQ' (cc(t)hi(t)) = Qr (a(O*2(O) for aii a, then hx = /i2 a.e. [Q].

PAT?O/. For a positive rational number r, let Ar := {f : A2(O < r < h\(t)}. There areno oo — oo problems in subtracting to get 0 < Q' ((ft2(O ~ Ai(0) ( € Ar}), whichforces QAr = 0 because hj — h\ is strictly negative on Ar. Take the union over allpositive rational r to deduce that Q{/*2 < h\] = 0. Reverse the roles of /*i and /i2 toget the companion assertion, Q{h\ < fc2} = 0, for the case where the inequality is

D replaced by an equality.

Proof of Theorem <19>. As before, to simplify notation, abbreviate M+(£2,7) toM+(Q) and M+(T, 3 ) to M+CJ).

Fix an X in M+(£2). For each n e N, define an increasing linear functional vn

on M+(T) by vn(a) := P" ((X(cy) A n)a(Tw)) for a € M+CT). It is easy to checkthat vn (or) < nQ(a) and that vn has the Monotone Convergence property. Thus eachvn corresponds to a finite measure on *B, absolutely continuous with respect to Q.By the Radon-Nikodym Theorem (Section 3.1), there exist bounded densities yn

in M+(7) for which vna = Q(yna) for each a in M+(7). By Lemma <20>, theinequality Q(yna) = vna < vn+\a = Q(yn+ict) for all a e M+(T,®) implies thatS* < g»+i a.e. [Q]. That is, {yn} is increasing a.e. [Q] to y := limsupn yj, e M+CT).Two appeals to Monotone Convergence then give the desired equality,

¥° (a(Tco)X(co)) = lim <a(f) = lim Qr (yB(r)a(r)) = Q' (y(r)a(r)),n->oo n » o o

for a € M+(T). The uniqueness of y up to Q-equivalence follows directly fromLemma <20>. Arbitrarily choose from the Q-equivalence class of all possible y'sone member, and call it g(t, X).

Again by Lemma <20>, the first three assertions are equivalent to therelationships, for all a e M+CT, $) ,

( i ) Q { ) ( )

5.6 Kolmogorov's abstract conditional expectation 125

(ii) Q(a(t)g(t,hi(T)Xi +h2(T)X2))=Q(a{t)hi{t)g(t9Xi) + h2(t)g(t,X2)),(iii) Q(a(t)g(t, Xi)) < Q(a(t)g(t, X2)).

Systematically replace all expressions like Q(F(t)g(t, Z)) by the correspondingV(F(T)Z), and expressions like QG(t) by the corresponding VG(T)9 to get furtherequivalent relationships,

(i) F(a(T)X) = f(a(T)h(T)),(ii) P^CD^KDXj +h2(T)X2)) =V(a(T)hi(T)Xi + h2(T)X2),

(iii) V(a(T)Xi)<V(a(T)X2).The first three assertions follow.

For the fourth assertion, note that (iii) implies g(t,Xn) f y(t) :=limsupng(f, Xn) a.e. [Q]. Then apply Monotone Convergence twice, with a inM+(T), to get

Q(«(0y(0) = limQ(a(f)*(*, *«)) = limP(a(r)Xn) = ¥(a(T)X) = Q(a(*)«(*,

D from which (iv) follows via Lemma <20>.

At the risk of misinterpreting the symbol as an expectation of X with respectto a conditional probabilty measure P(- | T = 0 ,1 will write P(X | T = t) instead ofg(t, X) for the Kolmogorov conditional expectation, by analogy with the traditionalE(X | T = 0.

If it were possible to combine the negligible sets corresponding to the a.e. [Q]qualifiers on each of the assertions (ii), (iii), and (iv) of the Theorem into a singleQ-negligible set 3sf, the maps {g(t, •) : t £ K} would define a family of increasinglinear functional on M + ^ S O , each with the Monotone Convergence property.Moreover, (ii) would even allow us to treat functions of T as constants whenconditioning on T. Such functional would then define the conditional distributionas a family of probability measures on 3. Unfortunately, we accumulate uncountablymany negligible sets as we cycle (ii)—(iv) through all the required combinations offunctions in M+(X) and positive constants. If we could somehow reduce (ii)—(iv)to a countable collection of requirements, each involving the exclusion of a single/Lt-negligible set, then the task of constructing the X would become easy. Thetopological assumptions in Theorem <9> are used precisely to achieve this effect.

REMARK. The accumulation of negligible sets causes no difficulty when wedeal with only countably many random variables X. In such circumstances, theKolmogorov conditional expectation effectively has the properties of an expectationwith respect to a conditional distribution.

For general P-integrable X, define

P(x | T = t) := P(x+ | r = o - P(*~ I r = f),

with the conditional expectations for positive and negative parts of X defined asin Theorem <19>. Notice that both those conditional expectations can be taken(almost) everywhere finite, because they have finite expectations: choose a = 1 indefining equality to get P(P(X± | T = t)) = FX± < oo. To avoid any problemwith infinite expectations, we could also restrict the a to be a bounded S-measurable


random variable, and define P(X | T = /) for integrable X as the Q-integrablerandom variable g(t, X) for which

<2i> P(a(DX) = Q(a(f)s(f, X)) for all bounded, S-measurable a

Brave readers might also want to dabble with conditional expectation for other caseswhere only one of PX+ or PX~ is finite. Beware of conditional oo — oo, whateverthat might mean!

There are conditional analogs of the Fatou Lemma (Problem [7]), DominatedConvergence (Problem [8]), Jensen's inequality (Problem [13]), and so on. Thederivations are essentially the same as for ordinary expectations, because theaccumulation of countably many Q-negligible sets causes no difficulty when wedeal with only countably many random variables X.

Conditioning on a sub-sigma-field

The choice of Q as the image measure lets us rewrite Q* (a(f)&(f, X)) asF(a(T)g(T, X)). Both g(T, X) and a(T) are a(7>measurable functions on Q.Indeed (recall the Problems to Chapter 2), every a(T)\B[0, oo]-measurable mapinto [0, oo] must be of the form a(T) for some a in M+(T, 23). The definingproperty may be recast as: to each X in M+(£2, 7) there exists an XT := g(7\ X)in M+(ft, a(T)) for which

<22> F(WX) = F(WXT) for each W in M+(Q, o(T)).

The random variable XT, which is also denoted by P(X | T) (or E(X | T), in

traditional notation), is unique up to a P almost sure equivalence.

REMARK. Be sure that you understand the distinction between the func-tions g{t) := g(t, X) and g(T) = g o T ; they live on different spaces, 7 and £2. Ifwe write P(X | T = t) for g(t) it is tempting to write the nonsensical P(X \ T = T)for g(T), instead of P(X | T), or even P (X | a(T)) (see below).

The conditional expectation P(X | T) depends on T only through the sigma-field a(T). If S were another random element for which a(T) = cr(S), theconditional expectation P(X | 5) would be defined by the same requirement <22>.That is, except for the unavoidable nonuniqueness within a P-equivalence class ofrandom variables, we have P(X | T) = P(X | S). We could regard the conditioninginformation as coming from a sub-sigma-field of 7 rather than from a T thatgenerates the sub-sigma-field.

<23> Definition. Let X belong to M+(ft, 7) and P be a probability measure on J .For each sub-sigma-field S of 7, the conditional expectation P(X | 9) is the randomvariable Xs in M+(X, 5) for which

<24> P(#X) = P(#Xg) for each g in M+(£2, 9).

The variable Xg is called the conditional expectation of X given the sub-sigma-field S. It is unique up to ¥-equivalence.

Be careful that you remember to check that Xg is 9-measurable. Otherwiseyou might be tempted to leap to the conclusion that Xg equals X, as a trivialsolution to the equality <24>. Such an identification is valid only when X itself

5.6 Kolmogorvv's abstract conditional expectation 127

is 9-measurable. That is, P(X | 9) = X when X is 9-measurable (cf. part (i) ofTheorem <19>).

REMARK. The traditional notation for P(X | 9) is E(X | 9). It would perhapsbe more precise to speak of a conditional expectation, to stress the nonuniqueness.But, as with most situations involving an almost sure equivalence, the tradition is toignore such niceties, except for the occasional almost sure reminder.

Some authors write the conditional expectation P(X | S) as 9X, a prefixnotation that stresses its role as a linear map from random variables to randomvariables. If I were not such a traditionalist, I might adopt this more concise notation,which has much to recommend it.

The existence and almost sure uniqueness of P(X | 9) present no new challenges:

it is just P(X | T) for the identity map T from (fi, 7) onto (fi, 9) . Theorem <19>

specializes easily to give us "measure-like" properties of P(- | 9).

<25> Theorem. For a fixed sub-sigma-field 9 of7, conditional expectations have the

following properties.

(i) P(X | 9) = X a.e. [P], for each Q-measurable X.

(ii) For Xu X2 in M+(J) and gu g2 in M+(9),

P(Si*i + £2*2 I S) = ginXi | 9) + g2nx2 I 9) a.e. [P].

(Hi) IfO <XX<X2 then P(Xi | 9) < P(X2 | 9) a.e. [P].

(iv) If 0 < Xi < X2 < ... t X then P(Xn | 9) t P(X I 9) a.e. [P].

REMARK. YOU might find it helpful to think of 9 as partial information—theinformation one obtains about co by learning the value g(co) for each 9-measurablerandom variable g. The value of the conditional expectation P(X | 9) is determinedby the "9-information;" it is a prediction of X(co) based on partial information.(The conditional expectation can also be interpreted as the fair price to pay for therandom return X(co) after one learns the 9-information about co, as with the fair-priceinterpretation described in Section 1.5.) Whatever you prefer as a guide to intuition,be warned: you should not take the interpretation too literally, because there areexamples (Problem [5]) where one can determine co precisely from the values of all9-measurable g, even with 9 a proper sub-sigma-field of 3 \

<26> Example. Suppose an X in M+(Q, 7) is independent of all 9-measurable random

variables. (That is, <r(X) is independent of 9.) What does the partial information

heuristic for conditioning on sigma-fields suggest for the value P(X | 9)?

Information from 9 leaves us just as ignorant about X as when we started,

when the prediction of X was the constant PX. It would seem, that we should

have P(X | 9) = PX when X is independent of 9. Indeed, the constant random

variable C = PX is 9-measurable and, for all g in M+(9), independence gives

• P (gX) = (Pg) (PX) = P (sC), as required to show that C = P(X | 9).

<27> Example. Suppose So and 9i are sub-sigma-fields of J with % c g l f and

let X € M+(ft, 7). Let X, = P(X | 9,), for i = 1, 2. That is, Xt € M+(£2, 9,) and

F(giX) = P(ftX,) for each g, in M+(ft, 9,). In particular, P($0*i) = P(So*) =

F(goXo) for each g0 in M+(fi, 90), because M+(Q, 90) c M+(n, 9i). That is, the

random variable Xo has the two properties characterizing P(Xi | 9o)« Put another

• way, P(P(X | 9 0 | 9o) = P(X | So) almost surely.


<28> Example. When PX2 < oo, the defining property of the conditional expectationXs = P(X | S) may be written as P (X - Xs) g = 0 for all bounded, S-measurable g.A generating class argument extends the equality to all square-integrable, S-measurable Z. That is, X - Xg is orthogonal to -C2(Q, S, P). The (equivalence classof the) conditional expectation is just the projection of X onto L2(Q, S, P), as aclosed subspace of L2(Q, 7, P).

Some abstract conditioning results are readily explained in this setting. Forexample, if So ^ Si, for two sub-sigma-fields of J , then the assertion

P(P(X | SO | So) = P(* I So) almost surely

from Example <27> corresponds to the fact that no o n\ = n$ where m denotes theprojection map from L2(Q, y, P) onto H,- := L2(Q, Si, P). The TT0 on the left-handside kills any component in Hi that is orthogonal to Ho, leaving only the component

• in Ho.

*7. Sufficiency

The intuitive definition of sufficiency says that a statistic T (a measurable map intosome space 7) is sufficient for a family of probability measures 7 = {P# : 0 e 0} ifthe conditional distributions given T do not depend on 0. The value of T(co) carriesall the "information about 0" there is to learn from an observation w on P^.

There are (at least) two ways of making the definition precise, using eitherthe Kolmogorov notion of conditional expectation or the properties of conditionaldistributions in the sense of Section 2. To distinguish between the two approaches,I will use nonstandard terminology.

<29> Definition. Say that T is strongly sufficient for a family of probability measures7® = {Ps : 0 € 0} on (&, 5) if there is a probability kernel {Pt :t e 7 ) that servesas a conditional distribution for each P# given T.

Say that T is weakly sufficient for *?& if for each X in M+(£2, T) there existsa version of the Kolmogorov conditional expectation Fe(X | T = t) that does notdepend on 0. Say that a sub-sigma-field S of 7 is weakly sufficient for *?& if, foreach X in M+(Q, 5) there exists a version ofFe(X | S) that does not depend on 0.

REMARK. I add the subscript to 3*0 to ensure that there is no confusion with 7as a probability kernel from T to Q.

The concept of strong sufficiency places more restrictions on the underlyingprobability spaces, but those restrictions have the added benefit of making somearguments more intuitive. The weaker concept has the advantage that it requires onlythe Kolmogorov definition of conditioning. However, it also has the drawback ofallowing some counterintuitive consequences. For example, if a sub-sigma-field So isweakly sufficient for a family 7® then, in the intuitive sense explained in Section 6,it provides all the information available about 0 from an observation on an unknownP# in ?0. If Si is another sub-sigma-field, with So Q Si, then intuitively Si shouldprovide even more information about 0, which should make Si weakly sufficientfor ?0. Unfortunately, as the next Example shows, the intuition is not correct.

5.7 Sufficiency 129

<30> Example. Let £1 be the real line, equipped with its Borel sigma-field 7. For each9 > 0 define P# as the probability measure on 3 that puts mass 1/2 at each of thetwo points ±9. Say that a set B is symmetric if -co € B for each co in B. Let Sbe a fixed symmetric set not in 7, and not containing the origin. Write So for thesub-sigma-field of all symmetric Borel sets, and Si for the sub-sigma-field of allBorel sets for which BSC is symmetric.

A simple generating class argument shows that a Borel measurable function gis So-measurable if and only if g(co) = g(—co) for all co, and it is Si-measurable ifand only if g(co) = g(—co) for all co in Sc.

Consider an X in M+(7). The symmetric function XQ(CO) := | (X(co) 4- X(—a>))is So-measurable. For all 9 > 0 and all go in M+(So)>

P* (X«o) = \ (X($)go(O) + X(-O)go(-O)) = Xo(6)go(O) = P* (Xogo).

That is, Xo is a version of P#(X | So) that does not depend on 9. The sub-sigma-field So is weakly sufficient for 7® = {P# : 9 > 0}. In fact, a stronger assertionis possible. The sigma-field So is generated by the map T(co) = \eo\. For each 0we can take the conditional distribution Pt as the probability measure that placesmass 1/2 at ±t. The statistic T is strongly sufficient.

Now consider X equal to the indicator of the half-line H = [0, oo). Couldthere be a Si-measurable function Xi for which P# (Xg\) = P# (X\g\) for all g\in M+(Si)? If such an X\ existed we would have

\g\iO) = P£ (X(co)gl(a>)) = Xi(O)8i(0) + Xl(-0)gl(-0) for all 9 > 0

For 0 in ifnS, take g\ as the indicator function of the singleton set {9}, and then as theindicator function of the singleton set {—9], to deduce that X\(9) = X\(—9) = 1. For9 in H\S take g\ = 1 to deduce (via the Si-measurability property X\(9) = X\(—9))that Xi(0) = X\{-9) = 1/2. That is, fXi = 1} = 5, which would contract the Borelmeasurability of Xi and the nonmeasurability of S. There can be no Si-measurablefunction Xi that is a version of Pe(X \ Si) for all 9. The sub-sigma-field Si is not

• weakly sufficient for ê .

REMARK. The failure of weak sufficiency for Si is due to the fact that the tPeis not dominated: that is, there exists no sigma-finite measure domininating each P#.Problem [17] shows that the failure cannot occur for dominated families.

The concepts of strong and weak sufficiency coincide when Q is a finite setand Pflfeo} > 0 for each co in £2 and each 9 in 0 . If {co : Tco = t] is nonempty thenthe sum g(t, 9) := J^{ r (a / ) = t}Fe[cof} is nonzero and

| * • 0) if T(co) = r,| otherwise,

If T is sufficient, the left-hand side is a function of co and t alone. If we write itas H(co, 0> then sufficiency implies

¥ <r},\g(t,9)H(co,t)

otherwise.


or, more succinctly, Fo{co} = g(T(co), 9)h{co) where h(co) = H(eoy T(co)). Conversely,if Feico] has such a factorization then, for a> with T(co) = t,

g(t,0M<p) = h(a>)

EatlTW) = *}*(*, W ^ ) E ^ ^ a / ) = f }*(oO *Thus T is sufficient if FQ{CO} factorizes in the way shown.

The factorization criterion also extends to more general settings. The proofof the analog for weak sufficiency, due to Halmos & Savage (1949), appears inthe textbook of Lehmann (1959, Section 2.6). The corresponding proof for strongsufficiency is easier to recognize as a direct generalization of the argument forfinite £2.

<3i> Example. Suppose each F0 has a density pico, 0) with respect to a sigma-finitemeasure A, and that k has a (T, ^-disintegration {kt : t e T}. If the densityfactorizes, pico, 9) = giTco, 6)h(co), then Theorem <12> will show that T is stronglysufficient for 3\

Ignoring problems related to division by zero, we would expect the conditionaldistribution Ft to be dominated by kt, with density

pja),9) = giTco,9)hj(o)

kfpi(o,9)For kt almost all co, the Ta> in the last numerator and denominator are both equalto t, allowing us to cancel out a g(t, 0) factor, leaving a conditional density thatdoesn't depend on 6.

Let us try to be more careful about division by zero. Define

Pt(co) = - ^ { 0 < H(t) < oo} where H(i) := kth.H(t)

A proof that pt is a conditional density for each P<? requires careful handling ofcontributions from the sets where H is zero or infinite. According to Theorem <12>,it suffices to show, for each fixed 6, that

q(t, 0)pt(co) = p(a>, 0) a.e. [fi ® A],

whereq(f, 0) = kf (g(Tco, 0)h(co)) = g{t, 0)H(t) a.e. [/x].

The aberrant negligible sets are allowed to depend on 6. Because Tco = t a.e. [/x® A],the task reduces to showing that

g(t9 6)h((o){0 < H{t) <oo} = git, 9)hi<o) a.e. [p ® A].

We need to show that the contributions to the right-hand side from the sets where His zero or infinite vanish a.e. [/x ® A]. Equivalently, we need to show

li<k* (git, 0)hico){Hit) = 0 or oo}) = 0.

The git, 0){Hit) = 0 or oo} factor slips outside the innermost integral with respectto A,,, leaving fif igit,O)Hit){Hit) = 0 or oo}). Clearly the product g(f,9)H(t) iszero on the set {H = 0}. That git, 9) must be zero almost everywhere on theset {H = oo} follows from the fact that qit, 9) is a probability density:

5.7 Sufficiency 131

The strong sufficiency follows.See Problem [18] for a converse, where strong sufficiency implies existence of

• a factorizable density.

<32> Example. Let W0 denote the uniform distribution on [0,#]2, for 0 > 0. LetT(x, y) = max(jc, y). As in Example <5>, the conditional probability distributionsPa given T are uniform on the sets {T = t). That is T is a strongly sufficientstatistic for the family {P# : 0 > 0}. The same conclusion follows directly from thefactorization criterion, because Fe has density p(x, y, 9) — 0~2{T(x,y) < $} with

D respect to Lebesgue measure on (E+)2.

8. Problems

[1] Suppose gi and g2 are maps from (X, A) to (y, $), with product-measurable graphs.Define fi(x) = (*,#(*))• Let Pt = ^,(Mf), for probability measures /u,, on A.Let P = a\P\ + #2^2, for constants a, > 0 with ot\ + c*2 = 1. Let X denote thecoordinate map onto the X space. Show that the conditional probability distributionPx concentrates on the points (jc,gi(jc)) and (JC,$2(*))« F^d the conditionalprobabilities assigned to each point. Hint: Consider the density of /Lt,- with respect

[2] Let Q be a nonatomic probability measure on S(R) with corresponding distributionfunction F. Let Q = Qn, the joint distribution of n independent observationsJCI, . . . , xn from Q. Define T(x\,..., xn) := maxf<n JC,.

(i) Show that T has distribution v, defined by P(-OO, f] = F(t)n.

(ii) For each fixed t € M, show that

v(-oo, t] = ] T \ ^ Q{*, < / , . . . , JC < r, T = jc,}

Q{ r ,^2 < xuX3 <x\9...9xH <xi)

Deduce that v has density nFn~~l with respect to Q.

(iii) Write Qt for the distribution of x\ conditional on JCI < t. That is, £G (g(x){x < t}) /F(t). Write €r for the probability measure degenerate at theponit t. Show that the conditional distribution Q,, for Q given T = t, equals

That is, to generate a sample JCI, . . . ,x n from Q,, select an JC, to equal r (withprobability n~l for each 0* then generate the remaining n - 1 observationsindependently from the conditional distribution Qt.

[3] Show that a (7\ /x)) disintegration A = {Xt : t € 1} of a sigma-finite measure A. canbe chosen as a probability kernel if and only if the sigma-finite measure /x is equal tothe image measure Tk. Hint: Write £(t) for XtX. Show that (Tk)g = M' (g(t)i(t))for all g in M+(!B). Prove that the right-hand side reduces to fig for all g if andonly if l(t) = 1 for /i almost all t.


[4] For families of probability measures 7 and Q defined on the same sigma-field, define012(7, Q) to equal sup{<X2(P, Q) : P e 7, Q e Q}, where c*2 denotes the Hellingeraffinity, as in Section 3.3. Write 7 0 7 for {Pi 0 P2 : Pi € IP}, and co(3> 0 0>) forits convex hull. Show that a2 (co(? 0 ?), co(Q 0 Q)) < a2 (co^P), co(Q))2 by thefollowing steps. Write A for «2 (co(T), co(Q)).

(i) For given P = £ . ftPu 0 ft/ in co(3>) and Q = £ y YJQIJ ® G2j in co(Q),let X be any probability measure dominating all the component measures,with corresponding densities pu(x), and so on. Show that P has densityp(x, y) := J \ Pipu(x)p2i(y) with respect to A. 0 X, with a similar expressionfor the density q(x, y) of Q.

(ii) Write X for the projection map of X x X onto its first coordinate. Show thatXP has density p(x) := £,• PiPu(x) with respect to X, with a similar expressionfor the density q(x) of XQ. Deduce that XP € co(0>) and XQ € co(Q).

(iii) Define px(y) := £ . PiP\i(x)/p(x) on the set {* : p(*) > 0}. Define ^(y)analogously. Show that A^y/pAyjqAy) < A for all JC.

(iv) Show that a2(P, Q) = A* 2

[5] Let P be Lebesgue measure on J, the Borel sigma-field of [0,1]. Let 9 denote thesigma-field generated by all the singletons in [0,1].

(i) Show that each member of S has probability either zero or one.

(ii) Deduce that P(X | S) = PX for each X in M+(7).

(iii) Show for each Borel measurable X that X(co) is uniquely determined once weknow the values of all S-measurable random variables.

[6] Let X be an integrable random variable, and Z be a 9-measurable random variablefor which XZ is integrable. Show that P(XZ) = F(YZ) where Y = P(X | 3).

[7] Suppose Xn € M+(90. Show that P(liminfnXn | S) < lim infw P(XW | 5) almostsurely, for each sub-sigma-field S. Hint: Derive from the conditional form ofMonotone Convergence. Imitate the proof of Fatou's Lemma.

[8] Suppose Xn -> X almost surely, with \Xn\ < H for an integrable H. Show thatP(Xn | 3) -> P(X | 3) almost surely. Hint: Derive from the conditional form ofFatou's Lemma (Problem [7]). Imitate the proof of Dominated Convergence.

[9] Suppose PX2 < oo. Show that var(X) = var(P(X | 3)) 4- P(var(X | 3)).

[10] Let X be an integrable random variable on a probability space (Q, 7, P). Let 3 be asub-sigma-field of 7 containing all P-negligible sets. Show that X is S-measurableif and only if F(XW) = 0 for every bounded random variable W with P(W | 3) = 0almost surely. (Compare with the corresponding statement for random variables thatare square integrable: Z e £2(S) if and only if it is orthogonal to every squareintegrable W that is orthogonal to £2(S).) Hint: For a fixed real t with PfX = t) = 0define Z, = P(X > * | 3) and Wt = {X > t] - Zt. Show that (X - t)Wt > 0 almostsurely, but P((X - t)Wt) = 0. Deduce that Wt = 0 almost surely, and hence Wt isS-measurable.

5.8 Problems 133

[11] Let So, Si»• • • > SN be random vectors, and ô c 3^ c . . . c ?# c y be sigma-fieldssuch that Si is ^-measurable for each i. Let e, a, and p be positive numbers forwhich PP{\SN - St\ < (1 - a)\St\ | 9i) > {|S,-| > €} almost surely. Show thatPfmax, |St| > €} < fiP{\SN\ > ct€}. Hint: Let r denote the first i for which | $ | > c.Show that {r = i} € 7,. Use the definition of conditional expectation to boundfiV[\Si\ > €, r = i] by P{|S*| > t*€, t = *}.

[12] Suppose P is a probability measure on (X,A) with density p(x) with respect to aprobability measure X. Let Q equal the image TP, and \x equal FX, for a measurablemap T into (7,3). Without assuming existence of disintegrations or conditionaldistributions, show

(i) Q has density q(t) := X(p \T = r) with respect to /Lt.

(ii) For X € M+(X), show that P(X | T = 0, the Kolmogorov conditionalexpectation, is given by {0 < q(t) < oo}X(Xp \ T = t)/q(t), up to aQ-equivalence.

[13] For each convex function ^ on the real line there exists a countable family of linearfunctions for which i/r(x) = supl€N(a, + bt*) for all x (see Appendix C). Use thisrepresentation to prove the conditional form of Jensen's inequality: if X and f(X)are both integrable then P0I>(X) | S) > ^ (P(X | 3)) almost surely. Hint: For each iargue that P(^(X) | 9) > at + bi¥(X | $) almost surely. Question: Is integrabilityof xlr(X) really needed?

[14] Let P be a probability measure on 3(R2) that is invariant under rotations SQ fora dense set ®o of angles. Show that it is invariant under all rotations. Hint: Forbounded, continuous / , if ft -> 0 then Pxf(Seix) -> Pxf(S$x).

[15] Find an invariance argument that produces the conditional distributions fromExercise <5>. Hint: Consider transformations g# that map (JC, y) into (*, y$), whereyo = y + 0 if y + 0 <x and y + 0 - x otherwise.

[16] Let Q be a family of probability measures on a sigma-field A dominated by a fixedsigma-finite measure X. That is, each Q in Q has a density 8Q with respect to X.Show that Q is also dominated by a probability measure of the form v = J2jti 2~7 Qj»with Qj € Q for each y, by following these steps.

(i) Show that there is no loss of generality in assuming that X is a finite measure.Hint: Express A. as a sum of finite measures Ylt h ^ e n choose positiveconstants a, so that ]T\ a/X, has finite total mass.

(ii) For each countable subfamily S of Q define L(S) := k (UQ€S{SQ > 0}). DefineL := sup{L(S) : Q 3 S countable}. Find countable subsets Sn for whichL(Sn) f L. Show that L = L(S*), where S* == UnSn.

(iii) Write Xo for UQeS*{8Q > 0}. For each Qo in Q, show that X ({SQo > 0}\X0) = 0.Hint: L(S* U{Q0}) < L(S*).

(iv) Enumerate S* as {Qj : j € N}. Define v = YlJLi^Qj- If / € M+ andvf = 0, show that X (fX0) = 0. Deduce that Qof = 0 for all Qo € Q. That is,v dominates Q.


[17] Let 7e — {Fo : 0 e 0} be a family of probability measures on J , dominated bya sigma-finite measure A. Suppose So is a weakly sufficient sub-sigma-field, andSo £ Si for another sigma-field 5\ c.9. Show that Si is also weakly sufficient, byfollowing these steps.

(i) With no loss of generality (Problem [16]), suppose A. = ^2j€N2~j P$j. Writefe for the (So-measurable) density of Fe\9o with respect to A|SQ. For an Xin 3Vt+(3r), write Xo for the version of P#(X | So) that doesn't depend on 0.Show that Xo is also a version of A(X | So). Deduce that P^X = P^X0 =A. (/flXo) = A. (/0X), that is, fe is also the density of P# with respect to A, asmeasures on J .

(ii) For an X in M+(S), write Xi for A(X | Si). For g{ e M + ( S I ) , show thatFe (giX) = k (fegiX) = A. (feg\Xx) = Fe (g\Xx). Deduce that X, is a versionof Fe(X | Si) that doesn't depend on 0.

[18] Let y@ = {Fe : 9 e 0} be a family of probability measures dominated by asigma-finite measure A.. Suppose T is a strongly sufficient statistic for 3\ meaningthat there exists a probability kernel {Pt : t e 7} that is a conditional distribution foreach F0 given T. Show that there exist versions of the densities dFe/dk of the formg(Tco, 0)h(co), by following these steps.

(i) From Problem [16], there exists a dominating probability measure for 7®of the form P = J2t 2~'P^, for some countable subfamily {F0i} of 3\ Showthat {Pt : t € 7} is also a conditional distribution for P given T.

(ii) By part (i) of Theorem <12>, Qe := TF0 is dominated by Q := TV.Write g(t, 0) for some choice of the density dQe/dQ. Show that W$f(a>) =P" (g(Tco, 9)f(co)) for / e M+(ft).

(iii) Deduce that

^ 0 ) ^ a.e. [A],dk

in the sense that the right-hand side can be taken as a density for P#.

[19] Let 3*0 = {P<9 : 0 e 0} be a family of probability measures dominated by asigma-finite measure A. Suppose T is a strongly sufficient statistic for 7. Suppose7* is another statistic, taking values in a set T* equipped with a countably generatedsigma-field !B* containing the singletons, such that a{T) c a(T*). Show that T canbe written as a measurable function of 7"\ Deduce via the factorization criterionthat T* is also strongly sufficient.

[20] Let (X, A, P) be a probability space, and let P equal Pny the n-fold product measure

on AN. For each x in X, let <$(*) denote the point mass at JC. For x in Xw, let T(x)denote the so-called empirical measure, n~l Ylt<n &(xi)- Intuitively, if we are giventhe measure T(x), the conditional distribution for P should give mass \/n\ toeach of the n\ permutations (JCI, . . . , xn). Formalize this intuition by constructing a(7\ 7T)-disintegration for P. Warning: Not as easy as it seems. What is the (T, S)for this problem? What happens if the empirical measure is not supported by n

5.8 Problems 135

distinct points? How should you define Wt when t does not correspond to a possiblerealization of an empirical measure?

9. Notes

The idea of defining abstract conditional probabilities and expectations as Radon-Nikodym derivatives is due to Kolmogorov (1933, Chapter 5).

(Regular) conditional distributions have a more complicated history.Lofeve (1978, Section 30.2) mentioned that the problem of existence was "in-vestigated principally by Doob," but he cited no specific reference. Doob (1953,page 624) cited a counterexample to the unrestricted existence of a regular condi-tional probability, which also appears in the exercises to Section 48 of the 1969printing of Halmos (1950). Doob's remarks suggest that the original edition ofthe Halmos book contained a slightly weaker form of the counterexample. Doobalso noted that the counterexample destroyed a claim made in (Doob 1938), anerror pointed out by Dieudonne (no citation) and Andersen & Jessen (1948).Blackwell (1956) cited Dieudonn6 (1948) as the source of a counterexample forunrestricted existence of a regular conditional probabilities. Blackwell also provedexistence of regular conditional distributions for (what are now known as) Blackwellspaces.

In point process theory, disintegrations appear as Palm distributions—conditionaldistributions given a point of the process at a particular position (Kallenberg 1969).

Pfanzagl (1979) gave conditions under which a regular conditional distributioncan be obtained by means of the elementary limit of ratio of probabilities. Theexistence of limits of carefully chosen ratios can also be established by martingalemethods (see Chapter 6).

The Barndorff-Nielsen, Blaesild & Eriksen (1989) book contains much materialon the invariance properties of conditional distributions.

Halmos & Savage (1949) cited a 1935 paper by Neyman, which I havenot seen, as the source of the factorization criterion for (weak) sufficiency. SeeBahadur (1954) for a detailed discussion of sufficiency. Example <30> is based ona meatier counterexample of Burkholder 1961. The traditional notion of sufficiency(what I have called weak sufficiency) has strange behavior for undominated families.

I learned much about the subtleties of conditioning while working on the paperChang & Pollard (1997), where we explored quite a range of statistical applications.

The result in Problem [4] is taken from Le Cam (1973) (see also Le Cam 1986,Section 16.4), who used it to establish asymptotic results in the theory of estimationand testing. Donoho & Liu (1991) adapted Le Cam's ideas to establish results aboutachievability of lower bounds for minimax rates of convergence of estimators.


REFERENCES

Andersen, E. S. & Jessen, B. (1948), 'On the introduction of measures in infiniteproduct sets', Danske Vid. Selst Mat.-Fys. Medd.

Baddeley, A. (1977), 'Integrals on a moving manifold and geometrical probability',Advances in Applied Probability 9, 588-603.

Bahadur, R. R. (1954), 'Sufficiency and statistical decision functions', Annals of

Mathematical Statistics 25, 423-462.Barndorff-Nielsen, O. E., Blaesild, P. & Eriksen, P. S. (1989), Decomposition and

Invariance of Measures, and Statistical Transformation Models, Vol. 58 of SpringerLecture Notes in Statistics, Springer-Verlag, New York.

Blackwell, D. (1956), On a class of probability spaces, in J. Neyman, ed., 'Pro-ceedings of the Third Berkeley Symposium on Mathematical Statistics andProbability', Vol. I, University of California Press, Berkeley, pp. 1-6.

Burkholder, D. L. (1961), 'Sufficiency in the undominated case', Annals of Mathe-matical Statistics 32, 1191-1200.

Chang, J. & Pollard, D. (1997), 'Conditioning as disintegration', Statistica Neer-landica 51, 287-317.

Dellacherie, C. & Meyer, P. A. (1978), Probabilities and Potential, North-Holland,Amsterdam.

Dempster, A. P., Laird, N, M. & Rubin, D. B. (1977), 'Maximum likelihoodestimation from incomplete data via the EM algorithm (with discussion)',Journal of the Royal Statistical Society, Series B 39, 1-38.

Dieudonne, J. (1948), 'Sur le theoreme de Lebesgue-Nikodym, IIF, Ann. Univ.Grenoble 23, 25-53.

Donoho, D. L. & Liu, R. C. (1991), 'Geometrizing rates of convergence, IF, Annalsof Statistics 19, 633-667.

Doob, J. L. (1938), 'Stochastic processes with integral-valued parameter', Transac-tions of the American Mathematical Society 44, 87-150.

Doob, J. L. (1953), Stochastic Processes, Wiley, New York.Eisenberg, B. & Sullivan, R. (2000), 'Crofton's differential equation', American

Mathematical Monthly pp. 129-139.Halmos, P. R. (1950), Measure Theory, Van Nostrand, New York, NY. July 1969

reprinting.Halmos, P. R. & Savage, L. J. (1949), 'Application of the Radon-Nikodym theorem to

the theory of sufficient statistics', Annals of Mathematical Statistics 20, 225-241.

Kallenberg, O. (1969), Random Measures, Akademie-Verlag, Berlin. US publisher:

Academic Press.Kendall, M. G. & Moran, P. A. P. (1963), Geometric Probability, Griffin.Kolmogorov, A. N. (1933), Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer-

Verlag, Berlin. Second English Edition, Foundations of Probability 1950,published by Chelsea, New York.

Le Cam, L. (1973), 'Convergence of estimates under dimensionality restrictions',Annals of Statistics 1, 38-53.

5.9 Notes 137

Le Cam, L. (1986), Asymptotic Methods in Statistical Decision Theory, Springer-Verlag, New York.

Lehmann, E. L. (1959), Testing Statistical Hypotheses, Wiley, New York. Lateredition published by Chapman and Hall.

Loeve (1978), Probability Theory, Springer, New York. Fourth Edition, Part II.Pachl, J. (1978), 'Disintegration and compact measures', Mathematica Scandinavica

43, 157-168.Pfanzagl, J. (1979), 'Conditional distributions as derivatives', Annals of Probability

7, 1046-1050.Solomon, H. (1978), Geometric Probability, NSF-CBMS Regional Conference Series

in Applied Mathematics, Society for Industrial and Applied Mathematics.

Chapter 6

Martingale et al.

SECTION I gives some examples of martingales, submartingales, and supermartingales.SECTION 2 introduces stopping times and the sigma-fields corresponding to "information

available at a random time." A most important Stopping Time Lemma is proved,extending the martingale properties to processes evaluted at stopping times.

SECTION 3 shows that positive supermartingales converge almost surely.SECTION 4 presents a condition under which a submartingale can be written as a

difference between a positive martingale and a positive supermartingale (the Krickebergdecomposition). A limit theorem for submartingales then follows.

SECTION *5 proves the Krickeberg decomposition.SECTION *6 defines uniform integrability and shows how uniformly integrable martingales

are particularly well behaved.SECTION *7 show that martingale theory works just as well when time is reversed.SECTION *8 uses reverse martingale theory to study exchangeable probability measures on

infinite product spaces. The de Finetti representation and the Hewitt-Savage zero-onelaw are proved.

1. What are they?

The theory of martingales (and submartingales and supermartingales and otherrelated concepts) has had a profound effect on modern probability theory. Wholebranches of probability, such as stochastic calculus, rest on martingale foundations.The theory is elegant and powerful: amazing consequences flow from an innocuousassumption regarding conditional expectations. Every serious user of probabilityneeds to know at least the rudiments of martingale theory.

A little notation goes a long way in martingale theory. A fixed probabilityspace (£2, J, P) sits in the background. The key new ingredients are:

(i) a subset T of the extended real line R;

(ii) & filtration {7t : t e T), that is, a collection of sub-sigma-fields of 7 forwhich <JS c 7t if s < t\

(iii) a family of integrable random variables {Xt : t e T] adapted to the filtration,that is, Xt is J,-measurable for each f in T.

6.1 What are they? 139

The set T has the interpretation of time, the sigma-field 7t has the interpretationof information available at time f, and Xt denotes some random quantity whosevalue Xt((o) is revealed at time t.

 Definition. A family of integrable random variables {Xt : t e T] adapted to afiltration {7t : t e T] is said to be a martingale (for that filtration) if

(MG) Xs = F(Xt | ?,) for all s < t.a.s.

Equivalently, the random variables should satisfy

(MG)' FXSF = ¥XtF for all F e 75, all s < t.

REMARK. Often the filtration is fixed throughout an argument, or the particularchoice of filtration is not important for some assertion about the random variables. Insuch cases it is easier to talk about a martingale {X, : t e T] without explicit mentionof that filtration. If in doubt, we could always work with the filtration .'natural,7t := <T{XS : S < /} , which takes care of adaptedness, by definition.

Analogously, if there is a need to identify the filtration explicitly, it is convenientto speak of a martingale {(Xt,3

rt) : t € T], and so on.

Property (MG) has the interpretation that Xs is the best predictor for Xt basedon the information available at time s. The equivalent formulation (MG)' is aminor repackaging of the definition of the conditional expectation P(X, \7S). TheJy-measurability of Xs comes as part of the adaptation assumption. Approximationby simple functions, and a passage to the limit, gives another equivalence,

(MG)" FXSZ = FXtZ for all Z € Mbdd(^), all s < r,

where Mbdd(3\y) denotes the set of all bounded, 9^-measurable random variables.The formulations (MG)' and (MG)" have the advantage of removing the slipperyconcept of conditioning on sigma-fields from the definition of a martingale. Onecould develop much of the basic theory without explicit mention of conditioning,which would have some pedagogic advantages, even though it would obscure oneof the important ideas behind the martingale concept.

Several of the desirable properties of martingales are shared by families ofrandom variables for which the defining equalities (MG) and (MG)' are relaxed toinequalities. I find that one of the hardest things to remember about these martingalerelatives is which name goes with which direction of the inequality.

<2> Definition. A family of integrable random variables {Xt : t e T] adapted to afiltration [Jt : t e T] is said to be a submartingale (for that filtration) if it satisfiesany (and hence all) of the following equivalent conditions:

(subMG) Xs < P(X, | ?s) for all s < t, almost surely

(subMG)' FXSF < FXtF for all F e J5, all s < t.

(subMG)" FXSZ < FXtZ for all Z e M£jd(75), all s < t,

The family is said to be a supermartingale (for that filtration) if {—Xt : t e T] isa submartingale. That is, the analogous requirements (superMG), (superMG)'\ and(superMG)" reverse the direction of the inequalities.

140 Chapter 6: Martingale et al

REMARK. It is largely a matter of taste, or convenience of notation for particularapplications, whether one works primarily with submartingales or supermartingales.

For most of this Chapter, the index set T will be discrete, either finite or equalto N, the set of positive integers, or equal to one of

N 0 : = { 0 } U N or N:=NU{oo} or No := {0} UNU {oo}.

For some purposes it will be useful to have a distinctively labelled first or last elementin the index set. For example, if a limit Xoo := limn€N Xn can be shown to exist, it isnatural to ask whether [Xn : n eN] also has sub- or supermartingale properties. Ofcourse such a question only makes sense if a corresponding sigma-field 9 ^ exists.If it is not otherwise defined, I will take Joo to be the sigma-field a (U/<OOJ,).

Continuous time theory, where T is a subinterval of R, tends to be morecomplicated than discrete time. The difficulties arise, in part, from problemsrelated to management of uncountable families of negligible sets associated withuncountable collections of almost sure equality or inequality assertions. A nontrivialpart of the continuous time theory deals with sample path properties, that is, withthe behavior of a process Xt((o) as a function of t for fixed co or with propertiesof X as a function of two variables. Such properties are typically derived fromprobabilistic assertions about finite or countable subfamilies of the [Xt] randomvariables. An understanding of the discrete-time theory is an essential prerequisitefor more ambitious undertakings in continuous time—see Appendix E.

For discrete time, the (MG)' property becomes

FXnF = FXmF for all F e 7n, all n < m.

It suffices to check the equality for m = n + l, with n e No, for then repeatedappeals to the special case extend the equality to m = n -f 2, then m = n + 3, and soon. A similar simplification applies to submartingales and supermartingales.

<3> Example. Martingales generalize the theory for sums of independent randomvariables. Let §i, &» • • • be independent, integrable random variables with P£n = 0for n > 1. Define Xo := 0 and Xn :— £i + . . . + £„. The sequence {Xn : n e No} is amartingale with respect to the natural filtration, because for F e Jw_i,

F(Xn - Xn-X)F = (P£n) (FF) = 0 by independence.

You could write F as a measurable function of X\,..., Xn_i, or of £ i , . . . , £n_i, if• you prefer to work with random variables.

<4> Example. Let {Xn : n e No} be a martingale and let ^ be a convex function forwhich each W(Xn) is integrable. Then {W(Xn) : n € No} is a submartingale: therequired almost sure inequality, P(*I>(Xn) | 7n-\) > ^(Xn_i), is a direct applicationof the conditional expectation form of Jensen's inequality.

The companion result for submartingales is: if the convex 4> function isincreasing, if {Xn} is a submartingale, and if each V(Xn) is integrable, then

: n € No} is a submartingale, because

7n-\) >

6.1 What are they? 141

Two good examples to remember: if [Xn] is a martingale and each Xn is squareintegrable then [X%] is a submartingale; and if {Xn} is a submartingale then {X+} is

• also a submartingale.

<5> Example. Let {Xn : n € No} be a martingale written as a sum of increments,Xn := Xo + £i + . . . + §„. Not surprisingly, the {£,} are called martingale differences.Each £„ is integrable and P(£n | 7w_i) = 0 for n e Nj.

a.s. u

A new martingale can be built by weighting the increments using predictablefunctions {//„ : n e N}, meaning that each Hn should be an %_i-measurablerandom variable, a more stringent requirement than adaptedness. The value of theweight becomes known before time n; it is known before it gets applied to the nextincrement.

If we assume that each Hni-n is integrable then the sequence

is both integrable and adapted. It is a martingale, because

which equals zero by a simple generalization of (MG)". (Use Dominated Conver-gence to accommodate integrable Z.) If {Xn : n e No} is just a submartingale, asimilar argument shows that the new sequence is also a submartingale, provided the

D predictable weights are also nonnegative.

<6> Example. Suppose X is an integrable random variable and {J, : t € T] is afiltration. Define X, := P(X | 7t). Then the family {Xt : t e T] is a martingale withrespect to the filtration, because for s < t,

P(X,F) = P(XF) if F e3t

) if F e?s

• (We have just reproved the formula for conditioning on nested sigma-fields.)

<7> Example. Every sequence {Xn : n e No} of integrable random variables adapted toa filtration {3^ : n e No} can be broken into a sum of a martingale plus a sequence ofaccumulated conditional expectations. To establish this fact, consider the incrementsi=n := Xn — Xn_i. Each §„ is integrable, but it need not have zero conditionalexpectation given 3n-u *he property that characterizes martingale differences.Extraction of the martingale component is merely a matter of recentering theincrements to zero conditional expectations. Define rjn := P(£n | 7n-\) and

Mn :== X0 + (ft - 1)1) + • • • + (^ ~ fin)

An :=i?i + . . . + *?„.

Then Xn = Mn + An, with {Mn} a martingale and {An} a predictable sequence.Often {An} will have some nice behavior, perhaps due to the smoothing

involved in the taking of a conditional expectation, or perhaps due to some otherspecial property of the {Xn}. For example, if [Xn] were a submartingale the ?/,•would all be nonnegative (almost surely) and {An} would be an increasing sequenceof random variables. Such properties are useful for establishing limit theory and

• inequalities—see Example <18> for an illustration of the general method.

142 Chapter 6: Martingale et al.

REMARK. The representation of a submartingale as a martingale plus anincreasing, predictable process is sometimes called the Doob decomposition. Thecorresponding representation for continuous time, which is exceedingly difficult toestablish, is called the Doob-Meyer decomposition.

2. Stopping times

The martingale property requires equalities PXSF = PX,F, fors<t and F e 5S.Much of the power of the theory comes from the fact that analogous inequalitieshold when s and t are replaced by certain types of random times. To make senseof the broader assertion, we need to define objects such as 7X and Xx for randomtimes r.

<8> Definition. A random variable x taking values in T := T U {oo} is called astopping time for a filtration {7t : t € T] if {x < t] e 3t for each t in T.

In discrete time, with T = No, the defining property is equivalent to

[x = n] e Jn for each n in No,

because fr < n} = Ui<nfr = '1 anc* {^ =n] = {x < n}{x <n — 1}C.

<9> Example . Let {Xn : n e No} be adapted to a filtration {7n : n € No}, and let Bbe a Borel subset of R. Define r(co) := inf{n : Xn(co) e #}, with the interpretationthat the infimum of the empty set equals H-oo. That is, t((o) = +oo if Xn(co) £ Bfor all n. The extended-real-valued random variable r is a stopping time because

{r < n] = \Ji<n{Xt eB}e% for n e N o .

It is called the first hitting time of the set B. Do you see why it is convenient to• allow stopping times to take the value -foo?

If S't corresponds to the information available up to time i, how should wedefine a sigma-field 7X to correspond to information available up to a randomtime T? Intuitively, on the part of Q where r = i the sets in the sigma-field 3>should be the same as the sets in the sigma-field 7t. That is, we could hope that

{F{x = i} : F € 3>} = {F{x = i) : F € Si} for each i.

These equalities would be suitable as a definition of JT in discrete time; we coulddefine 7t to consist of all those F in 7 for which

<io> F{x = i] € 7i for all i e No.

For continuous time such a definition could become vacuous if all the sets {r = t)were negligible, as sometimes happens. Instead, it is better to work with a definitionthat makes sense in both discrete and continuous time, and which is equivalentto <io> in discrete time.

< l i> Definition. Let x be a stopping time for a filtration {7t : t e T}9 taking values inf := T U {oo}. If the sigma-field Joo is not already defined, take it to be a (Ul € r J r ) .The pre-x sigma-field 7X is defined to consist of all F for which F{x < t] € 7t forall t e T.

6.2 Stopping times 143

The class 7X would not be a sigma-field if r were not a stopping time: theproperty Q, e % requires {r < t] e Jt for all t.

REMARK. Notice that 7 r c J ^ (because F{x < 00} e S ^ if F e J r) , withequality when r = 00. More generally, if r takes a constant value, f, then 7X =3r

t.It would be very awkward if we had to distinguish between random variables takingconstant values and the constants themselves.

<12> E x a m p l e . The stopping time r is measurable with respect to 9^, because, foreach a e R + and t e 7\

{r < a}{x < t] = [T < a A t} € %At C J,.

That is, {r < a] € % for all a e R+, from which the iFT-measurability follows bythe usual generating class argument. It would be counterintuitive if the information

• corresponding to the sigma-field 7X did not include the value taken by r itself.

<13> Example. Suppose a and r are both stopping times, for which a < x always.Then 7a c 7X because

F{x < t) = (F[a < t}){x < t] for all / € 7\

D and both sets on the right-hand side are 7t -measurable if F € 5a-

<14> Exercise. Show that a random variable Z is !?>-measurable if and only if Z{x < t]is ^-measurable for all t in T.

SOLUTION: For necessity, write Z as a pointwise limit of ^-measurable simplefunctions Zn, then note that each Zn{x < t) is a linear combination of indicatorfunctions of £Fr-measurable sets.

For sufficiency, it is enough to show that {Z > a] e 3> and {Z < — a] e 9>, foreach a e M+. For the first requirement, note that {Z > a]{x <t} = {Z{x <t}> a},which belongs to J, for each f, because Z{x < t] is assumed to be J,-measurable.

• Thus ( Z > a ) e ? T . Argue similarly for the other requirement.

The definition of XT is almost straightforward. Given random variables[Xt : t € T) and a stopping time r, we should define XT as the function takingthe value Xt(co) when x(co) = t. If r takes only values in T there is no problem.However, a slight embarrassment would occur when x(co) = 00 if 00 were nota point of 7\ for then Xoo((o) need not be defined. In the happy situation whenthere is a natural candidate for Xoo, the embarrassment disappears with little fuss;otherwise it is wiser to avoid the difficulty altogether by working only with therandom variable XT{x < 00}, which takes the value zero when r is infinite.

Measurability of XT{x < 00}, even with respect to the sigma-field J, requiresfurther assumptions about the {Xt} for continuous time. For discrete time the task ismuch easier. For example, if {Xn : n e No} is adapated to a filtration {7n : n e No},and r is a stopping time for that filtration, then

Xr{r < oo}{r < t] = J^ xd( = T ^ ')•I€N O

For 1 > t the ith summand is zero; for i < t it equals X,{r = / } , which is7i-measurable. The 7X-measurability of Xx{x < 00} then follows by Exercise <14>


The next Exercise illustrates the use of stopping times and the a-fields theydefine. The discussion does not directly involve martingales, but they are lurking inthe background.

<15> Exercise. A deck of 52 cards (26 reds, 26 blacks) is dealt out one card at a time,face up. Once, and only once, you will be allowed to predict that the next card willbe red. What strategy will maximize your probability of predicting correctly?SOLUTION: Write Rt for the event {/th card is red}. Assume all permutations ofthe deck are equally likely initially. Write 7n for the a-field generated by R\,..., Rn.A strategy corresponds to a stopping time r that takes values in { 0 , 1 , . . . , 51}: youshould try to maximize P/?r+i.

Surprisingly, P/?T+1 = 1/2 for all such stopping rules. The intuitive explanationis that you should always be indifferent, given that you have observed cards1,2, . . . , r, between choosing card r + 1 or choosing card 52. That is, it shouldbe true that P(/?r+i i 3>) = P(#52 I 9r) almost surely; or, equivalently, thatFRT+\F = FR52F for all F e JT; or, equivalently, that

FRMF{r =k} = FR52F{r = k] for all F € ?x and k = 0 , 1 , . . . , 51.

We could then deduce that PJ?T+i = Pi?52 = 1/2. Of course, we only need thecase F = £2, but 1*11 carry along the general F as an illustration of technique whileproving the assertion in the last display. By definition of 3>,

F { x = k } = F { x <k- \ } c { x < k } e 7 * .

That is, F[x = k] must be of the form {(J?i,...,l?*) € B] for some Borelsubset B of R*. Symmetry of the joint distribution of R\r..., R$2 implies that therandom vector (R\,..., /?*, /?*+i) has the same distribution as the random vector( /? i , . . . , Rk, /?52>, whence

P J ? * + l { ( J ? l , . . . , ! ? * ) € * } = P * 5 2 { ( * 1 . . - . , * * ) € B ) .

• See Section 8 for more about symmetry and martingale properties.

The hidden martingale in the previous Exercise is Xn, the proportion of redcards remaining in the deck after n cards have been dealt. You could check themartingale property by first verifying that P(/?n+i I 3^) = %n (an equality that isobvious if one thinks in terms of conditional distributions), then calculating

( 5 2 - n - l)P(Xn+1 | yB) = P((52 - n)Xn - Rn+{ | $H) = (52-n)Xn-¥(Rn+l | 7n).

The problem then asks for the stopping time to maximaize

T = i}) because {r = i} e 7*

= PXr.

The martingale property tells us that WXo = PX,- for i = 1 , . . . , 51. If we couldextend the equality to random i, by showing that PXr = PXo, then the surprisingconclusion from the Exercise would follow.

Clearly it would be useful if we could always assert that FX^ = PXr forevery martingale, and every pair of stopping times. Unfortunately (Or should I say

6.2 Stopping times 145

fortunately?) the result is not true without some extra assumptions. The simplestand most useful case concerns finite time sets. If a takes values in a finite set F, andif each Xt is integrable, then \Xa\ < J2teT l *l» which eliminates any integrabilitydifficulties. For an infinite index set, the integrability of Xo is not automatic.

<16> Stopping Time Lemma. Suppose a and x are stopping times for a filtration{7t : t € F}, with T finite. Suppose both stopping times take only values in T'.Let F be a set in % for which a(co) < x(co) when co G F. If {Xt : t € T] isa submartingale, then FXaF < FXtF. For supermartingales, the inequality isreversed. For martingales, the inequality becomes an equality.

Proof. Consider only the submartingale case. For simplicity of notation, supposeT = {0,1, . . . , N}. Write each Xn as a sum of increments, Xn = Xo H- £i + . . . + £„•The inequality a < T, on F, lets us write

XtF - XaF = lx0F 4- ]|T {i < r}F£f ] - lx0F\ \<i<N / \ \<i<N

\<i<N

Note that {a < i < x}F = ({a < i - \}F) {r < i - \}c € 3i_i. The expected valueD of each summand is nonnegative, by (subMG)'.

REMARK. If a < r everywhere, the inequality for all F in 7a implies thatXa < P(X r I 3^) almost surely. That is, the submartingale (or martingale, orsupermartingale) property is preserved at bounded stopping times.

The Stopping Time Lemma, and its extensions to various cases with infiniteindex sets, is basic to many of the most elegant martingale properties. Results fora general stopping time r, taking values in N or Mo, can often be deduced fromresults for x A Ny followed by a passage to the limit as N tends to infinity. (Therandom variable r A N is a stopping time, because {T A N < n] equals the wholeof Q when N <n, and equals {T < n] when N > n.) As Problem [1] shows, thefiniteness assumption on the index set T is not just a notational convenience; theLemma <16> can fail for infinite T.

It is amazing how many of the classical inequalities of probability theory canbe derived by invoking the Lemma for a suitable martingale (or submartingale orsupermartingale).

<17> Exercise. Let £ ] , . . . , £^ be independent random variables (or even just martingaleincrements) for which PI-, = 0 and Pf ? < oo for each i. Define S, := f iProve the maximal inequality Kolmogorov inequality: for each € > 0,

max \St\ >€\ <¥Si/€l.1<I<N J

SOLUTION: The random variables Xt := Sf form a submartingale, for the naturalfiltration. Define stopping times x = N and a := first i such that |S,| > €, with theconvention that a = N if |5, | < € for every i. Why is a a stopping time? Checkthe pointwise bound,

€2{max\Si\ >€} = €2{Xa > €2} < Xa.


What happens in the case when a equals N because \St\ < e for every i? Takeexpectations, then invoke the Stopping Time Lemma (with F = Q) for thesubmartingale {X,-}, to deduce

e2P{max \St| > 6} < FXa < FXt = PSft,i

D as asserted.

Notice how the Kolmogorov inequality improves upon the elementary boundP{|SAH > *} < PS#/e2. Actually it is the same inequality, applied to 5CT insteadof SN, supplemented by a useful bound for PS2 made possible by the submartingaleproperty. Kolmogorov (1928) established his inequality as the first step towards aproof of various convergence results for sums of independent random variables.More versatile maximal inequalities follow from more involved appeals to theStopping Time Lemma. For example, a strong law of large numbers can be provedquite efficiently (Bauer 1981, Section 6.3) by an appeal to the next inequality.

<18> Exercise. Let 0 = So , . . . , SN be a martingale with vt := P(S, — Sj_i)2 < oo foreach i. Let y\ > Yi ^ • • > YN be nonnegative constants. Prove the Hdjek-Renyiinequality:

P | max Yi\St\ > l \ <

SOLUTION: Define 7, := flr(5i,...,5«-)- Write Pf- for P(. | 7i). Define r/f :=y?Sf — Y?_xSf_v and A,- := P,-ir/i. By the Doob decomposition from Example <7>,the sequence Mk := X!?=i(^ ~ ^«) *s a martingale with respect to the filtration {7,};and yls\ = (Ai -f . . . + A*) 4- Mk. Define stopping times a = 0 and

{ first i such that Yi\S{\ > 1,AT if tt|S,-| < 1 for all/.

The main idea is to bound each Ai + . . . + A* by a single random variable A, whoseexpectation will become the right-hand side of the asserted inequality.

Construct A from the martingale differences & := 5/ — Sj-i for i = 1 , . . . , N.For each i, use the fact that 5,_i is 3^-1 measurable to bound the contribution of A,-:

A, = P,_,

= y,2P,-i

The middle term on the last line vanishes, by the martingale difference property,and the last term is negative, because y-2 < y}_v The sum of the three terms is lessthan the nonnegative quantity x,2P(£2 I 9",-i), and

A := Ei<N y,2P.--ift2 > EJ<* A,-,

for each k, as required.

6.2 Stopping times

The asserted inequality now follows via the Stopping Time Lemma:

147

<PA,

D because Ai + . . . + Ar < A and PMT = WMa = 0.

The method of proof in Example <18> is worth remembering; it can be usedto derive several other bounds.

3. Convergence of positive supermartingales

In several respects the theory for positive (meaning nonnegative) supermartingales{Xn : n € No} is particularly elegant. For example (Problem [5]), the StoppingTime Lemma extends naturally to pairs of unbounded stopping times for positivesupermartingales. Even more pleasantly surprising, positive supermartingalesconverge almost surely, to an integrable limit—as will be shown in this Section.

The key result for the proof of convergence is an elegant lemma (Dubins'sInequality) that shows why a positive supermartingale {Xn} cannot oscillate betweentwo levels infinitely often.

For fixed constants a and p with 0 < a 0 : X, < a}, r\ := inf{i > o\ : Xf > £},

a2 := inffi > t\ : X< < a}, r2 := inf{i > a2 : X, > £},

and so on, with the convention that the infimum of an empty set is taken as +oo.

0 ax xx a2 x2

Because the {X,} are adapted to {ffi}, each oi and rf is a stopping time for thefiltration. For example,

{n < k] = {Xi < a, X; > p for some i < j < k },

which could be written out explicitly as a finite union of events involving onlyXo,..., X*;.

When tk is finite, the segment {X, : a* < i < tk\ is called the fcth upcrossingof the interval [a, p] by the process {Xn : n e No}. The event {r* < N) maybe described, slightly informally, by saying that the process completes at least kupcrossings of [a, p] up to time N.


<20> Dubins's inequality. For a positive supermartingale {(Xn,%) : n € No} andconstants 0 < a 1, using the fact that Xrt > $ when i* < oo andX<*k < # when a* < oo, we have

P(£{r* < N) + X*{r* > AT}) < PXr,A*

< PXaftA^ Stopping Time Lemma

< ¥(a{ak <N} + XN{ak > N}) ,

which rearranges to give

mrk <N}< a¥{ak < N] + PX^ ({ak > N] - {rk > N})

< aP{t*_i < A } because rfc_i < ak < tk and XN > 0.

That is,^ < iV} for k>\.N} ^p

Repeated appeals to this inequality, followed by a passage to the limit as N -> oo,• leads to Dubins's Inequality.

REMARK. When 0 = a < ft we have P{ri < oo} = 0. By considering asequence of ft values decreasing to zero, we deduce that on the set {crj < OO} wemust have Xn = 0 for all n > o\. That is, if a positive supermartingale hits zerothen it must stay there forever.

Notice that the main part of the argument, before N was sent off to infinity,involved only the variables XQ, . . . , X#. The result may fruitfully be reexpressed asan assertion about positive supermartingales with a finite index set.

<2i> Corollary. Let {(Xn, jTn) : n = 0 , 1 , . . . , N] be a positive supermartingale with afinite index set For each pair of constants 0 < a < f$ < oo, the probability that theprocess completes at least k upcrossings is less than (a/p)k.

<22> Theorem. Every positive supermartingale converges almost surely to a nonnega-tive, integrable limit

Proof. To prove almost sure convergence (with possibly an infinite limit) of thesequence {Xn}, it is enough to show that the event

D = {a): limsupXw(a>) > liminfXrt(ft>)}

is negligible. Decompose D into a countable union of events

Da>/8 = flimsupXn > fi > a > liminfXn},

with a, p ranging over all pairs of rational numbers. On Da^ we must have T* < oofor every k. Thus PDa / 3 < (a/p)k for every k, which forces PDa>/? = 0, and ¥D = 0.

The sequence Xn converges to Xoo := liminf Xn on the set Dc. Fatou's lemma,• and the fact that PXn is nonincreasing, ensure that X ^ is integrable.

<23> Exercise. Suppose {£,-} are independent, identically distributed random vari-ables £ with P{£ = 4-1} = p and P{& = -1} = 1 - p. Define the partial

6.3 Convergence of positive supermartingales 149

sums So = 0 and 5, = §i + . . . + §, for i > 1. For lk < p < 1, show thatF{St = - 1 for at least one i } = (1 - p)/p.SOLUTION: Consider a fixed p with 1/2 < p < 1. Define 0 = (1 - p)/p. Definer = inf{i € N : 5, = - 1 } . We are trying to show that P{r < oo} = 9. Observe thatXn = 0Sn is a positive martingale with respect to the filtration Jw = <r(£i,. . . , £„):by independence and the equality P0*1 = 1,

FXnF = F9^9Sn~l F = ¥0^¥Xn-iF = PXn_iF for F in 5Fn_i.

The sequence fXTAn} is a positive martingale (Problem [3]). It follows that thereexists an integrable Xoo such that XrAW -» X ^ almost surely. The sequence {Sn}cannot converge to a finite limit because \Sn - Sn-\\ = 1 for all n. On the set wherer = oo, convergence of BSn to a finite limit is possible only if Sn -» oo and 0Sn -» 0.Thus,

XTAII -» 0~1{^ < oo) -h 0{r = oo} almost surely.

The bounds 0 < XtAn <9~l allow us to invoke Dominated Convergence to deducethat 1 = FXtAn -+ 9~lF{x < oo}.

D Monotonicity of P{r < oo} as a function of p extends the solution to p = 1/2.

The almost sure limit X^ of a positive supermartingale {Xn} satisfies theinequality liminfPZn > PXoo> by Fatou. The sequence [¥Xn] is decreasing. Underwhat circumstances do we have it converging to FXQQ? Equality certainly holds if{Xn} converges to X^ in Ll norm. In fact, convergence of expectations is equivalentto L1 convergence, because

F\xn - Jooi = P(Xoo - xn)+ + PCXoo - xny

= 2P(Xoo - Xn)+ - (PXoo - PXn) .

On the right-hand side the first contribution tends to zero, by Dominated Conver-gence, because X^ > (X^ — Xrt)

+ -> 0 almost surely. (I just reproved SchefK'slemma.)

<24> Corollary. A positive supermartingale {Xn} converges in Ll to its limit X^ ifand only ifFXn -> PXoo.

<25> Example . Female bunyips reproduce once every ten years, according to a fixedoffspring distribution P on N. Different bunyips reproduce independently of eachother. What is the behavior of the number Zn of «th generation offspring from Lucybunyip, the first of the line, as n gets large? (The process {Zn : n e No} is usuallycalled a branching process.)

Write fx for the expected number of offspring for a single bunyip. If reproduc-tion went strictly according to averages, the nth generation size would equal ixn.Intuitively, if fi > 1 there could be an explosion of the bunyip population; if fi < 1bunyips would be driven to extinction; if /x = 1, something else might happen. Amartingale argument will lead rigorously to a similar conclusion.

Given Zn_i = k, the size of the nth generation is a sum of k independentrandom variables, each with distribution P. Perhaps we could write Zn = J^f^}1 §„,,with the {£m : i = 1 , . . . , Zn-\] (conditionally) independently distributed like P. Ihave a few difficulties with that representation. For example, where is | n3 defined?


Just on {Zn-\ > 3}? On all of £2? Moreover, the notation invites the blunder ofignoring the randomness of the range of summation, leading to an absurd assertionthat P X ^ ] 1 £m equals Y^fli ^m = Zn_i/x. The corresponding assertion for anexpectation conditional on Zn-\ is correct, but some of the doubts still linger.

It is much better to start with an entire family {fni : n € N, i e N} ofindependent random variables, each with distribution P, then define Zo = 1 andZn := £/GN£m{* < Zn-\} for n > 1. The random variable Zn is measurable withrespect to the sigma-field 7n = ofai ' k <ny i e N}, and, almost surely,

= Y^{/ < Zn_i}P(£m- | 7n-\) because Zn-\ is 3n-\-measurablei€N

= Y^{/ < Zn_i}P(£m) because £„,- is independent of 7n-\

If /x < 1, the {Zn} sequence is a positive supermartingale with respect to the {7n}filtration. By Theorem <22>, there exists an integrable random variable Z^ withZn —• Zoo almost surely.

A sequence of integers Zn(<o) can converge to a finite limit k only if Zn(<o) = kfor all n large enough. If fc > 0, the convergence would imply that, with nonzeroprobability, only finitely many of the independent events (Yli<k £™ ^ k] can occur.By the converse to the Borel-Cantelli lemma, it would follow that J2i<k £„/ = kalmost surely, which can happen only if P{1) = 1. In that case, Zn = I for all n.If P{\] < 1, then Zn must converge to zero, with probability one, if /x < 1. Thebunyips die out if the average number of offspring is less than or equal to 1.

If ii > 1 the situation is more complex. If P{0} = 0 the population cannotdecrease and Zn must then diverge to infinity with probability one. If P{0} > 0 theconvex function g(t) := Pxtx must have a unique value 6 with 0 < 0 < 1 for whichg(0) = 0: the strictly convex function h{t) := g(t) - t has h(0) = P{0] > 0 andh(l) = 0, and its left-hand derivative Px(xtx~l - 1) converges to /x - 1 > 0 as tincreases to 1. The sequence {0Zn} is a positive martingale:

F(0z» | Jn_i) = {Zn_i = 0} + y^P((9^1+-+^MZn-i = k] | Jn_i)

= {Zn_! = 0} +k>\

= g(0)Zn-1 = 0z»-1 because g(0) = 0.

The positive martingale {0Zn} has an almost sure limit, W. The sequence {Zn} mustconverge almost surely, with an infinite limit when W = 0. As with the situationwhen ix < 1, the only other possible limit for Zn is 0, corresponding to W = 1.Because 0 < BZn < 1 for every n, Dominated Convergence and the martingaleproperty give P{W = 1} = limôo F0z» = F0z° = 0.

6.3 Convergence of positive supermartingales 151

In summary: On the set D := [W = 1}, which has probability 0, the bimyippopulation eventually dies out; on Dc, the population explodes, that is, Zn -> oo.

It is possible to say a lot more about the almost sure behavior of the process {Zn}when fi > I. For example, the sequence Xn := Zn/fi

n is a positive martingale,which must converge to an integrable random variable X. On the set {X > 0}, theprocess [Zn] grows geometrically fast, like /x11. On the set D we must have X = 0,but it is not obvious whether or not we might have X = 0 for some realizationswhere the process does not die out.

There is a simple argument to show that, in fact, either X = 0 almost surelyor X > 0 almost surely on D. With a little extra bookkeeping we could keeptrack of the first generation ancestor of each bunyip in the later generations. If wewrite zij) for the members of the nth generation descended from the jth (possiblyhypothetical) member of the first generation, then Zn = Ylj^n^Hj < Z\}. TheZJf\ for j = 1,2,..., are independent random variables, each with the samedistribution as Zn_i, and each independent of Z\. In particular, for each j , we haveZ(

nj)Iiin~x -> X(j) almost surely, where the X(j\ for j = 1,2,..., are independent

random variables, each distributed like X, and

liX = £ \ € N XÛ < Zi] almost surely.

Write <f> for P{X = 0}. Then, by independence,

whence <f> = ^€N0 <t> *{Z\ = k] = g($). We must have either 0 = 1, meaning thatX = 0 almost surely, or else 0 almost surely on Dc.The latter must be the case if X is nondegenerate, that is, if P{X > 0} > 0, which

• happens if and only if Px (x log(l + x)) < oo—see Problem [14].

4. Convergence of sub martingales

Theorem <22> can be extended to a large class of submartingales by means of thefollowing decomposition Theorem, whose proof appears in the next Section.

<26> Krickeberg decomposition. Let {Sn : n € No} be a submartingale for whichsupn¥S^ < oo. Then there exists a positive martingale {Mn} and a positivesupermartingale {Xn} such that Sn = Mn — Xn almost surely, for each n.

<27> Corollary. A submartingale with supn PS+ < oo converges almost surely to anintegrable limit.

For a direct proof of this convergence result, via an upcrossing inequality forsupermartingales that are not necessarily nonnegative, see Problem [11].

REMARK. Finiteness of supnPS+ is equivalent to the finiteness of supnP|5n|,because \SH\ = 25+-(5+-5~) and by the submartingale property, P(S+-S-) = FSn

increases with n.

<28> Example . (Section 68 of Levy 1937.) Let [Mn : n € No) be a martingale suchthat \Mn - Mn-\\ < 1, for all n, and Mo = 0. In order that Mn(co) converges to a


finite limit, it is necessary that supw Mn(a)) be finite. In fact, it is also a suffieicientcondition. More precisely

{co : lim Mn(co) exists as a finite limit} = {w : supMn(co) < 00} almost surely.n

To establish that the right-hand side is (almost surely) a subset of the left-hand side,for a fixed positive C define r as the first n for which that Mn > C, with x = 00when supn Mn < C. The martingale Xn :== MXAn is bounded above by the constantC -f 1, because the increment (if any) that pushes Mn above C cannot be largerthan 1. In particular, supnPX+ < 00, which ensures that {Xn} converges almostsurely to a finite limit. On the set {supn Mn < C] we have Mn = Xn for all w, andhence Mn also converges almost surely to a finite limit on that set. Take a unionover a sequence of C values increasing to 00 to complete the argument.

REMARK. Convergence of Mn(a)) to a finite limit also implies thatsupn |MW(G>)| < 00. The result therefore contains the surprising asssertion that,almost surely, finiteness of supn Mn(a)) implies finiteness of supn |Mn(*y)|.

As a special case, consider a sequence {An} of events adapted to a filtration {%}.The martingale Mn := J21=\ ( * "~ ^(Ai \ 7/-i)) has increments bounded in absolutevalue by 1. For almost all co, finiteness of XlneN^ € An] implies supn Mn(co) < 00,and hence convergence of the sum of conditional probabilities. Argue similarly forthe martingale {-Mn} to conclude that

{<» : EZi An < 00} = {co : ZZi VWn I 7n-\) < oo} almost surely,

a remarkable generalization of the Borel-Cantelli lemma for sequences of independentD events.

*5. Proof of the Krickeberg decomposition

It is easiest to understand the proof by reinterpreting the result as assertion aboutmeasures. To each integrable random variable X on (12, £F, P) there corresponds asigned measure /x defined on 7 by ixF := W(XF) for F € 7, The measure canalso be written as a difference of two nonnegative measures fi+ and /Lt~, defined by//+F : = p(x+F) and /x"F := P(X"F), for / € 3.

By equivalence (MG)', a sequence of integrable random variables \Xn : n € No}adapted to a filtration {3n : n € No} is a martingale if and only if the correspondingsequence of measures {/nn} on 3 has the property

<29> fj,n+i \% = fin \% for each n,

where, in general, v\9 denotes the restriction of a measure v to a sub-sigma-field S. Similarly, the defining inequality (subMG)' for a submartingale, /xn+iF :=P(Xn+!F) > ¥XnF =: nnF for all F € %, is equivalent to

<30> fin+i I > ^ n | y for each n.

6.5 Proof of the Krickeberg decomposition 153

Now consider the submartingale {Sn : n € No} from the statement of theKrickeberg decomposition. Define an increasing functional X : 3Vt+(J) -* [0, oo] by

Xf := limsupF(S+f) for / e

Notice that XI = limsupnPS+, which is finite, by assumption. The functional alsohas a property analogous to absolute continuity: if P / = 0 then Xf = 0.

Write Xk for the restriction of A. to M+(7*)- For / in M+(5k), the submartingaleproperty for {S+} ensures that FS+f increases with n for n > k. Thus

<3i> Xkf := Xf = limP(S+/) = supP(S+/) if / € M+(3i).

The increasing functional A* is linear (because linearity is preserved by limits), andit inherits the Monotone Convergence property from P: for functions in M+(7*)with 0 < fiit / ,

sup A*/ = sup supP(5rt+/;) = supsupP(Sn

+/i) = supP(S+/) = Xkf.i i n>k n>k i n>k

It defines a finite measure on 3> that is absolutely continuous with respect to P|y .Write Mk for the corresponding density in M+(Q, J*).

The analog of <29> identifies [Mk] as a nonnegative martingale, becauseA,*+i| = X\Jk = Xk. Moreover, Mk > S? almost surely because

¥Mk{Mk < S+) = Xk{Mk < S+] > VSt{Mk < S+l

the last inequality following from <3i> with / := {Mk < S£). The random variablesXk := Mk - Sk are almost surely nonnegative. Also, for F e 7*,

¥XkF = FMkF - FSkF > FMMF - FSMF = FXMF,

because [Mk] is a martingale and {Sk} is a submartingale. It follows that {Xk} is asupermartingale, as required for the Krickeberg decomposition.

*6. Uniform integrability

Corollary <27> gave a sufficient condition for a submartingale {Xn} to convergealmost surely to an integrable limit XQO- If [Xn] happens to be a martingale, weknow that Xn = F(Xn+m \ %) for arbitrarily large m. It is tempting to leap to theconclusion that

<32> Xn I

as suggested by a purely formal passage to the limit as m tends to infinity. Oneshould perhaps look before one leaps.

<33> Example. Reconsider the limit behavior of the partial sums [Sn] from Exam-ple <23> but with p = 1/3 and 0 — 2. The sequence Xn = 2Sn is a positivemartingale. By the strong law of large numbers, Sn/n -* -1 /3 almost surely, whichgives Sn -> -oo almost surely and XQQ = 0 as the limit of the martingale. Clearly

• Xn is not equal to P(X<x> I y»)-


REMARK. The branching process of Example <25> with /x = 1 providesanother case of a nontrivial martingale converging almost surely to zero.

As you will learn in this Section, the condition for the validity of <32>(without the cautionary question mark) is uniform integrability. Remember thata family of random variables [Zt : t e T] is said to be uniformly integrableif sup,€ rP|Z, |{ |Z, | > M) -> 0 as M -» oo. Remember also the followingcharacterization of -C1 convergence, which was proved in Section 2.8.

<34> Theorem. Let {Zn : n e N} be a sequence of integrable random variables. Thefollowing two conditions are equivalent.

(i) The sequence is uniformly integrable and it converges in probability to arandom variable Z^, which is necessarily integrable.

(ii) The sequence converges in Cl norm, P|Zn — Zod -> 0, to an integrablerandom variable Z^.

The necessity of uniform integrability for <32> follows immediately from ageneral property of conditional expectations.

<35> Lemma. For a fixed integrable random variable Z, the family of all conditionalexpectations {P(Z | S) ' 9 a sub-sigma-field of J} is uniformly integrable.

Proof. Write Zg for P(Z | 9)- With no loss of generality, we may supposeZ > 0, because |Zg| M2} € 9, to rewrite PZgfZg > M2} as

PZ{Zg > A/2} < MP{Zg > M2} + PZ{Z > M}.

The first term on the right-hand side is less than MPZg/M2 = PZ/Af, which tends• to zero as M -> oo. The other term also tends to zero, because Z is integrable.

More generally, if X is an integrable random variable and {Jn : n e No} is afiltration then Xn := P(X | Jw) defines a uniformly integrable martingale. In fact,every uniformly integrable martingale must be of this form.

<36> Theorem. Every uniformly integrable martingale {Xn : n € No} converges almostsurely and in Ll to an integrable random variable X^, for which Xn = P(Xoo | Jn).Moreover, if Xn := P(X | %) for some integrable X then Xoo = P(X | 3^) , whereôo := a (Uw6NrFn).

Proof. Uniform integrability implies finiteness of supnP|Xn|, which lets us deducevia Corollary <27> the almost sure convergence to the integrable limit X^. Almostsure convergence implies convergence in probability, which uniform integrabilityand Theorem <34> strengthen to &1 convergence. To show that Xn = P(Xoo | %),fix an F in Jn. Then, for all positive m,

IPX^F - PXnF| < PIXoo - Xn+m\ + \¥Xn+mF - fXHF\.

The XL1 convergence makes the first term on the right-hand side converge to zeroas m tends to infinity. The second term is zero for all positive m, by the martingaleproperty. Thus P X ^ F = PXnF for every F in %.

6.6 Uniform integrability 155

If F(X | %) = Xn = P(Xoo | 7n) then FXF = FXooF for each F in 3^. Agenerating class argument then gives the equality for all F in 5oo, which characterizes

D the ^-measurable random variable Xoo as the conditional expectation F(X | 3^).

REMARK. More concisely: the uniformly integrable martingales {Xn : n e N}are precisely those that can be extended to martingales [Xn : n € N}. Such amartingale is sometimes said to be closed on the right.

<37> Example. Classical statistical models often consist of a parametric family9 = {F0 : 0 € 0} of probability measures that define joint distributions of infinitesequences co := (<wi, c&i,...) of possible observations. More formally, each P# couldbe thought of as a probability measure on RN, with random variables X,- as thecoordinate maps.

For simplicity, suppose 0 is a Borel subset of a Euclidean space. Theparameter 0 is said to be consistently estimated by a sequence of measurablefunctions % = %(COQ,..., con) if

<38> Fe{ \tn - 0\ > €} -> 0 for each € > 0 and each 0 in 0.

A Bayesian would define a joint distribution Q := n <8> IP for 0 and co byequipping 0 with a prior probability distribution n. The conditional distributionsQn,t given the random vectors Tn := (Xo,..., Xn) are called posterior distributions.We could also regard nna>(') := Qn,rB(&>)() as random probability measures on theproduct space. An expectation with respect to nna) is a version of a conditionalexpectation given the the sigma-field 7n := <r(X\,..., Xn).

A mysterious sounding result of Doob (1949) asserts that mere existence ofsome consistent estimator for 0 ensures that the 7rnw distributions will concentratearound the right value, in the delicate sense that for n -almost all 0, the nna> measureof each neighborhood of 0 tends to one for P# -almost all co.

The mystery dissipates when one understands the role of the consistentestimator. When averaged out over the prior, property <38> implies (via DominatedConvergence) that Q{{0,c») : \0n(a>) - 0\ > €} -* 0. A Q-almost surely convergentsubsequence identifies 0 as an S'oo-measurable random variable, r(a>), on the productspace, up to a Q equivalence. That is, 0 = r(o>) a.e. [Q].

Let IX be a countable collection of open sets generating the topology of 0 .That is, each open set should equal the union of the It-sets that it contains. Foreach U in It, the sequence of posterior probabilities Mna)[0 € U] = Q{0 e U | Hn]defines a uniformly integrable martingale, which converges Q-almost surely to

Q{0 € U | IFoo} = {0 € U] because {0 e U] = {T(<D) € U} € 2oo-a.s.

Cast out a sequence of Q-negligible sets, leaving a set E with QE = 1 and• nna){0 € U] -> {0 € U] for all U in 11, all (0, a>) € £, which implies Doob's result.

*7 . R e v e r s e d m a r t i n g a l e s

Martingale theory gets easier when the index set T has a largest element, as in thecase T = —No := {—n : n € No}. Equivalently, one can reverse the "direction of


time," by considering families of integrable random variables {Xt : t € T] adaptedto decreasing filiations, families of sub-sigma-fields {9, : t € T} for which 5S => 9,when s < t. For such a family, it is natural to define Soo := ^t€r9t if it is notalready defined.

<39> Definition. Let {Xn : n € No) be a sequence of integrable random variables,adapted to a decreasing filtration {$n : n € No}. Call {(Xn, 9n) : n € No) a reversedsupermartingale if P(Xn | Sn+i) < Xn+i almost surely, for each n. Define reversedsubmartingales and reversed martingales analogously.

That is, f(Xn,9n) * n e No} is a reversed supermartingale if and only if{(X_n, 5-n) - n € —No} is a supermartingale. In particular, for each fixed N, thefinite sequence X#, X#_i , . . . , XQ is a supermartingale with respect to the filtrationSN^9N-I c . . . c 8 o .

<40> Example. If {S« : w € No} is a decreasing filtration and X is an integrablerandom variable, the sequence Xn := F(X | 9n) defines a uniformly integrable,

D (Lemma <35>) reversed martingale.

The theory for reversed positive supermartingales is analogous to the theoryfrom Section 3, except for the slight complication that the sequence {PXn : n € No)might be increasing, and therefore it is not automatically bounded.

<4i> Theorem. For every reversed, positive supermartingale {(Xn, Sn)' n € No}:

(i) there exists an XQQ in M+(Q, 9oo) for which Xn -> XQQ almost surely;

(ii) F(Xn I 9oo) f Xoo almost surely;

(Hi) F\Xn - Xoo| -> 0 if and only if supn FXn < 00.

Proof. The Corollary <2i> to Dubins's Inequality bounds by (ct/P)k the probabilitythat XN, X # _ I , . . . , XO completes at least k upcrossings of the interval [a, /J], nomatter how large we take TV. As in the proof of Theorem <22>, it then followsthat P{limsupXn > ft > a > liminf Xn] = 0, for each pair 0 < a < ft < 00, andhence Xn converges almost surely to a nonnegative limit Xoo, which is necessarily9oo-measurable.

I will omit most of the "almost sure" qualifiers for the remainder of the proof.Temporarily abbreviate P(- | 9«) to P n ( ) , for n e No, and write Zn for PooXw. Fromthe reversed supermartingale property, Wn+\Xn < Xn+u and the rule for iteratedconditional expectations we get

Zn = FooXn = Poo (Pn+iX,,) < PooXn+l = Zn+1.

Thus Zn f Zoo *= limsupn Zn, which is 9oo-nieasurable.For (ii) we need to show Z^ = X^, almost surely. Equivalently, as both

variables are 9oo-measurable, we need to show P(ZooG) = P(XooG) for each Gin 9oo- For such a G,

G) = supn€No P(ZnG) Monotone Convergence

= supn P(XnG) definition of Zn := PooXw

= supn supm€N P ((Xn A m)G) Monotone Convergence, for fixed n

= supm supn P((Xn A m)G).

<45>

6.7 Reversed martingales 157

The sequence [Xn A m : n € No} is a uniformly bounded, reversed positivesupermartingale, for each fixed m. Thus P((Xn Am)G) increases with n, and, byDominated Convergence, its limit equals P((Xoo Am)G). Thus

P(ZooG) = supm P ^ o o A m)G) =F(XooG),

the final equality by Monotone Convergence. Assertion (ii) follows.Monotone Convergence and (ii) imply that PXoo = supnPXn. Finiteness of

the supremum is equivalent to integrability of X^. The sufficiency in assertion (iii)follows by the usual Dominated Convergence trick (also known as Scheffe's lemma):

P|Xoo - Xn\ = 2P(Xoo - Xn)+ - (PXoo - FXn) -+ 0.

For the necessity, just note that an £* limit of a sequence of integrable random• variables is integrable.

The analog of the Krickeberg decomposition extends the result to reversedsubmartingales {(Xn,9n) : n € No}. The sequence Mn := F(X£ | 9n) is areversed positive martingale for which Mn > P(Xo | 9n) > Xn almost surely. ThusXn = Mn — (Mn — Xn) decomposes Xn into a difference of a reversed positivemartingale and a reversed positive supermartingale.

<42> Corollary. Every reversed submartingale {(Xn, 9rt) : n € No} converges almostsurely. The limit is integrable, and the sequence also converges in the O sense, ifinfn FXn > -oo.

Proof. Apply Theorem <4i> to both reversed positive supermartingales {Mn - Xn]• and {A/rt}, noting that supn FMn = PXj and supn P (Mn - Xn) = PX^ - infn PXrt.

<43> Corollary. Every reversed martingale {(Xrt, 9n) : n € No} converges almost surely,and in Lx

9 to the limit Xoo := P(X0 I Soo), where 9^ := 0 , ^ 9 , , .

Proof. The identification of the limit as the conditional expectation follows from thefacts that F(X0G) = P(XrtG), for each n, and |P(XnG)-P(XooG)| < P|Xn-Xoo| -> 0,

D for each G in 9<x>.

Reversed martingales arise naturally from symmetry arguments.

<44> Example. Let {?f : i € N} be a sequence of independent random elements takingvalues in a set X equipped with a sigma-field A. Suppose each £,- induces the samedistribution P on A, that is, P/(ft) = Pf for each / in M+(X,yi). For each ndefine the empirical measure Pnj(o (or just Pn, if there is no need to emphasize thedependence on w) on X as the probability measure that places mass n~l at each of

i < nthe points £i(<y), • • •, &(<»). That is, Pn^f := n~x Ei<

Intuitively speaking, knowledge of Pn tells us everything about the values£i(o>),..., %n(o)) except for the order in which they were produced. Conditionallyon Pn, we know that f i should be one of the points supporting Pn, but we don't knowwhich one. The conditional distribution of £i given Pn should put probability n~l

at each support point, and it seems we should then have

REMARK. Here I am arguing, heuristically, assuming Pn concentrates on ndistinct points. A similar heuristic could be developed when there are ties, but there


is no point in trying to be too precise at the moment. The problem of ties woulddisappear from the formal argument.

Similarly, if we knew all Pt for i > n then we should be able to locate £,-(<w)exactly for i > n + 1, but the values of ^\(co),..., i-n(co) would still be known onlyup to random relabelling of the support points of Pn. The new information wouldtell us no more about £i than we already knew from Pn. In other words, we shouldhave

<46> P(/($i) 19,.) = . . . = P (/(&.) I Sn) = Pnf where Qn :l a(Pn, />„+,,...),

which would then give

Sn) = - ^ - r V P (/(ft) I Sn) = -At — 1 T~T W —

i = l i = l

That is, {(Pnf, Sn) : « € N} would be a reversed martingale, for each fixed / .It is possible to define Qn rigorously, then formalize the preceeding heuristic

argument to establish the reversed martingale property. I will omit the details,because it is simpler to replace Sn by the closely related n-symmetric sigma-field Sn,to be defined in the next Section, then invoke the more general symmetry arguments(Example <50>) from that Section to show that {(P*/, Sn) : n e N} is a reversedmartingale for each P-integrable / .

Corollary <43> ensures that Pnf -+ P(/(fi) | Soo) almost surely. As you willsee in the next Section (Theorem <5i>, to be precise), the sigma-field S^ is trivial—it contains only events with probability zero or one—and P(X | 8^) = PX, almostsurely, for each integrable random variable X. In particular, for each P-integrablefunction / we have

= pnf -+ pf almost surely.n

The special case X := R and P|£i| < oo and f(x) = x recovers the Strong Law ofLarge Numbers (SLLN) for independent, identically distributed summands.

In statistical problems it is sometimes necessary to prove a uniform analog ofthe SLLN (a USLLN):

An := sup|Pn/0 — Pfe\ -> 0 almost surely,0

where [f$ : 0 e 0} is a class of P-integrable functions on X. Corollary <4i> cangreatly simplify the task of establishing such a USLLN.

To avoid measurability difficulties, let me consider only the case where 0 iscountable. Write Xn# for Pnfe — Pfe- Also, assume that the envelope F := sup^ |/#|is P-integrable, so that PAn < P(PnF 4- PF) = 2PF < oo.

For each fixed 0, we know that {(Xnj, Sn) : n e N} is a reversed martingale,and hence

P(An (p )e

That is, {(An, Sn) : n e N} is a reversed submartingale. From Corollary <4i>,An converges almost surely to a Soo-measurable random variable Aoo, which by

6.7 Reversed martingales 159

the triviality of Soo (Theorem <5i> again) is (almost surely) constant. To provethe USLLN, we have only to show that the constant is zero. For example, itwould suffice to show PArt -* 0, a great simplification of the original task. See

• Pollard (1984, Section II.5) for details.

*8. Symmetry and exchangeability

The results in this Section involve probability measures on infinite product spaces. Youmight want to consult Section 4.8/<?r notation and the construction of product measures.

The symmetry arguments from Example <44> did not really require anassumption of independence. The reverse martingale methods can be applied tomore general situations where probability distributions have symmetry properties.

Rather than work with random elements of a space (X, A), it is simpler to dealwith their joint distribution as a probability measure on the product sigma-field AN

of the product space XN, the space of all sequences, x := (x\,X2,.. .)> on X. Wecan think of the coordinate maps £,(x) := JC, as a sequence of random elementsof (X, A), when it helps.

<47> Definition. Call a one-to-one map n from N onto itself an n-permutation ifn(i) = i for i > n. Write Jl(n) for the set of all n\ distinct n-permutations. CallUneN:R(n) the set of all finite permutations of N. Write Sn for the map, from XN

back onto itself, defined by

• ) if 7t €

Say that a function h on XN is n-symmetric if it is unaffected all n-permutationsf

that is, ifhn{%) := h{Snx) = h(x) for every n-permutation n.

<48> Example . Let / be a real valued function on X. Then the function YlT=i f(xi)is n-symmetric for every m > w, and the function limsup^^^JZ^Li f(xt)/m is/i-symmetric for every n.

Let g be a real valued function on X <8> X. Then g(x\, X2) 4- g f e , x\) is2-symmetric. More generally, J2\<i^j<mS(xi^xj) *s n-symmetric for every m >n.

For every real valued function / on XN, the function

F(X) := — J2 fjt(X) = —: T2 /(**(l)»*>r(2). •...**(«). *»+!•...)

• is n-symmetric.

<49> Definition. A probability measure P on An is said to be exchangeable if it isinvariant under S^ for every finite permutation n, that is, ifFh — Whn for every hin M+(XN, AN) and every finite permutation it. Equivalently, under P the randomvector (kd), 6,(2),..., £r(n)) has the same distribution as (£i, £2, • . . , ?«), for everyn-permutation n, and every n.

The collection of all sets in AN whose indicator functions are n-symmetricforms a sub-sigma-field Sn of AN> the n-symmetric sigma-field. The Srt-measurable


functions are those AN-measurable functions that are w-symmetric. The sigma-fields{Sn : n e N} form a decreasing filtration on XN, with Si = AN.

<50> Example. Suppose P is exchangeable. Let / be a fixed P-integrable functionon XN. Then a symmetry argument will show that

P ( / I 8B) = - \

The function—call it F(x)—on the right-hand side is n-symmetric, and henceSn -measurable. Also, for each bounded, 8n -measurable function H,

P(/(x)/ /(x)) = P(fnHK) for all jr, by exchangeability

= P (fn H) for all n in %(n)

As a special case, if / depends only on the first coordinate then we have

P( / (* i ) I 8B)) = ^ T f (**<!)) = fn/.7r€$(n)

D where Pn denotes the empirical measure, as in Example <44>.

When the coordinate maps are independent under an exchangeable P, thesymmetric sigma-field Soo becomes trivial, and conditional expectations (such asP( / (* i ) I §oo)) reduce to constants.

<5i> Theorem. (Hewitt-Savage zero-one law) If P = PN, the symmmetric sigma-field §oo is trivial: for each F in Soo, either FF = 0 or PF = 1.

Proof, Write /i(x) for the indicator function of F, a set in Soo. By definition,hn = h for every finite permutation. Equip XN with the filtration 3n = <T[XJ : i < n}.Notice that Joo := a (UneN^n) = AN = Si.

The martingale Fn := P(F | %) converges almost surely to P(F | JQO) = F,and also, by Dominated Convergence, P|/i — Yn\2 -> 0.

The ?„-measurable random variable Fn may be written as hn(xi,..., *„), forsome An-measurable hn on Xn. The random variable Zn := /in(jcn+i, ...,X2n) isindependent of Yn, and it too converges in £ 2 to h: if 7r denotes the 2n-permutationthat interchanges i and i + n, for 1 0.

The random variables Zn and Fn are independent, and they both converge in £2(P)-norm to F. Thus

0 = lim P|rn - Zn\2 = lim (py 2 - 2 (PFn) (PZn) + PZ2) =FF- 2(PF)2 + PF.

D It follows that either PF = 0 or PF = 1.

In a sense made precise by Problem [17], the product measures PN are theextreme examples of exchangeable probability measures—they are the extremepoints in the convex set of all exchangeable probability measures on ,/lN. Acelebrated result of de Finetti (1937) asserts that all the exchangeable probabilities

6.8 Symmetry and exchangeability 161

can be built up from mixtures of product measures, in various senses. The simplestgeneral version of the de Finetti result is expressed as an assertion of conditionalindependence.

<52> Theorem. Under an exchangeable probability distribution F on (XN,AN), thecoordinate maps are conditionally independent given the symmetric sigma-field Soo-That is, for all sets A, in A,

eAu...,xmeAm \Soo) = F(xi e A{ | Soo) x . . . x W>(xm e Am | Soo)

almost surely, for every m.

Proof. Consider only the typical case where m = 3. The proof of the general caseis similar. Write f for the indicator function of Af. Abbreviate P(- | 8n) to Pn, forn € N. From Example <50>, for n > 3,

n3 (Wnfi(x0) (Fnf2(x2)) (Vnfiixs)) = £ { 1 < i\ j \ k < n}fl(xi)f2(xj)Mxk).

On the right-hand side, there are n(n - l)(n - 2) triples of distinct subscripts (i, j \ k),leaving O(n2) of them with at least one duplicated subscript. The latter contributea sum bounded in absolute value by a multiple of n2; the former appear in the sumthat Example <50> identifies as Pn (/i(xi)/2(jt2)/3(*3)). Th u s

( / ( ) ) / 2 ( J c 2 ) ) ( P » / 3 ( J C 3 ) ) =

By the convergence of reverse martingales, in the limit we get

(Poo/l^l)) (Poo/2te» <Poo/3(*3)) = Poo (/l(JCl)/2(JC

D the desired factorization.

When conditional distributions exist, it is easy to extract from Theorem <52>the representation of P as a mixture of product measures.

<53> Theorem. Let A be the Borel sigma-field of a separable metric space X. Let P bean exchangeable probability measure on AN, under which the distribution P ofx\ istight Then there exists an S^-measurable map T into [0,1]N, with distribution Q,for which conditional distributions {Ft : t € 7} exist, and Ft = Fr

N, a productmeasure, for Q almost all t.

Proof. Let £ := {Is, : / € N} be a countable generating class for the sigma-field A,stable under finite intersections and containing X. For each i let 7}(x) be a versionof P(JCI € Et | Soo). By symmetry, 71(x) is also a version of P(JC, 6 Et | S^), forevery j . Define T as the map from XN into T := [0,1]N for which T(x) has ithcoordinate 7}(x).

The joint distribution of x\ and T is a probability measure T on the productsigma-field of X x 7, with marginals P and Q. As shown in Section 1 of Appendix F,the assumptions on P ensure existence of a probability kernel 7 := {Pt : t e T} forwhich

Fg(xu T) = Txtg(x, t) = Q'Pfgix, t) for all g in 3Vt+(X x 7).

In particular, by definition of 7} and the Soo-measurability of T,

Qr (tMt)) = F(TMT)) = P(fjq € Ei)h(T)) = Qr (h(t)PtEi)


for all h in M+(T), which implies that />,£, = tt a.e. [Q], for each i.For every finite subcollection {£ / , , . . . , Ein) of £, Theorem <52> asserts

¥{xx eEil,...,xne Ein \%oo} = f ] J = J P{*,- € Eh | §00} =

which integrates to

£ f l , . . . , xn € Eim) = P

D A routine generating class argument completes the proof.

9. Problems

[1] Follow these steps to construct an example of a martingale {Z,} and a stoppingtime x for which PZ0 ^ PZT{r < 00}.

(i) Let §i,§2» ••• t>e independent, identically distributed random variables withP{£ = +1} = 1/3 and P{& = -1} = 2/3. Define Xo = 0 and X,- := ft + . . . + &and Z/ := 2X', for i > 1. Show that {Z,-} is a martingale with respect to anappropriate filtration.

(ii) Define T := inf{i : X, = — 1}. Show that r is a stopping time, finite almosteverywhere. Hint: Use SLLN.

(iii) Show that PZo > PZ r . (Should you worry about what happens on theset {r = 00}?)

[2] Let r be a stopping time for the natural filtration generated by a sequence of randomvariables {Zn : n e N}. Show that $t = a{ZTAn : n € N}.

[3] Let {(Zn, Jn) : n e No) be a (sub)martingale and r be a stopping time. Show that{(ZTAn,%) : n € No} is also a (sub)martingale. Hint: For F in 5i,_i, considerseparately the contributions to ¥ZnAtF and PZ(n_i)ATF from the regions {r < n — 1}and {r > n}.

[4] Let r be a stopping time for a filtration {J, : i e No}. For an integrable randomvariable X, define X( := P(X | %). Show that

P(X I $T) = ]T{r = i)Xi = Xr almost surely.

ieNo

Hint: Start with X > 0, so that there are no convergence problems.

[5] Let {(Xn, *Jn) - n € No} be a positive supermartingale, and let a and r be stoppingtimes (not necessarily bounded) for which a < r on a set F in 3>. Show thatPX^fcr < 00}F > PXr{r < 00}F. Hint: For each positive integer N, showthat FN := F{a < A } € 9>A^- Use the Stopping Time Lemma to prove that¥XaANFN > FXtANFN > PXr{r < N}F, then invoke Monotone Convergence.

6.9 Problems 163

[6] For each positive supermartingale {(Xw, 7n) : n € No}, and stopping times a < T,show that P(Xrfr < oo} \%) < Xo{a < oo} almost surely.

[7] (Kolmogorov 1928) Let § i , . . . , fn be independent random variables with P§, = 0and |&| < 1 for each i. Define Xt : = ^ i + . . . + & and Vt := FXf. For each € > 0show that P {maxl:<n |X,| < e) < (1 + €)2/Vn. Note the direction of the inequalities.Hint: Define a stopping time r for which Vn {max,<n |X,| < e] < Frfr = «}. Showthat FVT = FX2 < (1 + e)2.

[8] (Birnbaum & Marshall 1961) Let 0 = Xo, X\,... be nonnegative integrable randomvariables that are adapted to a filtration {3^}. Suppose there exist constants 0,-, with0 < 0,- < 1, for which

(*) nXi 1 ?,_,) > OiXi-i foxi>\.

Let Ci > Ci > . . . > C^+j = 0 be constants. Prove the inequalityAT

(*•) F{maxCiXi > 1} < JîQ - 0l+1Cl+!)PXI-,

by following these steps.

(i) Interpret (*) to mean that there exist nonnegative, 3^-1 -measurable randomvariables F/_i for which P(X,- | J|_i) = Fî -f 0|Xi_i almost surely. PutZt := X, - Fj_i - 0,-Xi-i. Show that QXi < Q-iXî + QZt + C,-Fi_i almostsurely.

(ii) Deduce that QX, < Mf -f A, where Mf is a martingale with Mo = 0 and

(iii) Show that the left-hand side of inequality (**) is less than PCTXr for anappropriate stopping time r, then rearrange the sum for PA to get the assertedupper bound.

[9] (Doob 1953, page 317) Suppose Si , . . . , Sn is a nonnegative submartingale, withPSf < oo for some fixed p > 1. Let q > 1 be defined by p~~l +q~l = 1. Show thatP (maxi<n Sf) < qpFS%, by following these steps.

(i) Write Mn for maxi<n S,. For fixed x > 0, and an appropriate stopping time r,apply the Stopping Time Lemma to show that

xF{Mn >x}< FST{ST >x}< FSn{Mn > JC}.

(ii) Show that FXP = /0°° pxp~lF{X > x}dx for each nonnegative randomvariable X.

(iii) Show that PMnp < qPSHMg~l.

(iv) Bound the last product using Holder's inequality, then rearrange to get thestated inequality. (Any problems with infinite values?)

[10] Let (Q,vF, P) be a probability space such that 7 is countably generated: that is,3r = o{B\, #2* •. •} for some sequence of sets {#f}. Let /it be a finite measure on 3%dominated by P. Let 7n := a{Bu . . . , Bn).


(i) Show that there is a partition nn of Q into at most 2n disjoint sets from Jw suchthat each F in % is a union of sets from nn,

(ii) Define Jn-measurable random variables Xn by: for co € A € nn,

X - ( a , ) s = | M / P A i fPA>0,10 otherwise.

Show that FXnF = [iF for all F in 7B.

(iii) Show that (Xn, Jn) is a positive martingale.

(iv) Show that [Xn] is uniformly integrable. Hint: What do you know aboutH[Xn > M}?

(v) Let Zoo denote the almost sure limit of the {Xn}. Show that PXooF = /xF forall F in 7. That is, show that XOQ is a density for ji with respect to P.

[11] Let {(Xn,7n) : n € No} be a submartingale. For fixed constants a < fi (notnecessarily nonnegative), define stopping times a\ < t\ < 02 < ..., as in Section 3.Establish the upcrossing inequality,

for each positive integer Nf by following these steps.

(i) Show that Zn = (Xn - a)+ is a positive submartingale, with Zff, = 0 if at < 00and Zr. > ^ — a if tf < 00.

(ii) For each 1, show that Zt.AN — ZOi^ > (/S — a){r, < A }. Hint: Considerseparately the three cases a{ > N, at < N < r,, and r, < AT.

(iii) Show that -PZaiAiV + ¥ZtkAN > k0 - a)F{tk < N}. Hint: Take expectationsthen sum over 1 in the inequality from part (ii). Use the Stopping Time Lemmafor submartingales to prove PZTjA# — WZm+lAN < 0.

(iv) Show that ¥ZtkAN < FZN = P (XN - a)+.

[12] Reprove Corollary <27> (a submartingale {Xn : n € No} converges almost surely toan integrable limit if supn PX^ < 00) by following these steps.

(i) For fixed a < p, use the upcrossing inequality from Problem [11] to prove that

P{liminfn Xn < a < p < limsupn Xn] = 0

(ii) Deduce that [Xn] converges almost surely to a limit random variable X thatmight take the values ±00.

(iii) Prove that P|Xn| < 2PX^ - PXi for every n. Deduce via Fatou's lemma thatP|X| < 00.

[13] Suppose the offspring distribution in Example <25> has finite mean /Lt > 1 andvariance a1.

(i) Show that var(Zn) = a2fin~l + /Lt2var(Zn_i).

(ii) Write Xn for the martingale Zn//x". Show that supn var(Xw) < 00.

(iii) Deduce that Xn converges both almost surely and in -C1 to the limit X, andhence PX = 1. In particular, the limit X cannot be degenerate at 0.

6.9 Problems 165

[14] Suppose the offspring distribution P from Example <25> has has finite mean ix > 1.Write Xn for the martingale Zn//in, which converges almost surely to an integrablelimit random variable X. Show that the limit X is nondegenerate if and only if thecondition

(XLOGX) PX (X log(l + x)) < oo,

holds. Follow these steps. Write fin for Px (x{x < fi"}) and Pn (•) for expectationsconditional on 7n.

(i) Show that En O1 ~~ Mn) = Px (x Yln^x > M11})* which converges to a finite limit

if and only if (XLOGX) holds.

(ii) Define Xn := pr* E,&,,{&,, < M*Hi < Z»-i}. Show that P«-iXn = M«almost surely. Show also that

x - xN = ££L*+i ( x n - xB-i) > E ^ N + I (*» - *»-i) a l m o s t

(iii) Show that, for some constant Cu

# Xn] < E n Hn'XP[x > MW} < Q/i < 00.

Deduce that Yln (Xn — Xn) converges almost surely to a finite limit,

(iv) Write varn_i for the conditional variance corresponding to Pn-i. Show that

Deduce, via (ii), that

Zn P (XH ' HnXn-xlll)2 < En M"""1 Px2[x < fl") < C2/X < OO,

for some constant C2. Conclude that Ew (Xn — V<nXn-\/v) is a martingale,which converges both almost surely and in £ l .

(v) Deduce from (iii), (iv), and the fact that E«(^« ~ ^«-i) converges almostsurely, that En Xn-\{\ - M«/M) converges almost surely to a finite limit.

(vi) Suppose P{X > 0} > 0. Show that there exists an a> for which bothEw*n-i(a>)(l - iin/ii) < 00 and limXn__i(G>) > 0. Deduce via (i) that(XLOGX) holds.

(vii) Suppose (XLOGX) holds. From (i) deduce that P ( E n Xn-$l - Mn/M)) < 00.Deduce via (iv) that En(^« ~" Xn-$ converges in JC1. Deduce via (ii) thatPX >P(XN + En%+i(*« " x*-i>) = l - *(D a s N -* °°' f r o m w h i c h k

follows that X is nondegenerate. (In fact, P|Xrt - X| -> 0. Why?)

[15] Let fa : i € N} be a martingale difference array for which E I € N ' (ft2/1'2) < °°-

(i) Define Xrt := E?=i6/ ' - S h o w t h a t suP«p^n < °°- Deduce that Xn(a))converges to a finite limit for almost all co.

(ii) Invoke Kronecker's lemma to deduce that n"1 E?=i & ~^ ^ almost surely.

[16] Suppose {Xn : n € N} is an exchangeable sequence of square-integrable randomvariables. Show that cov(Xi, X2) > 0. Hint: Each X, must have the same vari-ance, V; each pair X/, X,, for i ^ y, must have the same covariance, C Considervar (Ei<« ^1) f°r arbitrarily large n.


[17] (Hewitt & Savage 1955, Section 5) Let P be exchangeable, in the sense ofDefinition <49>.

(i) Let / be a bounded, ^"-measurable function on Xn. Define X := f(x\,..., xn)and Y := f(xn+u • • •, * 2 n ) . Use Problem [16] to show that P(XY) > (FX) (PK),with equality if P is a product measure.

(ii) Suppose P = ct\Q\ +«2Q2, with a, > 0 and ai +«2 = 1» where Qi and Q2 aredistinct exchangeable probability measures. Let / be a bounded, measurablefunction on some Xn for which /ii := Qi f(x\,..., xn) ^ Qiffri,..., *«) =: /L^.Define X and Y as in part (i). Show that P(XF) > (PX) (PF). Hint: Use strictconvexity of the square function to show that a\fi] + (X2M2 > faiMi + <*2M2)2-Deduce that P is not a product measure.

(iii) Suppose P is not a product measure. Explain why there exists an E € An anda bounded measurable function g for which

P (fz e E}g(xn+l, xn+2, . . . ) ) # (P{z € £}) (Pg(jcn+1, * n + 2 , . . . ) ) ,

where z := (JCI, . . . , xn). Define a = P{z G E). Show that 0 < a < 1. For eachh € M+(XN,.AN), define

Qift := P({z € E}h(xn+U x n + 2 , . . •)) /« ,

Q2* := P ({z € Ec}/i(jcn+1, jcrt4.2,...)) / ( I - a) .

Show that Qi and Q2 correspond to distinct exchangeable probability measuresfor which P = aQi + (1 - a)Oz. That is, P is not an extreme point of the setof all exchangeable probability measures on *4N.

10. Notes

De Moivre used what would now be seen as a martingale method in his solutionof the gambler's ruin problem. (Apparently first published in 1711, accordingto Thatcher (1957). See pages 51-53 of the 1967 reprint of the third edition ofde Moivre (1718).)

The name martingale is due to Ville (1939). Levy (1937, chapter VIII), expand-ing on earlier papers (Levy 1934, 1935a, 1935fe), had treated martingale differences,identifying them as sequences satisfying his condition (C). He extended severalresults for sums of independent variables to martingales, including Kolmogorov'smaximal inequality and strong law of large numbers (the version proved in Sec-tion 4.6), and even a central limit theorem, extending Lindeberg's method (to bediscussed, for independent summands, in Section 7.2). He worked with martingalesstopped at random times, in order to have sums of conditional variances close tospecified constant values.

Doob (1940) established convergence theorems (without using stopping times)for martingales and reversed martingales, calling them sequences with "property £."He acknowledged (footnote to page 458) that the basic maximal inequalities were"implicit in the work of Ville" and that the method of proof he used "was used byLevy (1937), in a related discussion." It was Doob, especially with his stochastic

6.10 Notes 167

processes book (Doob 1953—see, in particular the historical notes to Chapter VII,starting page 629), who was the major driving force behind the recognition ofmartingales as one of the most important tools of probability theory. See Levy'scomments in Note II of the 1954 edition of L6vy (1937) and in L6vy (1970,page 118) for the relationship between his work and Doob's.

I first understood some martingale theory by reading the superb text ofAsh (1972, Chapter 7), and from conversations with Jim Pitman. The materialin Section 3 on positive supermartingales was inspired by an old set of notes forlectures given by Pitman at Cambridge. I believe the lectures were based in parton the original French edition of the book Neveu (1975). I have also borrowedheavily from that book, particularly so for Theorems <26> and <4i>. The bookof Hall & Heyde (1980), although aimed at central limit theory and its application,contains much about martingales in discrete time. Dellacherie & Meyer (1982,Chapter V) covered discrete-time martingales as a preliminary to the detailed studyof martingales in continuous time.

Exercise <15> comes from Aldous (1983, p. 47).Inequality <20> is due to Dubins (1966). The upcrossing inequality of

Problem [11] comes from the same paper, slightly weakening an analogousinequality of Doob (1953, page 316). Krickeberg (1963, Section IV.3) establishedthe decomposition (Theorem <26>) of submartingales as differences of positivesupermartingales.

I adapted the branching process result of Problem [14], which is due to Kesten& Stigum (1966), from Asmussen & Hering (1983, Chapter II).

The reversed submartingale part of Example <44> comes from Pollard (1981).The zero-one law of Theorem <5i> for symmetric events is due to Hewitt &Savage (1955). The study of exchangeability has progressed well beyond theoriginal representation theorem. Consult Aldous (1983) if you want to know more.

REFERENCES

Aldous, D. (1983), 'Exchangeability and related topics', Springer Lecture Notes inMathematics 1117, 1-198.

Ash, R. B. (1972), Real Analysis and Probability, Academic Press, New York.Asmussen, S. & Hering, H. (1983), Branching processes, Birkhauser.Bauer, H. (1981), Probability Theory and Elements of Measure Theory, second english

edn, Academic Press.Birnbaum, Z. W. & Marshall, A. W. (1961), 'Some multivariate Chebyshev

inequalities with extensions to continuous parameter processes', Annals ofMathematical Statistics pp. 687-703.

de Finetti, B. (1937), 'La prevision: ses lois logiques, ses sources subjectives',Annales de VInstitut Henri Poincare 7, 1-68. English translation by H. Kyburgin Kyberg & Smokier 1980.

de Moivre, A. (1718), The Doctrine of Chances, first edn. Second edition 1738.Third edition, from 1756, reprinted in 1967 by Chelsea, New York.


Dellacherie, C. & Meyer, P. A. (1982), Probabilities and Potential B: Theory ofMartingales, North-Holland, Amsterdam.

Doob, J. L. (1940), 'Regularity properties of certain families of chance variables',Transactions of the American Mathematical Society 47, 455-486.

Doob, J. L. (1949), 'Application of the theory of martingales', Colloques Interna-tionaux du Centre National de la Recherche Scientifique pp. 23-27.

Doob, J. L. (1953), Stochastic Processes, Wiley, New York.Dubins, L. E. (1966), 4A note on upcrossings of semimartingales', Annals of

Mathematical Statistics 37, 728.Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application,

Academic Press, New York, NY.Hewitt, E. & Savage, L. J. (1955), 'Symmetric measures on cartesian products',

Transactions of the American Mathematical Society 80, 470-501.Kesten, H. & Stigum, B. P. (1966), 'Additional limit theorems for indecomposable

multidimensional Galton-Watson process', Annals of Mathematical Statistics37, 1463-1481.

Kolmogorov, A. (1928), 'Uber die Summen durch den Zufall bestimmter un-abhangiger GroBen', Mathematische Annalen 99, 309-319. Corrections: samejournal, volume 102, 1929, pages 484-488.

Krickeberg, K. (1963), Wahrscheinlichkeitstheorie, Teubner. English translation,1965, Addison-Wesley.

Kyberg, H. E. & Smokier, H. E. (1980), Studies in Subjective Probability, secondedn, Krieger, Huntington, New York. Reprint of the 1964 Wiley edition.

Levy, P. (1934), *Uaddition de variables aleatoires enchalnees et la loi de Gauss',Bull Soc. Math. France 62, 42-43.

Levy, P. (1935a), Troprietds asymptotiques des sommes de variables aleatoiresenchainees', Comptes Rendus de VAcademie des Sciences, Paris 199, 627-629.

Levy, P. (19356), 'Proprietes asymptotiques des sommes de variables aleatoiresenchainees', Bull. Soc. math 59, 1-32.

Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.Second edition, 1954.

Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.Neveu, J. (1975), Discrete-Parameter Martingales, North-Holland, Amsterdam.Pollard, D. (1981), 'Limit theorems for empirical processes', Zeitschrift fur

Wahrscheinlichkeitstheorie und Verwandte Gebiete 57, 181-195.Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.Thatcher, A. R. (1957), 'A note on the early solutions of the problem of the duration

of play', Biometrika 44, 515-518.Ville, J. (1939), Etude Critique de la Notion de Collectif, Gauthier-Villars, Paris.

Chapter 7

Convergence in distribution

SECTION I defines the concepts of weak convergence for sequences of probability measureson a metric space, and of convergence in distribution for sequences of randomelements of a metric space and derives some of their consequences. Several equivalentdefinitions for weak convergence are noted.

SECTION 2 establishes several more equivalences for weak convergence of probabilitymeasures on the real line, then derives some central limit theorems for sums ofindependent random variables by means of Lindeberg*s substitution method.

SECTION 3 explains why the multivariate analogs of the methods from Section 2 are notoften explicitly applied.

SECTION 4 develops the calculus of stochastic order symbols.SECTION *5 derives conditions under which sequences of probability measures have weakly

convergent subsequences.

1. Definition and consequences

Roughly speaking, central limit theorems give conditions under which sums ofrandom variable have approximate normal distributions. For example:

If £i , . . . , $„ are independent random variables with P& = 0 for each i andy \ var(£,) = 1, and if none of the f, makes too large a contribution to their sum,then £ . f, is approximately N(0> 1) distributed.

The traditional way to formalize approximate normality requires, for eachreal JC, that ¥{J2t & < *) ** P{Z < x] where Z has a N(0,1) distribution. Ofcourse the variable Z is used just as a convenient way to describe a calculationwith the N(0,1) probability measure; Z could be replaced by any other randomvariable with the same distribution. The assertion does not mean that ]Tf. £f « Z, asfunctions defined on a common probability space. Indeed, the Z need not even liveon the same space as the {&}. We could remove the temptation to misinterpret theapproximation by instead writing PC- f t < x] & P(—oo, JC] where P denotes theAT(0,1) probability measure.

Assertions about approximate distributions of random variables are usuallyexpresssed as limit theorems. For example, the sum could be treated as one ofa sequence of such sums, with the approximation interpreted as an assertion ofconvergence to a limit. We thereby avoid all sorts of messy details about the size of

170 Chapter 7: Convergence in distribution

error terms, replacing them by a less specific assurance that the errors all disappearin the limit. Explicit approximations would be better, but limit theorems are ofteneasier to work with.

In this Chapter you will learn about the notion of convergence traditionallyused for central limit theorems. To accommodate possible extensions (such as thetheories for convergence in distribution of stochastic processes, as in Pollard 1984),I will start from a more general concept of convergence in distribution for randomelements of a general metric space, then specialize to the case of real randomvariables. I must admit I am motivated not just by a desire for added generality.I also wish to discourage my readers from clinging to inconvenient, old-fashioneddefinitions involving pointwise convergence of distribution functions (at points ofcontinuity of the limit function).

Let X be a metric space, with metric d(-, •), equipped with its Borel cr-field$ := 3(X). A random element of X is just an 3r\3(X)-measurable map from someprobability space (Q, 7% P) into X. Remember that the image measure XP is calledthe distribution of X under P.

The concept of convergence in distribution of a sequence of random elements{Xn} depends on Xn only through its distribution. It is really a concept of convergencefor probability measures, the image measures XFn, There are many equivalent waysto define convergence of a sequence of probability measures {Pn}, all definedon 3(X). I feel that it is best to start from a definition that is easy to work with,and from which useful conclusions can be drawn quickly.

 Definition. A real-valued function £ on a metric space X is said to satisfy aLipschitz condition if there exists a finite constant K for which

\t(x) - i(y)\ < Kd(x, y) for all x and y in X.

Write BL(X) for the vector space of all bounded Lipschitz functions on X.

The space BL(X) has a simple characterization via the quantity ||/||BL» definedfor all real valued functions / on X by

<2> where K\ := sup ^M ^ W l and K2 := sup |/(JC)|.x^y d(X, y) x

I have departed slightly from the usual definition of || • \\BL> in order to get the neatbound

<3> \f(x) - f(y)\ < UWBL (1 A d(x, 3O) for all x, y e X.

The space BL(X) consists precisely of those functions £ for which \\£\\BL < oo. Itis easy to show that || • \\BL is a norm when restricted to BL(X). Moreover, slightlytedious checking of various pointwise cases leads to inequalities such as

ll/i v f2\\BL < max (||/, UL, UIWBL) ,

which implies that BL(X) is stable under the formation of pointwise maxima ofpairs of functions. (Replace ft by -ft to deduce the same bound for ||/i A /2\\BL,

and hence stability under pairwise minima.)

7.1 Definition and consequences 171

REMARK. It is also easy to show that BL(X) is complete under || • \\BL: thatis, if {in : n e N} c BL(X) and \\in - tm\\BL -> 0 as min(m,n) - • oo then\\L-i\\BL -* 0 for a uniquely determined t in BL(X). In fact, BL(X) is a Banachlattice, but that fact will play no explicit role in this book.

It is easy to manufacture useful members of BL(X) by means of the distance

function

d(x, B) := inf{d(x, y) : y € B] for B c X.

Problem [1] shows that \d(x, B) - d(y, B)\ < d(x, y). Thus functions such as

Za,p,,B(x) := a A f$d(x, B), for positive constants a and /*, and B e l , all belong

to BL(X).

<4> Definition. Say that a sequence of probability measures {Pn}, defined on !B(X),

converges weakly to a probability measure P, on !B(X), if Pnl - • PI for all I in

BL(X). Write Pn ^ P to denote weak convergence.

REMARK. Functional analytically minded readers might prefer the term weak-*convergence. Many authors use the symbol => to denote weak convergence. I begmy students to avoid this notation, because it too readily leads to indecipherablehomework assertions, such as Pn=$P=>TPn=$TP. (Which =» is an implicationsign?)

<5> Definition. Say that a sequence X\, X%, ...of random elements ofX of X

converges in distribution to a probability measure P on !B(X) if their distributions

converge weakly to P. Denote this convergence by Xn ~* P. If X is a random

element with distribution P, write Xn*^> X to mean Xn ^> P.

REMARK. Convergence in distribution is also called convergence in law (theword law is a synonym for distribution) or weak convergence. For the study ofabstract empirical process it was necessary (Hoffmann-J0rgensen 1984, Dudley 1985)to extend the definition to nonmeasurable maps Xn into X. For that case, theconcept of distribution for Xn is not defined. It turns out that the most natural andsuccessful definition requires convergence of outer expectations, ¥*h(Xn) -> Ph9 forbounded, continuous, functions h. That is, the convergence Xn -^ P becomes theprimary concept, with no corresponding generalization needed for weak convergenceof probability measures.

(X,£)

It is important to remember that convergence in distribution, in general, says

nothing about pointwise convergence of the Xn as functions. Indeed, each Xn might

be defined on a different probability space, (Qn, Jn, Pn) so that the very concept of

pointwise is void. In that case, Xn «** P means that Xn (Pn) ~+ P, that is,

Fn£(Xn) := (Xn¥n)(i) -» Pi for all I in BL(X);


and Xn -> X, with X defined on (£2,y,P), means Pt(Xn) -+ ¥i(X) for all €in £L(X).

Similarly, convergence in probability need not be well defined if the Xn liveon different probability spaces. There is, however, one important exceptional case.If Xn ~> P with P a probability measure putting mass 1 at a single point xo in X,then, for each e > 0,

Fn{d(Xn, XQ) >€}< Wne€(Xn) - * Pi€ = 0 where i€(x) := 1 A (rf(jc, xo)/e).

That is, Xn converges in probability to XQ.

<6> Example . Suppose Xn ~* P and {Xfn} is another sequence for which d(Xn, Xf

n)converges to zero in probability. Then X'n —» P. For if \\1\\BL •= # < oo,

|Pw£(Xn) - Fn£(X'J\ < Fn\i(Xn) - i(Xfn)\

<KFn(lAd(Xn,Xfn)) by <3>

< KF{d(Xn, X'n)>€} + K€ for each <? > 0.

The first term in the final bound tends to zero as n -> oo, for each € > 0, by the• assumed convergence in probability. It follows that ¥nl(X'n) -» P€.

REMARK. A careful probabilist would worry about measurability of d(Xn, X'n).If X were a separable metric space, Problem [6] would ensure measurability. Ingeneral, one could reinterpret the assertion of convergence in probability to mean:there exists a sequence of measurable functions {An} with Art > d(Xn, X'n) andAw -> 0 in probabilty. The argument for the Example would be scarcely affected.

Convergence for expectations of the functions in BL(X) will lead to convergencefor expectations of other types of function, by means of various approximationschemes. The argument for semicontinuous functions is typical. Recall that afunction g : X —• M is said to be lower semicontinuous (LSC) if {JC : g(x) > t} isan open set for each fixed t. Similary, a function / : X ~> E is said to be uppersemicontinuous (USC) if {x : f(x) < t} is an open set for each fixed t. (That is,/ is USC if and only if - / is LSC.) If a function is both LSC and UCS then itis continuous. The prototypical example for lower semicontinuity is the indicatorfunction of an open set; the prototypical example for upper semicontinuity is theindicator function of a closed set.

<7> Lemma. If g is a lower semicontinuous function that is bounded from below bya constant, on a metric space X, then there exists a sequence {€< : i e N} in BL(X)for which €4 (JC) f g(x) at each x.


Proof. With no loss of generality, we may assume g > 0. For each t > 0,the set Ft := {g < t] is closed. The sequence of nonnegative BL functions4 r(x) := t A (kd(x, Ft)) for ^ € N increases pointwise to t{g > r}, becaused(x, Ft) > 0 if and only if g(x) > t. (Compare with Problem [3].)

The countable collection S of all €*,, functions, for k G N and positive rational thas pointwise supremum equal to g. Enumerate S as {*i,ft2> •••}» then define

• U :=ma.Xj<jhj.

<8> Theorem. Suppose Pn -^ P. Then

(i) liminfôo Png > Pg for each lower semicontinuous function g that isbounded from below by a constant,

(ii) l imsup^^ Pnf < Pf for each upper semicontinuous function f that isbounded from above by a constant.

Proof For a given LSC g, bounded from below, invoke Lemma <7> to find anincreasing sequence {£,} in BL(X) with £,- f g pointwise. Then, for fixed i,

liminf Png > liminf Pn€, = P€, because Pn€, -> Plt.n n

Take the supremum over i, using Monotone Convergence to show supf P€, = Pg, to

• obtain (ii). Put g = — / to deduce (ii).

When specialized to the case of indicator functions, we have

liminf* PnG > PG for all open G )<9> \ if Pn - » P .

limsupn PnF < PF for all closed FJ<io> Example. Let / be a bounded, measurable function on X. The collection of all

lower semicontinuous functions g with g < f has a largest member, because thesupremum of any family S of LSC functions is also LSC: {JC : supg€S g(x) > t] =Vgesi* • g(x) > t), a union of open sets. By analogy with the notation for theinterior of a set, write / for this largest LSC < / . The analogy is helpful, because /equals the indicator function of B when / equals die indicator of a set B. Similarly,there is a smallest USC function / that is everywhere > / , and / is the indicatorof B when / is the indicator of B.

For simplicity suppose 0 < / < 1. We have /(JC) > /(JC) > /(JC) for all x.At a point JC where /(JC) = /(JC) the set {y : f(y) > /(JC) — e] is an openneighborhood of JC within which f(y) > /(JC) - €. Similarly, if /(JC) = /(JC), thereis a neighborhood of x on which f(y) < f(x) + e. In short, / is continuous at eachpoint of the set {JC : /(JC) = /(JC)}. Conversely, if / is continuous at a point x then,for each e > 0, there is some open neighborhood G of JC for which |/(JC) - f(y)\ < cfor y € G. We may assume € < I. Then the function / is sandwiched between aLSC function and an USC function,

(fix) -€){yeG}- 2{y i G] < f(y) < (/(JC) + 6) {y e G) 4- 2{y £ G},

which differ by only 2e at JC, thereby implying that /(JC) = /(JC). The Borelmeasurable set C/ := {JC : /(JC) = /(JC)} is precisely the set of all points at which /is continuous.


Now suppose Pn **+ P . From Theorem < 8 > , and the inequality / > / > / ,we have

Pf > l i m s u p P n f > l i m s u p P n f > l iminf Pnf > liminf Pnf > Pf.n n n n

If Pf = p / , which happens if PCf = 1, then we have convergence, Pnf -> Pf.That is, for a bounded, Borel measurable, real function / on X,

<n> Pnf -» Pf if Pn -w P and / is continuous P almost everywhere.

When specialized to the case of an indicator function of a set B e S(X), we have

<12> PnB -> PB if Pn ~* P and P (dB) = 0,

because the discontinuities of the indicator function of a set occur only at itsboundary. A set with zero P measure on its boundary is called a P-continuityset. For example, an interval (—oo, x] on the real line is a continuity set for everyprobability measure that puts zero mass at the point JC. When specialized to realrandom variables, assertion <12> gives the traditional convergence of distribution

• functions at continuity points of the limit distribution function.

REMARK. Intuitively speaking, the closeness of probability measures in a weakconvergence sense is not sensitive to changes that have only small effects on functionsin BL(X): arbitrary relocations of small amounts of mass (because the functionsin BL(X) are bounded), or small relocations of arbitrarily large amounts of mass(because the functions in BL(X) satisfy a Lipschitz condition). The P-continuitycondition, P (dB) = 0, ensures that small rearrangements of P masses near theboundary of B cannot have much effect on PB. See Problem [4] for a way ofmaking this idea precise, by constructing sequences P'n -^ P and Pn* -** P for which

P'nB - • PB and P'^B - * PB.

Problem <io> shows that convergence for all P-continuity sets impliesthe convergence Pnf —• Pf for all bounded, measurable functions / that arecontinuous a.e. [P], and, in particular, for all / in BL(X). Thus, any one of theassertions in the following summary diagram of equivalences could be taken asthe definition of weak convergence, and then the other equivalences would becometheorems. It is largely a matter of taste, or convenience, which equivalent form onechooses as the definition. It is is worth noting that, because of the equivalences,the concept of weak convergence does not depend on the particular choice for themetric: all metrics generating the same topology lead to the same concept.

REMARK. Billingsley (1968, Section 2) applied the name Portmanteau theoremto a subset of the equivalences shown in the following diagram. The circle ofideas behind these equivalences goes back to Alexandroff (1940-43), who workedin an abstract, nontopological setting. Prohorov (1956) developed a very usefultheory for weak convergence in complete separable metric spaces. Independently,Le Cam (1957) developed an analogous theory for more general topological spaces.(See also Varadarajan 1961.) For arbitrary topological spaces, Tops0e (1970,page 41) chose the semicontinuity assertions (or more precisely, their analogs forgeneralized sequences) to define weak convergence. Such a definition is needed tobuild a nonvacuous theory, because there exist nontrivia! spaces for which the onlycontinuous functions are the constants.


•

•

PnB -> PB for all Borel sets B with P(dB) = 0Problem [10]

liming PnGz PG for all open G <+ limsup,, PnF «s PF for all closed F

lim infn Png & Pg for all bounded,lower semicontinuous functions g

limsupn Pnf £ Pf for all bounded,upper semicontinuous functions /

Example <io>

Pnf -+ Pf for all bounded / with P{x e X : / discontinuous at x] = 0

Pf for all bounded, continuous

Theorem <8>Pn£ -* Pi for all I in BL(X)

def

Equivalences for weak convergence of Borel probability measures on a generalmetric space. Further equivalences, for the special case of probability measureson the real line, are stated in the next Section.

The following consequence of <ii> is often referred to as the ContinuousMapping Theorem, even though the mapping in question need not be continuous.It would be more accurate, if clumsier, to call it the Almost-Surely-ContinuousMapping Theorem.

Corollary. Let T be a tB(X)yB(^)-measurable map from X into another metricspace y, which is continuous at each point of a measurable subset CT- If Pn ~* Pand PCT = 1 then T(Pn) ~> T(P).

Proof. For £ € BLty), the composition / := I o T is continuous at each pointof CT. From <li>, we have (TPn)(t) := Pni(T) -> Pi(T) =: (TP)(i).

REMARK. The equivalent assertion for random elements is: if Xn

is continuous at almost all realizations X(<o) then T(Xn) -~* T(X).X and T

Example. Suppose Yn «** F, as random elements of a metric space (y,«/y), andZn ~> ZQ in probability, as random elements of a metric space (Z, dz)* EquipX := y x 2, with the metric

d (x\, x2) := max (dy(y\, y2), dz(z\, zi)) where xt := (yit zt).

If € € BLQd x Z) then the function y H> £(y, zo) belongs to BL(y>, and henceWni(Yn,zo) -> P€(F,z0). That is, the random elements X'n := (YnJzo) convergein distribution to (F, ZQ). The random element Xn := (Fw, Zn) is close to X'n: infact, d(Xn,X

fn) — dz(Zn$zo) ~> 0 in probability. By Example <6>, it follows

that Xn ~* (F, zo). If T is a measurable function of (y, z) that is continuous atalmost all realizations (F(<y), zo) of the limit process, Corollary <13> then givesT(Yny Zn) ~* T(Y, zo). The special cases where y = Z = R and T(y, z) = y + z or

, z) = yz are sometimes referred to as Slutsky's theorem.


<15> Example . Suppose £1,^2, ••• are independent, identically distributed randomRk-vectors with Pft = 0 and P(&£/) = /*. Define K,, = (£1 + ... + $n)/>/n. Asyou will see in Section 3, the sequence Krt converges in distribution to a randomvector Y whose components are independent N(0, 1) variables. Corollary <13>gives convergence in distribution of the squared lengths,

The limit distribution is xh by definition of that distribution.Statistical applications of this result sometimes involve the added complication

that the §,- are not necessarily standardized to have covariance equal to the identitymatrix but instead P(&§/) = V. In that case, Yn -> Y9 where Y has the N(0, V)distribution. If V is nonsingular, the random variable Yr

nV~xYn converges indistribution to Y'V~lY, which again has a xl distribution. Sometimes V has to beestimated, by means of a nonsingular random matrix Vn, leading to considerationof the random variable Y'nV~xYn. If Vn converges in probability to V (meaningconvergence in probability of each component), then it follows from Example <14>that (Yn, Vn) -~> (F, V), in the sense of convergence of random elements of R*+*2.Corollary <13> then gives Y'nV~xYn ~> YrV~]Y, because the map (y, A) H* y'A~ly

D is continous at each point of R*+*2 where the matrix A is nonsingular.

2. Lindeberg's method for the central limit theorem

The first Section showed what rewards we reap once we have established convergencein distribution of a sequence of random elements of a metric space. I will now explaina method to establish such convergence, for sums of real random variables, usingthe central limit theorem (CLT) as the motivating example. It will be notationallyconvenient to work directly with convergence in distribution of random variables,rather than weak convergence of probability measures. To simplify even further, letus assume all random variables are defined on a single probability space (£2, JF, P).

The CLT, in its various forms, gives conditions under which a sum ofindependent variables £1 4 - . . . -f £* has an approximate normal distribution, in thesense that P£(£i + . . . + £*) for I in BL(R) is close to the corresponding expectationfor the normal distribution. Lindeberg's (1922) method transforms the sum intoa sum of normal increments by successive replacement of each £,- by a normallydistributed y/f with the same expected value and variance. The errors accumulatedduring the replacements are bounded using a Taylor expansion of the function / .Of course the method can work only if / is smooth enough to allow the Taylorexpansion. More precisely, it requires functions with bounded derivatives up tothird order. We therefore need first to check that convergence of expectations forsuch smooth functions suffices to establish convergence in distribution. In fact,convergence for an even smaller class of functions will suffice.

<16> Lemma. IfPf(Xn) -> P/ (X) for each f in the class e°°(R) of all boundedfunctions with bounded derivatives of all orders, then Xn —> X.

7.2 Lindeberg's method for the central limit theorem 111

Proof. Let Z have a AT(0,1) distribution. For a fixed £ e BL(R) and tr > 0, definea smoothed function by convolution,

£a(x) := + <TZ) := exp (-|(y - £(y)dy.

The function £a belongs to e°°(R): a Dominated Convergence argument justifiesrepeated differentiation under the integral sign (Problem [15]). As a tends to zero,£a converges uniformly to €, because

\£a{x) - £{x)\ < F\£(x + aZ) - £{x)\ < |j€||BLP (1 A a\Z\) -* 0,

again by Dominated Convergence.Given € > 0, fix a a for which supx \£a(x) - £(x)\ < €. Then observe that

\¥£{Xn) - P€(X)| lies within 2t of W£a{Xn) - Ma(X% which converges to zero as• n -+ oo, because £a

REMARK. There is nothing special about the choice of the N(0, 1) distributionfor Z in the construction of £a. It matters only that the distribution have a densitywith bounded derivatives of all orders, and that differentiation under the integral signin the convolution integral can be justified.

For weak convergence of probability measures on the real line, we can augmentthe equivalences from Section 1 by a further collection involving specific propertiesof the real line.

P«(-oo, x] -+ P(-oo, x] for all x e R with P{x] = 0

Pnf -> P /

P - / ^ P /

for

for

all

all

/

/

pxeixt _^ pxeixt for

in

in

all

e3(R)

e°°(R)

real t

Further equivalences for weak convergence of probability measures on thereal line. Lemma <16> handles the implications leading from C°°(]R) to weakconvergence. Example < 10> then gives the convergence of distribution functions.Problem [5] gives the approximation arguments leading from convergence ofdistribution functions (in fact, it even treats the messier case for R2) to weakconvergence. The final equivalence, involving the complex exponentials, will beexplained in Chapter 8.

Lindeberg's method needs only bounded derivatives up to third order. DefineC3(E) to to be the class of all such bounded functions. Of course convergence ofexpectations for functions in C3(R) is more than enough to establish convergence indistribution, because e3(M) 2 e°°(R).


For a fixed / in C3(E), define

which is finite by assumption. By Taylor's Theorem,

f{x + y) = f(x) + yf'(x) 4- \y2f(x) + R(x, y),

where R(x, y) = \y3 flff{x*) for some JC* lying between x and x 4- j . Consequently,

|i?(jc, y)| < C|y|3 for all JC and y.

If X and F are independent random variables, with P|F|3 < oo, then

P/(X 4- F) = P/(X) 4- P (F/'(X)) + | P ^F2/"(X)) + P1?(X, F).

Using independence to factorize two of the terms and bounding \R(Xr Y)\ by C|F|3,we get

|P/(X + F) - P/(X) - (PF) (P/'(X)) - ±(FY2) (Ff"(X)) \ < CP|F|3.

Suppose Z is another random variable independent of X, with P|Z|3 < oo andPZ = PF and PZ2 = PF2. Subtract, cancelling out the first and second momentcontributions, to leave

|P/(X + F) - P/(X + Z)\ < C (P|F|3 4- P|Z|3) .

For the particular case where Z has a N(fi, a2) distribution, with $i := PF anda1 := var(F), the third moment bound simplifies slightly. For convenience, write Wfor (Z — ji)/cr, which has a 7V(0, 1) distribution. Then

P|Z|3 <8|/x|3 + 8<r3P|W|3

< 8(P|F|)3 + 8(P|F|2)3/2P|W|3

< ^8 -f 8P|W\3^ P|F|3 by Jensen's inequality.

The right-hand side of <17> is therefore less than CiP|F|3 where C\ denotes theconstant (9 + 8P|W|3)C.

Now consider independent random variables £ j , . . . , & with

m := Pfe , af := varfe), P|fe|3 < oo.

Independently of all the {£,}, generate rji distributed N(fjLir a2), for i = 1, . . . , k.

Choose the {r/,} so that all 2k variables are independent. Define

S : = £ i + . . . + & and T := m 4-.. . 4- m .

The sum T has a normal distribution with

P r = /Lii4-... + M* and var(T) = af + .. . 4-or2.

Repeated application of inequality <n> will lead to the third moment bound forthe difference P/(5) - Vf(T).

For each i define


The variables Xi9 Yh and Z, are independent. From <17> with the upper boundsimplified for the normally distributed Z,-,

Zt)\ <

for each i. At i = k and i = 1 we recover the two sums of interest, Xk + F* =£i 4-. . . + & = S and Xi + Zi = r?i + . . . + rjk = T. Each substitution of a Z, fora Ff replaces one more ft by the corresponding r?,; the fc substitutions replace allthe ft by the rji. The accumulated change in the expectations is bounded by a sumof third moment terms,

|P/(S) - p/(r>i < ct

We have only to add an extra subscript to get the basic central limit theorem.It is cleanest to state the theorem in terms of a triangular array of random

variables,

• • • « l r 2 , * ( 2 )

• - • « € 3 , * ( 3 )

The variables within each row are assumed independent. Nothing need be assumedabout the relationship between variables in different rows; all calculations arecarried out for a fixed row. By working with triangular arrays we eliminate variouscentering and scaling constants that might otherwise be needed.

<19> Theorem. Let §„, j , . . . , £„,*(„>, for n = 1,2,..., be a triangular array of randomvariables, independent within rows, such that

0) Ei Pfti,/ -* M, witfi /x finite,

(ii) £ . var(ft,(j) -» a2 < oo,

(Hi) Ei ' l&.H3->0.

™e n Ei<*(«) $».« ~* (M.cr2) as n -> oo.Pmo/. Choose / in C3(M). Apply inequality <18> and (iii) to show that ¥fQ2i 5«f/)equals P/(rn) + c?(l), where Tn is iV(/iw,aw

2) distributed with /xn -> /JL by (i) anda2 -» a2 by (ii). Deduce (see Problem [11]) that Tn ~* N(ii,cr2)9 whence

<20> Exercise. If Xn has a Bin (n, pn) distribution and if npn(l —/*„)-* oo, show thatXn — «/?„

iv(0,1).?n(l - pn)

SOLUTION: Manufacture a random variable with the same distribution as thestandardized Xn as follows. Let A r t j , . . . , AH%n be independent events with ¥AnJ =for i = 1, . . . ,n . Define

fn.i := — — — where an := y/npn(\ - pn).

Then E/<n ?».i âs the same distribution as (Xn - npn)/an.


Check the conditions of Theorem <19> with /x := 0 and a2 := 1 for these £n>l.The centering was chosen to given Fi-nj = 0, so (i) holds. By direct calculation,

so J2i var(&u) = 1- Requirement (ii) holds. Finally, because \Anj - pn\ < 1,

Ei PIM* < *nl Ei V\$nj\2 = <*? -* 0.

D It follows that Ei<n &M ~* N(Q, 1), as required.

Theorem <19> can be extended to random variables that don't satisfy themoment conditions—indeed, to variables that don't even have finite moments—bymeans of truncation arguments. The theorem for independent, identically distributedrandom variables with finite variances illustrates the truncation technique well. Themethod of proof will delight all fans of Dominated Convergence, which once againemerges as the right tool to handle truncation arguments for identically distributedvariables.

<2i> Theorem. Let X\, Xi,... be independent, identically distributed random variableswith WXi = 0 and FXf = 1. Then (Xr + . . . + XH)/Jn -» N(0,1).

Proof. The argument will depend on three applications of Dominated Convergence,with dominating variable X\:

Apply Theorem <19> to the variables fw?l := Xf{|Xi| < y/n]/y/n, for i =1, . . . ,n . Notice that

i ,| > v ^ } / ^ because WX( = 0,

which gives the bound

^ 2 + 0.

To control the sum of variances, use the identical distributions together with the factthat Fi;nj = ixnjn = o(\/n) to deduce that

Ei var(fc,fi) = P

For the third moment bound use

P ?( U%) > 0.

It follows that £/?n,i '^ ^(0* !)• Complete the proof via an appeal toExample <6>, using the inequality

^ 1 ^ ED to show that Ei<n

xi/^ " Ei $nj -> 0 in probability.

Similar truncation techniques can be applied to derive even more generalforms of the central limit theorem from Theorem <19>. Often we have to deal


with conditions expressible as Hn(e) - • 0 for each e > 0, for some sequenceof functions Hn. It is convenient to be able to replace € by a sequence {€„} thatconverges to zero and also have Hn(€n) -> 0. Roughly speaking, if a condition holdsfor all fixed € then it will also hold for sequences €n tending to zero slowly enough.

<22> Lemma. Suppose Hn(€) -» 0 as n - • oo, for each fixed c > 0. Then there existsa sequence €n -> 0 such that Hn(€n) -+ 0.

Proof. For each positive integer k there exists an «* such that \Hn(\/k)\ < l/k forn > «*. Without loss of generality, assume n\ < ni < Define

{ arbitrary for n < n\9

l/k for nk < n < «*+i-

That is, for n > ni, we have €„ = 1/Jfcn, where kn is the positive integer k forwhich nk < n < n*+i. Clearly kn -^ oo as n -» oo. Also, for n > «i, we have

D \Hn(€n)\ < \/kny which converges to zero as n -> oo.

The form of the central limit theorem in the next Exercise is essentially due toLindeberg (1922). The result actually includes Theorem <2i> as a special case.

<23> Exercise. Let {Xnj} be a triangular array of random variables, independent withineach row, such that:

(i) FXnJ = 0 for all n and i;

(ii) E,PX£, = 1;(iii) Ln(e) := J \ T>xlti{\XHti\ > e] -> 0 for each 6 > 0 [Lindeberg's condition].

Show that £ \ ^».i ^ ^(0,1).SOLUTION: Invoke Lemma <22> with Hn(e) := Ln(€)/€2 to find en tending to zeroslowly enough to ensure that Ln{en)/el -» 0. Define a triangular array of randomvariables §„,, := Xnj[\Xnj\ < €n}. Notice that

Pf&.i # X,,,/ for some i} < ^ PdXÎ > €n} < Ln(€n)/^ -> 0.

By Example <6> it suffices to show that £* ?«,i ^* ^(0.1).Check the conditions of Theorem <19>. For the first moments:

I > cn}| < Ln(cn)/€n ^ 0.

For the variances, use the fact that PfWfI = —PXn>l{|Xn>l| > €„} to show that

£ , vatfe,,) = Ei PX^dX,,.,! < €„} - Ei (-PX^dX,,.,-! > €n}f.

The first sum on the right-hand side equals Ei ^ 2 i ~ Ln(€n), which tends to 1.The second sum is bounded above by Ln(€n). For the third moments:

• The central limit theorem follows.

3. Multivariate limit theorems

The arguments for proving convergence in distribution of random Rk vectors—multivariate limit theorems—are similar to those for random variables. Indeed,


with an occasional reinterpretation of a square as a squared length of a vector, anda product as an inner product, the arguments in Section 2 carry over to randomvectors.

Perhaps the only subtlety in the multivariate analog of Theorem <19> arisesfrom the factorization of quadratic terms like P(F'/(X)y), with Y independent ofthe k x k random matrix of second derivatives /(X). We could resort to an explicitexpansion into a sum of terms P(yJFi/(X)li), but it is more elegant to reinterpretthe quadratic as the expected trace of a 1 x 1 matrix, then rearrange it as

Ptraee(F7(X)F) = Plrace(/(X)IT') = trace ((P/(X)) (WYYf)).

We are again in the position to approximate P / ( ^ X/) by means of a sequence ofsubstitutions of variables with the same expected values and covariances.

The new variables should be chosen as multivariate normals. For each fi e Rk

and each nonnegative definite matrix V, the N(fi,V) could be defined as thedistribution of fi + RW, with W a vector of independent iV(0, l)'s and R any k x kmatrix for which RRf = V. It is easy to check that

P(M + RW) = ii + RVW = n and var(/x + RW) = RVM(W)R' = F.

The Fourier tools from Chapter 8 offer the simplest method for showing that thedistribution does not depend on the particular choice of R.

Problem [14] shows how to adapt the one-dimensional calculations to derive amultivariate analog of approximation <18>. The assertion of Theorem <19> holdsif we reinterpret a2 to be a variance matrix. The derivations of other multivariatecentral limit theorems follow in much the same way as before.

<24> Example. If Xi, X2,... are independent, identically distributed random vectorswith FXt = 0 and P|X,|2 < 00, then (Xi + .. . + Xn)/Jn ~> W(0, V), where V :=

• P(XiX',).

In short, the multivariate versions of the results from Section 2 present littleextra challenge if we merely translate the methods into vector notation. Fansof the more traditional approach to the theory (based on pointwise convergenceof distribution functions) might find the extensions via multivariate distributionfunctions more tedious. Textbooks seldom engage in such multivariate exercises,because there is a more pleasant alternative: By means of a simple Fourier device(to be discussed in Section 8.6), multivariate results can often be reduced directlyto their univariate analogs. There is not much incentive to engage in multivariateproofs, except as a way of deriving explicit error bounds (see Section 10.4).

4 . S t o c h a s t i c o r d e r s y m b o l s

You are probably familiar with the 0() and o(-) notation from real analysis. Theyallow one to avoid specifying constants in many arguments, thereby simplifying thenotational load. For example, it is neater to write

f(x) = /(0) 4- Jtf(O) 4- *(|jc|) near 0,

7.4 Stochastic order symbols 183

than to write out the formal epsilon-delta details of the limit properties that definedifferentiability of / at 0.

The order symbols have stochastic analogs, which are almost indispensable inadvanced asymptotic theory. As with any very concise notation, it is easy to concealsubtle errors if the symbols are not used carefully; but without them, all but thesimplest arguments become notationally overwhelming.

<25> Definition. For random vectors {Xn} and nonnegative random variables {«„}write Xn = 0p(an) to mean; for each € > 0 there is a finite constant M€ such thatF{\Xn\ > M€an] < € eventually. Write Xn = op(otn) to mean: W{\Xn\ > ectn} -> 0for each € > 0.

Topically {otn} is a sequence of constants, but occasionally random bounds areuseful.

REMARK. The notation allows us to write things like op(ctn) = Op{otn), but notOp(an) = op(an)f meaning that if Xn is of order op(an) then it is also of order Op(an)but not conversely. It might help to think of Op() and op(<) as defining classes ofsequences of random variables and perhaps even to write Xn e Op(an) or Xn € op(an)instead of Xn = Op(an) or Xn = op(an).

<26> Example. The assertion Xn = op{\) means the same as Xn -* 0 in probability.When specialized to random vectors, the result in Example <6> becomes: if

• Xn ~* P then Xn + op(\) ~> P- The op{\) here replaces the sequence Xn - X'n.

<27> Example. If Xn ~* P then Xn = 0^(1). From <9> with G as the open ball ofradius M centered at the origin, liminfP{Xn € G} > PG. If M is large enough,

D PG > 1 - €, which implies that ¥{\Xn\ >M}<€ eventually.

<28> Example. Be careful with the interpretation of an assertion such as Op{\) +0P(1) = Op(\). The three 0p(l) symbols do not refer to the same sequence ofrandom vectors; it would be a major blunder to cancel out the 0p{\) to conclude thatOp{\) = 0 . The assertion is actually shorthand for: if Xn = 0p(l) and Yn = Op(l)

The assertion is easy to verify. Given e > 0, choose a constant M so that¥{\Xn\ > M} < e md¥{\Yn\ > M} < € eventually. Then, eventually,

¥{\Xn 4- Yn\ > 2M) < ¥{\Xn\ > M) + P{|yn| > M} < 2c.

If you worry about the 2e in the bound, replace e by e/2 throughout theD previous paragraph.

<29> Example. If [Xn) is a sequence of real random variables of order Op(l), whatcan be asserted about {l/Xn}l Nothing. The stochastic order symbol Op() conveysno information about lower bounds. For example, if Xn = \/n then Xn = Op{\) butl/Xn = n -> oo. You should invent other examples.

• The blunder of asserting l/0p(l) = 0p(l) is quite common. Be warned.

<30> Example. For a sequence of constants [an] that tends to zero, suppose Xn =Op(an). Let g be a function, defined on the range space of the {Xw}, for whichg(x) = O(|JC|) near 0. Then the random variables g(Xn) are of order op(an). Toprove the assertion, for given € > 0 and €f > 0, find M and then 5 > 0 such that


P{|XW| > Man] < €f eventually, and |g(jt)| < e\x\/M when \x\ < 8. When n is largeenough, we have Man < 8 and

\g(Xn)\ > ^ M a w | < P{|XM| > Man]

D That is, g(Xn) = op(an), or, more cryptically, o(Op(an)) = op(an).

<3i> E x a m p l e . The so-called delta method gives a simple way to analyze smoothtransformations of sequences of random vectors. Suppose XQ is a fixed vector inRk, and {Xn} is a sequence of random vectors for which Zn = y/n{Xn - XQ) ~» Z.Suppose g is a measurable function from R* into R; that is differentiate at JCO. Thatis, there exists an / x k matrix D such that

g(x0 + 8) =

where |/?(5)| = o(\8\) as 5 -> 0. If we replace 8 by the random quantity Zn/y/nwe get y ^ g ^ / i ) - g(xo)) = DZn + y/nR(Zn/^/n). From Example <30> we haveR(Zn/*fn) = o(Op(lfyfn)) — op{\/^/n), from which it follows, via Example <26>,

• that Vn(^(XB) - g(x0)) + op(l) ^> DZ.

*5. Weakly convergent subsequences

In a compact metric space, each sequence has a convergent subsequence. Sequencesof probability measures that concentrate most of their mass on compact subsets ofa metric space have a similar property, a result that provides a powerful method forproving existence of probability measures.

<32> Definition. A probability measure P on 3(X) is said to be tight if to eachpositive e there exists a compact set K€ such that PK€ > 1 — e.

For the purposes of weak convergence arguments it is more convenient tohave tight measures identified with particular linear functionals on BL(X), with X ametric space. The following characterization is a special case of a result proved inSection 6 of Appendix A.

<33> Theorem. A linear functional X : BL(X)+ H* M+ with A.1 = 1 defines a tightprobability measure if and only if it is functionally tight: to each positive € thereexists a compact set K€ such that XI < € for every I in BL(X)+ for which £ < Kc

€.

In order that a limit functional on BL(X)+ inherit the functional tightnessproperty from a convergent sequence, it suffices that an analogous property hold"uniformly along the sequence." It turns out that a property slightly weaker thanrequiring supn PnK

c€ < € is enough.

<34> Definition. Call a sequence of probability measures {Pn} on the Borel sigma-fieldof a metric space uniformly tight if to each e > 0 there exists a compact set K€

such that limsupM_ôo PnGc <e for every open set G containing K€.

Uniform tightness implies (and, apart from a fewinconsequential constant factors, is equivalent to) the assertionthat for each € > 0 there is a compact set K€ such that

K - ------G

7.5 Weakly convergent subsequences 185

limsup Pnt < 2c for every I in £L(X)+ for which 0 Pni < € + PnGc€ < 2€ eventually.

This equivalent form of uniform tightness is better suited to the passage to the limit.

<36> Theorem. (Prohorov 1956, Le Cam 1957) Every uniformly tight sequence ofprobability measures on the Borel sigma-field of a metric space has a subsequencethat converges weakly to a tight probability measure.

Construction of the limit distribution, in the form of a tight linear functionalon Z?L(X)+, will be achieved by a Cantor diagonalization argument, applied to acountable family of functions of the type descibed in the following Lemma.

<37> Lemma. For each 8 > 0, € > 0, and each compact set K there exists a finitecollection 9 = {go, 8\,, • • • > g*} Q BL(X)+ such that:

(i) goM 4-.. . + gk(x) = 1 for each x € X;(ii) the diameter of each set {g, > 0}, for i > 1, is less than 8.

(Hi) go < € on K.

REMARK. A finite collection of nonnegative, continuous functions that sum toone everywhere is called a continuous partition of unity.

Proof. Let jq , . . . , x* be the centers of open balls of radius 8/4 whose union coversthe compact set K. Define functions f0 = 6/2 and ft(x) := (1 - 2d(x, *i)/<S)+, fori > 1, in BL(X)+. Notice that ft(x) = 0 if d(x, xfi > 5/2, for i > 1. Thus the set{fi > 0} has diameter less than 8 for i > 1.

The function F{x) := X^=o fi(x) is everywhere greater than c/2, and it belongsto BL(X)+. The nonnegative functions g, := fi/F are bounded by 1 and satisfy aLipschitz condition:

\F(y)Mx) - F(x)My)\

\F(y) - F(x)\My)F(x) F(x)F(y)

For each x in K there is an i for which d(x,x{) < 8/4. For that i we havefi(x) > 1/2 and go(x) < fo(x)/ft(x) < e. The g, sum to 1 everywhere. They are

• the required functions.

Proof of Theorem <36>. Write Ki for the compact set given by Definition <34>with € := 1/i. For each i in N write 9* for the finite collection of functionsin BL(X)+ constructed via Lemma <37> with 8 := € := l/i and K equal to Ki.The class S := U^NS, is countable.

For each g in 9 the sequence of real numbers Png is bounded. It has aconvergent subsequence. Via a Cantor diagonalization argument, we can constructa single subsequence Ni c N for which lim eNj Png exists for every g in 9-

The approximation properties of 9 will ensure existence of the limit XI :=limn€Nl Pn£ for every I in BL(X)+. With no loss of generality, suppose U\\BL < 1.


Given e > 0, choose an i > l/e, then write Si = (&0t &i* • • •» g*h w ^ h indexingas in Lemma <37>. The open set G, := {go < e] contains Kt, which ensures thatlimsupn PnG? <€.

For each j > 1 let x, be any point at which gj(xj) > 0. If x is any other pointwith gj(x) > 0 we have \£(x) - £(JC,-)| < d(x, xy) < €. It follows that, for every xin X,

2€.

which integrates to give

It then follows, via the existence of Hmn€M, Pngj9 that limsupn€Nl Pnl differs fromliminfneNi Pnf- by at most 6c. The limit Xt := limw€N, Pn€ exists.

The limit functional X inherits linearity from P. Clearly XI = 1. It inheritstightness from the uniform tightness of the sequence, as in <35>. From Theo-rem <33>, the functional X corresponds to a tight probability measure to which

• {Pn : n € Nj} converges in distribution.

REMARK. For readers who know more about general topology: The Cantordiagonalization argument could be replaced by an argument with ultrafilters, oruniversal subsets, for uniformly tight nets of probability measures on more generaltopological spaces. Lemma <37> was needed only to allow us to work withsequences; it could be avoided.

<38> Example . Let [Xn] be a sequence of R*-valued random vectors of order 0P(1).If F{\Xn\ > M] < e eventually then limsupP{Xn ^ G) < € for every open set G thatcontains the compact ball {JC : \x\ < M}. That is, {Xn} is uniformly tight. It has a

D subsequence that converges in distribution to a probability measure on !B(R*).

6, Problems

[1] Let B be a subset of a metric space. For each pair of points jq and X2 in X,show that infy€Bd(x\, y) < d(x\,X2) + infy€fldfa, v). Deduce that the functionfB(x) := d(x, B) satisfies the Lipschitz condition \/B(X) — fniy)\ < d(x, y).

[2] For real-valued functions / , g on X, prove that \\fg\\BL <

[3] Let B be a subset of a metric space X. Show that

{x : d(x, B) = 0} = B := closure of B,

{x : d(x, Bc) > 0} = B := interior of B,

{x : d(x, B) = 0 = d(x, Bc)} = B\B = 3B = boundary of B.

Hint: If d(x, B) = 0, there exists points xn e B with d(x, xn) -» 0.

[4] Let P be a probability measure on a separable metric space, and B b e a Borel setfor which P(BB) > 0.

7.6 Problems 187

(i) For each e > 0, show that there is a partition of SB into disjoint Borel sets{Df : i € N} each with diameter less than €. Hint: Consider the union of ballsof radius e/3 centered at the points of a countable dense subset of dB.

(ii) For each i, find points jcf € B and yt e Bc such that </(*,-, Dt) < e andd(yi, Di) < e. Define a probability measure P'€ by replacing all the P massin Di by a point mass (of size PDt) at *,-, for each i. Define P" similarly, byconcentrating the mass in each D» at yt. Show that P^B = PB and P"B = PBfor each € > 0.

(iii) Show that, for each £ with ||€| |BL := K < oo,

|P€ - P€'€| < K€P (SB) and |P£ - P"l\ < KeP {dB).

Deduce that P€' —• P and P " ~> P as e -> 0, even though we have Pf€B ==

PB > PB and P€"£ == PB < PB.

[5] For x = (JCI, JC2) € R2, define Q% = {(y\,y2) e M2 : yi < x\, yt < X2}, the quadrantwith vertex x. Let {Xn} be a sequence of random vectors and P be a probabilitymeasure such that F{Xn € <2x} -> PQX for all x such that P (dQx) = 0. Show thatXn ^ P by following these steps.

(i) Write £ for the class of all lines parallel to a coordinate axis. Show that allexcept countably many lines in £ have zero P measure. (Hint: How manyhorizontal lines can have P measure greater than 1/n?)

(ii) Given £ e BL(R2), with 0 < £ < 1, show that there exist disjoint rectanglesSi,..., Sm with sides parallel to the coordinate axes, such that (a) P (35,) = 0for each each i; (b) the set B = (Jf $ has P-measure greater than 1 — e\ (c) thefunction £ oscillates by less than € on each Si.

(iii) Choose arbitrarily points x, in Si, for i = 1, . . . , m. Define functions g€(x) =J2A* ^ Si}£(Xi). Show that |g€(x) - €(x)| < c + {x i B] for all x.

(iv) Use a sandwiching argument to show that F£(Xn) ~> P£, then deduce thatXn^P.

[6] Let X be a metric space. Show that the map t/r : (JC, y) -> d(x, y) is continuous.Deduce that f is S(X2)\3(E)-measurable. If X is separable, deduce that \(r is$(X) 0 B(X)\®(E)-measurable, and hence co H> d(X(w), Y(<o)) is measurable if Xand Y are random elements of X.

[7] If P£ = Q£ for each £ in BL(%), show that P = 0 as measures on 3(X).

[8] Let X be a separable metric space. Show that A(P, Q) := sup{|P€™ Ql\ : \\£\\BL < 1}is a metric for weak convergence. That is, show that Pn ~* P if and only ifA(Pn, P) - • 0, by following these steps. (Read the proof of Lemma <37> forhints.)

(i) Show that A is a metric, and \Pl - Q£\ < ||£||BLA(e, P), for each L Deducethat Pn£ -+ P£, for each £ G BL(X), if D(Pn, P) ~» 0.

(ii) Let Xo := [xf : i e N} be dense in X. For a fixed 6 > 0, define fo(x) == €and ft(x) := (1 -rf(jc,^)/O+. Define Gk := uf=1{^ > 1/2}. Choose k so


that PGkk < €. Show that each function lt := /i/£o<i<* ft belongs to BL(X).

Show that £0(x) < 2c for x e Gk. Show that ]Cf=0^« s *• S h o w t h a t

diam{£, > 0} < 2e for i > 1.

(iii) For each i > 1, choose an xt from {€/ > 0}. Show that

L 4€ + G\ if

(iv) For each probability measure G> and each I with \\1\\BL S 1» show that

|fi£ - P€| < 8€ + QG\ + PG£ + E t i \QU - Plil

(v) If Pn ~* P, deduce that A(Pn, P) < 10^ eventually.

[9] Let y and Z be metric spaces, such that y is separable. Let d be the metric ondefined in Example <14>. Let Po, P be probability measures on S(y), and Qo» Q beprobability measure on B(Z). For a fixed I in BL(V x Z), define ho(z) := PQ €(y, z)and gQ(y) := G*€(y,z).

(i) Show that ||AOIUL < P b i and \\gQ\\BL < W\BL.

(ii) Let Ay and A*, denote the analogs of the metric from Problem [8], Show that

I (P 0 Q)l - (Po 0 fioXI < i ^ c - P68Q\ + \Qho - Qoho\

(iii) Show that AyxZ(P ® G, Po 0 Go) < AV(P, Po) + AZ(G, Go).

(iv) If Pn ^ Po and Qn ^ Go, show that Pn ® G« ^ ô ® Go, even if 2, is notseparable. (For separable Z, the result follows from (iii); otherwise use (ii).)

[10] Suppose [Xn] are random elements of X, and P is a probability measure P on S(X),for which P{Xn € B] -> PB for every P-continuity set. Let / be a boundedmeasurable function on X (with no loss of generality assume 0 < / < 1) that iscontinuous at all points except those of a P-negligible set K.

(i) For each real t, show that the boundary of the set {/ > t] is containedin 3sf U {/ = t}. Deduce that {/ > t] is a P-continuity set for almost all(Lebesgue measure) /. Hint: Consider sequences xn -» x and yn -> JC with/(*„) > t > f(yn).

(ii) Show that ¥f(Xn) = /0! F{/(XW) > t)dt -+ Pf.

[11] Suppose iin -> /i, and an2 -» a2, with both limits finite. Let Z have a #(0,1) distribu-

tion. Show that |P€(Mn+ornZ)~P€(M+aZ)| < ||€||BiLP(l A (\iiH - n\ + |orn - a| |Z|)).Deduce, by Dominated Convergence, that N(iin, a2) ^ iV(/x, a2).

[12] Suppose Xn has a ^(^n, a2) distribution, and Xn ^ P.

(i) Show that ^ := lim/in and a2 := lima2 must exist as finite limits, and that Pmust be the N(/i, a2) distribution. Hint: Choose M with P{M} = 0 = P{-M]and P[-M, M] > 3/4. If |/xw| > M or if an is large enough, show thatP{|XJ > M} > 1/2. Show that all convergent subsequences of (inn,crn) mustconverge to the same limit.

7.6 Problems 189

(ii) Extend the result to sequences of random vectors. Hint: Use part (i) to proveboundedness of {/xn} and each diagonal element of {Vn}. Use Cauchy-Schwarzto deduce that all elements of [Vn] are bounded.

[13] Suppose the random variables {Xn} converge in distribution to X, and that {An} and{Bn} are sequences of constants with An -* A and Bn -» B (both limits finite).Show that AnXn + Bn ~* AX + B. Generalize to random vectors.

[14] Let Y be a random fc-vector with fi := FY and V := var(r). Let V have the represen-tation V = LA2L\ with L an orthogonal matrix and A := diag(A.i,..., A*)» each A.,nonnegative. Define R := LA. Let W be a random ^-vector of independent #(0,1)random variables.

(i) Show that |/x| < P|F|. Hint: For a unit vector u in the direction of fi, use thefact that uY < \Y\.

(ii) Show that W\RW\3 = P| ]T\ A.,-W-|3 < (trace V)3/2PI#(0, l)|3.

(iii) Show that P|/z + RW\3 < 8P|F|3 -f- 8 (P|K|2)3/2P|JV(0,1)|3.

[15] Let / be a bounded, measurable, real-valued function on the real line. Let k be aLebesgue integrable function, with derivative k'. Suppose there exists a Lebesgueintegrable function M with \k(x + 8) - k(x)\ < \8\M(x) for all |5| < 1 and all x.

(i) Define g(x) := f f(x + y)k(y)dy = f f(z)k(z - x)dz. Use DominatedConvergence to justify differentiation under the integral sign to prove that g isdifferentiable, with derivative gf(x) = — / f(x 4- y)kf(y) dy.

(ii) Let k(x) := p(x) exp(~jr2/2), with p a polynomial in x. Show that A: and each ofits derivatives satisfies the stated assumptions. Deduce that the corresponding ghas bounded derivatives of all orders. Hint: Consider the case p(x) := xd.Show that \e* - 1| < |f| |r| for all real t.

(iii) For each a > 0, show that the function x H> k(x/a)/a and each of itsderivatives also satisfies the assumptions.

[16] Let {Xn} be a sequence of random variables, all defined on the same probabilityspace. If Xn = 0P(1), we know from Chapter 2 that there is a subsequence{Xni : i € N} for which Xn.(co) = o(l) for almost all a>. If, instead, Xn = 0P(1),must there exist a subsequence for which Xni(co) = 0(1) for almost all of! Hint:Let & : i € No} be a sequence of independent random variables, each distributedUniform(0,1). Consider Xn := (£o - fei)""1.

[17] Let ^ b e a strictly increasing function on E+ with ^(0) = 0 and \/r(t) -> 1 ast -> oo. Show that a sequence of random vectors [Xn] is of order Op{\) if and onlyif limsupP^(|Xn|) < 1.

[18] Let {Xn} and {Yn} be sequences of random it-vectors, with Xn and Yn defined onthe same space and independent of each other. Suppose Xn - Yn = Op(l). Showthat there exists a sequence of nonrandom vectors {an} for which Xn — an = Op(\).Hint: For probability measures P and g, show that it PxQyf (x - y) < M thenPxx/r (x - y) < M for at least one y. Use Problem [17].


[19] If X has a Poisson(A) distribution, show that %/X-y/k^ N(0, 1/4) as k -> 00.

[20] Let {Xnj} be a triangular array of random variables, independent within each rowand satisfying

(a) Et n\Xn,i\ > 6} ^ 0 for each e > 0,

(b) Ei ™(XHA\Xnj\ <€))->l for each e > 0.

Show that Ei*nj - AH ~> JV(O,1), where An := £ , PXn,if|X*,il < *}• Hint:Consider truncated variables rjnj := Xnj{\Xnj\ < €„} and £„,, := rjnj — ¥rjnj, for asuitable {€„} sequence.

[21] Let {Xn,,-} be a triangular array of random variables, independent within each rowand satisfying

(i) maxt \Xnj\ -> 0 in probability,

(ii) £,. FXHA\Xn,i\ < €} -> n for each € > 0,

(iii) 5 f. var(Xn?l{|Xn,i| <€})-» a2 < 00 for each € > 0.

Show that £**«.«• * N{â2). Hint: Define i;Bff. := XnA\XnA ± *n) and f=nJ :=r;n>/ — Pi7»f,-, where crt tends to zero slowly enough that: (a) P{max, \Xnj\ > €n] ~> 0;(b) £ ; PXâiX^j < €„} -> n\ and (c) E/ var(X ) 2

[22] If each of the components {Xni}, for i = 1, . . . , Jfc, of a sequence of random fc-vectorsfXn} is of order Op{\), show that Xn = Op(l).

[23] Suppose /(JC) = o(\x\) as JC -^ 0, and g(x) = O(|jr|fc) as x -> 0. SuppposeXw = Op(an) for some sequence of constants an tending to zero. Derive a boundfor the rate at which f(g(Xn)) tends to zero.

7. Notes

See Daw & Pearson (1972) and Stigler (1986a, Chapter 2) for discussion ofDe Moivre's 1733 derivation of the normal approximation to the binomial distribution(reproduced on pages 243-250 of the 1967 reprint of de Moivre 1718)—the firstcentral limit theorem.

Theorem <19> is esentially due to Liapounoff (1900, 1901), although themethod of proof is due to Lindeberg (1922). I adapted my exposition of Lindeberg'smethod, in Section 2, from Billingsley (1968, section 7), via Pollard (1984,Section III.4). The development of the CLT, from the simple idea described inSection 1 to formal limit theorems, has a long history, culminating in the work ofseveral authors during the 1920's and 30's. For example, building on Lindeberg'sideas, Levy (1931, Section 10) conjectured the form of the general necessary andsufficient condition for a sum of independent random variables to be approximatelynormally distributed, but established only the sufficiency. Apparently independentlyof each other, Levy (1935) and Feller (1935) established necessary conditions forthe CLT, under an assumption that individual summands satisfy a mild "asymptoticnegligibility" condition. See the discussion by Le Cam (1986). Chapter 4 ofLevy (1937) and Chapter 3 of Levy (1970) provide more insights into Levy's

7.7 Notes 191

thinking about the CLT and the role of the normal distribution. The idea of usingtruncation to obtain general CLT's from the Lindeberg version of the theorem runsthrough much of Levy's work.

Later works, such as Gnedenko & Kolmogorov (1949/68, Chapter 5) andPetrov (1972/75, Section IV.4), treat the CLT as a special case of more generallimit theorems for infinitely divisible distributions—compare, for example, the directargument in Problem [20] with Theorem 3 in Section 25 of the former or withTheorem 16 in Chapter 4 of the latter.

Theorem <36> for complete separable metric spaces is due to Prohorov (1956).Independently, Le Cam (1957) proved similar results for more general topologicalspaces. The monograph of Billingsley (1968) is still an excellent reference forthe theory of weak convergence on metric spaces. Together with the slightlymore abstract account by Parthasarathy (1967), Billingsley's exposition stimulatedwidespread interest in weak convergence methods by probabilists and statisticians(including me) during the 1970's. See Dudley (1989, Chapter 11) for an eleganttreatment that weaves in more recent ideas.

The stochastic order notation of Section 4 is due to Mann & Wald (1943). Forfurther examples see the survey paper by Chernoff (1956) and Pratt (1959).

REFERENCES

Alexandroff, A. D. (1940-43), 'Additive set functions in abstract spaces', MatSborniL Chapter 1: 50(NS 8) 1940, 307-342; Chapters 2 and 3: 51(NS 9)1941, 563-621; Chapters 4 and 5: 55(NS 13) 1943, 169-234.

Billingsley, P. (1968), Convergence of Probability Measures, Wiley, New York.Chernoff, H. (1956), 'Large sample theory: parametric case', Annals of Mathematical

Statistics 27, 1-22.Daw, R. H. & Pearson, E. S. (1972), 'Abraham De Moivre's 1733 derivation of the

normal curve: a bibliographical note', Biometrika 59, 677-680.de Moivre, A. (1718), The Doctrine of Chances, first edn. Second edition 1738.

Third edition, from 1756, reprinted in 1967 by Chelsea, New York.Dudley, R. M. (1985), *An extended Wichura theorem, definitions of Donsker classes,

and weighted empirical distributions', Springer Lecture Notes in Mathematics1153, 141-178. Springer, New York.

Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.Feller, W. (1935), 'Uber den Zentralen Grenzwertsatz der Wahrscheinlichskeitsrech-

nung', Mathematische Zeitung 40, 521-559. Part II, same journal, 42 (1937),301-312.

Gnedenko, B. V. & Kolmogorov, A. N. (1949/68), Limit Theorems for Sums ofIndependent Random Variables, Addison-wesley. English translation in 1968, oforiginal Russian edition from 1949.

Hoffmann-J0rgensen, J. (1984), Stochastic Processes on Polish Spaces, Unpublishedmanuscript, Aarhus University, Denmark.

Le Cam, L. (1957), 'Convergence in distribution of stochastic processes', Universityof California Publications in Statistics 2, 207-236.


Le Cam, L. (1986), 'The central limit theorem around 1935% Statistical Science1, 78-96.

Levy, P. (1931), 'Sur les series dont les termes sont des variables eventuellesindependantes', Studia Mathematica 3, 119-155.

Levy, P. (1935), 'Proprietes asymptotique des sommes de variables aleatoiresindependantes ou enchainees', Journal de Math Pures Appl. 14, 347-402.

Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.Second edition, 1954.

Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.Liapounoff, A. M. (1900), 'Sur une proposition de la theorie des probability,

Bulletin de VAcademie imperiale des Sciences de St. Petersbourg 13, 359-386.Liapounoff, A. M. (1901), 'Nouvelle forme du th6oreme sur la limite de probability,

Memoires de VAcademie imperiale des Sciences de St. Petersbourg.Lindeberg, J. W. (1922), 'Eine neue Herleitung des Exponentialgesetzes in der

Wahrscheinlichkeitsrechnung', Mathematische Zeitschrift 15, 211-225.Mann, H. B. & Wald, A. (1943), 'On stochastic limit and order relationships',

Annals of Mathematical Statistics 14, 217—226.Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic, New

York.Petrov, V. V. (1972/75), Sums of Independent Random Variables, Springer-Verlag.

Enlish translation in 1975, from 1972 Russian edition.Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.Pratt, J. W. (1959), 'On a general concept of "in probability"', Annals of Mathematical

Statistics 30, 549-558.Prohorov, Y. V. (1956), 'Convergence of random processes and limit theorems in

probability theory', Theory Probability and Its Applications 1, 157-214.Stigler, S. M. (1986a), The History of Statistics: The Measurement of Uncertainty

Before 1900, Harvard University Press, Cambridge, Massachusetts.Tops0e, F. (1970), Topology and Measure, Vol. 133 of Springer Lecture Notes in

Mathematics, Springer-Verlag, New York.Varadarajan, V. S. (1961), 'Measures on topological spaces', Mat. Sbornik

55(97), 35-100. American Mathematical Society Translations 48 (1965),161-228.

Chapter 8

Fourier transforms

SECTION 1 presents a few of the basic properties of Fourier transforms that make themsuch a valuable tool of probability theory.

SECTION 2 exploits a mysterious coincidence, involving the Fourier transform andthe density function of the normal distribution, to establish inversion formulas forrecovering distributions from Fourier transforms.

SECTION *3 explains why the coincidence from Section 2 is not really so mysterious.SECTION 4 shows that the inversion formula from Section 2 has a continuity property,

which explains why pointwise convergence of Fourier transforms implies convergencein distribution.

SECTION *5 establishes a central limit theorem for triangular arrays of martingaledifferences.

SECTION 6 extends the theory to multivariate distributions, pointing out how the calculationsreduce to one-dimensional analogs for linear combinations of coordinate variables—the Cramer and Wold device.

SECTION *7 provides a direct proof (no Fourier theory) of the fact that the family of(one-dimensional) distributions for all linear combinations of a random vector uniquelydetermines its multivariate distribution.

SECTION *8 illustrates the use of complex-variable methods to prove a remarkable propertyof the normal distribution—the Levy-Cramer theorem.

1. Definitions and basic properties

Some probabilistic calculations simplify when reexpressed in terms of suitable

transformations, such as the probability generating function (especially for random

variables taking only positive integer values), the Laplace transform (especially

for random variables taking only nonnegative values), or the moment generating

function (for random variables with rapidly decreasing tail probabilities). The

Fourier transform shares many of the desirable properties of these transforms

without the restrictions on the types of random variable to which it is best applied,

but with the slight drawback that we must deal with random variables that can take

complex values.

The integral of a complex-valued function, / := g + ih, is defined by splitting

into real Oft/ := g) and imaginary ( 3 / := h) parts, \if := [ig + i\xh. These integrals

inherit linearity and the dominated convergence property from their real-valued

194 Chapter 8: Fourier transforms

counterparts. The increasing functional property becomes meaningless for complexintegrals—the complex numbers are not ordered. The inequality \fif\ < fi\f\ stillholds if | • | is interpreted as the modulus of a complex number (Problem [1]).

The Fourier transform (which is often referred to as the characteristic functionin the probability and statistics literature) of a probability measure P on 3(W) isdefined by

fP(t) := Pxeixt for t in R.

Similarly, the Fourier transform of a real random variable X is defined by

fx(t) := W°exp(iX(w)t) for t in E.

That is, fx is the Fourier transform of the distribution of X.

REMARK. Without the i in the exponent, we would be defining the momentgenerating function, Pex\ which might be infinite except at t — 0, as in the case ofthe Cauchy distribution (Problem [6]).

Fourier transforms are well defined for every probability measure on $CR), and

\fP(t)\ = IP*exp(i*r)l < P* I exp(i*OI = 1 for all real t.

They are uniformly continuous, because

\fP(t + «) - fP(t)\ < Px\eixit+8) - eixt\ = P x \ e i x S - l j ,

which tends to zero as 8 -> 0 by Dominated Convergence. As a map from R intothe complex plane C, the Fourier transform defines a curve that always lies withinthe unit disk. The curve touches the boundary of the disk at 1 + Oi, correspondingto t = 0. If it also touches for some nonzero value of t, then P must concentrate on aregularly spaced, countable subset of R (Problem [2]). If P is absolutely continuouswith respect to Lebesgue measure then V>(0 -* 0 as t ->• ±oo (Problem [4]).

Fourier methods are particularly effective for dealing with sums of independentrandom variables, chiefly due to the following simplification.

<l> Theorem. If X\,..., Xn are independent then the Fourier transform ofX\-\-.. .+Xn

factorizes into \j/x{ (0 • • • ^X (0> for all real t.

Proof, Extend the factorization property of real functions of the X/s to the complex• functions exp(itXj).

REMARK. Be careful that you do not invent a false converse to the Theorem.If X has a Cauchy distribution and Y = X then ^*+r(0 = exp(-2|r|) = fxiOfrit),but X and Y are certainly not independent (Problem [6]).

There are various ways to extract information about a distribution fromits Fourier transform. For example, the next Theorem shows that existence offinite moments of the distribution gives polynomial approximations to the Fouriertransform, corresponding to the purely formal operation of taking expectationsterm-by-term in the Taylor expansion for exp(iXt).

8.1 Definitions and basic properties 195

<2> Theorem. If ¥\X\k < oo, for a positive integer k, then the Fourier transform hasthe approximation

(ii\^ (it}k

fx(t) = 1 + itPX + ^ P X 2 + ... 4- V k

2!^ P X + ... 4 V r 1 * +2! klfor t near 0.

D Proof. Apply Problem [5] to Pcos(*7) and Psin(A7).

<3> Example. The Poisson(A) distribution has Fourier transform

An appeal to the central limit theorem and a suitable passage to the limit will leadus to the Fourier transform for the normal distribution.

Suppose X\y X2,... is a sequence of independent random variables, eachdistributed Poisson(l). By the central limit theorem for identically distributedsummands, Zn := (X\ + . . . + Xn — ri)/*fn ~> N(0,1). For each fixed t the functioneixt is bounded and continuous in x. Thus limn_>OoPexp(i7Zn) is the Fouriertransform of the #(0,1) distribution. Evaluate the limit using Theorem :

Pexp(iZRf) = J^*=i

if^ - n).

The last exponent has the approximation

Notice the way the error term behaves as a function of n for fixed t. In the limit we• get exp(-r2/2) as the Fourier transform of the iV(0,1) distribution.

REMARK. If Z has a #(0 ,1) distribution and s is real,

1 f°°Pexp(sZ) = — = I exp (sz - h2) dz

V2K J-oo

= ~ [ exp (i*2 - i(z - s)2) dz = exp(52/2).

Formally, we have only to replace s by it to get the Fourier transform for Z. Arigorous justification requires some complex-variable theory, such as the uniquenessof analytic continuations.

2. Inversion formula

Written out in terms of the Af(0,1) density, the final result from Example <3>becomes

exp(i>* - \y2) , i 2x—-±— dy = exp(- j / z) for real *.


The function on the right-hand side looks very like a normal density; it lacksonly the standardizing constant. If we substitute t/a for t, then make a change ofvariables z — —y/cr, we get an integral representation for the N(0, a2) density <t>a.

1 f°° / i >> o\ exp(-4*2/a2)- i - / exp (-izt - \<J2Z2) dz = PV 2 ' =: <t><,(t).

This equality is a special case of a general inversion formula that relates Fouriertransforms to densities. Indeed, the general formula can be derived from the specialcase.

Suppose Z has a iV(0,1) distribution, independently of a random variable Xwith Fourier transform ^r(-). From the convolution formula for densities with respectto Lebesgue measure (Section 4.4), for a > 0 the sum X + aZ has a distributionwith density f(a)(y) '-= P^&r (y - X(co)). Substituting for <t>a from <4>, we have

= -Lp» /°° exp (-iziy - X(co)) - \o2A dz.£X J-oo X 7

The integrand, which is bounded in absolute value by exp(-a2z2/2), is integrable(as a function of o> and z) with respect to the product of P with Lebesgue measure.Interchanging the order of integration, as justified by the Fubini theorem, we get thebasic integral representation,

fa)(y) = 7r / ^(z)exp {-izy - \o2z2) dz.Lit J—oo ^ '

Notice that t/r(r)exp(-a2f2/2) is the Fourier transform of X + oZ. If we write Pfor the distribution of X, the formula becomes

density of P • #(0, a2) = — / e"izy(Fourier transform of P • N(0, a2)) dz,

a special case of the inversion formula to be proved in Theorem <io>.Limiting arguments as o tends to zero in the basic formula <5>, or <6>, lead

to several important conclusions.

<7> Theorem. The distribution of a random variable is uniquely determined by itsFourier transform.

Proof. For h a bounded measurable function on E and f(cr) as defined above,

Pft(X +aZ) = f h(y)f(<T)(y)dy.J—oo

which shows, via formula <5>, that the Fourier transform of the random variable Xuniquely determines the density for the distribution of X + aZ. Specialize to hin C(R), the class of all bounded, continuous real functions on R. By DominatedConvergence, Wh(X-\-oZ) -> P^(X) as a -> 0. Thus the Fourier transform uniquelydetermines all the expectations P/i(X), with h ranging over G(R). A generating classargument shows that these expectations uniquely determine the distribution of X, as

• a probability measure on

8.2 Inversion formula 197

You might feel tempted to rearrange limit operations in the previous proof, toarrive at an assertion that

Fh(X) = lim / h(y)f((T)(y)dy = / h(y) lim f(a)(y)dy,

and thereby conclude that the distribution of X has density

<8> f(y) = Km fia)(y) = — f f(z) exp (-izy) dz

with respect to Lebesgue measure. Of course such temptation should be resisted. Themigration of limits inside integrals typically requires some domination assumptions.Moreover, it would be exceedingly strange to derive densities for measures that arenot absolutely continuous with respect to Lebesgue measure.

<9> Example. Let P denote the probability measure that puts mass 1/2 at ±1. It hasFourier transform %lr(t) = {eil -f e~u) /2. The integral on the right-hand side of <8>does not exist for this Fourier transform. Application of formulas <5> then <4>gives

f(a\y) = r- r I (eiz+e~iz) exp (~izy - I*2*2)dz

Lit J—OQ l> ^ /

That is, f{a) is the density for the mixture l/2N(-l,a2) -h y2#(+l,a2), a densitythat is trying to behave like two point masses. The limit of fiar) does not exist in

• the ordinary sense.

<io> Theorem. If P has Fourier transform \/r for which / ^ \\/r(z)\ dz < oo, then P isabsolutely continuous with respect to Lebesgue measure, with a density

that is bounded and uniformly continuous.

Proof The convolution density /(or)(j)» as in <5>, is bounded by the constant C :=f\\(r\/2n. Uniform continuity follows as for Fourier transforms. By DominatedConvergence f(a)(y) converges pointwise to / . Thus 0 < / < C. If h is continuousand vanishes outside a bounded interval [—M, M], a second appeal to DominatedConvergence gives

Ph = lim / h(y)f(<T)(y)dy = / h(y)f(y)dy.°-*Oj-M J-M

That is, Ph = / hf for a large enough class of functions h to ensure (via a generatingD class argument) that P has density / .

REMARK. The inversion formula for integrable Fourier transforms is the basisfor a huge body of theory, including Edgeworth expansions, rates of convergencein the central limit theorem, density estimation, and more. See for example themonograph of Petrov (1972/75).


* 3 . A m y s t e r y ?

I have always been troubled by the mysterious workings of Fourier transforms. Forexample, it seems an enormous stroke of luck that the Fourier transform of thenormal distribution is proportional to its density function—the key to the inversionformula <5>. Perhaps it would be better to argue slightly less elegantly, to see whatis really going on.

Start from the slightly less mysterious fact that there exists at least one randomvariable W whose Fourier transform $0 is Lebesgue integrable. (See Problem [3]for one way to ensure integrability of the Fourier transform.) Symmetrize by meansof an independent copy W of W to get a random variable W — W with a real-valuedFourier transform x/r = | oi2» which is also Lebesgue integrable. That is, there existsa probability distribution Q whose Fourier transform r/r is both nonnegative andLebesgue integrable. For some finite constant co, the function CQ-^T defines a density(with respect to Lebesgue measure m on the real line) of a probability measure Qon S(R).

For h bounded and measurable,

P ® Qh(x+ay) = Pxmy (cQ(Qzeiyz)h(x + ay)).

If h vanishes outside a bounded interval, Fubini lets us interchange the order ofintegration, make a change of variable w = x + <xy in the Lebesgue integration, thenchange order of integration again, to reexpress the right-hand side as

— Qzmw (h(w)Px exp(iz(u> - x)/a)),

a function of the Fourier transform of P. Again we have a special case of theinversion formula, from which all results flow.

The method in Section 2 corresponds to the case where both Q and Q arenormal distributions, but that coincidence is not vital to the method.

4. Convergence in distribution

The representation <5> shows not only that the density /( t r ) of X + <r Z is uniquelydetermined by the Fourier transform fx of X but also that it depends on fx in acontinuous way. The factor exp(-<r2z2/2) ensures that small perturbations of fx

do not greatly affect the integral. The traditional way to make this continuity ideaprecise involves an assertion about pointwise convergence of Fourier transforms.

The proof makes use of a simple fact about smoothing and a simple consequenceof Dominated Convergence known as Scheffe's Lemma:

Let / , / 1 , / j , . . . be nonnegative, fx-integrable functions for which fn -> /a.e. [11} and fifn -* /*/. men ix\fn - / | -> 0.

See Section 3.1 for the proof.

8.4 Convergence in distribution 199

<n> Lemma. Let X, X\% X2,... and Z be random variables for which Xn + crZ ~*X + oZ for each a > 0. Then Xn ~* X.

Proof, For I in £L(R),

|P£(Xn) - P€(X)| < P|€(Xn) - t(Xn + aZ)| + |P€(Xn + aZ) - F£(X + aZ)\

On the right-hand side, the middle term tends to zero, by assumption. The other• two terms are both bounded by | |€ | | 5 LP(1 A <T|Z|), which tends to zero as a -* 0.

<12> Continuity Theorem. Let X, X\, X2,... be random variables with Fouriertransforms f, fa, fa,... for which fa(t) -> \jr(t) for each real t. Then Xn ~» X.

Proof. Let Z have a N(0,1) distribution independent of X and all the Xn. Therandom variables Xn+<rZ have distributions with densities

= ^ £~ fn(t) exp (~ity - \v2t2) dt

with respect to Lebesgue measure, and X + oZ has a distribution with a similarlydefined density /(cr). By Dominated convergence, f^Hy) -» f(a)(y) as n -> oo, foreach fixed y. If h is bounded and measurable,

\¥h(Xn 4- aZ) - Fh(X + aZ)\

= I f h(y)f^(y)dy - J h(y)f^(y)dy\

<M I \f<a)(y) - f(a)(y)\ dy where M = sup \h\

-» 0 by SchefK's Lemma.

Thus Xn +aZ ~> X+oZ for each a > 0, and the asserted convergence in distributionD follows via Lemma <n> .

<13> Example. Suppose Xi, X2,... are independent, identically distributed randomvariables with PX* = 0 and PXf = 1. From Chapter 7 we know that

V- _l_ _1_ V

,1).

Here is the proof of the same result using Fourier transforms.

Let VO) be the Fourier transform of X\. From Theorem <2>,

f(t) = 1 - \t2 + o(f2) as f -> 0.

In particular, for fixed r,

f(t/yfn) = 1 - j ^ 2 / w + 0(l/n) as n ~> <x).

The standardized sum has Fourier transform

- ±f2/* + 0(l/«))n ^ exp ( - i

The limit equals the N(0,1) Fourier transform. By the Continuity Theorem, the• asymptotic normality of the standardized sum follows.

Certainly the calculations in the previous Example involve less work than theLindeberg argument plus the truncations used in Chapter 7 to establish the same


central limit theorem. I would point out, however, the amount of theory neededto establish the Continuity Theorem. Moreover, for the corresponding Fouriertransform proofs of central limit theorems for more general triangular arrays, thecalculations would parallel those for the Lindeberg method.

REMARK. A slightly stronger version of the Continuity Theorem (due toCramer—see the Notes) also goes by the same name. The assumptions of theTheorem can be weakened (Problem [9]) to mere existence of a pointwise limitVKO •= limn \lrn(t) with \jr continuous at the origin. Initially it need not be identifiedas the Fourier transform of some specified distribution. The stronger version ofthe Theorem asserts the existence of a distribution P for which \ff is the Fouriertransform, such that Xn **+ P.

*5. A martingale central limit theorem

Fourier methods have some advantages over the methods of Chapter 7. For example,

the proof of the following important theorem seems to depend in an essential way

on the factorization properties of the exponential function.

REMARK. The Lindeberg method for independent summands, as explained inSection 7.2, can be extended to martingales differences under a natural assumptionon the sum of conditional variances—see Levy (1937, Section 67).

<14> Theorem. (McLeish 1974) For each n in N let {%nj : j = 0, . . . ,£„} be amartingale difference array, with respect to a filtration {3nj}, for which:

0) E , £n, -+ 1 in probability;

(ii) max, [£„, | -> 0 in probability;

(Hi) sup,, P max, S,2 < oo.

Then ]T\ %nj ~> N(0y 1) as n -* oo.

Proof. Write Sn for ] [ \ %nj and Mn for max, \%nj\. Denote expectations conditionalon $nj by P/(). The omission of the n subscript should cause no confusion, becauseall calculations will be carried out for a fixed n. Let me also abbreviate t;nj to £,,and simplify notation by assuming kn = n.

By the Continuity Theorem <12>, it suffices to show that Pexp(i'r5n) ->

exp(-f2/2), for each fixed t in R. A Taylor expansion of log(l + it%j) when Mn is

small gives exp (itt-j + t2%J/2\ ^ 1 + iffy, and hence, via (i),

exp (itSn) * exp ( - ; 2 / 2 ) Y\jSn (1 + it$j).

Define Xm := Y\jsm(l+itt;j)9 for m = 1 , . . . , n, with Xo == 1. Each Xm has expectedvalue 1, because ¥Xm = P(Xm-i( l +/*Pm_i§m)) = PXm_! = . . . = PX0, whichsuggests Pexp(irSn) « exp(-t2/2)FXn = exp(-r2/2).

For the formal proof, use the error bound

log(l +z) = z- z2/2 + r(z) with \r(z)\ < \z\3 for \z\ < 1/2.

8.5 A martingale central limit theorem 201

Temporarily write Zj for itfy. Notice that Ylj<n zj = -t2 + op(l), by (i). Also, whenU\Mn < 1/2, which happens with probability tending to 1,

£,<„ \r(zj)\ < |f|3 £,.<„ |^|3 < \t\3Mn Zj<n Hj = op{\) by (ii) and (i).

Thus

exp (itSn + \t2) = Xn exp (\t2 + \ £ - 5 n z) - £,-<„ r(zj)) = exp (

andYn := exp (itSn + \t2}-Xn-+0 in probability.

To strengthen the assertion to FYn -» 0, it is enough to show that supn P|yn|2 < oo,for then

F\Yn\ < M-lV\Yn\2 + MP(|Fn|{|Fn| < M}) = O ( A / " 1 ) + o(l).

(Compare with uniform integrability, as defined in Section 2.8.) The contributionfrom Sn to Yn is bounded in absolute value by a constant. We have only to controlthe contribution from Xn.

Consider first the obvious bound,

i*»i2 = r w (i + \zj\2) <By (ii), the expression on the right-hand side is of order Op(l), but it needn't bebounded everywhere by a constant. We can achieve something closer to uniformboundedness by means of a stopping time argument. Write Qn(m) for 52J<m§^.Define stopping times xn := inf{m : Qn(m) > 2}, with the usual convention thatxn := n if the sum never exceeds 2. Redefine Zj as it%j{xn < j], a new sequenceof martingale differences, for which P{ifSn ^ ]C,<« *>j) ~ p{r« < w} -> 0. We haveonly to prove that Pexp J2j<n (zj +12/2) -+ 1.

Repeat the argument from the previous paragraph but with Xn redefined usingthe new z/s . We then have

+ Mj\2u < *»)) < exp (in2 Ey sf[j < *n}) ( I + I/I2MW2) ,

which gives supnP|Xn|2 < supnexp(2r2)(l -f^2P|Mn|2) < oo. The asserted centralD limit theorem follows.

REMARKS.

(i) Notice the role played by the stopping times rn. They ensure that propertiesof the increments that hold with probability tending to one can be made tohold everywhere, if we can ignore the effect of the increment correspondingto tn = j . To control §n,Tn the Theorem had to impose constraints on

(ii) The sum ] ^ . ^ plays the same role as a sum of variances in the correspondingtheory for independent variables. Martingale central limit theorems sometimesimpose constraints on the sum of conditional variances P , _ i ^ . The sum ofsquared increments corresponds to the "square brackets" process [Xn] of amartingale, and the sum of conditional variances corresponds to the "pointybrackets" process (Xn). These two processes are also called the quadraticvariation process for Xn and the compensator for X^y respectively.


6. Multivariate Fourier transforms

The (multivariate) Fourier transform of a random fc-vector X is defined asPexpO'f'X) for t e Rk. Many of the results for the one-dimensional Fouriertransform carry over to the multidimensional setting with only notational changes.For example, if T X is integrable with respect to Lebesgue measure m* on Rk thenthe distribution of X is absolutely continuous with respect to m* with density

f(y) := ( 2 ^ m

Once again the Fourier transform uniquely determines the distribution ofthe random vector, and the pointwise convergence of Fourier transforms impliesconvergence in distribution. These two results have two highly useful consequences:

(i) The distribution of X is uniquely determined by the family of distributionsof all linear combinations t'X, as t ranges over Rk.

(ii) If t'Xn -> t'X for each t in R* then Xn -> X.

Both assertions follow from the trivial fact that Pexp(i'f'y) is both the multivariateFourier transform of the random vector F, evaluated at r, and the Fourier transformof the random variable t'Y, evaluated at 1. The reduction to the one dimensionalcase via linear combinations is usually called the Cramer-Wold device.

Consequence (ii) shows why one seldom bothers with direct proofs of mul-tivariate limit theorems: They can usually be deduced from their one-dimensionalanalogues. For example, the multivariate central limit theorem of Section 7.3 is animmediate consequence of its univariate counterpart.

<15> Example. Suppose X is a random A;-vector and Y is a random €-vector for which

PexpOVX 4- it'Y) = Pexp(/^/X)Pexp07/F) all s € R*, all / € Re.

Pass to image measures to deduce that the joint distribution Qxj of X and Y hasthe same Fourier transform as the product Qx ® QY of the marginal distributions.By the uniqueness result (i) for Fourier transforms of distributions on RM, we musthave QX,Y = Qx ® QY- That is, X and Y are independent.

You might regard this result as a sort of converse to Theorem . Don't slipinto the error of checking only that the factorization holds when s and t happen to

D be equal.

< 16> Example. A random n-vector X is said to have a multivariate normal distributionif each linear combination t'X, for t € Rw, has a normal distribution. In particular,each coordinate X, has a normal distribution, with a finite mean and a finite variance.The random vector must have a well defined expectation /z := PX and variancematrix V := P(X - ^)(X - /x)'. The distribution of t'X is therefore N(t'ii, t'Vt),which implies that X must have Fourier transform

Pexp(iVX) = exp(iV/x - \t'Vt) for all t in Rn.

Write N(fM, V) for the probability distribution with this Fourier transform.Every /JL in Rn and nonnegative definite matrix V defines such a distribution.

For if Z is a n-vector of independent N(0yl) random variables and if we factorize

8.6 Multivariate Fourier transforms 203

V as AA\ with A a n n x n matrix, then the random vector X := /x + AZ has Fouriertransform ^ x ( 0 = Pexp(iV(/i + AZ)) = exp07'/z)^z(A'O- The random vector Zhas Fourier transform x/fzis) = Ylj<k ^(-ty) — exp(-^|5|2), from which it followsthat fz(A't) = exp(-\t'A'At) = exp (-\t'Vt).

If V is nonsingular, the N(/z, V) distribution is absolutely continuous withrespect to n-dimensional Lebesgue measure, with density

IU - ii)'V'\x - /

This result follows via the Jacobian formula (Rudin 1974, Chapter 8) for changeof variable from the density (In)-"12 exp(-|x|2/2) for the W(0, /„) distribution. (Infact, the Jacobian formula for nonsingular linear transformations is easy to establishby means of invariance arguments.)

If V is singular, the N(/JL, V) distribution concentrates on a translate of a• subspace, {JC € R* : (JC - /X)'V(JC - p) ^ 0} = {/x + y : Ay = 0}, where V = A'A.

*7. Cramer-Wold without Fourier transforms

Fourier methods were long regarded as essential underpinning for the Cramer-Wolddevice. Walther (1997) recently found a beautiful direct argument that avoids useof Fourier transforms altogether. I will describe only his method for showing thatlinear combinations characterize a distribution, which depends on two facts aboutthe normal distribution:

(i) If Z is a vector of k independent random variables, each distributed N(0,1),and if B is a vector of constants, then B • Z has a N(0, \B\2) distribution.(This fact can be established by a direct convolution argument, which makesno use of Fourier transforms.)

(ii) Write <J> for the standard normal distribution function. If 0\,..., 02m+i aredistinct real numbers, then there exist real numbers a\,..., a2m+i such thatthe function g(t) := £ , aîGit), for t > 0, is of order O (t2m+l) near t = 0,and hence giO/t2"1*1 is Lebesgue integrable. Moreover, the {a,} can bechosen so that /0°° giO/t2"1*1 dt 0.

Walther gave an explicit construction for the constants {a,} for a particularchoice of the {0/}. (Actually, he also needed to add another well chosen constant a$to g to get the desired properties.) I will give a different proof of (ii) at the end ofthis Section.

Write m for Lebesgue measure on R2™. The function g serves to define am-integrable function F on R2m by F(u) := #(1/ |M|), for which

\u\)= rCmr2m-lg(l/r)dr = Cm fJo JO

= mug(l/\u\)= rCmr2m-lg(l/r)dr = Cm f giO/t2"1*1 dt # 0,

where Cm denotes the surface area of the unit sphere in R2m.Let h be any bounded, continuous, real function on R2"1. Integrability of F

justifies an appeal to Dominated Convergence to deduce that

Vh(X) = lim r°mu (h(X(a>) + au)F{u)) /mF.CT*0


A change of variable in the m integral, followed by an appeal to Fubini, gives

r°mu (h(X((o) + ou)F(u)) = a~2mmy

The last expectation is determined by the distributions of linear functions of X, asseen by an appeal to property (i), first conditioning on X, for a random normalvector Z that is independent of X:

Condition on Z to see that the last expression is uniquely determined by thedistributions of X • z for z in Rm.

Thus the distributions of the linear combinations X • z, with z ranging over R2"1,uniquely determines the expectation ¥h(X) for every bounded, continuous, realfunction ft, which is enough to determine the distribution of X.

REMARK. The Cramer-Wold result for random vectors taking values infollows directly from the result for R2"1: we have only to append one more coordinatevariable to X. There is no loss in having a proof for only even dimensions.

Proof of assertion (ii)

The Taylor expansion of the exponential function,

x2 x3 (—x)m~l (—x)m

«-* = l - x + £ - _ f L + . . . + ^ — . + L_^L exp(-**) with 0 < x* < *,2! 3! (m - 1)! m!

shows that e~x is the sum of a polynomial of degree m — 1 plus a remainderterm, r(x), of order O(xm) near x = 0, such that r(x) > 0 for all x > 0 if m is even,and r{x) < 0 for all JC > 0 if m is odd. Replace x by JC2/2, divide by V27r, thenintegrate from 0 to t to derive a corresponding expansion for the standard normaldistribution function: (0 = p(t) + R(t) for all t > 0, where p{t) := E ^ o ! ft'*and R{t) is a remainder of order O(t2m+l) near t — 0 that takes only positive (if mis even) or negative (if m is odd) values. Note that fa = 4>(0) = 1/2. The constantK := /0°° R(t)t~2m~x dt is finite and nonzero.

By construction,

g(t) := 2>,*(0,O = £>*(**) 4- J2T=ol A'*Eifl'**'If we can choose the {a,} such that £f . «,-0 = 0 for ifc == 0, 1 , . . . , 2m - 1, then thecontributions from the polynomials p(Ott) disappear, leaving

f°Jo

— y a; I - r — r dt =

The integral is nonzero if we can also ensure that J2t afilm ¥" 0.A simple piece of linear algebra establishes existence of a suitable vector

a := (« i , . . . , 02m+i). Write t/* for (0f,. . . , ^ + 1 ) . The vector U2m could not be alinear combination Yllio * YkUk, for otherwise the polynomial 02m - JÎQ1 Yk®k of

8.7 Cramer-Wold without Fourier transforms 205

degree 2m would have 2m + 1 distinct roots, 0\,..., 02m+i- a contradiction. Thecomponent of U2m that is orthogonal to all the other Ut vectors defines a suitable a.

*8. The Levy-Cramer theorem

If X and Y are independent random variables, each normally distributed, it is easyto verify by direct calculation that X + Y has a normal distribution. Surprisingly,there is a converse to this simple result.

<17> Levy-Cramer theorem. If X and Y are independent random variables withX -f Y normally distributed then both X and Y have normal distributions.

REMARK. The proof of the theorem makes use of several facts about analyticfunctions of a complex variable, such as existence of power series expansions. SeeChapters 10 and 13 of the Rudin (1974) text for the required theory.

Proof. With no loss of generality we may assume X + Y to have a standard normaldistribution, and Y to have a zero median, that is, F{Y > 0} > \ and F[Y < 0} > \.Then, for each x > 0,

> 0}P{X > x]

= 2¥{Y > 0, X > x] independence

< exp(—x2/2) normal tail bound from Section D.I.

A similar argument gives a similar bound for the lower tail. Thus

<18> P{|X| > JC} < 2exp(-jt2/2) for all JC > 0.

It follows (Problem [11]) that the function g(z) := Pexp(zX) is well defined andis an analytic function of z throughout the complex plane C. It inherits a growthcondition from the tail bound <18>:

\g(z)\ < P|exp(zX)| < Pexp(|zX|)

= l + |z| / cxp(\z\x)F{\X\>x}dxJo

poo

< 1 +2|z| I exp(|z|jc -x2/2)dx by <18>J—oo

<19> = exp(C -f |z|2) for some constant C.

A similar argument shows that h(z) := Pexp(zF) is also well defined for every z.By independence, g(z)h(z) = exp(z2/2) for all z € C, and hence g(z) ^ 0 for all z.It follows (Rudin 1974, Theorem 13.11) that there exists an analytic function y()on C such that g(z) = exp(y(z)). We may choose y so that y(0) = 0, because#(0) = 1. (In effect, logg can be defined as a single-valued, analytic functionon C.) The analytic function y has a power series expansion y(z) = J^L\ ynzn thatconverges uniformly to y on each bounded subset of C.


Decompose y(rew) into its real and imaginary parts U(r, 0) + iV(r, 6). Then,from <19>, we have exp(£/) = | exp(£/ -f iV)\ < exp(C -f r2), which gives

<20> U(r, 0) < C + r2 for all re1"* € C.

Uniform convergence of the power series expansion for y on the circle \z\ = r letsus integrate term-by-term, giving

2" . m \2nynrn for n = 1, 2, 3 , . . .for n = 0 , - 1 , - 2 , . . .

In particular, for n = 1, 2, 3 , . . . , and real p.

r (exp(-/n0 - ifi) + exp(*n0 + ip) + 2^ <

/•27T

/ y(reiB)exp(-in0)deJo/

Jo= 2nynrne-ifi.

Choose p so that yne ^ = |yw|, then equate real parts to deduce that

/ U(r, 0)(2 + 2cos(w0 + P)) dO = 2n\yn\rn.Jo v y

The integrand on the left-hand side is less than 4£/(r, 0) < 4(C -h r2). Let r tend toinfinity to deduce that yn = 0 for n = 3,4, 5, That is,

Pexp(zX) = exp(yu + nz2).

Problem [12] shows why y\ must be real valued and yi must be nonnegative. That• is, X has the Fourier transform that characterizes the normal distribution.

9. Problems

[1] Let / = ( / i , . . . , fk) be a vector of /x-integrable, real-valued functions. Define [ifas the vector

(i) Show that \fif\ < /z | / | , where | • | denotes the usual Euclidean norm. Hint: Leta be a unit vector in the direction of /z/. Note that a' f < | / | .

(ii) Let / = / !-{- if2 be a complex-valued function, with /x-integrable real andimaginary parts f\ and fa. Show that |/x/| < /x|/ | , where | • | denotes themodulus of a complex number.

[2] Suppose a Fourier transform has | />Oo)l = 1 for some nonzero to. That is,V'P('O) = exp(/0o) for some real #o- Show that P concentrates on the lattice ofpoints {(0O + 2nn)/t0 : n e Z}. Hint: Show m (1 - Px exp(itox - Wo)) = 0. Whatdo you know about 1 — cos(fo* — 6Q)1

[3] Suppose is / is both integrable with respect to Lebesgue measure m on the real lineand absolutely continuous, in the sense that f{x) = mx ({t < x}f(t)), for all x9 forsome integrable function / .

(i) Show that f(x) -» 0 as |JC| - • oo. Hint: Show m' / (0 = 0 to handle x -» oo.

8.9 Problems 207

(ii) Show that m* (f(x)eixt) = -mx (eixtf(x)) /{it) for t # 0. Hint: For safeFubini, write the left-hand side as limc-ootn* {eixt{\x\ < C}tns (f(s){s < x})).

(iii) If a probability density has Lebesgue integrable derivatives up to fcth order,prove that its Fourier transform is of order O(\t\~k) as \t\ -> oo.

[4] Let m denote Lebesgue measure on !B(R). For each / in Jdl(m) show thatmx (f(x)eixt) -> 0 as f -* ±oo. Hint: Check the result for / equal to the in-dicator function of a bounded interval. Show that the linear space spanned bysuch indicator functions is dense in ^ ( m ) . Alternatively, approximate / by linearcombinations of densities with integrable derivatives, then invoke Problem [3].

[5] Let g() be a bounded real function on the real line with bounded derivatives up toorder k + 1, and let X be a random variable for which F\X\k < oo. Let /?*(•) be theremainder in the Taylor expansion up to £th power:

g(x) = s(0) + xg'{0) + ... + jj8ik)(0)

(i) Show that \Rk(x)\ < Cmin(|;c|*, |*|*+ 1), for some constant C.

(ii) Invoke Dominated Convergence to show that

tkFg(Xt) = g(0) + tg'(0)FX + . . . + -g{k)(0)FXk + o(tk) as t - • 0.

[6] Let P denote the double-exponential distribution, given by the density p(x) =V2exp(-|jc|) with respect to Lebesgue measure on !B(R).

(i) Show that Pxexs = 1/(1 - s2) for real s with |*| < 1.

(ii) By a leap of faith (or by an appeal to analytic continuation), deduce thatx/rP(t) = 1/(1 +12) fort eR.

(iii) The Cauchy probability distribution is given by the density q(x) = n~l/(l -f x2).Apply the inversion formula from Theorem <io> to deduce that the Cauchydistribution has Fourier transform exp(—\t\).

[7] Suppose (X, Y) has a multivariate normal distribution with cov(X, Y) = 0. Showthat X and Y are independent. (Hint: What is the Fourier transform of Fx 0 FY ?)

[8] Let X i , . . . , Xn be independent, Uniform(0,l) distributed random variables. Calculatethe logarithm of the Fourier transform of n~lf2{X\ 4- . . . + Xn — jn) up to a remainderterm of order O(n~2).

[9] Suppose X is a random variable with Fourier transform ^x-

(i) For each 8 > 0, show that

I JT (1 -mxlrxit)) dt=*(\- S )

where 1 — co •= suP|*|>i I sin;c/jc|.

(ii) If \[r is a complex-valued function, continuous at the origin, show that

m (\lr(0) - V(0) dt -> 0 as 8 -> 0.


(iii) Suppose {Xn} is a sequence of random variables whose Fourier transforms \/rn

converge pointwise to a function f that is continuous at the origin. Show thatXn is of order Op(\).

(iv) Show that the ^ from part (iii) equals the Fourier transform of some randomvariable Z representing the limit in distribution of a subsequence of {Xn}.Deduce via Theorem <12> that Xn -^ Z.

[10] Suppose a random variable X has a Fourier transform \// for which ^ ( 0 = 1 + O(t2)near t = 0.

(i) Show that there exists a finite constant C such that

C >

for all M. Deduce via Dominated Convergence that 2C > PX2.

(ii) Show that FX = 0.

[11] Suppose X is a random variable, and / (z , X) is a jointly measurable functionthat is analytic in a neighborhood N := {z G C : |z — zol < <$} of a point zo inthe complex plane. Write f'{z, X) for the derivative with respect to z. Supposel/'(z, *)l < M(X) for |z - zol < <$, where PAf(X) < oo. Show that P/(z, X) isanalytic in X, with derivative P/'(z, X). Hint: Reduce to the corresponding theoremfor differentiation with respect to a real variable by defining g(t, X) := f(zo + th, X)for 0 < t < 1, with h fixed.

[12] Suppose a random variable X has Fourier transform 0 ( 0 = exp(/cf — dt2), forcomplex numbers c := c\ + ic2 and d := di -f id2.

(i) Deduce from the facts that 10(01 < 1 and 0(r) = 1 - c2t + icir + o(0 near r = 0that c2 = 0.

(ii) Show that X — c\ has Fourier transform exp(—dt2) — 1 + 0(f2) near r = 0.Deduce from Theorem <2> and Problem [10] that d = P|X — c\\2, which isnonnegative.

1 0 . N o t e s

Feller (1971, Chapters 15 and 16) is a good source for facts about Fourier transforms(characteristic functions). Much of my exposition in the first four Sections is basedon his presentation, with help from Breiman (1968, Chapter 8).

The idea of generating functions is quite old—see the entries under generatingJunctions in the index to Stigler (1986a), for descriptions of the contributions ofDe Moivre, Simpson, Lagrange, and Laplace.

Apparently Levy borrowed the name characteristic function from Poincare(who used it to refer to what is now known as the moment generating function),when rediscovering the usefulness of the Fourier transform for probability theorycalculations, unaware of earlier contributions.

8.10 Notes 209

Levy (1922) extended the classical Fourier inversion formula to generalprobability distributions on the real line. He also used an inversion formula to provea form of the Continuity Theorem slightly weaker than Theorem <12>. (He requireduniform convergence on bounded intervals, but his method of proof works just aswell with pointwise convergence.) He noted that the theorem could be proved byreduction to the case of bounded densities, by convolution smoothing, offering thenormal as a suitable source of smoothing—compare with Levy (1925, page 197),where he used convolution with a uniform distribution for the same purpose. Theslightly stronger version of the Continuity Theorem described in Problem [9] is dueto Cramer (1937), albeit originally incorrectly stated (Cram6r 1976, page 525).

The book by Hall & Heyde (1980) is one of the best references for martingaletheory in discrete time. It contains a slightly stronger form of Theorem <14>.

The results in Section 6 concerning characterizations via linear combinationscome from Cramer & Wold (1936). In the 1998 Addendum to his 1997 paper,Walther noted that Radon (1917) had proved similar results, also without the use ofFourier theory. I have not seen Radon's paper.

I borrowed the proof of the Levy-Cram6r theorem from Chow & Teicher (1978,Section 8.4). The result was conjectured by L6vy (1934, final paragraph) then provedby Cramer (1936)—see Cramer (1976, page 522) and L6vy (1970, page 111). Thelast part of the proof in Section 8 essentially establishes a special case of the resultof Hadamard originally invoked by Cramer. See Le Cam (1986, page 80) andLo&ve (1973, page 3) for further discussion of why the result plays such a key rolein the statement of necessary and sufficient conditions for the central limit theoremto hold.

REFERENCES

Breiman, L. (1968), Probability, first edn, Addison-Wesley, Reading, Massachusets.Chow, Y. S. & Teicher, H. (1978), Probability Theory: Independence, Interchange-

ability, Martingales, Springer, New York.Cramer, H. (1936), 'Uber eine Eigenschaft der normalen Verteilungsfinktion',

Mathematische Zeitung 41, 405-414.Cramer, H. (1937), Random Variables and Probability Distributions, Cambridge

University Press.Cramer, H. (1976), 'Half a century with probability theory: some personal recollec-

tions', Annals of Probability 4, 509-546.Cramer, H. & Wold, H. (1936), 'Some theorems on distribution functions', Journal

of the London Mathematical Society 11, 290-294.Feller, W. (1971), An Introduction to Probability Theory and Its Applications, Vol. 2,

second edn, Wiley, New York.Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application,

Academic Press, New York, NY.Le Cam, L. (1986), 'The central limit theorem around 1935', Statistical Science

1, 78-96.


Levy, P. (1922), 'Sur la determination des lois de probability par leurs fonctionscaracteristiques', Comptes Rendus de I'Academie des Sciences, Paris 175, 854-856.

Levy, P. (1925), Calcul des Probability, Gauthier-Villars, Paris.Levy, P. (1934), *Sur les integrates dont les elements sont des variables aleatoires

independantes', Ann. Ecole. Norm. Sup. Pisa(2) 3, 337-366.Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.

Page references from the 1954 second edition.Levy, P. (1970), Quelques Aspects de la Pensee d'un Mathematicien, Blanchard, Paris.Loeve, M. (1973), 'Paul Levy, 1886-1971', Annals of Probability 1, 1-18. Includes

a list of Levy's publications.McLeish, D. L. (1974), 'Dependent central limit theorems and invariance principles',

Annals of Probability 2, 620-628.Petrov, V. V. (1972/75), Sums of Independent Random Variables, Springer-Verlag.

Enlish translation in 1975, from 1972 Russian edition.Radon, J. (1917), 'Uber die Bestimmung von Functionen durch ihre Integralwerte

langs gewisser Mannigfaltigkeiten', Ber. Verh. Sachs. Akad. Wiss. Leipzig,Math-Nat. Kl. 69, 262-277.

Rudin, W. (1974), Real and Complex Analysis, second edn, McGraw-Hill, New York.Stigler, S. M. (1986a), The History of Statistics: The Measurement of Uncertainty

Before 1900, Harvard University Press, Cambridge, Massachusetts.Walther, G. (1997), 'On a conjecture concerning a theorem of Cramer and Wold',

Journal of Multivariate Analysis 63, 313-319. Addendum in same journal, 63,431.

Chapter 9

Brownian motion

SECTION 1 collects together some facts about stochastic processes and the normaldistribution, for easier reference.

SECTION 2 defines Brownian motion as a Gaussian process indexed by a subinterval Tof the real line. Existence of Brownian motions with and without continuous samplepaths is discussed. Wiener measure is defined.

SECTION 3 constructs a Brownian motion with continuous sample paths, using anorthogonal series expansion of square integrable functions.

SECTION *4 describes some of the finer properties—lack of differentiability, and a modulusof continuity—for Brownian motion sample paths.

SECTION 5 establishes the strong Markov property for Brownian motion. Roughly speaking,the process starts afresh as a new Brownian motion after stopping times.

SECTION *6 describes a family of martingales that can be built from a Brownian motion,then establishes Levy's martingale characterization of Brownian motion with continuoussample paths.

SECTION *7 shows how square integrable functions of the whole Brownian motion pathcan be represented as limits of weighted sums of increments. The result is a thinlydisguised version of a remarkable property of the isometric stochastic integral, whichis mentioned briefly.

SECTION *8 explains how the result from Section 7 is the key to the determination ofoption prices in a popular model for changes in stock prices.

1. Prerequisites

Broadly speaking, Brownian motion is to stochastic process theory as the normal

distribution is to the theory for real random variables. They both arise as natural

limits for sums of small, independent contributions; they both have rescaling and

transformation properties that identify them amongst wider classes of possible

limits; and they have both been studied in great detail. Every probabilist, and

anyone dealing with continuous-time processes, should learn at least a little about

Brownian motion, one of the most basic and most useful of all stochastic processes.

This Chapter will define the process and explain a few of its properties.

The discussion will draw on a few basic ideas about stochastic processes, and

a few facts about the normal distribution, which are summarized in this Section.

212 Chapter 9: Brvwnian motion

A stochastic process is just a family of random variables [Xt : t e T], alldefined on the same probability space, say (£2,JF, P). Throughout the Chapter, theindex set T will always be E + or a subinterval [0, a] of R+. You should think ofthe parameter t as time, with the stochastic process evolving in time.

I will use the symbols Xt(co) and X(t, co) interchangeably. The latter notationsuggests that we regard the whole process as a single function X on T x Q, anduse the single letter X to refer to the whole family of random variables. We canalso treat X{t, co) as a family of functions X(-, co) defined on 7\ one for each co.Each of these functions, t H> X(f, co), is called a sample path of the process. Eachviewpoint—a family of random variables, a single function of two arguments, anda family of sample paths—has its advantages.

As time passes, we learn more about the process and about other randomvariables defined on ft, a situation represented (as in Chapter 6) by a filtration: afamily of sub-sigma-fields [7t : t e T] of J for which 7S c % whenever s < t.A stochastic process {Xt : t G T] is said to be adapted to the filtration if Xt is7t -measurable for each t. On occasion it will be helpful to enlarge a filtrationslightly, replacing 3> by the sigma-field 7* generated by Jt U N, with K the class ofP-negligible subsets of Q. I will refer to {J* : t e T] as the completed filtration.

The joint distributions of the subcollections {Xt : t e 5), with S ranging overall the finite subsets of 7\ are called the finite dimensional distributions (or fidis)of the process. If all the fidis are multivariate normal, the process is said to beGaussian. If each Xt has zero expected value, the process is said to be centered.

The striking behavior of Brownian motion will be largely determined by just afew properties of the normal distribution (see Section 8.6):

(a) A multivariate normal distribution is uniquely determined by its vector ofmeans and its variance matrix. (The Fourier transform is a function of thosetwo quantities.)

(b) If X and Y have a bivariate normal distribution, then X is independentof Y if and only if cov(X, Y) = 0. That is, under an assumption of jointnormality, independence is equivalent to orthogonality of X — /JLX and F — /xy,in the L2(P) sense, where /xx := PX and /xy := FY.

(c) If Zi, Z2 , . . . Zn are independent random variables, each with a N(0,1) distri-bution, then all linear combinations J2iaizt have normal distributions. Thejoint distributions of finite collections of linear combinations of Z\,..., Zn

are multivariate normal.

(d) If {Xn} is a sequence of random vectors with multivariate normal distributionsthat converges in distribution to a random vector X, then X has a multivariatenormal distribution. The expected value PXn must converge to PX, and thecovariance matrix var(Xn) must converge to var(X). Convergence in L2(P)implies convergence in distribution.

(e) If Zi, Z2 , . . . Zn are independent random variables, each with a Af(0,1)distribution, then Pmax,<n |Z,-| < 2^/1 H-logn and max,<n |Z,|/x/21ogn - • 1almost surely as n -» 00. (See Problems [1] and [2] for proofs.)

9.2 Brownian motion and Wiener measure 213

2. Brownian motion and Wiener measure

There are several closely related definitions for Brownian motion, each focusing ona different desirable property of the process. Let us start from a minimal definition,then build towards a more comfortable set of properties.

 Definition. A stochastic process B := {Bt : t € T], adapted to a filtration {?,},is said to be a Brownian motion (for that filtration) if its increments have thefollowing two properties,

(i) for all s < t the increment Bt — Bs is independent of3s,

(ii) for all s < t the increment Bt — Bs has a N(0, t — s) distribution.

Equivalently,

(iii) PFexp (W(Bt - Bs) + \02(t - s)) = PF, for alls <t, all 0 € R, all F € 3S.

Proof of the equivalence. Necessity of (iii) follows from the independence and thefact that the N(09 t — s) distribution has Fourier transform exp(—02(t — s)/2). Forsufficiency, first take F = Q to show that Bt-Bs is N(0, t-s) distributed. To establishindependence, we have only to show that P(g(ft - BS)F) = (Fg(Bt - Bs)) (PF)for all bounded, measurable functions g. The assertion is trivial if PF = 0. WhenPF ^ 0, equality (iii) may be rewritten as

PFexp(i0(ft - Bs)) = exp(-02(r - s)/2) = Pexp(i0(ft - ft)),

where FF(A) = P(FA)/PF for all A in 3. By the uniqueness theorem for Fouriertransforms, Bt — ft has the same distribution under P/r as under P. In particular,

• FFg(Bt — ft) = ¥g(Bt — ft), for all bounded, measurable functions g.

Once we specify the distribution of Bo, the joint distribution of Bo, Btl,..., Btk,for 0 < t\ < . . . < f*, is uniquely determined. That is, the fidis are uniquelydetemined by Definition and the distribution of Bo. If Bo is integrable thenso are all the ft, and the process is a martingale. If Bo has a normal distribution(possibly degenerate), then all the fidis are multivariate normal, which makes Ba Gaussian process. If Bo = x, for a constant JC, the process is said to start at x.In particular, when Bo = 0, the Brownian motion is a centered Gaussian process,whose fidis are uniquely determined by the covariances,

cov(ft, Bt) = cov(ft, ft) + cov(ft, Bt - ft) = s + 0 if s < t.

More succinctly, cov(ft, Bt) = s A t for all s, t in R+.Often one speaks of a Brownian motion without explicit mention of the

filtration, in which case it is implicit that 7t equals ?f := a {ft : s < f}, the naturalor Brownian filtration. In that case, a simple generating class argument shows thatproperty (i) is equivalent to the assertion:

(i)' for all choices of to < t\ < ti < . . . < tk from 7\ the random variables{B(tj) - B(tj-\) : y = 1,2, . . . ,*} and B(t0) are independent.

A centered Gaussian process with cov(Bt, ft) = t A S for all s, t in T is a Brownianmotion for the natural filtration.

214 Chapter 9: Brownian motion

A further property is usually added to the list of requirements for Brownian

motion, namely that it have continuous sample paths:

(iv) For each fixed co, the sample path B(,co) is a continuous function on T.

Some authors give the mistaken impression that property (iv) follows from proper-ties (i) and (ii). The proper assertion (Problem [3]) is that there exists a stochasticprocess that satisfies (i) and (ii), and has continous sample paths, or, more precisely,if {Bt : t e T) satisfies (i) and (ii), then there exists another process {B* : t e T],defined on the same probability space, for which (i), (ii), and (iv) hold, and forwhich F{Bt = B*} = 1 at every t. The new process is called a version of the originalprocess. Notice that B* need not be J,-measurable if 7t does not contain enoughnegligible sets, but it is measurable with respect to the completed sigma-field 3?.In fact, B* is a Brownian motion for the completed filtration: the added negligiblesets have no effect on the calculations required to prove that B* — B* is independentof 3*.

In truth, a Brownian motion that did not have all, or almost all, of its samplepaths continuous would not be a very nice beast. Many of the beautiful propertiesof Brownian motion depend on the continuity of its sample paths.

<2> Example. Let {(/?,, J,) : 0 < t < 1} be a Brownian motion with continuoussample paths. The quantity M(co) := sup, \B{t, co)\ is finite for each co, because eachcontinuous function is bounded on the compact set [0,1]. It is an 7\-measurablerandom variable, because {co : M(co) > x] = U5€s{|#s(a>)| > x], for any countable,dense subset 5 of [0,1]. A countable union of sets from H\ also belongs to 3\.

What happens when the process does not have continuous sample paths, but (i)and (ii) are satisfied? To show you how bad it can get, I will perform pathwisesurgery on B to create a version that behaves badly. As you will see, the issue isreally one of managing uncountable families of negligible sets. Countable collectionsof negligible sets can be ignored, but uncountable collections are capable of causingreal trouble.

From Problem [7], there is a partition of Q into an uncountable union ofdisjoint, J\-measurable sets {Qt : 0 < t < 1}, with PQ, = 0 for every t. LetP(co) be an arbitrarily nasty, nonmeasurable, nonnegative function on Q for whichP(co) > M(co) at each co. Define B*(t,co) := B(t,co)[co <£ £lt] + P(co){co e 12,}.

By construction, P{#* = Bt] = 1 for every t. The joint distributions of finite, orcountable, collections of B* variables are the same as the joint distributions forthe corresponding collections of Bt. In particular, B* is a Brownian motion, withrespect to the completed filtration.

The construction ensures that |£*(-, co)\ is maximized at the t for which co € Qt.and sup5 \B*(co)\ — P(co). We have built ourselves a process satisfying require-ments (i) and (ii) of Definition but by deliberately violating (iv), we have

• created a nasty nonmeasurablity.

REMARK. The point of the Example is not that anyone might choose a badBrownian motion, like B*, in preference to one with continuous sample paths, butrather that there is nothing in requirements (i) and (ii) to exclude the bad version.Continuity of a path requires cooperation of uncountably many random variables,

9.2 Brownian motion and Wiener measure 215

a cooperation that cannot be ensured by requirements expressed purely in terms ofjoint distributions of finite, or countable, subfamilies of random variables.

A Brownian motion that has continuous sample paths also defines a map,(o H> B(,to) from Q into the space C(7) of continuous, real valued functionson 7. It becomes a random element of C(7) if we equip that space with itsfinite dimensional (or cylinder) sigma-field 6(7), the sigma-field generated by thecylinder sets {x e C(7) : (x(t\),..., x(tk)) € A}, with {t\9..., tk) ranging over allfinite subsets of 7 and A ranging over !B(R*), for each finite k.

As an êcn-measurable map, w H> /?(•, o>), from ft into C(7), a Brownianmotion induces a probability measure (its distribution or image measure) on 6(7).The distribution is uniquely determined by the fidis, because the collection ofcylinder sets is stable under finite intersections and it generates 6(7). For thesimplest case, where Bo = 0, the distribution is called Wiener measure on 6(7), or,less precisely, Wiener measure on 7. I will denote it by W, relying on context toidentify 7.

REMARK. Each coordinate projection, Xt(x) :— x(t), defines a random variableon C(7). As a stochastic process on the probability space (C(7), 6(7) , W), thefamily {Xt : t € 7} is a Brownian motion with continuous paths, started at 0. Formany purposes, the study of Brownian motion is just the study of W.

For the remainder of the Chapter, you may assume that all Brownian motionssatisfy requirements (i), (ii), and (iv), with BQ = 0. That is, unless explicitly warnedotherwise, you may assume that all Brownian motions from now on are centeredGaussian processes with continuous sample paths and cov(Bty Bs) = t A s, a processthat I will refer to as standard Brownian motion (on T).

3. Existence of Brownian motion

It takes some ingenuity to build a Brownian motion with continuous sample paths, afeat first achieved with mathematical rigor by Wiener (1923). This Section containsone construction, based on a few facts about Hilbert spaces, all of which are areestablished in Section 4 of Appendix B.

Suppose *K is a Hilbert space with a countable orthonormal basis {& : / 6 N).Let [rji] be a sequence of independent Af(0,1) random variables, defined on someprobability space (ft, 7, P). For each fixed h in 5{, the sequence of random variablesGn(h) := E?=i<*» Mii converges in L2(P) to a limit, G(h) := £/eN(/i, Vi)^/, andby Parseval's identity,

cov(Gn(/u),Gn(/t2)) = J^M(huifi)(h2î) «• ^^{Ai. f tM***,-) = (*i,*2>.

Note that G(h) is uniquely determined as an element of L2(P), but it is only definedup to an almost sure equivalence as a random variable.

In particular, from the facts (c) and (d) in Section 1, the random variable G(h)has a N(0, \\h\\2) distribution. Moreover, for each finite subset {h\,..., hk) of IK,


the random vector (G{h\),..., G(hk) has a multivariate normal distribution withzero means and covariances given by FG(hi)G(hj) — (ht,hj). That is, all the fidisof the process are centered multivariate normal. The family of random variables{G(h) : h e !K} is a Gaussian process that is sometimes called the isonormalprocess,indexed by the Hilbert space 'K (compare with Dudley 1973, page 67).

REMARK. Notice that the map h i-> G(h) is linear and continuous, as a functionfrom JC into L2(P). Thus G(h) can be recovered, up to an almost sure equivalence,from the values {Gfe): i e N} for any orthonormal basis {e{ : / € N) for *K.

To build a Brownian motion indexed by [0,1], specialize to the case where*K := L2(m), with m equal to Lebesgue measure on [0,1], Write ft for the indicatorfunction of the interval [0, t]. The subset {/, : t e [0,1]} of JC defines a centeredGaussian process Bt := G(ft) indexed by [0,1], with

,, Bt) = m(fsft) = m[0,s At] = s At.

That is, if we take J, as a{Bs : s < t] then {(#„ 7t) : t e [0,1]} is a centeredGaussian process with the covariances that identify it as a Brownian motion indexedby [0,1], in the sense that properties (i) and (ii) of Definition hold. Thequestion of sample path continuity is more delicate.

Each partial sum Gn defines a process with continuous paths, because

\Gn{fs) - Gn(ft)\2 < Y™^(fs - /„ tfi>2£"=i nf

a n d EILito - ft* fi)2 < Us - M2 = \s- t\. We need to preserve continuity inlimit as n tends to infinity. Convergence uniform in t would suffice.

Something slightly weaker than uniform convergence is easy to check if wework with the orthonormal basis of Haar functions on [0, 1]. It is most natural tospecify this basis via a double indexing scheme. For k = 0 , 1 , . . . and 0 < i < 2*,define Hik as a difference of indicator functions,

Hitk(s) := |/2"* < s < (i + ±)2~*} - {(i + \)2~k < s < (i + 1)2-*}.

Notice that |/f/,*| is the indicator function of the interval Jik := (*2~*, (i + l)2"fc], andHik = Ju,k+\ — Ju+\,k+\> Thus mH?k = m7/,* = 2~k. The functions r,- := V2tHifkare orthogonal, and each has L2(m) norm 1. As shown in Section 3 of Appendix B,the collection of functions V := {1} U {Vo\* : 0 = fjt + ^ = 0 2k'*Xk(t) with Xk(t) :=

where fj and the {rjijk} are mutually independent N(Oy1) random variables.As a function of r, each (/,, ///,*) is nonzero only in the interval Jiik, within

which it is piecewise linear, achieving its maximum value of 2~(*+1) at the midpoint,

(/„ HU) = 2-(*+1> A (t - 0 - 2-<*+1> A (t - l-^\ .

9.3 Existence of Brownian motion 217

X2i/2k+I (2i+l)/2k+1 (2i+2)/2k+1 W8 T|22/8

The process Xk(t) has continuous, piecewise linear sample paths. It takes thevalue 0 at t = J/2* for i = 0 , 1 , . . . , 2*. It takes the value r)iik/2

k+l at the point(2i +l)/2*+1. Thus sup, \Xk(t)\ = 2~(*+1) max, |i/<>t|. From property (e) in Section 1,

V |X*(OI < 2_Zo2 ' V1 + log(2*) < oo,

which implies finiteness of J^k 2k/2 supr |X*(f)| almost everywhere. With probability

one, the random series <3> for Bt converges uniformly in 0 < t < 1; almostall sample paths of B are continuous functions of t. If we redefine B{t,co) to beidentically zero for a negligible set of eo, we have a Brownian motion with all itssample paths continuous.

From the Brownian motion B indexed by [0,1] we could build a Brownianmotion ft indexed by M+ by defining

A := (1 + t)B (T^—) - tB{\) for t € R+.

Clearly {fit : t e R+} is a centered Gaussian process with continuous paths andp0 = J5o = 0. You should check that cov(&, ft) = 5- A f, in order to complete theargument that it is a Brownian motion.

REMARK. We could also write Brownian motion indexed by R+ as a doublyinfinite series,

which converges uniformly (almost surely) on each bounded subinterval. When wefocus on [0,1], the terms for k < 0 contribute t XLt<o 2*/2f?o,*> which corresponds tothe fjt in expansion <3>; and for k > 0, only the terms with 0 < i < 2* contributeto Bt.

*4. Finer properties of sample paths

Brownian motion can be constructed to have all of its sample paths continuous, but,almost all of its paths must be nowhere differentiable. The heuristic explanationfor this extreme irregularity is: existence of a derivative would imply approximateproportionality of successive small increments, implying a degree of cooperationthat is highly unlikely for a succession of independent normals.

A formal proof can be built from the heuristic. Consider first a single continuousfunction x on [0,1], which happens to be differentiable at some point to in (0,1),


with a finite derivative v, that is, x(t) = x(to) + (t — to)v + o(\t — /ol) near to. For apositive integer m, let im be the integer defined by (im — l)/m < to < im/m. Then

and hence

(m± (m \ , / m \ / _ , \-2x t — J + x{ m)=°

Similarly, both m Alm+2,m* and m Aim+^mx must also converge to zero. By consideringsuccessive second differences, we eliminate both to and v9 leaving the conclusionthat if x is differentiable (with a finite derivative) at an unspecified point of (0,1)then, for all m large enough there must exist at least one i for which

m|A;,mJt| < 1 and m|A/+2,m*| < 1 and m\Ai+^mx\ < 1.

Apply the same reasoning to each sample path of the Brownian motion{Bt : 0 < t < 1} to see that the set of co for which B(-, co) is somewhere differentiableon (0,1) is contained in

liminf{a>: m\Ai>mB\ < 1, m|A,+2,m#| < 1, m\Ai+^mB\ < 1 for some i},m—too

where the AimB are the second differences for the sample path B(-,co). Eachof Ao,m#, &2,mB, A4mB, . . .has a N(0,2/m) distribution—being a differenceof two independent random variables, each Af(O, 1/m) distributed—and they areindependent. By Fatou's lemma, the probability of the displayed event is smallerthan the lim inf of the probabilities

P ^11=0 {(O ' m\î,mB\ < 1, tn\Ai+2,mB\ < 1, m|A|4-4>m^| < 1}

< m (P{|A^(0, 2/m)| < I})3 = O(m"l/2).

Thus almost all Brownian motion sample paths are nowhere differentiable.The nondifferentiability is also suggested by the fact that (Bt+s — Bt) /8 has

a N(0, \/8) distribution, which could not settled down to a finite limit as 8 tendsto zero, for a fixed t. Indeed, the increment Bt+8 — Bt should be roughly ofmagnitude ^/8. The maximum such increment, D(S) := supo<r<i-5 Ift+s - Bt\% iseven larger. For example, part (i) of Problem [2] shows, for small e > 0 and k largeenough, that

P ( max 2k/1

[0<i<2*for constants C > 0 and 0 > 0 depending on €. A Borel-Cantelli argument with8k := 2~k then gives

D(8) t. \B (i8k + 8k) - B (i8k)\<5> hm sup v ' - > hm sup max ' , v n > 1

almost surely. A similar argument (Levy 1937, page 172; see McKean 1969,Section 1.6 for a concise presentation of the proof) leads to a sharper conclusion,

<6> lim sup — = 1 almost surely.a-o y/28\og(l/8)

9.4 Finer properties of sample paths 219

More broadly, y/8\og{\/8) gives a global bound for the magnitude of theincrements—a modulus of continuity—as shown in the next Theorem. To avoidtrivial complications for 8 near 1, it helps to increase the modulus slightly, by addinga term that becomes inconsequential for 8 near zero.

<7> Theorem. Define h{8) := y/S + <51og(l/<5) for 0 < 8 < 1. Then for almost all cothere is a finite constant Cw such that

\Bt(co) - Bs(co)\ < CJi (|r - s\) for all s, t in [0, 1],

where B is a standard Brownian motion indexed by [0,1].

Proof. Consider a pair with 0 < s < t < 1. Temporarily write 8 for t — s. Fromthe series representation <3>,

\Bt - Bs\ < $t - s)fj\ + YZo2k/2\Xk(t) " X*(5)l

- * w + T Z o Hi2*/2 \{f< - f" Hi^\Mk where Mk := max< i^*i-For a fixed k, the function ft — fs is orthogonal to all /f,-f* except possibly when tor s belongs to the support interval /,,*. There are at most two such intervals, andfor them we have the bound | ( / , - fsi Huk)\ < m ((5, t] n Jt,k) < \s - t\ A 2~k. Thus

SUp - < SUps^t h \\t — s$ o<s<\

From Problem [2], for almost all co there is a ko(co) such that Mk(co) < 2^/log(l -1-2*)for k > ko(co). From Problem [5] we then have

for a finite constant Co, which, together with the fact that 8 < h(8), gives

sup

D the desired bound.

5. Strong Markov property

Let B be a standard Brownian motion indexed by R+. For each fixed f, definethe restarted process R'B by moving the origin to the point (t, Bt). That is,(/?'#)5 := Bt+S - Bt at each time s in R+.

For simplicity, let me temporarily write Xs instead of (R'B)S. As a process,[Xs : s e E+} is adapted to the filtration 95 := 7s+t for 5 G E + . Moreover,each increment Xs+8 - Xs has a N(0,8) distribution, independent of 9*. ThusX is also a standard Brownian motion. It has distribution W, Wiener measureon e(R+). For each finite subset S of R+, the collection of random variables[Xs : s € 5} is independent of J , . Via the usual sort of generating class argumentit then follows that X, as a random element of C(R+), is independent of 7,.


That is, FFg(X) = (PF)Pg(X) for each F e 7t and (at least) for each bounded,e(R+)-measurable function g on C(R+). Equivalently,

P (g(X) | 3t) = Wg almost surely.

The fact that RfB is a Brownian motion independent of 7, is known as the Markovproperty of Brownian motion.

If we replace the fixed time t by a stopping time r, we get a stronger assertion,known as the strong Markov property. Roughly speaking, the restarted processRTB is a Brownian motion independent of the pre-r sigma-field J r , which consistsof all F for which F{r < t} is JF,-measurable, for 0 < t < oo.

REMARK. Remember that ^ is defined, if not otherwise specified, as thesigma-field generated by U{9r, : t € R+}. We need to ensure that <SX c J^ toavoid embarrassing ambiguities about the definition of 3^ for the extreme case of astopping time that is everywhere infinite.

We could write the stronger assertion as ¥(g(X) | 3>) = Wg almost surely onthe set {r < oo}, but that equality does not quite capture everything we need, asyou will discover when we consider the reflection principle, in Example <12>. Inthat Example we will meet a stopping time r for which Bt = a, a positive constant,whenever r(a>) < t. We will need to make an assertion like

¥{Bt < a | ?T) = F{Bt - BT < 0 | %} = | on the set {r < t).

Intuitively speaking, the conditioning lets us treat the J r -measurable random variabler as a constant, so that Bt - BT has a N(0, t - r) conditional distribution, whichis symmetric about 0. Unfortunately, this line of reasoning takes us beyond theproperties of the abstract, Kolmogorov conditional expectation—if you were payingvery careful attention while reading Chapter 5, you will recognize a covert appealto existence of a conditional distribution. Fortunately, Fubini offers a way aroundthe technicality.

Rewrite the Markov property as a pathwise decomposition of Brownian motioninto two independent contributions: R* shifted to the origin 0,0); and a killedprocess K'B, defined by (K'B)S := BSAt for s e R+. More formally, define S* asthe operator that shifts functions to the right,

x(s -t) for s > t0 for 0 < s < t.

9.5 Strong Markov property 221

Then B = K'B -f S'R'B. The Markov property lets us replace R*B by a newBrownian motion, independent of ?,, without changing the distributional propertiesof the whole path. For example, for each ^-measurable Y and (at least) for eachbounded, product measurable real function / , we have (via a generating classargument)

P/(F, B) = r°Wxf (Y(co), *'£(-, co) + S'x).

The Y could take values in some arbitrary measurable space. We might takeY = K*B, for instance.

REMARK. If Y takes values in (X, a), we can define an ./l<g>e(R+)<g)!B([0, oo])-measurable function g by g(y, z, t) := W* f(y, Ar'z-hS'jt). Then the Markov propertybecomes P/(K, B) = F^g (Y(co)y £(-, co), f), for 7,-measurable Y. Multiple appealsto the Tonelli/Fubini theorem would establish the necessary measurability properties.

KTx

x(co)

Now consider the effect of replacing the fixed t by a stopping time r. For everysample path we have £(-, co) = KT{(O)B{-, co) 4- 5r(w)/?T(<u)B(-, co). The decompositioneven makes sense for those co at which z{co) = oo, because K°°x = x and S°° shiftswhatever we decide to define as R°°x right out of the picture. For concreteness,perhaps we could take S°°x == 0. Of course it would be a little embarrassing toassert that RTB is, conditionally, a Brownian motion at those co.

Strong Markov property. Let B be a standard Brownian motion for a filtration{7t : t e R+}, and let x be a stopping time. Then for each 7X -measurable randomelement Y, and (at least) for each bounded, product measurable function / ,

P/(K, B) = 1TW*/(K, KT((O)B(', co) + 5™JC) .

REMARK. In the notation from the previous Remark, the assertion becomes:Ff(Y, B) = f^g (Y(co), B(-, co), T(CD)), for Jr-measurable Y.

Proof. A generating class argument reduces to the case where f(y, z) := g(y)h(z),where g is bounded and measurable, and h(z) := ho (z(s\),..., z(.s*)) with ho abounded, continuous function on R* and s\,..., sk fixed values in R+. Discretizethe stopping time by rounding up to the next multiple of n~l,

rn := 0{r = 0} 4- V - I'—— < r < - \ + oo{r = oo}.

For each n,

P/(F, B) = , B){r = 00} P{rn = i/n}g(Y)h{Ki'nB + S""R"*B).


The product {rn = i/n}g{Y) is J//n-measurable, because Y is Jr-measurable.By <9>, with / (F , B) replaced by ({rn = i/n}g(Y))h(B), the ith summand equals

r°Wx{rn = i/n)g(Y)h(Ki/nB + 5I/WJC) = F W f o = i/n}g(Y)h(KTnB + ST"JC).

Sum over i to deduce that

P/(F, £) = V»Vfxg(Y)h{Kx*B + Stnx).

As n tends to infinity, rn(w) converges to T(O>), and hence KtnB + STnx -> £TB + STxpointwise (in particular, at each sy), for each co and each JC e C(R+). Continuity of /iothen gives h(KTnB + SXnx) -> h(KTB + 5TJC). An appeal to Dominated Convergence

• completes the proof.

< n > Corollary. For each W-integrable function f on C(E+),

P (f(B) | JT) = Wxf(KTB + 5TJC) almost surely,

for each stopping time r.

<12> Exercise. Let # be a standard Brownian motion indexed by M+, and let a be apositive constant. Define r := inf{t : Bt > or}. Use the strong Markov property tofind the distribution of r.SOLUTION: For fixed t e R+,

P{r < t} = P{r < f, Bf > a} + P{r <t, Bt < a}.

The first contribution on the right-hand side equals ¥{Bt > a] = P{N(0, 0 > a},because the inequality for r is superfluous (by continuity) when Bt > a. InvokeTheorem <io> to write the second term as

P" ({T(CO) < t}Wx{x : B (t A tip)) + x (t - r(ft})) < a}).

For each co with r(<y) < r we have B it A r(cy)) = a and W{JC : jc(r-r(w)) < 0} = 1/2.By Fubini, the term equals ^P{r < r}. Thus

P{r < r} = 2P{^ > a] =

If you differentiate you will discover that the distribution of r has density• at~3/2expi—a2/2t)/V2n with respect to Lebesgue measure on R+.

*6. Martingale characterizations of Brownian motion

A centered Brownian motion is a martingale. This fact is just the simplest instanceof a method for building martingales from polynomial functions of Bt and t.

<13> Example. If {(#,, Jr) : t € T] is a centered Brownian motion then a directcalculation shows that {B? — t} is a martingale with respect to the same filtration: ifs < t and F e 7S, then

P£,2F = P(£52 + 25, A + A2)F where A := Bt - Bs

= P£52F + 2P(£5F)PA + (PA2) iFF) by independence.

9.6 Martingale characterizations of Brvwnian motion 223

After substitution of 0 for PA and t - s for PA2, the equality rearranges to giveF(B2 - t)F = P(£2 - s)F for all F in ?s, the asserted martingale property.

Similar arguments could be used for higher degree polynomials, but it is easierto code all the martingale assertions into a single identity. For each fixed complex 0,the process Mt := exp(0Bt - tO2/2) is a (complex-valued) martingale: for F in 3>,

PAf,F = ¥cxp(0Bs + 0 A - W2/2)F

= P (exp(0£, - J 0 2 / 2 ) F ) Pexp(0 A - (r - s)02/2) by independence

= PM,F because A is Af(O, f - s) distributed.

A dominated convergence argument would justify the integration term-by-term toproduce a power-series expansion,

¥MtF = Pexp(0£,)Fexp(-02f/2)

02 o 03 , \ / oh eAt2 \yP*2F + -PZ?3F + ...J ^1 - _ + _ + ...J

with a similar expansion for Pexp(0Z?5). The series converge for all complex 0.By equating coefficients of powers of 0, we obtain a sequence of equalities thatestablish the martingale property for {Bt}, {B2 - f}, As an exercise you mightfind the term involving B?, then check the martingale property by direct calculation,

• as I did for B2 - t.

Given that Fourier transforms determine distributions, it is not surprising thatthe martingale property of exp(02?r — f02/2), for all 0, characterizes Brownianmotion—essentially equivalence (iii) of Definition . It is less obvious that themartingale property for the linear and quadratic polynomials alone should alsocharacterize Brownian motion with continuous sample paths. This striking fact isactually just an elegant repackaging of a martingale central limit theorem. Thecontinuity of the sample paths lets us express the process as a sum of many smallmartingale differences, whose individual contributions can be captured by Taylorexpansions to quadratic terms.

<14> Theorem. (Levy) Suppose {(Xf,3>) : t € R+] is a martingale with continuoussample paths, and Xo = 0. If {(X2 - t, 3t) : t € M+} is also a martingale then theprocess is a standard Brownian motion.

Proof. From equivalence (iii) of Definition , it is enough to prove, for eachreal 0 and each F in 3T5, that

<15> Pexp(W(Xt - Xs) + 02(t -s))F = PF where 02 := ^02.

I will present the argument only for the notationally simplest case where s = 0 andt = 1, leaving to you the minor modifications needed for the more general case.

In principle, the method of proof is just Taylor's theorem. Break Xi into asum of increments £J=117*, where rjk := X(k/n) - X((k - l)/n). Write P* forexpectations conditional on ^ / n . The two martingale assumptions give P*-i*?* = 0

and vt-i :=Pn-i^= l//t. Notice that £%L( v*-i = 1. For fixed real 0, define

Dk := exp (i0(rji + . . . + rjk) + 02(v0 + . . . + tfr-i) ,


so that Do = 1 and Dn = exp(i'0Xi -I- $2). Continuity of the paths should make allthe t]k small, suggesting

+ iOrjk -

« Dk-i (1 + 62vk-i + . . .) (1 - 0 4- 02vk-i + . . .)

^ Djt_i if v%_{ is small.

Averaging over an F in 3b we get F(DkF) « P(Z)*_iF). Repeated appeals to thisapproximation give ¥DnF « 1, from which <15> would follow in the limit asn —> oo.

For a rigorous proof we must pay attention to the remainder terms, and thereinlies a small difficulty. Continuity of the sample paths does make all the increments rjk

small with high probability, but we need slightly more control to ensure that theremainders cause no trouble when we take expectations. A stopping time trickwill solve the problem. Here we need to make use of a result for martingales incontinuous time:

If {(M,, 3r/) : t e T] is a martingale with right continuous sample paths,

and if x is a stopping time for the filtration, then {(M,AT, 7t) : t e T] isalso a martingale.

The analogous result for discrete-time martingales was established as a Problem inChapter 6. For continuous time, some extra measurability questions must be settled,and further approximation arguments are required. For details see Appendix E.

For Levy's theorem, choose the stopping time so that the increments of thestopped process XtAT are bounded by a small quantity. Fix an e > 0. For each n inN, define

xn := 1 A inf{/ € R4" : \Xt - Xs\ > € for some s with t - n~l < s < t).

Each xn is a stopping time: by sample path continuity, the set {xn < t} can bewritten as a countable union of ^-measurable sets, {|Xr - Xr>\ > e], with r and r'ranging over pairs of rational numbers from [0, t] with \r — rr\ < n~~l. Sample pathcontinuity (via uniform continuity on compact time intervals) also implies that, foreach co we have xn(co) = 1 eventually, that is, for all n > no(co).

For fixed n and e, write Yt for the martingale X(t A xn). The martingaleincrements & := Y(k/n) — Y((k — l)/n) are all bounded in absolute value by €, andfrom the martingale property of Y} - (t A xn),

Vk-\ := P*-l§ifc = P*-l " A T n A Xn < - .\n n ) n

The conditional variance V*_i is not equal to the constant 1/n, but it is true thatVo + . . . Vn-\ < 1 and

P(V0 + . . . Vn_i) = V P ( - Axn - A J = P ( l A r B ) - > 1 asn-> oo,

f=f \n n /from which it follows that 1 — (Vo 4 - . . . Vn-\) -> 0 in probability.

Replace the Dk defined above by the analogous quantity for the Y process,

Dk:=i

9.6 Martingale characterizations of Brownian motion 225

To keep track of the remainder terms, use the bounds (compare with Problem [8])

ex = 1 + x + A{x) with 0 < A(x) <x2 for 0 < x < 1,

eiy = 1 + iy - y2/2 + B(y) with \B(y)\ < \y\3 for all y.

We still have Do = 1, but now VDnF = PFexp (WY{ + G2(V0 + ... + Vn_0), which,by Dominated Convergence, converges to PF exp(/0Xi + 02) as n -> 00. Withbounds on the remainder terms, the conditioning argument gives a more preciseassertion,

= D,_i (1 + 02 V*-i + A (02V*-i)) (1 - 02Vk-i

= Dfc_i + **

where

< \Dk-xI

< exp(02) ((022 4-02)

< CB (n~l + <:) Vk-i because P^_i^2 = Vk.x < n~\

for a constant Co that depends on 0. Averaging over an F in Jo we then get

|P(FDW) - PF| < ^ = 1 |P(FD*) - P(FD,_O| < J^^ ¥\Rk\ < Ce (n~l + e) .

D Let n tend to infinity then € tend to zero to complete the argument.

REMARK. The LeVy characterization explains why Brownian motion plays acentral role in the theory of ltd diffusions. Roughly speaking, such a diffusion is anadapted process {Z, : t € R+} with continuous sample paths for which

P (Z,+, - Z, I 7,) ^ n(Z,)8 and P ((Zt+S - Z,)2 | 7,) * a\Zt)8

for small 5 > 0, where JJL and a are suitably smooth functions. If we break [0, t] intoa union of small intervals [4,4+1], each of length 8, then the standardized increments

AiX := (Z(rl+1) - Zfo) - /*(Z,,)*) /cr(Z,,)

are martingale differences for which P((A,X)2 | ^r,.) « 8. The sum X, := £ . A,Xis a discrete martingale for which Xf — t is also approximately a martingale. It ispossible (see, for example, Stroock & Varadhan 1979, Section 4.5) to make theseheuristics rigorous, by a formal passage to the limit as 8 tends to zero, to build aBrownian motion X from Z, so that Zt+S - Z, ^ ii{Zt)8 -I- <r(Z,) (X,+5 - X,). Bysumming increments and formalizing another passage to the limit, we then representZ as a solution to a stochastic integral equation,

showing that the diffusion is driven by the Brownian motion X.The probability theory needed to formalize the heuristics is mostly at the level

of the current Chapter. However, if you wish to pursue the idea further it wouldbe better to invest some time in studying systematically the methods of stochasticcalculus, and the theory of the ltd stochastic integral, as developed by Stroock &Varadhan (1979), or (in the more general setting of stochastic integrals with respectto martingales) by Chung & Williams (1990). See also the comments at the end ofSection 7 for more about stochastic integrals.


*7. Functionals of Brownian motion

Let B be a standard Brownian motion indexed by [0, 1], with [7f : 0 < t < 1}its natural filtration. How complicated can Jf-measurable random variables be?Remember that these are the random variables expressible as measurable functionalsof the whole sample path. The answer for square integrable random variables isquite surprising; and it has remarkable consequences, as you will see in Section 8.

<16> Theorem. The Hilbert space H := L2(£2,?f ,P) equals the closure of thesubspace Ho spanned by the constants together with the collection of all randomvariables of the form (Bt - Bs)hs, where h5 ranges over all bounded, 7S-measurablerandom variables, and s, t range over all pairs for which 0 < s < t < 1.

Proof. It is enough if we show that a random variable Z in H that is orthogonal to Hois also orthogonal to every bounded random variable of the form V := g(Btx,..., Btk).For then a generating class argument would show that Z must be orthogonal toevery random variable in H, from which it would follow that Z = 0 almost surely.It even suffices to consider, in place of V, only random variables of the formU := f[y=i exPO'0/#/,)> for real constants 07. For if

PZ+ Y[kj={ exp(i9jBtj) = PZ" f ] * = 1 exp(i0jBtj)

for all real {#,}, the uniqueness theorem for Fourier transforms implies equality ofthe measures Q*, on a{Btx,..., Btk}> with densities Z± with respect to P, and henceQ+V = Q~V for all V = g(Btl,..., Btk).

Repeated application of the following equality (proved below),

<17> PZ1V**' = PZYeWB' exp (~\02(t - s)}

for all real 0, all s < t, and all bounded, 7S-measurable random variables F,will establish orthogonality of Z and U. The argument is easy. First put Y\ :=FI*=i exp(/0,£,.) and 0 := 0k to deduce

exp(WkBtk) = FZY{ exp(WkBtk_x)exp (-\ol(tk - ft_

Then repeat the exercise with F2 •= n > = ? e x P ( ^ 7 ^ ) anc* ^ : = ^*-i +^* t o

VZY2exp(WBtk_l) = ¥ZY2txp(i0Btk_2) (-\o\tk-X - fc_

And so on. In effect, we replace successive complex exponentials in Btj by anonrandom factors, and replace 0, by a sum ^2a>j0a. After k steps, we are leftwith a product of nonrandom factors multiplied by PZ, which is zero because Z isorthogonal to the constant function 1.

To prove equality <17>, break the interval [s, t] into n subintervals each oflength 8 := (t - s ) / n , and endpoints s = SQ < s\ < ... < sn = t. Write X, for B(SJ)and Aj for Xj - X/_i, which has a Af(0,8) distribution. Abbreviate the conditionalexpectation P(- | 3^) to P 7 () .

Temporarily write f(y) for eWy. Notice that f'(y) = Wf(y) and f"{y) =-202f(y) where 02 := 02/2. From Problem [8],

h) - f{x) - hf'{x) - \h2f\x)\ < \0h\3/6.

9.7 Functionate of Brownian motion 227

Substitute x := Xj-\ and h := A7 to get

f(Xj) = f(Xj-l) + WAjf(Xj-l)-02AJf(Xj-l) + Rj where \Rj\ < |0A,|3/6

= (1 - 02S)f(Xj-i) + iOAjf(Xj-i) - 02(Af - S)f(XJ.l) + Rj.

The random variable £,- := -02(AJ - 8)f(Xj-\) is 7*-measurable, with P/_i£/ = 0and P|§,|2 < Cj62 for some constant C\ depending on 0. Similarly, F\Rj\2 < C28

3.Write y for the constant l-028. Notice that y" = (l-02(t-s)/n)n -+ exp(-02(t-s))as n -> 00. Multiply the expansion of / (X, ) by ZY then take expected values. Thecontribution from Ajf(Xj^j) is zero because YAjf{Xj-i) e J%, Thus we have arecurrence formula,

Repeated substitutions, starting with j =n9 give an explicit formula,

FZYf(Xn) = ynFZYf(Xo) + FZY QTJ=1 Yn~j{Hj + *

By Cauchy-Schwarz, the last term is bounded in absolute value by ||ZF||2 times

We may assume that 8 is small enough to ensure that \y\ < 1. Orthogonality of the{£,} then gives

The other term is bounded by

D In the limit, as n tends to infinity, we get the asserted equality < n > .

REMARK. Almost the same argument shows that Ho is dense in Li :=L!(£2, 3^,P) , under the Lx norm. The proof rests on the fact that if someelement X of L| were not in the closure of Ho, there would exist a bounded,yf-measurable random variable Z for which V(WZ) = 0 for all W in Ho butP(XZ) = 1. (Hahn-Banach plus (L1)* = L°°, for those of you who know somefunctional analysis.) The argument based on <17> would imply P(ZW) = 0 forall bounded, ? f -measurable random variables W, and, in particular, PZ2 = 0, acontradiction.

The Theorem tells us that for each X in L2(Q, J f , P) and each € > 0 there is

a constant co and grid points 0 = to < t\ < ... < /*+1 = 1 such that

X((o) = co + ]C*=o*'(f t>)Ai iJ + Re^ where AiB :==

with each hi is bounded and wF,.-measurable, and FR2 < e2. Notice that |PX — col =\FR€ I < ey so we may as well absorb the difference co — FX into the remainder, andassume co = FX.

REMARK. The representation takes an even neater form if we encode the{hi} into a single function on [0, 1] x ft, an elementary predictable process,H(t,co) := 7r,ki=ohi(a)){ti < t < ti+\}. The function H is measurable with respectto the predictable sigma-field 7, the sub-sigma-field of ®[0,1] <g> 7f generated bythe class of all adapted processes with left-continuous sample paths. Moreover, it is


square integrable for the measure fi := m <g> P, with m equal to Lebesgue measure

on [0, 1],

»H2 = m'P" ]£ . hiipfto < t

The stochastic integral of H with respect to B is defined as

The random variable defined in this way is also square integrable,

For i < y, the random variable A , # is independent of the y,.+1 -measurable randomvariable hihj(AiB); and for 1 = 7 , the random variable ( A , # ) 2 is independent of h2.The cross-product terms disappear, leaving

Thus the map H \-> fQ H dB is an isometry from the space IK0 of all elementarypredictable processes into L2 := L2(Q, 3*f, P) . It is not too difficult to show that Ji0

is dense in the space J{ := L2 ([0, 1] x £2, 3 \ /x)). The isometry therefore extends

to a map from [K into L2, which defines the stochastic integral fQ HdB for all

predictable processes that are square-integrable with respect to \i.

Theorem <16> implies that for each X in L 2 there exists a sequence (Hn)

in *KQ for which P X — PX — fQ HndB

sequence,

0. In consequence, [Hn] is a Cauchy

Completeness ensures existence of an H in *K for which

P | / 0! Hn dB - fl

Q HdB^ = n\Hn - H\2 -> 0.

In summary: X = PX + fQ H dB almost surely, a most elegant representation.

*8. Option pricing

The lognormal model of stock prices assumes that the price of a particular stock(standardized to take the value 1 at time 0) at time t is given by

<18> St = exp (crBt + (fjt - \G2\ t\ for 0 < t < 1,

where B is a standard Brownian motion. The drift parameter fx is unknown, but thevolatility parameter a is known (or is at least assumed to be well estimated). Noticethat 5 is a martingale, for the filtration J, := a{Bs : s < t) = a{Ss : s < t], if /x = 0.

The strange form of the parametrization makes more sense if we considerrelative increments in stock price over a small time interval,

t+S' "" ' = exp ((/x - ±<r2) 8 + aAfij - 1 where AB := Bt+S - Bt

+ or AB 4- \ ((/x - â2) 6 4- (rAfi) + .. .

9.8 Option pricing 229

The square (A/?)2 has expected value 8. The term -o28/2 centers it to have zeromean. As you will see, the centered variable eventually gets absorbed into a smallerror term, leaving /JL8 + a AB as the main contribution to the relative changes instock price over short time intervals. The model is sometimes written in symbolicform as dSt = ixStdt + aStdBt.

An option may be thought of as a random variable Y that is, potentially, afunction of the entire history {St : 0 < t < 1} of the stock price: ones pays anamount yo at time 0 in return for the promise of a return Y at time 1 that dependson the performance of the stock. (That is, an option is a refined form of gamblingon the stock market.)

The question whose answer makes investment bankers rich is: What is theappropriate price yo? The elegant answer is: yo = QF, where Q is a probabilitymeasure on 7 that makes the stock price a martingale, because (as you will soonlearn) there are trading schemes whose net returns can be made as close to Y — yoas we please, in a probabilistic sense. If a trader offered the option for a price ysmaller than yo, one could buy the option and also engage in a trading scheme fora net return abitrarily close to (Y — y) — (Y — yo) = yo — y > 0. A similar argumentcan be made against an asking price greater than yo.

A trading scheme consists of a finite set of times 0 < to < t\ < ... < tk+\ < 1at which shares need to be traded: at time f, buy a quantity Kt of the stock, at thecost KiS(tj), then sell at time ti+\ for an amount KiS(ti+\), for a return of /f/A,S,where A,5 = S(f,+i) — S(f,-) denotes the change in stock price per share over thetime interval. The quantity Kt must be determined by the information available attime tt, that is, K( must be £Frj-measurable. We should also allow Kt to take negativevalues, a purchase of a negative quantity being a sale and a sale of a negativequantity being a purchase. The return from the trading scheme is E/=o KiîS-

We could assume that the spacing of the times between trades is as small aswe please. For example, purchase of K shares at time s followed by resale at time thas the same return as purchases of K at times s + i8 and sales of K at timess + (i 4-1)8, for i = 0 , 1 , . . . , AT - 1, with 8 := (/ - s)/N for an arbitrarily large AT.Conceptually, we could even pass to the mathematical limit of continuous trading,in which case the errors of approximation could be driven to zero (almost surely).

Existence of the trading scheme to nearly duplicate the return Y - yo willfollow from the representation in the previous Section. Consider first the case wherefi = 0, which makes S a martingale. Let us assume that PF2 is finite. Write yo forPF. For an arbitrarily small € > 0, Theorem <16> gives a finite collection of times{/,} and bounded, 3>.-measurable random variables A/ for which

F\Y-y0- Ei *iA,-JB|2 < €2 where A;£ := Bfo+i) - B(t().

If we could replace A,-B by A,-5/ (aSti) for each i we would have a tradingscheme Kt := A,/(tr5r.) with the desired approximation property. If the timeintervals <S, := ti+\ —ti are all small enough, a simple calculation will show that sucha substitution increases the L2 error only slightly. The error term

f o A B = exp (aAtB - {o28%\ - 1 - aA(B


has zero expected value and it is independent of 3>,. You will see soon that it alsohas a small L2(P) norm. To simplify the algebra, temporarily write x for ojh andZ for AiB/V89 which has a standard normal distribution. Then

P/?2

+1= p (exp (xZ - \xz\ - 1 - r

= Pexp {ixZ - r2) + 1 + r2 - 2P ((1 + xZ) exp (xZ - ±r2))

= exp(r2) - 1 - r2 cf. re^ 2 = — e*1'2 = —P*rZ = P (ZerZ)OX OX

< r4 when r < 1.

The approximation ]T\ hiAtB to F — yo equals Yli ^ A,-S — J . ftj/^+i/cr. The firstsum represents the net return from a trading strategy. The other sum will be smallin L2(P) when max/ 8t is small. Independence eliminates cross-product terms,

FfaRi+xhjRj+i) =F(hiRi+lhj) (FRj+l) = 0 if j > i,

and hence

] ; (P/I2) (a462) if max, j,- < I/a2

(max, «f) J^. (PÂ P(A/5)2

< a 4 (max, 5f) || £ \ fc,-A,-i?|^ -> 0 as max 6/ -> 0.

The contribution from the /?, 's can be absorbed into the other error of approximation,increasing the e by an arbitrarily small amount, leaving the desired trading strategyapproximation for Y — yo.

Finally, what happens when //, is not zero? A change of measure will disposeof its effects. For a fixed constant a, let Q be the probability measure on 7 withdensity zxp(aB\ - <*2/2) with respect to P. Problem [11] shows that the processXt := Bt — at is a Brownian motion under Q. The stock price is also a functionof X if we choose a = —fi/cr,

St = exp (or(Xt + at) + (u - â2) t\ = exp (crXt - \cr2t\ for 0 < t < 1.

If we replace P by Q, and the Brownian motion B by the Brownian motion X, then

a repeat of the argument for the case /x = 0 shows that the appropriate price for the

option is now yo — QY.

REMARK. Of course there is a hidden assumption that Y is Q square-integrable,or just Q-integrable if you accept the Remark following Theorem <16>.

9. Problems

[1] Let Zi have a N(0, of) distribution, for i = 1, 2 , . . . , n. Prove that

Pmax \Zi\ < /Pmax|Z,|2 < 2ay/l 4-logn where a \— max, a,,/ V i

9.9 Problems 231

Hint: Argue via Jensen's inequality that

exp (Pmax, |Z,-|2/4€r2) < Pexp (max,- |Z;|2/4<x2) < £ ) . Pexp (jZ,|2/4<r2) .

Bound the right-hand side by ny/2, then take logarithms.

REMARK. For Problem [1] we do not need to assume that the {Z,} have amultivariate normal distribution. Nowhere in the argument do we need to knowanything about the joint distribution. When the Z, are independent, or nearly so, thebound is quite good. For the extreme case of n independent N(0, a2) variables, theinequality from part (i) of Problem [21 shows that Pmax/<n |Z,| > Cocr^/l + log« fora universal constant Co. See Section 12.2 for a precise way—Sudakov's minoration—of capturing the idea of approximate independence when the {Z,} have a multivariatenormal distribution. When the variables are highly dependent, the bound is notgood: in the extreme cases where Z, == Z{, which has a N(0,a2) distribution,Pmax,<n |Z,| = crP|iV(0,1)|, which does not increase with n.

[2] Let Zi, Z2 , . . . , be independent random variables, each distributed Af(0,1). DefineMn := max/<rt |Z,-| and £n := y/2logn and n(k) := 2k. Use the tail bound fromAppendix D,

3 ) exp ( - y J < P{Zi > x] < \ exp ( -%- J for x > 0,

to prove that Mn/y/2logn -» 1 almost surely. Argue as follows.

(i) For each small e > 0, show that there exist strictly positive constants C and 0for which ¥{Mn < (1 - €)ln] = (1 - 2P{Zi > (1 - €)tn})n < exp (-Cn°) , forall n large enough. Deduce that liming Mn/ln > 1 almost surely.

(ii) For each € > 0, show that ¥{Mn > (1 + €)ln] < n/n(l+€)2. Deduce that\imsupkMn(k)/ln(fC) < 1 almost surely.

(iii) Use the inequality Mn/ln < Mn(k+\)/lnik) for n(k) < n < n(k+1), and the resultfrom parts (i) and (ii), to conclude that Mn/ln -+ 1 almost surely.

[3] Let {(/?,, 7t) : 0 < t < 1} be a centered Brownian motion whose sample pathsneed not be continuous. Show that there exists a centered Brownian motion{B* : 0 < t < 1} with continuous sample paths, for which ¥{B* ^ Bt] = 0 forevery t. Argue as follows. Let Sn := {i/2n : 1 = 0 , 1 , . . . , 2W} and S := Un€N5n. Foreach to define

Dn(co) == max0) := J2t>n A(a>) -*• 0 as n -> 00 for w ^ N.

(ii) Let ^ and t be points in 5W for which \s - t\ < 2~m, where n > m. For eachk, write j * for the largest value in S* for which Sk < s, and define tk similarly.(Thus s = sn and f = tn.) Show that


(iii) Deduce that

m a x { | f l ( . s , co) - B(t, co)\ :s,t e Sn a n d \s - t\ < 2~m] < 2em{co),

and hence

sup{|£(.y, co) - B(t, co)\:s,t e S and \s - t\ < 2"m} < 2€m(co).

(iv) For each t in [0, 1] and each co e Nc, define

B*(t, co) = lim sup{fl5 : s e S and t - 2~n < s < t}.

Define B*(t, co) to be zero if co G N. (Notice that B* is measurable with respectto the completion 7*.) Show that, for all t, t' in [0,1],

\B*(t, co) - B*(t\ co)\ < 2€m(co) if \t - t'\ < 2~m.

(v) Show that B*(t,co) = limn B(tn,co) almost surely, where {tn} is the sequencedefined in step (ii). For each e > 0, deduce that

F[\B? - Bt\ > €} < Pliminf(|5(fn) - Bt\ > e] < liminfF{\B(tn) - Bt\ > €} = 0.n n v

(Remember that B(t) - B(tn) has a N(0,t - tn) distribution.) Conclude thatB* = Bt almost surely, for each t.

[4] Suppose {Bs : s e S] is a Brownian motion, with S the countable set of all dyadicrationals in [0, 1], as in the previous Problem. Modify the argument from thatProblem to construct a standard Brownian motion indexed by [0, 1].

[5] Show that there exists a finite constant Co for which

A 2~k) y i+l°g(2*) < <VS + 61og(l/S) for 0 < 8

Hint: For 2~m > 8 > 2~m~1, bound the sum by

The ratio of successive terms in the last sum converges to 1/V2.

[6] Let B be a standard Brownian motion indexed by [0, 1]. Show that the processXt = B\ — #i_r, for 0 < t < 1, is also a Brownian motion (with respect to own itsnatural filtration, not the filtration for B).

[7] Let {(#,, J,) : 0 < t < 1} be a Brownian motion with continuous sample paths.Define r(co) as the smallest t at which the sample path /?(-, co) achieves its maximumvalue, M(co). (Note: r is not a stopping time.) Follow these steps to show thatthe sets Qt := {r = t] for 0 < t < 1 is a family suitable for the construction inExample <2>.

(i) Show that {r < t) = {supJ5, Bs(co) = M(co)}. Deduce that r is 7\-measurable.Hint: Work with the process at rational times.

(ii) For each / in [0, 1), show that {r = /} c {sup,<J5,(£5 - Bt) < 0}. Deduce viathe result from Exercise <12> that P{r = t) = 0.

(iii) Use Problem [6] to show that P{r = 1} = 0.

9.9 Problems 233

[8] For all real y, show that eiy = 1 + iy + (ry)2/2! + . . . + (iy)k/k\ + £*(>0 with1**001 < \y\Mlih + 1)!. Hint: for y > 0 use the fact that i /o

y £*(*)<** = #*+i(>0.

[9] Let X := C(E+), equipped with its metric d for uniform convergence on compacta.

(i) Show that (X, d) is separable. Hint: Consider the countable collection ofpiecewise linear functions obtained by interpolating between a finite number of"vertices" with rational coordinates.

(ii) Prove that the sigma-field G generated by the finite dimensional projectionscoincides with the Borel sigma-field !B(X). Hint: For one inclusion usecontinuity of the projections. For the other inclusion replace sup-norm distancesby suprema over subsets of rationals.

[10] Let [Bt : t e R+] be a standard Brownian motion. For fixed constants a > 0 andP > 0 show that F[Bt eventually hits the line a + fit] = exp(-2cr/J), by followingthese steps. Write r for the first hitting time on the linear barrier. For fixed real 0,let X0(t) denote the martingale exp(0£, - lfy92t).

(i) For each fixed t, show that 1 = FXg(tAr). Hint: You might consult Appendix Eif you wish to be completely rigorous.

(ii) If 0 > 2fi, show that 0 < X0{t A r) < exp(0a).

(iii) If 0 > 2)3, show that X9(t A r) - • 0 as t -+ oo on the set {r = oo}.

(iv) Deduce that 1 = Pexp (0a + (00 - 720 2 )T) {T < oo} for each 0 > 2£. Then let0 decrease to 2fi to conclude the argument.

[11] Let {(#,, 7t) : 0 < t < 1} be a Brownian motion defined on a probability space(Q, 7, P). Let a be a positive real number.

(i) Show that the measure Q defined by dQ/dF := exp(a#i - \a2) is a probabilitymeasure on 7.

(ii) Show that the process Xt := Bt - at for 0 < t < 1 is a Brownian motionunder Q. Hint: For fixed s < t, real 0, and F in 7sy show that

Qexp (iO(Xt - Xs) + \02(t - 5)) F

= Pexp

x Pexp ((a + W)(Bt - Bs) - \{a + W)2(t -

x Pexp (aBs - \ct2s\ F.

Notice that right-hand side reduces to an expression that does not depend on 0,which identifies it as QF.

[12] Calculate the price of an option that allows you to purchase one unit of stock fora price K at time 1, if you wish. Hint: Interpret (Si - K)+, or read a book aboutoptions.


10. Notes

The detailed study by Brush (1968) describes the history of Brownian motion:from the recognition by Brown, in 1828, that it represented a physical, ratherthan biological, phenomenon; through the mathematical theories of Einstein andSmoluchowski, and the experimental evidence of Perrin, in the first decade of thetwentieth century.

Apparently Wiener (1923) was motivated to study the irregularity of theBrownian motion sample paths, in part, by remarks of Perrin regarding the haphazardmotion of small particles, cited (page 133) in translation as "One realizes from suchexamples how near the mathematicians are to the truth in refusing, by a logicalinstinct, to admit the pretended geometrical demonstrations, which are regardedas experimental evidence for the existence of a tangent at each point of a curve."(Compare with Kac (1966, page 34), quoting from Wiener's autobiography.) Wienerconstructed "Wiener measure" as a linear functional on the space of continuousfunctions, deriving the necessary countable additivity property from a propertyslightly weaker than the modulus condition of Theorem <7>.

The construction in Section 3 is essentially due to Levy (1939, Section 6), whoobtained the piecewise linear approximations to Brownian motion by an interpolationargument (compare with Theorem 1, Chapter 1 of Levy 1948). He also referred toLevy (1937, page 172) for the derivation of a Holder condition <6> for the samplepaths. The explicit construction via an orthogonal series expansion in the Haar basisis due to to Ciesielski (1961).

Theorem <14> in Section 6 is due to Levy (1948, pages 77-78 of thesecond edition), who merely stated the result, with a reference to Theorem 67.3of Levy (1937), which established a central limit theorem for (discrete time)martingales under an assumption on the conditional variances analogous to themartingale property for X,2 — t. Doob (1953, Theorem 11.9, Chapter VII) provideda formal proof, similar to the proof in Section 6. The method could be streamlinedslightly, with replacement of increments over deterministic time intervals byincrements over random intervals. It could also be deduced from fancier resultsin stochastic calculus, whose derivations ultimately reduce to calculations withsmall increments of the process. I feel the direct method has pedagogic advantages,because it makes very clear the vital role of sample path continuity.

The proof of Theorem <16> is a disguised version of Ito's formula forstochastic integrals, applied to a particular function of Brownian motion—comparewith Durrett (1984, Section 2.14). The result is due to Kunita & Watanabe (1967),although, as noted by Clark (1970), it follows easily from an expansion due toIto (1950). See also Dudley (1977) for an extension to 7f -measurable randomvariables that are not necessarily square integrable, verifying a result first assertedthen retracted by Clark.

See Harrison & Pliska (1981) and Duffie (1992, Chapter 6) for a discussionof stochastic calculus and option trading. The book by Wilmott, Howison &Dewynne (1995) provides a gentler introduction to some of the finance ideas.

9.10 Notes 235

In the last three Sections of the Chapter, I was dabbling with ideas importantin stochastic calculus, without introducing the formal machinery of the subject.Accordingly, there was some repetition of methods—chop the sample paths intosmall increments; make Taylor expansions; dispose of remainder terms by argumentsreeking of martingale theory. As I remarked at the end of Section 6, if you wishto pursue the theory any further it would be a good idea to invest some time instudying the formal machinery. I found Chung & Williams (1990) a good place tostart, with Metivier (1982) as a reliable backup, and Dellacherie & Meyer (1982) asa rigorous test of true understanding.

REFERENCES

Brush, S. G. (1968), 'A history of random processes: I. Brownian movement fromBrown to Perrin', Archive for History of the Exact Sciences pp. 1-36.

Chung, K. L. & Williams, R. J. (1990), Introduction to Stochastic Integration,Birkhauser, Boston.

Ciesielski, Z. (1961), 'Holder condition for realization of Gaussian processes',Transactions of the American Mathematical Society 99, 403-413.

Clark, J. M. C. (1970), 'The representation of functional of Brownian motionby stochastic integrals', Annals of Mathematical Statistics 41, 1282-1295.Correction, ibid. 42 (1971), 1778.

Dellacherie, C. & Meyer, P. A. (1982), Probabilities and Potential B: Theory ofMartingales, North-Holland, Amsterdam.

Doob, J. L. (1953), Stochastic Processes, Wiley, New York.Dudley, R. M. (1973), 'Sample functions of the Gaussian process', Annals of

Probability 1, 66-103.Dudley, R. M. (1977), 'Wiener functionals as Ito integrals', Annals of Probability

5, 140-141.Duffie, D. (1992), Dynamic Asset Pricing Theory, Princeton Univeristy Press.Durrett, R. (1984), Brownian Motion and Martingales in Analysis, Wadsworth,

Belmont CA.Harrison, J. M. & Pliska, S. R. (1981), 'Martingales and stochastic integrals in

the theory of continuous trading', Stochastic Processes and their Applications11, 215-260.

ltd, K. (1950), 'Multiple Wiener integral', J. Math. Society Japan 3, 158-169.Kac, M. (1966), 'Wiener and integration in function spaces', Bulletin of the American

Mathematical Society 72, 52-68. One of several articles in a special issue ofthe journal, devoted to the life and work of Norbert Wiener.

Kunita, H. & Watanabe, S. (1967), 'On square integrable martingales', Nagoya Math.J. 30, 209-245.

Levy, P. (1937), Theorie de Vaddition des variables aleatoires, Gauthier-Villars, Paris.References from the 1954 second edition.

Levy, P. (1939), 'Sur certaines processus stochastiques homogenes', Compositiomathematica 7, 283-339.


Levy, P. (1948), Processus stochastiques et mouvement brownien, Gauthier-Villars,Paris. Second edition, 1965.

McKean, H. P. (1969), Stochastic Integrals, Academic Press.Metivier, M. (1982), Semimartingales: A Course on Stochastic Processes, De Gruyter,

Berlin.Stroock, D. W. & Varadhan, S. R. S. (1979), Multidimensional Diffusion Processes,

Springer, New York.Wiener, N. (1923), 'Differential-space', Journal of Mathematics and Physics 2, 131-

174. Reprinted in Selected papers of Norbert Wiener, MIT Press, 1964.Wilmott, P., Howison, S. & Dewynne, J. (1995), The Mathematics of Financial

Derivatives: a Student Introduction, Cambridge University Press.

Chapter 10

Representations and couplings

SECTION I illustrates the usefulness of coupling, by means of three simple examples.SECTION 2 describes how sequences of random elements of separable metric spaces that

converge in distribution can be represented by sequences that converge almost surely.SECTION *3 establishes Strassen's Theorem, which translates the Prohorov distance

between two probability measures into a coupling.SECTION *4 establishes Yurinskii's coupling for sums of independent random vectors to

normally distributed random vectors.SECTION 5 describes a deceptively simple example (Tusnddy's Lemma) of a quantile

coupling, between a symmetric Binomial distribution and its corresponding normalapproximation.

SECTION 6 uses the Tusnddy Lemma to couple the Haar coefficients for the expansions ofan empirical process and a generalized Brownian Bridge.

SECTION 7 derives one of most striking results of modern probability theory, the KMTcoupling of the uniform empirial process with the Brownian Bridge process.

1. What is coupling?

A coupling of two probability measures, P and Q, consists of a probability space(£2, J, P) supporting two random elements X and F, such that X has distribution Pand Y has distribution Q. Sometimes interesting relationships between P and Qcan be coded in some simple way into the joint distribution for X and Y. Threeexamples should make the concept clearer.

 Example. Let Pa denote the Bin(n,a) distribution. As a gets larger, the distri-bution should "concentrate on bigger values." More precisely, for each fixed JC, thetail probability Pa[x, n] should be an increasing function of a. A coupling argumentwill give an easy proof.

Consider a f$ larger than a. Suppose we construct a pair of random variables,Xa with distribution Pa and Xp with distribution Pp, such that Xa < Xp almostsurely. Then we will have {Xa > x] < {Xp > x] almost surely, from which wewould recover the desired inequality, Pa[jc,n] < Pp[x,n]9 by taking expectationswith respect to P.

How might we construct the coupling? Binomials count successes in inde-pendent trials. Couple the trials and we couple the counts. Build the trials from

238 Chapter 10: Representations and couplings

independent random variables £/,-, each uniformly distributed on (0,1). That is,define Xa := £,-<„{£/,• < ot} and Xp := Jî<n{Ui < P). In fact, the construction

• couples all Py, for 0 < y < 1, simultaneously.

<2> Example. Let P denote the Bin(n, a) distribution and Q denote the approximatingPoisson(na) distribution. A coupling argument will establish a total variation bound,supA \PA — QA\ < na2, an elegant means for expressing the Poisson approximationto the Binomial.

Start with the simplest case, where n equals 1. Find a probability measure Pconcentrated on {0,1} x No with marginal distributions P :— Bin(l,a) and

Q := Poisson(a). The strategy is simple: put as much mass as7k! we can on the diagonal, (0,0) U (1,1), then spread the remaining

mass as needed to get the desired marginals. The atoms on thediagonal are constrained by the inequalities

P(0,0) < min (P{0), 0(0})) = min (1 - a, e~a),

To maximize, choose P(0,0) := 1 - a and P(l, 1) := ae~a. The1-a rest is arithmetic. We need P(l, 0) := e~a - 1 + a to attain the

marginal probability Q{0], and P(0, k) := 0, for k = 1, 2 , . . . , to attain the marginalP{0] = 1 - a. The choices P(l, k) := Q{k], for k = 2, 3 , . . . , are then forced. Thetotal off-diagonal mass equals a — ae~a < a2.

For the general case, take P to be the n-fold product of measures of thetype constructed for n = 1. That is, construct n independent random vectors(X\,Y\), . . . , (X n ,y n ) with each Xt distributed Bin(l,a), each Y( distributedPoisson(a), and P{X, ^ Yi] < a2. The sums X := £ . X, and Y := £ \ F, then havethe desired Binomial and Poisson distributions, and P{X ^ Y] < X],-P{î # Yt) <na2. The total variation bound follows from the inequality

|P{X e A] - F{Y € A}| = |P{X € A, X # Y) - F{Y e A, X / Y}\ < P{X ^ Y],

D for every subset A of integers.

The first Example is an instance of a general method for coupling probabilitymeasures on the real line by means of quantile functions. Suppose P has distri-bution function F and Q has distribution function G, with corresponding quantilefunctions qF and qG. Remember from Section 2.9 that, for each 0 < i# < 1,

u < F(x) if and only if # F ( « ) < x.

In particular, if U is uniformly distributed on (0,1) then

n<lF(U) <x} = F{U < F(x)} = F(JC),

so that X := qF(U) must have distribution P. We couple P with Q by using thesame U to define the random variable Y := qc(U) with distribution Q.

A slight variation on the quantile coupling is available when G is one-to-onewith range covering the whole of (0,1). In that case, qG is a true inverse functionfor G, and U = G(Y). The random variable X := qpG{Y) is then an increasingfunction of y, a useful property. Section 5 will describe a spectacularly successfulexample of a quantile coupling expressed in this form.

10.1 What is coupling? 239

<3> Example. Suppose {Pn} is a sequence of probability measures on the real line,for which Pn ^ P. Write Fn and F for the corresponding distribution functions, andqn and q for the quantile functions. From Section 7.1 we know that Fn(x) -» F(x)at each x for which P{x] = 0, which implies (Problem [1]) that qn(u) - • <?(M) atLebesgue almost all u in (0,1). If we use a single £/, distributed uniformly on (0,1),to construct the variables Xn := qn(U) and X := q(U), then we have Xn - • X almostsurely. That is we have represented the weakly convergent sequence of measures byan almost surely convergent sequence of random variables.

REMARK. It might happen that the measures {Pn} are the distributions of someother sequence of random variables, {Yn}. Then, necessarily, Yn -w P; but theconstruction does not assert that Yn converges almost surely. Indeed, we might evenhave the Yn defined on different probability spaces, which would completely rule outany possible thought of almost sure convergence. The construction ensures that eachXn has marginal distribution Pn, the same as Yn, but the joint distribution of the Xn'shas nothing to do with the joint distribution of the IVs (which is only well definedif the Yn all live on the same probability space). Indeed, that is the whole point ofthe construction: we have artificially manufactured the joint distribution for the Xn'sin order that they converge, not just in the distributional sense, but also in the almostsure sense.

The representation lets us prove facts about weak convergence by means of thetools for almost sure convergence. For example, in the problems to Chapter 7, youwere asked to show that A(P, Q) := sup{\Pi - Ql\ : \\i\\BL < 1} defines a metricfor weak convergence on the set of all Borel probability measures on a separablemetric space. (Refer to Section 7.1 for the definition of the bounded Lipschitznorm.) If A(Pn, P) -» 0 then Pnf -> Pf for each / with ||/||^L < oo, that is,Pn ~* P. Conversely, if Pn ~> P and can we find Xn with distribution Pn and Xwith distribution P for which Xn —• X almost surely (see Section 2 for the generalcase), then

A(PniP)0.U\\BL<1

In effect, the general constructions of the representing variables subsume the specificcalculations used in Chapter 7 to approximate {£ : \\1\\BL < 1} by a finite collection

• of functions.

2. Almost sure representations

The representation from Example <3> has extensions to more general spaces. Theresult for separable metric spaces gives the flavor of the result without getting uscaught up in too many measure theoretic details.

<4> Theorem. For probability measures on the Borel sigma field of a separable metricspace X, if Pn ~> P then there exist random elements Xn, with distributions Pn, andX, with distribution P, for which Xn - • X almost surely.

The main step in the proof involves construction of a joint distribution for Xn

and X. To avoid a profusion of subscripts, it is best to isolate this part of the


construction into a separate lemma. Once again, a single uniformly distributed U

(that is, with distribution equal to Lebesgue measure m on 3 ( 0 , 1)) will eventually

provide the thread that ties together the various couplings into a single sequence

converging almost surely. The construction builds the joint distribution via a

probability kernel, K from (0,1) x X into X.

Recall, from Section 4.3, that such a kernel consists of a family of probability

measures {KUiX() : « e (0, 1), x e X} with (uyx) f-+ KUXB measurable for each

fixed B in !B(X). We define a measure on the product sigma-field of (0 ,1) x X x X

by

( m 0 F 0 K)uxyf(u. JC, y) := mu {PxKyuxf{u, JC, y)).

Less formally: we independently generate an observation u from the uniform

distribution m and an observation x from P, then we generate a y from the

corresponding Kux. The expression in parentheses on the right-hand side also

defines a probability distribution, (P ® K)uy on X x X,

(P ® K)xu

yf{x, y) := PxKyuxf{x, y) for each fixed u.

In fact, {(P <8> K)u : u e (0, 1)} is a probability kernel from (0, 1) to X x X. Notice

also that the marginal distribution muPxKu%x for y is a m <g> P average of the Kux

probability measures on !B(X). As an exercise in generating class methods, you

might check all the measurability properties needed to make these assertions precise.

<5> L e m m a . Let P and Q be probability measures on the Borel sigma- field B(X).

Suppose there is a partition of X into disjoint Borel sets Bo, B\t...9 Bm, and a

positive constant €, for which QBa > (1 — e)PBa for each a. Then there exists a

probability kernel K from (0, 1) x X to X for which Q = muPxKUtX and for which

(P <g> K)u concentrates on Ua (Ba x Ba) whenever u < 1 - e.

Proof. Rewrite the assumption as QBa = 8a + (1 — €)PBay where the nonnegativenumbers 8a must sum to € because £ a QBa = ]Ta PBa = 1. Write Q(- | Ba) forthe conditional distribution, which can be taken as an arbitrary probability measureon Ba if QBa = 0. Partition the interval (1 - 6, 1) into disjoint subintervals Ja withxnJa = 8a. Define

KuA) = Jâ ({M e J«] + {U - l - *' X e Ba]) Q( I Ba)'When u < 1 - € the recipe is: generate y from Q(- \ Ba) when x e Ba, which

ensures that JC and y then belong to the same Ba. Integrate over u and x to find the

marginal probability that y lands in a Borel set A:

muPxKuxA = £ a (8a + (1 - €)PBa) Q(A \ Ba) = ^ ( G ^ ) Q ( A | Ba) = QA.

• as asserted.

REMARK. Notice that the kernel K does nothing clever when u e Ja. If wewere hoping for a result closer to the quantile coupling of Example <3>, we mightinstead try to select y from a Bp that is close to xy in some sense. Such refinedbehavior would require a more detailed knowledge of the partition.

Proof of Theorem <4>. The idea is simple. For each n we will construct an

appropriate probability kernel K^\ from (0, 1) x X to X, via an appeal to the

10.2 Almost sure representations 241

Lemma, with Q equal to the corresponding Pn and 6 depending on n. We thenindependently generate Xn((o) from K^x> for each n, with u an observation from mindependent of an observation X(co) := x from P.

The inequality required by the Lemma would follow from convergence indistribution if each Ba were a P-continuity set (that is, if each boundary dBa hadzero P measure—see Section 7.1), for then we would have PnBa -+ PBa as n -> oo.Problem [4] shows how to construct such a partition n := {Bo, B\,..., Bm] for anarbitrarily small e > 0, with two additional properties,

(i) PB0 < €

(ii) diameter(#a) < € for each a > 1.

We shall need a a whole family of such partitions, nk := [Batk : a = 0, 1 , . . . , m*},corresponding to values €k := 2~* for each k e N.

To each k there exists an nk for which />„# > (1 - ek)PB for all £ in 7r ,when n > nk. With no loss of generality we may assume that 1 < n\ < ni < . . . ,which ensures that for each n greater than n\ there exists a unique k := k(n) forwhich nk < n < n*+i. Write K^x for the probability kernel defined by Lemma <5>for Q := Pn with € := ek{n)j and nk{n) as the partition. Define P as the probabilitymeasure m (8) P ® ( ^ n e N ^ ) on the product sigma-field of Q := (0, 1) x X x XN.The generic point of & is a sequence o> := (w, JC, yi, >>2,.. .)• Define X(co) := JC andXn(co):=yn.

Why does Xn converge P-almost surely to X? First note that £ 4 PBo,* < oo-Borel-Cantelli therefore ensures that, for almost all x and every u in (0,1), thereexists a &o = o(w» *) for which u < \ — ek and JC ^ fio,* for all k > ko. For such(W,JC) and it > ko we have (jc,^n) € Va>\Ba,k x Bajt for nk < n < nk+\y by theconcentration property of the kernels. That is, both X((o) and Xn(co) fall within thesame Bak with a > 1, a set with diameter less than ek. Think your way through thatconvoluted assertion and you will realize we have shown something even stronger

• than almost sure convergence.

<6> Example. Suppose Pn -~* P as probability measures on the Borel sigma-field ofa separable metric space, and suppose that [Tn] is a sequence of measurable mapsinto another metric space y. If P-almost all x have the property that Tn(xn) -» T(x)for every sequence {*„} converging to JC, then the sequence of image measuresalso converges in distribution, TnPn ~» TP, as probability measures on the Borelsigma-field of y. The proof is easy is we represent {Pn} by the sequence {Xw}, asin the Theorem. For each I in BLi$)y we have l(Tn(Xn(a>))) - • l(T(X(co))) forP-almost all co. Thus

(TnPn)i = Ft(Tn(Xn)) -+ ¥l(T(X)) = (TP)t,

• by Dominated Convergence.

I noted in Example <3> that if Yn has distribution Pn, and if each Yn isdefined on a different probability space (Qn, Jn,Pn), then the convergence indistribution Yn ~* P cannot possibly imply almost sure convergence for Yn.Nevertheless, using an argument similar to the proof of Theorem <4>, Dudley (1985)obtained something almost as good as almost sure convergence.


He built a single probability space (Q,y ,P)supporting measurable maps \jrn, into Qn, and X,into X, with distributions Prt = fn (P) and P = X (P),for which Yn(x//n(co)) -> X(a>) for P almost all co.In effect, the ^ maps pull Yn back to £2, where thenotions of pointwise and almost sure convergencemake sense.

Actually, Dudley established a more delicateresult, for Yn that need not be measurable as mapsinto X, a generalization needed to accommodatean application in the theory of abstract empirical

processes. See Pollard (1990, Section 9) for a discussion of some of the conceptualand technical difficulties—such as the meaning of convergence in distribution formaps that don't have distributions in the usual sense—that are resolved by Dudley'sconstruction. See Kim & Pollard (1990, Section 2) for an example of the subtleadvantages of Dudley's form of the representation theorem.

*3. Strassen's Theorem

Once again let (X, d) be a separable metric space equipped with its Borel sigma-field 'B(X). For each subset A of X, and each 6 > 0, define A€ to be the closed set{x e X : d(x, A) < €}. The Prohorov distance between any P and Q from the set 7of all probability measures on £(X) is defined as

P(P, Q) '= inf{€ > 0 : PB < QB€ + € for all B in B(X)}.

Despite the apparent lack of symmetry in the definition, p is a metric (Problem [3])on 9.

REMARK. Separability of X is convenient, but not essential when dealing withthe Prohorov metric. For example, it implies that 3(X x X) = 3(X) <g)23(X), whichensures that d(X, X') is measurable for each pair of random elements X and X'\ andif Xn -> X almost surely then P{d(XHf X) > 6} -* 0 for each € > 0.

If p(Pni P) -> 0 then, for each closed F we have PnF < PF€ + e eventually,and hence limsupn PnF < PF, implying that Pn -w P. Theorem <4> makes it easyto prove the converse. If Xn has distribution Pn and X has distribution P, and ifXn -> X almost surely, then for each € > 0 there is an n€ such that

F{d(Xn,X) >€} <e f o r n > n € .

For every Borel set B, when n > n€ we have

) } ¥{X B€} + PB€ + €.PnB €}< ¥{X e B€} + € = PB€

Thus p is actually a metric for weak convergence of probability measures.The Prohorov metric also has an elegant (and useful, as will be shown by

Section 4) coupling interpretation, due to Strassen (1965). I will present a slightlyrestricted version of the result, by placing a tightness assumption on the probabilities,in order to simplify the statement of the Theorem. (Actually, the proof will establish

10.3 Strassen's Theorem 243

a stronger result; the tightness will be used only at the very end, to tidy up.)Also, the role of e is slightly easier to understand if we replace it by two separateconstants.

<8> Theorem. Let P and Q be tight probability measures on the Borel sigma field *B ofa separable metric space X. Let e and e' be positive constants. There exists randomelements X andY ofX with distributions P and Q such that F[d(X, Y) > e] < e' ifand only if PB < QB€ + €f for all Borel sets B.

The argument for deducing the family of inequalities from existence of thecoupling is virtually the same as <7>. For the other, more interesting direction, Ifollow an elegant idea of Dudley (1976, Lecture 18). By approximation argumentshe reduced to the case where both P and Q concentrate on a finite set of atoms,and then existence of the coupling followed by an appeal to the classical MarriageLemma (Problem [5]). I modify his argument to eliminate a few steps, by makingan appeal to the following generalization (proved in Problem [6]) of that Lemma.

<9> Lemma. Let v be a finite measure on a finite set S and ^ be a finite measure ona sigma-field *B on a set T. Suppose {Ra : a e S] is a collection of measurable setswith the domination property that v(A) < /x(Ua€/iâ) for all A c 5. Then thereexists a probability kernel K from S to T with Ka concentrated on Ra for each aand Haes vWKa < M.

Proof of Theorem <8>. The measure P will live on X x X, with X and Y asthe coordinate maps. It will be the limit of a weakly convergent subsequence of auniformly tight family {¥s : 8 > 0}, obtained by an appeal to the Prohorov/Le Camtheorem from Section 7.5.

Construct Fs via a "discretization" of P, which brings the problem within theambit of Lemma <9>. For a small, positive 8, which will eventually be sent tozero, partition X into finitely many disjoint, Borel sets Bo,B\,...,Bm with PBo < 8and diameter(#a) < 8 for a > 1. (Compare with the construction in Problem [4].)Define a probability measure v, concentrated on the finite set 5 := { 0 , 1 , . . . , m}, byv[a] := PBa for a = 0 , . . . , m. Augment X by a point oo. Extend Q to a measure/ i o n l : = l u {oo} by placing mass e' at oo. Define Ra as B% U {oo}. With thesedefinitions, the measures v and /x satisfy the requirements of Lemma <9>: for eachsubset A of 5,

v(A) = P (Ua€ABa) < Q (Ua€ABaY +€f=Q (uaGAB€a) + /x{oo} = /z (Ua€A/?a).

The Lemma ensures existence of a probability kernel A\ from 5 to 7, withKaB

€a + Ka{oo] = KaRa = 1 for each a and Jâ v{a}KaA < /JLA for every Borel

subset A of T. In particular, ]Pa v{a}KaB < QB for all B € S. The nonnegativemeasure Q - Jâ v{a}Ka\x on *B has total mass

* := 1 " £ « v{<x}KaX = £ « v[a}Ka{oo} < /x{oo} = *'.

Write this measure as TQQ, with go a probability measure on !B. (If r = 0, chooseQo arbitrarily.) We then have Qh = rQoh + £ a v{a}Kah for all h € M+(X).

Define a probability measure P5 on !B ® !B by

F8f := Px ( £ l o { j c € Ba] (Ka + Ka{oo}Q0)y / ( * , y)) for / € M+(X x X).


REMARK. In effect, I have converted K to a probability kernel L from X toX, by setting Lx equal to Ka\x + Ka{oo]Q0 when x € Ba. The definition of Fs isequivalent to Fs := P <£> L, in the sense of Section 4.4.

The measure F8 has marginals P and £> because, for g and A in M+(X),

= P* Q H B { * € *„} (*aX + Ka{oo}) g(x)) = P*,

J^ff P{* € £„} (Kah + tfa{oo}Q0fc) = £ a W«}

It concentrates most of its mass on the set D :== U^=1 (Ba x #*),

F8D > Y2=x pX ( ^ € *«}*««'• 30

= 1 - r -

When (JC, y) belongs to D, we have x e Ba and */(>, 5a) < 6 for some Ba withdiameter(5a) < 8, and hence J(JC, y) < 8 + e. Thus Fs assigns measure at least1 - €' - 8 to the closed set F8+€ := {(JC, y ) e X x l : J(JC, ^) < 6 + €}.

The tightness of both P and £? will let us eliminate 5, by passing to the limitalong a subsequence. For each rj > 0 there exists a compact set Cn for whichPC$ < t] and gCJJ < rj. The probability measure Fs, which has marginals P and Q,puts mass at most 2rj outside the compact set C^ x C . The family {P : 5 > 0} isuniformly tight, in the sense explained in Section 7.5. As shown in that Section,there is a sequence {<$,} tending to zero for which F8i ~> P, with P a probabilitymeasure on !B<g>$. It is a very easy exercise to check that P has marginals P and Q.For each fixed t > 6, the weak convergence implies

PF, > limsup, F8iFt > limsup, F8iF€+8i > 1 - *'.

• Let t decrease to € to complete the proof.

*4. The Yurinskii coupling

The multivariate central limit theorem gives conditions under which a sum S ofindependent random vectors §i , . . . , £„ has an approximate normal distribution.Theorem <4> would translate the corresponding distributional convergence into acoupling between the standardized sum and a random vector with the appropriate nor-mal distribution. When the random vectors have finite third moments, Theorem <8>improves the result by giving a rate of convergence (albeit in probability).

<io> Theorem. Let § i , . . . , £n be independent random k-vectors with P& = 0 for each iand p := £ . P|£|3 finite. Let S := £i + . . . + £n. For each 8 > 0 there exists arandom vector T with a N(0, var(S)) distribution such that

F{\S -T\> 38} < C0B (l 4- ^log^/B^\ where B :=

for some universal constant Co-

10.4 The Yurinskii coupling 245

REMARK. The result stated by Yurinskii (1977) took a slightly different form.I have followed Le Cam (1988, Theorem 1) in reworking the Yurinskii's methods.Both those authors developed bounds on the Prohorov distance, by making anexplicit choice for 8. The Le Cam preprint is particularly helpful in its discussion ofheuristics behind how one balances the effect of various parameters to get a goodbound.

Proof. The existence of the asserted coupling (for a suitably rich probability space)will follow via Theorem <8> if we can show for each Borel subset A of R* that

P{5 € A] < ¥{T € A38} + ERROR,

with the ERROR equal to the upper bound stated in the Theorem. By choosing asmooth (bounded derivatives up to third order) function / that approximates theindicator function of A, in the sense that / « 1 on A and / « 0 outside A33, we willbe able to deduce inequality from the multivariate form of Lindeberg's method(Section 7.3), which gives a third moment bound for a difference in expectations,

<i2> IP/(5) - p / (m < c (PI$, i3 + . . .+PI&I 3 ) = cp.

More precisely, if the constant Cf is such that

<13> \f(x + y) - fix) - y'f(x) - \y'f{x)y\ < Cf\y\3 for all x and y,

then we may take C = (9 4- 8P|JV(O, l)l3) Cf < 15C/.For a fixed Borel set A, Lemma <18> at the end of the Section will show

how to construct a smooth function / for which approximation <13> holds withCf = (cr28)-1 and for which, if 8 > aVk,

I •=-(—f.<14> (1 - e){x e A) < f{x) < € + (1 - e){x € A3S] where • V e" '

The Lindeberg bound <12>, with C0 = \5p/(a28) = 155(1 +a) , then gives

F{S € A} < (1 - €)-lVf(S)

< (1 - €)~l (P/(D + 155(1 + a))

< F{T € A38} + ,' where «' := € + l**] + a)-

We need to choose a, as a function of k and B, to make e' small.Clearly the bound <15> is useful only when € is small, in which case the

( l - e ) factor in the denominator contributes only an extra contant factor to thefinal bound. We should concentrate on the numerator. Similarly, the assertion ofthe Theorem is trivial if B is not small. Provided we make sure Co > e, we mayassume B < e~\ that is, log(l /#) > 1.

To get within a factor 2 of minimizing a sum of two nonnegative functions,one increasing and the other decreasing, it suffices to equate the two contributions.This fact suggests we choose a to make

log(l + a) « j log(l/fl) 4- O(k~x).- ( - * ) •


If B is small then a will be large, which would make log(l + a) small comparedwith a. If we make a slightly larger than 2k~l log(l /#) we should get close toequality. Actually, we can afford to have a a larger multiple of log(l/Z?), becauseextra multiplicative factors will just be absorbed into constant Co. With thesethoughts, it seems to me I cannot do much better than choose

which at least has the virtue of giving a clean bound:

and hence

- y V y < -

d - 6 ) - 1 - e - i 'The proof is complete, except for the construction of the smooth function /

D satisfying <14>.

Before moving on to the construction of / , let us see what we can do with thecoupling from the Theorem in the case of identically distributed random vectors.For convenience of notation write y*(x) for the function Co* (1 + I log(l/jt)|/fc).

<16> Example. Let £1, £2. • • • be independent, identically distributed random A:-vectorswith P£i = 0, var(£i) := V, and /x3 := P|£i|3 < 00. Write Sn for £1 + ...£„.The central limit theorem asserts that Sn/*Jn ~> Af(O, V). Theorem <io>, assertsexistence of a sequence of random vectors Wn, each distributed N(0, V) for which

For fixed k, we can make the right-hand side as small as we please by choosing 5as a large enough enough multiple of n~1/6. Thus, with finite third moments,

- w n= Opin 1/6) via the Yurinskii coupling.

For k = 1, this coupling is not the best possible. For example, under an assumptionof finite third moments, a theorem of Major (1976) gives a sequence of independentrandom variables Y\, Y2,..., each distributed Af(O, V), for which

= op(n~Xf6) almost surely.

Major's result has the correct joint distributions for the approximating normals, as• n changes, as well as providing a slightly better rate.

< n > Example. Yurinskii's coupling (and its refinements: see, for example, thediscussion near Lemma 2.12 of Dudley & Philipp 1983) is better suited to situationswhere the dimension k can change with n.

Consider the case of a sequence of independent, identially distributed stochasticprocesses {Xt(t) : t e T}. Suppose ¥X\(t) = 0 and \X\(t)\ < 1 for every t. Undersuitable regularity conditions on the sample paths, we might try to show that thestandardized partial sum processes, Zn(t) := (X[(t) + . . . + Xn(t))/y/n, behave like a

10.4 The Yurinskii coupling 247

centered Gaussian process {Z(t) : t e T], with the same covariance structure as X\.We might even try to couple the processes in such a way that sup, \Zn(t) — Z(t)\ issmall in some probabilistic sense.

The obvious first step towards establishing a coupling of the processes is toconsider behavior on large finite subsets T(k) := {t\,..., f*} of 7 , where k is allowedto increase with n. The question becomes: How rapidly can k tend to infinity?

For fixed k, write £, for the random fc-vector with components X/(/y), forj = 1 , . . . , fc. We seek to couple (£1 + . . . + %n)/y/n with a random vector Wn,distributed like {Z(fy) : j = 1,...,A:}. The bound is almost the same as inExample <16>, except for the fact that the third moment now has a dependenceon fc,

3 / 2 /i *< * 3 / 2 P l -

\ * 7 = 1

]P. xj \ , the coupling bound becomes

^ • 0 if it = o («1/5) and 5 -> 0 slowly enough.

• That is, max,<* |Zn(r,) - WnJ\ = op(l) if k increases more slowly than n I / 5 .

Smoothing of indicator functions

There are at least two methods for construction of a smooth approximation / to aset A. The first uses only the metric:

For an interval in one dimension, the approximation has the effect of replacing thediscontinuity at the boundary points by linear functions with slope 1/5. The secondmethod treats the indicator function of the set as an element of an C{ space, andconstructs the approximation by means of convolution smoothing,

f(x)=mw({weA)4>*(w-x)),

where <pa denotes the N(), a2Ik) density and m denotes Lebesgue measure on(Any smooth density with rapidly decreasing tails would suffice.) A combination ofthe two methods of smoothing will give the best bound:


<18> Lemma. Let A be a Borel subset of R*. let Z have a N(0, /*) distribution. Forpositive constants 8 and o define

g(x) := (l - ^ A l \ and f(x) := Pg(x + oZ) = m" (g(w)</>a(w - x)).

Then f satisfies <13> with C := (cr28)~l, and approximation <14> holds.

Proof. The function / inherits some smoothness from g and some from theconvolving standard normal density 0a, which has derivatives

4 Oland ^<

For fixed x and y, the function h(t) := f(x + ty), for 0 < r < 1, has secondderivative

. m .

jc + ry + <rZ) ( ( /Z) 2 - |y

The Lipschitz property \g(x + ty + oZ) - g(x + oZ)\ < t\y\/S then implies

\h(t) - M0)| < ^ 1

The asserted inequality <13> then follows from a Taylor expansion,

|A(1) - h(0) - h(0) - i/i(0)| = \ \h(t*) - *(0)| where t* e (0,1).

For approximation <14>, first note that A8 < g < A28 and 0 < / < 1everywhere. Also P{|Z| > 8/a] < e, from Problem [7]. Thus

fix) > Wg(x + aZ){\aZ\ < 8} = P{|Z| < 5/a} > 1 - 6 if * e A,

and

+ aZ){\aZ\ < 8} + Pg(jc 4- <TZ){|<XZ| > 5} < * if * £ A35.

•

5. Quantile coupling of Binomial with normal

As noted in Section 1, if rj is distributed N(0, 1), with distribution function ,and if q denotes the Bin(n, 1/2) quantile function, then the random variableX := q{®(r))) has exactly a Bin(n, 1/2) distribution. In a sense made precise bythe following Lemma, X is very close to the random variable Y := n/2 + rj^/n/4,which has a N{n/2,n/4) distribution. The coupling of the Bin(n, 1/2) with its

70.5 Quantile coupling of Binomial with normal 249

approximating N(n/2, n/4) has been the starting point for a growing collection ofstriking approximation results, inspired by the publication of the fundamental paperof Koml6s, Major & Tusnady (1975).

<19> Tusnady's Lemma. For each positive integer n there exists a deterministic, in-creasing function r(n, •) such that the random variable X := r(n, rj) has a Bin(n, 1/2)distribution whenever rj has a N(0,1) distribution. The random variable X satisfiesthe inequalities

y and < x ,

where Y := - 4- rj./-, which has a N (-, - ) distribution.2 V 4 \2 4/

At first glance it is easy to underestimate the delicacy of these two inequalities.Both X and Y have mean n/2 and standard deviation of order <Jn. It would be nochallenge to construct a coupling for which |X - Y\ is of order yfn\ the Lemmagives a coupling for which \X — Y | is bounded by a quantity whose distribution doesnot even change with n.

The original proof (Tusn£dy 1977) of the Lemma is challenging. Appendix Dcontains an alternative derivation of similar inequalities. To simplify the argument,I have made no effort to derive the best constants for the bound. In fact, the preciseconstants appearing in the Lemma will have no importance for us. It will be enoughfor us to have a universal constant Co for which there exists couplings such that

<20> \X-Y\< Co (l + 772) and x - ^ + M),a weaker bound that follows easily from the inequalities in Appendix D.

6. Haar coupling—the Hungarian construction

Let JCI,...,JCB be n independent observations from the uniform distribution Pon (0,1]. The empirical measure Pn is defined as the discrete distribution that putsmass \/n at each of jq, . . . ,*„. That is, Pnf := £"= 1 /(JC/)/n, for each function /on (0, 1]. Notice that nPnD has a Bin(«, PD) distribution for each Borel set D. Thestandardized measure vn := */n(Pn - P) is called the uniform empirical process.For each square integrable function / ,

vnf = n~l/2 £ " = 1 (/(*,) - Pf) - JV(O, a}) where a) = Pf2 - (P/)2.

More generally, for each finite set of square integrable functions f\,..., /*, therandom vector (vn / i , . . . , vnfk) has a limiting multivariate normal distributionwith zero means and covariances P(ftfj) - (Pfi)(Pfj). These finite dimensionaldistributions identify a Gaussian process that is closely related to the isonormalprocess {G(f) : / € L2(P)} from Section 9.3.

Recall that G is a centered Gaussian process, defined on some probability space(£2,y,P), with cov(G(/), G(g)) = (f9g) = P(fg), the L2(P) inner product. The


Haar basis, *I> = {1} U [i/ritk : 0 u+i) = 2k/2 (2/2.M+1 - J/,0 for 0 < i < 2*.

REMARK. For our current purposes, it is better to replace L2(P) by £2(P) , thespace of square-integrable real functions whose P-equivalence classes define L2(P).It will not matter that each G(f) is defined only up to a P-equivalence. We needto work with the individual functions to have Pnf well defined. It need not be truethat Pnf = Png when / and g differ only on a P-negligible set.

Each function in £2(P) has a series expansion,

which converges in the £2(P) sense. The random variables If := G(l) andrjik := G(\lfitk) are independent, each with a Af(0,1) distribution, and

G(f) =

with convergence in the L2(P) sense. If we center each function / to havezero expectation, we obtain a new Gaussian process, v(f) := G ( / — Pf) =G(f) — (P/)G(1), indexed by £2(P), whose covariances identify it as the limitprocess for vn. Notice that v(x/rik) = G{yjf^k) = r\i%k almost surely, because P^/,* = 0-Thus we also have a series representation for v,

At least in a heuristic sense, we could attempt a similar series expansion of theempirical process,

<22> vn(f) 2 XXREMARK. Don't worry about the niceties of convergence: when the heuristicsare past I will be truncating the series at some finite k.

The expansion suggests a way of coupling the process vn and v, namely, find aprobability space on which vn(\lritk) ^ v(\lsijk) = rjitk for a large subset of the basisfunctions. Such a coupling would have several advantages. First, the peculiaritiesof each function / would be isolated in the behavior of the coefficients (/, r,-,*).Subject to control of those coefficients, we could derive simultaneous couplings formany different / ' s . Second, because the functions are rescaled differences ofindicator functions of intervals, the vn(\lrik) are rescaled differences of Binomialcounts. Tusnady's Lemma offers an excellent means for building Binomials fromstandard normals. With some rescaling, we can then build versions of the v,,^/,*)from the rjink.

The secret to success is a recursive argument, corresponding to the nestingof the Juk intervals. Write node(i, k) for (i + V2)/2*, the midpoint of Jiik. Regardnode(2i, k + 1) and node(2i + 1, fc + 1) as the children of node(i, k), correspondingto the decomposition of Ji%k into the disjoint union of the two subintervals 72/,*+1and 72,+u+i- The parent of node(i, k) is node(Li/2J, k - 1).

10.6 Haar coupling—the Hungarian construction 251

For each integer i with 0 < i < 2k there is a path back through the tree,

path(i\ k) := {(/o, 0), {i\, 1 ) , . . . , (ik% k)} where I'O = 0 and ik = i,

for which /(/*, k) c J(ik-\,k - 1) C . . . C / (0 ,0) = (0,1]. That is, the path tracesthrough all the ancestors (parent, grandparent, . . . ) back to the root of the tree.

JlA

h2l h2l h2)( J 0 , 3 ] (

(0.3) (1.3) (2,3) (3,3) (4,3) (5,3) (6,3) (7.3)

The recursive argument constructs successively refined approximations to Pn

by assigning the numbers of observations X,-,* amongst x\, xi,..., xn that land ineach interval /,,*. Notice that, conditional on X,-t* = N, the two offspring countsmust sum to N, with X2i,*+i having a conditional Bin(Af, 1/2) distribution. ViaLemma <19> define

Xo,i := r(n, ^0,0) =: n - X i j ,

Xo,2 := r(Xo,i, T]O,\) =• *o,i - Xi,2, X2,2 •= r ( X i j , ?7i,i) = : Xifi - X3,2,

and so on. That is, recursively divide the count X,* at each node(i\k) betweenthe two children of the node, using the normal variable 77,* to determine theBin(Xj,*, 1/2) count assigned to the child at node(2i, k -f 1). The joint distributionfor the Xi%k variables is the same as the joint distribution for the empirical countsnPnJi,k, because we have used the correct conditional distributions.

If we continued the process forever then, at least conceptually, we wouldidentify the locations of the n observations, without labelling. Each point wouldbe determined by a nested sequence of intervals. To avoid difficulties related topointwise convergence of the Haar expansion, we need to stop at some finite level,say the mth, after which we could independently distribute the Xl>m observations (ifany) within 7,,m.

The recursive construction works well because Tusnddy's Lemma, even in itsweakened form <20>, provides us with a quadratic bound in the normal variablesfor the difference between vn(\lrifk) and the corresponding fy*.

<23> Lemma. There exists a universal constant C such that, for each k and 0 < 1* < 2k,

where {(ij, j) : j — 0 , 1 , . . . , k] is a path from the root down to node(i*, k).

Proof. Abbreviate /(( , , j) to J/, and rji.j to rjj, and so on, for j = 0 ,1 , . . .,k.Notice that the random variable PnJj has expected value PJj = 2~j, and a smallvariance, so we might hope that all of the random variables A7 := 2j PnJj should beclose to 1. Of course Ao = 1.

252 Chapter JO: Representations and couplings

Consider the effect of the split of 7, into its two subintervals, / ' := 7(2/;, j +1)and J" := J(2ij + IJ + 1). Write N for nPnJj and X for nPnJ', so thatA,- = VN/n and A' := V+lPnJ' = V+lX/n and A/; := V+xPnJ" = 2A7- - A'.From inequality <20>, we have X = N/2 + y/Nrjj/2 + /?, where

<24> |/?| < C0(l + r?;2) and |X - N/2\ < C0VN (l + | ^ | ) .

By construction,

_ ., X N + VNr>, + 2R , / . A A " . 2/?N

and hence

v B ^ = V«27pn (2f - Jj) =

From the first inequality in <24>,

<25> |vn^- - r]j\ <

From the second inequality in <24>,

2CoV 7

2|A" - Ay| = |A' - Ay| = |X - N/2\ < Co

n

Invoke the inequality |V« - <fb\ <\a- b\/Vb, for positive a and fc, to deduce that

Iv/A^T - yA~| < max (|^/A7 - y/AJ\, IÂ17 - Âj | ) < ICQV'1 (1 + |i?y|) /Vn.

From <25> with j = k, and the inequality from the previous line, deduce that

Mvntk - ml < 2CO2*/2 (l + nl) + M

< 2C02*'2 ( l + r,2) + 2C0

Bound 1^1 + l*7*fyl by 1 + ^ + 5 ^ + 5 ^ , then collect terms involving r/ , toD complete the proof.

7. The Komlos-Major-Tusnady coupling

The coupling method suggested by expansions <2i> and <22> works particularlywell when restricted to the set of indicator functions of intervals, ft{x) = {0 < x < f},for 0 < t < 1. For that case, the limit process {v(0, t] : 0 < t < 1), which canbe chosen to have continuous sample paths, is called the Brownian Bridge, ortied-down Brownian motion, often written as {B°(t) : 0 < t < 1}.

<26> Theorem. (KMT coupling) There exists a Brownian Bridge [B°(t) : 0 < t < 1}with continuous sample paths, and a uniform empirical process vn, for which

P ] sup Ivn(0, t] -[0<r<l

B°(t)\ > Cx X+ °gn \ < Coexp(-x) for all x > 0,

Vrt j

with constants C\ and Co that depend on neither n nor x.

10.7 The Komlds-Major-Tusnddy coupling 253

REMARK. Notice that the exponent on the right-hand side is somewhat arbitrary;we could change it to any other positive multiple of — x by changing the constant C\on the left-hand side. By the same reasoning, it would suffice to get a bound likeC2 exp(—C2JC) + C3 exp(—C3JC) 4- C4 exp(—c*x) for various positive constants C, andc,, for then we could recover the cleaner looking version by adjusting C\ and Co. Inmy opinion, the exact constants are unimportant; the form of the inequality is whatcounts. Similarly, it would suffice to consider only values of x bounded away fromzero, such as x > c0, because the asserted inequality is trivial for x < c0 if Co > ec°.

It is easier to adjust constants at the end of an argument, to get a clean-lookinginequality. When reading proofs in the literature, I sometimes find it frustrating tostruggle with a collection of exquisitely defined constants at the start of a proof,eventually to discover that the author has merely been aiming for a tidy final bound.

Proof. We will build vn from B°, allocating counts down to intervals of length 2"m,as described in Section 6. It will then remain only to control the behavior ofboth processes over small intervals. Let T(m) denote the set of grid points[i/2m : 1 = 0, 1 , . . . , 2m} in [0, 1]. For each t in T(m), both series <2i> and <22>terminate after k = m, because [0, t] is orthogonal to each Vu for k > m. Thatis, using the Hungarian construction we can determine Pn7,,m for each i, and thencalculate

v>n(0, t] = J ^ J2t VnWi,k)(ft, ti,k) for t in T(m),

which we need to show is close to

B°(t) := v(0, t] = 5 ^ J2t *i.k(ft> *uk) for t in T(m).

Notice that B°(0) = B°(\) = 0 = vn(0,0] = vn(0,1]. We need only consider / in7(m)\{0,1}. For each k, at most one coefficient (/,, yfrik) is nonzero, correspondingto the interval for which t e 7,^, and it is bounded in absolute value by 2~k/2. Thecorresponding nodes determine a path ( 0 , 0 ) , . . . , (i;, 7 ) , . . . , (im, m) down to themth level. The difference between the processes at t is controlled by the quadraticfunction,

Sm(t) := ^ m= 0 rif.j where t e J(ih j) for each j ,

of the normal variables at the nodes of this path:

(0, t] - B°(t)\ < J2™=0 rf\VnWik,k) ~ mk*\2-k/2

< J2i0 <j<k< m}C2°-*)/2 ( l + rjtj} by Lemma <23>

< 4C Y^m_0 ( l + ilf-jj summing the geometric series

<27>

As t ranges over T(m), or even over the whole of (0,1), the path definingSm(t) ranges over the set of all 2m paths from the root down to the mth level.We bound the maximum difference between the two processes if we bound themaximum of Sm(t). The maximum grows roughly linearly with m, the same rate asthe contribution from a single t. More precisely

<28> P{max, Sm(t) > 5m + JC} < 2exp(-x/4) for each x > 0.


I postpone the proof of this result, in order not to break the flow of the mainargument.

REMARK. The constants 5 and 4 are not magical. They could be replaced byany other pair of constants for which Pexp((iV(0, I)2 — ci)/c2) < 1/2.

From inequalities <27> and <28> we have

<29> p{ max \vn(0,t]-B°(t)\ >4C + * J" m\ <p{maxSm(O >jc + 5m| <exp(-Jt/4).[t€T(m) jn J I t JProvided we choose m smaller than a constant multiple of x + logw, this term willcause us no trouble.

We now have the easy task of extrapolating from the grid T(m) to the wholeof (0,1). We can make 2"m exceedingly small by choosing m close to a largeenough multiple of JC + logn. In fact, when x > 2, the choice of m such that

<30> 2nV > 2m > n2ex

will suffice. As an exercise, you might want to play around with other m and thevarious constants to get a neater statement for the Theorem.

We can afford to work with very crude estimates. For each s in (0, 1) write ts

for the point of T(m) for which ts < s < ts + 2~m. Notice that

MO, s] - vn(0, ts]\ < # points in (*„ s]/</n + Vw2"m.

The supremum over s is larger than 3/y/n only when at least one / I m interval, for0 .

Similarly,

sup, \B°(s) - B°(ts)\ < sup, |G[0, s] - G[0, ts]\ + sup, \(s - ts)Jj]

< max sup |G[0, s] - G[0, i/2m]\ -h 2~m|^0<i<2m

seJ.m

from which it follows that

j)-B°(f,)|> ^ 1 <2mPJ

where B is a Brownian motion. The second term on the right-hand side is less thanexp (-4mjc2/2n). By the reflection principle for Brownian motion (Section 9.5), thefirst term equals

{ JC 1 f 2m/2jc 1 / 2m r 2 \

\B(2-m)\ > — = 2m+1P \N(0, 1)| > —j=- < 2m+1 exp ( — y - J .For x > 2 and m as in <30>, the sum of the two contributions from the BrownianBridge is much smaller than e~x.

From <29>, and the inequalityMO, *] - B°(s)\ < |vn(0, s] - vn(0, ts]\ + \vn(0, ts] - B°(ts)\ + \B°(ts) - B°(s)\,

together with the bounds from the previous paragraph, you should be able to• complete the argument.

10.7 The Komlos-Major-Tusnddy coupling 255

Proof of inequality <28>. Write Rm for max, Sm(t). Think of the binary tree ofdepth m as two binary trees of depth m — 1 rooted at node(0,1) and node(l, 1), tosee that that Rm has the same distribution as ^ 0 - | -max(r , 7'), where T and T bothhave the same distribution as Rm-u and 770,0, T9 and V are independent. Write Dk

for Pexp((J?* - 5A:)/4). Notice that

^-5/4DQ = P e x p ( i ^ 0 - I ) = V2exp(-5/4) < 1/2.

For m > 1, independence lets us bound Dm by

(r - 5(m - 1), r - 5(m - 1)))

- 1)) + P e x p ( ± r - |(m - 1))) = Dm_,.

By induction, Pexp(/?m - 5m)/4 = Dm < Do = V2. Thus

P{Rm > 5m + JC} < Pexp ((/?m - 5m)/4) exp(-x/4) < \/2exp(-Jt/4),

D as asserted.

By means of the quantile transformation, Theorem <26> extends immediatelyto a bound for the empirical distribution function Fn generated from a sample£1 , . . . , §n from a probability measure on the real line with distribution function F.Again writing qF for the quantile function, and recalling that we can generate thesample as & = qF(xt), we have

which implies y/n(Fn(t) - F(t)) = vrt(0, F(r)]. Notice that F(t) ranges over asubset of [0,1] as t ranges over R; and when F has no discontinuities, the rangecovers all of (0,1). Theorem <26> therefore implies

p f supK/n (Fn(f) - F(t)) - B°(F(t))\ > C l * + 1 ° g " | < coe~x for x > 0.

Put another way, we have an almost sure representation Fn(t) = F(t) ++ Rn(t), where, for example, supr \Rn(t)\ = 6>p (n"1 logn).

REMARK. From a given Brownian Bridge B° and a given n we have constructeda sample * i , . . . , xn from the uniform distribution. From the same B°y we couldalso generate a sample JCJ, . . . , x'nJ x'n+l of size n + 1. However, it is not true thatJC, = JC,' for 1 < n\ it is not true that * i , . . . , * „ , JC^+1 are mutually independent. If wewished to have the samples relate properly to each other we would have to changethe Brownian Bridge with n. There is a version of KMT called the Kiefer coupling,which gets the correct joint distributions between the samples at the cost of a weakererror bound. See Csorg6 & Revesz (1981, Chapter 4) for further explanation.

Inequality <3i> lets us deduce results about the empirical distribution func-tion Fn from analogous results about the Brownian Bridge. For example, it impliessuprV^lî(0 ~ ^(01 ^ sup, |J?°(F(f))|. If F has no discontinuities, the limitdistribution is the same as that of sup5 |#°(.s)|. That is, we have an instant deriva-tion of the Kolmogorov-Smirov theorem. The Csorg6 & Rev6sz book describesother consequences that make much better use of all the hard work that went intoestablishing the KMT inequality.


8. Problems

[1] Suppose F and Fn9 for n e N are distribution functions on the real line for whichFn(x) -» F(x) for each x in a dense subset D of the real line. Show that thecorresponding quantile functions Qn converge pointwise to Q at all except (at worst)a countable subset of points in (0, 1). Hint: Prove convergence at each continuitypoint MO of Q. Given points x\ JC" in D with x' < xo = Q(uo) < x", find 8 > 0 suchthat JC' < Q(M0 - 5) and Q(M0 + 8) < x". Deduce that

FH(x') < F(x) + 8 < MO < F(x") - 5 < Fn(jc") eventually,

in which case x1 < Qn(uo) < x".

[2] Let P and Q be two probability measures defined on the same sigma-field A of aset X. The total variation distance v = v(P, Q) is defined as supAeA \PA - QA\.

(i) Suppose X are Y are random elements of X, defined on the same probabilityspace (Q, J, P), with distributions P and Q. Show that F*{X # K} > t^P, 6) .Hint: Choose a measurable set D D ( X ^ F ) with PD = P*{X ^ F}. Note thatP{X € A] - F{Y e A) = P{X G A} n D - P{Y e A] n D.

(ii) Suppose the diagonal A : = { ( j c , j ) G X x X : ; c = j } i s product measurable.Recall from Section 3.3 that V = 1-(PA Q)(X) = (P- Q)+(X) = (Q-P)+(X).Define a probability measure P = i ( P — Q)+ 0 (Q — P)+ 4- A., where A. is theimage of P A Q under the map x H^ (JC, JC). Let X and F be the coordinate maps.Show that X has distribution P and 7 has distribution Q, and P{X ^ Y] - v.

[3] Show that the Prohorov distance is a metric. Hint: For the triangle inequality,use the inclusion (B€y' c B€+€. For symmetry, consider /o(P, (2) < 8 < e. PutDc = B*. Prove that D5 c Bc, then deduce that 1 - PB€ < QD8 + 8 < 1 - QB + 5.

[4] Let P be a Borel probability measure concentrated on the closure of a countablesubset S = {JC/ : / e N) of a metric space X. For fixed € > 0, follow thesesteps to show that there exist a partition of X into finitely many P-continuity setsCo, C\,.. •, Cm such that PCQ < e and diameter(C,-) < € for i > 1.

(i) For each JC in X, show that there are at most countably many closed balls Bcentered at x with P(dB) > 0.

(ii) For each xt in 5, find a ball Z?, centered at xt with radius between e/4 and 6/2and P(dBt) = 0.

(iii) Show that U,-€NB,- contains the closure of 5. Hint: Each point of the closurelies within c/4 of at least one *,-.

(iv) Show that P (u,<mZ?i) > 1 — e when m is large enough.

(v) Show that the sets Ct := Bt\ D\<j<i Bj and Co := (Uf^B,-)0 have the desiredproperties.

[5] (2)e O(Dc 9Jlatriage £cmma) Suppose 5 is a finite set of princesses. Supposeeach princess, a, has a list, AT (a), of frogs desirable for marriage. For eachcollection A c 5, the combined list of frogs equals K(A) = (JfA'Ccr) : 0" € A}. If

10.8 Problems 257

each princess is to find a frog on her list to marry, then clearly the "Desirable FrogCondition" (DFC), #K(A) > #A, for each A c s , must be satisfied. Show that DFCis also sufficient for happy princesses: under the DFC there exists a one-to-onemap n from S into K(S) such that 7t(a) e K(a) for every a in S. Hint: Translatethe following mathematical fairy tale into an inductive argument.

(i) Once upon a time there was a princess ao who proposed to marry a frog TOfrom her list. That would have left a collection 5\{TO} of princesses with listsK(G)\{TO] to choose from. If the analog of the DFC had held for those lists,an induction hypothesis would have made everyone happy.

(ii) Unfortunately, a collection Ao c S\{cro) of princesses protested, on the groundsthat #X(AO)\{TO] < #Ao; clearly not enough frogs to go around. They pointedout that the DFC held with equality for Ao, and that their happiness could beassured only if they had exclusive access to the frogs in K(Ao).

(iii) Everyone agreed with the assertion of the Ao. They got their exclusive access,and, by induction, lived happily ever after.

(iv) The other princesses then got worried. Each collection B in 5\Ao asked,"#K(B)\K(A0) > ##?" They were reassured, "Don't worry. Originally#K(B U Ao) > #B + #A0, and we all know that #£(A0) = #A0, so of course

#K(B)\K(A0) = #K(B U Ao) - #tf(A0) > #B.

You too can live happily ever after, by induction." And they did.

[6] Prove Lemma <9> by carrying out on the following steps. Write RA for UaeARa.Argue by induction on the size of S. With no loss of generality, suppose 5 ={ 1 , 2 , . . . , m}. Check the case m = 1. Work from the inductive hypothesis that theresult is true for #5 < m.

(i) Suppose there exists a proper subset Ao of S for which vAo = V>RA0- DefineR'a = Ra\RAo for a $ Ao. Show that vA < ILR'A for all A c S\A0. ConstructK by invoking the inductive hypothesis separately for Ao and 5\Ao- (Comparewith part (iv) of Problem [5].)

Now suppose vA < /JLRA for all proper subsets A of S. Write La for the probabilitydistribution [i(- | Ra), which concentrates on Ra.

(ii) Show that /x > v{l}L\. Hint: Show iiB > v{l}/z(#/?i)//z/?i for all B c Rx.

Write €\ for the unit mass at 1. Let 0o be the largest value in [0, v{\}] for which(/x - OoL\)RA > (v - 0o*i)A for every A c 5.

(iii) If Go = v{l}, use the inductive hypothesis to find a probability kernel fromS\{1] into T for which (/x - v{l}L{) > £ a > 2 v{a}Ka. Define Kx = L\.

(iv) If $o < v{l}, show that there exists an Ao 5 S for which (/x — 9L\)RAQ <(v - 0€i)Ao when v{l] > 9 > 0o- Deduce that Ao must be a proper subsetof S for which (/x - 0oL\)RAo = (v - 0oî)Ao. Invoke part (i) to find aprobability kernel M for which ^ - 0oî > (v{l} - 0o)î + La>2 v{Define Kx := (0o/v{l})Li + (1 - 0oMl})Mi.


[7] Establish the bound P{|N(O, /*)| > Vkx} < (xe{-x)k/2, for x > 1, as needed (withVibe = 8/a) for the proof of Lemma <18>. Hint: Show that

P{|W(0, h)\2 > kx] < exp(-tkx)(l - 2t)~k/2 for 0 < t < 1/2

which is minimized at t = j(l — x~l).

[8] Let Fm and Gn be empirical distribution functions, constructed from independentsamples (of sizes m and n) from the same distribution function F on the real line.Show that

sup, \Fm(t) - Gn(t)\ ~> sup, |B°(F(O)| as min(m, n) -* oo.m 4- n

Hint: Use <3i>. Show that otB^(s) + pBîs) is a Brownian Bridge if a2 + £2 = 1and # p Z? are independent Brownian Bridges.

9. Notes

In increasing degrees of generality, representations as in Theorem <4> are due toSkorohod (1956), Dudley (1968), Wichura (1970), and Dudley (1985).

Prohorov (1956) defined his metric for probability measures on complete,separable metric spaces. Theorem <8> is due to Strassen (1965). I adapted theproof from Dudley (1976, Section 18), who used the Marriage Lemma (Problem [5])to prove existence of the desired coupling in a special discrete case. Lemma <9>is a continuous analog of the Marriage Lemma, slightly extending the method ofPollard (1984, Lemma IV.24).

The discussion in Section 4 is adapted from an exposition of Yurinskii (1977)'smethod by Le Cam (1988). I think the slightly weaker bound stated by Yurinskiimay be the result of his choosing a slightly different tail bound for |Af(O, /*)|, witha correspondingly different choice for the smoothing parameter.

The idea for Example < n > comes from the construction used by Dudley &Philipp (1983) to build strong approximations for sums of independent randomprocesses taking values in a Banach space. Massart (1989) refined the couplingtechnique, as applied to empiricial processes, using a Hungarian coupling in placeof the Yurinskii coupling.

The proof of the KMT approximation in the original paper (Komlos et al. 1975)was based on the analog of the first inequality in <20>, for \X - n/2\ smaller than atiny multiple of n. The proof of the elegant refinement in Lemma <19> appeared ina 1977 dissertation of Tusnady, in Hungarian. I have seen an annotated extract fromthe dissertation (courtesy of Sandor Csorgo). Csorgo & Revesz (1981, page 133)remarked that Tusnady's proof is "elementary" but not "simple". I agree. Bretagnolle& Massart (1989, Appendix) published another proof, an exquisitely delicate exercisein elementary calculus and careful handling of Stirling's approximation. The methodused in Appendix D resulted from a collaboration between Andrew Carter and me.

Lemma <23> repackages a construction from Komlos et al. (1975) that hasbeen refined by several authors, most notably Bretagnolle & Massart (1989),Massart (1989), and Koltchinskii (1994).

10.9 Notes 259

REFERENCES

Bretagnolle, J. & Massart, P. (1989), 'Hungarian constructions from the nonasymp-totic viewpoint', Annals of Probability 17, 239-256.

Csorgo, M. & Rev6sz, P. (1981), Strong Approximations in Probability and Statistics,Academic Press, New York.

Dudley, R. M. (1968), 'Distances of probability measures and random variables',Annals of Mathematical Statistics 39, 1563—1572.

Dudley, R. M. (1976), 'Convergence of laws on metric spaces, with a view tostatistical testing'. Lecture Note Series No. 45, Matematisk Institut, AarhusUniversity.

Dudley, R. M. (1985), 'An extended Wichura theorem, definitions of Donsker classes,and weighted empirical distributions', Springer Lecture Notes in Mathematics1153, 141-178. Springer, New York.

Dudley, R. M. & Philipp, W. (1983), 'Invariance principles for sums of Ba-nach space valued random elements and empirical processes', Zeitschrift furWahrscheinlichkeitstheorie und Verwandte Gebiete 62, 509-552.

Kim, J. & Pollard, D. (1990), 'Cube root asymptotics', Annals of Statistics 18, 191—219.

Koltchinskii, V. I. (1994), 'Komlos-Major-Tusnady approximation for the generalempirical process and Haar expansion of classes of functions', Journal ofTheoretical Probability 7, 73-118.

Komlos, J., Major, P. & Tusnady, G. (1975), 'An approximation of partial sums ofindependent rv-s, and the sample df. I', Zeitschrift fiir Wahrscheinlichkeitstheorieund Verwandte Gebiete 32, 111-131.

Le Cam, L. (1988), On the Prohorov distance between the empirical process andthe associated Gaussian bridge, Technical report, Department of Statistics, U.C.Berkeley. Technical report No. 170.

Major, P. (1976), 'The approximation of partial sums of independent rv's', Zeitschriftfur Wahrscheinlichkeitstheorie und Verwandte Gebiete 35, 213—220.

Massart, P. (1989), 'Strong approximation for multivariate empirical and relatedprocesses, via KMT constructions', Annals of Probability 17, 266-291.

Pollard, D. (1984), Convergence of Stochastic Processes, Springer, New York.Pollard, D. (1990), Empirical Processes: Theory and Applications, Vol. 2 of NSF-

CBMS Regional Conference Series in Probability and Statistics, Institute ofMathematical Statistics, Hayward, CA.

Prohorov, Yu. V. (1956), 'Convergence of random processes and limit theorems inprobability theory', Theory Probability and Its Applications 1, 157-214.

Skorohod, A. V. (1956), 'Limit theorems for stochastic processes', Theory Probabilityand Its Applications 1, 261-290.

Strassen, V. (1965), 'The existence of probability measures with given marginals',Annals of Mathematical Statistics 36, 423-439.

Tusnady, G. (1977), A study of Statistical Hypotheses, PhD thesis, HungarianAcademy of Sciences, Budapest. In Hungarian.


Wichura, M. J. (1970), 'On the construction of almost uniformly convergent randomvariables with given weakly convergent image laws', Annals of MathematicalStatistics 41, 284-291.

Yurinskii, V. V. (1977), 'On the error of the Gaussian approximation for convolu-tions', Theory Probability and Its Applications 2, 236-247.

Chapter 11

Exponential tails and thelaw of the iterated logarithm

SECTION I introduces the law of the iterated logarithm (LIL) through the technicallysimplest case: independent standard normal summands.

SECTION 2 extends the results from Section I to sums of independent bounded randomvariables, by means of Bennett's exponential inequality. It is noted that the boundson the variables could increase slowly without destroying the limit assertion, therebypointing to the easy (upper) half of Kolmogorov's definitive LIL.

SECTION *3 derives the very delicate exponential lower bound for bounded summands,needed to prove the companion lower half for Kolmogorov's LIL.

SECTION *4 shows how truncation arguments extend Kolmogorov's LIL to the case ofindependent, identically distributed summands with finite second moments.

1. L I L for n o r m a l s u m m a n d s

Two important ideas run in tandem through this Chapter: the existence of exponentialtail bounds for sums of independent random variables, and proofs of the law of theiterated logarithm (LIL) in various contexts. You could read the Chapter as eithera study of exponential inequalities, with the LIL as a guiding application, or as astudy of the LIL, with the exponential inequalities as the main technical tool.

The LIL's will all refer to partial sums Sn := X\ + . . . + Xn for sequencesof independent random variables [Xi] with PX, = 0 and var(Z,) := a? < oo,for each i. The words iterated logarithm refer to the role played by functionL(x) := y^jtloglogx. To avoid minor inconveniences (such as having to excludecases involving logarithms or square roots of negative numbers), I arbitrarily defineL{x) as 1 for x < ee % 15.15. Under various assumptions, we will be able to prove,with Vn := var(Sn), that

 limsupn_)>00 Sn/L (Vn) = 1 almost surely,

together with analogous assertions about the lim inf and the almost sure behavior ofthe sequence {Sn/L (Vn)}. Equality breaks naturally into a pair of assertions,

<2> limsup,,.^ Sn/L (Vn) < 1 and l imsup, ,^ Sn/L (Vn) > 1 a.s.,

262 Chapter 11: Exponential tails and the LIL

inequalities that I will refer to as the upper and lower halves of the LIL, or upperand lower LIUs, for short. In general, it will be easier to establish the upper half,because the exponential inequalities required for that case are easier to prove.

As you will see, several of the techniques used for proving LIL's are refinementsof techniques used in Chapter 4 (appeals to the Borel-Cantelli lemma, truncation ofsummands, bounding of whole blocks of terms by means of maximal inequalities)for proving strong laws of large numbers (SLLN). Indeed, the LIL is sometimesdescribed as providing a rate of convergence for the SLLN.

The theory is easiest to understand when specialized to the normal distribution,for which the following result holds.

<3> Theorem. For the partial sums {Sn} of a sequence of independent N(0,1) randomvariables,

(i) l imsup^^ Sn/Ln = 1 a.s.

(ii) liminfôo Sn/Ln = - 1 a.s.

(Hi) Sn/Ln e J infinitely often, a.s., for every open subinterval J of [— 1,1].

Proof. The key requirement for the proof of the upper LIL is an exponential tailbound, such as (see Appendix D for the proof)

<4> P{Sn > xy/n] 0.

If we take xn := yy/2\ogn for some fixed y > 1, then

which, by the Borel-Cantelli lemma, implies

lim sup n < y almost surely, for fixed y > 1.n-Kx> y/2n log n

Cast out a sequence of negligible sets, for a sequence of y values decreasing to 1,to deduce that

<5> lim sup < 1 almost surely.n-*oo y/2nlogn

This result is not quite what we need for the upper LIL. Somehow we must replacethe y/2n log n factor by y/2n log log /i, without disturbing the almost sure bound.

As with the proof of the SLLN in Section 4.6, the improvement is achievedby collecting the Sn into blocks, then applying Borel-Cantelli to a bound for amaximum over a block. To handle the contributions from within each block weneed a maximal inequality, such as the following one-sided analog of the boundfrom Section 4.6.

<6> Maximal Inequality. Let £i, . . . , %N be independent random variables, and x9

€, and p be nonnegative constants such that P J V ^ §7 > — e} > l/P for 2 < i < N.Then P { }

The proof is almost identical to the proof for the two-sided bound. For independent,standard normal summands, symmetry lets us take ft = 2 with € = 0.

11.1 LJL for normal summands 263

Define blocks /?* := {n : w* < n < n*+i}, where n*/p* -> 1 for someconstant p > 1 (depending on y) that needs to be specified. For a fixed y > 1,

P{S» >y£(w) for some n e £*}< P{maxSn > yUrik)} because L(n) is increasing

nzBk

< 2F{Snk+l > yL{nk)} by <6> with p = 2 and € = 0

< exp (-|}/2L(n*)2/n*+i) by <4>.

The expression in the exponent increases like

y V , y2

^ - ^ l o g l o g p * = Mlogfc + loglogp).If p < y, the bound for block Bk eventually decreases faster than k~a, forsome a > 1. The Borel-Cantelli lemma then implies that, with probability one, onlyfinitely many of the events [Sn > yL(n) for some n e Bk} occur. As before, it thenfollows that limsup^^^ Sn/L{ri) < 1 almost surely, the upper half of the LIL. Bysymmetry, we also have limsupn (-Sn/L(n)) < 1 almost surely, and hence

<7> limsupn \Sn\/L(n) < 1 almost surely.

The lower half of the LIL asserts that, with probability one, Sn > yL(n)infinitely often, for each fixed y < 1. To prove this assertion, it is enough if we canfind a sequence [n(k)} along which Sn(k) > yLnik) infinitely often, with probabilityone. (I will write n(k) and Ln, instead of nk and L(n), to avoid nearly invisiblesubscripts.) The proof uses the Borel-Cantelli lemma in the other direction, namely:if {Ak} is a sequence of independent events for which J^k FAk = oo then the event[Ak occurs infinitely often} has probability one.

The sums Sn(k) are not independent, because of shared summands. Howeverthe events Ak := {£„(*) - Sn{k-\) > yLn(k)} are independent. If we can choose n(k)so that J^k FAk = oo then we will have

lim supk S"<*> ~ S«k~» > Y almost surely.

Ln{k)

If n(k) increases rapidly enough to ensure that Ln(*-i)/£*i(*) ~* 0, then <7> willforce Sn(k-\)/Ln(k) - • 0 almost surely, and the lower LIL will follow.

We need a lower bound for FAk. The increment £„(*> — Sn(k-\) has a AT(O, m(k))distribution, where m(k) := n{k) - n(k - 1). From Appendix D,

<8> > exp (-0x2/2j for all x large enough, if 0 > 1.

Thus, for fixed 0 > 1,/-0y22n(fc)loglogn(*)\ r , .

FAk > exp I — v / f fe I for it large enough.\ 2m (k) /

The choice n(k) := /:* ensures that both n(k)/m(k) - • 1 and Ln{k-\)/Ln{k) -> 0.With 0 close enough to 1, the lower bound behaves like (k\o%k)~a for an a < 1,making J2k ^ * diverge, and completing the proof of (i).


Assertion (ii) follows from (i) by symmetry of the normal distribution.For (iii) note that J^nF{\Sn - Sn-\\ > y^logn} < £nexp(-21ogn) < oo. ByBorel-Cantelli, Sn — Sn-\ — O (y/\ognj a.s., which (after some algebra) implies(Sn/Ln) — (5n_i/Ln_i) - • 0 a.s.. As [Sn/Ln] oscillates between neighborhoods of

D -hi and —1 it must pass through each intervening J infinitely often.

2. LIL for bounded summands

The upper LIL for normal variables relied on symmetry of the summands (for theappeal to the Maximal Inequality) and the exponential bound <4>. For sequences{Sn} generated in other ways we will not have such a clean tail bound, and we neednot have symmetry, but the arguments behind the LIL can be adapted in some cases.

<9> Lemma. Let Tn := §i + . . . + £„ be a sum of independent random variableswith Pft = 0 and a? := var(&) < oo. Suppose {Wn} is an increasing sequenceof constants with o\ + . . . + a2 < Wn -> oo, and {n(k) : k e N} is an increasingsequence for which Wn^+\)/ Wn^) is bounded. Then, for constants k > 1 and 8 > 0,

W{Tn > (X + 8)L(Wn) for some n with n(k) <n<n(k+ 1)}

< 2P{rn(*+1) > XL(Wn{k))} for all k large enough.

Proof. Replace L(Wn) by the lower bound L(Wn^))y to put the inequality in theform amenable to an application of the Maximal Inequality <6>. Then argue, byTchebychev's inequality, that for n(k) < n <n(k+ 1),

? ^ XT (XV \\ ^ 1- Sn > -8L(Wn(k))} > 1 82L(Wn(k)) ~ 282Wn(k) loglog Wn(k)'

• which tends to 1 as A: tends to infinity.

If we are to imitate the proof for normal summands, the other key requirementis existence of an exponential tail bound. For bounded summands there is a simpleexponential inequality, which looks like <4> except for the appearance of an extrafactor in the exponent, a factor involving the nonegative function

. = | 2 ((1 + x) log(l + x) - JC) /x2 for x > - 1 and x # 011 for JC = 0.

The function yfr is convex and decreasing (Appendix C). For themoment, it is enough to know that V(*) % 1 when JC ^ 0, so that

- 1 0 the inequalities look similar to the inequalities for normal tails

when we focus on departures "not too far out into the tails."

 Bennett's Inequality. Let Y\y • •, Yn be independent random variables with

(i) FYf = 0 and a2 := FY2 < oo

(H) Yi < M for every i, for some finite constant M.

For each constant W > o2 -\ h a2,

Yn > x} < exp [~^f i^ff) for x ~

11.2 LIL for bounded summands 265

Proof. For each t > 0,

P{7, + • • • + Yn > x] < e~xt Y\^n FexpitYi).

As shown in Appendix C, the function

<12> A(x) := 2(ex - 1 - JC)/JC2, with A(0) = 1,

is nonnegative and increasing over the whole real line. Rewrite cxp(tYt) as

1 + tYi + \{t Yt)2A(t Yi) < 1 + tYi + \t2Y2&{tM).

Take expectations, then invoke 1 + a < ea to bound the tail probability by

( + \tVA(rM)) < exp (-** + If

Minimize the exponent by putting Mf equal to log(l + Mx/W), then rearrange to• get the stated upper bound.

The inequality ir(x) > (1 +x/3)~l, also established in Appendix C, gives aslight weakening of Bennett's inequality,

<13> Corollary. (Bernstein's inequality) Under the conditions of < n > ,

With Lemma <9> and Bennett's inequality, we have enough to establish anupper LIL, at least for summands X, bounded in absolute value by a fixed constant M.Assume var(Sn) := Vn -* oo as n -> oo. For a fixed p > 1 (depending on y), defineblocks by putting n(k) := max{n : Vn < pk}. The fact that fl^(ik)+1 < M2 = ^(Vn^))ensures that Vn{k)/pk -> 1 as A; -> oo. From Lemma <9> with Ww equal to Vn,

F{Sn > (k 4- 5)L(Vn) for some n with «(*:) < n < n(k + 1)}

< 2¥{Sn(M) >

2VniM) W \ VHiM)

The expression in the exponent increases like X22/o*loglog(p*)^ (o(The ^(^(1)) converges to 1, and therefore can be absorbed into other factors. Ifp < A., the bound for block Bk again eventually decreases faster than k~a, forsome a > 1. The upper half of the LIL, as in first assertion of <2>, then followsvia Borel-Cantelli and the casting out of a sequence of negligible sets.

Notice that uniform boundedness of the summands was needed twice:

(i) to show that <r2(jt)+1 = o(Vn(k))y thereby ensuring that Vn(*)/P* -* 1 as k -> oo;

(ii) to show that the argument of the ^ factor in the exponent tends to zero.

The same properties also hold for sequences with |Xn| < Mn, where {Mn} is a slowlydiverging sequence of constants. In fact, they hold if and

Vn -> oo and Mn=o \JVn/^g\o% Vn j<15> Vn -> oo and Mn=o \JVn/^g\o% Vn j as n -> oo,


a condition introduced by Kolmogorov (1929) to prove the LIL for partialsums of independent random variables X, with PX, = 0 and |X,| < Af,.

The proof of the upper LIL under <15> is essentially the same as the prooffor uniformly bounded summands, as sketched above. The lower LIL requires ananalog of the exponential lower bound <8>. The next Section establishes this lowerbound, an extremely delicate exercise in Calculus. With this exponential bound, youcould prove the corresponding lower LIL by modifying the analogous proof fromSection 1 (or the proof of the lower LIL that will be sketched in Section 4).

*3. Kolmogorov's exponential lower bound

Let X,, Sn and Vn be as before. Suppose also that |X/| < 8^/V^ for / = 1,2,..., w,where 8 is a small positive constant. Then for each constant 0 > 1 there existsan JCO > 0 and a K (both depending only on 0) such that

F[Sn > Xy/Vn} > exp {-\0x2\ for JC0 < x < K/8.

Proof. The constants will be determined by a collection of requirements thatemerge during the course of the argument. As with many proofs of this type, therequirements make little intuitive sense out of context, but it is useful to have themcollected in one place, so that the dependences between the constants are clear.As you will soon see: the constant 0 will determine a small € > 0, which in turnwill determine an even smaller rj > 0 (in fact, we will choose rj slightly smallerthan 62/2), and a small, positive K depending on € and r). Specifically, we willneed K so small that

max ((1 + rj)~\ ±(1 + €)) and M

with ^ as in <io> and A as in <12>. To avoid an accumulation of many trivialconstraints, assume (redundantly) that 0 < 17 < e < 1. We will also need

and-6 - (1 + 40(1 + €) + i ( l + €)2(1 - ri) > -0/2.

The constant Jto will need to be large enough that | > exp(—exfy and

2 + 3(1 + 6)JC exp (|JC2(1 + <02K) < \ exp (jjc2(l + <02(l - *?)) for JC > JC0.

Now let the argument begin.

REMARK. AS noted by Dudley (1989, page 379), it is notoriously difficult tomanage the constants and ranges correctly for the Kolmogorov inequality. With theconstraints made explicit, I hope that my errors will be easier to detect and repair.

11.3 Kolmogorov's exponential lower bound 267

With no loss of generality, assume Vn = 1 (equivalently, divide each X,by \/V^), and consequently o} = FXf < 82 for each i. By almost the samereasoning as in the proof of Bennett's inequality, for t > 0,

) =

> PJ/<w (l + \t2a2A{-t8)\ because A increasing, and Xt > -8

-t2a2A(-t8) \2 i I 9 A , — - I via log(l (1 + 30

If 0 < t < 2K/8 then A(-2K) < A(-t8) < A(0) = 1. We then have2A(-2K)\^ __,(^ U2o

J2 i+2K2 ) ~ C X P(^ (1 "the second equality coming from < n > and the fact that £ , a? = Vn = 1. We alsohave an upper bound for the same quantity,

P e x p ( f S n ) = P / tety{Sn >y}dy<\+( tetyF{Sn > y}dy.

The idea now is to choose t so that the last integrand is maximized somewherein a small interval J := [x, w], which contributes at most

/ tety

Jx<20> / tetyP{Sn >x}dy< etwF{Sn > x]

Jxto the right-hand side of <19>. We need the other contributions to <19>, from youtside / , to be relatively small. For such y9 we can use Bennett's inequality tobound the integrand by texp(ty — \y2ty{y8)). If we were to ignore the x// factor,the bound would be maximized at y — t, which suggests we make t slightly largerthan x but smaller than w. Specifically, choose t := (1 -f €)x and w := (1 H- 4C)JC,

for a small e that needs to be specified. (Note that t < 2x < 2K/8, as requiredfor <18>.)

When y is large, the \/r factor has a substantial effect on the y2. However, usingthe fact (Appendix C) that y\/r(y) is an increasing function of y, and the constraintx < K/S9 we have (y8)f(y8) > (Sx8)\/r(Sx8) > (Sx8)\lr(SK) when y > 8JC, hence

€) > 2yt by

The contribution from the region where y > 8JC is therefore small if x < K/8:/»OO /»OO

/ tetyf>{Sn>y}dy \/r(Sx8) > ir(SK) > (1 + rj)~l

because x < K/8, and by < n > , and the integrand tetyV{Sn > y] is less than

Notice that the exponent is maximized at y = (1 4- r])t = (1 -f rj)(l -h 6)JC, which liesin the interior of 7, with

min ((1 + rj)t - JC, w - (1 + rj)f) > ex because (1 + rj)(l + €) < 1 + 3e.


<22>

If divided by yj2n(\ + r\), a constant smaller than 3, the factor contributed by thequadratic in y turns into the N{{\ + rf)t, (1 + 77)) density. The contribution to thebound <19> from y in [0, 8JC]\ / is less than

t exp (\t\\ + r/)) 3P{|tf(0, (1 + r/))| > ex] < 3f exp

Combining the inequalities from <18> and <19>, with the right-hand sideof the latter broken into contributions bounded via <2i>, <22>, and <20> thenrewritten as functions of JC, we have

2 + 3 ( 1

+ exp (JC2(1 + 4 0 ( 1 + 6)) P{Sn > JC}

<23> > exp ( | J C 2 ( 1 + <02(l - ??)) for 0 < x < K/8.

We need to absorb the first two terms on the left-hand side into the right-hand side,which will happen for large enough x if we ensure that

Choose r] := rj(€) to make this inequality hold (a value slightly smaller than €2/2will suffice), then find x€ so that the sum of the first two terms in <23> is smallerthan half the right-hand side when x > x€. We may also assume that | > exp(—6JC2).ght-hand side when x > x€. We may also assume that | > exp(—6JC2

Then we have

( 2sn >x}> exp (~€x2 - x2(l + 4<0(l ^

when JC6 < x < K/8. Finally, we choose € so small that

-e - ( 1 + 4 0 ( 1 + 6 ) + | ( 1 +*)2(1 - 17) > - 0 / 2 ,

which is possible because the left-hand side tends to —1/2 as e decreases to zero.• Put JCO equal to the corresponding x€.

*4. Identically distributed summands

Kolmogorov's LIL for bounded summands under the constraint <15> extends toidentically distributed summands by means of a truncation argument, an idea due toHartman & Wintner (1941). That is, the normality assumption can be dropped fromTheorem <3>.

<24> Theorem. For the sequence of partial sums {£„} of a sequence of independent,identically distributed random variables {Xf} with PX; = 0 and var(X/) = 1,

(i) l imsup^^ Sn/Ln = 1 a.s.

(ii) liminfôo Sn/Ln = - 1 a.s.

(Hi) Sn/Ln € J infinitely often, a.s., for every open subinterval J of [— 1,1].

11.4 Identically distributed summands 269

Most of the ideas needed for the proof are contained in Sections 1 and 2. I willmerely sketch the arguments needed to prove Theorem <24>, with emphasis on theway the new idea, truncation, fits neatly with the other techniques.

The truncated variables will satisfy an analog of <15> with Vn := n, exceptthat the o{>) will be replaced by a fixed, small factor. The fragments discarded bythe truncation will be controlled by the the following lemma, which formalizes anidea of DeAcosta (1983).

<25> Lemma. The function

is strictly increasing and

/ ——dx<Ct for t > g{ee\Jee L(X)

for some constant C.

Proof. Differentiate.

2 ^ p f = - - -— > - ( 1 - - ) > 0 for x > ee.g(x) x log log x log x x x \ e)

The inequality also implies that1 Q (x *2.p

SL1 < Cg\x) where C := « 3.16,L(x) x e-l

• from which it follows, for t > g(ee), that f*~l(t) j^dx < feg~l(t) Cg'(x) dx < Ct.

Upper LIL

Consider first the steps needed to prove limsup(Sn/Ln) < y a.s., for a fixed y > 1.It helps to work with a smooth form of truncation, so that (Problem [4]) the variancesof the truncated variables are necessarily smaller than the variances of the originalvariables. For each positive constant M define a function from E onto [—M, M] by

<26> T(X, M) := -M{x < -M] + JC{|JC| < M} + M{x > M}.

For a fixed e > 0, which will depend on y, define, for i > 17 > 1 -f eey

with gi := g(i), as defined by Lemma <25>. Note that |/x,| < P|Z,|, becauseV(Yi + Z/) = PX/ = 0.

REMARK. Notice the dependence of the truncation level on i. Compare with thetruncation used to prove the SLLN under a first moment assumption in Section 4.7,and the truncations used to prove central limit theorems in Section 7.2. For almostsure convergence arguments it is common for each variable to have its own truncationlevel, because ultimately Borel-Cantelli assertions must depend on convergence of asingle infinite series, rather than on convergence of changing sequences of partialsums to zero.

270 Chapter 11: Exponential tails and the UL

The partial sum Sn decomposes into J2i<n & + Zw<n Mi + £/<„ zi- The first

sum will be handled essentially as in Section 2. The other two sums will be smallrelative to Ln, by virtue of the following bound,

Zll < £ j _L_li < 2 J ' l l l g — n / - identical distnbutionsi=17 ^' i=17 ^' i=17 ^'

<F\\X\\y^[g l(\X\\/e) > i] I dx)\ r-?L J L(x) I\ 1=1/ /



< oo because FX\ < oo.

The series started from i = 1 also converges. By Kronecker's lemma,L~x Yli<n Li\l*i\/Li -> 0 as n ->oo. Similarly, finiteness of the expected value ofJ2i \Zi\~/Li implies J2i<n %* = °(Ln), almost surely.

To handle the contribution from the {ft}, write Tn for J2i<n & anc* Vn for var(rn).By Dominated Convergence, var(ft) -» 1 as i —> oo, implying Vn/n -> 1 as n -> oo.Write y as 1 4- 25, with 5 > 0. We need to prove limsup(rn/Ln) < 1 + 28 almostsurely. As in Section 2, use blocks Bk := {n(£) < n < n(k -h 1)}, with n(k)/pk -» 1,for a £ > 1 to be specified. Invoke Lemma <9> with Wn := n and A := 1 + 8to reduce to behavior along the geometrically spaced subsequence. Then invokeBennett's inequality,

Tn(k+i) > kLnik)] < exp I — f I I I.V 2 ^ + 1 ) V Vn(k+\) ))

The argument of the ^ factor behaves like

LH(M)n(k+l)For fixed y, the f factor can be brought as close to 1 as we please, by choosing €small enough. The other term in the exponent behaves like (A.2/p)loglogp*. Withappropriate choices for p and e, we therefore have the tail probability decreasingfaster than k~a, for some a > 1, which leads to the desired upper LIL.

Lower LIL

For a fixed y < 1 we need to show limsup(7n/Ln) > y almost surely. As inSection 1, look along a subsequence n(k) := kk. Write T for Tn(k) - Tn(k-\), and Vfor var(r) = Vn{k) — V^-i) . We can make V/n(k) as close to 1 as we please, bymaking k large enough. The summands contributing to T are bounded in abolutevalue by 8y/V, where 8 := 2tgnik)/W. We need to bound F[T > yLn(k)] frombelow by a term of a divergent series. Fix a 0 > 1. Write JC for yLn{k)/y/V.Inequality <16> gives

F[T > yLnik)] = F{T > xW) > exp (-^Jc2) = exp (-2V

11.4 Identically distributed summands 271

provided JCQ < x < K/8, that is, provided

2n(k)loglogn(k)Y n(*)(l+o(l)) 26 V "(*)

With € small enough, the range eventually contains the desired x value. The rest ofthe argument follows as in Section 1.

Cluster points

Assertion (iii) of Theorem <24> will follow from assertion (i), by means of aningenious projection argument, borrowed from Finkelstein (1971). Construct newindependent observations {X,} with the same distribution as the {X,}, and letSn := X\ + . . . + Xn. Write Wn for the random vector (Sn, Sn)/Ln and u9 forthe unit vector (cos 0, sin 0). For each fixed 0, the random variables Wn • UQ —X, cos0 4- Xi sin# have mean zero, variance one, and they are identically distributed.From (i), limsup,,^^ (Wn • UQ) — 1 almost surely. Given 6 > 0, there exists a finitecollection of half spaces {(x,y) - uo < 1 + 6 } whose intersection lies inside the ballof radius 1 + 26 about the origin. It follows that limsup \Wn\ < 1 + 2 6 almostsurely. The geometry of the circle then forces Wn to visit each neighborhood of

the boundary point ue infinitely often, with probability one.The projection of such a neighborhood onto the horizontal axisgives a neighborhood of the point cos#, which Sn/Ln must visitinfinitely often. After a casting out of a countable sequenceof negligible sets, for a countable collection of subintervals of( -1 ,1) , we then deduce assertion (iii).

5, Problems

[1] Let X have a Bin(n, p) distribution. Define q := 1 - p. For 0 < x < nq, show that

Hint: Bound the tail probability by exp (-t(np + JC) + n log(q + pe1)) for t e R+,then minimize the expression in the exponent by Calculus. For 0 < x < nq show

—?• ) . For the second bound, usel-x/nqj

convexity of x//.

[2] Let X\,..., Xn be independent random variables with PX,- := p\ and 0 < Xt < 1 foreach i. Let p := (p\ H h pn)/n =: I -q. Show that

F {£,,„ *, > nP + X} < exp (-£.+ ( ^ ) ) for 0 < , < „,


Hint: For t € M+, bound the tail probability by

exp(-t(np + *))Pexp(f EX,-) = exp (-t(np + x) + £,<„ l og(# + A*')) •

Use concavity of the logarithm function to increase the bound to a form amenableto the method of Problem [1].

[3] Suppose X has a Poisson(A.) distribution.

(i) By direct minimization of exp (-t(X + JC)) Pexp(rX) over R+, prove that

(ii) Derive the same tail bound by a passage to the limit in the binomial boundfrom Problem [1].

[4] For each random variable X with finite variance, and each constant M, show thatvar(r(X, M)) < var(X), with r as defined in <26>. Hint: Let X' be an independentcopy of X. Show that 2var(r(X, M)) = P|r(X, M) - r(X', M)\2 and also that|T(JC, M) - T(JC', M)\ <\x- x'\ for all real x and xr.

[5] Let {Xn} be a sequence of independent, identically distributed two-dimensionalrandom vectors with PXf = 0 and var(X,) = h. Define Sn = Xi -f . . . + Xn.

(i) Show that limsup|5n|/Ln < 1 almost surely.

(ii) Show that, with probability one, the sequence Sn/Ln visits every open subset ofthe unit ball {|JC| < 1} infinitely often. Hint: Project three-dimensional randomvectors.

6. Notes

My understanding of the LIL began with the reading of Feller (1968, Section VIII.5),Lamperti (1966, Section 11) and Stout (1974, Chapter 5). I learned the idea ofregarding the Bennett inequality as a slightly corrupted (by the presence of the\js function in the exponent) analog the the normal tail bound from conversationswith Galen Shorack and Jon Wellner. Shorack (1980) systematically exploited theidea to establish very sharp LIL results for the empirical distribution function.For a beautiful exposition of the many applications of the idea to the study ofinequalities for the empirical distribution function and related processes see Shorack& Wellner (1986, Chapter 11).

The method used to establish the Bennett inequality, but not the form ofthe inequality, comes from Chow & Teicher (1978, page 338). They developedexponential inequalities suitable for derivation of LIL results. I am uncertain aboutthe earlier history of the exponential bounds. Apparently (cf. Kolmogorov & Sar-manov 1960), Bernstein's inequality comes from a 1924 paper. The Bennett (1962)and Hoeffding (1963) papers contain other tail bounds for sums of random variables,with some references to further literature.

Apparently the first versions of the LIL with a log log bound are due toKhinchin (1923, 1924). For the early history of the LIL, leading up to the definitive

11.6 Notes 273

version by Kolmogorov (1929), see Feller (1943). Hartman & Wintner (1941)extended to the case of independent identically distributed summands with finitesecond moments, by means of a truncation argument. (Actually, they did notassume identical distributions, but only a domination condition for the tails, whichholds under the second moment condition for identically distributed summands.)My exposition in Section 4 draws from DeAcosta (1983), who gave an elegantalternative derivation (and extension) of the Hartman-Wintner version of the LIL.

I thank Jim Kuelbs for the proof of part (iii) of Theorem <3>.

REFERENCES

Bennett, G. (1962), 'Probability inequalities for the sum of independent randomvariables', Journal of the American Statistical Association 57, 33-45.

Chow, Y. S. & Teicher, H. (1978), Probability Theory: Independence, Interchange-ability, Martingales, Springer, New York.

DeAcosta, A. (1983), 'A new proof of the Hartman-Wintner law of the iteratedlogarithm', Annals of Probability 11, 270-276.

Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.Feller, W. (1943), 'The general form of the so-called law of the iterated logarithm',

Transactions of the American Mathematical Society 54, 373-402.Feller, W. (1968), An Introduction to Probability Theory and Its Applications, Vol. 1,

third edn, Wiley, New York.Finkelstein, H. (1971), 'The law of the iterated logarithm for empirical distributions',

Annals of Mathematical Statistics 42, 607-615.Hartman, P. & Wintner, A. (1941), 'On the law of the iterated logarithm', American

Journal of Mathematics 63, 169-176.Hoeffding, W. (1963), 'Probability inequalities for sums of bounded random

variables', Journal of the American Statistical Association 58, 13-30.Khinchin, A. Ya. (1923), 'Uber dyadische Briiche', Math. Zeit. 18, 109-116.Khinchin, A. Ya. (1924), 'Uber einen Satz der Wahrscheinlichkeitsrechnung',

Fundamenta Mathematicae 6, 9-20.Kolmogorov, A. (1929), 'Uber das Gesetz des Iterierten Logarithmus', Mathematische

Annalen 101, 126-135.Kolmogorov, A. N. & Sarmanov, O. V. (1960), 'The work of S. N. Bernshtein on

the theory of probability', Theory Probability and Its Applications 5, 197-203.Lamperti, J. (1966), Probability: A Survey of the Mathematical Theory, W. A.

Benjamin, New York.Shorack, G. R. (1980), 'Some law of the iterated logarithm type results for the

empirical process', Australian Journal of Statistics 22(1), 50-59.Shorack, G. R. & Wellner, J. A. (1986), Empirical Processes with Applications to

Statistics, Wiley, New York.Stout, W. F. (1974), Almost Sure Convergence, Academic Press.

Chapter 12

Multivariate normal distributions

SECTION 1 explains why you will not learn from this Chapter everything there is to knowabout the multivariate normal distribution.

SECTION 2 introduces Fernique's inequality. As illustration, Sudakov's lower bound forthe expected value of a maximum of correlated normals is derived.

SECTION * J proves Fernique 's inequality.SECTION 4 introduces the Gaussian isoperimetric inequlity. As an application, BorelVs tail

bound for the distribution of the maximum of correlated normals is derived.SECTION *5 proves the Gaussian isoperimetric inequlity.

1. Introduction

Of all the probability distributions on multidimensional Euclidean spaces themultivariate normal is the most studied and, in many ways, the most tractable.In years past, the statistical subject known as "Multivariate Analysis" was almostentirely devoted to the study of the multivariate normal. The literature on Gaussianprocesses—stochastic processes whose finite dimensional distributions are allmultivariate normal—is vast. It is important to know a little about the multivariatenormal.

As you saw in Section 8.6, the multivariate normal is uniquely determined byits vector of means and its matrix of covariances. In principle, everything that onemight want to know about the distribution can be determined by calculation of meansand covariances, but in practice it is not completely straightforward. In this Chapteryou will see two elegant examples of what can be achieved: Fernique's (1975)inequality, which deduces important information about the spread in a multivariatenormal distribution from its covariances; and Borell's (1975) Gaussian isoperimetricinequality, with a proof due to Ehrhard (1983a, 1983b). Both results are proved bycareful Calculus.

The Chapter provides only a very brief glimpse of Multivariate Analysis andthe theory of Gaussian processes, two topics that are covered in great detail in manyspecialized texts. I have chosen merely to present examples that give the flavor ofsome of the more modern theory. Both the Fernique and Borell inequalities havefound numerous applications in the recent research literature.

12.2 Fernique's inequality 275

2. Fernique's inequality

The larger <r2, the more spread out is the N(0,cr2) distribution. Fernique (1975,page 18) proved a striking multivariate generalization of this simple fact.

 Theorem. Suppose X and Y both have centered (zero means) multivariate normaldistributions, with P|X,- - Xj\2 < F\Yt - Yj\2 for all i, j . Then

P/(max/ Xt - min, Xt) < P/(max, Yt - min, Yt)

for each increasing, convex function f on R+.

The theorem lets us deduce inequalties for multivariate normal distributions,with potentially complicated covariance structures, by making comparisons withsimpler processes.

<2> Example. (Sudakov's minoration) Let Y := (Y\, Y2,..., Yn) have a centeredmultivariate normal distribution, with F\Yj - F*|2 > 82 for all j ^ k. Fernique'sinequality will show that Pmax/<n Y( > C8y/log2n, where C := P|N(0, l)\/V%.

The result is trivial for n = 1. For n > 2, let k be the largest integer for whichn > 2*. Note that 2k > k + 1 > log2n > k. Reindex the variables [Yt : 1 < i < 2*}by the vectors a in the set A = {—1, +1}* of all it-tuples of ±1 values. WriteYa instead of Yt. The precise correspondence between A and {1 : 1 < 1 < 2*} isunimportant.

Build another centered multivariate normal family {Xa : a € A},

Xa := \8k-X12 Yî=x aiwi w h e r e Wi , . . . , Wlk arc independent N(0, l)'s.

For a ^ ,

¥\Xa - Xt\2 = \82k~l £ . ( ( * , - - ft)2 <82< F\Ya - Yp\2.

From Fernique's inequality with f(t) := t we get

P (maxa Ya — nun*, Ya) > P (maxa Xa —

Symmetry of the multivariate normal implies that maxa Ya has the same distributionas maXa(-Ya) = -min« Ya, and similarly for the X's. The last inequality implies

Prnax* Ya > Pmax« Xa = ^ / * ^

For each realization W := (Wi, . . . , W*), the maximum of J],«,-Wi is achieved wheneach a, takes the same sign as W,. (Of course, the maximizing a depends on W.)The lower bound equals

D as asserted.

REMARK. The lower bound is sharp within a constant, in the following sense. IfF\Yj-Yk\

2 < 82 for all j ^ k then Pmax,- Yt = PF,+ Pmax^yi-y!) = Pmax^y , -^)and, by Jensen's inequality and monotonicity,

exp (PmaXl(r, - Y})/28)2 < Pmax, exp ((7, - Yx)2/482) < nPexp (N (0, \

Thus Pmax, Yt is bounded above by 28y/\og(\/2n).

276 Chapter 12: Multivariate normal distributions

*3. Proof of Fernique's inequality

A straightforward approximation argument (see Appendix C) reduces to the casewhere / has a bounded continuous second derivative with bounded support. Alsowe may assume that the covariance matrices Vb := var(X) and Vi := var(F) areboth nonsingular n xn matrices: We could prove the result for X + eZ and Y + €Z,where Z is a distributed Af (0, /„) independently of X and F, then let € tend to zero.

To simplify notation, I will write 9, for d/Sxj and djk for 32/dxj3xk.

For nonsingular Vb and Vi, the covariance matrix V$ := (1 — 0)Vb + 0V\ isnonsingular for 0 < 0 < 1. The N(0, V#) distribution has a density ge with respectto Lebesgue measure 9Jt on Rn. A simple calculation (Problem [1]) based on theFourier inversion formula for the N(0, Ve) density gives

^-(x) = I J2j,k AMaJW*> w h e r e A := Vi - Vo.

The Theorem will be proved if we can show that the function

H(0) := mx (f (max, xt - min, *,-) gG(x))

is increasing in 6. The assumptions on / justify differentation under the 97? to get

H'(0) = mxf (max, x, - min, jt,) —geto

= \ J2j,k ^j^x (f (max/ * ~ ™ni x0 df.k8e) by <3>.

Two integrations by parts will replace the QJl-integrals by integrals involving thenonnegative functions / ' and /", reducing the expression for H'(0) to a sum ofthe form J2j<k (^jj + &kk — 2A^) (something nonnegative). To establish such arepresentation, we need to keep track of contributions from subregions of W definedby inequalities involving the functions

L(x) := max, xt and S(x) := mint JC,

Lj (x):= max, {/ ^ j}x, and 5} (x) := min, {i ^ j}xt.

Notice that L{x) = L7(JC) v jt, and S(x) = Sj(x) A xj9 for each j .Let my denote Lebesgue measure on the y'th coordinate space R, and 9JI, denote

(n - l)-dimensional Lebesgue measures on the product of the remaining coordinatesubspaces. That is, m, integrates over Xj and 971, integrates over the remaining n - 1coordinate variables. The product of m7 0 9)ty equals 971, Lebesgue measure on Rn.

The function Xj \-> fL — S) is absolutely continuous with almost sure derivativefix; - Sj){xj > Lj] - f(Lj-Xj)[xj < Sj] = /'(L - 5) ({xj = L) - {Xj = 5}). Here,and subsequently, I ignore the 9Jt-negligible set of x for which there is a tie formaximum or minimum. The function dkge decreases to zero exponentially fast asmax, \xi | -> oo. An integration by parts with respect to Xj gives

Mxf (L - S) dfkg9 = Wlj (mjfixj v Lj - Xj A Sjty (dkge))

= -Mj (mjfiL - S) ({XJ = L) - {*, =

= -m (f(L - 5) ({*, = L] - {Xj =

12.3 Proof of Fernique's inequality 277

The second integration-by-parts proceeds slightly differently for j = k or j # k.To simplify notation, I will temporarily assume j = 1 and either k = 1 or k = 2;and I will replace * 3 , . . . , xn or JC2, * 3 , . . . , xn by a long dash (—) to keep attentionfocussed on the variables actively involved in the calculation.

Mixed partial derivative

With j := 1 and k := 2, consider the final integrand of <5> as a function of JC2-Rewrite the difference of indicator functions as {x2 < x\ = L2} - {X2 > x\ = S2}.Integration by parts, with respect to JC2, gives

m2 (f'(L - S) ({JC, = L) - [xx = S}) d2ge)

= / ' (L - S){JC, = L2}ge(x)\xxl2_oo - f\L - S){Xl = S2}^(*)|~=JC1

- m2 (/"(L - S) ({X2 = L] - {X2 = 5}) ({xi = L} - {JC, = 5}) go)

= / U 2 - S2)S*(*i, jcj,— )({*, = L2} + {JC! = 52})

+ m2 (/"(L - 5) ({xi = 5, JC2 = L} + {JCI = L, JC2 = S}) g0) a.e. [m2].

The OT2 negligible set allows for JC where x\ and JC2 tie for maximum or minimum.Integrate with respect to 9^2 to get an expression for the mixed partial derivative.

mf(L - S)dl2gd = -m2 (f(L2 - S2){{xl = S2) + {x{ = L2))ge(xuxu—))

- m (f"(L - 5)({JC, = 5, JC2 = L] + {JC, = L, JC2 = S])g9(x))

:= -A{,2 - #1,2.

Notice the curious form of the integral for Ax,2- It runs over n — 1 variables (JC2

omitted) ranging over the sets where either JCI is the smallest or the largest of thosevariables. The second argument of g#, previously occupied by JC2, is now occupiedby the extremal value JC, . We would get exactly the same integral if we interchangedthe roles of JCI and JC2. Write A/t* for the analogous integral with JC, taking over therole of JCI and JC* taking over the role of JC2. Nonnegativity of / ' , and the symmetryof the roles of the two variables, ensure that A,* = A* ,7 > 0 for all j ^ k. Similarly,write Bjk for the analog of #i,2 with Xj taking over the role of xx and JC* takingover the role of JC2. Nonegativity of / " implies that Bjk = Bkj > 0 for all j ^ k.

Repeated partial derivative

The calculations for d\xge are similar. Integrate first with respect to JCI.

m, (f(L - S) ({xx =L}- {xx = S}) dvg0)

- mi (f"(L - S) ({xx =L}- {xx = S})

- m f ( L - S)ge(x) ({xx =S} + {xx = L})

Then integrate over the remaining n — 1 variables, to conclude that

Lx - S{) (ge(Sx,—) + ge(Lu—))

m (f'\L - S)ge(x) ({xx =S} + {xx = L}))


When split according to which of the variables X2>..., xn achieves the extremalvalue S\ or L\y the first integral breaks into the sum Ylk>2 î.*> an(^ ^ e s e c o n d i12k>2 ^M- ^ e o ther repeated derivatives contribute similar expressions.

Collection of terms

Substitute the results from the two integrations by parts into <4>.

H'W = 5 E , ^'^f(L ~ S) dli= 2 E A

7

The assumption of the Theorem tells us that

Ajj + Akk - 2Ajk = PF/ - PXy2 + PF,2 - PX2 - 2(FYjJk - FXjXk)

= ¥\Yj-Yk\2-¥\Xj-Xk\

2>0.

D Thus H'(0) > 0, and Fernique's inequality follows.

4. Gaussian isoperimetric inequality

Let yk denote the standard normal, N(0, Ik)9 distribution on R*. Write <t> for the one-dimensional N(0, 1) distribution function, and write </>(x) := (27r)"1/2exp(—x2/2)for its density with respect to Lebesgue measure.

For each Borel subset A of Rk and each r > 0 define Ar := {JC : d(x, A) < r},a set that I will call a neighborhood of A. The isoperimetric problem requiresminimization of ykA

r over all A with a fixed value of ykA. Borell (1975) showedthat the minimizing choice is a closed half-space H. The neighborhood Hr is anotherhalf-space. By rotational symmetry of the standard normal, the calculation of y^Wreduces to a one-dimensional problem: if y^H := <J>(a) then ykHr = 3>(r + a) .

The term isoperimetric comes from an analogy with the classical isoperimetricinequality for minimization of surface area of a set with fixed Lebesgue measure. Ifone avoids the tricky problem of defining the surface area of a general Borel set Aby substituting the Lebesgue measure of the thin shell Ar\A, for very small r, theproblem becomes one of minimizing the Lebesgue measure mAr for a fixed valueof xnA. Replace Lebesgue measure by yk and we have the Gaussian analog of themodified isoperimetric problem.

<7> Theorem. The Gaussian measure of the neighbourhood Ar is minimized, fora given value of ykA, by choosing A as a closed half-space. More generally,YkAr > <J> (r + a), for every Borel subset A ofRk with ykA > <l>(a).

It is the reduction from a fc-dimensional problem, with k arbitrarily large, toa one-dimensional calculation for the lower bound that makes BorelFs result sopowerful, as shown by the inequalities in the next Example.

12.4 Gaussian isoperimetric inequality 279

<8> Example. Recall that a median of a (real valued) random variable X is anyconstant m for which F{X >m}> V2 and P{X <m}> V2. Such an m always exists,but it need not be unique.

Suppose Y\,...,Yn have a multivariate normal distribution. Define S :=max,<* \Yi\9 and let M be a median of 5. (In fact, the inequalities will force Mto be unique.) Define a2 := max, var(X,). Borell's concentration inequality assertsthat F{S> M + r] < ¥{N(0,a2) > r} and ¥{S < M-r) < F{N(0,a2) < -r} , foreach r > 0. More succinctly,

<9> P{|max \Yt\ - M)\ > r] < ¥{\N(0, o2))\ > r} for each r > 0.

That is, the spread in the distribution of max |y,| about its median is no worse thanthe spread in the distribution of the Yt with largest variance.

In special cases, such as independent variables (Problem [3]), one can gettighter bounds, but Borell's inequality has two great virtues: it is impervious tothe effects of possible dependence between the Yiy and it does not depend on w.In consequence, it implies similar concentration inequalities for general Gaussianprocesses, with surprising consequences.

Each half of the Borell inequality follows easily from Theorem <7> if werepresent the K, as linear functions Yi(x) := /i, -f 0[x on W equipped with themeasure yn. (That is, regard the vector of Yi's as a linear transformation of avector of independent N(0, l)'s.) The assumption about the variances becomesvar(yi) = \9i\2 < a2 for each i. By definition of the median M, the set

A := {JC e Rn : max/<n |F/(JC)| < M]

has ynA > 1/2 = <f>(0). If a point x lies within a distance r of a point *o in A, then\9[x - 0/jto| < \$i\ \x - xo\ < or, for each i. Thus the neighborhood Ar is containedwithin {x : maxf<n |F/(JC)| < M + ar}, and

P{max/<* \Y{\ < M + err] > ynAr > (0 + r),

as asserted by the upper half of the Borell inequality. The derivation for theD companion inequality is similar.

<io> Example. Suppose X\, X2,... is a Gaussian sequence of random variables. Thesupremum S := sup, |X,| might take infinite values, but if P{5 < 00} > 0 thenBorell's inequality <9> will show that the distribution of S must have an uppertail that decreases like that of a normal distribution. More precisely, the constanta2 := supt var(X|) must be finite, and there must exist some finite constant M suchthat

¥{S > M + or) < O(r) = P{AT(0, 1) > r] for all r > 0.

Of course there is no comparable result needed for the lower tail, because S isnonnegative.

The tail bound < n > ensures that S has finite moments of all orders, and infact Pexp(aS2) < 00 for all a < (2a2)-1 . See Problem [8].


We can establish < n > by reducing the problem to finite sets of randomvariables, to which Borell's inequality applies. Write Sn for maxt<n |X,-| and Mn forits median. Define a 2 := max,<n var(X,). From <9>,

F{\Sn-Mn\>onr}<4>(r).

The assumption on S ensures existence of a finite constant C and an € > 0 suchP{S < C] > €, which implies that P{|X,| <C}> P{Sn < C] > € for all n and i < n.These inequalities place an upper bound of 2C2/ne2 on var(X,), and thereby alsoon a2, because P{|N(/x, r 2 ) | <C}< 2C/r^/2n. Choose an ro for which <t>(ro) < €.From <12> we have ¥{Sn < Mn — crnro} < £> which excludes the possibility thatMn - crnro might be larger than C. The constant M := supn Mn is therefore boundedabove by C + aro. From <12> we then get P{Sn > M + or) < <l>(r) for all n and

• all r > 0, an assertion stronger than < n > .

*5. Proof of the isoperimetric inequality

First note that we may assume A is closed, because Ar = Ar and ykA > ykA.As a convenient abbreviation, I will call a closed set B an improvement over Aif tt# > YkA and )/*£r < ykA

r. Following Ehrhard (1983a, 1983b), I prove theTheorem in three steps:

(i) Establish the Theorem for the one-dimensional case, which Problem [6]shows can be reduced by a simple approximation argument to the casewhere A is a finite union of disjoint closed intervals Jt := [jcj~, *+], where—oo < JCJ" < JCJ < x^ < . . . < x+ < +oo. The method of proof depends onlyon the fact that the logarithm of the standard normal density 0 is a concavefunction. It works by showing that we can improve A by successively fusingeach Jt with its neighboring interval on the left or the right, until eventuallywe are left with either a single semi-infinite interval or a union of two suchintervals. A further convexity argument disposes of the two-interval case.

(ii) Establish the two-dimension version of the Theorem by an analog of theclassical Steiner symmetrization method (Billingsley 1986, Section 19),which draws on the one-dimensional result. For a Borel subset A of M2,Fubini's Theorem asserts that the y-section Ay := {JC € R : (JC, y) € A] isBorel measurable, and that y\Ay is a Borel measurable function. Define afunction g(y) to satisfy the equality <P(g(y)) := y\Ay. Then the y-sectionsBy of the set B := {(x, y) : y < g(x)} have the same y\ measure as thecorresponding Ay.

12.5 Proof of the isoperimetric inequality 281

Intuitively speaking, the set B is obtained from A by sliding andstretching each of its jt-sections into a semi-infinite interval with the same y\measure. (The jagged left edge for B in the picture is supposed to suggestthat the set extends off to -oo.) I will call the operation that transforms Ainto B a 1-shift.

Problem [5] shows that B is closed if A is closed. Fubini's theoremensures that yiB = yiA. The one-dimensional version of the Theorem willthen ensure that B is an improvement over A.

The same 1-shift idea works for every direction, not just for slicesparallel to the jc-axis. I will write Su for the 1-shift operator in the directionof a unit vector w. (The picture corresponds to the case where u is theunit vector u := ( -1 ,0) that points back along the x-axis.) By means of asequence of such shifts, we can rearrange A into a set arbitrarily close to ahalf space, with an improvement at each step. A formal limit argument thenestablishes the two-dimensional version of the Theorem.

(iii) Establish the fc-dimensional version of the Theorem by induction on thedimension. It will turn out that the two dimensional case involves most ofthe hard work. For example, two applications of the result for two dimensionswill give the result for three dimensions.

DETAILS OF THE PROOF

(i) One dimension

We have to show that a half-line is an improvement over A := U,-<„/,-, a finite unionof disjoint closed intervals.

If jcj1" > x^ — 2r we may replace J\ and J2 by the single interval [JCJ", x£]without changing y\Ar. Thus, with no loss of generality, we may suppose J\ andthe set J = U/>2/, are a distance at least 2r apart, in which case y\ Ar = y\ J[ + y\Jr.

Define 28 := y\J\ = <&(xf) - 4>(jcf) and 2t\ := <J>(jtf) + <E>(jcf), so thatJ\ has endpoints &~l(t\ ± 8). Consider the effect of replacing J\ by anotherinterval /, := [JC~, JC+]. If we take JC" := <&~l(t - 8) and JC+ := O"1^ + 8), with8 < t < t* = O(JC^) - 6, then y\lt = y\J\. When t = 8, the interval lt is semi-infinite, [—00, JC+]; when t = t*, the intervals It and J2 touch at JC+ = JC ~. The setsBt := It U / and A have the same y\ measure.

t+5 1

t-S


When t < O(JC^ - 2 r ) -8 the neighborhoods Vt and Jr are disjoint, and y\Brt =

yiAr + y i / r . For larger f, the two neighborhoods overlap, and y\Brt < y\Vt +y\Jr.

If we choose t with y\l\ < y\J\ then Bt is an improvement over A. The followingLemma shows that we get the most improvement by pushing t to one of the extremepositions, because a concave function on [8, t*] achieves its minimum at one of theendpoints.

<13> Lemma. For each fixed r > 0 the function G{t) := y\Vt is a concave functionoft on [8,1 -<$].

Proof. It is enough if we show that the derivative Gr is a decreasing function on(6, 1 - 8). By direct differentiation of (x+ 4- r) - <&(x~ - r) as a function of t wehave

G'(t) = ^ ( * ^ r ) - ^X ~ r \

Concavity of log0 implies that <t>'(x)/<p(x) is a decreasing function of JC. Thush(x) := <f>{x -f r)/0(jc) is a decreasing function of JC, because

REMARK. Actually, \ogh(x) = — jcr — r 2 /2 , which is clearly decreasing. I wrotethe argument using concavity because I suspect there might be a more general versionof the isoperimetric inequality provable by similar methods (cf. Bobkov 1996).

Both JC+ and x~ are increasing functions of t. Thus G' equals a decreasingfunction of / minus the reciprocal of another decreasing function of t, which makes

• G' decreasing, and G concave.

With the appropriate t, the improvement Bt is also a finite union of disjointintervals, UJL, J/, with either J[ := 0 or / ( := [-oo, x+].

Now repeat the argument, replacing J^ by an interval that abuts either J[ orJy And so on. After at most n such operations, A is replaced by either a singlesemi-infinite interval (the desired minimizing set—it doesn't matter whether theinterval extends off to —oo or to +oo) or a union D := [—oo, z~] U [z+, oo], forwhich yD = yA and yDr < yAr.

For the second possibility we may assume that z+ — z~ > 2r to avoid thetrivial case where Dr = E. The complement of D is an interval of the form(<P~l(t - 8), <P~l(t + 8)) for some t, where 28 = yi(z",z+) . Calculations almostidentical to those for the proof of Lemma <13>, with r replaced by —r, show that

H(t) := <D (&~l (t + 8) - r ) - <D U>~1 (t - 8) + A

is a convex function of t. It achieves its maximum at one of the extreme positions,which corresponds to the transformation of D into a single semi-infinite interval.The proof for the one-dimensional case of Theorem <7> is complete.

(ii) Two-dimensions

The 1-shifts along each section parallel to the x-axis transform the closed set A intoanother closed set B with yiB = yiA — <!>(«).


<14> Lemma. The set B is an improvement over A.

Proof. We need to show that yiBr < yikT. By Fubini, it is enough if we can showthat Yi(Br)y < yi(Ar)y, for all y.

•

The section (Br)yi through Br at a fixed y\ is a closed set. Consider theboundary point (x\,y\) in (Br)yi, with JCI as large as possible. By definition, thereexists a point (*o, yo) in B that lies within a distance r of (x\, y\): if |JCI - xo\ := €and \y\ — yo\ := 8 then e2 + 82 < r2. (Both e and 5 depend on y\9 of course.)For each t > 0 the point (JCI - f, ji) lies within a distance r of (XQ - f, > o), whichalso belongs to B. Thus Br is of the form {(JC, y) e R2 : x < ^r(y)}, for somefunction gr().

Write ^ [ c ] and A^|>] for the one-dimensional ^-neighborhoods of the yo-sections B^ and AM. Notice that (Ar)J1 contains the set G = A^|>] <g> {3 1}, becauseall points of G lie within a distance (e2 4- 52)1/2 < r of A. Thus

because (Ar)^ 2 G

+ O from part (i), because y\ An = ( (jo))

> ®(gr(yi)) because gr(y\) = xi < x0 + € < g(;yo) + €

= yi (#r)yi definition of ^r.

It follows that B is an improvement over A.

To make precise the idea that a sequence of shifts can make A look more like ahalfspace, we need some way of measuring how close a set is to being a half-space.

The picture suggests a method. The idea is that thereshould be a cone C of directions that we can follow fromeach point of A without leaving A. (The jagged edgeson A and the cone C are meant to suggest that both setsshould be extended off to infinity—there is not enoughroom on the page to display the whole sets.) If the vertexangle 0 of the cone were a full 180°, the set A would be ahalf-space (or the whole of R2). More formally, let us say

that a set A has a 0 spread if there exists a cone C with vertex angle 0 such that{JC + y : x e A, y € C] c A. Call C the spreading cone.

For example, the set B produced by Lemma <14> has spread of at least zero,with spreading cone C := {(JC, 0) : x < 0}.


•

Lemma. If a closed set A has 0 spread, for some 0 less than n, there exists ashift Su such that SUA has (n + 0)/2 spread.

Proof. To make the picture easier to draw I will assume that the axis of symmetryof the cone C points along the y-axis. The required shift then has u pointing alongthe jc-axis.

0/2 . 0/2

(g(y),y)

Consider the sections through SUA at heights y and yf = y + 6, with 8 > 0.

Both sections are intervals, (—oo, g(y)] and ( — o o , g ( / ) ] , where y\Ay := <P(g(y))

and yiAy := <&(g(y')). Define 6 := <5tan(0/2). If JC € Ay then (x + t,y + 8)eA

for all f with \t\ < €, because A has 0 spread. That is, the section Ay contains a

one-dimension neighborhood F := Ay[e]. The cross section at y' has y\ measure

greater than y\Ay\<e\, which by part (i) is greater than <&(y\Ay + €). That is,

®(g(y')) — Y\Ay' > Y\Ay[€] > <&(g(y) + €),

whence g(y') > g(y) + e. It follows that Su A has spread at least {n + 0)/2.

Repeated application of Lemma <15>, starting from A produces improvements#i, #2> #3> ^4, •. with spreads at least 0, 7r/2, 37r/4, 77r/8, and so on. Each Z?,- hasy2 measure equal to O(a) = y2A, and y2Ar > y2B\ > y2B

r2 > We may even

rotate each Bn so that its spreading cone has axis parallel to the JC axis.The fact that Bn has spread n — 2en, with en —• 0, and the fact that y2Bn — O(a)

together force Brn to lie close to the half space H := {(JC, y) e R2 : x < a + r]

eventually. More precisely, it forces liminf Brn > H, in the sense that each point of

H must belong to Brn for all n large enough.

For if a point (JCO + r, yo) of H were not in Brn then the point (JCO, yo) could not

lie in Bn, which would ensure that no points of Bn lie outside the set

On := {(x,y) :x < \y - yo\tenen}.

However, the set Dn converges to a halfspace with yi measureequal to 4>(JCO), which is strictly smaller than <J>(a) = y2Bn.Eventually Bn must therefore contain points outside Dn,in which case (JCO, yo) € Bn and (JCO + r, yo) € Br

n. Fatou'sLemma then completes the proof of the two-dimensionalisoperimetric assertion:

y2Ar > liminfy2B

rn > ^liminf Br

n >

- angle en

H ha+r


(iii) More than two dimensions

A formal proof uses induction on the dimension, invoking the result from part (ii)to reduce for R* to the result for R*"1. Then another application of part (ii) reducesto R*~2. And so on.

For simplicity of notation, I will explain only the reduction from R3 to R2.Consider a closed subset A of R3 with yiA := a. Write Ay for its y-section, so thatA = U ^ <g> {y}. Define a function g by the equality <P(g(y)) := YiAy. The closedset B := {(JC, y% z) : x < g(y)} has yiBy = yiAy for every y, and hence (Fubini)yiB = y$A. Call B a Irshift of A.

The set B has all its z-sections equal to C = {(*, y) € R2 : * < g (>>)}.That is, B = C (8) [z e R}. The closed set # r has all its jc-sections equal to thetwo-dimensional neighborhood C[r].

The proof that B improves upon A is almost identical to the proof ofLemma <14>. Indeed the picture from the proof can be reinterpreted as a z-section of the (three-dimensional) picture for the present proof. Only small changesin wording are needed. In fact, here is a repeat of the argument, with changesindicated in boldface:

We need to show that ~f$Br <Âr. By Fubini, it is enough if we canshow that *Y2(Br)y < 7 2 ( 4 % far all y.

The section (Br)y] through Br at a fixed y\ is a closed set. Considerthe boundary point (x\, y\,Z\) in (Br)yi with Z\ as large as possible. Bydefinition, there exists a point (jto, yo, Zo), with Zo = Zi, in B that lieswithin a distance r of (x\, y\, Z\): if \x\ — jtol = € and \y\ — yol = $ thene2+£2 < r2. (Both € and 8 depend on Z\, of course.) For each t > 0 the point(x\ — f, y\, Zo) lies within a distance r of (*o — t, yo, Zo), which also belongsto B. Thus Br is of the form {(JC, y, z) e R2 : x < gr(y), - o o < z < oo}, forsome function ^ r ( ) -

Write Byqle] and Ayo[e] for the two-dimensional €-neighborhoods ofthe yo-sections Byo and Ayo. ...

And so on.The set B := C (8) {z € R} is not a halfspace, but it can be transformed into

one by means of a second 2-shift, with sections taken orthogonal to the y axis. Theisoperimetric theorem for R3 then follows.

The argument for higher dimensions is similar.

6. Problems

[1] Let Vb and Vi be positive definition matrices. For 0 < 0 < 1 let gB denote the, (1 - 0)V0 + 0V\) density on Rn. Show that

by following these steps.


(i) Use Fourier inversion to show that ge{x) = (2n)~n f exp (-ix't - \t'Vet) dt.

(ii) Justify differentiation under the integral sign to show that

dgeix)30

and

= (2nyn f ^ exp (-ix't - \t\OA + V0)t) dtJ oO

= (2nyn f -\t'At exp (-ix't - \t\OA + V0)t) dt

= (2n)~n IJ dXjdx

= (2nyn j(-itj){-itk)txp(-ix't - \t\0A + V0)f) dt.

(iii) Collect terms.

[2] Suppose Xn has a multivariate normal distribution, and Xn ~* X. Show that X alsohas a multivariate normal distribution. Hint: Reduce to the one-dimensional caseby working with linear combinations. Also show that if N(/in ,a2) converges thenboth {/xn} and [on) must be bounded. Argue along subsequences.

[3] Let Zi, Z2, . . . be independent random variables, each distributed N(0,1). Showthat the distribution of Mn := max,-<n Z, concentrates largely in a range of order(logn)~1/2 by following these steps. Define an := (21ogn)1/2 and Ln — \ loglogn.

(i) Remember (Appendix D) that the tail function O(JC) := P{N(0,1) > Adecreases like 4>(x)/x as x -> 00, in the sense that the ratio of the two functionstends to 1. For a fixed constant C\ define xn := an — (C + Ln)/an. Show thatlog (n$>(xn)) converges to Co := C — \ log(47r) as n -> cx>.

(ii) Deduce that logP{Mn < xn) = n log (l - <t>(jcrt)) - • e~CQ.

(iii) Deduce that an(Mn — an) -f Ln converges in distribution as n -> 00.

[4] Let X and Y be random variables for which both ¥{X > x] < F{Z > x] andF{X < -x] < F{Z < -x), for all x > 0. Show that Pexp(rX) < Pexp(/K) for allnonnegative t. Hint: Consider F f™ootexp(tx)[X > x]dx.

[5] Let A be a closed subset of R2, with sections Ay := {JC e R : (JC, y) e A] havingGaussian measure g(y) := y\Ay. Let B denote the 1 -shift, as in Lemma <14>.

(i) If yn -> yy show that limsupA^ < Ay, in the sense of pointwise convergenceof indicator functions. (If JC € Ayn infinitely often, deduce that (JC, y) e A.)

(ii) If yn -* yy use Fatou's lemma to prove that lim sup£(>>,,) < g(y).

(iii) If (xn, yn) e B and (*„, yn) -+ (JC, > ), show that JC < g(y), that is, (JC, y) e B.

[6] Reduce the one-dimensional version of Theorem <7> to the case where A is a finiteunion of intervals, by following these steps. Let 4>(a) := y\A.

(i) Show that it is enough prove that yAr+s > O(r + a) for each 8 > 0, becauseAr+8 I Ar as 8 decreases to zero.

12.6 Problems 287

(ii) Define an open set G := {x : d(x, A) < 8}. Show that yG > a. Hint: The setG\A is open.

(iii) Show that there exists a countable family {/,} of disjoint closed intervals forwhich yi(G\Ul-7l-) = 0.

(iv) Choose N so that the closed set AN := U,<#// has y\ measure greater than <J>(a).Show that the assertion of the Theorem for AN implies that yAr+s > <t>(r + a).

[7] (Slepian's (1962) inequality) Let X = (Xu . . . , Xn) and Y = (K,, . . . , Yn) both havecentered multivariate normal distributions with var(Xy) = var(ly) for each j andFXjXk < FYjYk for all j / k. Prove that PU, [Xj > ctj] > PUy [Yj > a,} for all realnumbers a\,..., an. Hint: Use equality <3> to show that \ix J~[/<fI{jc, < oti)g$(x) isa decreasing function of 0.

[8] Let 5 be a nonnegative random variable with P{5 > M + r] < Coexp (~\r2/a2) forall r > 0, for positive constants Co, M, and a1.

(i) Show that Pexp(aS2) = 1 + /0°°2;yaexp(cry2)P{S > y)dy.

(ii) For a < l/(2<r2), prove that Pexp(a52) < oo.

7. Notes

There is a huge amount of literature on Gaussian process theory, with which I amonly partially acquainted. I took the proof of the Sudakov minoration (Example <2>)from Fernique (1975, page 27). According to Dudley (1999, notes to Section 2.3),a stronger, but incorrect, result was first stated by Sudakov (1971). The result inExample <io> is due to Marcus & Shepp (1972), who sharpened earlier results ofLandau & Shepp (1970) and Fernique.

See Dudley (1973) for a detailed discussion of sample path properties ofGaussian process, particularly regarding entropy characterizations of boundedness orcontinuity of paths. Dudley (1999, Chapter 2) contains much interesting informationabout Gaussian processes and their recent history. I also found the books by Jain &Marcus (1978), Adler (1990, Section II), and Lifshits (1995) useful references.

The Lifshits book contains an exposition of Ehrhard's method, similar to theone in Section 5, and proofs (Section 14) of the Fernique and Slepian inequalities.See also the notes in that section of his book for a discussion of related work ofSchlafli and the contributions of Sudakov.

BorelFs isoperimetric inequality was, apparently, also proved by similar methodsby TsireFson and Sudakov in a 1974 paper which I have not seen. TsireFson (1975)mentioned the result and the method. The notes of Ledoux (1996, Section 4) discussthe isoperimetric inequality in great detail.

REFERENCES

Adler, R. J. (1990), An introduction to continuity, Extrema, and Related Topics forGeneral Gaussian Processes^ Vol. 12 of Lecture Notes-Monograph series, Instituteof Mathematical Statistics, Hayward, CA.


Billingsley, P. (1986), Probability and Measure, second edn, Wiley, New York.Bobkov, S. (1996), 'Extremal properties of half-spaces for log-concave distributions',

Annals of Probability 24, 35-48.Borell, C. (1975), The Brunn-Minkowski inequality in Gauss space', Inventiones

Math. 30, 207-216.Dudley, R. M. (1973), 'Sample functions of the Gaussian process', Annals of

Probability 1, 66-103.

Dudley, R. M. (1999), Uniform Central Limit Theorems, Cambridge University Press.Ehrhard, A. (1983a), 'Un principe de symetrisation dans les espaces de Gauss',

Springer Lecture Notes in Mathematics 990, 92-101.Ehrhard, A. (1983b), 'Symetrisation dans l'espace de Gauss', Mathematica Scandi-

navica 53, 281-301.Fernique, X. (1975), 'Regularity des trajectoires des fonctions aleatoires gaussiennes',

Springer Lecture Notes in Mathematics 480, 1-97.Jain, N. C. & Marcus, M. B. (1978), Advances in probability, in J. Kuelbs, ed.,

'Probability in Banach Spaces', Vol. 4, Dekker, New York, pp. 81-196.Landau, H. J. & Shepp, L. A. (1970), 'On the supremum of a Gaussian process',

Sankhyd: The Indian Journal of Statistics, Series A 32, 369-378.Ledoux, M. (1996), 'Isoperimetry and Gaussian analysis', Springer Lecture Notes in

Mathematics 1648, 165-294.Lifshits, M. A. (1995), Gaussian Random Functions, Kluwer.Marcus, M. B. & Shepp, L. A. (1972), 'Sample behavior of Gaussian processes',

Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics andProbability 2, 423-441.

Slepian, D. (1962), 'The one-sided barrier problem for Gaussian noise', Bell SystemTechnical Journal 41, 463-501.

Sudakov, V. N. (1971), 'Gaussian random processes and measures of solid angles inHilbert space', Soviet Math. Doklady 12, 412-415.

Tsirel'son, V. S. (1975), 'The density of the distribution of the maximum of aGaussian process', Theory Probability and Its Applications 20, 847-856.

Appendix A

Measures and integrals

SECTION I introduces a method for constructing a measure by inner approximation,starting from a set function defined on a lattice of sets.

SECTION 2 defines a "tightness" property, which ensures that a set function has an extensionto a finitely additive measure on a field determined by the class of approximating sets.

SECTION 3 defines a tlsigma-smoothness" property, which ensures that a tight set functionhas an extension to a countably additive measure on a sigma-field.

SECTION 4 shows how to extend a tight, sigma-smooth set function from a lattice to itsclosure under countable intersections.

SECTION 5 constructs Lebesgue measure on Euclidean space.SECTION 6 proves a general form of the Riesz representation theorem, which expresses

linear functionals on cones of functions as integrals with respect to countably additivemeasures.

1. Measures and inner measure

Recall the definition of a countably additive measure on sigma-field. A sigma-fieldyiona set X is a class of subsets of X with the following properties.

(SFj) The empty set 0 and the whole space X both belong to A.(SF2) If A belongs to A then so does its complement Ac.(SF3) For countable [At : 1 € N} c A, both U,-A,- and O, A, are also in A.

A function /x defined on the sigma-field A is called a countably additive (nonnegative)measure if it has the following properties.

(Mj) fx0 = 0 < \iA < 00 for each A in A.(M2) \i (Uji4,-) = Yli Mî for sequences {At; : 1 € N} of pairwise disjoint sets from A.

If property SF3 is weakened to require stability only under finite unions andintersections, the class is called a field. If property M2 is weakened to hold onlyfor disjoint unions of finitely many sets from A, the set function is called a finitelyadditive measure.

Where do measures come from? Typically one starts from a nonnegativereal-valued set-function /x defined on a small class of sets DCo, then extends to asigma-field A containing fto- One must at least assume "measure-like" propertiesfor \i on 3Co if such an extension is to be possible. At a bare minimum,

290 Appendix A: Measures and integrals

(Mo) M is an increasing map from %o into R+ for which /x0 = 0.

Note that we need %o to contain 0 for Mo to make sense. I will assume that Moholds thoughout this Appendix. As a convenient reminder, I will also reserve thename set function on %o for those /x that satisfy Mo.

The extension can proceed by various approximation arguments. In thefirst three Sections of this Appendix, I will describe only the method based onapproximation of sets from inside. Although not entirely traditional, the method hasthe advantage that it leads to measures with a useful approximation property calledXo-regularity:

fiA = supiiiK : A 2 K € XQ] for each A in A.

REMARK. When X consists of compact sets, a measure with the inner regularityproperty is often called a Radon measure.

The desired regularity property makes it clear how the extension of /x must beconstructed, namely, by means of the inner measure /z*, defined for every subset Aof X by Â := supluK : A 3 K e Xo}.

In the business of building measures it pays to start small, imposing as fewconditions on the initial domain Xo as possible. The conditions are neatly expressedby means of some picturesque terminology. Think of X as a large expanse of muddylawn, and think of subsets of X as paving stones to lay on the ground, with overlapspermitted. Then a collection of subsets of X would be a paving for X. The analogymight seem far-fetched, but it gives a concise way to describe properties of variousclasses of subsets. For example, a field is nothing but a (0, U/, n / , c ) paving,meaning that it contains the empty set and is stable under the formation of finiteunions (U/), finite intersections ( n / ) , and complements (c). A (0, Uc, He,c) pavingis just another name for a sigma-field—the Uc and Pic denote countable unions andintersections. With inner approximations the natural assumption is that %o be atleast a (0, U/, Of) paving—a lattice of subsets.

REMARK. Note well. A lattice is not assumed to be stable under differencesor the taking of complements. Keep in mind the prime example, where %o denotesthe class of compact subsets of a (Hausdorff) topological space, such as the realline. Inner approximation by compact sets has turned out to be a good thing forprobability theory.

For a general lattice 3Co, the role of the closed sets (remember f for ferm6)is played by the class J(3Co) of all subsets F for which FK € Xo for every K inXo. (Of course, Xo c 3"(3C0)- The inclusion is proper if X £ Xo) The sigma-field^>{Xo) generated by 7(%o) will play the role of the Borel sigma-field.

The first difficulty along the path leading to countably additive measures liesin the choice of the sigma-field A, in order that the restriction of /x* to A has thedesired countable additivity properties. The CarathSodory splitting method identifiesa suitable class of sets by means of an apparently weak substitute for the finiteadditivity property. Define So as the class of all subsets 5 of X for which

ji*i4 = fi*(AS) + /JL*(ASC) for all subsets A of X.

s

sc

T JC

J

A.I Measures and inner measure 291

If A e So then /z* adds the measures of the disjoint sets AS and ASC correctly. Asfar as /z* is concerned, S splits the set A "properly."

<2> Lemma. The class So of all subsets S with the property is a field. Therestriction of /x* to So is a finitely additive measure.

Proof. Trivially So contains the empty set (because /z*0 = 0) and it is stable underthe formation of complements. To establish the field property it suffices to showthat So is stable under finite intersections.

Suppose S and T belong to So. Let A be an arbitrary subsetof X. Split A into two pieces using S, then split each of those twopieces using T. From the defining property of So,

li+A = /z* (AS) + /z* (ASC)

= it* {AST) + /z* (ASTC) + /z* (ASCT) + /z, (ASCTC) .

Decompose A(ST)C similarly to see that the last three terms sum toli*A(ST)c. The intersection ST splits A correctly; the class So contains ST; theclass is a field. If ST = 0, choose A := S U T to show that the restriction of /z*

• to So is finitely additive.

At the moment there is no guarantee that So includes all the members of XQ,let alone all the members of $(9Co). In fact, the Lemma has nothing to do with thechoice of /z and XQ beyond the fact that /z*(0) = 0. To ensure that So 2 Xo wemust assume that /z has a property called %o-tightness, an analog of finite additivitythat compensates for the fact that the difference of two Xo sets need not belongto Xo. Section 2 explains Xo-tightness. Section 3 adds the assumptions needed tomake the restriction of /z* to So a countable additivity measure.

2. Tightness

If So is to contain every member of Xo, every set K e Xo must split everyset K\ € 3Co properly, in the sense of Definition ,

Writing Ko for ATi AT, we then have the following property as a necessary conditionfor 3C0 So. It will turn out that the property is also sufficient.

<3> Definition. Say that a set function /z on Xo is Xo-tight ifîK\ = /z Ao+/z*(Af i \#o)for all pairs of sets in Xo with K\ 2 Ao.

The intuition is that there exists a set K € 3Co that almost fills out K\\Ko, in thesense that \iK « ixK\ - /z#o- More formally, for each e > 0 there exists a K( e Xowith K( c K\\Ko and fiK€ > ILK\ — fiKo — €. As a convenient abbreviation, I willsay that such a K€ fills out the difference Ki\Ko within an €.

Tightness is as close as we come to having Xo stable under proper differences.It implies a weak additivity property: if K and H are disjoint members of Xo then

= iiH -f/z^f, because the supremum in the definition of ii*((HUK)\K) is


achieved by H. Additivity for disjoint !Ko-sets implies superadditivity for the innermeasure,

<4> /x*(A U B ) > />i*A + /x*£ for all disjoint A and #,

because the union of each inner approximating H for A and each inner approximatingK for B is an inner approximating set for A U B. Tightness also gives us a way torelate So to DC0.

<5> Lemma. Let Xo be a (0, U/, Of) paving, and fi be Xo-tight set function. Then

(i) S € So if and only if [iK < /** (KS) + /x* (K\S) for all K in Xo;

(ii) the field So contains the field generated by 7(Xo)-

Proof. Take a supremum in (i) over all K c A to get /x*A < /x* (AS) + /x* (A\S).The superadditivity property <4> gives the reverse inequality.

If S e 7(0Co) and K e Xo, the pair K{ := K and Ko := KS are candidates for• the tightness equality, [iK = /JL (KS) + n* (K\S), implying the inequality in (i).

3. Countable additivity

Countable additivity ensures that measures are well behaved under countable limitoperations. To fit with the lattice properties of 9Co> it is most convenient to insertcountable additivity into the construction of measures via a limit requirement thathas been called a-smoothness in the literature. I will stick with that term, ratherthan invent a more descriptive term (such as a -continuity from above), even thoughI feel that it conveys not quite the right image for a set function.

<6> Definition, Say that /x is cr-smooth (along XQ) at a set K in Xo if \iKn \, \xK forevery decreasing sequence of sets {Kn} in Xo with intersection K.

REMARK. It is important that [i takes only (finite) real values for sets in Xo.If A. is a countably additive measure on a sigma-field A, and An I Aoo with all A,in A, then we need not have XAn | XA^ unless XAn < oo for some n, as shown bythe example of Lebesgue measure with An = [n, oo) and A^ = 0.

Notice that the definition concerns only those decreasing sequences in XQ forwhich HnGN Kn € Xo. At the moment, there is no presumption that Xo be stableunder countable intersections. As usual, the a is to remind us of the restriction tocountable families. There is a related property called r-smoothness, which relaxesthe assumption that there are only countably many Kn sets—see Problem [1].

Tightness simplifies the task of checking for a-smoothness. The next proof isa good illustration of how one makes use of 3Co-tightness and the fact that ii* hasalready been proven finitely additive on the field So-

<7> Lemma. If a Xo-tight set function on a (0, U/, n / ) paving Xo is a-smooth at 0then it is cr-smooth at every set in XQ.

•

<8>

A. 3 Countable additivity

Proof. Suppose Kn i

293

^ with all Kt in Ko. Find an H € Xo that fills out thedifference K^K^ within e. Write L for H U A^. Finiteadditivity of /i* on So lets us break ixKn into the sum

The middle term decreases to zero as n -> oo becauseKnH i K^H = 0. The last term is less than

which is less than c, by construction.

If 3Co is a stable under countable intersections, the a-smoothness propertytranslates easily into countable additivity for /JL* as a set function on So.

Theorem. Let Xo be a lattice of subsets of X that is stable under countableintersections, that is, a (0, U/, He) paving. Let [i be a Xo-tight set function on aXo, with associated inner measure IM*A := sup{[iK : A 2 K e Xo). Suppose n isa-smooth at 0 (along Xo). Then

(i) the class

So := {S c X : [iK < ii*(KS) + fx*(K\S) for all K in Xo)

is a sigma-field on X;

(ii) So 2 £(3Co), the sigma-field generated by J(Xo);

(Hi) the restriction of it* to So is a Xo-regular, countably additive measure;

(iv) So is complete: if S\ 2 B D So with St € So and ii*(S\\So) = 0 then B e So-

Proof. From Lemma <5>, we know that So is a field that contains J(3Co). To

prove (i) and (ii), it suffices to show that the union S := U,-6N7} of a sequence of sets

in So also belongs to So, by establishing the inequality [iK < it* (KS) + ii* (K\S),

for each choice of K in Xo.

Write Sn for U/<n7}. For a fixed £ > 0 and each i, choose a 3Co-subsetAT, of K\Si for which ATi > /x* (^\5/) - c/21. DefineLn := n,<nA'I. Then, by the finite additivity of /x* on So,

H*(K\Sn) - ixLn < Zi<n (l**(K\Si) - nKi) < €.

The sequence of sets {Ln} decreases to a ^Co-subsetLoo of K\S. By the a-smoothness at Loo we have[iLn < jitLoo + € < M* (A'X ) + €, for n large enough,

< iiLn -f € < /x* (A^\5) -f 2e, whence

-f /x* (K\Sn) because Sn 6 So

which gives /x*

It follows that 5 € So.When K £ S, the inequality fiK < n* (KSn) + ii* (K\S) + 2^ and the finite

additivity of /z* on So imply [iK < ^2i<n/A*(KTi) + 2e. Take the supremumover all 9Co-subsets of 5, let n tend to infinity, then e tend to zero, to deduce


that IJL+S < £!,€N/i*7}. The reverse inequality follows from the superadditivityproperty <4>. The set function /z* is countably additive on the the sigma-field So.

For (iv), note that \iK = \L+ (KSo) + /x* (K\SO), which is smaller than

/x, (KB) + /x, (K\SO + ^(KS]SC

O) < ix* (KB) + /x* (K\B) + 0,

• for every K in Xo-

In one particularly important case we get a-smoothness for free, without anyextra assumptions on the set function /x. A paving Xo is said to be compact (in thesense of Marczewski 1953) if: to each countable collection {Ki : i € N) of sets fromXo with CiienKi = 0 there is some finite n for which nf<nAT, = 0. In particular,if Ki I 0 then Kn = 0 for some n. For such a paving, the a -smoothness propertyplaces no constraint on [i beyond the standard assumption that /x0 = 0.

<9> Example . Let Xo be a collection of closed, compact subsets of a topologicalspace X. Suppose {Ka : a e A] is a subcollection of Xo for which Ha€AKa = 0.Arbitrarily choose an «o from A. The collection of open sets Ga := K* for a e Acovers the compact set Kao. By the definition of compactness, there exists a finitesubcover. That is, for some c*i,..., am we have Kao c U^jGa. = (rr=lKai)

c. Thus• n™=oKai = 0. In particular, Xo is also compact in the Marczewski sense.

REMARK. Notice that the Marczewski concept involves only countable sub-collections of Xo, whereas the topological analog from Example <9> applies toarbitrary subcollections. The stronger property turns out to be useful for provingr-smoothness, a property stronger than a-smoothness. See Problem [11 for thedefinition of r-smoothness.

4. Extension to the nc-closure

If Xo is not stable under countable intersections, a-smoothness is not quite enoughto make /x* countably additive on So- We must instead work with a slightly richerapproximating class, derived from Xo by taking its He-closure: the class X ofall intersections of countable subcollections from Xo. Clearly X is stable undercountable intersections. Also stability under finite unions is preserved, because

(DietiHi) U (Dj^Kj) = n M € N x N (Hi U Kj) ,

a countable intersection of sets from Xo. Note also that if Xo is a compact pavingthen so is X.

The next Lemma shows that the natural extension of /z to a set function on Xinherits the desirable cr-smoothness and tightness properties.

<io> Lemma. Let \i be a Xo-tight set function on a (0, U/, Of) paving Xo, which isa-smooth along Xo at 0. Then the extension /x of /x to the He-closure X9 definedby

<n> {iH := infifiK : H c K e Xo] for H e X,

is X-tight and a-smooth (along X) at 0.

A.4 Extension to the He-closure 295

Proof. Lemma <7> gives us a simpler expression for /I. If {Kn : n e N} c Xo andKn i L e X, then jlL = infn /ztf,, because, for each 9Co-subset K with K 2 L,

by a-smoothness of /x at K.(^* U AT)

The a-smoothness at 0 is easy. If Hn := f]j€N ^«J € ^ ^ d Hn i 0 then thesets £„ := fl/<*,,<„ *<; belong to 0Co, and /Jn = HXH2... //„ c ^n | 0. It followsthat iiHn < fiKn I 0.

The 3C-tightness is slightly trickier. Suppose H\ 2 Ho, with both sets in X. Let{Kn} be a decreasing sequence of sets in Xo with intersection # i . For a fixed € > 0,choose a AT in 3Co with K o> Ho and iiK < fiHo + €. With no loss of generality wemay assume that /if c K\. Invoke 3Co-tightness to find a DCo-subset L of K\\K forwhich /xL > /JLK\ - fiK - € > fiK\ - jlHo - 2e. Notice that the sequence {LKn}

decreases to LH\y a DC-subset of H\\K c H\\Ho. The finiteadditivity of ^*, when restricted to So, gives

li(LKn) = / i L + / x ^ - / x ( L U Kn)

\xH\

jiHo - 2e, as required for DC-tightness.

as n oo.

REMARK. It is helpful, but not essential, to have a different symbol for theextension of \i to a larger domain while we are establishing properties for thatextension. For example, it reminds us not to assume that £ has the same propertiesas /x before we have proved as much. Once the result is proven, the \x has servedits purpose, and it can then safely be replaced by \i.

A similar argument might be made about the distinction between JCo and DC,but there is some virtue in retaining the subscript as a reminder than DC0 is assumedstable only under finite intersections.

Together, Theorem <8> and Lemma <io> give a highly useful extensiontheorem for set functions defined initially on a lattice of subsets.

<12> Theorem. Let Xo be a (0, U/, n / ) paving of subsets of X, and let X denoteits He-closure. Let /JL : Xo -> R+ be a Xo-tight set function that is sigma-smoothalong Xo at 0. Then /x has a unique extension to a complete, X-regular, countablyadditive measure on a sigma-field §, defined by

IiK := MfaKo : K c Ko e DC0} for K e X,

liS := sup{/x/i: : S 2 K e X) for 5 e S.

The sigma-field S contains all sets F for which FK e X for all K in X. Inparticular, S 2 9C 2 DC0.

REMARK. Remember: <r-smoothness is automatic if DCo is a compact paving.

5. Lebesgue measure

There are several ways in which to construct Lebesgue measure on Rk. Thefollowing method for R2 is easily extended to other dimensions.


Take %o to consist of all finite unions of semi-open rectangles (ot\, P\]®(a2, fh]-Each difference of two semi-open rectangles can be written as a disjoint union of

at most eight similar rectangles. As a consequence, every memberof 9Co has a representation as a finite union of disjoint semi-openrectangles, and Xo is stable under the formation of differences. Theinitial definition of Lebesgue measure m, as a set function on 9Co,might seem obvious—add up the areas of the disjoint rectangles.

It is a surprisingly tricky exercise to prove rigorously that m is well defined andfinitely additive on 3C0.

REMARK. The corresponding argument is much easier in one dimension. It is,perhaps, simpler to consider only that case, then obtain Lebesgue measure in higherdimensions as a completion of products of one-dimensional Lebesgue measures.

The 3Co-tightness of m is trivial, because XQ is stable under differences: if*i 2 #o, w Jth both sets in XOy then K\\K0 e Xo and mK\ - mK0 = m(K\\K0).

To establish <x-smoothness, consider a decreasing sequence [Kn] with emptyintersection. Fix € > 0. If we shrink each component rectangle of Kn by a smallenough amount we obtain a set Ln in X$ whose closure Ln is a compact subset of Kn

and for which m(Kn\Ln) < e/2n. The family of compact sets {Ln : n = 1, 2,. . .}has empty intersection. For some finite N we must have n,<# £/ = 0, so that

mKN < m (D^Li) + £/<* m(Ki\Li) < 0 + £,<„ e/2'.

It follows that mKn tends to zero as n tends to infinity. The finitely additivemeasure m is 3Co-smooth at 0. By Theorem <12>, it extends to a 3C-regular,countably additive measure on S, a sigma-field that contains all the sets in J(X).

You should convince yourself that X, the He-closure of Xo, contains all compactsubsets of R2, and 7{X) contains all closed subsets. The sigma-field 8 is completeand contains the Borel sigma-field 23(R2). In fact S is the Lebesgue sigma-field, theclosure of the Borel sigma-field.

6, Integral representations

Throughout the book I have made heavy use of the fact that there is a one-to-onecorrespondence (via integrals) between measures and increasing linear functionalon M+ with the Monotone Convergence property. Occasionally (as in Sections 4.8and 7.5), I needed an analogous correspondence for functional on a subconeof M + . The methods from Sections 1, 2, and 3 can be used to construct measuresrepresenting such functionals if the subcone is stable under lattice-like operations.

<13> Definition. Call a collection !K+ of nonnegative real functions on a set X a latticecone if it has the following properties. For h, h\ and hi in JC+, and <x\ and c*2in M + :

(H\) ai h i -f a2fr2 belongs to ft*;

(H2) hx\h2 := (h{ - h2)+ belongs to W+;

(H3) the pointwise minimum h\ A h2 and maximum h\Vh2 belong to 5{+;(H4) h A 1 belongs to W+.

A.6 Integral representations 297

The best example of a lattice cone to keep in mind is the class CQ (Rk) ofall nonnegative, continuous functions with compact support on some Euclideanspace R*.

REMARK. By taking the positive part of the difference in H2, we keep thefunction nonnegative. Properties Hi and H2 are what one would get by takingthe collection of all positive parts of members of a vector space of functions.Property H4 is sometimes called Stone's condition. It is slightly weaker than anassumption that the constant function 1 should belong to <K+. Notice that the coneCQ (Rk) satisfies R,, but it does not contain nonzero constants. Nevertheless, ifh € 3f+ and a is a positive constant then the function (h — a ) + = (h — a(\ A h/a))+belongs to !K+.

<14> Definition. Say that a map T : 2C+ »-• R+ is an increasing linear functional if,forh\, hi in 9C+, andot\, oti in R+;

(T\) T(a\h\ -\-a2h2) = ot\Th\ -\-a2Th2l(T2) Th\ < Th2 ifh\ < h2 pointwise.

Call the functional a-smooth at 0 if(T3) Thn I 0 whenever the sequence [hn] in 3-C+ decreases pointwise to zero.

Say that T has the truncation property if(T4) T(h An)-4 Th as n ^ 00, for each h in <K+.

REMARK. For an increasing linear functional, T3 is equivalent to an apparentlystronger property,

(T3) if hn I tin with all /i, in J{+ then Thn | Th^

because Thn < Th^ + 7'(/irt\/i00) I Thoo + 0. Property T4 will allow us to reducethe representation of arbitrary members of J{+ as integrals to the representation forbounded functions in !K+.

If fi is a countably additive measure on a sigma-field A, and all the functionsin !K+ are /x-integrable, then the Th := [ih defines a functional on *K^ satisfying Tjthrough T4. The converse problem—find a \i to represent a given functional T—iscalled the integral representation problem. Theorem <8> will provide a solutionto the problem in some generality.

Let Xo denote the class of all sets K for which there exists a countablesubfamily of *K+ with pointwise infimum equal to (the indicator function of) K.Equivalently, by virtue of H3, there is a decreasing sequence in 9C+ convergingpointwise to K. It is easy to show that %o is a (0, U/, Oc)-paving of subsets for X.Moreover, as the next Lemma shows, the functions in IK+ are related to Xo and7(Xo) in much the same way that nonnegative, continuous functions with compactsupport in R* are related to compact and closed sets.

<15> Lemma. For each h in 3{+ and each nonnegative constant a,(i) lh>a}eX0ifct>0, and (ii) {h < a] e 7(XO).

Proof. For (i), note that [h > a] = infneN (l A n (h - a + n~l)+\ a pointwiseinfimum of a sequence of functions in <K¥. For (ii), for a given K in 9Co» find asequence {hn : n e N} c J{+ that decreases to K. Then note that K{h < a] =

• infn hn\ (nh — na)+, a set that must therefore belong to XQ.


<16> Theorem. Let Ji+ be a lattice cone of functions, satisfying requirements H\through H4, and T be an increasing, linear functional on J{+ satisfying conditions T\through T4. Then the set function defined on Xo by fxK := inf{Th : K <h € CK+}is Xo-tight and o-smooth along Xo at 0. Its extension to a Xo-regular measureon £(Xo) represents the functional, that is, Th = [ih for all h in M+. There is onlyone Xo-regular measure on 25(30)) whose integral represents T.

REMARK. Notice that we can replace the infimum in the definition of /x by aninfimum along any decreasing sequence {/*„} in !K+ with pointwise limit K. For ifK <he <K+, then infn Thn < infw T (hn v h) = Th, by T2 and Tr

Proof. We must prove that \i is a-smooth along Xo at 0 and Xo-tight; and thenprove that Th > [ih and Th < \xh for every h in 0i+.

(T-smoothness: Suppose Kn e XQ and Kn I 0. Express Kn as a pointwise infimumof functions {hnj} in tKv. Write hn for infw<rt,/<n/iw,,. Then Kn < hn I 0, andhence \iKn < Thn I 0 by the er-smoothness for T and the definition of /JL.

Xo-tightness: Consider sets K\ 3 Ko in Xo. Choose 9£+ functions g > Ko andhn 4 K\ and fix a positive constant t < 1. The 3C~-function gn := (hn — n(g\t))+

decreases pointwise to the set L :— K\{g < t] c K\\Ko. Also, it is trivially truethat g > tK\{g > t}. From the inequality gn -f g > f ATi we get iiK\ < T(gn + g)/t,because (gn +g)/t is one of the 5{+-functions that enters into the definition of /JLK\.

Let n tend to infinity, take an infimum over all g > Ko, then let t increase to 1, todeduce that fiKi < IJLL + IAKO, as required for Xo-tightness.

By Theorem <8>, the set function /x extends to a Xo-regular measure on !B(Xo)

Inequality Th > /x/i: Suppose h > u := X!y=iay^y G -simple* ^ e n e e (* t o

show that 7/i > iiu \— YLj^j^^r ^ e m aY assume that the yi-measurable sets Ajare disjoint. Choose Xo sets Kj c Aj, thereby defining another simple functionv := X^=ia./^0 - U- Fin(* sequences hnj from !K+ with hnj I otjKj, so thatJ2j Thnj I Zj oijiiKj = [iv. With no loss of generality, assume h > hnj for all nand j . Then we have a pointwise bound, Yljhnj < h + J2i<jhni A An;-, becausemaxy hnj < h and each of the smaller hnj summands must appear in the last sum.Thus

<Th + Y.i<jT{hni^hnj).

As n tends to infinity, hni A hnj | K(Kj = 0. By a-smoothness of T, the right-handside decreases to Th, leaving /iv < Th. Take the supremum over all Kj c Aj, thentake the supremum over all u < h, to deduce that fih <Th.

Inequality Th < fih: Invoke property T4 to reduce to the case of a bounded h.For a fixed e > 0, approximate h by a simple function s€ := e Xl/lit^ - f'€K w ^hsteps of size €. Here N is a fixed value large enough to make Ne an upper bound

A.6 Integral representations 299

for h. Notice that {h > ie] € 3Co, by Lemma <15>. Find sequences hni from W+

with hni | [h > ie). Then we have

Sf<h< (hA€)+S( < (hA€) + €1£t\hni<

from which it follows that

-> T(h A €) -I- € Y,1L\ V{h > ie) as * - • oo

< 7(/i A e) -f iih because € Ylh=\lh ^ i€) = s( < h

-> fih as f - • 0, by a-smoothness of T.

Uniqueness: Let v be another 3Co-regular representing measure. If hn I K € %Qy

and hn e 3i+, then fiK = limn /x/in = limn T/in = limn v/in = vK. Regularity extendsD the equality to all sets in S(X0).

< n > Example. Let !K+ equal CQ (X), the cone of all nonnegative, continuous functionswith compact support on a locally compact, Hausdorff space X. For example, Xmight be Rk. Let T be an increasing linear functional on Cj(X).

Property T4 holds for the trivial reason that each member of Cj (X) is bounded.Property T3 is automatic, for a less trivial reason. Suppose hn I 0. Without lossof generality, K > h\ for some compact K. Choose h in ej(X) with h > K. Forfixed e > 0, the union of the open sets [hn < €} covers K. For some finite N, theset {hN < €} contains K, in which case hN < eK < €/i, and ThN < eTh. Thea-smoothness follows.

The functional T has a representation Th = [ih on Cj (X), for a DCo-regularmeasure ix. The domain of /x need not contain all the Borel sets. However, by ananalog of Lemma <io> outlined in Problem [1], it could be extended to a Borel

• measure without disturbing the representation.

<18> Example. Let J{+ be a lattice cone of bounded continuous functions on atopological space, and let T : 1K+ -* R+ be a linear functional (necessarilyincreasing) with the property that to each € > 0 there exists a compact set K€

for which Th < e if 0 < h < Kc€. (In Section 7.5, such a functional was called

functionally tight.)Suppose 1 G ^K+. The functional is automatically cr-smooth: if 1 > ht ], 0 then

eventually K€ c {hi < e), in which case Th{ < T ((hi - O + + *) < € + €T(\). Infact, the same argument shows that the functional is also r-smooth, in the sense ofProblem [2].

The functional T is represented by a measure fi on the sigma-field generatedby M+. Suppose there exists a sequence {*,-} c Jf+ for which 1 > hx \ K(. (Theversion of the representation theorem for r-smooth functional, as described byProblem [2], shows that it is even enough to have W+ generate the underlyingtopology.) Then fiK€ = lim, Th{ = 7(1) - lim, T(\ - ht) > T(l) - €. That is, \i isa tight measure, in the sense that it concentrates most of its mass on a compact set.

• It is inner regular with respect to approximation by the paving of compact sets.


7. Problems

[1] A family of sets U is said to be downward filtering if to each pair U\, Ui in Itthere exists a U3 in U with U\ n U2 2 U3. A set function \x : Xo -> M+ is saidto be r-smooth if inf{/x£ : AT € U} = /x(ntl) for every downward filtering familyIX c %0. Write X for the fia-closure of a (0, U/, n / ) paving 3C0, the collection ofall possible intersections of subclasses of Xo.

(i) Show that X is a (0, U/, Ha) paving (stable under arbitrary intersections).

(ii) Show that a 3Co-tight set function that is r-smooth at 0 has a 9C-tight, r-additiveextension to X.

[2] Say that an increasing functional T on 0i+ is x-smooth at zero if inf{Th : h e V] foreach subfamily V of !K+ that is downward filtering to the zero function. (That is, toeach h\ and /12 in V there is an /13 in V with h\ A / I 2 > ^3 and the pointwise infimumof all functions in V is everywhere zero.) Extend Theorem <16> to r-smoothfunctional by constructing a 3C-regular representing measure from the class X ofsets representable as pointwise infima of subclasses of 5{+.

8, Notes

The construction via DC-tight inner measures is a reworking of ideas fromTops0e (1970). The application to integral representations is a special case ofresults proved by Pollard & Tops0e (1975).

The book by Fremlin (1974) contains an extensive treatment of the relationshipbetween measures and linear functionals. The book by Konig (1997) develops thetheory of measure and integration with a heavy emphasis on inner regularity.

See Pfanzagl & Pierlo (1969) for an exposition of the properties of pavingscompact in the sense of Marczewski.

REFERENCES

Fremlin, D. H. (1974), Topological Riesz Spaces and Measure Theory, CambridgeUniversity Press.

Konig, H. (1997), Measure and Integration: An Advanced Course in Basic Proceduresand Applications, Springer-Verlag.

Marczewski, E. (1953), *On compact measures', Fundamenta Mathematicae pp. 113-124.

Pfanzagl, J. & Pierlo, N. (1969), Compact systems of sets, Vol. 16 of Springer LectureNotes in Mathematics, Springer-Verlag, New York.

Pollard, D. & Tops0e, F. (1975), *A unified approach to Riesz type representationtheorems', Studia Mathematica 54, 173-190.

Tops0e, F. (1970), Topology and Measure, Vol. 133 of Springer Lecture Notes inMathematics, Springer-Verlag, New York.

Appendix B

Hilbert spaces

SECTION 1 defines Hilbert space and presents two basic inequalities.

SECTION 2 establishes the existence of orthogonal projections onto closed subspaces of aHilbert space.

SECTION 3 defines orthonormal bases of Hilbert spaces. Vectors in the space haverepresentations as infinite linear combinations (convergent series) of basis vectors.

SECTION 4 shows how to construct a random process from an orthonormal sequence ofrandom variables and an orthonormal basis.

1. Definitions

Hilbert space is an infinite dimensional generalization of ordinary Euclidean space.Arguments involving Hilbert spaces look similar to their analogs for Euclideanspace, with the addition of occasional precautions against possible difficulties withinfinite dimensionality.

 Definition. A Hilbert space is a vector space IK equipped with an inner product(•, •) (a map from *K <8> IK into R) which satisfies the following requirements.

(a) (ctf + fig, h) = <*</, h) + p{g, h) for all real a, p all / , g, h in IK.

(b) (f,g) = {g,f)forallf,ginM.

(c) (/. /> > 0 with equality if and only if / = 0.

(d) *K is complete for the norm defined by \\f\\ := J{f, / ) . That is, if {/„} is aCauchy sequence in IK, meaning \\fn - fm\\ -> 0 as min(m, n) -> oo, thenthere exists an f inOi for which \\fn — f\\ - • 0.

Two elements / and g of *K are said to be orthogonal, written / ± g, if(/»g) = 0. An element / is said to be orthogonal to a subset G of IK, written/ ± G, if / J. g for every g in G.

The prime examples of Hilbert spaces are ordinary Euclidean space and L2(/i),the set of equivalence classes of measurable real-valued functions whose squares are/i-integrable, for a fixed measure /x. See Section 2.7 for discussion of why we needto work with /^-equivalence classes to get property (c).

Hilbert space shares several properties with ordinary Euclidean space.

302 Appendix B: Hilbert spaces

Cauchy-Schwarz inequality: | ( / , g)\ < \\f\\ \\g\\ for all / , g in M.The inequality is trivial if either | | / | | = 0 or ||g|| = 0. Otherwise it followsimmediately from an expansion of the left-hand side of the inequality

Ill/ll 11*11 iTriangle Inequality: 11/ + *|| < | | / | | + ||*|| for all / , g in W.

The square of the left-hand side equals | | / | | 2 + 2(/ , g) + ||g||2, which is lessthan H/ll2 + 2H/II ||g|| + ||g||2 = (| |/ | | + ||g||)2, by Cauchy-Schwarz.

2. Orthogonal projections

Many proofs for Rk that rely only on completeness carry over to Hilbert spaces. Forexample, if Wo is a subspace of R*, then to each vector x there exists a vector xoin Wo that is closest to x. The vector jto is characterized by the property that x — xois orthogonal to Wo. The vector JCO is called the orthogonal projection of x onto Wo,or the component of x in the subspace Wo; the vector x — jto is called the componentof x orthogonal to Wo.

Projections also exist for closed subspaces of a Hilbert space W. (Recall that asubset 9 of W is said to be closed if it contains all its limit points: if [gn] c g and\\gn — f\\ —• 0 then / G S ) Every finite dimensional subspace is automatically closed(Problem [2]); infinite dimensional subspaces need not be closed (Problem [3]).

<2> Theorem. Let Wo be a closed subspace of a Hilbert space W. For each f in Wthere is a unique fo in Wo, the orthogonal projection of f onto Wo, for which f - fois orthogonal to Wo- The point fo minimizes \\f - h\\ over all h in Wo.

Proof For a fixed / in W, define 8 := inf{||/ — h\\ : h G WO}. Choose [hn) in Wosuch that | | / — /in II —• 8. For arbitrary g, g' in W, cancellation of cross-productterms leads to the identity

II* + *'ll2 + II* - *'ll2 = 2||g||2 + 2||g/||2.

Put g := / - hn and g' := f - hm to get

4 | | / - (*„ + hm)/2\\2 + \\hm - hn\\2 = 2 | | / - hn\\

2 + 2 | | / - hm\\2.

The first term on the left-hand side must be > 482 because (hn +hm)/2 belongs toWo- Both terms on the right-hand side converge to 2<52 as min(m,n) - • oo. Thus\\hm - K\\ - • 0 as min(m, n) -+ oo. That is, {hn} is a Cauchy sequence.

Completeness of W ensures that hn converges to some /o in W. As Wo is closedand each hn belongs to Wo, we deduce that /o e Wo. The infimum 8 is achieved at/o, because | | / - /0 | | < | | / - hn\\ 4- \\hn - /0 | | -+ 8.

To prove the orthogonality assertion, for a fixed g in Wo consider the squareddistance from / to /o + fg,

11/ - (/o + *«)||2 = 11/ - /oil2 + 2r{/ - /o, g) + /2||g||2,

B.2 Orthogonal projections 303

as a function of the real variable t. The vector /o + tg belongs to !Ko . It is one ofthose vectors in the range of the the infimum that defines S. It follows that

82 + 2t{f - /o, g) + f2||£||2 > 82 for all real t.

For such an inequality to hold, the coefficient, ( / - /o, g), of the linear term in tmust be zero. (Otherwise what would happen for t close to zero?)

To prove uniqueness, suppose both /o and f\ have the projection property.Then f\ - /o in 0<o would be orthogonal to both / - /o and / - f\, from which it

• follows that / i - /o is orthogonal to itself, whence f\ = fa.

<3> Corollary. (Riesz-Frechet Theorem) To each continuous linear map T from aHilbert space 0< into R there exists a unique element ho in Oi such that Th — {h, ho)for all h in *H.

Proof. Uniqueness is easy to prove: if (h,ho) = (h,h\) for h := ho - h\ thenl l*o-*i 11=0.

Existence is also easy when Th = 0 = (ft, 0). So let us assume that there existsan h2 with Th2 # 0. Let h3 denote the component of h2 that is orthogonal to theclosed (by continuity of T) linear subspace Wo := [h e 0{ : Th = 0} of <K. Notethat Th3 = Th2 - T(h2 - h3) = Th2 ^ 0.

For h in IK, define Ch := Th/Th3. The difference h - Q/13 belongs to 3K0»because T(h-Chh3) = 0, and therefore 0 = (h-Chh3,h3) = (h,h3)-(Th/Th3)\\h3\\

2.• The choice ho := (Th3/\\h3\\

2)h3 gives the desired representation.

3. Orthonormal bases

A family of vectors ^ = [fa? : 1 e 1} in a Hilbert space !K is said to be orthonormalif (^1, tyi) = 1 for all 1 and ( r,-, y> = 0 for all i ^ j . The subspace JCo spannedby ^ consists of all linear combinations J Z i e j ^ ^ . with J ranging over all finitesubsets of / . If Jio is dense in *K (that is, if the closure *Ko of 0<o equals !K) thenthe family \jr is called an orthonormal basis for 3~C.

<4> Lemma. Let ^ be an orthonormal basis for *K.

(i) For each h in Oi, the set h := {1 e / : (h, ^1) # 0} is countable, and for everyenumeration {i(l), i(2), . . .} of /A, the sum 5Z*=i(^» fnk))î{k) converges innorm to h as n —• 00.

(ii) (g, A) = £/€/(£> ^«>(*, ^,) (ParsevaVs identity) for all g,heK. In particular,

I|A|I2 = E<€/K*.^>I2-REMARK. The sum in assertion (ii) actually runs over only a countable subsetof /. The assertion should be understood as convergence of partial sums for allpossible enumerations of that countable subset.

Proof. For each finite subset J of / , the subspace 'Kj spanned by the finite set{t , : 1 € 7} is closed—see Problem [2]. The projection of h onto Jij equals


hj := £j€/(A, ti)fi, because (A - hj, ft) = {A, ft) - (A, ft) = 0 for each i in / .The orthogonality implies

\\h\\2 = \\h -hj\\ + \\hj\\2 = ||A - hj\\2 + J2izj{h> ^ ) 2 *

For each € > 0, the set Ih(€) := {i € / : |(A, fi)\ > €} has cardinality N€ no greaterthan \\h\\2/€2, because \\h\\2 > £.{1 e /*(€)} (A, ft)2 > e2iV€. It follows that the set/A = Uk£nh(l/k) is countable, as asserted.

For a fixed A let (i(l), i(2), . . .} be an enumeration of 4 . Write / (n) for theinitial segment ( i ( l ) , . . . , i(n)}. Denseness of Jio means that to each 6 > 0 there issome finite linear combination h€ = J2t£j a * ^ s u c h â t \\h — h€\\ < €. By definitionof 4 , the vector /i is orthogonal to ft for each i in i \ 4 . It follows that the vectorh — £\{i" 6 J f i h}&ityi is orthogonal to { / : i € J \ / / ,} , whence

||A - h€f = \\h - EiV € / H /}a^«ll2 + II £/{* € A'}«^«l l 2 .

We reduce the distance between h and h€ if we discard from J those i not in lh.Without loss of generality we may therefore assume that J is a subset of //,.

When n is large enough that / c J(n) we have h — hj^) orthogonal to thesubspace Jij(n), which contains both hj(n) and h€. For such n we have

€2 > ||A - hj(H) + hJ{n) - h€\\2 definition of h€

= IIA - h m ||2 + || hJin) -h€f by orthogonality

D

<5>

That is, for each € > 0 we have ||A — A/(W)|| < € for all n large enough—preciselywhat it means to say that J2kL\(h' !M*))ifo(*) converges (in norm) to A. Therepresentation ||A||2 = $îj(A, ^V(*))2 then follows via continuity (Problem [1]) ofthe map g H* ||g||2.

For ParsevaFs identity, let ( j ( l ) , j(2),...} be an enumeration of /g U Ih. Then,from the special case just established,

The series for 4{g, A) = ||g + A||2 - ||g - A||2 is obtained by subtraction.

Example . (Haar basis) Let m denote Lebesgue measure on the Borelfield of (0,1]. For k = 0 ,1 ,2 , . . . , partition (0,1] into subintervals J(i2~*, (1 4-1)2"*], for i = 0 , 1 , . . . , 2 * - 1 . Define functions Hik := J2i,Mand 1ritk := V¥Ha, for 0 := {1} U { fjk : k e No; 0 < 1 < 2*} is anorthonormal family in L2(m). A generating class argument will show that ^ is anorthonormal basis.

B.3 Orthonormal bases 305

Each Jitk belongs to the subspace IKo spanned by ^:

7o.i = 5 0

Jo,2 = \ (/o.I + #0,1) = Jb.i - Ki. 2,2 = \ (/i.i + J?u) = /i.i - J3.2

and so on. Take finite sums to deduce that IKo contains the class £ of all indicatorfunctions of dyadic intervals (i/2*, j/2*]. Note that £ is stable under finite intersec-tions and that it generates the Borel sigma-field. The class D of all Borel sets whoseindicators belong to the closure IKo is a A.-system (easy proof), which contains £.Thus IKo contains <r(£), the Borel sigma-field on (0,1].

If IKo were not equal to the whole of L2(m) there would exist a square-integrablefunction / orthogonal to IKo. In particular, / would be orthogonal to both {/ > 0}and {/ < 0}, which would force m / + = m/~ = 0. That is, / = 0 a.e. [m].

The component of a function h in the subspace spanned by 1 is just the constantfunction m/t. Each function h in L2(m) has a series expansion,

• with convergence in the L2(m) sense.

4 . S e r i e s e x p a n s i o n s o f r a n d o m p r o c e s s e s

Suppose IK is a Hilbert space (necessarily separable—see Problem [6]) with acountable orthonormal basis * = {fa : i e N}. Let {& : i € N} be a sequenceof random variables, on some probability space (Qy 7, P), that is orthonormal forthe L2(P) inner product: Pftfy = 1 if 1 = j , zero otherwise. For each h in IK, thesequence of random variables

is a Cauchy sequence in L2(F), because convergence of Xw(A, Vo)2 implies

P !£„<<<•,<*. î)6|2 = Hn<i<m(h, +t)2-+0 as n - oo.

Write X(h) for the limit ^l\(h, ^)ft, which is defined up to an almost sureequivalence as a square integrable random variable.

By Problem [1] and the Parseval identity, for g and h in IK we have

VX(g)X(h) = lim FXn(g)Xn(h) = lim ? " (g, )(*f ^ ) = (g, h).

In particular, FX(h)2 = ||/i||2. The map h »-> X(/i) is a /mâr isometry between IKand a linear subspace of L2(P): the map preserves inner products and distances.

The most important example of a series expansion arises when all the fthave independent, standard normal distributions. The family of random variables[X(h) : h € IK} is then called the isonormal process indexed by IK. The particularcase known as Brownian motion is discussed in Chapter 9.


5. Problems

[1] Suppose gn -> g and hn -> /i, as elements of a Hilbert space. Show that (gn, hn) ->(g, /i). Hint: Use Cauchy-Schwarz to bound terms like (gn - g, hn).

[2] Let 3~ be a finite subset of a Hilbert space IK. Show that the subspace generatedby 3 is a closed subset of IK. Hint: Without loss of generality assume the elements{/i» •••»/*} of J are linearly independent. For each i, find a vector Vv such that(/,, fa) = 1 but (/)-, V>) = 0 f°r j # '• W * •= Y,i<*i(n)fi converges to some fc,deduce that {a,(n)} converges for each i.

[3] Let A. be a finite measure on 3[0, 1]. Let IK denote the collection of A-equivalenceclasses {[h] : h is continuous }. Show that IK is not a closed subspace of L2(k) ifk equals Lebesgue measure. Could it be a closed subspace for some other choiceof X?

[4] Let IK be a closed, convex subset of a Hilbert space IK.

(i) Show that to each / in IK there is a unique /o in X for which | | / - /oil =inf{||/ - h\\ : h e X). Hint: Mimic the proof of Theorem <2>.

(ii) Show that ( / - fo, g - /o) < 0 for all g in X. Hint: Consider the distancefrom / to (1 - t)f0 + tg for 0 < t < 1.

(iii) Give a (finite-dimensional) example where ( / - /o, g — /o) < 0 for all gin 3C\{/o).

[5] Use Zorn's Lemma to prove that every Hilbert space has at least one orthonormalbasis. Hint: Order orthonormal bases by inclusion. If 4> is maximal for thisordering, show that there can be no nonzero element orthogonal to every memberof 4/.

[6] Let IK be a Hilbert space with an orthonormal basis 4> := {^ : i € / } . Show that/ is countable if and only if IK is separable (that is, it contains a countable, densesubset). Hint: if / is countable, consider finite linear combinations X^gjC*,- ; w ^*the a, rational. Conversely, if {h\, hi,...} is dense, construct an orthonormal basisinductively by defining g{ := ht - ^{j < i}(*,-, Vo) md $i : = gi/WgiW w h e n gi # 0.

6. Notes

Halmos (1957, Chapter I) is an excellent source for basic facts about Hilbert space.See Dudley (1973) and Dudley (1989, page 378) for the isonormal process.

REFERENCES

Dudley, R. M. (1973), 'Sample functions of the Gaussian process', Annals ofProbability 1, 66-103.

Dudley, R. M. (1989), Real Analysis and Probability, Wadsworth, Belmont, Calif.Halmos, P. R. (1957), Introduction to Hilbert space, second edn, Chelsea.

Appendix C

Convexity

SECTION I defines convex sets and functions.

SECTION 2 shows that convex functions defined on subintervals of the real line have left-and right-hand derivatives everywhere.

SECTION 3 shows that convex functions on the real line can be recovered as integrals of

their one-sided derivatives.

SECTION 4 shows that convex subsets of Euclidean spaces have nonempty relative interiors.

SECTION 5 derives various facts about separation of convex sets by linear functions.

1. Convex sets and functions

A subset C of a vector space is said to be convex if it contains all the line segmentsjoining pairs of its points, that is,

otx\ + (1 - «)JC2 e C for all JCI , X2 € C and all 0 < a < 1.

A real-valued function / defined on a convex subset C (of a vector space V) is saidto be convex if

f(ax{ + (1 - a)x2) < af(x\) + (1 - a)f(x2) for all x\, x2 e C and 0 < a < 1.

Equivalently, the epigraph of the function,

epi(/) := {(JC, 0 e C x R : t > / (* ) } ,

is a convex subset of C x IR. Some authors (such as Rockafellar 1970) define /(JC)to equal +oo for x e V\C, so that the function is convex on the whole of V, andepi(/) is a convex subset of V x R.

This Appendix will establish several facts about convex functions and sets,mostly for Euclidean spaces. In particular, the facts include the following results asspecial cases.

(i) For a convex function / defined at least on an open interval of the real line(possibly the whole real line), there exists a countable collection of linearfunctions for which /(JC) = supI€N (a, 4- fax) on that interval.

(ii) If a real-valued function / has an increasing, real-valued right-hand derivativeat each point of an open interval, then / is convex on that interval. Inparticular, if / is twice differentiable, with f" > 0, then / is convex.

308 Appendix C: Convexity

(iii) If a convex function / on a convex subset C c f has a local minimum ata point JCO, that is, if /(JC) > f(xo) for all x in a neighborhood of JCO, then

/ (w) > f(x0) for all u; in C.

(iv) If Ci and C2 are disjoint convex subsets of En then there exists a nonzerot in Rn for which supx€Cl x -I < infx€c2

x * Z- Tha t *s> * e linear functionalx H> x • I separates the two convex sets.

2. One-sided derivatives

Let / be a convex function, defined and real-valued at least on an interval J of thereal line.

Consider any three points JCI < JC2 < JC3, all in J. (For the moment, ignorethe point JCO shown in the picture.) Write a for (JC2 — jci)/ to — *i)> so thatX2 = 0CX3 + (1 - <x)x\. By convexity, y2 := a / t o ) + (1 - ot)f(x\) > / t o ) . WriteS(xi,Xj) for (f(xj) - f(xj))/(xj - JC/), the slope of the chord joining the points(JC,, / (JC,)) and (JC,, / (* , ) ) . Then

, /to) - /to)

= S(xux3) =yi - f(xi)

X2-X\

/to)-/to)= S(xux2).

slope S(x2,x3) ,

slope S(x,,x3)-

From the second inequality it follows that 5(JCI,JC) decreases as x decreasesto x\. That is, / has right-hand derivative D+(x\) at jq, if there are points of Jthat are larger than x\. The limit might equal -00 , as in the case of the functionf(x) = — *fx defined on R+, with JCI = 0. However, if there is at least one point JCOof / for which *o < x\ then the limit D+(JCI) must be finite: Replacing {JCI, JC2, JC3}in the argument just made by {JCO, JCI, JC2}, we have S(JCO,JCI) < S(JCI,JC2), implyingthat - 0 0 < 5(JCO,JCI) < D+(JCI).

The inequality 5(JCI, JC) < S(x\, JC2) < 5 t o , JC') if JCI < JC < JC2 < JC', leads to theconclusion that D+ is an increasing function. Moreover, it is continuous from the

C.2 One-sided derivatives 309

right, because

D+(x2) < S(jt2, JC3) - > 5(JCJ, JC3) as JC2 I x\, for fixed JC3

-* £>+(*!) as x3 i x\.

Analogous arguments show that S(JCO,JCI) increases to a limit D_(JCJ) as JCO

increases to x\. That is, / has left-hand derivative DJ(JCI) at JCI, if there are pointsof J that are smaller than x\.

If JCI is an interior point of J then both left-hand and right-hand derivativesexist, and Z>_(JCI) < D+(JCI). The inequality may be strict, as in the case wheref(x) = \x\ with x\ = 0. The left-hand derivative has properties analogous to thoseof the right-hand derivative. The following Theorem summarizes.

< 1 > Theorem. Let f be a convex, real-valued function defined (at least) on a boundedinterval [a, b] of the real line. The following properties hold.

(i) The right-hand derivative D+(JC) exists,

f{y)-fix)lD+(x) a s , | , ,y-x

for each x in [a,b). The function D+{x) is increasing and right-continuouson [a, b). It is finite for a < x < b, but D+(a) might possibly equal —00.

(ii) The left-hand derivative D-(x) exists,

fix) - Hz) x>

x -zfor each x in (a,b]. The function D-(x) is increasing and left-continuousfunction on (a,b]. It is finite for a < x < b, but D-(b) might possiblyequal +00.

(Hi) For a < x < y < b,

D+(x) < fJytJM < D_(y).y-x

(iv) D-(x) < D+(JC) for each x in (a, b)9 and

/(w) > fix) + c(w - x) for all w in [a, b],

for each real c with D_(JC) < c < D+(x).

Proof. Only the second part of assertion (iv) remains to be proved. For w > x usef(w) - f(x)

w — xforw<x use

= Six, w) > D+(JC) > c;

x — w

D where 5(-, •) denotes the slope function, as above.

<2> Corollary. If a convex function f on a convex subset C c Rn has a localminimum at a point xo, that is, if fix) > /(JCO) for all x in a neighborhood of XQ,

then fiw) > /(JCO) for all w in C.


Proof. Consider first the case n = \. Suppose w eC with w > JCO. The right-handderivative D+(JCO) = l i m ^ (f(y) - /C*o)) /(y - JCO) must be nonnegative, because/OO > /C*o) for y near JCO. Assertion (iv) of the Theorem then gives

f(w) > /(JCO) + (W- JCO)D+(JCO) > f(xo).

The argument for w < xo is similar.• For general Rn, apply the result for R along each straight line through JCO.

Existence of finite left-hand and right-hand derivatives ensures that / iscontinuous at each point of the open interval (a, b). It might not be continuous atthe endpoints, as shown by the example

f(x) _ I -Vx for x > 0/ W " | l forjc=0.

Of course, we could recover continuity by redefining / (0) to equal 0, the value ofthe limit / ( 0 + ) := limô / (w) .

<3> Corollary. Let f be a convex, real-valued function on an interval [a, b\. Thereexists a countable collection of linear functions d, -f Cjtu, for which the convexfunction rjr(w) := supjeN(d, + ciw) is everywhere < f(w), with equality exceptpossibly at the endpoints w = a or w = b, where \fr(a) = f(a+) and \lr(b) = f(b-).

Proof Let Xo := {JC, : I € N} be a countable dense subset of (a, /?). DefineCi := D+(jC|) and dt := /(jcf-)—c,-jcf-. By assertion (iv) of the Theorem, f(w) > dj+CiWfor a < w ^r{w).

If a < w f(w) -I- (x, - w)D+(w), andhence

t(w) > f(Xi) + Ci(w - JC/) > f(w) - (Xi - w) (D+(JC,) - D+(w)) for all xh

Let JC/ decrease to w (through Xo) to conclude, via right-continuity of D+ at w, that> f(w).

If D+(a) > — oo then / is continuous at a, and

f(a) > iKa) > limsup (f(Xi) + (a - Xi)Ci) = /(n+) = f(a).

If D+(a) = — oo then / must be decreasing in some neighborhood N of a, withCj < 0 when jcf e N, and

xlr(a) > sup (/(JC,-) + (a - JC,-)C,-) > sup /(JC,-) = f(a+).

If \{r(a) were strictly greater than f(a+), the open set

would contain a neighborhood of 0, which would imply existence of pointsw in N\[a} for which t/r(u;) > / ( a + ) > / (w) , contradicting the inequality

D (w ) 5 / (w) . A similar argument works at the other endpoint.

3. Integral representations

Convex functions on the real line are expressible as integrals of one-sided derivatives.

C.3 Integral representations 311

<4> Theorem. If f is real-valued and convex on [a,b], with f(a) = f(a+) andf(b) = f(b—), then both D+(JC) and D_(JC) are integrable with respect to Lebesguemeasure on [ayb], and

f{x) = f(a) + / D+(t)dt = f(a) + f D.(t)dt for a < x < b.Ja Ja

Proof Choose a and p with a < a and monotonicity of bothone-sdied derivatives gives

r*-i r*f-+,

/ D+(t)dt < «D+(x,--i) < /(*,-) - /(JC,-_I) < <$£_(*,) < / D.(t)dty

which sums to give

D+(t)dt < /(*„_,) - /(*,) < / D_(0rf/.v X2

Let n tend to infinity, invoking Dominated Convergence and continuity of / , todeduce that ff D+(t)dt < f(P) - f(a) < ff D_(t)dt. Both inequalities mustactually be equalities, because D-(t) < D+(t) for all t in (a, b).

Let a decrease to a. Monotone Convergence—the functions D± are boundedabove by D+(P) on (a, p]—and continuity of / at a give f(p)-f(a) = / / D+(t) dt =ff D-(t)dt. In particular, the negative parts of both D± are integrable. Then let Pincrease to x to deduce, via a similar argument, the asserted integral expressions for/ (* ) — /(<*)> ^ d the integrability of D± on [a, b].

Conversely, suppose / is a continuous function defined on an interval [a, b],with an increasing, real-valued right-hand derivative D+(t) existing at each point of[a,b). On each closed proper subinterval [a,jc], the function D+ is bounded, andhence Lebesgue integrable. From Section 3.4, f(x) = f* D+(t)dt for all a < x < b.Equality for x — b also follows, by continuity and Monotone Convergence. Asimple argument will show that / is then convex on \a,b\.

More generally, suppose D is an increasing, real-valued function defined (atleast) on [a,b). Define g(x) := f* D{t)dty for a < x < b. (Possibly g(b) = oo.)Then g is convex. For if a < XQ < x\ (aixi - xa) - (1 - a)ixa - xo)) D(xa) = 0.

<5> Example. Let / be a twice continuously differentiable (actually, absolutecontinuity of / ' would suffice) convex function, defined on a convex interval J c R


that contains the origin. Suppose / (0) = /'(0) = 0. The representations

f(x)=xf{0<s<l}f'(xs)ds

= x2ff{0<t<s< l}f"(xt)dtds=x2f*(\-t)f"(xt)dt,

establish the following facts.

(i) The function f(x)/x is increasing.

(ii) The function </>(x) := 2/(JC)/JC2 is nonnegative and convex,

(iii) If / " is increasing then so is 0.

Moreover, Jensen's inequality for the uniform distribution A. on the triangular region{0 < t < s < 1} implies that

<t>(x) = k'>'f'\xt) > / " {Xs'xt) = /"(JC/3).

Two special cases of these results were needed in Chapter 10, to establish theBennett inequality and to establish Kolmogorov's exponential lower bound. Thechoice /(JC) := ex - 1 - JC, with /"(JC) = ex, leads to the conclusion that the function

11 for x = 0is nonnegative and increasing over the whole real line. The choice /(JC) :=(1 + jc)log(l 4- JC) - JC, for JC > - 1 , with / (JC) = log(l-f JC) and /"(JC) = (1 + JC)"1,

leads to the conclusion that the function

|

(^ 2

1 for JC = 0.is nonnegative, convex, and decreasing. Also x\/r(x) is increasing on R+, and

• 1r(x)>(l+x/3)-1.

4. Relative interior of a convex set

Convex subsets of Euclidean spaces either have interior points, or they can beregarded as embedded in lower dimensional subspaces within which they haveinterior points.

<6> Theorem. Let C be a convex subset ofRn.

(i) There exists a smallest subspace V for which C c JCO 0 V := {JCO 4- x : JC 6 V},for each JCO € C.

(ii) dim(V) = n if and only if C has a nonempty interior.

(iii) If int(C) ^ 0, there exists a convex, nonnegative function p defined on Rn

for which int(C) = {JC : p(x) < 1} c C c {JC : p(x) < 1} = int(C).

Proof. With no loss of generality, suppose O e C . Let JCJ, . . . , JC* be a maximal setof linearly independent vectors from C, and let V be the subspace spanned by thosevectors. Clearly C c V. If k < n, there exists a unit vector w orthogonal to V, andevery point JC of V is a limit of points JC + t w not in V. Thus C has an empty interior.

C.4 Relative interior of a convex set 313

If k = n, write x for J f. JC,//I. Each member of the usual orthonormal basis has arepresentation as a linear combination, e{ = J2jai,jxj- Choose an € > 0 for which

y v l / 2

2n€ f ]Tf. a?y J < 1 for every j . For every y := ]T\ yiei in Rn with |y| < €, thecoefficients ^; := (2n)~l 4- J]f.a/y^/ are positive, summing to a quantity 1 -/fo < 1,and Jc/2 + j = $>0 + E/ Aî e c- ^ ^ ^ / 2 i s a*1 interior point of C.

If int(C) / 0, we may, with no loss of generality, suppose 0 is an interiorpoint. Define a map p : Rn -* R+ by p(z) := inf{f > 0 : z/t e C}. It is easyto see that p(0) = 0, and p(ay) = ap(y) for a > 0. Convexity of C implies thatp(z\ + Z2) < p(z\) 4- pfe) for all zii if Zi/U € C then

z\ + Z2 i / Z 2

t\+tl t\+lIn particular, p is a convex function. Also p satisfies a Lipschitz condition: ify = Ei y. i and z = X)f. z/e, then

- z) = P (Ei(yt - Zi)et)

< Ei ((yi - Zi)+p(ei) + 0* - Zf)-p(-^))

^ 1 - z\ (Jî êi)1 v * ( -Thus {p < 1} is open and {/? < 1} is closed.

Clearly p(jc) < 1 for every x in C; and if p(jc) < 1 then *o := x/t e C for somet < 1, implying x = (1 -1)0 + '*o € C. Thus {z : p(z) < 1} c C c { : pfe) < 1}.Every point JC with p(x) = 1 lies on the boundary, being a limit of points x(l ±n~l)

D from C and Cc. Assertion (iii) follows.

If C c jc0 © V c Rn, with dim(V) = )t < n, we can identify V with R* and Cwith a subset of R*. By part (ii) of the Theorem, C has a nonempty interior, as asubset of JCO 0 V. That is, there exist points x of C with open neighborhoods (in Rn)for which Nn (JCO 0 V) c C. The set of all such points is called the relative interiorof C, and is denoted by rel-int(C). Part (iii) of the Theorem has an immediateextension,

rel-int(C) c C c rel-int(C),

with a corresponding representation via a convex function p defined only on JCQ 0 V.

5. Separation of convex sets by linear functionals

The theorems asserting existence on separating linear functionals depend on thefollowing simple extension result.

<7> Lemma. Let f be a real-valued convex function, defined on a vector space V. LetTo be a linear functional defined on a vector subspace Vo, on which To(x) < /(JC)for all x e Vo. Let y\ be a point ofV not in Vo. There exists an extension of To to alinear functional T\ on the subspace V\ spanned by Vo U {yi} for which T{ (z) < f(z)onV\.


Proof. Each point z in Vi has a unique representation z := x + ry\, for some

x e Vo and some r e R. We need to find a value for T$y$ for which /(JC -I- ry\) >

To(x) + rT$y$ for all r e R. Equivalently we need a real number c such that

. f f(xp + ty\)- TQ(XQ) > ^ > ^ T0(x\)- f(x\ -syi)

X 0 GV 0 , />0 t ~ ~x,€v0.5>0 S

for then T$y$ := c will give the desired extension.

For given JCO, *i in Vo and s, t > 0, define a := s/(s+t) and xa := axo+O -a)jci.

Then, by convexity of / on Vi and linearity of 7b on Vo,

/ U + ty\) +5 - h r

which implies

f(x0 + ty\)- T0(x0) T0{xx) - f(x{ - syx)oo > > > — oo.

t s

The infimum over JCO and / > 0 on the left-hand side must be greater than or equal

to the supremum over x\ and s > 0 on the right-hand side, and both bounds must

• be finite. Existence of the desired real c follows.REMARK. The vector space V need not be finite dimensional. We can orderextensions of 70, bounded above by / , by defining (7*a,Va) > (7^, V$) to meanthat Vp is a subspace of Va, and Ta is an extension of Tp. Zorn's lemma gives amaximal element of the set of extensions (Ty,Vy) > (To, Vo). Lemma <7> showsthat VY must equal the whole of V, otherwise there would be a further extension.That is, To has an extension to a linear functional T defined on V with T(x) < /(JC)for every x in V. This result is a minor variation on the Hahn-Banach theorem fromfunctional analysis (compare with page 62 of Dunford & Schwartz 1958).

<8> Theorem. Let C be a convex subset ofW1 and yo be a point not in rel-int(C).

(i) There exists a linear functional T on Rk for which 0 ^ T(y0) > supxe? T(x).

(ii) If yo & C, then we may choose T so that T(yo) > sup^r(j t ) .

Proof. With no loss of generality, suppose 0 € C. Let V denote the subspace

spanned by C, as in Theorem <6>. If yo <£ V, let t be its component orthogonal

to V. Then y0 -I > 0 = x • i for all x in C.

If yo € V, the problem reduces to construction of a suitable linear functional T

on V: we then have only to define T(z) := 0 for z £ V to complete the proof.

Equivalently, we may suppose that V = Rn. Define To on Vo := {rjto : r e R ) by

T(ryo) := rp(yo), for the p defined in Theorem <6>. Note that To(yo) = p(yo) >: U

because yo £ rel-int(C) = {p < 1}. Clearly To(x) < p(x) for all x e Vo. Invoke

Lemma <7> repeatedly to extend 7b to a linear functional 7 on R", with T(x) < p(x)

for all JC € IR". In particular,

T(yo) > 1 > P(x) > T(x) for all x e C = {p < 1}.

• For (ii), note that T(y0) > 1 if yo i C.

<9> Corollary. Let C\ and Ci be disjoint convex subsets ofRn. Then there is a

nonzero linear functional for which inf ,^ T(x) > supx€^2 T(x).

C.5 Separation of convex sets by linear functionals 315

Proof. Define C as the convex set {JCI - ;*2 : JC, e C,}. The origin does not belongto C. Thus there is a nonzero linear functional for which 0 = T(0) > T(x\ - X2) for

• alljt ,€C,.

<io> Corollary. For each closed convex subset F ofW there exists a countable familyof closed halfspaces {//, : i e N} for which F = n/€N//,-.

Proof. Let {JC, : i <= N} be a countable dense subset of Fc. Define r, as the distancefrom xi to F, which is strictly positive for every 1, because Fc is open. The openball B(xiiri) with radius r,- and center JC/ is convex and disjoint from F. Fromthe previous Corollary, there exists a unit vector lt and a constant &, for whichU y > h > li x for all y G £(JC, , r,-) and all x e F. Define //, := {JC € Rn : lt x < k(}.

Each x in Fc is the center of some open ball B(x, 3e) disjoint from F. Thereis an xi with \x - JC,| < €. We then have r,- > 2^, because 5(JC, 3e) 2 B(xiy 2^), andhence JC — €lt e B(JC,, r,-). The separation inequality t-x • (x — ett) > it/ then implies

• £i-x> kt, that is JC $ H(.

< 11 > Corollary. Let f be a convex (real-valued) function defined on a convex subset CofW1, such that epi(/) is a closed subset of Rn+K Then there exist {d{ : i e N} c Rn

and [ci : i e N} c R such that /(JC) = supI€N(c/ + d( • JC) for every x in C.

Proof. From the previous Corollary, and the definition of epi(/), there exist tt € W1

and constants a,-, k[ € R such that

00 > t > /(JC) if and only if *,- > lt • JC - fa, for all i € N.

The ith inequality can hold for arbitrarily large t only if ar > 0. Define \jr{x) :=supa > 0 (li • JC - kt) /a,. Clearly /(JC) > \lr(x) for JC € C. If s < /(JC) for an JC in Cthen there must exist an i for which ^ • JC - /(jc)or/ < *,- < lt - x - sat, thereby

• forcing or, > 0 and s < ^(x).

6. Problems

[1] Let / be the convex function, taking values in E U {00}, defined by

y(jc y) _ { -yl/2 for 0 < 1 and x e R\ 00 otherwise.

Let To denote the linear function defined on the jc-axis by 7O(JC,O) := 0 for allx e R. Show that To has no extension to a linear functional on R2 for whichT(x, y) < /(JC, y) everywhere, even though To < f along the jc-axis.

[2] Suppose X is a random variable for which the moment generating function,M(t) := Pexp(fX), exists (and is finite) for t in an open interval / about the originof the real line. Write Pf for the probability measure with density etX/M(t) withrespect to P, for t e J, with corresponding variance varf (•). Define A(0 := log M(t).

(i) Use Dominated Convergence to justify the operations needed to show that

A'(0 = M'(t)/M(t) = F(XetX/M(t)) = P,X,

A"(0 = (M(t)M"(t) - M'(t)2)/M(t)2 = var,(X).


(ii) Deduce that A is a convex function on J.

(iii) Show that A achieves its minimum at t = 0 if IPX = 0.

[3] Let Q be a probability measure defined on a finite interval [a, b]. Write O'Q for itsvariance.

(i) Show that a% < (b - a)2/A. Hint: Reduce to the case b = -a, noting thatx < g (JC2).<

(ii) Suppose also that Qx(x) = 0. Define A(f) := log (Qxext), for t € R. Show thatA"(f) < (fc - a)2/4, and hence A(0 < t2(b - a)2/S for all t e R.

(iii) (Hoeffding 1963) Let X i , . . . , Xn be independent random, variables with zeroexpected values, and with X,- taking values only in a finite interval [aiy &,-]. For€ > 0, show that

P{X! + ... + Xn > €) < wfe-c'Ylpe?*' < exp {-2e2/ £,& - at)2).

[4] Let P be a probability measure on R*. Define M{t)\ = Px (ext) for r e Efc.

(i) Show that the set C := {f e Rk : M(0 < oo} is convex,

(ii) Show that logM(r) is convex on rel-int(C).

[5] Let / be a convex increasing function on M+. Show that there exists an increasingsequence of convex, increasing functions /n , with each /n" bounded and continuous,such that 0 < fn(x) < fn+\(x) f / (* ) for each x. Hint: Approximate the right-handderivative of / from below by smooth, increasing functions.

7. Notes

Most of the material described in this Appendix can be found, often in much greatergenerality, in the very thorough monograph by Rockafellar (1970).

REFERENCES

Dunford, N. & Schwartz, J. T. (1958), Linear Operators, Part I: General Theory,Wiley.

Hoeffding, W. (1963), *Probability inequalities for sums of bounded randomvariables', Journal of the American Statistical Association 58, 13-30.

Rockafellar, R. T. (1970), Convex Analysis, Princeton Univ. Press, Princeton, NewJersey.

Appendix D

Binomial and normal distributions

SECTION 1 establishes some useful bounds for the tails of the normal distribution, thenuses them to derive the perturbation inequalities needed for the proof of the mainresult, in Section 2

SECTION 2 describes a very precise approximation to symmetric Binomial tail probabilitiesvia the tails of the standard normal distribution. The approximation implies existenceof a very tight coupling between the Binomial and its approximating normal—the keyto the KMT coupling (Chapter 10) between the empirical process and a BrownianBridge.

SECTION 3 proves the results described in Section 2.

1. Tails of the normal distributions

The Af(0,1) distribution on the real line has density function <f>(x) :=exp(—x2/2)/V2jr with respect to Lebesgue measure. For many limit theo-rems and inequalities it is only the rate of decrease of the tail probability<f>(jt) := P{N(0, 1) > x] that matters. The simplest approximation,

! ( )( =• |

\X X5)f() f o r x > 0 ,

X

follows (compare with Feller 1968, Section VII. 1 and Problem 7.1) by integratingfrom x to oo across the trivial inequalities

(l O(jr2) as x -» oo.

^\ d>(t) < 0(0 <(i + L\ 4>{t) for t > 0.

Less precisely,

() (l -

When x is close to zero there are a better bounds, such as

<2> <*>(JC) < - exp (-x2/2\ for x > 0,

an inequality that will follow from properties of the function p(x) := 0(jt)/(jt),defined for all real JC.

318 Appendix D: Binomial and normal distributions

First note that inequality provides upper and lower bounds for l/p(x) onthe positive part of the real line, which translate into

xx < p(x) < x +

x2-lfor x > 1.

The lower bound is also valid for 0 < x < 1.To a first approximation, p(x) increases like x. The difference r(x) := p(x) — x

is nonnegative, and it converges to zero as x tend to infinity. As the plot suggests, thefunction r ( ) is actually decreasing, a property that will have pleasant consequences.

1.8

1.6

1.4

1.2

i x

' , X 2 - 1

1

1

!

1

»

1

0.8

0.6

0.4

0.2 r(x) = p(x) - x

10

<4> Theorem. The function p(x) := <p(x)/<P(x) is increasing, with p(-oo) = 0 andp(0) = 2/V27T « .7979. The function r(x) := p(x) — x decreases to zero as x tendsto infinity. The function logp(x) is concave, and logp(x + 8) < logp(jc) 4- r(x)Sfor x e R and 8 > 0.

Proof. Temporarily write G(x) for l/p(jc), which equals the Laplace transform ofthe measure fi on R+ with density exp(-z2/2){z > 0} with respect to Lebesgue

measure:

G(x) = y) f (-xz -7^

For each JC, define Px to be the probability measure with density e~zx/G(x) withrespect to /JL. Dominated Convergence then lets us differentiate under the integralsign to obtain

Gf(x)/G(x) = nz (-ze~zx) /G(x) = -Pxz < 0,

Gn{x)/G{x) = iiz (z2e'zx) IG{x) = Pxz2.

From the first inequality it follows that G is decreasing, and p is increasing.The derivatives of the function

x2

IK*) := - log p(x) = log G(x) = — + log $>(x) + log

D.I Tails of the normal distributions 319

satisfy the equalities

f'M = ^rl =*- P(x) = -rW,G(x)

The last expression equals the variance of the nondegenerate distribution Px, whichis strictly positive. Thus f{x) is convex, and r(x) = -\/r'(x) is decreasing. Themean-value theorem then gives

log P(X*)*) = -+(* + «) + *(*) = *r(x*) < 8r(x),

D where x* lies in the interval (JC, x + 5).

REMARK. Assertion <2> is equivalent to the inequality p(x) > p(0) = 2/y/lnfor JC > 0, a consequence of the fact that p is an increasing function.

Suppose Z is Af(O, 1) distributed. The approximation <f>(jt) ^ <f>(x)/x suggestsa simple form for the conditional probability

namely, something close to exp(-(;t + <5)2/2)/exp(-jc2/2) = exp(-x<$ - 52/2), atleast for large x and small 8. The p function from Theorem <4> lets us sharpenthe approximation to a pair of most useful inequalities.

<5> Corollary. For x € E and 8 > 0,

/ 82\ * (* + <$) / 82\exp I —x8 — —- I > —= > exp I — p(x)8 — — I.

\ 2 / 4>(jt) \ 2 /

Proof For the left-hand inequality, replace exp(—18) by its upper bound exp(—x8)in the equality

&(x + 8) = f™ <f>(t + 8)dt = f™ 4>{t) exp(-t8 - 82/2) dt.

For the right-hand inequality write the ratio as

0

then invoke the last assertion of Theorem <4> to bound the log term from below• by -r(x)8. Remember that p(x) = x + r(jt).

REMARK. Notice that the ratio of the upper and lower bounds equalsexp(—r(x)8), which lies close to one if x is large and S/x is small.

By solving for 8 as a function of the ratio of tail probabilities, we obtain theperturbation inequalities needed for the proof of the main result.

<6> Lemma. Let x, y , and z be related by the equality &(z) = e~y <t>(jt), with x > 0.

(i) There exists a positive constant C\ such that both \z — x\ < C\\y\ and\z-x- y/p(x)\ < Ci\y\2/p(x), provided y > - (1 +* 2 ) /2 .

(ii) Ifx > 2 and y > 0 then 0 < y/x2 + 2y -z< 2y/x3.


Proof. The three quantities satisfy the equation

g(z) = g(x) + y where g(t) := - log <J>(0-

The function g is increasing, nonnegative, and convex, because g'it) = p(t), anincreasing function. Inequality <2> implies that g(t) > \og2 + t2/2 when t > 0. Inparticular, under the condition on y from (i),

g(z) > Iog2 + ijc2 - i ( l + JC2) > Iog2 - i > 0,

implying that z > —* := g~l (log 2 — 5) > —00.Write w for z — x. For some JCI and x2 between x and z,

y = g(z) — g{x) = wp(x\) = wp(x) -f- jiy /) (JC2).

From the first equality we get \w\ < \y\/p(-ic), and from the second, using the factthat 0 < p' < 1, we get \w-y/p(x)\ < \y\2/ (2p(-/c)2p(x)). Assertion (i) is proved.

For assertion (ii), note that w > 0. From Corollary <5>,

p(x)w 4- \w2 > g(z) - g(x) > xw + \w2.

Thus w must lie between w\ and w2, the positive solutions of xw\ 4- \w\ = y =

wi := /I(JC) > M; > /i (JC + r(jc)) := ^ where hit) := - / +

The function — h has an decreasing derivative, with

0 < -h'it) = 1 _ _ L = = 7 , . 2 \ , . < y/t2,

which implies

0 < (x + w\) - z < w\ - w2 < -r(x)h'(x) < * t ^ < -^ ifJC2 — 1 X2 JC3

• as asserted.

2. Quantile coupling of Binomial with normal

The quantile transformation defines an increasing function *!>„ from R onto{0,1 , . . . , n] for which the random variable X := Vn(Y) has exactly a Bin(n, 1/2)distribution when Y has a N(n/2, n/4) distribution. More precisely, *i>n should takethe constant value A: on an interval (&,n, A+i,n], for fc = 0 , 1 , . . . , n, where thecutpoints -00 = Po,n < Pi,n < .. . < Pn,n < Pn+\,n = oo are determined by therequirement that F{X > k] = F{Y > &,„} for each A:. The challenge lies in locatingthe cutpoints.

The usual normal approximation suggests that P{X > k) « P{F > ik - 1/2}, thatis, f$kin & k — 1/2. Numerical calculation provides further evidence that k — 1/2 isindeed very close to a lower bound, at least for small values of n.

D.2 Quantile coupling of Binomial with normal 321

n=20

Plot of pkn — (k — 1/2) as a function of k, for n/2 < k < n and values of nranging from 1 to 20. Each small cross corresponds to a different k and n.Crosses that share the same n are joined by line segments.

Notice that the plot covers only values of k greater than n/2, a restrictionjustified by the following symmetry argument. The fact that n — X also hasBin(n, 1/2) distribution implies

= n.

F{Y > &,„} = P{n - X > k] = 1 - P{X > n - k + 1} = 1 - P{F > &_*

from which it follows that pkiH - n/2 = n/2 - /*„_*+!,„, that is, /fcfB + £„_*+,Put another way, the intervals (/h,n, Pn,n), (P2,n> Pn-\,n), and so on, are all symmetricabout n/2. When n is even, say n = 2m, the interval (/Jm,n, Pm+i,n) is symmetricabout n/2; so we have only to consider £ > m + l = (w + 2) /2 . When n is odd,say n = 2m + 1, the interval 0#m>n, pm+2,n) is symmetric about n /2 = pm+\,n\ so wehave only to consider k>m + 2 = (n + 3 ) /2 .

n = 2m, evenn/2

= 2m+1,oddn/2

T i r i rPm.n Pm+1,n

I I I rPm.n Pm+1,n Pm+2,n

The plot also suggests that fikn grows faster than k — 1/2 as k moves towards n,a suggestion supported by explicit calculations when k is close to n, for n largeenough to justify simplifying approximations. It will be slightly more convenient towork with the standardized cutpoints Zk,n '= 2(pk,n - n/2)/y/n and the tails of thestandard normal distribution. The symmetry properties and defining equalities thenbecome zn-k+\,n = -zk,n and

<7> P{Bin(n, 1/2) >k} = ®(zk,n) for each k.

When zkjn is large, it can be well approximated via inequality .

<8> Example. For k = n - B, with a B = 0, 1, 2 , . . . fixed as n increases,


From inequality we have log (z) = - log (zV2n) - \z2 - O(z~2). For valuesz := \/nc (1 - dln + y/n) with constants c > 0, d > 0, y, and tn := (log w)/n, wehave logO(z) equal to

2^) - \ logn - o(l) - \nc2 (\ - 2din + ^ + ^(n"1) j -

2 - i ) logn - ±wc2 - log(cV27r) - y

If we choose c := v^log2 « 1.177 and d := (1 + 2B)/(2c2) then a boundedsequence of y values would suffice to cancel out the other terms, leaving a z forwhich <J>(z) = F[X > n - B}. Thus, for fixed B,

2c

and

logn / 1-y/n \ y/n

1 + c 1 +

D Notice that pn-B,n exceeds « — 5 by about 0.088n plus smaller order terms.

The method of proof for the main approximation result uses an elementaryrelationship between Binomial tails and beta integrals. If n points are placed indepen-dently according to the uniform distribution on (0, 1), then X := #points in [0,1/2]has a Bin(n, 1/2) distribution, the fcth order statistic, Tk, has a beta distribution, and

F{X >k} = P{7i < 1/2} = ; — / tk"l(l - t)n~k dt.(k l ) ! ( n A : ) ! J

Essentially we have only to simplify the ratio of factorials by an appeal to Stirling'sformula, then approximate the logarithm of the integrand by a Taylor expansion(around t = 1/2) to quadratic terms, to bring the right-hand side into a form closeto $>(y) for some y. It will then follow that <t>(zk,n) % ^ 0 0 , whence Zk,n ^ y-More precisely, we will be able to sandwich the beta integral between two suchexpressions, <P(y') < <!>(£*,*) < $>(/'), and then we will have y' > Zk,n > y"because *!>„ is a monotone decreasing function.

The detailed approximation looks slightly intimidating, because it needs tocover several ranges for k where different terms become important to the bounds.For k near n/2 it should provide a fik,n near k - \\ for k near n it should reproducethe behavior from Example <8>; and it should make a smooth transition betweenthe two types of behavior as k increases from n/2ton.

<io> Theorem. There is an increasing function y(-) with y(0) = 1/12 and 1 + 2 y ( l ) =c2 := 2 log 2 for which the standardized cutpoints

zk,n := 2(ftfB - n/2)/vfi and uh%n := 2 (k - 1/2 - n/2) /^N

where N := n - 1, are related by the approximation

/ ,-x l o g ( l - M 2 /N) / ;zk,n = uk,nS Un/VN) + B \ , k"' } + RKn where S(e).:= Jl + 2e2y(€),v ' 2c\ukJ

v

D.2 Quantile coupling of Binomial with normal 323

and

-O (\uktn\ + 1) < nRkn < O (\uktn\ + logn) uniformly in — — <k < n.

For 0 < k < (n + l ) /2 , the same approximation holds but with the upper and lowerbounds for Rkn interchanged.

The inequality k > (n + l ) /2 is equivalent to uk,n > 0. Equality is achieved byan integer k only when n is odd, that is, n = 2m -h 1 and k = m -f 1. By symmetry, inthat case F{X > m +1} = P{X < m] = 1/2, and hence zm+\,n = 0 = um+i,n, implying/?m+i,n = 0. Thus we have only to consider the situation where k > (n + l ) /2 forthe proof.

The peculiar y/n - 1 standardization for uk>n, and the exclusion of the valuesk = 0 and it = n from the range, are both related to the behavior described byExample <8>. The increasing function y(-) is defined for 0 < € < 1 as

(1 + €) log(l + €) + (1 — 6) lOg(l — 6) — €2 ^ €2r

It has an infinite derivative from the left at € = 1, which accounts for the behaviorof the function S(e) as € increases to 1. More explicitly, for small positive 8,

If k := n- B, for a fixed £ > 1, then uKn/y/~N = 1 - (2B)/N and # M is oforder O (n~l/2). With 5 := 2B/N, the Theorem approximates zn-B,n by

which differs from the expression in Example <8> by terms of order O (n~ l / 2 ) .

That is, with the y/n - I standardization we capture the extreme tail behavior

correctly.

REMARK. A y/n standardization would put an extra ( logA0/\ /# into theapproximation, thereby slightly increasing the error bound by a factor of logn. Theextra factor would have no effect on the coupling bound we seek, so you couldbe forgiven for mostly ignoring the small difference. A much better argument forworking with N instead of n, and with k — 1 instead of ky will appear in the proof,where you will see that calculations come out much more cleanly with the slightlysmaller values.

The effect of the function y is small when \uk,n\/\fN is close to zero, that is,when |it - n/2\ = o(n). In that case, the log term can be absorbed into the errorbounds, and also \uk,n\ is much smaller than the maximum of logrc and |w*,nl3- Theapproximation then simplifies to


The corresponding approximation for the unstandardized cutpoints is

ft., = * - £ + ( * " / 2 ) (1 + o(l)) ± O ( ^ ) for |it - n/2\ =2 3n2 \ y/n )

When |fc - n/2| is of order o(n2/3) the cubic term contributes only a o(l) to theapproximation; and the cutpoint &,„ stays within o{\)ofk-\. For larger \k - n/2\the cubic term accounts for the slow drift suggested by the plot of /**,„ — (k — 1/2)versus fc.

As a trivial consequence of the the preceding discussion, there exists a constantCo such that, k — Co < Pk,n for n/2 < k <n. Also the inequality y/x + y <for positive x and y , gives us the upper bound

and hence (with Example <8> covering the case k = w),

These two inequalities give us an analog of Tusnddy's inequality (from Section 10.5)for the coupling between the X := Vn(Y) distributed Bin(n, 1/2) and the Y distributedN(n/2, n/4). When X = k > (n + l ) /2 we have Y > pk,n > k - Co = X - Co, andconsequently \X -n/2\ < \Y -n/2\ + Co. We also have, for some other constant C\,

» ( U t l W l t l t

n

< X -f C\ -f C\

More succinctly, for some constant C, /y M /

and |K - X| < C - h Cn

By reasons of symmetry, the same inequalities also hold when X < (n -f l ) /2.Inequality <13> is better than needed in Chapter 10 to establish the KMT coupling.

3. Proof of the approximation theorem

From now on, all calculations will be for a fixed n. There is no harm in droppingthe n from the subscripts, writing zk instead of zk,„, and so on. Also note that theimplied constants in the O(-) error terms allow us to ignore finitely many valuesof n, and establish the Theorem only for n large enough.

The calculations will be neater when reexpressed in terms of the integersK := k - 1 and N := n - \ and the fractions

1+6 K \-€ N-K«:= — = - and /, : = 1 - „ = _ = _ _

That is, € = (2K/N) -1 and uk = tJTJ. The constraint n/2 <k <n-l correspondsto € staying slightly away from two extreme values:

< 1 4 > l _ A > 6 = = ? ^ _ i > { ^ - 1 when n is even,N - AT ~ I 2N~X when n is odd.

D.3 Proof of the approximation theorem 325

The assertion of the Theorem <io> becomes

<15> zk = eVNS(€) + ° g ' ~" t j + Rk where S(€) := (1 + 2€2y(€))l/2

with -O (l + €*J~N\ < NRk < O (logN + €^/N\ uniformly in the range <14>.

Representation <9> becomes

The logarithm of the integrand equals AT times the concave function H(t) :=alogt + plog(l — r), for 0 < t < 1, whose maximum occurs at a:

H(a) = i± I log ( i ± l ) + 1Z1 log

with y(€) as inStirling's formula (Feller 1968, Section II.9),

n! = Vln exp ((n + i ) logn - n + Xrt) with y ^ — y < ^ < ^ ^

simplifies the ratio of factorials. Denote exp (XN - kK - kN-K) by A€. Then

nN\ n

-.A€ exp (-NH (a)).

Write h(s) for the concave function H (\/2- s)-H(a). The representation becomesl/2n fl

A. / exp (NH(t) - NH(a)) dtJo

The main contribution to the integral will come from a small neighborhood of5 = 0. In this neighborhood, a Taylor expansion gives a good approximation to h:

h(s) = h(0) - 2es - 2s2 + -s3h'"(s*) with 0 < 5* < 56

= -€4y(O-2(5 + i ) 2 +

because

h(0) = H(\) - H(a) = -\€2 - €4y

ti(s) = - 2 a ( l - 2s)~x + 20(1 + 25)"1 whence h\0) = 2(0 - a) = -26,

ti\s) = - 4 a ( l - 25)"2 - 40(1 + 25)~2 whence h"(0) = - 4 .

j + REMAINDER,


The third derivative is negative and decreasing,

h"'{s) = -16c*(l - 2Sy3 + 160(1 4- 2s)~3 whence *'"(0) = 16(0 - a) < 0,

h{iv\s) = -96a(l - 2^)"4 - 960(1 + 2s)~4 < 0 so *'" is decreasing.

Thus 0 > REMAINDER > \s3h'"{s) for s > 0.

Upper bound for the tail probability

Substitution of the upper bound for h(s) in the representation <16> gives an upperbound for the tail probability,

If)ds-We make the integral larger by increasing the upper terminal from 1/2 to +oo. Thechange of variable t = 2y/~N(s + e/2) then simplifies the inequality to

< Un(€) := (l + ^ \ A€ exp (-N€*y(€) - l- log(ll- log(l - €2)\ 4> (^V^v) .

Lower bound for the tail probability

To get an analogous lower bound, replace the upper terminal by some positiverj < 1/2, and replace h(s) by the lower bound

h(0) - 2es - 2s2 + -r]s2ti"(ri) for 0 < s < r).6

There is a constant C for which

£ i & ' > foral"<1/4-For small /? we therefore have

h(s) > h(0) - 2es - 2K2S2 where K2 := 1 4- C /(6 +

Substitution in <16> gives

A, exp (-Ne4y(t) - \ log(l - e

(— ( , - - l)) x ; - ^ exp {-x exp (— ( , - - l)) x ; - ^ exp {-2N*> (* + &

The first factor on the right-hand side is the same as the factor for l/n(e). Thechange of variable t = 2VNK(S + €/2K2) in the integral transforms the third factorto /c"1 times

We should try to choose rj to make K~1 times this difference as large as possible.Explicit maximization seems an impossible task, because the dependence on rjis complicated. Lemma <5>, part (ii), gives a simpler lower bound, via the twoinequalities

exp ( -

D. 3 Proof of the approximation theorem 327

and

O (*VJVir!) > 4> ^ j exp (€*/NK-X8 + ±<52) where 5 := €<J~N (\ - K~1} .

The difference in <19> is larger than

O (eVN^ exp (€*/NK-18 + ±<52) x (l - exp (-INtr) - 2A/V*2)) .

And now a miracle occurs. The exponential factor from <18> completely cancelsthe first exponential factor in the previous line,

leaving us with Un(e)K~] (1 — exp (—INerj — 2NT]2K2)) as a lower bound forwhere Un(€) denotes the upper bound from <17>.

The dependence on rj now concentrates in two simpler factors. By trial and errorI decided to choose rj to make 2N€rj + 2Nrj2 = log Af, that is, 2rj = -e + y/e2 + 2tN,where £# := (log N)/N. Then /t2 = 1+ C£N/2 < exp(C€/v/2), and

<20> *(zk) > Un(€) ( l - AT1) exp(-C€N).

Of course this argument assumes that n is large enough to make rj < 1/4.

Inversion of the tail bounds

It remains only to replace the upper and lower bounds for (z*) by expressions ofthe form <J>(w).

First dispose of the easy case where € < Co/yfN for some constant Co. Asser-tion <15> then reduces to O(N~l) < Zk~ *VN < O(£N), and the inequalities <n>and <20> simplify to

expThe asserted approximation to Zk follows immediately from Lemma <6> part (i),first with y = O(N~l) then with y = O(lN).

For the remainder of the proof we may assume e > Co/VN, for a constant Coof our choosing. Let JC* be defined by

*(**) := exp (-N€4y(t) - \ log(l -

Inequalities <n> and <20> can be then written

exp (-yk - wk) i> (xk) < &(zk) < exp (-yk) & (xk),

where wk := - log (1 - AT1) + CiN and yk := - log (1 + AT1) - \N -f XK + kN_K.Notice that the first three contributions to yk are of order O(N~l), as is the fourthterm except when e gets close to 1, in which case it is of order 0(1).

For Co/yfN < € < 1 - 2/N, for a large enough Co, the exponent termYN(€) := N€4y(€) + ^log(l — e2) is nonnegative. For 6 bounded away from 1,the log term is of order 0(e2). In the worst case, when € achieves its maximumvalue 1 - 2/N, it is of order O(log AT). From Lemma <6> part (ii),


That is,

Xk = '

the second approximation coming from the Taylor expansion y/l+y = l + jy+O(y2)followed by an appeal to <12>.

Notice that xk contributes the main term in the desired approximation <15>for Zk- The yk and wk in <2i> contribute further error terms. From Lemma <6>,part (i), the 4 and zk for which

are given by

z'k=xk + O (yk/p(xk)) = ^ +

4 = ^ + o «yk + wk)/p(xk)) =xk + 0 (log JV

Monotonicity of then implies zk > Zh > zk, completing the proof of the Theorem.

4. Notes

The material in this Appendix is an expanded version of the paper Carter &Pollard (2000), where further discussion of related literature may be found.

REFERENCES

Carter, A. & Pollard, D. (2000), Tusnady's inequality revisited, Technical report,Yale University. http://www.stat.yale.eduTpollard.

Feller, W. (1968), An Introduction to Probability Theory and Its Applications, Vol. 1,third edn, Wiley, New York.

Appendix E

Martingales in continuous time

SECTION I explains the importance of sample path properties in the study of martingales,and other stochastic processes, in continuous time. Versions are defined. A delicatemeasurability question, regarding first hitting times on general Borel sets, is discussed.The notion of a standard filtration is introduced. The Section summarizes somedefinitions and results from the start of Chapter 6.

SECTION 2 presents the extension of the Stopping Time Lemma to submartingales withright continuous sample paths.

SECTION 3 shows how to construct a supermartingale with cadlag sample paths,by microsurgery on paths. Under a regularity condition on the filtration, eachsubmartingale with right continuous expected value is shown to have a cadlag version.

SECTION 4 presents a remarkable property of the Brownian filtration.

1. Filtrations, sample paths, and stopping times

A stochastic process is a collection of random variables {Xt : t e T] all defined onthe same probability space (Q, 7, P). The index set T is often referred to as "time."The theory of stochastic processes for "continuous time," where T is a subintervalof M, tends to be more complicated than for T countable (such as T = N). Thedifficulties arise, in part, from problems related to management of uncountablefamilies of negligible sets associated with uncountable collections of almost sureequality or inequality assertions. A nontrivial part of the continuous time theorydeals with sample path properties—that is, with the behavior of a process Xt(co) asa function of t for fixed co—or with properties of X as a function of two variables,X(t,co). Such properties are vital to many arguments based on approximation ofprocesses through their values at a finite collections of times.

REMARK. I will treat the notations Xt(co) and X(t, co) as interchangeable. If cois understood, I will also abbreviate to Xt or X(t). The second form becomes moreconvenient when t is replaced by a more complicated expression: something likeX(t0 4- Tr

k_{) is much easier to read than XtQ+x> (a,)(a>).

Throughout this Appendix, T will usually denote R+ or a bounded interval,such as [0, 1]. The most desirable sample path properties are continuity, and aslightly weaker property that is known by the acronym for the French phrasemeaning "continuous on the right with left limits."

330 Appendix E: Martingales in continuous time

 Definition. Call a real-valued function on an interval T c R codlag if it iscontinuous from the right and has a finite limit from the left, at each point t wheresuch assertions make sense. Write B(T) for the set of all cadlag real functions on T.

For example, functions in D[0, oo) are right continuous everywhere, with left limitson (0, oo). A left limit at 0 makes no sense. There is no assumption of a limitexisting as t -» oo. For D[0, oo], however, the limit does exist as t -> oo, but itneed not equal the value of the function at oo.

The measurabilty properties of the random variables provide another link withthe interpretation of T as time. Typically Xt is required to be adapted to a filtration[7t : / € 7}, a family of sub-sigma-fields of 7 for which 7S c Jt if s < t. Thatis, X, is ^-measurable. If 3oo is not otherwise defined, I will take it to be thesigma-field generated by UtGTHt.

The finite dimensional distributions (fidis) of a stochastic process do notcompletely determine the sample path properties. (See, for example, Section 9.2for a Brownian motion with discontinuous sample paths.) Instead, good behaviorof sample paths typically results from microsurgery that changes each Xt at aP-negligible set of co. The surgery results in a new version of the process.

<2> Definition. Say that a process {Xt : t € T] is a version of another process[Xt:t e T) if¥[Xt ^ Xt) = 0 for each t in T.

REMARK. The negligible set N, where Xt ^ X, can depend on t. If T isuncountable, we cannot dismiss UteTN, as a negligible nuisance; the union mighteven cover the whole of Q. An observer who sees the two processes at a fixed timewould not notice any difference worth worrying about. An observer who was ableto record the whole sample path, X(-,o>) or X(yco), might be able to distinguishbetween the processes. For uncountable T it would be a much stronger requirementto insist that X(,co) = X{,(o) as functions on 7\ for all except a negligble setof co. Some authors (such as Me*tivier 1982, page 4) express the stronger propertyby saying that the processes are indistinguishable, or P-equal, or equivalent up toevanescence.

The surgery that creates a new version of the process can have measurabilityside effects. For example, in Section 3, to obtain cadlag sample paths we will defineXt(a)) by a limit of Xs(co) values, with s ranging over rational values larger than t. Onthe negligible set where the limit does not exist, we will define Xt(co) in some otherway. As a result of these modifications, Xt need not be ^-measurable. However itwill be measurable with respect to a slightly larger sigma-field, Jt := n5>,cr (NU 7t),where K := [A c n : A c F for some F e 7 with PF = 0}, the collection of allP-negligible sets. Remember, P has a unique completion, that is, an extension to thesigma-field generated by JT U 3sf. It is no loss of generality to assume that N c j ,but it would be a nontrivial requirement to insist, a priori, that N c j , for every t.

Clearly, each 3", contains all P-negligible sets and 7t = C\S>J3'S (a propertysometimes called right-continuity). Such a filtration is said to be standard:[7t : t € T) is the standard filtration generated by {7t : t e T}.

REMARK. If T happens to be a bounded interval, say T := [0, 1], there is a

small notational difficulty in the definition of 3~|, because there are no s in T with

s > 1. We could remedy the problem by defining 7S := *5\ for all s > 1, which

E.I Filtrations, sample paths, and stopping times 331

would give 7X = a(Nu3'l). The value of X{ has no effect on whether^ the X processhas cadlag sample paths, but we must take care to choose Xi to be -measurableif the process is to be adapted.

Much of the power of martingale theory comes from the preservation ofthe defining equalities and inequalities when fixed times are replaced by suitablestopping times r, that is, for random variables taking values in T U {00} for which{r < t] e Jt for each t in T. If r takes values only in T, we define Xx as thefunction with the value Xt(co) when x((o) = t. If r might take the value 00, and00 £ 7\ then it is safer to work with Xx{x < 00}, which takes the value zero whenT is infinite.

Associated with each stopping time r is a pre-r sigma-field 7X, defined toconsist of all F for which F{x < f} € 3i for all t € T U {00}. A random variable Zis 3>-measurable if and only if Z{x < t] is J,-measurable for all t e T U {00}. If Tis countable, it is easy to show that XT{r < 00} is 9^-measurable. For continuoustime, the corresponding fact requires further assumptions about the behavior ofX(t, co) as a function of two arguments.

<3> Definition. Say that a process {Xt : t € R+} is progressively measurable if itsrestriction to [0, t] x Q is !B[0, t] <g> 7t -measurable for each t in R+.

Typically, good sample path properties plus adaptedness are used to deduceprogressive measurability. For example, Problem [2] shows that an adapted process{Xt : t € R+} with right-continuous sample paths must be progressively measurable.

<4> Theorem. For a given filtration, let [Xt : t G R+} be a progressively measurableprocess and x be a stopping time. Then XT{x < 00} is 7X-measurable.

Proof. Write Z for Xx{x < 00}. Clearly 0 = Z[x = 00} is Joo-measurable. Weneed to show, for each re i " 1 " , that Z{x < t} is JF,-measurable. On the set {r < t}we can replace r by r A t, giving

Z{x < t] = X(x(co) A t, (o){x(co) < t}.

By definition of a stopping time, the indicator on the right-hand side is 3t-measurable. The stopping time x A t is 3XAt-measurable, and 7XM c J r . Thus themap co 1- (x((o) A f, co) is 3>\ (3[0, t] <g> ^-measurable. When composed with therestriction of X to [0, t] x ft, which is ($[0, t] <g> $t) \S(M)-measurable, by definitionof progresive measurability, it gives an ^r\!B(R)-measurable function. The random

• variable Z{x < t] is a product of two ^-measurable functions.

In discrete time, with an adapted process {Xn : n € N}, and a Borel set B, thefirst hitting time x(co) := inf{n : Xn(co) € B] is a stopping time. (The infimum ofthe empty set is interpreted as +00. That is, x(co) = +00 if Xn(a>) £ B for all n.)Clearly {r < n] = U,<n{X| e B] e % for each n G N. The analogous result forcontinuous time is more subtle.

For a Borel subset B we can still define x(co) := inf{f € M+ : Xt(co) € B). Foreach fixed t in R+, it need not even be true that {r < t] is a union Us<t{Xs e /?},because the infimum need not be achieved; and even if the representation were valid,


an uncountable union of measurable sets would be of no help. We would have moresuccess with the representation for a strict inequality,

{r < t] = {co : X(s, co) e B for some s < t]

= projection of {(s, co) : X(s, co) € B, s < t] onto Q.

Q "

{T<t} ): X(S,«)GB, s<t}

If B is an open set, and if the process has right continuous sample paths, this settakes the form of a countable union of sets [Xs e B], with s ranging over all rationalnumbers in [0, t). It then follows that {r < t] e 3>, which is almost what we need.Indeed, we would then have

{z <t} = nneN{x < t + n~1} € nw€N7,+n-i c ?,.

That is, r is a stopping time for the standard filtration.The argument for open B does not use the fact that N c j , , For hitting times

on more general Borel sets, the negligible sets are needed. An amazing result fromadvanced measure theory (Dellacherie & Meyer 1978, III.l through IIL44) assertsthat the projection of any $[0, t] 0 7t-measurable subset D of [0, t] x Q ontoQ is measurable with respect to the sigma field a ( N U ^ ) . If X is progressivelymeasurable, the set D := {(s,a)) : X(s,co) e B, s < t] is product measurable,and hence its projection {r < /} is cr(NU J^-measurable. It then follows that{T < t) is 9>-measurable. The complete details of the proof would take another halfdozen pages of careful argument. The brave and interested reader should consultthe Dellacherie-Meyer monograph. For our purposes it is enough that the generalquestion be seen as settled, and that the role of the standard filtration be understood.

<5> Theorem. If the process {(Xty 3t) ' t e E+} is progressively measurable, and if Bis a Borel subset of the real line, then r := inf{f : Xt e B] is a stopping time for thestandard filtration {5, : t e M+} generated by {J, : t € E+}.

In a droll understatement of the subtlety of the underlying ideas, it is customaryto refer to an assumption that the filtration be standard as "the usual conditions."

2. Preservation of martingale properties at stopping times

The basic Stopping Time Lemma from Section 6.2 asserted preservation of thesubmartingale property at stopping times taking values in a finite set:

Let {(Z,-, Si) •' = 0 , 1 , . . . , N] be a submartingale, and a and r be stoppingtimes with 0 < a < r < N, with N a (Finite) positive integer. For each Gin gCT, we have FZOG < PZTG. Equivalently, Za < P(ZT | %) a.s..

E.2 Preservation of martingale properties at stopping times 333

The continuous time analog of the Lemma is more delicate. As with Theo-rem <4>, we need to make assumptions about the behavior of the martingale asa function of /. As you will see from the next Section, it is reasonable to requirethat the submartingale X have sample paths that are continuous from the right.Problem [2] shows that such a process is progressively measurable, so there are nomeasurability difficulties in working with the value of the process at stopping times.

<6> Theorem. Let {(X,, 7t) : 0 < t < 1} be a submartingale with right-continuoussample paths. Let 0<o<x<\bea stopping times for the filtration. ThenXo < F(XX I 7a) almost surely. That is, FXaF < FXXF for each F in Ja.

Proof. We need to reduce to the case of a finite index set by using the rightcontinuity. The method is a common one. First discretize the stopping times,defining

^ : | { ^ < r < ^ } for nThat is, xn is just r rounded up to the next integer multiple of 2~n. It is importantthat we round up, rather than down, because then {rn < i/2n) = {r < i/2n] e 3i/2«iwhich makes each rn a stopping time for the filtration {J//2« : i = 0 , 1 , . . . , 2n}. Asn increases, the sequence {rn(a>)} decreases to r(o>), for each co. The stopping times[on] are defined analogously.

The event F belongs to 7a, which is a sub-sigma-field of 7On- The discreteversion of the Stopping Time Lemma therefore implies that FXOnF < FXXnF.

As n tends to infinity, right continuity ensures that Zn := XOn —• Xa andZ'n := XXn -> Xx along each sample path. To complete the proof, it is enough toshow that [Zn] and [Z'n] are uniformly integrable. As the method of proof is thesame in both cases, let us consider only the Zn's.

The Stopping Time Lemma also implies that Xo, Zn, Zm, X\ is a submartingalefor each n >m. In particular, the sequence FZn decreases, to a finite limit K > FXQ.Given e > 0, there exists an m such that K < FZn < FZm < K + € for all n > m. Foreach positive constant C,

P|Zn|{|Zn| >C} = FZn{Zn > C) - FZn{Zn < -C)

< FX\{Zn > C) - FZn + FZn{Zn > -C) submartingale

<FXx{Zn >C)-FZm+€+FZm[Zn >-C) if n>m

C}-f6 .

The random variable |Xi| -f \Zm\ is integrable. The event {\Zn\ > C] has probabilitybounded by

F\Zn\/C = (2PZ+ - FZn) /C < (2PX+ - PX0) /C ,

which tends to zero (uniformly in n) as C tends to infinity. Uniform integrability,D and the assertion of the Theorem, follow.

<7> Corollary. If {(Mt, 3t) : 0 < t < 1} is a martingale with right continuous samplepaths then Ma = F(M\ \ 3a) for each stopping time with 0 < o < 1.


3. Supermartingales from their rational skeletons

Write S for the set of all rational numbers in (0,1). For each t in (0,1], writelinijtt, to denote a limit along {s € S : s < t}\ and for each t in [0,1), write l i m ^ ,to denote a limit along {s e S : s > t).

<8> Theorem. Let {(X,, 7t) : 0 < t < 1} be a positive supermartingale. Then thereexists a ¥-negligible subset NofQ such that, for each co e Nc:

(i) snpseSXs(co) < oo;

(ii) for every t in [0,1), the limit Xt(co) := lim^;, Xs(co) exists and is finite;

(Hi) for every t in (0, 1], the limit lim5tt, Xs(co) exists and is finite.

Define X\{co) := Xi(co) and Xt(co) :=0 if co e N and t < \. Then:

(iv) {(Xt, 7t) : 0 < t < 1} is a supermartingale with cadlag sample paths;

(v) FXtF > PX,F > FXt>F for all 0 < t < tf < 1 and all F in 7t;

(vi) if the map t \-+ ¥Xt is right continuous at to then Xt0 = P(X,0 | 3^) as..

Proof. Let {Sn} be an increasing sequence of finite sets with union S. For each n,and each x > 0, the random variable zn := 1 A min{.s € Sn : Xs > x] is a stoppingtime. By the Stopping Time Lemma from Section 6.2,

> x] = P{XTn > JC} < PXTn/jc < PX0/JC.

Let n tend to infinity, then x tend to infinity to see that snpseS Xs(co) < oo exceptfor co in some P-negligible set No.

For each pair of rational numbers 0 < or 0 for each i. By Dubins's inequalityfrom Section 6.3, FAn (k,a, ft) < (a/P)k for each n, which implies that the eventN(a, ft) := HjteN U«GN An(A:, a, )S) has zero probability.

Define N as the union of No with all N(a, p), for rational a < p. For co £ N,and each t in [0, 1), we must have

liminf X5(cy) = limsupX5(ct>) < oo,

for otherwise there would be a pair of rational numbers for which co e N(a, P) c N.Assertion (ii) follows. The proof for assertion (iii) is similar.

Temporarily write Lt(co) for the limit from the left. For each t in (0,1) andeach co e Nc, to each € > 0 there exists a 8 > 0 such that

\Xs(co) - Xt(co)\ <€ for s in S with f < s < t + 6,

|X5(<y) - L,(a>)| < e for 5 in 5 with t - 8 < s < t.

We must therefore have

\XAco) - Xt{co)\ < 6 for t < tf < t + 8,

\Xt>(co) - Lt(co)\ <€ for t - 8 < tf < t.

It follows that X(«, co) is right continuous with left limit Lt(co) at t. Similar reasoningapplies at t — 0 and t = 1. Thus X(-, o>) is a cadlag function for every co in Nc.

E.3 Supermartingales from their rational skeletons 335

Clearly limsupj44, Xs is 7S-measurable for each s > t9 and hence ^-measurable.The limit Xt differs from the limsup at only a negligible set of co. The process{Xt : 0 < t < 1} is adapted to the standard filtration. To complete the proof of (iv),we need to show that FXtF > PXt>F for each F in 9>, for t < t'. We may assumet' < 1. (The argument for t' = 1 is even simpler.) Choose sequences {sn} and {s'n}with s'n 44 t' > sn 44 t. By definition, X5n -> Xt and X,; - • X,', for each samplepath. As in the last part of the proof of Theorem <6>, the sequences {XSn} and {Xs'n}are uniformly integrable. (For example, put Zn equal to — XSn, then argue exactlyas before.) Also, because F differs only negligibly from sets in each 5Sn, we havePX5nF > PXs'nF. In the limit we get the supermartingale inequality, PXtF > PXt>F.

For each positive constant C, the process Xt A C is a positive supermartingale:P(X, A C)F > P(X5 A C)F > F(Xt> A C)F if t < s < t' and F e 7t. InvokeDominated Convergence as s decrease to t through rational values to conclude thatP(X, A C)F > P(X, AC)F > F(Xt> A C)F. Let C increase to infinity to deduce (v).

The equality in (vi) is equivalent to the assertion that PX,0F = FXt0F for all Fin J'tQ. If for some such F and some € > 0 we had PXt0F > e + PX,0F, then, viathe inequality PX,OFC > PXroFc, we would get PX,0 > e 4- PX,0 > € -f PXrS for all

D t' > tQy which would prevent right continuity of the map t »-• FXt at to.

<9> Corollary. For each integrable random variable X, there exists a cadlag versionof the martingale F(X \7t).

Proof. Start from any choice of the conditional expectations Xt := F(X \ J,). Notethat FXt is constant, and hence (trivially) continuous from the right. By property (vi)

• and the 3>-measurability of Xt we have Xt = F(Xt \7t) — Xt almost surely.

<1O> Corollary. Each submartingale {(St, ?,)) : t e R+} with t \-+ PS, right continuous,has a cadlag version.

Proof. For each positive integer n, invoke the Theorem for the positive martingaleAf,(n) := P(5+ | J,) and the positive supermartingale x\n) := M,(w) - S,, to get cadlag

• versions St = M,(n) - x\n) for n - 1 < t < n.

REMARK. The proof of the Krickeberg decomposition from Section 6.5 wouldcarry over with only minor notational changes to continuous time. That is, eachsubmartingale {(S,,^)) : t e R+} with sup^X/" < oo can be expressed as thedifference Mt — Xt of a positive martingale Mt and a positive supermartingale Xt.

Notice that the second Corollary refers to submartingales with respect to thestandard filtration. The result is not necessarily true for an arbitrary filtration. Forexample, if we take 5, := cr(N) for t < 1/2 and 2ft := cr(N U T) for t > 1/2,and if A is an event with probability a not equal to zero or one, then the processXt((o) := a{t < | } + {co € A, t > \] is a martingale for the filtration {7t : 0 < t < 1}.If {M(t)} is another version of the process, then would have M(l/2) = a almostsurely and M(l/2-hw~1) € {0,1} almost surely, which rules out right continuity at 1/2for almost all paths. We have 5, = 7, except at t = 1/2, where ? i / 2 = # U J ) .The cadlag process X from the Lemma agrees with X except at 1/2, whereX1/2 = A / X\/2 almost surely. That is, there is a single time point at whichF{Xt ^ Xt] is nonzero.


The general situation is similar (Doob 1953, pages 356-358): there are atmost countably many t at which P{X, Xt] > 0. If we work only with standardfiltrations, the difficulty disappears.

4. The Brownian filtration

Let {Bt : t e R+} be a Brownian motion with continuous sample paths. Write 7ffor a{Bs : 0 < s < f}, for t e R+, the natural (or Brownian) filtration on £2, andlet {7f : t € R+} be the corresponding standard filtration. The Markov propertyof Brownian motion implies that Jf differs only negligibly from 7f, and also thateach martingale with respect to the standard filtration has a version with continuoussample paths. Before turning to the proof of these assertions, let me first dispose ofa small detail that might cause some notational difficulties if left unsettled.

<n> Lemma. The process {(Bt,3f) : t G ! + ) is also a Brownian motion.

Proof. We need the increment Bt - Bs is independent of each F in Jf, whens < t. For each u with s < u < t, the event F differs negligibly from a member ofyf. Thus Pexp (W(Bt - Bu)) F = exp (-02(t - u)/2)) PF, for each real 0. InvokeDominated Convergence as u decreases to s to deduce that Pexp(/0(#, - BS)) F

• factorizes into exp (~02(t - s)/2j) PF, as required for a Brownian motion.

Now recall from Chapter 9 that we may treat B as an S^C-measurable mapfrom SI into C[0, oo), the space of continuous real functions on M+, equipped withits cylinder sigma-field 6. The distribution of B is Wiener measure W, a probabilitymeasure on 6. The collection of all sets {B e C), with C e Q defines a sub-sigma-field ?B of J, the sigma-field generated by the Brownian motion. A random variableis 7B-measurable if and only if it can be written in the form / (#) , where / is aC-measurable map from C[0, oo) into the real line. That is, f(B) is a functionalof the whole Brownian motion sample path. The sigma-field 7f is also generatedby the map co H> K'B, where K* is the killing functional, (K'x)(s) := x(t A S),which maps C[0, oo) back into itself. Thus every Jf-measurable random variableis representable as f(K*B), for some C-measurable functional / on C[0, oo).

The Markov property gives an explicit representation for the conditionalexpectation of every W-integrable function / of the whole Brownian motion samplepath,

<12> P(/(B) | ?t) = WfiK'B + S'JC) almost surely,

where (Sfx)(s) equals x(t - s) if t > s, and is zero otherwise, as in Section 9.5.uccThe metric d for uniform convergence on compacta (denoted by the symbol —•)

is defined on the space C[0, oo) by d(x, y) := £ n e N 2~n (1 A supo<,<n \x(t) - y(t)\).The Borel sigma-field for this metric coincides with the cylinder sigma-field 6.The sigma-field C is also generated by the space Cbdd of all bounded, ^-continuousfunctions on C[0, oo). (Quite confusing: continuous functions on a space ofcontinuous functions. Maybe the clumsier Cbdd (C[0, oo)) would be a better symbolto save us from confusion.) The space Cbdd is dense in

E.4 The Brownian filtration 337

Notice that for / in Cbdd, the right-hand side of <12> is a continuous functionucc

of t, by Dominated Convergence, because KUB{>, co) + Sux —> K'B(-, co) + S'xas u -> t, for each sample path £(•, <w) and each x in C[0, oo) with JC(O) = 0.That is, for each f in Ctdd there is a version of the martingale P(f(B) \ 3>)with continuous sample paths. This fact underlies the remarkable properties of theBrownian filtration.

<13> Theorem. For each t in R+, the standard sigma-field y f equals oQfyjSf). Eachmartingale for the filtration [5f} has a version with continuous sample paths.

Proof. Representation <12> holds for every filtration {?,} for which B is aBrownian motion. In particular, it holds if we choose 3> = 3f. Remember that7f c a (Ji U y f ) c or (>f U ?B) for every s > t. If X is an integrable, ^-measurablerandom variable, it differs only negligibly from an integrable random variable of theform / ( # ) , with / a C-measurable, W-integrable functional. In consequence,

X = F(X | y,) = P(/(J?) | %) = Wxf(K'B + S'JC) almost surely.

That is, X differs only negligibly from an y f -measurable random variable. The firstassertion about the standard filtration follows.

Now suppose (Af,, Jr) : r € M+} is a martingale, where 7t = yf, as before. ByCorollary <io> we may assume that it has cadlag sample paths. It is enough toprove that almost all its sample paths are continuous on each bounded subintervalof R+. Consider the subinterval [0,1] as a typical case.

Write M\ as /(/?), with / a C-measurable, W-integrable functional. Approx-imate / in L!(W) norm by a sequence of functions /„ from CM*: for each n e Nchoose fn so that W | / - fn\ < 4~n. Define Mn(t) := Wxfn(K'B 4- 5'JC), a versionof the martingale F(fn(B) \ 7t) with continuous sample paths. Notice that

¥\Mn(l) - M(1)| = ¥\F(fn(B) - f(B) \ )\ < F\fn(B) - f(B)\ = W|/n - f\.

Fix a finite subset U of [0,1], define r := 1 Ainf{f e U : \Mn(t) - M(01 > 2"n},a stopping time for the filtration {7t : t e U}. Assume I e U. Then, via the StoppingTime Lemma applied to the submartingale {\Mn(t) — M(t)\ : t € U], deduce that

/ \Mn(t) - Af (01 > 2~n] = F{\Mn(r) - Af (T) | > 2~n]

<2nF\Mn(r)-M(T)\

< 2nF\Mn(l) -

The last bound does not depend on U. Let U expand up to a countable dense subsetof [0,1]. Invoke right continuity of Mn — M sample paths to deduce that

P{supo<,<, |Mn(0 - Af (01 > 2~n} < 2~\

By Borel-Cantelli, sup^,^ \Mn(t) - M(01 -» 0 almost surely. That is, for almostall a), the sample path Af (•, co) is a uniform limit on [0,1] of the continuous functions

• Mn(-, co), and hence is also continuous.


5. Problems

[1] Let [Nt : 0 < t < 1} be a family of negligible sets, for Lebesgue measure on S[0,1],such that UtNt = [0,1]. For co e Q := [0,1], define Xt(co) := {co € N,}. Show that[Xt : 0 < t < 1} is martingale with respect to the filtration {£[0, t] : 0 < t < 1}.Need it have cadlag sample paths?

[2] Show that an adapted process {Xt : t e R+} with right-continuous sample paths mustbe progressively measurable. Hint: For fixed t, consider approximations of the form

Xk(s, co) := X(0, co){s = 0} + ] T ^ X(ti/2k, co) {t(i - l)/2* < s < ti/2k].

Why is the Xk process !B[0, t]®Jt -measurable? Is it adapted? Why does it convergepointwise to X on [0, t] <g> Q?

[3] Let {Xt : f € R+} be a process with continuous sample paths, adapted to afiltration {IF, : r e E+}. Define r(<w) := inf[t e E + : Xr(c<;) € F}, for some closedset F. Show that {r < r) = {o>: inf^g, /(X5(a>), F) = 0}, where Qt denotes the setof rational numbers in [0, t]. Deduce that r is a stopping time for the filtration.

6. Notes

My exposition borrows ideas from Doob (1953), Metivier (1982), Breiman (1968,Chapter 14), Dellacherie & Meyer (1978, Chapter IV), and Dellacherie &Meyer (1982, Chapters V and VI).

REFERENCES

Breiman, L. (1968), Probability, first edn, Addison-Wesley, Reading, Massachusets.Dellacherie, C. & Meyer, P. A. (1978), Probabilities and Potential, North-Holland,

Amsterdam.Dellacherie, C. & Meyer, P. A. (1982), Probabilities and Potential B: Theory of

Martingales, North-Holland, Amsterdam.Doob, J. L. (1953), Stochastic Processes, Wiley, New York.Metivier, M. (1982), Semimartingales: A Course on Stochastic Processes, De Gruyter,

Berlin.

Appendix F

Disintegration of measures

SECTION 1 decomposes a measure on a product space into a product of a marginalmeasure with a kernel.

SECTION 2 specializes the decomposition to the case of a measure concentrated on thegraph of a function, establishing existence of a disintegration in the sense of Chapter 5.

1. Representation of measures on product spaces

Recall from Chapter 4 how we built a measure /z ® A, out of a sigma-finite measure/x on (X, A) and a sigma-finite kernel A := {A., : t e T}, from (T, S) to (X, A), viaan iterated integral,

{li ® A) / := iifkxtf(x, t) for / in M+(X x T , i ® !B).

This Section treats the inverse problem: Given a measure / x o n S and a measure F on»A<g)!B, when does there exist a kernel A for which F = IL® A? Such representationsare closely related to the problem of constructing conditional distributions, as yousaw in Chapter 5.

 Theorem. Let T be a sigma-finite measure on the product sigma-field A <g> !B ofa product space X x T, and /A be a sigma-finite measure on *B. Suppose:

(i) X is a metric space and A is its Borel sigma-field;

(ii) the 7-marginal ofV is absolutely continuous with respect to /JL;

(Hi) F = £ i € N F,-, where each F, is a finite measure concentrating on a set X, x Twith Xj compact.

Then there exists a kernel A from (7, S) to (X, A) for which F = /x0 A. The kernelis unique up to a ii-equivalence.

REMARK. The uniqueness assertion means that, if A := {A., : t € T} is anotherkernel for which F = \x <8> A, then kt = A,, as measures on A, for \i almost all t.

Heuristics

Suppose for the moment that F has a representation as \x ® A, for some kernel A.If we could characterize the kernel A in terms of F and /x alone, then we could tryto construct A for a general F by reinterpreting the characterization as a definition.

340 Appendix F: Disintegration of measures

First note that the T-marginal of r (that is, the image of r under the map nthat projects X x T onto 7) must be absolutely continuous with respect to /z, because

(*ryg(t) = Vxtg(t) = ii'kfgit) = nl igit)ktX) for g € M+(T, B).

If g = 0 a.e. [/x] then the last integral must be zero, thereby implying (7iT)g = 0.Then note that, for each fixed / in M+iX,A), the iterated integral representationfor r identifies t H> kxfix) as a density with respect to \i of the measure 1} on £defined via an increasing linear functional,

<2> rfg := Txt (f(x)g(t)) = n< {g(t)kxf(x)) for g € M+(T, B).

Construction of the kernel, almost

Now consider how we might reverse the argument, starting from a measure Ton A <g> 23 for which JT (H is absolutely continuous with respect to a sigma-finitemeasure /ion!B. Without loss of generality we may suppose F is a finite measure.The result for f as in the Theorem will follow by pasting together the results foreach of the r,-.

For each / in 3V[+(T, £), we can define (via the Radon-Nikodym Theoremfrom Section 3.1, even though 1} need not be sigma-finite) a function X(f, / ) asthe density of 1} with respect to /x. Here I write kit, / ) instead of Xtf to avoidan inadvertent presumption that / i-> A.(r, / ) is a measure. Indeed, at the momentkit, f) is defined only up to a /x-equivalence; it is not even a well defined functionalfor each fixed t. Nevertheless, it has almost sure analogs of the properties, as afunction of / , that we need. Problem [2] outlines the steps needed to prove thefollowing result.

<3> Lemma. Let T be a finite measure, whose 7-marginal is absolutely continuouswith respect to the sigma-finite measure \x. For each fixed f in M+iX,A), thedensity kit, f) := dTf/dii, with Tf as in <2>, is well defined and determineduniquely up to a ii-equivalence. The map t H> X(f, / ) has the following measure-like properties for all <x\, a2 in M+(T, B) and all /, fu f 2 , . . . in M+(X, A).

(i) k(t, aifi + a2f2) = «i(0Mf, f\) + a2{t)k(t, f2) a.e. [/z].(ii) If/ i < h then k(t, / i) < k(t, ft) a.e. [/x].

(Hi) If 0 < /i < h < ... f / then kit, fn) t kit, f) a.e. [/*].

If it were possible to combine the negligible sets corresponding to the a.e. [/i]qualifiers on each of the assertions (i), (ii), and (iii) of the Lemma into a single/z-negligible set N, the family of functionals [kit, •) : t £ K] would correspondto a family of measures on A. Unfortunately, we accumulate uncountably manynegligible sets as we cycle (i)—(iii) through all the required combinations offunctions in 3V[+(X,.A) and positive constants.

There are two strategies in the literature for dealing with the multitude ofnegligible sets. The more elegant approach brings to bear the heavy measuretheoretic machinery of the lifting theorem. (Roughly speaking, a lifting is a mapthat selects a representative from each ^-equivalence classes of bounded functionsin a way that eliminates the almost sure qualifiers from properties (i) and (ii) of

El Representation of measures on product spaces 341

Lemma <3>.) The alternative is to use a separability assumption to reduce (i)—(iii)to a countable collection of requirements, each involving the exclusion of a single^-negligible set.

REMARK. Separability methods have the longer history, but lifting is clearlythe technical tool of choice when we seek more general forms for theorems aboutconditional probabilities, constructions of point processes, and other results involvingmanagement of uncountable families of negligible sets.

Construction of measures on compact metric spaces

I will follow the separability strategy for dealing with the host of negligible sets inLemma <3>. The following result from Section A.4 will help us to build a measurefrom a set function defined initially on a countable collection of sets.

<4> Lemma. Let Xo be a (0, U/, n / ) paving of subsets of X, and let X denote itsHe-closure. Let v : Xo - • K+ be a Xo-tight set function that is sigma-smooth at 0.Then v has a unique extension to a complete, X-regular, countably additive measureon a sigma-field S D XO, defined by

vK := inf{vK0 : K c Ko e Xo] for K e 3C,

vS := sup{vK : S 3 K e X) for 5 € 8.

The requirements of the Lemma simplify greatly when v = k(t, •), with X asthe paving of all compact subsets of a compact metric space X, if we choose anappropriate Xo. Problem [9] shows that there exists a countable subclass Xo of Xwith the following properties.

(a) The class Xo is a (0, U/, (If) paving on X, meaning that Xo contains theempty set and that it is stable under the formation of finite unions and finiteintersections.

(b) For each K in 3C, there exists a decreasing sequence of sets {Kn} from Xofor which K =nnKn.

(c) For each pair of sets K\, Ki in Xo for which Jfi 3 Ki there exists anincreasing sequence of sets {//„} from Xo for which UnHn = K\\K2-

Property (b) ensures that X is the He-closure of 3Co- Compactness makes thesigma-smoothness for v automatic. The ^Co-tightness property,

vK\ = vK2 + sup{v// :KX\K2^H e Xo] for Kx 2 K2 in Xo,

reduces to

(A) v(K UH) = v(K) + v(H) for all disjoint pairs K and H in Xo,

(B) v(K2) + supn v(Hn) = v(*i), with Ku K2, and {//„} as in (b).We now have the tools to turn Lemma <3> into a proof of Theorem , first

for the case of compact X, then for the general case.

Theorem for finite F and compact X

Write vt(K) for an arbitrarily chosen version of A.(f, K), for each K in 3Co- Thereare only countably many ways to choose a pair of sets from XQ. Taking the union


of countably many /x-negligible sets, we are left with a single set N with fiN = 0,for which the following properties hold when t £ N:

(A)' vt(K U H) = vt(K) + vt(H) for all disjoint pairs K and H in 0Co,

(By v,(tf2) + supn vt(Hn) = v,(tf,), with Ku K2, and {//„} as in (b).

For each such t, Lemma <4> lets us extend vt to a countably additive, finitemeasure Xt on the Borel sigma-field of X, for which t H* XtK is immeasurable and

F* ' (s(0{* e £}) = / / (g(t)XtK) for all g G M+(T, £)) and AT e X O .

Define A., as the zero measure when t e N. A simple A.-cone generating classargument then shows that t H> XtA is Immeasurable for each Borel set A, so thatA := {Xt : t e T} is a genuine kernel. A similar X-cone argument extends therepresentation for F to all functions in M+(X x T, A <8> !B), thereby confirming thatr = /x(8) A.

For each AT in 3Co, the function r H> XtK is determined up to an almostsure equivalence as a density. For any two representing kernels we would haveXtK = XtK for t $. NK, with fiNK = 0. For t outside the /x-negligble set UK^%QNKya A.-class argument establishes equality of the finite measures Xt and Xt.

General case

We may assume the compact sets X, are disjoint. Invoke the special case to findkernels A,- := {ktj : t e 7} for which F, = /z ® A,- and XriX? = 0 for each i.Define A by Xt := JîeNXtj. The uniqueness assertion reduces to the uniquenessfor each X,-, together with the fact that F puts zero mass outside (U,-X,-) x T.

2. Disintegrations with respect to a measurable map

In Chapter 5, conditional distributions were identified as special cases of disintegra-tions, in the following wide sense.

<5> Definition. Let (X, A, X) and (T, *B,fi) be measure spaces, and let T be anA\b-measurable map from X into 7. Call a kernel A := {A., : t e 7} from (T, S) to(X, A) a (7, fi)-disintegration of a sigma-finite measure X if

(i) Xt[T jk t] = 0 for ii almost all t,

(ii) Xxf(x) = ixlXxtf{x), for each f in M+(X, A).

Xt concentrateson {x : Tx =

To fit disintegration into the framework of Theorem , it suffices that A. bea Radon measure: a measure on the Borel sigma-field 3(X) for which

XA = sup{A tf : AD K, with K compact} for A e

F.2 Disintegrations with respect to a measurable map 343

with kK < oo for each compact K. A sigma-finite Radon measure must concentrateall its mass on a disjoint union of countably many compact Borel sets (Problem [3]).Every sigma-finite measure on the Borel sigma-field of a complete, separable, metricspace is a Radon measure (Problem [4]), provided it gives finite measure to compactsets.

<6> Theorem. Let T be an A\B-measurable map from X into 7. Let k be a sigma-finite Radon measure on the Borel sigma-field A of a metric space X, whose imagemeasure Tk is absolutely continuous with respect to a sigma-finite measure /i on'B.If the set graph(r) := {(jt, t) e X x T : Tx = t] is A <g> 'B-measurable then k has a(7\ ^-disintegration, unique up to a ^-equivalence.

Proof. The image measure F of k under the map x H* (x, Tx) concentrates ongraph(r), and it satisfies the conditions of Theorem . The Theorem gives akernel A for which r = /JL ® A, that is,

kxg(x, Tx) = Txtg(xy t) = /i'kxg(x, t) for all g e M+(X <g> T, A <8> £) .

Specialize to g(x, t) = f(x) to get the disintegration property (ii). Then take g asthe indicator function of graph(7) to get 0 = F{(x, t) : t # Tx} = /ifkt{T ^ t],which implies the disintegration property (i). The almost uniqueness of the kernel

• follows from the corresponding assertion in Theorem .

3 . P r o b l e m s

[1] (repeated from Chapter 3) Let /xbea sigma-finite measure on (T, $) . Suppose h\and hi are functions in M+(T, 3) for which fi(g(t)h\(t)) < /JL (g(t)h2(t)) for all gin M+(T, £) . Show that h\ < h2 a.e. [/i\. Hint: Write 7 = UI€NT/, with /i7t < oofor each i. Consider g := {t € T, : fi2(t) < r < h\(t)} for rational r.

[2] Establish Lemma <3> by the following steps. Write y for the X-marginal of P.

(i) For fixed / in M+(X,A) and each n in N, write kn(t, f) for any choice ofthe Radon-Nikodym derivative drnAf/d/i. That is, kn(t, f) is a ^-measurablefunction for which Vxtg(t) (n A /(*)) = /ilg{t)kn{t, f) for all g in M+(T, £) .Use Problem [1] to show that 0 < kn(t, f) t a.e. |>].

(ii) Show that k(t, f) := limsupA.,,(f, / ) is a density for 1} with respect to /i. UseProblem [1] to show that the density is unique up to a /x-equivalence.

(iii) For fixed «i, a2, g in M+(T, ®) and / , fu h in M+(X, A), show that

Deduce via (ii) that M',<*i/i +« 2 /2 ) = «i(0^(^, / i ) +<x2(t)k(t, f2) a.e. [/it].

(iv) If fu h € M+(X,>1) and ^(x) < /2(x) a.e. [y]9 show that /xr ^(f)A(r, / i )) <Mr te(O^(r, / 0 ) for all g in M+(T, 3) . Deduce that *(*, / 0 < X(r, / 2 ) a.e. [^]-

(v) If fu h,... € M+(X, 1) and /n(jc) f / W a.e. [y], show that

^ te(O^(r, /,)) = r (g(t)fn(x)) t r («(0/U)) =for all g in M+(7, S). Deduce that k(t, fn) f k(t, f) a.e. [/x].


[3] Let A. be a sigma-finite Radon measure on (X, A), with A the Borel sigma-field.Show that there exists a sequence of disjoint compact sets {Kt : / € N} for whichk (UiKi)c = 0. Hint: Partition X into sets {X, : i e N}, each with finite measure. Foreach i find disjoint compact subsets Kin of X, for which A. (X,\ U/<rt Kin) < 2~n.

[4] Let A. be a finite measure on the Borel sigma-field of a complete, separable metricspace X. Write X for the class of all compact subsets of X.

(i) Show that k is tight: for each € > 0 there exists a K€ in X such that kKc€ < e.

Hint: For each positive integer n, show that the space X is a countable union ofclosed balls with radius 1/n. Find a finite family of such balls whose union Bn

has k measure greater than kX — e/2n. Show that C\nBn is compact, usingthe total-boundedness characterization of compact subsets of complete metricspaces (Simmons 1963, Section 25).

(ii) Deduce that kB = sup{Xtf : B 2 K € X) for each Borel set B. Hint: Comparewith the Problems to Chapter 2.

(iii) Extend the result to sigma-finite measures A. on a complete, separable metricspace for which kK < oo for each compact K.

[5] Let T be an >L\!B-measurable map from (X, A) into (y,®). Suppose B containsall singleton sets {>>}, for y e y, and is countably generated: © := <xCBo) f°r somecountable subclass ®o- With no loss of generality you may assume that $o is stableunder complements. Show that graph(7) € A <g> $ , by following these steps.

(i) Say that a set B separates points y\ and y2 if either y\ e B and y2 e Bc ory\ € Bc and yi e B. For fixed yi and j ^ , show that

Si := {B e !B : B does not separate y\ and yi }

is a sigma-field. Deduce that 3 j cannot contain every $o set.

(ii) If y / 7\x, for some x € X and y € y, show that (JC, y) € (J~x Bc) <g> 5 for someB e %

(iii) Deduce that graph(7y is a countable union of measurable rectangles.

[6] Suppose H : y -> R is one-to-one. Show that a{H) is countably generated andcontains all the singleton sets.

[7] Suppose 6 is a countably generated sigma-field on a set y, that contains all thesingleton subsets. That is, C := cr(Co), for some countable Co := {Ci,C2,...}.Define a real-valued, C-measurable function H(y) := ^ 3~'{y e C,} on y.

(i) Show that H is one-to-one. Hint: If H(y\) = H(y2), with > i ^ 3^, show that6 cannot separate the points y\, ^2, which implies that [y\] £ G.

(ii) Equip the range 31 := / /y (which might not be a Borel set) with the tracesigma-field D := {BR : B e B(R)J. Show that the map G = H'x from ftonto y is D\C-measurable. Hint: Show that G"1^ is the set of all points in ftwhose base-3 expansions have a 1 in the ith place.

(iii) Show that e = cr(H).

F.3 Problems 345

[8] Suppose T is a function from a set X into a set T equipped with a sigma-field !B.Suppose 5 : X - • y is a(7)\e-measurable, where 6 is a countably generated sigma-field on y containing all the singleton sets. Show that there exists a ®\e-measurablemap y : 7 -> y such that 5 = y o T. Hint: Represent H o 5, where H is defined inProblem [7], as g o T. Show that y = H~x o g has the desired property.

[9] Let Xo be a countable dense subset of a compact metric space X. Write 7?(JC, r) forthe closed ball with radius r and center x. Define X\ as the collection of all finiteunions of balls B(x, r) with r rational and x e Xo. Write %o for the collection ofall finite intersections of sets from %\.

(i) Show that %o is a countable (0, U/, n/)-paving of compact subsets of X.

(ii) For each closed subset F of X show that there exists a decreasing sequence ofsets [Kn] from %o for which F = DnKn. Hint: The union of the open ballsB(x, \/n) with x € Xo n F covers the compact F. Find a finite subcover.Consider the union of the corresponding closed balls.

(iii) For each pair of closed subsets K\, Ki in UCo for which K\ D K2 show that thereexists an increasing sequence of sets [Hn] from Xo for which UnHn = K\\K2-Hint: Cover the compact set [x e K\ : d(x, K2) > 2/n] by finitely many closedballs with radius 1/n, then intersect their union with K\.

4. Notes

The key idea underlying all proofs of existence of disintegrations and regularconditional distributions is that of compact approximation—existence of a classof approximating sets with properties analogous to the class of compact sets in ametric space—as a means for deducing countably additivity from finite additivity.Marczewski (1953) isolated the concept of a compact system. See Pfanzagl &Pierlo (1969) for an exposition. Bourbaki (1969, Note Historique) also gave creditto Ryll-Nardzewski for disintegration (no citation), perhaps in some point-processcontext. In point process theory disintegrations appear as Palm distributions—conditional distributions given a point of the process at a particular position(Kallenberg 1969).

I learned about disintegration from Dellacherie & Meyer (1978, Section III.8).They proved a result close to Theorem <6> using results about analytic sets(essentially to deduce product-measurability of graph(r)). Parthasarathy (1967,Sections V.7 and V.8) cited notes of Varadarajan for his existence proof for adisintegration. The definitive disintegration theorem is due to Pachl (1978), whoestablished that approximation by a compact system of sets is essential for a generaldisintegration theorem. See Hoffmann-j0rgensen (1994, Sections 10.26-10.30) foran exposition of PachFs method. Hoffmann-J0rgensen (1971), following IonescuTulcea & Ionescu Tulcea (1969), used a lifting argument to eliminate separabilityassumptions for existence of conditional distributions.

See the Notes to Chapter 5 for further comments about conditional distributionsand disintegrations.


REFERENCES

Bourbaki, N. (1969), Integration sur les espaces topologiques separes, Elements demathematique, Hermann, Paris. Fascicule XXXV, Livre VI, Chapitre IX.

Dellacherie, C. & Meyer, P. A. (1978), Probabilities and Potential, North-Holland,Amsterdam.

Hoffmann-J0rgensen, J. (1971), 'Existence of conditional probabilities', MathematicaScandinavica 28, 257-264.

Hoffmann-J0rgensen, J. (1994), Probability with a View toward Statistics, Vol. 2,Chapman and Hall, New York.

Ionescu Tulcea, A. I. & Ionescu Tulcea, C. T. (1969), Topics in the Theory of Lifting,Springer- Verlag.

Kallenberg, O. (1969), Random Measures, Akademie-Verlag, Berlin. US publisher:Academic Press.

Marczewski, E. (1953), 'On compact measures', Fundamenta Mathematicae pp. 113—124.

Pachl, J. (1978), 'Disintegration and compact measures', Mathematica Scandinavica43, 157-168.

Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic, NewYork.

Pfanzagl, J. & Pierlo, N. (1969), Compact systems of sets, Vol. 16 of Springer LectureNotes in Mathematics, Springer-Verlag, New York.

Simmons, G. F. (1963), Introduction to Topology and Modern Analysis, McGraw-Hill.

Index

SYMBOLS

A <g> $ product sigma-field, 82A\B difference of sets, 7AAB symmetric difference of sets, 7BL(X) bounded Lipschitz functions, 170

Borel sigma-field on X, 22smoothness class of functions, 177

C complex planeC(T) continuous, real functions on T, 215D+(x) right-hand derivative, 309A ( J C ) : = 2 ( ^ - 1 - J C ) / J C 2 , 265

dvjd\i density, 53D(P\\Q) relative entropy, 61£g (£ff) countable intersections (unions) of

sets from a class £, 20:= equality, by definition

/ , / semicontinuous envelope, 173JT pre-r sigma-field, 142H(P7 Q) Hellinger distance, 61=> NOT weak convergence, 1711A indicator function of a set(•, •) inner product, see Appendix B£>l(fi) functions integrable wrt /x, 28Lx equivalence classes of JC1, 35L°°, 49£ ' , Lp, 36L(jt):=(2jcloglog.x)1/2, 261v maximum (pointwise, of functions)A minimum (pointwise, of functions)m Lebesgue measure (usually), 29M+(X,A) cone of nonnegative, ^.-measurable

functions on a set X, 24M-bdd bounded members of M+, 59lix\y

x iterated integral, 84fif same as / f(x)/i(dx), 27(Hi-li2)

+9 73

Hi A112 minimum of measures, 61fi*v convolution, 91/ i ® k product measure, 88fi ® A product of measure and kernel, 86N natural numbers (1,2, . . . )No := {0} U NN : = N u { o o }No := {0} U N U {oo}v ± ji mutual singularity, 57

Op(-), op(') stochastic order symbols, 1830(x) standard normal density, 3174>(JC) := P{N(O, 1) > JC}, 317Pn -^> P weak convergence, 171P(X | T — t) conditional expectation, 125P(X I S) conditional expectation, 126

c2, 264

R extended real line, R U {-oo, oo}R real linep(x):=4>(x)/*(x), 317<r(£) sigma-field generated by a class of sets £,

19oi'K) sigma-field generated by a class of

functions IK, 23Tix image measure, 402K° a particular cardinality, 103W Wiener measure, 215Xn ~» P convergence in distribution, 171X x y product space, 82

absolute continuityfunctions, 65measures, 55

adapted, 138almost surely/everywhere, 33Andersen, E. S. and Jessen, B., 109Bayes theory, 119, 131

consistent estimator, 155bet red, 144blocking, 95, 262Boolean algebra, 7Borel paradox, 122Borel-Cantelli lemma, 34, 46, 102, 150

Levy extension, 152used for LIL, 263

Borell, C, 278bounded variation, 67branching process, 149Bretagnolie, J., 258Brownian

Bridge, 252filtration, 213, 336isonormal process, 216Markov property, 220, 336modulus of continuity, 219

348 Index

motion, standard, 215motion, started at JC, 213

bunyip, 149Burkholder, D. L., 135

cadlag, 329Cantor

diagonalization, 185set, 55

Carath6odory splitting, 290Carter, A., 328central limit theorem, 169

Lindeberg, 181martingale, 200multivariate, 182real random variables, 176second moments, iid, 180, 199third moments, 179, 244

characteristic function, see Fourier transformCLT, see central limit theoremcluster point, 271complete

L1, 48Lp

9 49completion, see sigma-fleldconditional

density, 118distribution, 113-117expectation, as contingent bet, 13

contingent bet, 12Continuous Mapping Theorem, 175convergence

almost surely/everywhere, 34in JC1, 39, 154in distribution, 171in probability, 37relationship between modes, 38weak, 171

convolution, 91smoothing, 247

Cramer-Wold device, 202Crofton's theorem, 116Csorgo, M. and Revesz, P., 255

Daniell, P. J., 6infinite products of measures, 100

de Finetti, B.exchangeability, 160measures as functionals, 10sets as indicators, 7

DeAcosta, A., 269Dellacherie, C. and Meyer, P. A., 338delta method, 184density

as derivative, 65of measure, 53uniqueness, 71

Desirable Frog Condition, 257differentiation under integral, 32disintegration, 117, 342

distribution, see also measure/imageBinomial and beta, 322finite dimensional, 99function, 40, 82, 174joint, 90normal, see normal distributionregular conditional, 113uniform, 115uniform on sphere, 122

Dominated Convergence, 31, 35, 132, 180differentiation under integral sign, 32truncation arguments, 98

dominated measure, 53, 133Doob, J. L.

consistent Bayes estimator, 155infinite products of measures, 109martingales, 166martingales in continuous time, 336

Dudley, R. M.almost sure representation, 241coupling, 246empirical process, 171Strassen's theorem, 243

Dynkin class, see A.-systemEhrhard, A., 280EM-algorithm, 120empirical measure, 134, 157, 249envelope, 158epigraph, 307essential supremum, 49Etemadi, N., 109event, see measurable set; sigma-fieldexchangeable, 159expectation

as fair price, 11as linear functional, 5, 10iterated conditional, 128

fair price, 11Fatou's Lemma, 31, 132Feller, W., 272fidi, see finite dimensional distributionsfield, 47, 289filtration

completion, 212decreasing, 156definition, 138natural, 213standard, 330

finite dimensional distributions, 99, 212Finkelstein, H., 271first hitting time, 142, 331Fourier transform

Brownian motion, 213Continuity Theorem, 199inversion formula, 276lattice distribution, 194uniqueness, 196

Frechet, M., 92, 109Fubini theorem, 89Fundamental Theorem of Calculus, 65

Index 349

Gaussian processcentered, 212concentration of supremum, 280coupling, 247series expansion, 249

generating classfunctions, 43sets, 41

graph of a map, 117, 343, 344

Haar basis, 216, 250, 304Hahn-Banach theorem, 314Hartman, P. and Wintner, A., 268Hellinger distance, 61, 105, 120, 132Hewitt-Savage zero-one law, 160Hilbert space, 226Hoeffding, W., 272, 316Hoffmann-J0rgensen, J., 171Huygens, C, 6

iid (independent, identically distributed), 98inclusion/exclusion method, 10independent

random variables, 81, 83sigma-fields, 80

indicator function of a set, 7inequality

Bennett, 264Bernstein, 265Binomial tail, 271Birnbaum-Marshall, 163Borell, 279Cauchy-Schwarz, 37, 302chi squared tail, 258concentration, 280DeAcosta, 269Doob, 163Dubins, 148Fernique, 275Fourier coupling, 251Gaussian tail, 279Hajek-R6nyi, 146Hoeffding, 316Holder, 48isoperimetric, 278Jensen, 29, 133, 140Kolmogorov, 145, 163, 266maximal, 50, 95, 109maximum of normals, 275, 286Minkowski, 48Poisson tail, 272Slepian, 287Sudakov, 275Tusn*dy, 249, 324upcrossing, 148, 164

inner measure, 290inner product (•,•), see Appendix B, 36integrable

function, 28uniformly, 37, 154

integralas linear functional, 26by parts, 105iterated, 86

Ionescu T\ilcea, C. T.infinite products of measures, 100

isonormal process, 216, 305

^Co-regularity, 290^Co-tightness, 291kernel, 84, 113

coupling, 240sigma-finite, 87, 339

Khinchin, A. Ya., 272Kim, J., 242KMT

= Koml6s, J., Major, P., T\isnddy, G., 249coupling, 252

Knuth, D. E., 9Kolmogorov, A. N.

Borel paradox, 122conditional expectation, 123Grundbegriffe, 1, 5infinite products of measures, 100LIL, 266maxima] inequality, 145, 163SLLN, 78, 96zero-one law, 81

Koltchinskii, V. I., 258Krickeberg decomposition, 151, 335Kronecker's lemma, 105Kuelbs, J., 273Kullback-Leibler distance, 61

A-system of sets, 42A.-cone of functions, 44, 85lattice

cone, 296of subsets, 290

Lebesgue, H.Fubini theorem, 108Fundamental Theorem of Calculus, 65measure/integral, 22, 29, 295thesis, 4, 14

Le Cam, L., 135, 174weakly convergent subsequences, 185Yurinskii's theorem, 245

L£vy, P.Brownian modulus of continuity, 218Brownian motion as martingale, 223central limit theorem, 190martingales, 166, 200normal distribution, 205

lifting theorem, 340LIL (law of the iterated logarithm), 261Lindeberg, J. W., 176, 200, 245linear functional

as integral, 27integral representation, 297iterated, 84sigma-smooth, 100tight, 184

350 Index

linear isometry, 305Lipschitz condition, 170LSC, see semicontinuity

Major, P., 246Marczewski, E., 294Marriage Lemma, 243, 256martingale

at stopping times, 333Brownian motion, 223closed on the right, 155continuous time, 224differences, 141quadratic variation/compensator, 201

Massart, P., 258maximum likelihood, 79, 120McLeish, D. L., 200measurability, 23

diagonal, 103failure for product measure, 943V, 143first hitting time, 332stability properties, 24

measurablefunction, 22progressively, 331rectangle, 82set, 22, see also sigma-field

measureabsolutely continuous, 55affinity, 60consistent family, 99convolution, 91dominated, 53L] distance, 60finitely additive, 289image, 39infinite products, 99inner, 290Lebesgue decomposition, 56marginal, 84product, 88Radon, 290, 342sigma-finite, 54signed, 59singular, 57space, 17total variation distance, 59Wiener, 215

median, 279Metivier, M., 330Monotone Convergence, 26//-negligible, 33negligible set, 33Neyman-Pearson Lemma, 73norm

bounded Lipschitz, 170JC1, 29L\ 35£ p , ZA 36Orlicz, 50, 93

normal distributionconvolution, 91correlated bivariate, 88Fourier transform, 195Levy-Cramer theorem, 205maximum, 212multivariate, 202quantile coupling with Binomial, 320VSTT, 89symmetric bivariate, 121tail bounds, 317

notation, 27

orthogonal, 301

Parseval's identity, 215, 303partition of unity, 185paving, 290

compact, 294, 345P-continuity set, 174permutation, 159Philipp, W., 246n~X theorem, 42Portmanteau theorem, 174predictable

process, 227sequence, 141

probability space, 18product space, 82Prohorov, Yu. V.

distance, 242weak convergence, 174weakly convergent subsequences, 185

projectionconditional expectation, 128in Hilbert space, 302

Pythagoras, 132quantile function, 41, 92, 238

coupling, 249, 320Radon measure, 290Radon-Nikodym Theorem, 56, 340random element, 170recursion, 251relative entropy, 61relative interior, 313Riesz-Frechet Theorem, 303rotational symmetry, 121

sample path, 212, 214cadlag, 334continuous, 223, 337

Scheffe's lemma, 57, 157, 198semicontinuity, 172separation of convex sets, 308sets

cylinder, 215inclusion/exclusion, 10indicator functions, 7inner/outer regular, 47limsup, 8measurable, 22negligible, 33

Index 351

paving, 290symmetric difference, 7

Shorack, G. R., 272Sierpiriski class, see A-systemsigma-field, 18

atoms, 20Borel, 20completion, 34conditional expectation, 126countably generated, 163event, 22fidi/cylinder, 215generated, 19independence, 80Lebesgue, 34pre-r, 142product, 82symmetric, 158, 159tail, 81trace, 107

sigma-ring, 94<j-smooth, 100, 292simple function, 25Skorohod, A. V., 258SLLN, see strong law of large numbersSlutsky's theorem, 175Stirling's formula, 325stochastic integral, 225stochastic process, 212

continuous time, 329version, 214, 330

stock price, 228Stone's condition, 297stopping time, 142Stopping Time Lemma, 145, 332Strassen, V., 242strong law of large numbers, 78

Etemadi, N., 106first moment, 97

fourth moments, 79LIL, as rate, 262second moments, 96uniform, 158

sub/super martingale, 139as measures, 152convergence, 147, 151Doob-Meyer decomposition, 142reversed, 156uniform integrability, 333

T-smooth, 300Tchebychev's inequality, 10tight, 184, 344Tonelli theorem, 88topology, countably generated, 103Tops0e, F, 174, 300total variation, 59triangular array, 179truncation, 98, 180, 191, 269uniformly tight, 184upcrossing, 147, 334USC, see semicontinuityusual conditions, 332Varadarajan, V. S., 174Vitali covering, 68

Wald, A. (maximum likelihood), 80Walther, G., 203weak convergence

(Fourier) Continuity Theorem, 199almost sure representation, 239equivalences, 175multivariate, 182

Weierstrass approximation theorem, 51Wellner, J. A., 272Whittle, P., 6Wichura, M. J., 258Wiener, N., 6, 234Yurinskii, V. V., 245

Date post:	07-Feb-2016
Category:	Documents
Upload:	dima-reshetnikov
View:	162 times
Download:	35 times

(Cambridge Series in Statistical and Probabilistic Mathematics) David Pollard-A User's Guide to...

Documents