+ All Categories
Home > Documents > Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Date post: 11-Sep-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
265
Institute of Mathematical Statistics LECTURE NOTES–MONOGRAPH SERIES Volume 55 Asymptotics: Particles, Processes and Inverse Problems Festschrift for Piet Groeneboom Eric A. Cator, Geurt Jongbloed, Cor Kraaikamp, Hendrik P. Lopuha¨ a, Jon A. Wellner, Editors Institute of Mathematical Statistics Beachwood, Ohio, USA
Transcript
Page 1: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Institute of Mathematical Statistics

LECTURE NOTES–MONOGRAPH SERIES

Volume 55

Asymptotics: Particles, Processes andInverse ProblemsFestschrift for Piet Groeneboom

Eric A. Cator, Geurt Jongbloed, Cor Kraaikamp,Hendrik P. Lopuhaa, Jon A. Wellner, Editors

Institute of Mathematical StatisticsBeachwood, Ohio, USA

Page 2: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Institute of Mathematical StatisticsLecture Notes–Monograph Series

Series Editor:R. A. Vitale

The production of the Institute of Mathematical StatisticsLecture Notes–Monograph Series is managed by the

IMS Office: Jiayang Sun, Treasurer andElyse Gustafson, Executive Director.

Library of Congress Control Number: 2007927089

International Standard Book Number (13): 978-0-940600-71-3

International Standard Book Number (10): 0-940600-71-4

International Standard Serial Number: 0749-2170

Copyright c© 2007 Institute of Mathematical Statistics

All rights reserved

Printed in Lithuania

User
Note
Asymptotics: Particles, Processes and Inverse Problems: Festschrift for Piet Groeneboom Editor: Eric A. Cator Editor: Geurt Jongbloed Editor: Cor Kraaikamp Editor: Hendrik P. Lopuhaa Editor: Jon A. Wellner Lecture Notes--Monograph Series, Volume 55 Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2007. 252 pp. Abstract: In September 2006, Piet Groeneboom officially retired as professor of statistics at Delft University of Technology and the Vrije Universiteit in Amsterdam. He did so by delivering his farewell lecture `Summa Cogitatio' (to appear in Nieuw Archief voor Wiskunde, 2007) in the Aula of the university in Delft. To celebrate Piet's impressive contributions to statistics and probability, the workshop `Asymptotics: particles, processes and inverse problems' was held from July 10 until July 14, 2006, at the Lorentz Center in Leiden. Many leading researchers in the fields of probability and statistics gave talks at this workshop, and it became a memorable event for all who attended, including the organizers and Piet himself. This volume serves as a Festschrift for Piet Groeneboom. It contains papers that were presented at the workshop as well as some other contributions, and it represents the state of the art in the areas in statistics and probability where Piet has been (and still is) most active. Furthermore, a short CV of Piet Groeneboom and a list of his publications are included. Permanent link to this monograph: http://projecteuclid.org/euclid.lnms/1196797058 ISBN:978-0-940600-71-3 ISBN:0-940600-71-4 Copyright © 2007, Institute of Mathematical Statistics.
Page 3: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Contents

PrefaceEric Cator, Geurt Jongbloed, Cor Kraaikamp, Rik Lopuhaa and Jon Wellner . . . . . v

Curriculum Vitae of Piet Groeneboom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of publications of Piet Groeneboom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

A Kiefer–Wolfowitz theorem for convex densitiesFadoua Balabdaoui and Jon A. Wellner . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Model selection for Poisson processesLucien Birge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Scale space consistency of piecewise constant least squares estimators – anotherlook at the regressogram

Leif Boysen, Volkmar Liebscher, Axel Munk and Olaf Wittich . . . . . . . . . . . . . . 65

Confidence bands for convex median curves using sign-testsLutz Dumbgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Marshall’s lemma for convex density estimationLutz Dumbgen, Kaspar Rufibach and Jon A. Wellner . . . . . . . . . . . . . . . . . . . 101

Escape of mass in zero-range processes with random ratesPablo A. Ferrari and Valentin V. Sisko . . . . . . . . . . . . . . . . . . . . . . . . . . 108

On non-asymptotic bounds for estimation in generalized linear modelswith highly correlated design

Sara A. van de Geer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Better Bell inequalities (passion at a distance)Richard D. Gill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Asymptotic oracle properties of SCAD-penalized least squares estimatorsJian Huang and Huiliang Xie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Critical scaling of stochastic epidemic modelsSteven P. Lalley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Additive isotone regressionEnno Mammen and Kyusang Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

A note on Talagrand’s convex hull concentration inequalityDavid Pollard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

A growth model in multiple dimensions and the height of a randompartial order

Timo Seppalainen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Empirical processes indexed by estimated functionsAad W. van der Vaart and Jon A. Wellner . . . . . . . . . . . . . . . . . . . . . . . . 234

iii

Page 4: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom
Page 5: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Preface

In September 2006, Piet Groeneboom officially retired as professor of statistics atDelft University of Technology and the Vrije Universiteit in Amsterdam. He did soby delivering his farewell lecture ‘Summa Cogitatio’ ([42] in Piet’s publication list)in the Aula of the university in Delft. To celebrate Piet’s impressive contributionsto statistics and probability, the workshop ‘Asymptotics: particles, processes andinverse problems’ was held from July 10 until July 14, 2006, at the Lorentz Centerin Leiden. Many leading researchers in the fields of probability and statistics gavetalks at this workshop, and it became a memorable event for all who attended,including the organizers and Piet himself.

This volume serves as a Festschrift for Piet Groeneboom. It contains papers thatwere presented at the workshop as well as some other contributions, and it repre-sents the state of the art in the areas in statistics and probability where Piet hasbeen (and still is) most active. Furthermore, a short CV of Piet Groeneboom anda list of his publications are included.

Eric CatorGeurt JongbloedCor KraaikampRik LopuhaaDelft Institute of Applied MathematicsFaculty of Electrical Engineering,Mathematics and Computer ScienceDelft University of TechnologyThe Netherlands

Jon WellnerDepartment of StatisticsUniversity of Washington, SeattleUSA

v

Page 6: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Piet in a characteristic pose. Amsterdam, 2003.

Page 7: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Curriculum Vitae of Piet GroeneboomBorn: September 24, 1941, The HagueCitizenship: The NetherlandsDissertation: Large deviations and asymptotic efficiencies

1979, Vrije Universiteit, Amsterdam.supervisor: J. Oosterhoff.

Professional Career:

Mathematical Centre (MC) Researcher and consultant 1973–1984University of Washington, Seattle Visiting assistant professor 1979–1981University of Amsterdam Professor of statistics 1984–1988Delft University of Technology Professor of statistics 1988–2006Stanford University Visiting professor 1990Universite Paris VI Visiting professor 1994University of Washington, Seattle Visiting professor 1998,

1999, 2006University of Washington, Seattle Affiliate professor 1999–Vrije Universiteit Amsterdam Professor of statistics 2000–2006Institut Henri Poincare, Paris Visiting professor 2001

Miscellanea:

Rollo Davidson prize 1985, Cambridge UK.

Fellow of the IMS and elected member of ISI.

Visitor at MSRI, Berkeley, 1983 and 1991.

Three times associate editor of The Annals of Statistics.

Invited organizer of a DMV (Deutsche Mathematiker Vereinigung) seminarin Gunzburg, Germany, 1990.

Invited lecturer at the Ecole d’Ete de Probabilites de Saint-Flour, 1994.

vii

Page 8: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Publications of Piet GroeneboomApril 2007

1. Rank tests for independence with best strong exact Bahadur slope (with Y.Lepage and F.H. Ruymgaart), Zeitschrift fur Wahrscheinlichkeitstheorie undVerwandte Gebiete 36 (1976), 119–127.

2. Bahadur efficiency and probabilities of large deviations (with J. Oosterhoff),Statist. Neerlandica 31 (1977), 1–24.

3. Relevant variables in the advices of elementary school teachers on further edu-cation; an analysis of correlational structures (in Dutch, with J. Hoogstraten,G.J. Mellenbergh and J.P.H. van Santen), Tijdschrift voor Onderwijsresearch(Journal for Educational Research) 3 (1978), 262–280.

4. Large deviation theorems for empirical probability measures (with J. Ooster-hoff and F.H. Ruymgaart), Ann. Probability 7 (1979), 553–586.

5. Large deviations and asymptotic efficiencies, Mathematical Centre Tract 118(1980), Mathematical Centre, Amsterdam

6. Large deviations of goodness of fit statistics and linear combinations of orderstatistics (with G.R. Shorack), Ann. Probability 9 (1981), 971–987.

7. Bahadur efficiency and small-sample efficiency (with J. Oosterhoff), Int. Sta-tist. Rev. 49 (1981), 127–141.

8. The concave majorant of Brownian motion, Ann. Probability 11 (1983), 1016–1027.

9. Asymptotic normality of statistics based on convex minorants of empiricaldistribution functions (with R. Pyke), Ann. Probability 11 (1983), 328–345.

10. Estimating a monotone density, in Proceedings of the Conference in honor ofJerzy Neyman and Jack Kiefer, Vol. II (Eds. L.M. Le Cam and R.A. Olshen),Wadsworth, Inc, Belmont, California (1985), 539–555.

11. Some current developments in density estimation, in Mathematics and Com-puter Science, CWI Monograph 1 (Eds. J.W. de Bakker, M. Hazewinkel,J.K. Lenstra), Elsevier, Amsterdam (1986), 163–192.

12. Asymptotics for incomplete censored observations, Mathematical Institute,University of Amsterdam (1987), Report 87-18.

13. Limit theorems for convex hulls, Probab. Theory Related Fields 79 (1988),327–368.

14. Brownian motion with a parabolic drift and Airy functions, Probab. TheoryRelated Fields 81 (1989), 79–109.

15. Discussion on “Age-specific incidence and prevalence, a statistical perspec-tive”, by Niels Keiding in the J. Roy. Statist. Soc. Ser. A. 154 (1991), 371–412.

16. Information bounds and nonparametric maximum likelihood estimation (withJ.A. Wellner), Birkhauser Verlag (1992).

17. Discussion on “Empirical functional and efficient smoothing parameter se-lection” by P. Hall and I. Johnstone in the J. Roy. Statist. Soc. Ser. B. 54(1992), 475–530.

18. Isotonic estimators of monotone densities and distribution functions: basicfacts (with H.P. Lopuhaa), Statist. Neerlandica 47 (1993), 175–183.

19. Flow of the Rhine river near Lobith (in Dutch: “Afvoertoppen bij Lobith”), inToetsing uitgangspunten rivierdijkversterkingen, Deelrapport 2: Maatgevendebelastingen (1993), Ministerie van Verkeer en Waterstaat.

20. Limit theorems for functionals of convex hulls (with A.J. Cabo), Probab. The-ory Related Fields 100 (1994), 31–55.

viii

Page 9: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

ix

21. Nonparametric estimators for interval censoring, in Analysis of Censored Data(Eds. H. L. Koul and J. V. Deshpande), IMS Lecture Notes-Monograph Series27 (1995), 105–128.

22. Isotonic estimation and rates of convergence in Wicksell’s problem (with G.Jongbloed), Ann. Statist. 23 (1995), 1518–1542.

23. Computer assisted statistics education at Delft University of Technology,(with de P. Jong, D. Tischenko and B. van Zomeren), J. Comput. Graph.Statist. 5 (1996), 386–399.

24. Asymptotically optimal estimation of smooth functionals for interval censor-ing, part 1 (with R.B. Geskus), Statist. Neerlandica 50 (1996), 69–88.

25. Lectures on inverse problems, in Lectures On Probability and Statistics. Ecoled’Ete de de Probabilites de Saint-Flour XXIV (Ed. P. Bernard), LectureNotes in Mathematics 1648 (1996), 67–164. Springer Verlag, Berlin.

26. Asymptotically optimal estimation of smooth functionals for interval censor-ing, part 2 (with R.B. Geskus), Statist. Neerlandica 51 (1997), 201–219.

27. Extreme Value Analysis of North Sea Storm Severity (with C. Elsinghorst, P.Jonathan, L. Smulders and P.H. Taylor), Journal of Offshore Mechanics andArctic Engineering 120 (1998), 177–184.

28. Asymptotically optimal estimation of smooth functionals for interval censor-ing, case 2 (with R.B. Geskus), Ann. Statist. 27 (1999), 627–674.

29. Asymptotic normality of the L1-error of the Grenander estimator (with H.P.Lopuhaa and G. Hooghiemstra), Ann. Statist. 27 (1999), 1316–1347.

30. Integrated Brownian motion conditioned to be positive (with G. Jongbloedand J.A. Wellner), Ann. Probability 27 (1999), 1283–1303.

31. A monotonicity property of the power function of multivariate tests (withD.R. Truax), Indag. Math. 11 (2000), 209–218.

32. Computing Chernoff’s distribution (with J.A. Wellner), J. Comput. Graph.Statist. 10 (2001), 388–400.

33. A canonical process for estimation of convex functions: the “invelope” of in-tegrated Brownian motion + t4 (with G. Jongbloed and J.A. Wellner), Ann.Statist. 29 (2001), 1620–1652.

34. Estimation of convex functions: characterizations and asymptotic theory (withG. Jongbloed and J.A. Wellner), Ann. Statist. 29 (2001), 1653–1698.

35. Ulam’s problem and Hammersley’s process, Ann. Probability 29 (2001), 683–690.

36. Hydrodynamical methods for analyzing longest increasing subsequences, J.Comput. Appl. Math. 142 (2002), 83–105.

37. Kernel-type estimators for the extreme value index (with H.P. Lopuhaa andP.-P. de Wolf), Ann. Statist. 31 (2003), 1956–1995.

38. Density estimation in the uniform deconvolution model (with G. Jongbloed),Statist. Neerlandica 57 (2003), 136–157.

39. Hammersley’s process with sources and sinks (with E.A. Cator), Ann. Prob-ability 33 (2005), 879–903.

40. Second class particles and cube root asymptotics for Hammersley’s process(with E.A. Cator), Ann. Probability 34 (2006), 1273–1295.

41. Estimating the upper support point in deconvolution (with L.P. Aarts andG. Jongbloed). To appear in the Scandinavian journal of Statistics, 2007.

42. Summa Cogitatio. To appear in Nieuw Archief voor Wiskunde (magazine ofthe Royal Dutch Mathematical Association) (2007).

43. Convex hulls of uniform samples from a convex polygon, Conditionally ac-cepted for publication in Probability Theory and Related Fields.

Page 10: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

x

44. Current status data with competing risks: Consistency and rates of conver-gence of the MLE (with M.H. Maathuis and J.A. Wellner). To appear in Ann.Statist. (2007).

45. Current status data with competing risks: Limiting distribution of the MLE(with M.H. Maathuis and J.A. Wellner). To appear in Ann. Statist. (2007).

46. The support reduction algorithm for computing nonparametric function esti-mates in mixture models (with G. Jongbloed and J.A. Wellner). Submitted.

Page 11: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Contributors to this volume

Balabdaoui, F., Universite Paris-DauphineBirge, L., Universite Paris VIBoysen, L., Universitat Gottingen

Dumbgen, L., University of Bern

Ferrari, P. A., Universidade de Sao Paulo

van de Geer, S. A., ETH ZurichGill, R. D., Leiden University

Huang, J., University of Iowa

Lalley, S. P., University of ChicagoLiebscher, V., Universitat Greifswald

Mammen, E., Universitat MannheimMunk, A., Universitat Gottingen

Pollard, D., Yale University

Rufibach, K., University of Bern

Seppalainen, T., University of Wisconsin-MadisonSisko, V. V., Universidade de Sao Paulo

van der Vaart, A. W., Vrije Universiteit Amsterdam

Wellner, J. A., University of WashingtonWittich, O., Technische Universiteit Eindhoven

Xie, H., University of Iowa

Yu, K., Universitat Mannheim

xi

Page 12: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom
Page 13: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

arX

iv:m

ath/

0701

179v

2 [

mat

h.ST

] 7

Sep

200

7

IMS Lecture Notes–Monograph Series

Asymptotics: Particles, Processes and Inverse Problems

Vol. 55 (2007) 1–31c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000256

A Kiefer–Wolfowitz theorem for convex

densities

Fadoua Balabdaoui1 and Jon A. Wellner2,∗Universite Paris-Dauphine and University of Washington

Abstract: Kiefer and Wolfowitz [Z. Wahrsch. Verw. Gebiete 34 (1976) 73–85] showed that if F is a strictly curved concave distribution function (cor-responding to a strictly monotone density f), then the Maximum Likelihood

Estimator Fn, which is, in fact, the least concave majorant of the empiricaldistribution function Fn, differs from the empirical distribution function inthe uniform norm by no more than a constant times (n−1 log n)2/3 almostsurely. We review their result and give an updated version of their proof. Weprove a comparable theorem for the class of distribution functions F with con-

vex decreasing densities f , but with the maximum likelihood estimator Fn

of F replaced by the least squares estimator Fn: if X1, . . . , Xn are sampledfrom a distribution function F with strictly convex density f , then the least

squares estimator Fn of F and the empirical distribution function Fn differin the uniform norm by no more than a constant times (n−1 log n)3/5 almostsurely. The proofs rely on bounds on the interpolation error for complete splineinterpolation due to Hall [J. Approximation Theory 1 (1968) 209–218], Halland Meyer [J. Approximation Theory 16 (1976) 105–122], building on earlierwork by Birkhoff and de Boor [J. Math. Mech. 13 (1964) 827–835]. These re-sults, which are crucial for the developments here, are all nicely summarizedand exposited in de Boor [A Practical Guide to Splines (2001) Springer, NewYork].

1. Introduction: The Monotone Case

Suppose that X1, . . . , Xn are i.i.d. with monotone decreasing density f on (0,∞).

Then the maximum likelihood estimator fn of f is the well-known Grenander es-timator: i.e. the left-derivative of the least concave majorant Fn of the empiricaldistribution function Fn.

In the context of estimating a decreasing density f so that the correspondingdistribution function F is concave, Marshall [17] showed that Fn satisfies ‖Fn−F‖ ≤‖Fn − F‖ so that we automatically have

√n‖Fn − F‖ ≤ √

n‖Fn − F‖ = Op(1).Kiefer and Wolfowitz [14] sharpened this by proving the following theorem understrict monotonicity of f (and consequent strict concavity of F ). Let α1(F ) = inf{t :F (t) = 1}, and write ‖g‖ = sup0≤t≤α1(F ) |g(t)|.Theorem 1.1 (Kiefer–Wolfowitz [14]). If α1(F ) < ∞,

β1(F ) ≡ inf0<t<α1(F )

(−f ′(t)/f2(t)) > 0,

∗Supported in part by NSF Grant DMS-05-03822 and by NI-AID Grant 2R01 AI291968-04.1Centre de Recherche, en Mathematiques de la Decision, Universite Paris-Dauphine, Paris,

France, e-mail: [email protected] of Washington, Department of Statistics, Box 354322, Seattle, Washington 98195-

4322, USA, e-mail: [email protected] 2000 subject classifications: Primary 62G10, 62G20; secondary 62G30.Keywords and phrases: Brownian bridge, convex density, distance, empirical distribution, in-

velope process, monotone density, optimality theory, shape constraints.

1

Page 14: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

2 Balabdaoui and Wellner

γ1(F ) ≡ sup0<t<α1(F )(−f ′(t))/ inf0<t<α1(F ) f2(t)) < ∞, and f ′ is continuous on[0, α1(F )], then

‖Fn − Fn‖ = O((n−1 log n)2/3) almost surely.(1)

Although Kiefer and Wolfowitz did not formulate their result in this way, thestatement above follows from their proof. Also note that (1) implies that

√n‖Fn − Fn‖ = O(n−1/6(log n)2/3) → 0

almost surely, so that the MLE Fn and the empirical distribution are asymptoticallyequivalent under the hypotheses of Theorem 1.

Kiefer and Wolfowitz [14] used Theorem 1.1 to show that the MLE Fn of Fin the class of concave distributions is an asymptotically minimax estimator of F .(Also see Kiefer and Wolfowitz [15] for a generalization of the results of Kiefer andWolfowitz [14] to allow somewhat weaker conditions.)

It follows from the rather general theorem of Millar [18] that the empirical dis-tribution function Fn remains asymptotically minimax in a wide range of problemsinvolving shape- constrained families of d.f.’s F . In particular, for the classes Fk ofdistribution functions corresponding to k-monotone densities, it follows from Millar[18] that the empirical distribution function Fn is asymptotically minimax for esti-mation of F even in the smaller classes Fk. The interesting question which has notbeen addressed concerns asymptotic minimaxity of the MLEs within these classes.Our goal in this paper is to make some headway toward answering these questionsby giving a partial (and imperfect) analogue of Theorem 1.1 in the case of F2, theclass of distribution functions corresponding to the class of decreasing and convexdensities. The MLE and least squares estimators of a density f corresponding toF ∈ F2 have been studied by Groeneboom, Jongbloed and Wellner [11], and thoseresults will provide an important starting point here.

In fact, we will not study the MLE, but its natural surrogate, the least squaresestimator. This is because of the lack of a complete analogue of Marshall’s lemmafor the MLE in the convex case, while we do have such analogues for the leastsquares estimator; see Dumbgen, Rufibach and Wellner [7] and Balabdaoui andRufibach [1].

One view of the Kiefer–Wolfowitz Theorem 1.1 is that it is driven by the (familyof) corresponding local results, as follows:

Theorem 1.2 (Local process convergence, monotone case). Suppose that t0 ∈(0,∞) is fixed with f(t0) > 0 and f ′(t0) < 0, and f and f ′ continuous in a neigh-borhood of t0. Then

n2/3(Fn(t0 + n−1/3t) − Fn(t0 + n−1/3t))

⇒ Cb,c(t) − Y1(t)d=

(2f2(t0)

−f ′(t0)

)1/3

{C(at) − (W (at) − a2t2)}(2)

in (D[−K, K], ‖ · ‖) for every K > 0 where

Y1(t) ≡√

f(t0)W (t) + (1/2)f ′(t0)t2 ≡ bW (t) − ct2

for W a standard two-sided Brownian motion process starting from 0, Cb,c is theLeast Concave Majorant of Y1, C ≡ C1,1 is the least concave majorant of W (t)− t2,

and a ≡([f ′(t0)]

2/(4f(t0)))1/3

.

Page 15: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 3

The (one-dimensional) special case of (2) with t = 0 is due to Wang [26], whilethe complete result is given by Kulikov and Lopuhaa [16].

Here the logarithmic term on the right side of (1) reflects the cost of transferringthe family of (in distribution) local result to an (almost sure) global result. Hereis a heuristic proof of (2); for the complete proof, see Kulikov and Lopuhaa [16].For a similar result in the context of monotone regression, see Durot and Tocquet[8], and for a similar theorem in the context of the Wicksell problem studied byGroeneboom and Jongbloed [9], see Wang and Woodroofe [25]. For a related resultin the context of estimation of an increasing failure rate, see Wang [24].

Proof of Theorem 1.2. We rewrite the left side of (2) as

n2/3{Fn(t0 + n−1/3t) − Fn(t0 + n−1/3t)}= n2/3{Fn(t0 + n−1/3t) − F (t0) − n−1/3f(t0)t}(3)

− n2/3{Fn(t0 + n−1/3t)) − Fn(t0) − n−1/3f(t0)t}+ n2/3{Fn(t0) − Fn(t0) − (Fn(τ−

0 ) − Fn(τ−0 ))}

− n2/3{Fn(t0) − F (t0)}

where τ−0 is the first point of touch of Fn and Fn to the left of t0. From known local

theory for Fn and Fn it follows easily that

n2/3{Fn(t0 + n−1/3t)) − Fn(t0) − n−1/3f(t0)t}

⇒√

f(t0)W (t) +1

2f ′(t0)t

2 ≡ Y1(t),(4)

n2/3{Fn(t0 + n−1/3t) − F (t0) − n−1/3f(t0)t} ⇒ Cb,c(t)(5)

and

(6) n2/3{Fn(t0) − F (t0)} ⇒ Cb,c(0)

where Cb,c is the least concave majorant of Y1. It remains to handle the third term.

But since Fn(t0) − Fn(τ−0 ) = fn(t0)(t0 − τ−

0 ) by linearity of Fn on (τ−0 , τ+

0 ),

n2/3{Fn(t0) − Fn(t0) − (Fn(τ−0 ) − Fn(τ−

0 )}= −n2/3(Fn(t0) − Fn(τ−

0 ) − fn(t0)(t0 − τ−0 ))

= −n2/3(Fn(t0) − Fn(τ−0 ) − f(t0)(t0 − τ−

0 ))

+ n2/3(fn(t0) − f(t0))(t0 − τ−0 )

= n2/3{Fn(t0 + n−1/3n1/3(τ−0 − t0)) − Fn(t0)

− f(t0)n−1/3n1/3(τ−

0 − t0)}− n1/3(fn(t0) − f(t0))n

1/3(τ−0 − t0)

→d Y1(τ−) − C(1)b,c (0)τ− = Y1(τ−) − {Cb,c(0) + C

(1)b,c (0)τ−} + Cb,c(0)

= Y1(τ−) − Cb,c(τ−) + Cb,c(0) = Cb,c(0)(7)

where τ− is the first point of touch of Y1 and Cb,c to the left of 0, and henceCb,c(τ−) = Y1(τ−). Combining (4), (5), (6) and (7) with (3) it follows that

n2/3{Fn(t0 + n−1/3t) − Fn(t0 + n−1/3t)} ⇒ Cb,c(t) − Y1(t)

in (D[−K, K], ‖ · ‖) for each fixed K > 0.

Page 16: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

4 Balabdaoui and Wellner

2. The convex case

Now suppose that X1, . . . , Xn are i.i.d. with monotone decreasing and convex den-sity f on (0,∞). Then the maximum likelihood estimator fn of f is a piecewiselinear, continuous and convex function with at most one change of slope betweenthe order statistics of the data, and, as shown by Groeneboom, Jongbloed andWellner [11], is characterized by

Hn(x, fn)

{≤ 1, x ≥ 0

= 1, if f ′n(x−) < f ′

n(x+).

where, with K being the class of convex and decreasing and nonnegative functionson [0,∞),

Hn(x, f) =

[0,x]

2(x − y)/x2

f(y)dFn(y), (x, f) ∈ R

+ ×K.

As shown by Groeneboom, Jongbloed and Wellner [11], the least squares estimator

fn of f is also a piecewise linear, continuous, and convex function with at most onechange of slope between the order statistics, but is characterized by

Hn(x)

{≥ Yn(x), x ≥ 0,

= Yn(x), if f ′n(x−) < f ′

n(x+).

where Hn(x) =∫ x

0

∫ y

0 fn(u)dudy ≡∫ x

0 F (y)dy and Yn(x) =∫ x

0 Fn(y)dy. The

corresponding estimators Fn of F and Y are given by Fn(x) =∫ x

0fn(y)dy and

Hn(x) =∫ x

0 Fn(y)dy respectively. Since pointwise limit theory for both the MLEand the least squares estimators of f are available from Groeneboom, Jongbloedand Wellner [11], we begin by formulating a (family of) local convergence theoremsanalogous to Theorem 1.2 in the monotone case. These will serve as a guide informulating appropriate hypotheses in the context of our global theorem.

Theorem 2.1 (Local process convergence, convex case). If f(t0) > 0,f ′′(t0) > 0, and f(t) and f ′′(t) are continuous in a neighborhood of t0, then for

(Fn, Hn) = (Fn, Hn) or for (Fn, Hn) = (Fn, Hn),

(n3/5(Fn(t0 + n−1/5t) − Fn(t0 + n−1/5t))n4/5(Hn(t0 + n−1/5t) − Yn(t0 + n−1/5t))

)

⇒(

H(1)2 (t) − Y

(1)2 (t)

H2(t) − Y2(t)

)(8)

d=

(24

f(t0)3

f ′′(t0)

)1/5

(H(1)2,s(at) − Y

(1)2,s(at))

(243 f(t0)

4

f ′′(t0)3

)1/5

(H2,s(at) − Y2,s(at))

in (D[−K, K], ‖ · ‖) for every K > 0 where

Y2(t) ≡√

f(t0)

∫ t

0

W (s)ds +1

24f ′′(t0)t

4

Page 17: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 5

and H2 is the “invelope” process corresponding to Y2: i.e. H2 satisfies: (a) H2(t) ≥Y2(t) for all t; (b)

∫∞

−∞(H2 − Y2)dH

(3)2 = 0; and (c) H

(2)2 is convex. Here

a =

(f ′′(t0)

2

242f(t0)

)1/5

,

and H2,s, Y2,s denote the “standard” versions of H2 and Y2 with coefficients 1: i.e.

Y2,s(t) =∫ t

0 W (s)ds + t4.

Note that β2(F ) ≡ inf0<t<α1(F )(f′′(t)/f3(t)) is invariant under scale changes of

F , while δ2(F ) ≡ sup0<t<α1(F )(f′′(t)2/f(t))1/5 is equivariant under scale changes

of F ; i.e. δ2(F (c·)) = cδ2(F ).

Proof. Here is a sketch of the proof of the convergence in the first coordinate of(8). We write

n3/5(Fn(t0 + n−1/5t) − Fn(t0 + n−1/5t))

= n3/5(Fn(t0 + n−1/5t) − F (t0) − n−1/5 1

6f(t0)t

3)

− n3/5(Fn(t0 + n−1/5t) − Fn(t0) − n−1/5 1

6f(t0)t

3)

+ n3/5(Fn(t0) − Fn(t0) − (Fn(τ−0 ) − Fn(τ−

0 ))

− n3/5(Fn(t0) − F (t0)).

Here

n3/5

(Fn(t0 + n−1/5t) − F (t0) − n−1/5 1

6f(t0)t

3

)⇒ H

(1)2 (t),

n3/5

(Fn(t0 + n−1/5t) − Fn(t0) − n−1/5 1

6f(t0)t

3

)⇒ Y

(1)2 (t),

n3/5(Fn(t0) − F (t0)) ⇒ H(1)2 (0),

while

n3/5(Fn(t0) − Fn(t0) − (Fn(τ−0 ) − Fn(τ−

0 ))

= n3/5(Fn(t0 + n−1/5n1/5(τ−

0 − t0)) − Fn(t0)

− n−1/5 1

6f(t0)(n

1/5(τ−0 − t0))

3

)

− n3/5(Fn(t0 + n−1/5n1/5(τ−

0 − t0)) − F (t0)

− n−1/5 1

6f(t0)(n

1/5(τ−0 − t0))

3

)

+ n3/5(Fn(t0) − F (t0))

→d Y(1)2 (τ−) − H

(1)2 (τ−) + H

(1)2 (0) = H

(1)2 (0)

since Y(1)2 (τ−) = H

(1)2 (τ−). Combining the pieces yields the claim.

The proof for the second coordinate is similar.

Now we can formulate our main result. Fix τ < α1(F ). Our hypotheses are asfollows:

Page 18: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

6 Balabdaoui and Wellner

R1. F has continuous third derivative F (3)(t) = f ′′(t) > 0 for t ∈ [0, τ ] andβ2(F, τ) ≡ inf0<t<τ (f ′′(t)/f3(t)) > 0

R2. γ1(F, τ) ≡ sup0<t<τ (−f ′(t)/f2(t)) < ∞.R3. γ2(F, τ) ≡ sup0<t<τ f ′′(t)/ inf0<t<τ f3(t) < ∞.R4. R ≡ max{1, sup0<t<τ f(t)}/ inf0<t<τ f(t) = max{1, f(0)}/f(τ) < ∞.

In the rest of the paper we fix τ ∈ (0, α1(F )) such that R1–R4 hold, and let‖h‖ ≡ sup0≤t≤τ |h(t)|, the supremum norm of the real-valued function h on [0, τ ].

Theorem 2.2. Suppose that R1–R4 hold. Then

‖Fn − Fn‖ ≡ sup0≤t≤τ

|Fn(t) − Fn(t)| = O((n−1 log n)3/5),(9)

‖Hn − Yn‖ ≡ sup0≤t≤τ

|Hn(t) − Yn(t)| = O((n−1 log n)4/5),(10)

almost surely.

Note that (9) and (10) imply that

n1/2‖Fn − Fn‖ = O(n−1/10(log n)3/5),(11)

n1/2‖Hn − Yn‖ = O(n−3/10(log n)4/5),(12)

almost surely.To prepare for the proof of Theorem 2.2, fix 0 < τ < α1(F ) for which the

hypotheses of Theorem 2.2 hold. For an integer k ≥ 2 define a(k)j ≡ aj ≡ F−1((j/

k)F (τ)) for j = 1, . . . , k, and set a(k)0 ≡ a0 ≡ α0(F ) ≡ sup{x : F (x) = 0}. Note

that a(k)k = F−1(F (τ)) = τ for all k ≥ 2. We will often simply write aj for a

(k)j , but

the dependence of the knots {aj} on k (and the choice of k depending on n) will becrucial for our proofs. We also set ∆ja = aj −aj−1, and write |a| = max1≤j≤k ∆ja.

Let Hn,k be the complete cubic spline interpolant of Yn with knot points given by{aj , j = 0, . . . , k}. Thus Hn,k is piecewise cubic on [aj−1, aj ], j = 1, . . . , k with two

continuous derivatives H(1)n,k and H

(2)n,k; see de Boor [5], pages 39–43 and 51–56. We

will choose k = kn ∼ (Cn/ log n)1/5 → ∞ in our arguments. H(2)n,kn

is not necessarilyconvex, but we will show that it becomes convex on [0, τ ] with high probability asn → ∞, and hence Hn,kn

will play a role analogous to the role played by the linearinterpolation of Fn in the proofs of Kiefer and Wolfowitz [14]. (We will frequentlysuppress the dependence of k = kn on n, and write simply k for kn.)

Let Y be defined by Y (t) ≡∫ t

0 F (s)ds; thus Y (1) = F , Y (j) = f (j−2), forj ∈ {2, 3, 4}. We will also need the complete cubic spline interpolant Hkn

of Y ; thiswill play the role of the linear interpolant L = L(k) of F in Kiefer and Wolfowitz[14].

The cubic spline interpolant Hn,k of Yn based on the knot points {a(k)j , j =

0, . . . , k} is completely determined on [0, τ ] by the values of Yn at the knots aj ,

j = 1, . . . , k together with the values of Y(1)n = Fn at 0 and ak = τ , namely Yn(aj),

j = 1, . . . , J , Y(1)n (0) = Fn(0) = 0, and Y

(1)n (τ); see, e.g., de Boor [5], page 43. As de

Boor nicely explains in his Chapter IV, the complete cubic spline interpolant is onecase of a family of cubic interpolation methods. Taking de Boor’s function g to beour present function Yn, several different piecewise cubic interpolants of Yn can bedescribed in terms of cubic polynomials Pj on each of the intervals [aj , aj+1] wherethe interpolating function Hn(·; s) is given by Hn(x; s) = Pj(x; s) for x ∈ [aj , aj+1],

Page 19: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 7

j = 0, . . . , k − 1, and where we require

Pj(αj) = Yn(aj), Pj(αj+1) = Yn(aj+1)

P ′j(aj) = sj , P ′

j(aj+1) = sj+1,

for j = 0, . . . , k−1. Here s = (s0, . . . , sk) and the sj ’s are free parameters. Differentchoices of the sj ’s leads to different piecewise cubic functions agreeing with Yn at theknots aj ; all of these different approximating functions Hn(·; s) are continuous andhave continuous first derivatives. Of interest to us here are the following particularways of determining the sj ’s:

• sj = Y(1)n (aj) = Fn(aj), j = 0, . . . , k. This gives the piecewise cubic Hermite

interpolant of Yn, Hn(·, s) ≡ Hn,Herm.

• sj , j = 0, . . . , k chosen so that Hn(·, s) ∈ C2[0, τ ]; i.e. so that H(2)n (·, s) is

continuous and s0 = Y(1)n (0) = 0 and sk = Y

(1)n (ak) = Y

(1)n (τ). This gives the

complete cubic spline interpolant of Yn, Hn(·, s) ≡ Hn,CS ≡ Hn,k.

The complete spline interpolant Hn,CS will play the role for us that the linearinterpolant Ln of Fn played in Kiefer and Wolfowitz [14]. As we will see, however,

even though the Hermite interpolant Hn,Herm is not in C2[0, τ ] (i.e. H(2)n,Herm is

not continuous), the slopes of its piecewise linear second derivative can be given

explicitly in terms of Yn and Y(1)n = Fn at the knots, and our proof will proceed by

relating the slopes of H(2)n,Herm to the (more complicated and less explicit) slopes of

H(2)n,CS ≡ H

(2)n,kn

in order to prove point B in the following outline of our proof.Here is an outline of the proof, paralleling the proof of the K–W theorem.

Main steps, proof of (9) distribution function equivalence:

A. By the generalization of Marshall’s lemma for the convex density problem(see Dumbgen, Rufibach and Wellner [7]), for any function h with convex

derivative h′, ‖H(1)n − h‖ ≤ 2‖Fn − h‖ where H

(2)n ≡ fn. [This generalization

is not yet available for the MLE H(1)n of F in F2 corresponding to H

(2)n = fn;

see Dumbgen, Rufibach and Wellner [7] for a one-sided result.]

B. PF (An) ≡ PF {H(2)n,kn

is convex on [0, τ ]} ր 1 as n → ∞ if kn ≡ (C0β2(F )2n/

log n)1/5 for some absolute constant C0.C. On the event An,

‖H(1)n − Fn‖ = ‖H

(1)n − H

(1)n,kn

+ H(1)n,kn

− Fn‖

≤ 2‖Fn − H(1)n,kn

‖ + ‖H(1)n,kn

− Fn‖by the generalization of Marshall’s lemma (A)

= 3‖Fn − H(1)n,kn

= 3‖Fn − H(1)n,kn

− (F − H(1)kn

) + F − H(1)kn

≤ 3‖Fn − H(1)n,kn

− (F − H(1)kn

)‖ + 3‖F − H(1)kn

‖≡ 3Dn + 3En.

D. We show that Dn = O((n−1 log n)3/5) almost surely via a generalization ofthe K–W Lemma 2. We also show that En = O((n−1 log n)3/5)by an analytic(deterministic) argument.

Page 20: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

8 Balabdaoui and Wellner

Of course proving step B in this outline involves showing that the slopes of

the H(2)n,kn

become ordered with high probability for large n, and this explains our

interest in the slopes of both H(2)n,CS = H

(2)n,kn

and H(2)n,Herm.

The assertion (10) of Theorem 2.2 can be proved in a similar way if we replace

H(1)n , H

(1)n,kn

, H(1)kn

, Fn , F by Hn, Hn,kn, Hkn

, Yn, Y respectively, and if we replaceA by the following recent result of Balabdaoui and Rufibach [1]:

A′. For any function G with convex second derivative g′′, ‖Hn −G‖ ≤ ‖Yn −G‖.Proof of (9) assuming B. First the deterministic term En. As in de Boor [5], page43, let I4 denote the complete cubic spline interpolation operator, and (as in de Boor[5], page 31, let I2 be the piecewise linear (or “broken line”) interpolation operator.Then by de Boor [5], (20) on page 56, with pn ≡ 1/kn,

En = ‖F − H(1)kn

‖ = ‖Y (1) − (I4Y )(1)‖ ≤ 1

24|a|3‖Y (4)‖

≤ 1

24γ2(F, τ)p3

n = O((n−1 log n)3/5).

To handle Dn, let $3 be defined to be the space of all quadratic splines on [0, τ ],and similarly let $2 be the space of all linear splines on [0, τ ]. Then, by de Boor [5],page 56, equation (17), together with (18) on page 36, it follows that with

Dn = ‖Fn − H(1)n,kn

− (F − H(1)kn

)‖ = ‖(Yn − Y )(1) − (I4(Yn − Y ))(1)‖

≤ 19

4dist((Yn − Y )(1); $3) ≤

19

4dist((Yn − Y )(1); $2)

≤ 19

4‖(Yn − Y )(1) − I2[(Yn − Y )(1)]‖

=19

4‖(Fn − F ) − I2(Fn − F )‖

≤ 19

4ω(Fn − F ; |a|)

d=

19

4n−1/2ω(Un; pn)

= O(n−1/2√

pn log(1/pn)) almost surely

= O((n−1 log n)3/5);

here

ω(g; h) ≡ sup{|g(t) − g(s)| : |t − s| ≤ h},dist(g; S) ≡ min{‖g − f‖ : f ∈ S}, S ⊂ C[0, τ ] and

Un(t) ≡√

n(Gn(t) − t)

where Gn(t) = n−1∑n

i=1 1[0,t](ξi) is the empirical distribution function of ξ1, . . . , ξn

i.i.d. Uniform(0, 1) random variables. (See de Boor [5], pages xviii, 24, 32, and 34 fordefinition and use of dist(g; S) and the modulus of continuity ω in conjunction.)

Proof of (10) assuming B. By Hall [12] (also see Hall and Meyer [13] for optimalityof the constant and de Boor [5], page 55),

En ≡ ‖Y − Hkn‖ ≤ 5

384|a|4‖Y (4)‖ ≤ 5

384Rγ2(F )

1

k4n

.

Page 21: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 9

To handle the first term Dn, we note that

Yn − Y − (Hn,kn− Hkn

) = (Yn − Y ) − I4(Yn − Y )

where I4 is the complete spline interpolant, and, on the other hand, for any dif-ferentiable function g it follows from de Boor [5], page 45, equation (14), togetherwith (18) on page 36, that

‖g − I4g‖ ≤19

8|a|dist(g′, $3) ≤

19

8|a|dist(g′, $2)

≤ 19

8|a|‖g′ − I2g‖ ≤ 19

8|a|ω(g′, |a|).

Applying this to g = Yn − Y , it follows that

‖Yn − Y − (Hn,kn− Hkn

)‖ = ‖(Yn − Y ) − I4(Yn − Y )‖

≤ 19

8|a| ω(Fn − F, |a|)

d= n−1/2ω(Un, pn)

Therefore ω(Fn − F ; |a|) = O(n−1/2√

pn log(1/pn)) almost surely (just as in theproof of Lemma 2 for the Kiefer-Wolfowitz theorem, see Section 5), we see that theorder of Dn is

n−1/2p3/2n (log(1/pn))1/2 = O((n−1 log n)4/5) almost surely

as claimed. Thus the claim (10) is proved if we can verify that B holds.

We end this section with a short list of further problems:

• It would be of interest to prove a comparable theorem for the MLE Fn itselfrather than Fn. This involves several additional challenges, among which is acomplete analogue of Marshall’s lemma.

• Are either Fn or Fn asymptotically minimax for estimating F ∈ F2?• We conjecture that similar results hold for k−monotone densities and corre-

sponding distribution functions (k = 1 corresponds to the Kiefer and Wol-fowitz monotone density case, while k = 2 corresponds to the convex densitycase treated here). More concretely, we conjecture that under comparablehypotheses

‖Fn − Fn‖ = O((n−1 log n)(k+1)/(2k+1)) almost surely

for Fn = Fn or Fn = Fn, the least squares estimator or MLE of F ∈ Fk.Some progress on the local theory of the corresponding density estimators isgiven in Balabdaoui and Wellner [2] and Balabdaoui and Wellner [3]. On theinterpolation theory side, the results of Dubeau and Savoie [6] may be useful.

• What is the exact order (in probability or expectation) of ‖Fn − Fn‖ in thecase k = 2? Is it (n−1 log n)3/5 as perhaps suggested by the results of Durotand Tocquet [8] in the case k = 1?

3. Asymptotic convexity of H(2)n,kn

In this section we write C for the complete spline interpolation operator that mapsfunctions g ∈ C1[0, τ ] into their complete spline interpolants C[g] (based on the

Page 22: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

10 Balabdaoui and Wellner

fixed knot sequence 0 = a0 < a1 . . . < ak = τ); thus in this section our C is deBoor’s operator I4. Thus we have

Hn,k = C[Yn], Hn,k = C[Y ].

It follows from the formula for c4,i in (5) on page 40 of de Boor [5] that the slope

of H(2)n,k on the interval [aj−1, aj ] is given by

Bj ≡ Bj(CS) =12

(∆ja)3

(H

(1)n,k(aj−1) + H

(1)n,k(aj)

2∆ja − ∆jYn

)

where ∆jf ≡ f(aj) − f(aj−1) for j = 1, . . . , k and any function f on [0, τ ].

In the following we will let H denote the Hermite interpolation operator that

maps Yn to Hn: thus Hn,Herm = H[Yn], H(1)n,Herm = (H[Yn])(1), and so forth. It

is important to note that the corresponding slopes of the second derivative of the

Hermite interpolant, H(2)n,Herm = (H[Yn])(2) on [aj−1, aj ] are given by the same

formula as in the last display, but with H(1)n,k(ai) replaced by Y

(1)n (ai) = Fn(ai),

i = j − 1, j:

Bj ≡ Bj(Herm) =12

(∆ja)3

(Fn(aj−1) + Fn(aj)

2∆ja − ∆jYn

).(13)

Note that Bj is expressed explicitly as a function of the data via Fn and Yn,whereas Bj still involves Hn,k = C[Yn] and hence also the interpolation operator

C. Ordering of the slopes Bj can be shown using only Lemma 3.1 and Lemma 4.5,but (unfortunately) the generalization of Marshall’s lemma does not apply to the

Hermite interpolant because the second derivative H(2)n,Herm is not continuous at the

knots. This last formula (13) agrees with the formulas for H and Hn in Groene-boom, Jongbloed and Wellner [10] and Groeneboom, Jongbloed and Wellner [11];in particular (13) can be viewed as a finite sample analogue of the 3rd derivativeof the interpolant H given in Groeneboom, Jongbloed and Wellner [10], page 1631,but based on the fixed knots {aj} rather than random knots determined by the

optimization procedure. Note that the least squares estimator fn = H(2)n can be

viewed as the second derivative of either the Hermite interpolant or the completecubic spline interpolant of Yn since these two interpolants have been forced equalby the optimization procedure which determines the knots as random functions ofthe data.

Set

An ≡{

H(2)n,kn

is convex on [0, τ ]}

=

k−1⋂

j=1

{Bj ≤ Bj+1} .

To prove B, we want to bound

P (Acn) ≤

k−1∑

j=1

P (Bj > Bj+1).

Page 23: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 11

To prepare for this, we define

Tn,j =(C[Yn])(1)(aj−1) + (C[Yn])(1)(aj)

2∆ja − ∆jYn,

Rn,j =Y

(1)n (aj−1) + Y

(1)n (aj)

2∆ja − ∆jYn,

tn,j =(C[Y ])(1)(aj−1) + (C[Y ])(1)(aj)

2∆ja − ∆jY,

rn,j =Y (1)(aj−1) + Y (1)(aj)

2∆ja − ∆jY.

We will frequently suppress the dependence of all of these quantities on n, andsimply write Tj for Tn,j, Rj for Rn,j, and so forth. Now Bj = 12Tj/(∆ja)3, Bj =12Rj/(∆ja)3, and we can write

Tj − rj = Tj − tj + tj − rj(14)

= Rj − rj + {Tj − tj − (Rj − rj)} + tj − rj

≡ Rj − rj + Wj + bj .(15)

We regard Rj−rj as the main random term to be controlled, and view Tj−tj−(Rj−rj) ≡ Wj and tj − rj ≡ bj as second order terms, the last of which is deterministic.Thus our strategy will be to first develop an appropriate exponential bound for|Rj −rj |, and then by further separate bounds for Wj and bj, derive an exponentialbound for |Tj − rj |.

For 0 ≤ s < t < ∞, define the family of functions hs,t by

hs,t(x) = (x − (s + t)/2)1(s,t](x).

Note that

Phs,t =1

2(F (t) + F (s))(t − s) −

∫ t

s

F (u)du,

Pnhs,t =1

2(Fn(t) + Fn(s))(t − s) −

∫ t

s

Fn(u)du,

and, furthermore,

rj = Phaj−1,aj, Rj = Pnhaj−1,aj

.

Here is a (partial) analogue of Kiefer and Wolfowitz’s Lemma 1.

Lemma 3.1. Suppose that γ1(F ) < ∞ and R < ∞. Let hs,t(x) = (x − (s +

t)/2)1(s,t](x), s = a(k)j−1 ≡ aj−1, and t = a

(k)j ≡ aj so that t − s = aj − aj−1 =

k−1(1/f(a∗j )) for some a∗

j ∈ [aj−1, aj ]. Then if δn → 0 and k ≥ 5γ1(F )R,

Pr(|Rj − rj | > δnp3n) = Pr(|Pn − P |(hs,t) > δnp3

n)

= 2 exp

(−

3nδ2nf2(a∗

j )p3n

1 + pnδnf(a∗j )

)

≤ 2 exp(−3nδ2np3

nf2(a∗j )(1 + o(1)))

where o(1) depends on f(a∗j ), kn, and δn.

Page 24: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

12 Balabdaoui and Wellner

Proof. First note that |hs,t| is bounded by (t−s)/2. Thus by Bernstein’s inequality(see e.g. van der Vaart and Wellner [23], page 102),

Pr(|Pnhs,t − Phs,t| > x) ≤ 2 exp

(− nx2/2

σ2 + Mx/3

)

for σ2 ≥ V arF (hs,t(X)), M = (t − s)/2 = 1/(2f(a∗j)k) = [1/(2f(a∗

j))]pn, andx > 0. Note that

V ar(hs,t(X)) ≤ Eh2s,t(X) =

∫ t

s

(x − (t + s)/2)2dF (x)

≤ f(s)(t − s)3/12

= f(s)k−3/(12f3(a∗j )) = f(s)p3

n/(12f3(a∗j ))

≤ p3n/(6f2(a∗

j ))

for k ≥ 5γ1(F )R by Lemma 4.1. Then we obtain

Pr(|Pnhs,t − Phs,t| > δnp3n)

≤ 2 exp

(− nδ2

np6n/2

p3n/(6f(a∗

j)2) + pnδnp3

n/(6f(a∗j))

)

= 2 exp

(−

nδ2nf2(a∗

j )p3n

1/3 + pnf(a∗j )δn/3

)

= 2 exp

(−

3nδ2nf2(a∗

j )p3n

1 + pnδnf(a∗j )

)

= 2 exp(− 3nδ2

np3nf2(a∗

j )(1 + o(1)))

where the o(1) term depends on f(t) = f(aj+1), pn = 1/kn, and δn.

Remark. Note that taking δn = C/kn in Lemma 3.1 yields

Pr(|Pnhs,t − Phs,t| > Cp4n) ≤ 2 exp(−3(nC2f2(a∗

j )/k5n)(1 + o(1)))

which seems quite analogous to Lemma 4 of Kiefer and Wolfowitz (1976), but withthe power of 3 replaced by 5.

The following lemma gives a more complete version of Lemma 3.1 in that itprovides an exponential bound for |Tj − rj |.Lemma 3.2. Suppose that the hypotheses of Theorem 2.2 hold: β2(F, τ) < ∞,γ2(F, τ) < ∞, γ1(F ) < ∞ and R < ∞. Then if δn = Cpn for some constant C andk ≥ {5R ∨ 3}γ1(F ),

Pr(|Tj − rj | > 3δnp3n) ≤ 6 exp

(−

(100)−1nδ2nf2(a∗

j )p3n

1 + 30−1pnδnf(a∗j )

).

Proof. This follows from a combination of Lemma 3.1, Lemma 4.2, and Lemma 4.3.Lemma 4.2 yields

|bj | ≡ |tj − rj | ≤ R4o(1)p4n ≤ δnp3

n

if n (and hence kn) is sufficiently large. This implies that

Pr(|Tj − rj | > 3δnp3n) ≤ Pr(|Tj − tj | > 3δnp3

n − |tj − rj |)≤ Pr(|Tj − tj | > 2δnp3

n).

Page 25: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 13

In view of the decomposition (15), this yields

Pr(|Tj − rj | > 3δnp3n) ≤ Pr(|Rj − rj | > δnp3

n) + Pr(|Aj | > δnp3n)

≤ 6 exp

(−

(100)−1nδ2nf2(a∗

j )p3n

1 + 30−1pnδnf(a∗j )

)

by Lemma 3.1, Lemma 4.3, and the fact that

100−1A

1 + 30−1B=

30

100

A

30 + B≤ 3A

1 + B

for A, B > 0.

Lemma 3.3. Suppose that β2 ≡ β2(F, τ) > 0, γ1 ≡ γ1(F, τ) < ∞ and R ≡R(f, τ) < ∞ for some τ < α1(F ) ≡ inf{t : F (t) = 1}. Let

An ≡ {H(2)n,kn

is convex on [0, τ ]}.

Then

P (Acn) ≤ 12kn exp

(−Kβ2

2(F, τ)np5n

)(16)

where K−1 = 82 · 1442 · 16 · 200 = 4, 246, 732, 800 ≤ 4.3 · 109.

Proof. Since

Acn ≡

kn−1⋃

j=1

{Bj > Bj+1},

it follows that

P (Acn) ≤

kn−1∑

j=1

P (Bj > Bj+1)

=

kn−1∑

j=1

P(Bj > Bj+1,

∣∣Ti − ri

∣∣ ≤ 3δn,jp3n, i = j, j + 1

)

+

kn−1∑

j=1

P(Bj > Bj+1,

∣∣Ti − ri

∣∣ > 3δn,jp3n for i = j or i = j + 1

)

≤mn−1∑

j=0

P(Bj > Bj+1,

∣∣Ti − ri

∣∣ ≤ 3δn,jp3n, i = j, j + 1

)

+

kn−1∑

j=0

{P(∣∣Tj − rj

∣∣ > 3δn,jp3n

)+ P

(∣∣Tj+1 − rj+1

∣∣ > 3δn,jp3n

)}

= In + IIn(17)

where we take

δn,j =C(F, τ)

knf(a∗j )

= pnC(F, τ)

f(a∗j)

≡ δn

f(a∗j )

;

Page 26: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

14 Balabdaoui and Wellner

here a∗j ∈ [aj−1, aj ] satisfies ∆ja = aj − aj−1 = 1/(knf(a∗

j )), and C(F, τ) is aconstant to be determined. We first bound IIn from above. By Lemma 3.2, weknow that

P

(|Tj − rj | > 3δn,jp

3n

)≤ 6 exp

(−

(100)−1nδ2n,jf

2(a∗j )p

3n

1 + 30−1pnδn,jf(a∗j )

)

where δ2n,jf

2(a∗j )p

3n = C2(F, τ)p5

n and

1

1 + 30−1pnδn,jf(a∗j)

=1

1 + 30−1C(F, τ)p2n

>1

2

when kn > [30−1C(F, τ)]1/2 . Hence,

P

(|Tj − rj | > 3δn,jp

3n

)≤ 6 exp

(−200−1C2(F, τ)np5

n

).(18)

We also have

P

(|Tj+1 − rj+1| > 3δn,jp

3n

)≤ 6 exp

(−

100−1nδ2n,jf

2(a∗j+1)p

3n

1 + 30−1pnδn,jf(a∗j+1)

)

where a∗j+1 ∈ [aj , aj+1] and aj+1 − aj = ∆j+1a = 1/(knf(a∗

j+1). By Lemma 5.1 wehave f(aj)/f(aj+1) ≤ 2 if kn ≥ 5γ1(F, τ)R. But this implies that f(a∗

j )/f(a∗j+1) ≤ 4

since

f(a∗j )

f(a∗j+1)

=f(a∗

j )

f(aj)· f(aj)

f(aj+1)· f(aj+1)

f(a∗j+1)

≤ f(aj−1)

f(aj)

f(aj)

f(aj+1)

f(aj+1)

f(a∗j+1)

≤ 2 · 2 · 1 = 4.

Hence, we can write

δ2n,jf

2(a∗j+1) =

1

k2n

C2(F, τ)f2(a∗

j+1)

f2(a∗j )

≥ 1

k2n

C2(F, τ)1

16=

C2(F, τ)

16p2

n

and, since f(a∗j+1)/f(a∗

j ) ≤ 1,

1

1 + 30−1pnδn,jf(a∗j+1)

=1

1 + 30−1C(F, τ)p2nf(a∗

j+1)/f(a∗j )

≥ 1

1 + 30−1C(F, τ)p2n

>1

2

when kn > [30−1C(F, τ)]1/2 . Thus, we conclude that

P

(|Tj+1 − rj+1| > 3δn,jp

3n

)≤ 6 exp

(−200−1

16C2(F, τ)np5

n

)(19)

Combining (18) and (19), we get

IIn ≤ 12kn exp

(−200−1

16C2(F, τ)np5

n

).

Page 27: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 15

Now we need to handle In. Recall that

Bj = 12Tj

(∆ja)3, Bj+1 = 12

Tj+1

(∆j+1a)3.

Thus, the event{Bj > Bj+1, |Ti − ri| ≤ 3δn,jp

3n, i = j, j + 1

}

is equal to the event

{ Tj

(∆ja)3>

Tj+1

(∆j+1a)3, |Ti − ri| ≤ 3δn,jp

3n, i = j, j + 1

}.

Then, it follows thatTj

(∆ja)3≤ rj

(∆ja)3+

3δn,jp3n

(∆ja)3

andTj+1

(∆j+1a)3≥ rj+1

(∆j+1a)3− 3δn,jp

3n

(∆j+1a)3,

and hence

Tj

(∆ja)3≤[

rj

(∆ja)3− rj+1

(∆j+1a)3

]+

[rj+1

(∆j+1a)3− 3δn,jp

3n

(∆j+1a)3

]

+

[3δn,jp

3n

(∆ja)3+

3δn,jp3n

(∆j+1a)3

]

≤[

rj

(∆ja)3− rj+1

(∆j+1a)3

]+

Tj+1

(∆j+1a)3

+

[3δn,jp

3n

(∆ja)3+

3δn,jp3n

(∆j+1a)3

].

The first term in the right side of the previous inequality is the leading term in

the sense that it determines the sign of the difference of the slope of H(2)n,kn

. ByLemma 4.5, we can write

rj

(∆ja)3− rj+1

(∆j+1a)3≤ − 1

12f ′′(a∗∗

j )∆ja +1

24(f

′′

j ∆ja − f ′′

j+1∆j+1a).

Let a∗j ∈ [aj−1, aj ] such that ∆ja = pn[f(a∗

j )]−1. Then, we can write

3δn,jp3n

(∆ja)3+

3δn,jp3n

(∆j+1a)3− 1

12f ′′(a∗∗

j )∆ja +1

24(f

′′

j ∆ja − f ′′

j+1∆j+1a)

≤ 6δn,jf3(a∗

j ) −1

12f ′′(a∗∗

j )∆ja +1

24(f

′′

j ∆ja − f ′′

j+1∆j+1a)

= 6f2(a∗j )

{δn − 1

72

f ′′(a∗∗j )

f3(a∗j )

pn +1

144f2(a∗j )

(f′′

j ∆ja − f ′′

j+1∆j+1a)

}

= 6f2(a∗j )

{δn − 1

72

f ′′(a∗∗j )

f3(a∗j )

pn +1

144f3(a∗j )

(f′′

j − f ′′

j+1

∆j+1a

∆ja

)pn

}

= 6f2(a∗j )

{δn − 1

72

f ′′(a∗∗j )

f3(a∗j )

pn +1

144

f ′′

j+1

f3(a∗j )

(f′′

j

f ′′

j+1

− ∆j+1a

∆ja

)pn

}

Page 28: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

16 Balabdaoui and Wellner

= 6f2(a∗j )

{δn − 1

72

f ′′(a∗∗j )

f3(a∗∗j )

f3(a∗∗j )

f3(a∗j )

pn

+1

144

f ′′

j+1

f3(a∗j )

(f′′

j

f ′′

j+1

− ∆j+1a

∆ja

)pn

}

≤ 6f2(a∗j )

{δn − 1

72

β2(F, τ)

8pn +

1

144

f ′′

j+1

f3(a∗j )

(f′′

j

f ′′

j+1

− ∆j+1a

∆ja

)pn

}

= 6f2(a∗j )

{δn − 1

72

β2(F, τ)

8pn

+1

144

f ′′

j+1

f3(a∗j )

(f′′

j

f ′′

j+1

− 1 + 1 − ∆j+1a

∆ja

)pn

}

where (using arguments similar to those of Lemma 4.2 and taking the bound on

|f ′′

j − f ′′

j+1| to be ǫ‖f ′′‖ which is possible by uniform continuity of f ′′ on [0, τ ])

f′′

j

f ′′

j+1

− 1 ≤∣∣∣∣∣

f′′

j

f ′′

j+1

− 1

∣∣∣∣∣ ≤ǫf(τ)3γ2(F, τ)

f ′′

j+1

if kn > max(5γ1(F, τ)R, (√

2 + 1)R/η) for a given η > 0 and

1 − ∆j+1a

∆ja≤∣∣∣∣∆j+1a

∆ja− 1

∣∣∣∣ ≤ 8γ1(F, τ)pn.

Hence

3δn,jp3n

(∆ja)3+

3δn,jp3n

(∆j+1a)3− 1

12f ′′(a∗∗

j )∆ja +1

24(f

′′

j ∆ja − f ′′

j+1∆j+1a)

≤ 6f2(a∗j )

{δn − 1

72

β2(F, τ)

8pn +

1

144ǫγ2(F, τ) pn

+8

144γ2(F, τ)γ1(F, τ)p2

n

}

where we can choose ǫ and pn small enough so that

1

144ǫγ2(F, τ) +

8

144γ2(F, τ)γ1 pn ≤ 1

2 · 72 · 8β2(F, τ);

for example

ǫ <1

16

β2(F, τ)

γ2(F, τ), kn = p−1

n > 16 · 8 γ1(F, τ)

β2(F, τ).

The above choice yields

3δn,jp3n

(∆ja)3+

3δn,jp3n

(∆j+1a)3− 1

12f ′′

j∆ja +

1

24(f

′′

j ∆ja − f ′′

j+1∆j+1a)

≤ 6f2(a∗j )

{δn − β2(F, τ)

8 · 144pn

}= 0

Page 29: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 17

by choosing

δn = C(F, τ)pn =β2(F, τ)

8 · 144pn;

i.e. C(F, τ) = β2(F, τ)/(8 · 144). For such a choice, the first term In in (17) isidentically equal to 0.

4. Appendix 1: technical lemmas

Lemma 4.1. Under the hypotheses of Theorem 2.2,

1 ≤ f(aj−1)

f(aj)≤ ∆j+1a

∆ja≤ 2

uniformly in j if k ≥ 5γ1R.

Proof. Note that for each interval Ij = [aj−1, aj ] we have

pn =

Ij

f(x)dx = f(a∗j )∆ja

{≥ f(aj)∆ja≤ f(aj−1)∆ja

where a∗j ∈ Ij . Thus

pn

∆j+1a≤ f(aj) ≤

pn

∆ja

andpn

∆ja≤ f(aj−1) ≤

pn

∆j−1a.

It follows that

1 ≤ f(aj−1)

f(aj)≤ ∆j+1a

∆j−1a=

∆j+1a

∆ja

∆ja

∆j−1a.

Thus we will establish a bound for ∆j+1a/∆ja. Note that with c ≡ F (τ) < 1

∆j+1a = aj+1 − aj = F−1(j + 1

kc) − F−1(

j

kc)

=c

k

1

f(aj)+

c2

2k2

−f ′(ξj+1)

f3(ξj+1)

=c

k

1

f(aj)

{1 +

c

2k

−f ′(ξj+1)

f2(ξj+1)

f(aj)

f(ξj+1)

}

≤ c

k

1

f(aj)

{1 +

cγ1

2kR

}.

for some ξj+1 ∈ Ij+1, where ξj+1 ∈ Ij+1, R < ∞, and γ1 < ∞.Similarly, expanding to second order (about aj again!),

∆ja = aj − aj−1 = F−1(j

kc) − F−1(

j − 1

kc)

=c

k

1

f(aj)+

c2

2k2

f ′(ξj)

f3(ξj)

=c

k

1

f(aj)

{1 +

c

2k

f ′(ξj)

f2(ξj)

f(aj)

f(ξj)

}

Page 30: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

18 Balabdaoui and Wellner

≥ c

k

1

f(aj)

{1 +

c

2k

f ′(ξj)

f2(ξj)

}

since f(aj)/f(ξj) ≤ 1 and f ′(ξj) < 0

≥ c

k

1

f(aj)

{1 − cγ1

2k

}.

where ξj ∈ Ij . Thus it follows that for k = kn so large that γ1/(2k) ≤ 1/2 we have

∆j+1a

∆ja≤ 1 + cγ1

2k R

1 − cγ1

2k

≤(

1 +cγ1

2kR

)(1 +

cγ1

k

)

= 1 +cγ1

k(R/2 + 1) +

c2γ21

2k2R

< 1 +γ1(R + 1)

k

if k = kn ≥ γ1. The last inequality here follows from

γ1

k(R/2 + 1) +

γ21

2k2R ≤ γ1

k(R + α)

if and only if

(R/2 + 1) +γ1

2kR ≤ R + α

or, equivalently, if and only if

γ1

2kR ≤ R/2 + α − 1, or k ≥ γ1

R

R + 2(α − 1)= γ1

if α = 1. It now follows that

1 ≤ f(aj−1)

f(aj)≤ ∆j+1a

∆j−1a=

∆j+1a

∆ja

∆ja

∆j−1a∆j−1a≤ 2

if

∆i+1a

∆ia≤

√2

for i = j − 1, j. But these inequalities hold if k is so large that 1 + γ1(R+1)k ≤

√2,

or k ≥ 5γ1R ≥ γ1(R + 1)/(√

2 − 1) since R ≥ 1 and 1/(√

2 − 1) ≤ 5/2.

Lemma 4.2. Under the hypotheses of Theorem 2.2,

|tj − rj |(∆ja)4

= o(1)

where the o(1) depends only on τ , γ1(F, τ), and γ2(F, τ).

Remark. Note that

max1≤j≤k

|tj − rj | ≤1

24|a|4‖Y (4)‖ =

1

24|a|4‖f ′′‖ ≤ 1

24Rγ2(F )p4

n.(20)

Page 31: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 19

This follows since

rj − tj =1

2

(Y (1)(aj−1) + Y (1)(aj)

− ((C[Y ])(1)(aj−1) − (C[Y ])(1)(aj)))

∆ja

=1

2

{(Y (1)(aj−1) − (C[Y ])(1)(aj−1)

)

+(Y (1)(aj) − (C[Y ])(1)(aj)

)}∆ja,

and hence from de Boor [5], (20), page 56, it follows that

|rj − tj | ≤1

24|a|3‖Y (4)‖∆ja ≤ 1

24|a|4‖f (2)‖,

and this yields (20). The claim of Lemma 4.2 is stronger because it makes a state-ment about the differences tj − rj relative to (∆ja)4; this is possible because onlydifferences between the derivative of the derivative of Y and the derivative of itsinterpolant C[Y ] at the knots are involved.

Proof. We have

rj − tj =1

2

(E(1)

Y (aj−1) + E(1)Y (aj)

)∆ja,(21)

where Eg = g − C[g]. Now, using the result of Problem 2a, Chapter V of de Boor[5] (compare also with the formula (3.52) given in Nurnberger [20]), we have

δjE(1)Y (aj−1) + 2E(1)

Y (aj) + (1 − δj)E(1)Y (aj+1) = βj

for j = 0, · · · , k − 1, where

δj =aj+1 − aj

aj+1 − aj−1=

∆j+1a

∆ja + ∆j+1a

and

βj =δj(−∆ja)3f ′′(ξ1,j) + (1 − δj)(∆j+1a)3f ′′(ξ2,j)

24,

ξ1,j , ξ2,j ∈ [aj−1, aj+1]. By Problem IV 7(a) in de Boor [5] and the techniques usedin Chapter III (see in particular equation (9)), a bound on the maximal value atthe knots of the derivative interpolation error can be derived using the followinginequality

max0≤j≤k

|E(1)Y (aj)| ≤ max

(|E(1)

Y (a0)|, max1≤j≤k−1

|βj |, |E(1)Y (ak)|

).(22)

By definition of the complete cubic spline, E(1)Y (a0) = E(1)

Y (ak) = 0. Thus, we willfocus now on getting a sharp bound for max1≤j≤k−1 |βj | under our hypotheses.This will be achieved as follows:• Expanding δj around 1/2: We have

δj =aj+1 − aj

(aj+1 − aj) + (aj − aj−1)=

k−1n [f(a∗

j+1)]−1

k−1n [f(a∗

j+1)]−1 + k−1

n [f(a∗j )]

−1,

Page 32: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

20 Balabdaoui and Wellner

where a∗j ∈ [aj−1, aj ] and a∗∗

j ∈ [aj , aj+1], and hence

δj =1

2+

f(a∗j+1) − f(a∗

j )

2(f(a∗j) + f(a∗

j+1))

=1

2+

f ′(a∗∗j )

2(f(a∗j) + f(a∗

j+1))(a∗∗

j − a∗j )

=1

2+

f ′(a∗∗j )

2(f(a∗j) + f(a∗

j+1))

a∗∗j − a∗

j

aj − aj−1∆ja =

1

2+ Mj ∆ja

where

|Mj| =∣∣∣∣∣

f ′(a∗∗j )

2(f(a∗j) + f(a∗

j+1))

a∗∗j − a∗

j

aj − aj−1

∣∣∣∣∣

≤ |f ′(aj−1)|4f(aj+1)

aj+1 − aj−1

aj − aj−1

≤ |f ′(aj−1)|4f(aj−1)

f(aj−1)

f(aj+1)

(aj+1 − aj

aj − aj−1+ 1

)

≤ |f ′(aj−1)|4f(aj−1)

2 · 2 · (√

2 + 1), for kn > 5γ1R

= (√

2 + 1)|f ′(aj−1)|f(aj−1)

.

• Approximation of f ′′(ξ1,j) and f ′′(ξ2,j): Define ǫ1,j and ǫ2,j by

ǫ1,j = f ′′(ξ1,j) − f ′′(aj−1), and ǫ2,j = f ′′(ξ2,j) − f ′′(aj).

By uniform continuity of f (2) = f ′′ on the compact set [0, τ ], for every ǫ > 0there exists an η = ηǫ > 0 such that |x − y| < η implies |f ′′(x) − f ′′(y)| < ǫ.Fix ǫ > 0 (to be chosen later). We have ξ1,j , ξ2,j ∈ [aj−1, aj+1], where, bythe proof of Lemma 4.1, if kn > 5γ1R,

aj+1 − aj−1 = aj+1 − aj + aj − aj−1 ≤ 1

knf(a∗j )

(√

2 + 1)

≤ (√

2 + 1)1

knf(τ)

≤ (√

2 + 1)R

kn.

Thus, if we choose kn such that kn > max(5γ1R, (

√2+1)/ηR

), then aj+1 −

aj−1 < η for all j = 1, . . . , k and furthermore

max{|f ′′(ξ1,j) − f ′′(aj−1)|, |f ′′(ξ2,j) − f ′′(aj−1)|

}< ǫ, for j = 1, . . . , k,

or, equivalently, max{|ǫ1,j|, |ǫ2,j|} < ǫ, j = 1, . . . , k.• Expanding ∆j+1a around ∆ja: We have

∆j+1a = aj+1 − aj = aj − aj−1 + [aj+1 − aj − (aj − aj−1)]

= ∆ja + ∆ja

(aj+1 − aj

aj − aj−1− 1

)= ∆ja + ∆ja ǫ3,j

Page 33: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 21

where

ǫ3,j =aj+1 − aj

aj − aj−1− 1 =

f(a∗j )

f(a∗j+1)

− 1 =f(a∗

j ) − f(a∗j+1)

f(a∗j+1)

=−f ′(a∗∗

j )

f(a∗j+1)

(a∗j+1 − a∗

j ).

Thus,

|ǫ3,j| ≤|f ′(aj−1)|f(aj+1)

(aj+1 − aj−1)

=|f ′(aj−1)|f(aj+1)

(1

knf(a∗j )

+1

knf(a∗j+1)

)

≤ 2|f ′(aj−1)|f2(aj+1)

1

kn= 2

|f ′(aj−1)|f2(aj−1)

(f(aj−1)

f(aj+1)

)21

kn

≤ 2 · 24 |f ′(aj−1)|f2(aj−1)

1

kn= 32

|f ′(aj−1)|f2(aj−1)

1

kn≤ 32γ1

1

kn.

Above, we have used the fact that kn > 5γ1R to be able to use the inequalityf(aj−1)/f(aj+1) < 22.Now, expansion of βj yields, after straightforward algebra,

24βj =[− 2Mjf

′′(aj−1)(∆ja)4]

+[ǫ1,j

(1

2+ Mj∆ja

)(−∆ja)3 + ǫ2,j

(1

2− Mj∆ja

)(∆ja)3

]

+[(1

2− Mj∆ja

)(3 + 3ǫ3,j + ǫ23,j)(f

′′(aj−1) + ǫ2,j) ǫ3,j (∆ja)3]

= T1,j + T2,j + T3,j

where

|T1,j|(∆ja)3

= 2|Mj|f ′′(aj−1)(∆ja) ≤ 2(√

2 + 1)|f ′(aj−1)|f(aj−1)

f ′′(aj−1)1

knf(a∗j )

≤ 4(√

2 + 1)|f ′(aj−1)|f2(aj−1)

f ′′(aj−1)1

kn

≤ 4(√

2 + 1)γ1f′′j

1

kn≤ 4(

√2 + 1)γ1γ2f(τ)3

1

kn

≤ 2−1(√

2 + 1)γ1γ2τ−3 ≡ M1

1

kn,

since f(τ) ≤ (2τ)−1 by (3.1), page 1669, Groeneboom, Jongbloed and Wellner[11],

|T2,j|(∆ja)3

≤ 2

(1

2+

2(√

2 + 1)γ1

kn

)ǫ ≤ 2

(1

2+

2(√

2 + 1)

5R

)ǫ = M2ǫ,

Page 34: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

22 Balabdaoui and Wellner

and

|T3,j|(∆ja)3

≤(

1

2+

2(√

2 + 1)γ1

kn

)(3 +

96γ1

kn+

322γ21

k2n

)(f

′′

j + ǫ)1

kn

≤(

1

2+

2(√

2 + 1)

5R

)(3 +

96

5R+

322

25R2

)2γ2f(τ)3

1

kn

≤(

1

2+

2(√

2 + 1)

5R

)(3 +

96

5R+

322

25R2

)2−2γ2τ

−3 1

kn= M3

1

kn

if we choose ǫ < γ2f(τ)3 = sup0<t<τ f ′′(t) and again use f(τ) ≤ (2τ)−1. Notethat by (21)

|tj − rj |(∆ja)3

≤ max1≤i≤k |E(1)(ai)|(∆ja)2

.

Thus, using (22) and combining the results obtained above, we can write forj = 1, . . . , k,

|tj − rj |(∆ja)3

≤ max1≤i≤k−1

|βi|(∆ja)2

≤ 24−1 max1≤i≤k−1

|T1,i| + |T2,i| + |T3,i|(∆ia)3

· |a|3(∆ja)2

≤[(M1 + M3)

1

kn+ M2 ǫ

] |a|3(∆ja)2

=

[(M1 + M3)

1

kn+ M2 ǫ

] |a|3(∆ja)3

∆ja(23)

But note that

|a|3(∆ja)3

= max1≤i≤k

(∆ia

∆ja

)3

≤ max1≤i≤k

(f(a∗

j )

f(a∗i )

)3

≤(

f(aj−1)

f(τ)

)3

where

f(aj−1)

f(τ)=

f(aj−1)

f(ak)=

f(aj−1)

f(aj)· f(aj)

f(aj+1)· · · f(ak−1)

f(ak)

and, for l = 0, . . . , k − 1,

f(al)

f(al+1)= 1 +

f(al) − f(al+1)

f(al+1)

= 1 +−f ′(a∗

l )

f(al+1)(al+1 − al), a∗

l ∈ [al, al+1]

= 1 +−f ′(a∗

l )

f(al+1)f(a∗∗l )

1

kn, a∗∗

l ∈ [al, al+1]

≤ 1 +−f ′(al)

f(al+1)f(a∗∗l )

1

kn

= 1 +−f ′(al)

f2(al)

f2(al)

f(al+1)f(a∗∗l )

1

kn

≤ 1 +γ1

4

1

kn, if kn > 5γ1R.

Page 35: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 23

Hence,

|a|3(∆ja)3

≤(

1 +γ1

4

1

kn

)3(kn+2−j)

≤(

1 +γ1

4

1

kn

)3(kn+2)

≤(

1 +γ1

4

1

kn

)3(kn+2)

=

(1 +

γ1

4

1

kn

)6(1 +

3γ1

4

1

3kn

)3kn

≤ 2

(1 +

3γ1

4

1

3kn

)3kn

≤ 2e3γ1/4(24)

if kn ≥ γ1/(4(21/6−1)) where we used log(1+x) ≤ x for x > 0 in the last inequality.Combining (23) with (24), it follows that if we choose

kn > max{5γ1R, γ1/(4(21/6 − 1)), (

√2 + 1)/ηR

}

then

|tj − rj |(∆ja)3

≤ 4e3γ1/4

[(M1 + M3)

1

kn+ M2ǫ

]∆ja = o(∆ja)

or

|tj − rj |(∆ja)4

= o(1)

where o(1) is uniform in j.

Lemma 4.3. Under the hypotheses of Theorem 2.2,

Pr(|Tj − tj − (Rj − rj)| ≥ δnp3

n

)≤ 4 exp

(−

(100)−1nδ2nf2(a∗

j )p3n

1 + (1/30)pnδnf(a∗j )

).

Proof. Write

Wj ≡ Tj − tj − (Rj − rj)

= −{

(Yn − Y )(1)(aj−1) + (Yn − Y )(1)(aj)

2

− (C[Yn − Y ])(1)(aj−1) + (C[Yn − Y ])(1)(aj)

2

}∆ja

≡ −1

2

(E(1)

Yn−Y (aj−1) + E(1)Yn−Y (aj)

)∆ja

where

E(1)g (t) ≡ (g − C[g])(1)(t).

But for g ∈ C1[aj−1, aj ] with g(1) of bounded variation,

g(t) = g(aj−1) + g′(aj−1)(t − aj−1) +

∫ t

aj−1

(t − u)dg(1)(u)

= Pj(t) +

∫ aj

aj−1

gu(t)dg(1)(u)

Page 36: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

24 Balabdaoui and Wellner

where gu(t) ≡ (t−u)+ = (t−u)1[t≥u]. Since C is linear and preserves linear functions

C[g](t) = Pj(t) +

∫ aj

aj−1

Cgu(t)dg(1)(u),

and this yields

Eg(t) =

∫ aj

aj−1

Egu(t)dg(1)(u)

and

E(1)g (t) =

∫ aj

aj−1

E(1)gu

(t)dg(1)(u).

Applying this second formula to g = Yn − Y yields the relation

E(1)Yn−Y (t) =

∫ aj

aj−1

E(1)gu

(t)d(Fn − F )(u).

Now gu is absolutely continuous with gu(t) =∫ t

0g(1)u (s)ds where g

(1)u (t) = 1[t≥u], so

by de Boor [5], (17) on page 56 (recalling that our C = I4 of de Boor),

‖E(1)gu

‖ = ‖g(1)u − (C[gu])(1)‖

≤ (19/4)dist(g(1)u , $3) ≤ (19/4)dist(g(1)

u , $2)

≤ (19/4)ω(g(1)u , |a|) ≤ (19/4) ≤ 5.

Thus the functions (u, t) 7→ E(1)gu (t)∆ja are bounded by a constant multiple of ∆ja,

while the functions hj,l(u) = E(1)gu (al)1[aj−1,aj ](u)∆ja, l ∈ {j − 1, j} satisfy

V ar[hj,l(X)] ≤ (∆ja)2∫ aj

aj−1

(19/4)2f(u)du ≤ 52(∆ja)3f(aj−1)

≤ 50p3n/f2(a∗

j )

for k ≥ 5γ1(F, τ)R as in the proof of Lemma 3.1 in section 3. By applying Bern-stein’s inequality much as in the proof of Lemma 3.1 we find that

Pr(|E(1)

Yn−Y (al)| > δnp3n

)

≤ 2 exp

(− nδ2

np6n/2

50p3n/f(a∗

j)2 + pn(5/3)δnp3

n/f(a∗j )

)

= 2 exp

(−

nδ2nf2(a∗

j )p3n

100 + (10/3)pnf(a∗j )δn

)

= 2 exp

(−

(100)−1nδ2nf2(a∗

j )p3n

1 + (1/30)pnδnf(a∗j )

).

Thus it follows that

Pr(|Wj | > δnp3

n

)

≤ Pr(|E(1)

Yn−Y (aj−1)| > δnp3n

)

+ Pr(|E(1)

Yn−Y (aj)| > δnp3n

)

≤ 4 exp

(−

(100)−1nδ2nf2(a∗

j )p3n

1 + (1/30)pnδnf(a∗j )

).

Page 37: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 25

This completes the proof of the claimed bound.

Lemma 4.4. Let R(s, t) be defined by

R(s, t) ≡ Phs,t

=1

2(F (t) + F (s))(t − s) −

∫ t

s

F (u)du, 0 ≤ s ≤ t < ∞.

Then

R(s, t)

{≤ 1

12f ′(s)(t − s)3 + 124 sups≤x≤t f ′′(x)(t − s)4

≥ 112f ′(s)(t − s)3 + 1

24 infs≤x≤t f ′′(x)(t − s)4.(25)

Remark. It follows from the Hadamard-Hermite inequality that for F concave,R(s, t) ≤ 0 for all s ≤ t; see e.g. Niculescu and Persson [19], pages 50 and 62-63for an exposition and many interesting extensions and generalizations. Lemmas A4and A5 give additional information under the added hypotheses that F (2) existsand F (1) is convex.

Proof. Since gs(t) ≡ R(s, t) has first three derivatives given by

g(1)s (t) =

d

dtRs(t) =

1

2f(t)(t − s) +

1

2(F (t) + F (s) − F (t))

=1

2f(t)(t − s) − 1

2(F (t) − F (s))

t=s= 0,

g(2)s (t) =

d2

dt2Rs(t) =

1

2f ′(t)(t − s) +

1

2(f(t) − f(t))

t=s= 0,

g(3)s (t) =

d3

dt3Rs(t) =

1

2f ′′(t)(t − s) +

1

2f ′(t),

we can write R(s, t) as a Taylor expansion with integral form of the remainder: fors < t,

R(s, t) = gs(t) = gs(s) + g′s(s)(t − s) +1

2!g′′s (s)(t − s)2

+1

2!

∫ t

s

g(3)s (x)(t − x)2dx

= 0 +1

2!

∫ t

s

(1

2f ′′(x)(x − s) +

1

2f ′(x)

)(t − x)2dx

=1

4

∫ t

s

f ′(x)(t − x)2dx +1

4

∫ t

s

f ′′(x)(x − s)(t − x)2dx

=1

4

∫ t

s

{f ′(s) + f ′′(x∗)(x − s)}(t − x)2dx

+1

4

∫ t

s

f ′′(x)(x − s)(t − x)2dx

=1

12f ′(s)(t − s)3 +

1

4

∫ t

s

{f ′′(x∗) + f ′′(x)}(x − s)(t − x)2dx

where |x∗ − x| ≤ |x − s| for each x ∈ [s, t]. Since∫ t

s (x − s)(t − x)2dx = (t − s)4/12we find that the inequalities (25) hold.

Page 38: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

26 Balabdaoui and Wellner

Lemma 4.5. Let rn,i ≡ P (hai−1,ai) = R(ai−1, ai), i = j, j + 1, f ′′

j= inft∈[aj−1,aj ]

f ′′(t) and f′′

j = supt∈[aj−1,aj ] f′′(t) . Then there exists a∗

j ∈ [aj−1, aj] = Ij such that

rn,j

(∆ja)3− rn,j+1

(∆j+1a)3≤ − 1

12f ′′(a∗

j )∆ja +1

24(f

′′

j ∆ja − f ′′

j+1∆j+1a).

Proof. In view of (25), we have

rn,j

{≤ 1

12f ′(aj−1)(∆ja)3 + 124 supx∈Ij

f ′′(x)(∆ja)4

≥ 112f ′(aj−1)(∆ja)3 + 1

24 infx∈Ijf ′′(x)(∆ja)4,

rn,j+1

{≤ 1

12f ′(aj)(∆j+1a)3 + 124 supx∈Ij+1

f ′′(x)(∆j+1a)4

≥ 112f ′(aj)(∆j+1a)3 + 1

24 infx∈Ij+1f ′′(x)(∆j+1a)4,

and hence

rn,j

(∆ja)3− rn,j+1

(∆j+1a)3

≤ 1

12f ′(aj−1) +

1

24supx∈Ij

f ′′(x)∆ja − 1

12f ′(aj) −

1

24inf

x∈Ij+1

f ′′(x)∆j+1a

= − 1

12f ′′(a∗

j )∆ja +1

24(f

′′

j ∆ja − f ′′

j+1∆j+1a), where a∗

j ∈ Ij .

5. Appendix 2: A “modernized” proof of Kiefer and Wolfowitz [14]

Define the following interpolated versions of F and Fn. For k ≥ 1, let aj ≡ a(k)j ≡

F−1(j/k) for j = 1, . . . , k − 1, and set a0 ≡ α0(F ) and ak ≡ α1(F ). Using thenotation of de Boor [5], Chapter III, let L(k) = I2F be the piecewise linear andcontinuous function on R satisfying

L(k)(a(k)j ) = F (a

(k)j ), j = 0, . . . , ak.

Similarly, define Ln ≡ L(k)n = I2Fn; thus

L(k)n (x) = Fn(aj) + k{Fn(aj+1) − Fn(aj)}[L(k)(x) − F (aj)]

for aj ≤ x ≤ aj+1, j = 0, . . . , ak. We will eventually let k = kn and then writepn = 1/kn (so that F (aj+1) − F (aj) = 1/kn = pn).

The following basic lemma due to Marshall [17] plays a key role in the proof.

Lemma 5.1 (Marshall [17]). Let Ψ be convex on [0, 1], and let Φ be a continuousreal-valued function on [0, 1]. Let

Φ(x) = sup{h(x) : h is convex and h(z) ≤ Φ(z) for all z ∈ [0, 1]}.

Then

sup0≤x≤1

|Φ(x) − Ψ(x)| ≤ sup0≤x≤1

|Φ(x) − Ψ(x)|.

Proof. Note that for all y ∈ [0, 1], either Φ(y) = Φ(y), or y is an interior point of aclosed interval I over which Φ is linear. For such an interval, either supx∈I |Φ(x) −

Page 39: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 27

Ψ(x)| is attained at an endpoint of I (where Φ = Φ), or it is attained at an interiorpoint, where Ψ < Φ. Since Φ ≤ Φ on [0, 1], it follows that

supx∈I

|Φ(x) − Ψ(x)| ≤ supx∈I

|Φ(x) − Ψ(x)|.

Here is a second proof (due to Robertson, Wright and Dykstra [21], page 329)that does not use continuity of Φ. Let ǫ ≡ ‖Φ − Ψ‖∞. Then Ψ − ǫ is convex, andΨ(x) − ǫ ≤ Φ(x) for all x. Thus for all x

Φ(x) ≥ Φ(x) ≥ Ψ(x) − ǫ,

and henceǫ ≥ Φ(x) − Ψ(x) ≥ Φ(x) − Ψ(x) ≥ −ǫ

for all x. This implies the claimed bound.

Main steps:

A. By Marshall’s lemma, for any concave function h, ‖Fn − h‖ ≤ ‖Fn − h‖.B. PF (An) ≡ PF {L

(kn)n is concave on [0,∞)} ր 1 as n → ∞ if kn ≡ (C0β1(F )×

n/ logn)1/3 for some absolute constant C0.C. On the event An, it follows from Marshall’s lemma (step A) that

‖Fn − Fn‖ = ‖Fn − L(kn)n + L

(kn)n − Fn‖

≤ ‖Fn − L(kn)n ‖ + ‖L

(kn)n − Fn‖

= 2‖Fn − L(kn)n ‖

= 2‖Fn − L(kn)n − (F − L(kn)) + F − L(kn))‖

≤ 2‖Fn − L(kn)n − (F − L(kn))‖ + 2‖F − L(kn)‖

≡ 2(Dn + En).

D. Dn is handled by a standard “oscillation theorem”; En is handled by ananalytic (deterministic) argument.

Proof of (1) assuming B holds. Using the notation of de Boor [5], chapter III, wehave

Fn − F − (Ln − L) = Fn − F − I2(Fn − F ).

But by (18) of de Boor [5], page 36, ‖g−I2g‖ ≤ ω(g; |a|) where ω(g; |a|) is the oscil-lation modulus of g with maximum comparison distance |a| = maxj ∆aj (and notethat de Boor’s proof does not involve continuity of g). Thus it follows immediatelythat

Dn ≡ ‖Fn − F − (Ln − L)‖= ‖Fn − F − I2(Fn − F )‖≤ ω(Fn − F ; |a|) d

= n−1/2ω(Un; pn)

where Un ≡ √n(Gn − I) is the empirical process of n i.i.d. Uniform(0, 1) random

variables. From Stute’s theorem (see e.g. Shorack and Wellner [22], Theorem 14.2.1,page 542), lim sup ω(Un; pn)/

√2pn log(1/pn) = 1 almost surely if pn → 0, npn → ∞

and log(1/pn)/npn → 0. Thus we conclude that

‖Fn − F − (Ln − L)‖ = O(n−1/2√

pn log(1/pn)) = O((n−1 log n)2/3)

Page 40: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

28 Balabdaoui and Wellner

almost surely as claimed.To handle En, we use the bound given by de Boor [5], page 31, (2): ‖g − I2g‖ ≤

8−1|a|2‖g′′‖. Applying this to g = F , I2g = L(k) yields

‖F − L(k)‖ = ‖F − I2F‖ ≤ 1

8|a|2‖F ′′‖

≤ 1

8γ1(F )p2

n = O((n−1 log n)2/3).

Combining the results for Dn and En yields the stated conclusion.

It remains to show that B holds. To do this we use the following lemma.

Lemma 5.2. If pn → 0 and δn → 0, then for the uniform(0, 1) d.f. F = I,

P (|Gn(pn) − pn| ≥ δnpn) ≤ 2 exp(−1

2npnδ2

n(1 + o(1)))

where the o(1) term depends only on δn.

Proof. From Shorack and Wellner [22], Lemma 10.3.2, page 415,

P (Gn(pn)/pn ≥ λ) ≤ P

(sup

pn≤t≤1

Gn(t)

t≥ λ

)≤ exp(−npnh(λ))

where h(x) = x(log x − 1) + 1. Hence

P

(Gn(pn) − pn

pn≥ λ

)≤ exp(−npnh(1 + λ))

where h(1 + λ) ∼ λ2/2 as λ ↓ 0, by Shorack and Wellner [22], (11.1.7), page 44.Similarly, using Shorack and Wellner [22], (10.3.6) on page 416,

P

(pn − Gn(pn)

pn≥ λ

)= P

(pn

Gn(pn)≥ 1

1 − λ

)≤ exp(−npnh(1 − λ))

where h(1 − λ) ∼ λ2/2 as λ ց 0. Thus the conclusion follows with o(1) dependingonly on δn.

Here is the lemma which is used to prove B.

Lemma 5.3. If β1(F ) > 0 and γ1(F ) < ∞, then for kn large,

1 − P (An) ≤ 2kn exp(−nβ21(F )/80k3

n).

Proof. For 1 ≤ j ≤ kn, write

Tn,j ≡ Fn(aj) − Fn(aj−1), ∆ja ≡ aj − aj−1.

By linearity of L(kn)n on the sub-intervals [aj−1, aj ],

An =

kn−1⋂

j=1

{Tn,j

∆ja≥ Tn,j+1

∆j+1a

}≡

kn−1⋂

j=1

Bn,j.

Suppose that

|Tn,i − 1/kn| ≤ δn/kn, i = j, j + 1; and∆j+1a

∆ja≥ 1 + 3δn.(26)

Page 41: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 29

Then

Tn,j ≥ 1

kn− δn

kn=

1 − δn

kn, Tn,j+1 ≤ 1 + δn

kn,

and it follows that for δn ≤ 1/3

Tn,j∆j+1a

∆ja≥ 1 − δn

kn(1 + 3δn) ≥ 1 − δn

kn

1 + δn

1 − δn≥ Tn,j+1.

[1 + 3δ ≥ (1 + δ)/(1 − δ) iff (1 + 2δ − 3δ2) ≥ 1 + δ iff δ − 3δ2 ≥ 0 iff 1 − 3δ ≥ 0.]Now the ∆ part of (26) holds for 1 ≤ j ≤ kn − 1 provided δn ≤ β1(F )/6kn < 1/3.Proof: Since

d

dtF−1(t) =

1

f(F−1(t))and

d2

dt2F−1(t) = − f ′

f3(F−1(t))

we can write

∆j+1a = F−1(j + 1

k) − F−1(

j

k) = k−1

n

1

f(aj)+

1

2k2n

(−f ′(ξ)

f3(ξ)

)

for some aj ≤ ξ ≤ aj+1, and

∆ja ≤ k−1n

1

f(aj).

Combining these two inequalities yields

∆j+1a

∆ja≥ 1 + (2kn)−1f(aj)

(−f ′(ξ)

f3(ξ)

)

≥ 1 +1

2kn

(−f ′(ξ)

f2(ξ)

)≥ 1 +

1

2knβ1(F )

= 1 + 3δn

if δn ≡ β1(F )/(6kn).Thus we conclude that

1 − P (An) = P (

kn−1⋃

j=1

Bcn,j) ≤

kn−1∑

j=1

P (Bcn,j)

≤kn−1∑

j=1

2P (|Tn,j − 1/kn| > δn/kn)

≤ kn4 exp(−2−1npnδ2n1 + o(1))) = 4kn exp(−nβ2

1(F )/80k3n).

by using Lemma 5.2 and for kn sufficiently large (so that (1 + o(1)) ≥ 72/80).

Putting these results together yields Theorem 1.1.

Acknowledgments. The second author owes thanks to Lutz Dumbgen and KasparRufibach for the collaboration leading to the analogue of Marshall’s lemma whichis crucial for the development here. The second author thanks Piet Groeneboomfor many stimulating discussions about estimation under shape constraints, andparticularly about estimation of convex densities, over the past fifteen years.

Page 42: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

30 Balabdaoui and Wellner

References

[1] Balabdaoui, F. and Rufibach, K. (2007). A second marshall inequality inconvex estimation. Statist. and Probab. Lett. To appear.

[2] Balabdaoui, F. and Wellner, J. A. (2004). Estimation of a k-monotonedensity, part 1: characterizations, consistency, and minimax lower bounds.Technical report, Department of Statistics, University of Washington.

[3] Balabdaoui, F. and Wellner, J. A. (2004). Estimation of a k-monotonedensity, part 4: limit distribution theory and the spline connection. Technicalreport, Department of Statistics, University of Washington.

[4] Birkhoff, G. and de Boor, C. (1964). Error bounds for spline interpolation.J. Math. Mech. 13 827–835. MR0165294

[5] de Boor, C. (2001). A Practical Guide to Splines. Springer, New York.[6] Dubeau, F. and Savoie, J. (1997). Best error bounds for odd and even

degree deficient splines. SIAM J. Numer. Anal. 34 1167–1184. MR1451119[7] Dumbgen, L., Rufibach, K. and Wellner, J. A. (2007). Marshall’s lemma

for convex density estimation. In Asymptotics, Particles, Processes, and In-verse Problems, IMS Lecture Notes Monogr. Ser. Inst. Math. Statist., Beach-wood, OH. To appear.

[8] Durot, C. and Tocquet, A.-S. (2003). On the distance between the em-pirical process and its concave majorant in a monotone regression framework.Ann. Inst. H. Poincare Probab. Statist. 39 217–240. MR1962134

[9] Groeneboom, P. and Jongbloed, G. (1995). Isotonic estimation and ratesof convergence in Wicksell’s problem. Ann. Statist. 23 1518–1542. MR1370294

[10] Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001). A canon-ical process for estimation of convex functions: the “invelope” of integratedBrownian motion +t4. Ann. Statist. 29 1620–1652. MR1891741

[11] Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001). Estimationof a convex function: characterizations and asymptotic theory. Ann. Statist. 291653–1698. MR1891742

[12] Hall, C. A. (1968). On error bounds for spline interpolation. J. Approxima-tion Theory 1 209–218. MR0239324

[13] Hall, C. A. and Meyer, W. W. (1976). Optimal error bounds for cubicspline interpolation. J. Approximation Theory 16 105–122. MR0397247

[14] Kiefer, J. and Wolfowitz, J. (1976). Asymptotically minimax estimationof concave and convex distribution functions. Z. Wahrsch. Verw. Gebiete 3473–85. MR0397974

[15] Kiefer, J. and Wolfowitz, J. (1977). Asymptotically minimax estimationof concave and convex distribution functions. II. In Statistical Decision Theoryand Related Topics. II (Proc. Sympos., Purdue Univ., Lafayette, Ind., 1976)193–211. Academic Press, New York. MR0443202

[16] Kulikov, V. N. and Lopuhaa, H. P. (2006). The limit process of the dif-ference between the empirical distribution function and its concave majorant.Statist. Probab. Lett. 76 1781–1786. MR2274141

[17] Marshall, A. W. (1970). Discussion on Barlow and van Zwet’s paper. InNonparametric Techniques in Statistical Inference (M. L. Puri, ed.). Proceed-ings of the First International Symposium on Nonparametric Techniques heldat Indiana University, June 1969 174–176. Cambridge University Press, Lon-don.

[18] Millar, P. W. (1979). Asymptotic minimax theorems for the sample distri-bution function. Z. Wahrsch. Verw. Gebiete 48 233–252. MR0537670

Page 43: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

A Kiefer–Wolfowitz theorem 31

[19] Niculescu, C. P. and Persson, L.-E. (2006). Convex Functions and TheirApplications. Springer, New York. MR2178902

[20] Nurnberger, G. (1989). Approximation by Spline Functions. Springer,Berlin. MR1022194

[21] Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Re-stricted Statistical Inference. Wiley, Chichester. MR0961262

[22] Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Ap-plications to Statistics. Wiley, New York. MR0838963

[23] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergenceand Empirical Processes. Springer, New York.

[24] Wang, J.-L. (1986). Asymptotically minimax estimators for distributions withincreasing failure rate. Ann. Statist. 14 1113–1131. MR0856809

[25] Wang, X. and Woodroofe, M. (2007). A Kiefer–Wolfowitz comparisontheorem for Wicksell’s problem. Ann. Statist. 35. To appear.

[26] Wang, Y. (1994). The limit distribution of the concave majorant of an em-pirical distribution function. Statist. Probab. Lett. 20 81–84. MR1294808

Page 44: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotic: Particles, Processes and Inverse ProblemsVol. 55 (2007) 32–64c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000265

Model selection for Poisson processes

Lucien Birge1

Universite Paris VI

Abstract: Our purpose in this paper is to apply the general methodologyfor model selection based on T-estimators developed in Birge [Ann. Inst. H.Poincare Probab. Statist. 42 (2006) 273–325] to the particular situation ofthe estimation of the unknown mean measure of a Poisson process. We in-troduce a Hellinger type distance between finite positive measures to serve asour loss function and we build suitable tests between balls (with respect tothis distance) in the set of mean measures. As a consequence of the existenceof such tests, given a suitable family of approximating models, we can buildT-estimators for the mean measure based on this family of models and analyzetheir performances. We provide a number of applications to adaptive intensityestimation when the square root of the intensity belongs to various smoothnessclasses. We also give a method for aggregation of preliminary estimators.

1. Introduction

This paper deals with the estimation of the mean measure µ of a Poisson process Xon X . More precisely, we develop a theoretical, but quite general method for esti-mating µ by model selection with applications to adaptive estimation and aggrega-tion of preliminary estimators. The main advantage of the method is its generality.We do not make any assumption on µ apart from the fact that it should be finiteand we allow arbitrary countable families of models provided that each model beof finite metric dimension, i.e. is not too large in a suitable sense to be explainedbelow. We do not know of any other estimation method allowing to deal with modelselection in such a generality and with as few assumptions. The main drawback ofthe method is its theoretical nature, effective computation of the estimators beingtypically computationally too costly for permitting a practical implementation. Inorder to give a more precise idea of what this paper is about, we need to start byrecalling a few well-known facts about Poisson processes that can, for instance, befound in Reiss [29].

1.1. The basics of Poisson processes

Let us denote by Q+(X ) the cone of finite positive measures on the measurablespace (X , E). Given an element µ ∈ Q+(X ), a Poisson process on X with meanmeasure µ is a point process X = {X1, . . . , XN} on X such that N has a Pois-son distribution with parameter µ(X ) and, conditionally on N , the Xi are i.i.d.with distribution µ1 = µ/µ(X ). Equivalently, the Poisson process can be viewed asa random measure ΛX =

∑Ni=1 δXi , δx denoting the Dirac measure concentrated

at the point x. Then, whatever the partition A1, . . . , An of X , the n random vari-ables ΛX(Ai) are independent with Poisson distributions and respective parameters

1UMR 7599 “Probabilites et modeles aleatoires” Laboratoire de Probabilites, boıte 188, Uni-versite Paris VI, 4 Place Jussieu, F-75252 Paris Cedex 05, France, e-mail: [email protected]

AMS 2000 subject classifications: Primary 62M30, 62G05; secondary 62G10, 41A45, 41A46.Keywords and phrases: adaptive estimation, aggregation, intensity estimation, model selection,

Poisson processes, robust tests.

32

Page 45: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 33

µ(Ai) and this property characterizes a Poisson process. We shall denote by Qµ thedistribution of a Poisson process with mean measure µ on X . We recall that, forany nonnegative measurable function φ on (X , E),

(1.1) E

[N∑

i=1

φ(Xi)

]=∫

Xφ(x) dµ(x)

and

(1.2) E

[N∏

i=1

φ(Xi)

]= exp

[∫

X[φ(x) − 1] dµ(x)

].

If µ, ν ∈ Q+(X ) and µ � ν, then Qµ � Qν and

(1.3)dQµ

dQν(X1, . . . , XN ) = exp[ν(X ) − µ(X )]

N∏

i=1

dν(Xi),

with the convention that∏0

i=1(dµ/dν)(Xi) = 1.

1.2. Introducing our loss function

From now on, we assume that we observe a Poisson process X on X with unknownmean measure µ ∈ Q+(X ) so that µ always denotes the parameter to be esti-mated. For this, we use estimators µ(X) with values in Q+(X ) and measure theirperformance via the loss function Hq(µ(X), µ) for q ≥ 1, where H is a suitabledistance on Q+(X ). To motivate its introduction, let us recall some known facts.The Hellinger distance h between two probabilities P and Q defined on the samespace and their Hellinger affinity ρ are given respectively by

(1.4) h2(P, Q) =12

∫ (√dP −

√dQ)2

, ρ(P, Q) =∫ √

dPdQ = 1 − h2(P, Q),

where dP and dQ denote the densities of P and Q with respect to any dominatingmeasure, the result being independent of the choice of such a measure. If X1, . . . , Xn

are i.i.d. with distribution P on X and Q is another distribution, it follows from anexponential inequality that, for all x ∈ R,

P

[n∑

i=1

log(

dQ

dP

)(Xi) ≥ 2x

]≤ exp

[n log

(ρ(P ,Q)

)− x

]

(1.5) ≤ exp[nh2

(P ,Q

)− x

],

which provides an upper bound for the errors of likelihood ratio tests. In particular,if µ and µ′ are two elements in Q+(X ) dominated by some measure λ, it followsfrom (1.3) and (1.2) that the Hellinger affinity ρ(Qµ, Qµ′) between µ and µ′ is givenby

(1.6) ρ(Qµ, Qµ′) =∫ √

dQµ

dQλ

dQµ′

dQλdQλ = exp

[−H2(µ, µ′)

],

Page 46: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

34 L. Birge

where

H2(µ, µ′) =12[µ(X ) + µ′(X )

]−∫ √

(dµ/dλ)(dµ′/dλ)(1.7)

=12

∫ (√dµ/dλ −

√dµ′/dλ

)2

.(1.8)

Comparing (1.8) with (1.4) indicates that H is merely the generalization of theHellinger distance h between probabilities to arbitrary finite positive measures andthe introduction of H turns Q+(X ) into a metric space. Moreover, we derive from(1.5) with n = 1 that, when X is a Poisson process with mean measure µ on X ,

(1.9) P

[log

(dQµ′

dQµ

)(X) ≥ 2x

]≤ exp

[−H2(µ, µ′) − x

].

If µ(X ) = µ′(X ) = n, then H2(µ, µ′) = nh2(µ1, µ′1) and (1.9) becomes a perfect

analogue of (1.5). The fact that the errors of likelihood ratio tests between twoprobabilities are controlled by their Hellinger affinity justifies the introduction ofthe Hellinger distance as the natural loss function for density estimation, as shownby Le Cam [26]. It also motivates the choice of Hq as a natural loss function forestimating the mean measure of a Poisson process. For simplicity, we shall firstfocus on the quadratic risk E[H2(µ(X), µ)].

1.3. Intensity estimation

A case of particular interest occurs when we have at hand a reference positivemeasure λ on X and we assume that µ � λ with dµ/dλ = s, in which cases is called the intensity (with respect to λ) of the process with mean measure µ.Denoting by L

+i (λ) the positive part of Li(λ) for i = 1, 2, we observe that s ∈ L

+1 (λ),√

s ∈ L+2 (λ) and µ ∈ Qλ = {µt = t · λ, t ∈ L

+1 (λ)}. The one-to-one correspondence

t �→ µt between L+1 (λ) and Qλ allows us to transfer the distance H to L

+1 (λ) which

gives, by (1.8),

(1.10) H(t, u) = H(µt, µu) =(1/

√2)∥∥∥

√t −

√u∥∥∥

2for t, u ∈ L

+1 (λ),

where ‖·‖2 stands for the norm in L2(λ). When µ = µs ∈ Qλ it is natural to estimateit by some element µ(X) = s(X) ·λ of Qλ, in which case H(µ(X), µ) = H(s(X), s)and our problem can be viewed as a problem of intensity estimation: design anestimator s(X) ∈ L

+1 (λ) for the unknown intensity s. From now on, given a Poisson

process X with mean measure µ, we shall denote by Eµ and Pµ (or Es and Ps whenµ = µs) the expectations of functions of X and probabilities of events dependingon X, respectively.

1.4. Model based estimation and model selection

It is common practice to try to estimate the intensity s on X by a piecewise constantfunction, i.e. a histogram estimator s(X) belonging to the set

Sm =

D∑

j=1

aj1lIj , aj ≥ 0 for 1 ≤ j ≤ D

Page 47: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 35

of nonnegative piecewise constant functions with respect to the partition {I1, . . . ,ID} = m of X with λ(Ij) > 0 for all j. More generally, given a finite familym = {ϕ1, . . . , ϕD} of elements of L2(λ), we may consider the D-dimensional linearspace Sm generated by the ϕj and try to estimate

√s by some element

√s(X) ∈

Sm. This clearly leads to difficulties since Sm is not a subset of L+2 (λ), but we

shall nevertheless show that it is possible to design an estimator sm(X) with theproperty that

(1.11) Es

[H2 (sm(X), s)

]≤ C

[inf

t∈Sm

∥∥t −√

s∥∥2

2+ |m|

],

where |m| = D stands for the cardinality of m and C is a universal constant. In thisapproach, Sm should be viewed as a model for

√s, which means an approximating

set since we never assume that√

s ∈ Sm and the risk bound (1.11) has (up to theconstant C) the classical structure of the sum of an approximation term inft∈Sm

‖t−√s‖2

2 and an estimation term |m| corresponding to the number of parameters to beestimated.

If we introduce a countable (here countable always means finite or countable)family of models {Sm, m ∈ M} of the previous form, we would like to know towhat extent it is possible to build a new estimator s(X) such that

(1.12) Es

[H2 (s(X), s)

]≤ C ′ inf

m∈M

{inf

t∈Sm

∥∥t −√

s∥∥2

2+ |m|

},

for some other constant C ′, i.e. to know whether one can design an estimator whichrealizes, up to some constant, the best compromise between the two componentsof the risk bound (1.11). The problem of understanding to what extent (1.12) doeshold has been treated in many papers using various methods, mostly based on theminimization of some penalized criterion. A special construction based on testinghas been introduced in Birge [9] and then applied to different stochastic frameworks.We shall show here that this construction also applies to Poisson processes and thenderive the numerous consequences of this property. We shall, in particular, be ableto prove the following result in Section 3.4.1 below.

Theorem 1. Let λ be some positive measure on X and ‖ · ‖2 denote the norm inL2(λ). Let {Sm}m∈M be a finite or countable family of linear subspaces of L2(λ)with respective finite dimensions Dm and let {∆m}m∈M be a family of nonnegativeweights satisfying

(1.13)∑

m∈Mexp[−∆m] ≤ Σ < +∞.

Let X be a Poisson process on X with unknown mean measure µ = µs + µ⊥ wheres ∈ L

+1 (λ) and µ⊥ is orthogonal to λ. One can build an estimator µ = µ(X) =

s(X) · λ ∈ Qλ satisfying, for all µ ∈ Q+(X ) and q ≥ 1,

[Hq(µ, µ)

]≤ C(q) [1 + Σ]

(1.14)×[√

µ⊥(X ) + infm∈M

{inf

t∈Sm

∥∥√s − t∥∥

2+√

Dm ∨ ∆m

}]q

,

with a constant C(q) depending on q only.

Page 48: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

36 L. Birge

When µ = µs ∈ Qλ, (1.14) becomes

(1.15) Es

[Hq(s, s)

]≤ C(q) [1 + Σ] inf

m∈M

{inf

t∈Sm

∥∥√s − t∥∥

2+√

Dm ∨ ∆m

}q

.

Typical examples for X and λ are [0, 1]k with the Lebesgue measure or {1; . . . ; n}with the counting measure. In this last case, the n random variables ΛX({i}) =Ni are independent Poisson variables with respective parameters si = s(i) andobserving X is equivalent to observing a set of n independent Poisson variableswith varying parameters, a framework which is usually studied under the name ofPoisson regression.

1.5. Model selection for Poisson processes, a brief review

Although there have been numerous papers devoted to estimation of the meanmeasure of a Poisson process, only a few, recently, considered the problem of modelselection, the key reference being Reynaud-Bouret [30] with extensions to moregeneral processes in Reynaud-Bouret [31]. A major difference with our approachis her use of the L2(λ)-loss, instead of the Hellinger type loss that we introducehere. It first requires that the unknown mean measure µ be dominated by λ withintensity s and that s ∈ L2(λ). Moreover, as we shall show in Section 2.3 the useof the L2-loss typically requires that s ∈ L∞(λ). This results in rather complicatedassumptions but the advantage of this approach is that it is based on penalizedprojection estimators which can be computed practically while the construction ofour estimators is too computationally intensive to be implemented on a computer,as we shall explain below. The same conclusions essentially apply to all other pa-pers dealing with the subject. The approach of Gregoire and Nembe [21], whichextends previous results of Barron and Cover [8] about density estimation to thatof intensities, has some similarities with ours. The paper by Kolaczyk and Nowak[25] based on penalized maximum likelihood focuses on Poisson regression. Meth-ods which can also be viewed as cases of model selection are those based on thethresholding of the empirical coefficients with respect to some orthonormal basis. Itis known that such a procedure is akin to model selection with models spanned byfinite subsets of a basis. They have been considered in Kolaczyk [24], Antoniadis,Besbeas and Sapatinas [1], Antoniadis and Sapatinas [2] and Patil and Wood [28].

1.6. An overview of the paper

We already justified the introduction of our Hellinger type loss-functions by theproperties of likelihood ratio tests and we shall explain, in the next section, whythe more popular L2-risk is not suitable for our purposes, at least if we want todeal with possibly unbounded intensities. To show this, we shall design a generaltool for getting lower bounds for intensity estimation, which is merely a version ofAssouad’s Lemma [3] for Poisson processes. We shall also show that recent resultsby Rigollet and Tsybakov [32] on aggregation of estimators for density estimationextend straightforwardly to the Poisson case. In Section 3, we briefly recall thegeneral construction of T-estimators introduced in Birge [9] and apply it to thespecific case of Poisson processes. We also provide an illustration based on non-linear approximating models. Section 4 is devoted to various applications of ourmethod based on families of linear models. This section essentially relies on results

Page 49: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 37

from approximation theory about the approximation of different classes of functions(typically smoothness classes) by finite dimensional linear spaces in L2. We also in-dicate how to mix different families of models and introduce an asymptotic point ofview which allows to consider convergence rates and to make a parallel with densityestimation. In Section 5, we deal with aggregation of estimators with some appli-cations to partition selection for histograms. The final Section 6 is devoted to theproof of the most important technical result in this paper, namely the existence andproperties of tests between balls of mean measures. This is the key argument whichis required to apply the construction of T-estimators to the problem of estimatingthe mean measure of a Poisson process. It also has other applications, in particularto the study of Bayesian procedures as done, for instance, in Ghosal, Ghosh andvan der Vaart [20] and subsequent work of van der Vaart and coauthors.

2. Estimation with L2-loss

2.1. From density to intensity estimation

A classical approach to density estimation is based on L2-loss. We assume that theobservations X1, . . . , Xn have a density s1 with respect to some dominating mea-sure λ and that s1 belongs to the Hilbert space L2(λ) with scalar product 〈·, ·〉 andnorm ‖ · ‖2. Given an estimator s(X1, . . . , Xn) we define its risk by E[‖s − s1‖2

2].In this theory, a central role is played by projection estimators as defined by Cen-cov [14]. Model selection based on projection estimators has been considered byBirge and Massart [11]. A more modern treatment can be found in Massart [27].Thresholding estimators based on wavelet expansions as described in Cohen, De-Vore, Kerkyacharian and Picard [15] (see also the many further references therein)can also be viewed as special cases of those. Recently Rigollet and Tsybakov [32]introduced an aggregation method based on projection estimators. Projection esti-mators have the advantage of simplicity and the drawback or requiring somewhatrestrictive assumptions on the density s1 to be estimated, not only that it belongsto L2 but most of the time to L∞. As shown in Birge [10], Section 5.4.1, the factthat s1 belongs to L∞ is essentially a necessary condition to have a control on theL2-risk of estimators of s1.

As indicated in Baraud and Birge [4] Section 4.2, there is a parallel betweenthe estimation of a density s1 from n i.i.d. observations and the estimation ofthe intensity s = ns1 from a Poisson process. This suggests to adapt the knownresults from density estimation to intensity estimation for Poisson processes. Weshall briefly explain how it works, when the Poisson process X has an intensitys ∈ L∞(λ) with L∞-norm ‖s‖∞.

The starting point is to observe that, given an element ϕ ∈ L2(λ), a naturalestimator of 〈ϕ, s〉 is ϕ(X) =

∫ϕdΛX =

∑Ni=1 ϕ(Xi). It follows from (1.1) that

(2.1) Es [ϕ(X)] = 〈ϕ, s〉 and Vars (ϕ(X)) =∫

ϕ2s dλ − 〈ϕ, s〉2 ≤ ‖s‖∞‖ϕ‖22.

Given a D-dimensional linear subspace S′ of L2(λ) with an orthonormal basisϕ1, . . . , ϕD, we can estimate s by the projection estimator with respect to S′:

s(X) =D∑

j=1

[N∑

i=1

ϕj(Xi)

]ϕj .

Page 50: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

38 L. Birge

It follows from (2.1) that its risk is bounded by

(2.2) Es

[‖s(X) − s‖2

2

]≤ inf

t∈S′‖t − s‖2

2 + ‖s‖∞D.

Note that s(X) is not necessarily an intensity since it may take negative values.This can be fixed: replacing s(X) by its positive part can only reduce the risk sinces is nonnegative.

2.2. Aggregation of preliminary estimators

The purpose of this section is to extend some recent results for aggregation ofdensity estimators due to Rigollet and Tsybakov [32] to intensity estimation. Thebasic tool for aggregation in the context of Poisson processes is the procedure of“thinning” which is the equivalent of sample splitting for i.i.d. observations, see forinstance Reiss [29], page 68. Assume that we have at our disposal a Poisson processwith mean measure µ: ΛX =

∑Ni=1 δXi and an independent sequence (Yi)i≥1 of

i.i.d. Bernoulli variables with parameter p ∈ (0, 1). Then the two random measuresΛX1 =

∑Ni=1 YiδXi and ΛX2 =

∑Ni=1(1 − Yi)δXi are two independent Poisson

processes with respective mean measures pµ and (1 − p)µ.Now assume that X is a Poisson process with intensity s with respect to λ,

that X1 and X2 have been derived from X by thinning and that we have at ourdisposal a finite family {sm(X1), m ∈ M} of estimators of ps based on the firstprocess and belonging to L2(λ). They may be projection estimators or others. Theseestimators span a D-dimensional linear subspace of L2(λ) with an orthonormalbasis ϕ1, . . . , ϕD, D ≤ |M|. Working conditionally with respect to X1, we use X2

to build a projection estimator s(X2) of (1−p)s belonging to the linear span of theestimators sm(X1). This is exactly the method used by Rigollet and Tsybakov [32]for density estimation and the proof of their Theorem 2.1 extends straightforwardlyto Poisson processes to give

Theorem 2. The aggregated estimator s based on the processes X1 and X2 bythinning of X satisfies(2.3)

Es

[‖s(X) − (1 − p)s‖2

2

]≤ Es

inf

θ∈RM

∥∥∥∥∥ps −∑

m∈Mθmsm(X1)

∥∥∥∥∥

2

2

+(1− p)‖s‖∞|M|.

Setting s(X) = s(X)/(1 − p) leads to

Es

[‖s(X) − s‖2

2

]≤ 1

(1 − p)2inf

m∈MEs

[‖ps − sm(X1)‖2

2

]+

‖s‖∞|M|1 − p

.

If we start with a finite family {Sm, m ∈ M} of finite-dimensional linear subspacesof L2(λ) with respective dimensions Dm, we may choose for sm(X1) the projectionestimator based on Sm with risk bounded by (2.2)

Es

[‖sm(X1) − ps‖2

2

]≤ inf

t∈Sm

‖t− ps‖22 + p‖s‖∞Dm = p2 inf

t∈Sm

‖t− s‖22 + p‖s‖∞Dm.

Choosing p = 1/2, we conclude that

Es

[‖s(X) − s‖2

2

]≤ inf

m∈M

{inf

t∈Sm

‖t − s‖22 + 2‖s‖∞Dm

}+ 2‖s‖∞|M|.

Page 51: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 39

2.3. Lower bounds for intensity estimation

It is rather inconvenient to get risk bounds involving the unknown and possiblyvery large L∞-norm of s and this problem becomes even more serious if s doesnot belong to L∞(λ). It is, unfortunately, impossible to avoid this problem whendealing with the L2-loss. To show this, let us start with a version of Assouad’sLemma [3] for Poisson processes.

Lemma 1. Let SD = {sδ, δ ∈ D} ⊂ L+1 (λ) be a family of intensities indexed by D =

{0; 1}D and ∆ be the Hamming distance on D given by ∆(δ, δ′) =∑D

j=1 |δj − δ′j |.Let C be the subset of D ×D defined by

C = {(δ, δ′) | ∃k, 1 ≤ k ≤ D with δk = 0, δ′k = 1 and δj = δ′j for j �= k}.

Then for any estimator δ(X) with values in D,

(2.4) supδ∈D

Esδ

[∆(δ(X), δ

)]≥ D

4

1|C|

(δ,δ′)∈Cexp

[−2H2(sδ, sδ′)

] .

If, moreover, SD ⊂ L ⊂ L+1 (λ) and L is endowed with a metric d satisfying

d2(sδ, sδ′) ≥ θ∆(δ, δ′) for all δ, δ′ ∈ D and some θ > 0, then for any estimators(X) with values in L,

(2.5) sups∈SD

Es

[d2 (s(X), s)

]≥ Dθ

16

1|C|

(δ,δ′)∈Cexp

[−2H2(sδ, sδ′)

] .

Proof. To get (2.4) it suffices to find a lower bound for

RB = 2−D∑

δ∈DEsδ

[∆(δ, δ

)]= 2−D

δ∈D

∫ D∑

k=1

∣∣∣δk − δk

∣∣∣ dQsδ,

since the left-hand side of (2.4) is at least as large as the average risk RB . It followsfrom the proof of Lemma 2 in Birge [10] with n = 1 that

RB ≥ 2−D∑

(δ,δ′)∈C

[1 −

√1 − ρ2

(Qsδ

, Qsδ′

)]≥ 2−D−1

(δ,δ′)∈Cρ2(Qsδ

, Qsδ′

).

Then (2.4) follows from (1.6) since |C| = D2D−1. Let now s(X) be an estimatorwith values in L and set δ(X) ∈ D to satisfy d(s, sδ) = infδ∈D d(s, sδ) so that,whatever δ ∈ D, d(sδ, sδ) ≤ 2d(s, sδ). It then follows from our assumptions that

supδ∈D

Esδ

[d2 (s, sδ)

]≥ 1

4supδ∈D

Esδ

[d2(sδ, sδ

)]≥ θ

4supδ∈D

Esδ

[∆(δ(X), δ

)]

and (2.5) follows from (2.4).

The simplest application of this lemma corresponds to the case D = 1 which, inits simplest form, dates back to Le Cam [26]. We consider only two intensities s0

and s1 so that θ = d2(s0, s1) and (2.5) gives, whatever the estimator s(X),

(2.6) maxi=0,1

Esi

[d2 (s(X), si)

]≥ d2(s0, s1)

16exp

[−2H2(s0, s1)

].

Another typical application of the previous lemma to intensities on [0, 1] uses thefollowing construction of a suitable set SD.

Page 52: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

40 L. Birge

Lemma 2. Let D be a positive integer and g be a function on R with support on[0, D−1) satisfying

0 ≤ g(x) ≤ 1 for all x and∫ D−1

0

g2(x) dx = a > 0.

Set, for 1 ≤ j ≤ D and 0 ≤ x ≤ 1, gj(x) = g(x − D−1(j − 1)) and, for δ ∈D, sδ(x) = a−1[1 +

∑Dj=1(δj − 1/2)gj(x)]. Then ‖sδ − sδ′‖2

2 = a−1∆(δ, δ′) andH2(sδ, sδ′) ≥ ∆(δ, δ′)/8 for all δ, δ′ ∈ D. Moreover,

(2.7) |C|−1∑

(δ,δ′)∈Cexp

[−2H2(sδ, sδ′)

]≥ exp[−2/7].

Proof. The first equality is clear. Let us then observe that our assumptions on gimply that 1 − g2(x)/7 ≤

√1 − g2(x)/4 ≤ 1 − g2(x)/8, hence, since the functions

gj have disjoint supports and are translates of g,

H2(sδ, sδ′) = (2a)−1D∑

j=1

|δj − δ′j |∫ D−1

0

[√1 + g(x)/2 −

√1 − g(x)/2

]2dx

= a−1D∑

j=1

|δj − δ′j |∫ D−1

0

[1 −

√1 − g2(x)/4

]dx = c∆(δ, δ′),

with 1/8 ≤ c ≤ 1/7. The conclusions follow.

Corollary 1. For each positive integer D and L ≥ 3D/2, one can find a finite setSD of intensities with the following properties:

(i) it is a subset of some D-dimensional affine subspace of L2([0, 1], dx);(ii) sups∈SD

‖s‖∞ ≤ L;(iii) for any estimator s(X) with values in L2([0, 1], dx) based on a Poisson

process X with intensity s,

(2.8) sups∈SD

Es

[‖s − s‖2

2

]≥ (DL/24) exp[−2/7].

Proof. Let us set θ = 2L/3 ≥ D and apply the construction of Lemma 2 withg(x) =

√D/θ 1l[0,1/D), hence a = θ−1. This results in the set SD with ‖sδ‖∞ ≤

θ[1 + (1/2)

√D/θ

]≤ 3θ/2 = L for all δ ∈ D as required. Moreover ‖sδ − sδ′‖2

2 =θ∆(δ, δ′). Then we use Lemma 1 with d being the distance corresponding to thenorm in L2([0, 1], dx) and (2.5) together with (2.7) result in (2.8).

This result implies that, if we want to use the squared L2-norm as a loss function,whatever the choice of our estimator there is no hope to find risk bounds thatare independent of the L∞-norm of the underlying intensity, even if this intensitybelongs to a finite-dimensional affine space. This provides an additional motivationfor the introduction of loss functions based on the distance H.

3. T-estimators for Poisson processes

3.1. Some notations

Throughout this paper, we observe a Poisson process X on X with unknown meanmeasure µ belonging to the metric space (Q+(X ), H) and have at hand some ref-erence measure λ on X so that µ = µs + µ⊥ with µs ∈ Qλ, s ∈ L

+1 (λ) and µ⊥

Page 53: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 41

orthogonal to λ. We denote by ‖ · ‖i the norm in Li(λ) for 1 ≤ i ≤ ∞ and by d2

the distance corresponding to the norm ‖ · ‖2. We always denote by s the intensityof the part of µ which is dominated by λ and set s1 = s/µs(X ). We also systemat-ically identify Qλ with L

+1 (λ) via the mapping t �→ µt, writing t as a shorthand for

µt ∈ Qλ. We write H(s, S′) for inft∈S′ H(s, t), a∨ b and a∧ b for the maximum andthe minimum respectively of a and b, |A| for the cardinality of a finite set A andN

� = N \ {0} for the set of positive integers. In the sequel C (or C ′, C1, . . .) denoteconstants that may vary from line to line, the form C(a, b) meaning that C is nota universal constant but depends on some parameters a and b.

3.2. Definition and properties of T-estimators

In order to explain our method of estimation and model selection, we need to recallsome general results from Birge [9] about T-estimators that we shall specialize tothe specific framework of this paper. Let (M, d) be some metric space and B(t, r)denote the open ball of center t and radius r in M .

Definition 1. A subset S′ of the metric space (M, d) is called a D-model withparameters η, D and B′ (η, B′, D > 0) if

(3.1) |S′ ∩ B(t, xη)| ≤ B′ exp[Dx2

]for all x ≥ 2 and t ∈ M.

Note that this implies that S′ is at most countable.To estimate the unknown mean measure µ of the Poisson process X, we introduce

a finite or countable family {Sm, m ∈ M} of D-models in (Qλ, H) with respectiveparameters ηm, Dm and B′ and assume that

(3.2) for all m ∈ M, Dm ≥ 1/2 and η2m ≥ (84Dm)/5,

and

(3.3)∑

m∈Mexp

[−η2

m/84]

= Σ < +∞.

Then we set S =⋃

m∈M Sm and, for each t ∈ S,

(3.4) η(t) = inf{ηm |m ∈ M and Sm � t}.

Remark. Note that if we choose for {Sm, m ∈ M} a family of D-models in(Q+(X ), H), S is countable and therefore dominated by some measure λ that wecan always take as our reference measure. This gives an a posteriori justificationfor the choice of a family of models Sm ⊂ Qλ.

Given two distinct points t, u ∈ Qλ we define a test function ψ(X) between tand u as a measurable function from X to {t, u}, ψ(X) = t meaning deciding t andψ(X) = u meaning deciding u. In order to define a T-estimator, we need a family oftest functions ψt,u(X) between distinct points t, u ∈ S with some special properties.The following proposition, to be proved in Section 6 warrants their existence.

Proposition 1. Given two distinct points t, u ∈ S there exists a test ψt,u betweent and u which satisfies

sup{µ∈Q+(X ) |H(µ,µt)≤H(t,u)/4}

Pµ[ψt,u(X) = u]

≤ exp[−(H2(t, u) − η2(t) + η2(u)

)/4],

Page 54: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

42 L. Birge

sup{µ∈Q+(X ) |H(µ,µu)≤H(t,u)/4}

Pµ[ψt,u(X) = t]

≤ exp[−(H2(t, u) − η2(u) + η2(t)

)/4],

and for all µ ∈ Q+(X ),

(3.5) Pµ[ψt,u(X) = u] ≤ exp[(

16H2(µ, µt) + η2(t) − η2(u))/4].

To build a T-estimator, we proceed as follows. We consider a family of tests ψt,u

indexed by the two-points subsets {t, u} of S with t �= u that satisfy the conclusionsof Proposition 1 and we set Rt = {u ∈ S, u �= t |ψt,u(X) = u} for each t ∈ S. Thenwe define the random function DX on S by

DX(t) =

supu∈Rt

{H(t, u)

}if Rt �= ∅;

0 if Rt = ∅.

We call T-estimator derived from S and the family of tests ψt,u(X) any measurableminimizer of the function t �→ DX(t) from S to [0, +∞] so that DX(s(X)) =inft∈S DX(t). Such a minimizer need not exist in general but it actually existsunder our assumptions.

Theorem 3. Let S =⋃

m∈M Sm ⊂ Qλ be a finite or countable family of D-modelsin (Qλ, H) with respective parameters ηm, Dm and B′ satisfying (3.2) and (3.3).Let {ψt,u} be a family of tests indexed by the two-points subsets {t, u} of S witht �= u and satisfying the conclusions of Proposition 1. Whatever µ ∈ Q+(X ), Pµ-a.s. there exists at least one T-estimator s = s(X) ∈ S derived fom this family oftests and any of them satisfies, for all s′ ∈ S,

(3.6) Pµ [H(s′, s) > y] < (B′Σ/7) exp[−y2/6

]for y ≥ 4[H(µ, µs′) ∨ η(s′)].

Setting µ(X) = s(X) · λ and µ = µs + µ⊥ with µs ∈ Qλ and µ⊥ orthogonal to λ,we also get

(3.7) Eµ

[Hq (µ, µ(X))

]≤ C(q)[1 + B′Σ] inf

m∈M

{H(s, Sm) + ηm +

√µ⊥(X )

}q

and, for intensity estimation when µ = µs,

(3.8) Es

[Hq (s, s(X))

]≤ C(q)[1 + B′Σ] inf

m∈M{H(s, Sm) + ηm}q

.

Proof. It follows from Theorem 5 in Birge [9] with a = 1/4, B = 1, κ = 4 andκ′ = 16 that T-estimators do exist, satisfy (3.6) and have a risk which is bounded,for q ≥ 1, by

(3.9) Eµ

[Hq (µ, µ(X))

]≤ C(q)[1 + B′Σ] inf

m∈M

{(inf

t∈Sm

H(µ, µt))∨ ηm

}q

.

In Birge [9], the proof of the existence of T-estimators when M is infinite was givenonly for the case that the tests ψt,u(X) have a special form, namely ψt,u(X) = uwhen γ(u, X) < γ(t, X) and ψt,u(X) = t when γ(u, X) > γ(t, X) for some suitablefunction γ. A minor modification of the proof extends the result to the general

Page 55: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 43

situation based on the assumption that (3.5) holds. It is indeed enough to use (3.5)to modify the proof of (7.18) of Birge [9] in order to get instead

Pµ [ ∃ t ∈ S with ψs′,t(X) = 1 and η(t) ≥ y] −→y→+∞

0.

The existence of s(X) then follows straightforwardly. Since H2(µ, µt) = H2(s, t) +µ⊥(X )/2, (3.7) follows from (3.9).

It follows from (3.7) that the problem of estimating µ with T-estimators alwaysreduces to intensity estimation once a reference measure λ has been chosen. Acomparison of the risk bounds (3.7) and (3.8) shows that the performance of theestimator s(X) is connected to the choice of the models in L

+1 (λ), the component

µ⊥(X ) of the risk depending only on λ. We might as well assume that µ⊥(X ) isknown since this would not change anything concerning the performance of theT-estimators for a given λ. This is why we shall essentially focus, in the sequel, onintensity estimation.

3.3. An application to multivariate intensities

Let us first illustrate Theorem 3 by an application to the estimation of the unknownintensity s (with respect to the Lebesgue measure λ) of a Poisson process on X =[−1, 1]k. For this, we introduce a family of non-linear models related to neural netswhich were popularized in the 90’s by Barron [5, 6] and other authors in view oftheir nice approximation properties with respect to functions of several variables.These models have already been studied in detail in Sections 3.2.2 and 4.2.2 ofBarron, Birge and Massart [7] and we shall therefore refer to this paper for theirproperties. We start with a family of functions φw(x) ∈ L∞([−1, 1]k) indexed by aparameter w belonging to R

k′and satisfying

(3.10) |φw(x) − φw′(x)| ≤ |w − w′|1 for all x ∈ [−1, 1]k,

where | · |1 denotes the l1-norm on Rk′

. Various examples of such families are givenin Barron, Birge and Massart [7] and one can, for instance, set φw(x) = ψ(a′x− b)with ψ a univariate Lipschitz function, a ∈ R

k, b ∈ R and w = (a, b) ∈ Rk+1.

We set M = (N \ {0, 1})3 and for m = (J, R, B) ∈ M we consider the subset ofL∞([−1, 1]k) defined by

S′m =

J∑

j=1

βjφwj (x)

∣∣∣∣∣∣

J∑

j=1

|βj | ≤ R and |wj |1 ≤ B for 1 ≤ j ≤ J

.

As shown in Lemma 5 of Barron, Birge and Massart [7], such a model can beapproximated by a finite subset Tm. More precisely, one can find a subset Tm of S′

m

with cardinality bounded by [2e(2RB + 1)]J(k′+1) and such that if u ∈ S′m, there

exists some t ∈ Tm such that ‖t − u‖∞ ≤ 1. Defining Sm as {t2, t ∈ Tm}, we getthe following property:

Lemma 3. For m = (J, R, B) ∈ (N \ {0, 1})3, we set η2m = 42J(k′ + 1) log(RB).

Then Sm is a D-model with parameters ηm, Dm = [J(k′ + 1)/4] log[2e(2RB + 1)]and 1 in the metric space (L+

1 (λ), H) and (3.2) and (3.3) are satisfied. Moreover,for any s ∈ L

+1 (λ),

(3.11)√

2H(s, Sm) ≤ inft∈S′

m

∥∥√s − t∥∥

2+ 2k/2.

Page 56: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

44 L. Birge

Proof. Since |Sm| ≤ |Tm|, to show that Sm is a D-model with the given parametersit is enough to prove, in view of (3.1), that |Tm| ≤ exp[4Dm], which is clear. Thatη2

m/84 ≥ Dm/5 follows from log[2e(2RB+1)] ≤ 4 log(RB) since RB ≥ 4. Moreover,since k′ + 1 ≥ 2, η2

m ≥ 84J log(RB), hence

m∈Mexp

[−η2

m

84

]≤∑

J≥2

n≥2

n−J

2

≤∑

J≥2

(∫ +∞

3/2

x−J dx

)2

,

so that (3.3) holds. Let now u ∈ S′m. There exists t ∈ Tm such that ‖t − u‖∞ ≤ 1,

hence ‖√s − t‖2 ≤ ‖√s − u‖2 + 2k/2. Then t2 ∈ Sm and since ‖√s −√

t2‖2 ≤‖√s − t‖2, (3.11) follows.

Let now s(X) be a T-estimator derived from the family of D-models {Sm, m ∈M}. By Theorem 3 and Lemma 3, it satisfies

Es

[H2 (s, s(X))

]≤ C inf

m∈M

{inf

t∈S′m

∥∥√s − t∥∥2

2+ 2k + η2

m

}

≤ C(k, k′) infm∈M

{inf

t∈S′m

∥∥√s − t∥∥2

2+ J log(RB)

}.(3.12)

The approximation properties of the models S′m with respect to different classes

of functions have been described in Barron, Birge and Massart [7]. They allow tobound inft∈S′

m‖√s − t‖2 when

√s belongs to such classes so that corresponding

risk bounds can be derived from (3.12).

3.4. Model selection based on linear models

3.4.1. Deriving D-models from linear spaces

In order to apply Theorem 3 we need to introduce suitable families of D-models Sm

in (Qλ, H) with good approximation properties with respect to the unknown s. Moreprecisely, it follows from (3.7) and (1.10) that they should provide approximationsof

√s in L

+2 (λ). Good approximating sets for elements of L

+2 (λ) are provided by

approximation theory and some recipes to derive D-models from such sets have beengiven in Section 6 of Birge [9]. Most results about approximation of functions inL2(λ) deal with finite dimensional linear spaces or unions of such spaces and theirapproximation properties with respect to different classes (typically smoothnessclasses) of functions. We therefore focus here on such linear subspaces of L2(λ).To translate their properties in terms of D-models, we shall invoke the followingproposition.

Proposition 2. Let S be a k-dimensional linear subspace of L2(λ) and δ > 0. Onecan find a subset S′ of Qλ which is a D-model in the metric space (Qλ, H) withparameters δ, 9k and 1 and such that, for any intensity s ∈ L

+1 (λ),

H(s, S′) ≤ 2.2[inft∈S

∥∥√s − t∥∥

2+ δ

].

Proof. Let us denote by BH and B2 the open balls in the metric spaces (L+1 (λ), H)

and (L2(λ), d2) respectively. It follows from Proposition 8 of Birge [9] that one can

Page 57: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 45

find a subset T of S which is a D-model of (L2(λ), d2) with parameters δ, k/2 and1 and such that, whatever u ∈ L2(λ), d2(u, T ) ≤ d2(u, S) + δ. It follows that

(3.13)∣∣∣T ∩ B2

(t, 3r′

√2)∣∣∣ ≤ exp

[9k(r′/δ)2

]for r′ ≥ 2δ and t ∈ L2(λ).

Moreover, if t ∈ T , π(t) = max{t, 0} belongs to L+2 (λ) and satisfies d2(u, π(t)) ≤

d2(u, t) for any u ∈ L+2 (λ). We may therefore apply Proposition 12 of Birge [9] with

(M ′, d) = (L2(λ), d2), M0 = L+2 (λ), λ = 1, ε = 1/10, η = 4

√2δ and r = r′

√2 to

get a subset S of π(T ) ⊂ L+2 (λ) such that

(3.14) |S ∩B2

(t, r′

√2)| ≤ |T ∩B2

(t, 3r′

√2)| ∨ 1 for all t ∈ L2(λ) and r′ ≥ 2δ

and d2(u, S) ≤ 3.1d2(u, T ) for all u ∈ L+2 (λ). Setting S′ = {t2 ·λ, t ∈ S)} ⊂ Qλ and

using (1.10), we deduce from (3.13) and (3.14) that

|S′ ∩ BH (µt, r′) | ≤ exp

[9k(r′/δ)2

]for r′ ≥ 2δ and µt ∈ Qλ,

hence S′ is a D-model in (Qλ, H) with parameters δ, 9k and 1, and

H(s, S′) ≤(3.1/

√2)

d2

(√s, T

)< 2.2

[d2

(√s, S

)+ δ

].

We are now in a position to prove Theorem 1. For each m, let us fix η2m =

84[∆m ∨ (9Dm/5)] and use Proposition 2 to derive from Sm a D-model Sm withparameters ηm, Dm = 9Dm and 1 which also satisfies

H(s, Sm) ≤ 2.2[

inft∈Sm

∥∥√s − t∥∥

2+ ηm

].

It follows from the definition of ηm that (3.2) and (3.3) are satisfied so that Theo-rem 3 applies. The conclusion immediately follows from (3.7).

3.4.2. About the computation of T-estimators

We already mentioned that the relevance of T-estimators is mainly of a theoreticalnature because of the difficulty of their implementation. Let us give here a simpleillustrative example based on a single linear approximating space S for

√s, of dimen-

sion k. To try to get a practical implementation, we shall use a simple discretizationstrategy. The first step is to replace S, that we identify to R

k via the choice of abasis, by θZ

k. This provides an η-net for Rk with respect to the the Euclidean

distance, with η2 = k(θ/2)2. Let us concentrate here on the case of a large value ofΓ2 =

∫s dλ in order to have a large number of observations since N has a Poisson

distribution with parameter Γ2. In particular, we shall asume that Γ2 (which playsthe role of the number of observations as we shall see in Section 4.6) is much largerthan k. It is useless, in such a case, to use the whole of θZ

k to approximate√

s sincethe closest point to

√s belongs to B(0, Γ + η). Of course, Γ is unknown, but when

it is large it can be safely estimated by√

N in view of the concentration propertiesof Poisson variables. Let us therefore assume that N ≥ Γ2/2 ≥ 2k. A reasonableapproximating set for

√s is therefore T = B(0,

√2N + η)∩ θZ

k and since our finalmodel S should be a subset of L

+2 (λ), we can take S = {t ∨ 0, t ∈ T} so that

d2(√

s, S) ≤ d2(√

s, T ) ≤ d2(√

s, S + η). It follows from Lemma 5 of Birge [9] that

|S| ≤ |T | ≤ (πe/2)k/2

√πk

(2√

2N + 2η

θ√

k+ 1

)k

< K =[c(√

2Nη−1 + 1)]k

,

Page 58: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

46 L. Birge

with c =√

πe/2 ∼ 2.07. This implies that S is a D-model with parametersη, (log K)/4 and 1. In order that (3.2) be satisfied, we need that η2 ≥ 4.2 log K. Ifwe choose η2 = 4.2k log(c(

√N/k + 1)), this inequality holds since η ≥ 2

√k, hence

K ≤ [c(√

N/k + 1)]k. The number of tests required for building the T-estimator is|S|(|S| − 1) < K2. For N of the order of 100 and k as small as 5, K2 is of the orderof 1010. This toy example illustrates the difficulty of implementing the algorithm.More realistic ones would be much worse.

4. Applications with linear models

We now assume that µ = µs = s · λ and focus on the estimation of the inten-sity s by model selection, starting with linear models in L2(λ) that possess goodapproximating properties with respect to

√s.

4.1. Adaptation in Besov spaces

It is now well-known that wavelet bases are very good tools for representing smoothfunctions in L2([0, 1]l, dx). In particular, given a suitable wavelet basis {ϕj,k, j ≥−1,k ∈ Λ(j)} with |Λ(−1)| ≤ Γ and 2jl ≤ |Λ(j)| ≤ Γ2jl for all j ≥ 0 any functionf ∈ L2([0, 1]l, dx) can be written as f =

∑∞j=−1

∑k∈Λ(j) βj,kϕj,k. Moreover f

belongs to the Besov space Bαp,∞([0, 1]l) if and only if

(4.1) supj≥0

2j(α+ l2− l

p )

k∈Λ(j)

|βj,k|p

1p

= |f |Bαp,∞ < +∞,

and it belongs to Bαp,q([0, 1]l) with q < +∞ if

j≥0

2j(α+ l

2− l

p )

k∈Λ(j)

|βj,k|p

1p

q

= |f |qBαp,q

< +∞.

Many properties of those function spaces are to be found in DeVore and Lorentz[19], DeVore [17] and Hardle, Kerkyacharian, Picard and Tsybakov [22] among otherreferences.

As a consequence of Theorem 1, we can derive an adaptation result for theestimation of the intensity of a Poisson process when it belongs to some Besovspace on [0, 1]l.

Theorem 4. Let X be a Poisson process with unknown intensity s with respectto Lebesgue measure on [0, 1]l. Let us assume that

√s belongs to some Besov space

Bαp,∞([0, 1]l) for some unknown values of p > 0, α > l(1/p − 1/2)+ and |√s|Bα

p,∞given by (4.1). One can build a T-estimator s(X) such that

(4.2) Es

[H2(s, s)

]≤ C(α, p, l)

[|√

s|Bαp,∞ ∨ 1

]2l/(2α+l)

.

Proof. We just use Proposition 13 of Birge [9] which provides suitable familiesMj(2i) of linear approximation spaces for functions in Bα

p,∞([0, 1]l) and use thefamily of linear spaces {Sm}m∈M with M =

⋃i≥1

⋃j≥0 Mj(2i) provided by this

Page 59: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 47

proposition. Then, for m ∈ Mj(2i), Dm ≤ c1(2i) + c2(2i)2jl and we choose ∆m =c3(2i)2jl+i+j which implies that (1.13) holds with Σ < 1. Applying Proposition 13of Birge [9] with t =

√s, r = 2i > α ≥ 2i−1 and q = 2, we derive from Theorem 1

that, if R = |√s|Bαp,∞ ∨ 1,

Es

[H2(s, s)

]≤ C inf

j≥0

{C(α, p, l)R22−2jα + c4(α)2jl

}.

Choosing for j the smallest integer such that 2j(l+2α) ≥ R2 leads to the result.

4.2. Anisotropic Holder spaces

Let us recall that a function f defined on [0, 1) belongs to the Holder class H(α, R)with α = β + p, p ∈ N, 0 < β ≤ 1 and R > 0 if f has a derivative of order psatisfying |f (p)(x)−f (p)(y)| ≤ R|x−y|β for all x, y ∈ [0, 1). Given two multi-indicesα = (α1, . . . , αk) and R = (R1, . . . , Rk) in (0, +∞)k, we define the anisotropicHolder class H(α,R) as the set of functions f on [0, 1)k such that, for each jand each set of k − 1 coordinates x1, . . . , xj−1, xj+1, . . . , xk the univariate functiony �→ f(x1, . . . , xj−1, y, xj+1, . . . , xk) belongs to H(αj , Rj).

Let now a multi-integer N = (N1, . . . , Nk) ∈ (N�)k be given. To it correspondsthe hyperrectangle

∏kj=1[0, N−1

j ) and the partition IN of [0, 1)k into∏k

j=1 Nj trans-lates of this hyperrectangle. Given an integer r ∈ N and m = (N , r) we can definethe linear space Sm of piecewise polynomials on the partition IN with degree atmost r with respect to each variable. Its dimension is Dm = (r + 1)k

∏kj=1 Nj .

Setting M = (N�)k × N and ∆m = Dm, we get (1.13) with Σ depending only onk as shown in the proof of Proposition 5, page 346 of Barron, Birge and Massart[7]. The same proof also implies (see (4.25), page 347) the following approximationlemma.

Lemma 4. Let f ∈ H(α,R) with αj = βj + pj, r ≥ max1≤j≤k pj, N = (N1, . . . ,Nk) ∈ (N�)k and m = (N , r). There exists some g ∈ Sm such that

‖f − g‖∞ ≤ C(k, r)k∑

j=1

RjN−αj

j .

We are now in a position to state the following corollary of Theorem 1.

Corollary 2. Let X be a Poisson process with unknown intensity s with respect tothe Lebesgue measure on [0, 1)k and s be a T-estimator based on the family of linearmodels {Sm, m ∈ M} that we have previously defined. Assume that

√s belongs to

the class H(α,R) and set

α =

k−1

k∑

j=1

α−1j

−1

and R =

k∏

j=1

R1/αj

j

α/k

.

If Rj ≥ Rk/(2α+k)

for all j, then

Es

[H2(s, s)

]≤ C(k,α)R

2k/(2α+k).

Page 60: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

48 L. Birge

Proof. If αj = βj +pj for 1 ≤ j ≤ k, let us set r = max1≤j≤k pj , η = Rk/(2α+k)

anddefine Nj ∈ N

� by (Rj/η)1/αj ≤ Nj < (Rj/η)1/αj + 1 so that Nj < 2(Rj/η)1/αj

for all j. It follows from Lemma 4 that there exists some t ∈ Sm, m = (N , r)with ‖√s − t‖∞ ≤ C1(k,α)

∑kj=1 RjN

−αj

j , hence ‖√s − t‖2 ≤ kC1(k,α)η. It thenfollows from Theorem 1 that

Es

[H2(s, s)

]≤ C2(k,α)

η2 + (r + 1)k

k∏

j=1

Nj

≤ C3(k,α)

[η2 + R

k/αη−k/α

].

The conclusion follows.

4.3. Intensities with bounded α-variation

Let us first recall that a function f defined on some interval J ⊂ R has boundedα-variation on J for some α ∈ (0, 1] if

(4.3) supi≥1

supx0<···<xi

xj∈J for 0≤j≤i

i∑

j=1

|f(xj) − f(xj−1)|1/α = [Vα(f ; J)]1/α < +∞,

the classical case of bounded variation corresponding to α = 1. This formulation us-ing the power 1/α (instead of α) implies that an α-Holderian function has boundedα-variation over any finite interval J . We want to build a family of linear modelswhich are suitable for estimating intensities s with support on some interval J offinite length L and such that

√s has bounded α-variation on J for some unknown

value of α. These models are linear spaces of piecewise constant functions on somefinite partitions m of J , namely

Sm =

t =

D∑

j=1

aj1lIj

when m = {I1, . . . , ID}.

We consider for M a special family of partitions m of J derived by dyadic split-ting which are in one-to-one correspondence with the family of complete binarytrees. They are built according to the following “adaptive” algorithm described inSection 3.3 of DeVore [17]. This algorithm simultaneously grows a complete binarytree and a dyadic partition of J . It starts with a tree reduced to its root whichis associated to the interval J . At each step of the algorithm the set of terminalnodes of the current tree is associated to the set of intervals in the current partition.Each step of the algorithm corresponds to choosing one terminal node and addingtwo sons to it. For the associated partition this means dividing the interval whichcorresponds to this terminal node into two intervals of equal length which thencorrespond to the two sons. At some stage the procedure stops and we end with acomplete binary tree with D terminal nodes and the associated partition of J intoD intervals. We acually take for M the set of all finite partitions m that can bebuild in that way so that each m corresponds to the complete binary tree with |m|terminal nodes that was used to build the partition.

It is known that the number of complete binary trees with j + 1 terminal nodesis given by the so-called Catalan numbers (1 + j)−1

(2jj

)≤ 4j/(1 + j) as explained

Page 61: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 49

for instance in Stanley [33], page 172. Setting ∆m = 2|m| leads to∑

m∈Mexp[−∆m] =

j≥0

{m∈M| |m|=1+j}exp[−2(j + 1)]

≤∑

j≥0

4j exp[−2(j + 1)]j + 1

= e−2∑

j≥0

(2/e)2j

j + 1< 1.(4.4)

The approximation properties of⋃

m∈M Sm with respect to functions of boundedα-variation are given by the following proposition the proof of which was kindlycommunicated to the author by Ron DeVore [18].

Proposition 3. Let f be a function of bounded α-variation on the interval J offinite length L with α-variation Vα(f ; J) given by (4.3). For each j ∈ N, one canfind a partition m ∈ M with

(4.5) |m| ≤ c1(α)2j and inft∈Sm

‖f − t‖2 ≤ c2(α)L1/2Vα(f ; J)2−jα.

with 1 < c1(α) = (1 − 2−[1/(2α)+1])(1 − 2−1/(2α))−1 < 2.21 and

√2 < c2(α) =

[21+2α

(1 − 2−[1/(2α)+1]

)1−2α

1 − 2−1/(2α)

]1/2

< 6.51.

Proof. For any interval I ⊂ J we denote by |I| its length and set V (I) = Vα(f ; I).If m = {I1; . . . ; ID} is a partition of J into D intervals, fj = |Ij |−1

∫Ij

f(x) dx and

f =∑D

j=1 fj1lIj , then ‖(f − fj)1lIj‖∞ ≤ V (Ij), hence

(4.6)∥∥f − f

∥∥2

2≤

D∑

j=1

E(Ij) with E(I) = |I|V 2(I).

In particular (4.5) holds with m = {J} and j = 0. To study the general case wechoose some ε > 0 and apply the adaptive algorithm described just before in thefollowing way: at each step we inspect the intervals of the partition and if we findan interval I with E(I) > ε we divide it into two intervals of equal length |I|/2.The algorithm necessarily stops since E(I) ≤ |I|V 2(J) for all I ⊂ J and this resultsin some partition m with E(I) ≤ ε for all I ∈ m. It follows from (4.6) that if f isbuilt on this partition, then ‖f − f‖2

2 ≤ ε|m|. Since the case |m| = 1 has alreadybeen considered, we may assume that |m| ≥ 2. Let us denote by Dk the numberof intervals in m with length L2−k and set ak = 2−kDk so that

∑k≥1 ak = 1

(since D0 = 0). If I is an interval of length L2−k, k > 0, it derives from thesplitting of an interval I ′ with length L2−k+1 such that E(I ′) > ε, hence, by (4.6),V (I ′) > [εL−12k−1]1/2 and, since the set function V 1/α is subadditive over disjointintervals, the number of such interval I ′ is bounded by [V (J)]1/α[εL−12k−1]−1/(2α).It follows that

Dk ≤ γ2−k/(2α) and ak ≤ γ2−k/(2α)−k with γ = 2[V (J)]1/α[ε/(2L)]−1/(2α).

Since |m| =∑

k≥1 2kak, we can derive a bound on |m| from a maximization of∑k≥1 2kak under the restrictions

∑k≥1 ak = 1 and ak ≤ γ2−k[1/(2α)+1]. One should

then clearly keep the largest possible indices k with the largest possible values for

Page 62: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

50 L. Birge

ak. Let us fix ε so that γ = (1 − 2−[1/(2α)+1])2j[1/(2α)+1] for some j ≥ 1. Then,setting ak to its maximal value, we get

∑k≥j γ2−k[1/(2α)+1] = 1, which implies that

an upper bound for |m| is

|m| ≤∑

k≥j

γ2k2−k[1/(2α)+1] =γ2−j/(2α)

1 − 2−1/(2α)=

1 − 2−[1/(2α)+1]

1 − 2−1/(2α)2j .

The corresponding value of ε is 2L(γ/2)−2αV 2(J) so that

∥∥f − f∥∥2

2≤ ε|m| ≤ 2LV 2(J)22α γ1−2α2−j/(2α)

1 − 2−1/(2α)

=2LV 2(J)22α

(1 − 2−[1/(2α)+1]

)1−2α

1 − 2−1/(2α)2−2αj .

These two bounds give (4.5) and we finally use the fact that 0 < α ≤ 1 to boundthe two constants.

We can then derive from this proposition, (1.15) and our choice of the ∆m that

Es

[Hq(s, s)

]≤ C(q) inf

j∈N

{2j/2 + L1/2Vα

(√s; J

)2−jα

}q

.

An optimization with respect to j ∈ N then leads to the following risk bound.

Corollary 3. Let X be a Poisson process with unknown intensity s with respectto the Lebesgue measure on some interval J of length L. We assume that

√s has

finite α-variation equal to V on J , both α and V being unknown. One can build aT-estimator s(X) such that

(4.7) Es

[Hq(s, s)

]≤ C(q)

[(L1/2V

)∨ 1

]q/(2α+1)

.

It is not difficult to show, using Assouad’s Lemma, that, up to a constant, thisbound is optimal when q = 2.

Proposition 4. Let L, α and V be given and S ⊂ L+1 (λ) be the set of intensities

with respect to the Lebesgue measure on [0, L) such that√

s has α-variation boundedby V . Let s(X) be any estimator based on a Poisson process X with unknownintensity s ∈ S. There exists a universal constant c > 0 (independent of s, L, α andV ) such that

sups∈S

Es

[H2(s, s)

]≥ c

[(L1/2V

)∨ 1

]2/(2α+1)

.

Proof. If L1/2V < 1, we simply apply (2.6) with s0 = 1l[0,L) and s1 = (1 +L−1/2)21l[0,L) so that 2H2(s0, s1) = 1. If L = 1 and V ≥ 1 we fix some positiveinteger D and define g with support on [0, D−1) by

g(x) = x1l[0,(2D)−1)(x) +(D−1 − x

)1l[(2D)−1,D−1)(x).

Then∫ 1/D

0g2(x) dx = (12D3)−1 and 0 ≤ g(x) ≤ (2D)−1. If we apply the con-

struction of Lemma 2, we get a family of Lipschitz intensities sδ with values in theinterval [12D3 − 3D2, 12D3 + 3D2] ⊂ [9D3, 15D3] and Lipschitz coefficient 6D3. Itfollows that if 0 ≤ x < y ≤ 1,

∣∣∣√

sδ(x) −√

sδ(y)∣∣∣ ≤ |sδ(x) − sδ(y)|

6D3/2

≤(6D2

)∧(6D3|x − y|

)

6D3/2≤

√D [1 ∧ (D|x − y|)] .

Page 63: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 51

This allows us to bound the α-variation of√

sδ in the following way. For any in-creasing sequence 0 ≤ x0 < · · · < xi ≤ 1,

i∑

j=1

∣∣∣∣√

sδ(xj) −√

sδ(xj−1)∣∣∣∣1/α

≤ D1/(2α)i∑

j=1

1l{xj−xj−1≥D−1}

+ D3/(2α)i∑

j=1

1l{xj−xj−1<D−1}(xj − xj−1)1/α.

If n =∑i

j=1 1l{xj−xj−1≥D−1} ≤ D, then

D3/(2α)i∑

j=1

1l{xj−xj−1<D−1}(xj − xj−1)1/α

≤ D3/(2α)D−1/α(D − n) = D1/(2α)(D − n),

which shows that the α-variation of√

sδ is bounded by [D1/(2α)D]α = D(1+2α)/2.We finally choose for D the largest integer j such that j(1+2α)/2 ≤ V . ThenV 2/(1+2α) < 2D and an application of Lemmas 1 and 2 show that

sups∈SD

Es

[H2(s, s)

]≥ 2−8(2D) exp[−2/7] ≥ 2−8 exp[−2/7]V 2/(1+2α),

which proves our lower bound. The general case L1/2V ≥ 1 follows from a scalingargument. If X is a Poisson process on [0, L] with intensity s (with respect to theLebesgue measure), then Y = L−1X is a Poisson process on [0, 1] with intensity sL

to which the previous results apply. Since sL(y) = Ls(Ly), it follows that H2(s, t) =H2(sL, tL) and, if

√s has α-variation bounded by V ,

√sL has α-variation bounded

by L1/2V . The result for an arbitrary L follows from these remarks.

4.4. Intensities with square roots in weak �q-spaces

4.4.1. Approximation based on weak �q-spaces

As we already mentioned, if s ∈ L+1 (λ) is an intensity with respect to λ on X

and we are given an orthonormal basis {ϕj , j ≥ 1} of L2(λ),√

s can be written as∑j≥1 βjϕj with β = (βj)j≥1 ∈ �2 = �2(N�) and

∑j≥1 β2

j = ‖√s‖22 < +∞. Hence,

for all x > 0, |{j ≥ 1 | |βj | ≥ x}| ≤ ‖√s‖22x

−2, which means that the sequence βbelongs to the weak �2-space �w

2 .More generally, given a sequence β = (βj)j≥1 converging to zero and aj the

rearrangement of the numbers |βj | in nonincreasing order (which means that a1 =supj≥1 |βj |, etc. . . ), we say that β belongs to the weak �q-space �w

q (q > 0) if

(4.8) supx>0

xq |{j ≥ 1 | |βj | ≥ x}| = supx>0

xq |{j ≥ 1 | aj ≥ x}| = |β|qq,w < +∞.

This implies that aj ≤ |β|q,wj−1/q for j ≥ 1 and the reciprocal actually holds:

(4.9) |β|q,w = inf{

y > 0 | aj ≤ yj−1/q for all j ≥ 1}

.

Note that, although |θβ|q,w = |θ||β|q,w for θ ∈ R, |β|q,w is not a norm. For con-venience, we shall call it the weight of β in �w

q . By extension, given the basis

Page 64: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

52 L. Birge

{ϕj , j ≥ 1}, we shall say that u ∈ L2(λ) belongs to �wq if u =

∑j≥1 βjϕj and

β ∈ �wq . As a consequence of this control on the size of the coefficients aj , we get

the following useful lemma.

Lemma 5. Let β ∈ �wq with weight |β|q,w for some q > 0 and (aj)j≥1 be the

nonincreasing rearrangement of the numbers |βj |. Then β ∈ �p for p > q and forall n ≥ 1,

(4.10)∑

j>n

apj ≤ q

p − q|β|pq,w(n + 1/2)−(p−q)/q.

Proof. By (4.9) and convexity,

j>n

apj ≤ |β|pq,w

j>n

j−p/q ≤ |β|pq,w

∫ +∞

n+1/2

x−p/q dx.

As explained in great detail in Kerkyacharian and Picard [23] and Cohen, DeVore,Kerkyacharian and Picard [15], the fact that u ∈ �w

q for some q < 2 has importantconsequences for the approximation of u by fonctions in suitable D-dimensionalspaces. For m any finite subset of N

�, let us define Sm as the linear span of {ϕj , j ∈m}. If u =

∑j≥1 βjϕj belongs to �w

q and D is a positive integer, one can find somem with |m| = D and some t ∈ Sm such that

(4.11) ‖u − t‖22 ≤ (2/q − 1)−1|β|2q,w(D + 1/2)1−2/q.

Indeed, let us take for m the set of indices of the D largest numbers |βj |. It followsfrom (4.10) that

j �∈m

β2j =

j>D

a2j ≤ q

2 − q|β|2q,w(D + 1/2)1−2/q.

Setting t =∑

j∈m βjϕj gives (4.11) which provides the rate of approximation ofu by functions of the set

⋃{m | |m|=D} Sm as a decreasing function of D (which

is not possible for q = 2). Unfortunately, this involves an infinite family of linearspaces Sm of dimension D since the largest coefficients of the sequence β may havearbitrarily large indices. To derive a useful, as well as a practical approximationmethod for functions in �w

q -spaces, one has to restrict to those sets m which aresubsets of {1, . . . , n} for some given value of n. This is what is done in Kerkyacharianand Picard [23] who show, in their Corollary 3.1, that a suitable thresholding ofempirical versions of the coefficients βj for j ∈ {1, . . . , n} leads to estimators thathave nice properties. Of course, since this approach ignores the (possibly large)coefficients with indices bigger than n, an additional condition on β is required tocontrol

∑j>n β2

j . In Kerkyacharian and Picard [23], it takes the form

(4.12)∑

j>n

β2j ≤ A2n−δ for all n ≥ 1, with A and δ > 0,

while Cohen, DeVore, Kerkyacharian and Picard [15], page 178, use the similarcondition BS. Such a condition is always satisfied for functions in Besov spacesBα

p,∞([0, 1]l) with p ≤ 2 and α > l(1/p − 1/2). Indeed, if

f =∞∑

j=−1

k∈Λ(j)

βj,kϕj,k

Page 65: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 53

belongs to such a Besov space, it follows from (4.1) that,

j>J

k∈Λ(j)

|βj,k|2 ≤∑

j>J

k∈Λ(j)

|βj,k|p

2/p

≤ |f |2Bαp,∞

j>J

2−2j(α+ l2− l

p )

≤ C|f |2Bαp,∞

2−2J(α+ l2− l

p ).(4.13)

Since the number of coefficients βj,k with j ≤ J is bounded by C ′2Jl, after a properchange in the indexing of the coefficients, the corresponding sequence β will satisfy∑

j>n β2j ≤ A2n−δ with δ = (2α/l) + 1 − (2/p).

4.4.2. Model selection for weak �q-spaces

It is the very method of thresholding that imposes to fix the value of n as a functionof δ or impose the value of δ when n has been chosen in order to get a goodperformance for the threshold estimators. Model selection is more flexible since itallows to adapt the value of n to the unknown values of A and δ. Let us assume thatan orthonormal basis {ϕj , j ≥ 1} for L2(λ) has been chosen and that the Poissonprocess X has an intensity s with respect to λ so that

√s =

∑j≥1 βjϕj with β ∈ �2.

We take for M the set of all subsets m of N� such that |m| = 2j for some j ∈ N and

choose for Sm the linear span of {ϕj , j ∈ m} with dimension Dm = |m|. If |m| = 2j

and k = inf{i ∈ N� | 2i ≥ l for all l ∈ m}, we set ∆m = k + log

(2k

2j

). Then

m∈Mexp[−∆m] ≤

k≥1

k∑

j=0

(2k

2j

)exp

[−k − log

(2k

2j

)]≤∑

k≥1

(k + 1) exp[−k],

which allows to apply Theorem 1.

Proposition 5. Let s be a T-estimator provided by Theorem 1 and based on theprevious family of models Sm and weights ∆m. If

√s =

∑j≥1 βjϕj with β ∈ �w

q

for some q < 2 and (4.12) holds with A ≥ 1 and 0 < δ ≤ 1, the risk of s at s isbounded by

Es

[H2(s, s)

]≤ C

[(γ1−q/2

(R2 ∨ γ

)q/2)∧

A2/(1+δ)],

with

R =[

q

2 − q

]1/2

|β|q,w and γ = δ−1

[log

(δ[A ∨ R]2

)

log 2

∨1

].

Proof. Let (aj)j≥1 be the nonincreasing rearrangement of the numbers |βj |, k andj ≤ k be given and m be the set of indices of the 2j largest coefficients among{|β1|, . . . , |β2k |}. Then Dm = 2j and ∆m ≤ k + log

(2k

2j

). It follows from (4.10) and

(4.12) that

j �∈m

β2j ≤

(∑

i>2j

a2i

)1lj<k +

i>2k

β2i ≤ q

2 − q|β|2q,w2−j(2/q−1)1lj<k + A22−kδ.

This shows that one can find t ∈ Sm such that ‖√s − t‖22 ≤ R22−j(2/q−1)1lj<k +

A22−kδ and it follows from (1.14) that

Es

[H2(s, s)

]≤ C inf

k≥1inf

0≤j≤k

{R22−j(2/q−1)1lj<k + A22−kδ + 2j + k + log

(2k

2j

)}.

Page 66: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

54 L. Birge

We recall that C denotes a constant that may change as often as necessary. Ifj = k, Es[H2(s, s)] ≤ C[A22−kδ + 2k] and an optimization with respect to k leadsto Es[H2(s, s)] ≤ CA2/(1+δ). For j < k, we notice that ∆m ≤ k+2j [1+log(2k−j)] <3k2j , so that

(4.14) Es

[H2(s, s)

]≤ C inf

k≥1

{(A22−kδ

)∨ inf

0≤j<k

{(R22−j(2/q−1)

)∨(k2j

)}}.

If R22−(k−1)(2/q−1) > k2k−1, we may harmlessly increase k until k = K with

K = inf{

i ≥ 1∣∣∣ i2i−1 ≥ R22−(i−1)(2/q−1)

}= inf

{i ≥ 1

∣∣∣ 2i−1 ≥ Rqi−q/2}

and therefore restrict the minimization in (4.14) to k ≥ K. We then choose for jthe smallest integer i such that 2i ≥ (R2/k)q/2, which leads to

Es

[H2(s, s)

]≤ C inf

k≥K

{(A22−kδ

)∨(Rqk1−q/2

)∨ k

}.

It follows from Lemma 6 below (with a = 1) that, if δA2 ≤ 2, (A22−kδ)∨

k ≥ A2/2for all k which does not improve on our previous bound CA2/(1+δ) so that we mayassume from now on that δA2 > 2, hence γ > δ−1. Handling this case in fullgenerality is much more delicate and we shall simplify the minimization problemby replacing A by A = A ∨ R, which amounts to assuming that A ≥ R and leadsto Es[H2(s, s)] ≤ C infk≥K f(k) with

f(x) = f1(x) ∨ f2(x) ∨ x; f1(x) = A22−xδ and f2(x) = Rqx1−q/2.

We want to minimize f(x), up to constants. The minimization of f1(x) ∨ x followsfrom Lemma 6 with δA

2> 2. The minimum then takes the form c2γ > 0.469γ

with f1(γ) = δ−1 < γ hence f(γ) = γ ∨ f2(γ). To show that infx f(x) ≥ cf(γ)when δA

2> 2, we distinguish between two cases. If R2 ≤ γ, f(γ) = γ and we

conclude from the fact that infx f(x) > 0.469γ. If R2 > γ, f2(x) > x for x ≤ γ,f(γ) = f2(γ) > γ and the minimum of f(x) is obtained for some x0 < γ. Hence

infx

f(x) = infx{f1(x) ∨ f2(x)} = Rq inf

x

{(B2−δx

)∨ x1−q/2

}with B = A

2R−q.

It follows from Lemma 6 with a = (2 − q)/2 that the result of this minimizationdepends on the value of

V =2δ

2 − qA

4/(2−q)R−2q/(2−q) =

2A2δ

2 − q

(A

R

)2q/(2−q)

≥ A2δ > 2,

since A ≥ R. Then,

infx

f(x) ≥ Rq

[(2 − q) log V

]1−q/2

≥ Rqγ1−q/2

[(2 − q) log 2

3

]1−q/2

> 0.45Rqγ1−q/2,

and we can conclude that, in both cases, infx f(x) ≥ 0.45f(γ). Let us now fix k suchthat γ+1 ≤ k < γ+2 so that k < 3γ. Then 2k−1 ≥ 2γ = (A

2δ)1/δ while Rqk−q/2 ≤

(R2/γ)q/2 ≤ (R2δ)q/2. This implies that k ≥ K. Moreover f(k) = k∨f2(k) < 3f(γ)which shows that infk≥K f(k) < 3f(γ) < 6.7 infx f(x) and justifies this choice of k.Finally Es[H2(s, s)] ≤ C[γ ∨ f2(γ)].

Page 67: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 55

Note that our main assumption, namely that β ∈ �wq , implies that

∑j>n ap

j ≤R2n−2/q+1 by (4.10) while (4.12) entails that

∑j>n ap

j ≤∑

j>n βpj ≤ A2n−δ. Since

it is only an additional assumption it should not be strictly stronger than the mainone, which is the case if A ≤ R and δ ≥ 2/q − 1. It is therefore natural to assumethat at least one of these inequalities does not hold.

Lemma 6. For positive parameters a, B and θ, we consider on R+ the functionf(x) = B2−δx ∨ xa. Let V = a−1δB1/a. If V ≤ 2 then infx f(x) = c1B with2−a ≤ c1 < 1. If V > 2, then infx f(x) = [c2aδ−1 log V ]a with 2/3 < c2 < 1.

Proof. Clearly, the minimum is obtained when x = x0 is the solution of B2−δx = xa.Setting x0 = B1/ay and taking base 2 logarithms leads to y−1 log2(y−1) = V , hencey < 1. If V ≤ 2, then 1 < y−1 ≤ 2 and the first result follows. If V ≥ 2, the solutiontakes the form y = zV −1 log2 V with 1 > z > [1 − (log2 V )−1 log2(log2 V )] >0.469.

4.4.3. Intensities with bounded variation on [0, 1)2

This section, which is devoted to the estimation of an intensity s such that√

s be-longs to the space BV ([0, 1)2), owes a lot to discussions with Albert Cohen and RonDeVore. The approximation results that we use here should be considered as theirs.The definition and properties of the space BV ([0, 1)2) of functions with boundedvariation on [0, 1)2 are given in Cohen, DeVore, Petrushev and Xu [16] where thereader can also find the missing details. It is known that, with the notations ofSection 4.1 for Besov spaces, B1

1,1([0, 1)2) ⊂ BV ([0, 1)2) ⊂ B11,∞([0, 1)2). This cor-

responds to the situation α = 1, l = 2 and p = 1, therefore α = l(1/p − 1/2), aborderline case which is not covered by the results of Theorem 4. On the other hand,it is proved in Cohen, DeVore, Petrushev and Xu [16], Section 8, that, if a functionof BV ([0, 1)2) is expanded in the two-dimensional Haar basis, its coefficients belongto the space �w

1 . More precisely if f ∈ BV ([0, 1)2) with semi-norm |f |BV and f isexpanded in the Haar basis with coefficients βj , then |β|1,w ≤ C|f |BV where |β|1,w

is given by (4.8) and C is a universal constant. We may therefore use the resultsof the previous section to estimate

√s but we need an additional assumption to

ensure that (4.12) is satisfied. By definition√

s belongs to L2([0, 1)2, dx) but weshall assume here slightly more, namely that it belongs to Lp([0, 1)2, dx) for somep > 2. This is enough to show that (4.12) holds.

Lemma 7. If f ∈ BV ([0, 1)2)∩Lp([0, 1)2, dx) for some p > 2 and has an expansionf =

∑∞j=−1

∑k∈Λ(j) βj,kϕj,k with respect to the Haar basis on [0, 1)2, then for

J ≥ −1, ∑

j>J

k∈Λ(j)

|βj,k|2 ≤ C(p)‖f‖p|f |B11,∞

2−2J(1/2−1/p).

Proof. It follows from Holder inequality that |βj,k| = 〈f, ϕj,k〉 ≤ ‖f‖p‖ϕj,k‖p′ withp′−1 = 1−p−1 and by the structure of a wavelet basis, ‖ϕj,k‖p′

p′ ≤ c12−j(2−p′), so that|βj,k| ≤ c2‖f‖p2−j(2/p′−1) = c2‖f‖p2−j(1−2/p). Since BV ([0, 1)2) ⊂ B1

1,∞([0, 1)2),it follows from (4.1) with α = p = 1 and l = 2 that

∑k∈Λ(j) |βj,k| ≤ |f |B1

1,∞so that

∑k∈Λ(j) |βj,k|2 ≤ c2‖f‖p|f |B1

1,∞2−j(1−2/p) for all j ≥ 0. The conclusion follows.

Since the number of coefficients βj,k with j ≤ J is bounded by C22J , aftera proper reindexing of the coefficients, the corresponding sequence β will satisfy

Page 68: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

56 L. Birge

(4.12) with δ = 1/2−1/p which shows that it is essential here that p be larger than2. We finally get the following corollary of Proposition 5 with q = 1.

Corollary 4. One can build a T-estimator s with the following properties. Let theintensity s be such that

√s ∈ BV ([0, 1)2) ∩ Lp([0, 1)2, dx) for some p > 2, so that

the expansion of√

s in the Haar basis satisfies (4.12) with δ = 1/2−1/p and A ≥ 1.Let R = |√s|BV , then

E[H2(s, s)

]≤ C

[√γ (R2 ∨ γ) ∧ A2/(1+δ)

]

with γ = δ−1

[log

(δ[A ∨ R]2

)

log 2∨ 1

].

4.5. Mixing families of models

We have studied here a few families of approximating models. Many more can beconsidered and further examples can be found in Reynaud-Bouret [30] or previouspapers of the author on model selection such as Barron, Birge and Massart [7],Birge and Massart [12], Birge [9] and Baraud and Birge [4]. As indicated in theprevious sections, the choice of suitable families of models is driven by results inapproximation theory relative to the type of intensity we expect to encounter or,more precisely, to the type of assumptions we make about the unknown function√

s. Different types of assumptions will lead to different choices of approximatingmodels, but it is always possible to combine them. If we have built a few families oflinear models {Sm, m ∈ Mj} for 1 ≤ j ≤ J and chosen suitable weights ∆m suchthat

∑m∈Mj

exp[−∆m] ≤ Σ for all j we may consider the mixed family of models{Sm, m ∈ M} with M = ∪J

j=1Mj and define new weights ∆′m = ∆m + log J

for all m ∈ M so that (1.13) still holds with the same value of Σ. It follows fromTheorem 1 that the T-estimator based on the mixed family will share the propertiesof the ones derived from the initial families apart, possibly, for a moderate increasein the risk of order (log J)q/2. The situation becomes more complex if J is largeor even infinite. A detailed discussion of how to mix families of models in generalhas been given in Birge and Massart [12], Section 4.1, which applies with minormodifications to our case.

4.6. Asymptotics and a parallel with density estimation

The previous examples lead to somewhat unusual bounds with no number of ob-servations n like for density estimation and no variance size σ2 as in the case ofthe estimation of a normal mean. Here, there is no rate of convergence becausethere is no sequence of experiments, just one with a mean measure µs = s · λ.To get back to more familiar results with rates and asymptotics and recover someclassical risk bounds, we may reformulate our problem in a slightly different formwhich completely parallels the one we use for density estimation. As indicated inour introduction we may always rewrite the intensity s as s = ns1 with

∫s1 dλ = 1

so that s1 becomes a density and n = µs(X ). We use this notation here, althoughn need not be an integer, to emphasize the similarity between the estimation of sand density estimation. When n is an integer this also corresponds to observing ni.i.d. Poisson processes Xi, 1 ≤ i ≤ n with intensity s1 and set ΛX =

∑ni=1 ΛXi .

In this case (1.15) can be rewritten in the following way.

Page 69: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 57

Corollary 5. Let λ be some positive measure on X , X be a Poisson process withunknown intensity s ∈ L

+1 (λ), {Sm, m ∈ M} be a finite or countable family of

linear subspaces of L2(λ) with respective finite dimensions Dm and let {∆m}m∈Mbe a family of nonnegative weights satisfying (1.13). One can build a T-estimators(X) of s satisfying, for all s ∈ L

+1 (λ) such that

∫s dλ = n, s1 = n−1s and all

q ≥ 1,

Es

[(n−1/2H(s, s)

)q ]≤ C(q) [1 + Σ] inf

m∈M

inf

t∈Sm

‖√s1 − t‖2 +

√Dm ∨ ∆m

n

q

.

Writtten in this form, our result appears as a complete analogue of Theorem 6of Birge [9] about density estimation, the normalized loss function (H/

√n)q play-

ing the role of the Hellinger loss hq for densities. We also explained in Birge [9],Section 8.3.3, that there is a complete parallel between density estimation and es-timation in the white noise model. We can therefore extend this parallel to theestimation of the intensity of a Poisson process. This parallel has also been ex-plained and applied to various examples in Baraud and Birge [4], Section 4.2. Asan additional consequence, all the families of models that we have introduced in Sec-tions 3.3, 4.2, 4.3 and 4.4 could be used as well for adaptive estimation of densitiesor in the white noise model and added to the examples given in Birge [9].

To recover the familiar rates of convergence that we get when estimating densitieswhich belong to some given function class S, we merely have to assume that s1

(rather than s) belongs to the class S and use the normalized loss function. Let us,for instance, apply this approach to intensities belonging to Besov spaces, assumingthat

√s1 ∈ Bα

p,∞([0, 1]l) with α > l(1/p − 1/2)+ and that |√s1|Bαp,∞ ≤ L with

L > 0. It follows that√

s ∈ Bαp,∞([0, 1]l) with |√s|Bα

p,∞ ≤ L√

n. For n large enough,L√

n ≥ 1 and Theorem 4 applies, leading to Es[H2(s, s)] ≤ C(α, p, l)(L√

n)2l/(2α+l).Hence

Es

[n−1H2(s, s)

]≤ C(α, p, l)L2l/(2α+l)n−2α/(2α+l),

which is exactly the result we get for density estimation with n i.i.d. observations.The same argument can be developed for the problem we considered in Sec-

tion 4.2. If we assume that√

s1, rather than√

s, belongs to H(α,R), then√

s ∈H(α,

√nR) and the condition Rj ≥ η of Corollary 2 becomes, after this rescaling,√

nRj ≥ (√

nR)k/(2α+k) which always holds for n large enough. The correspondingnormalized risk bound can then be written

Es

[n−1H2(s, s)

]≤ C(k,α)R

2k/(2α+k)n−2α/(2α+k),

which corresponds to the rate of convergence for this problem in density estimation.Another interesting case is the one considered in Section 4.4. Let us assume here

that instead of putting the assumptions of Proposition 5 on√

s we put them on√s1. This implies that

√s satisfies the same assumptions with R replaced by R

√n

and A by A√

n. Then, for n ≥ n0(A, R, δ), γ ≤ 2δ−1 log n ≤ nR2 and

Es

[n−1H2(s, s)

]≤ C(q, δ, A, R)

(n−1 log n

)1−q/2.

This result is comparable to the bounds obtained in Corollary 3.1 of Kerkyacharianand Picard [23] but here we do not know the relationship between q and δ. Forthe special situation of

√s1 ∈ BV ([0, 1)2), we get Es[n−1H2(s, s)] ≤ C(q, δ, s1) ×

(n−1 log n)1/2. One could also translate all other risk bounds in the same way.

Page 70: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

58 L. Birge

An alternative asymptotic approach, which has been considered in Reynaud-Bouret [30], is to assume that X is a Poisson process on R

k with intensity s withrespect to the Lebesgue measure on R

k, but which is only observed on [0, T ]k.We therefore estimate s1l[0,T ]k , letting T go to infinity to get an asymptotic result.We only assume that

∫[0,T ]k

s(x) dx is finite for all T > 0, not necessarily that∫Rk s(x) dx < +∞. For simplicity, let us consider the case of intensities s on R

+

with√

s belonging to the Holder class H(α, R). For t an intensity on R+, we set

for 0 ≤ x ≤ 1, tT (x) = Tt(Tx) so that tT is an intensity on [0, 1] and H(tT , uT ) =H(t1l[0,T ], u1l[0,T ]). Since

√sT ∈ H(α, RTα+1/2) it follows from Corollary 2 that

there is a T-estimator sT (X) of sT satisfying

Es

[H2 (sT , sT )

]≤ C(α)

(RTα+1/2

)2/(2α+1)

= C(α)TR2/(2α+1).

Finally setting s(y) = T−1sT

(T−1y

)for y ∈ [0, T ], we get an estimator s(X) of

s1l[0,T ] depending on T with the property that

Es

[H2

(s1l[0,T ], s

)]≤ C(α)TR2/(2α+1) for all T > 0.

4.7. An illustration with Poisson regression

As we mentioned in the introduction, a particular case occurs when X is a finiteset that we shall assume here, for simplicity, to be {1; . . . ; 2n}. In this situation,observing X amounts to observing N = 2n independent Poisson variables withrespective parameters si = s(i) where s denotes the intensity with respect to thecounting measure. If we introduce a family of linear models Sm in R

N to approxi-mate

√s ∈ R

N with respect to the Euclidean distance, we simply apply Theorem 1to get the resulting risk bounds. In this situation, the Hellinger distance betweentwo intensities is merely the Euclidean distance between their square roots, up toa factor 1/

√2.

As an example, we shall consider linear models spanned by piecewise constantfunctions on X as described in Section 1.4, i.e. Sm = {

∑Dj=1 aj1lIj} when m =

{I1, . . . , ID} is a partition of X into D = |m| nonvoid intervals. In order to definesuitable weights ∆m, we shall distinguish between two types of partitions. Firstwe consider the family MBT of dyadic partitions derived from binary trees anddescribed in Section 4.3. We already know that the choice ∆m = 2|m| is suitable forthose partitions and (4.4) applies. Note that these include the regular partitions,i.e. those for which all intervals Ij have the same size N/|m| and |m| = 2k for0 ≤ k ≤ n. For all other partitions, we simply set ∆m = log

(N|m|)

+ 2 log(|m|) sothat (1.13) holds with Σ < 3 since the number of possible partitions of X into |m|intervals is

(N−2|m|−1

). We omit the details. Denoting by ‖ · ‖2 the Euclidean norm in

RN , we derive from Theorem 1 the following risk bound for T-estimators:

Es

[∥∥∥√

s −√

s∥∥∥

2

2

]

≤ C

[inf

m∈MBT

{inf

t∈Sm

∥∥√s − t∥∥2

2+ |m|

}

∧inf

m∈M\MBT

{inf

t∈Sm

∥∥√s − t∥∥2

2+ log(|m|) + log

(N

|m|

)} .

Page 71: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 59

The performance of the estimator then depends on the approximation propertiesof the linear spaces Sm with respect to

√s. For instance, if

√s varies regularly, i.e.

|√si −√

si−1| ≤ R for all i, one uses a regular partition which belongs to MBT toapproximate

√s. If

√s has bounded α-variation, as defined in Section 4.3, one uses

dyadic partitions as explained in this section. If√

s is piecewise constant with kjumps, it belongs to some Sm and we get a risk bound of order log(k+1)+log

(N

k+1

).

5. Aggregation of estimators

In this section we assume that we have at our disposal a family {sm, m ∈ M′}of intensity estimators, (T-estimators or others) and that we want to select oneof them or combine them in some way in order to get an improved estimator. Wealready explained in Section 2.3 how to use the procedure of thinning to derive froma Poisson process X with mean measure µ two independent Poisson processes withmean measure µ/2. Since estimating µ/2 is equivalent to estimating µ, we shallassume in this section that we have at our disposal two independent processes X1

and X2 with the same unknown mean measure µs with intensity s to be estimated.We assume that the initial estimators sm(X1) are all based on the first process andtherefore independent of X2. Proceeding conditionally on the first process, we usethe second one to mix the estimators.

We shall consider here two different ways of aggregating estimators. The firstone is suitable when we want to choose one estimator in a large (possibly infinite)family of estimators and possibly attach to them different prior weights. The secondmethod tries to find the best linear combination from a finite family of estimatorsof

√s.

5.1. Estimator selection

Here we start from a finite or countable family {sm, m ∈ M} of intensity estimatorsand a family of weights ∆m ≥ 1/10 satisfying (1.13). Our purpose is to use theprocess X2 to find a close to best estimator among the family {sm(X1), m ∈ M}.

5.1.1. A general result

Considering each estimator sm(X1) as a model Sm = {sm(X1)} with one singlepoint, we set η2

m = 84∆m. Then Sm is a T-model with parameters ηm, 1/2 andB′ = e−2, (3.2) and (3.3) hold and Theorem 3 applies. Since each model is reducedto one point, one can find a selection procedure m(X2) such that the estimators(X1,X2) = sm(X2)(X1) satisfies the risk bound

Es

[H2(s, s)

∣∣X1

]≤ C[1 + Σ] inf

m∈M

{H2 (s, sm(X1))2 + ∆m

}.

Integrating with respect to the process X1 gives

(5.1) Es

[H2(s, s)

]≤ C[1 + Σ] inf

m∈M

{Es

[H2 (s, sm)

]+ ∆m

}.

This result completely parallels the one obtained for density estimation in Sec-tion 9.1.2 of Birge [9].

Page 72: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

60 L. Birge

5.1.2. Application to histograms

The simplest estimators for the intensity s of a Poisson process X are histograms.Let m be a finite partition m = {I1, . . . , ID} of X such that λ(Ij) > 0 for all j. Tothis partition corresponds the linear space of piecewise constant functions on thepartition m: Sm = {

∑Dj=1 aj1lIj}, the projection sm of s onto Sm and the corre-

sponding histogram estimator sm of s given respectively by sm =∑D

j=1(∫

Ijs dλ)×

[λ(Ij)]−11lIj and sm =∑D

j=1 Nj [λ(Ij)]−11lIj with Nj =∑N

i=1 1lIj (Xi). It is provedin Baraud and Birge [4], Lemma 2, that H2(s, sm) ≤ 2H2(s, Sm). Moreover, onecan show an analogue of the risk bound obtained for the case of density estimationin Birge and Rozenholc [13], Theorem 1. The proof is identical, replacing h by H,n by 1 and the binomial distribution of N by a Poisson distribution. This leads tothe risk bound

Es

[H2(s, sm)

]≤ H2 (s, sm) + D/2 ≤ 2H2

(s, Sm

)+ |m|/2.

If we are given an arbitrary family M of partitions of X and a correspondingfamily of weights {∆m, m ∈ M} satisfying (1.13) and ∆m ≥ |m|/2, we may applythe previous aggregation method which will result in an estimator s(X1,X2) =sm(X2)(X1) where m(X2) is a data-selected partition. Finally,

(5.2) Es

[H2(s, s)

]≤ C[1 + Σ] inf

m∈M

{H2

(s, Sm

)+ ∆m

}.

Various choices of partitions and weights have been described in Baraud and Birge[4] together with their approximation properties with respect to different classesof functions. Numerous illustrations of applications of (5.2) can therefore be foundthere.

5.2. Linear aggregation

Here we start with a finite family {si(X1), 1 ≤ i ≤ n} of intensity estimators. Wechoose for M the set of all nonvoid subsets of {1, . . . , n} and to each such subsetm, we associate the |m|-dimensional linear subspace Sm of L2(λ) given by

(5.3) Sm =

j∈m

λj

√sj(X1) with λj ∈ R for j ∈ m

.

We then set ∆m = log(

n|m|)

+ 2 log(|m|) so that (1.13) holds with Σ =∑n

i=1 i−2.We may therefore apply Theorem 1 to the process X2 and this family of modelsconditionally to X1, which results in the bound

Es

[H2(s, s)

∣∣X1

]≤ C[1 + Σ]

× infm∈M

{inf

t∈Sm

∥∥√s − t(X1)∥∥2

2+ log

(n

|m|

)+ log(|m|)

}.

Note that the restriction of this bound to subsets m such that |m| = 1 correspondsto a variant of estimator selection and leads, after integration, to

Es

[H2(s, s)

]≤ C[1 + Σ] inf

1≤i≤n

{infλ>0

Es

[∥∥∥√

s − λ√

si(X1)∥∥∥

2

2

]+ log n

}.

This can be viewed as an improved version of (5.1) when we choose equal weights.

Page 73: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 61

6. Testing balls in (Q+(X ), H)

6.1. The construction of robust tests

In order to use Theorem 3, we have to find tests ψt,u satisfying the conclusionsof Proposition 1. These tests are provided by a straightforward corollary of thefollowing theorem.

Theorem 5. Given two elements πc and νc of Q+(X ) with respective densitiesdπc and dνc with respect to some dominating measure λ ∈ Q+(X ) and a numberξ ∈ (0, 1/2), let us define πm and νm in Q+(X ) by their densities dπm and dνm

with respect to λ in the following way:√

dπm = ξ√

dνc + (1 − ξ)√

dπc and√

dνm = ξ√

dπc + (1 − ξ)√

dνc.

Then for all x ∈ R, µ ∈ Q+(X ) and X a Poisson process with mean measure µ,

[log

(dQπm

dQνm

(X))

≥ 2x

]≤ exp

[−x + (1 − 2ξ)

(2ξH2(µ, νc) − H2(πc, νc)

)]

and

[log

(dQπm

dQνm

(X))

≤ 2x

]≤ exp

[x + (1 − 2ξ)

(2ξH2(µ, πc) − H2(πc, νc)

)].

Corollary 6. Let πc and νc be two elements of Q+(X ), 0 < ξ < 1/2 and

T (X) = log((dQπm/dQνm)(X)

)− 2x,

with πm and νm given by Theorem 5. Define a test function ψ with values in {πc, νc}by ψ(X) = πc when T (X) > 0, ψ(X) = νc when T (X) < 0 (ψ(X) being arbitraryif T (X) = 0). If X is a Poisson process with mean measure µ, then

Pµ[ψ(X) = πc] ≤ exp[−x − (1 − 2ξ)2H2(πc, νc)

]if H(µ, νc) ≤ ξH(πc, νc)

and

Pµ[ψ(X) = νc] ≤ exp[x − (1 − 2ξ)2H2(πc, νc)

]if H(µ, πc) ≤ ξH(πc, νc).

To derive Proposition 1 we simply set πc = µt, νc = µu, ξ = 1/4, x = [η2(t) −η2(u)]/4 and define ψt,u = ψ in Corollary 6. As to (3.5), it follows from the secondbound of Theorem 5.

6.2. Proof of Theorem 5

It is based on the following technical lemmas.

Lemma 8. Let f , g, f ′ ∈ L+2 (λ) and ‖g/f‖∞ ≤ K. Denoting by 〈·, ·〉 and ‖ · ‖2

the scalar product and norm in L2(λ), we get

(6.1)∫

gf−1f ′2 dλ ≤ K‖f − f ′‖22 + 2〈g, f ′〉 − 〈g, f〉.

Page 74: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

62 L. Birge

Proof. Denoting by Q the left-hand side of (6.1) we write

Q =∫

gf−1(f ′ − f)2 dλ + 2∫

gf ′ dλ −∫

gf dλ,

hence the result.

Lemma 9. Let µ, π and ν be three mean measures with π � ν and ‖dπ/dν‖∞ ≤ K2

and let X be a Poisson process with mean measure µ. Then

[√dQπ

dQν(X)

]≤ exp

[2KH2(µ, ν) − 2H2(π, µ) + H2(π, ν)

].

Proof. By (1.3) and (1.2),

[√dQπ

dQν(X)

]= exp

[ν(X ) − π(X )

2

]Eµ

[N∏

i=1

√dπ

dν(Xi)

]

= exp

[ν(X ) − π(X )

2+∫

X

(√dπ

dν(x) − 1

)dµ(x)

]

= exp

[ν(X ) − π(X )

2− µ(X ) +

X

√dπ

dν(x) dµ(x)

].

Using Lemma 8 and (1.7), we derive that

X

√dπ

dν(x) dµ(x) ≤ 2KH2(µ, ν) + 2

∫ √dπdµ −

∫ √dπdν

= 2KH2(µ, ν) − 2H2(π, µ) + π(X ) + µ(X )+ H2(π, ν) − (1/2)[π(X ) + ν(X )].

The conclusion follows.

To prove Theorem 5, we may assume (changing λ if necessary) that µ � λ andset v =

√dµ/dλ. We also set tc =

√dπc/dλ, uc =

√dνc/dλ, tm = ξuc+(1−ξ)tc and

um = ξtc + (1 − ξ)uc. Then πm = t2m · λ and νm = u2m · λ. Note that tc, uc, tm, um

and v belong to L+2 (λ) and that for two elements w, z in L

+2 (λ), ‖w − z‖2

2 =2H2(w2 · λ, z2 · λ). Since ‖tm/um‖∞ ≤ (1 − ξ)/ξ, we may apply Lemma 9 withK = (1 − ξ)/ξ to derive that

L = log

(Eµ

[√dQπm

dQνm

(X)

])≤ 1 − ξ

ξ‖v − um‖2

2 − ‖v − tm‖22 +

‖tm − um‖22

2.

Using the fact that

v − um = v − uc + ξ(uc − tc), v − tm = v − uc + (1 − ξ)(uc − tc),

tm − um = (1 − 2ξ)(tc − uc)

and expending the squared norms, we get, since the scalar products cancel,

L ≤ 1 − 2ξ

ξ‖v − uc‖2

2 +[ξ(1 − ξ) − (1 − ξ)2 +

(1 − 2ξ)2

2

]‖tc − uc‖2

2,

Page 75: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Model selection for Poisson processes 63

which shows that

L ≤ (1 − 2ξ)[2ξ−1H2(µ, νc) − H2(πc, νc)

].

The exponential inequality then implies that

[log

(dQπm

dQνm

(X))

≥ 2x

]≤ e−x

[√dQπm

dQνm

(X)

]= exp[−x + L],

which proves the first error bound. The second one can be proved in the same way.

Acknowledgments

Many thanks to Philippe Bougerol for some exchanges about Poisson processes andto Albert Cohen and Ron DeVore for several illuminating discussions on approx-imation theory, the subtleties of the space BV (R2) and adaptive approximationmethods. I also would like to thank the participants of the Workshop Asymptotics:particles, processes and inverse problems that was held in July 2006 in Leiden fortheir many questions which led to various improvements of the paper.

References

[1] Antoniadis, A., Besbeas, P. and Sapatinas, T. (2001). Wavelet shrinkagefor natural exponential families with cubic variance functions. Sankhya Ser. A63 309–327.

[2] Antoniadis, A. and Sapatinas, T. (2001). Wavelet shrinkage for naturalexponential families with quadratic variance functions. Biometrika 88 805–820.

[3] Assouad, P. (1983). Deux remarques sur l’estimation. C. R. Acad. Sci. ParisSer. I Math. 296 1021–1024.

[4] Baraud, Y. and Birge, L. (2006). Estimating the intensity of a randommeasure by histogram type estimators. Probab. Theory Related Fields. To ap-pear. Available at arXiv:math.ST/0608663.

[5] Barron, A. R. (1993). Universal approximation bounds for superpositions ofa sigmoidal function. IEEE Transactions on Information Theory 39 930–945.

[6] Barron, A. R. (1994). Approximation and estimation bounds for artificialneural networks. Machine Learning 14 115–133.

[7] Barron, A. R., Birge, L. and Massart, P. (1999). Risk bounds for modelselection via penalization. Probab. Theory Related Fields 113 301–415.

[8] Barron, A. R. and Cover, T. M. (1991). Minimum complexity densityestimation. IEEE Transactions on Information Theory 37 1034–1054.

[9] Birge, L. (2006). Model selection via testing: an alternative to (penalized)maximum likelihood estimators. Ann. Inst. H. Poincare Probab. Statist. 42273–325.

[10] Birge, L. (2006). Statistical estimation with model selection. IndagationesMath. 17 497–537.

[11] Birge, L. and Massart, P. (1997). From model selection to adaptive esti-mation. In Festschrift for Lucien Le Cam: Research Papers in Probability andStatistics (D. Pollard, E. Torgersen and G. Yang, eds.) 55–87. Springer, NewYork.

Page 76: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

64 L. Birge

[12] Birge, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math.Soc. 3 203–268.

[13] Birge, L. and Rozenholc, Y. (2006). How many bins should be put in aregular histogram. ESAIM Probab. Statist. 10 24–45.

[14] Cencov, N. N. (1962). Evaluation of an unknown distribution density fromobservations. Soviet Math. 3 1559–1562.

[15] Cohen, A., DeVore, R., Kerkyacharian, G. and Picard, D. (2001).Maximal spaces with given rate of convergence for thresholding algorithms.Appl. Comput. Harmon. Anal. 11 167–191.

[16] Cohen, A., DeVore, R., Petrushev, P. and Xu, H. (1999). Nonlinearapproximation and the space BV (R2). Amer. J. Math. 121 587–628.

[17] DeVore, R. A. (1998). Nonlinear approximation. Acta Numerica 7 51–150.[18] DeVore, R. A. (2006). Private communication.[19] DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation.

Springer, Berlin.[20] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence

rates of posterior distributions. Ann. Statist. 28 500–531.[21] Gregoire, G. and Nembe, J. (2000). Convergence rates for the minimum

complexity estimator of counting process intensities. J. Nonparametr. Statist.12 611–643.

[22] Hardle, W., Kerkyacharian, G., Picard, D. and Tsybakov, S. (1998).Wavelets, Approximation and Statistical Applications. Lecture Notes in Statist.129. Springer, New York.

[23] Kerkyacharian, G. and Picard, D. (2000). Thresholding algorithms,maxisets and well-concentrated bases. Test 9 283–344.

[24] Kolaczyk, E. (1999). Wavelet shrinkage estimation of certain Poisson inten-sity signals using corrected threshold. Statist. Sinica 9 119–135.

[25] Kolaczyk, E. and Nowak, R. (2004). Multiscale likelihood analysis andcomplexity penalized estimation. Ann. Statist. 32 500–527.

[26] Le Cam, L. M. (1973). Convergence of estimates under dimensionality re-strictions. Ann. Statist. 1 38–53.

[27] Massart, P. (2007). Concentration inequalities and model selection. In Lec-ture on Probability Theory and Statistics. Ecole d’Ete de Probabilites de Saint-Flour XXXIII — 2003 (J. Picard, ed.). Lecture Notes in Math. 1896. Springer,Berlin.

[28] Patil, P. N. and Wood, A. T. (2004). A counting process intensity estima-tion by orthogonal wavelet methods. Bernoulli 10 1–24.

[29] Reiss, R.-D. (1993). A Course on Point Processes. Springer, New York.[30] Reynaud-Bouret, P. (2003). Adaptive estimation of the intensity of inho-

mogeneous Poisson processes via concentration inequalities. Probab. TheoryRelated Fields 126 103–153.

[31] Reynaud-Bouret, P. (2006). Penalized projection estimators of the Aalenmultiplicative intensity. Bernoulli 12 633–661.

[32] Rigollet, T. and Tsybakov, A. B. (2006). Linear and convex aggregationof density estimators. Available at arXiv:math.ST/0605292 v1.

[33] Stanley, R. P. (1999). Enumerative Combinatorics, 2. Cambridge UniversityPress, Cambridge.

Page 77: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotic: Particles, Processes and Inverse ProblemsVol. 55 (2007) 65–84c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000274

Scale space consistency of piecewise

constant least squares estimators –

another look at the regressogram

Leif Boysen 1,∗, Volkmar Liebscher2,†, Axel Munk1 and Olaf Wittich3,†

Universitat Gottingen, Universitat Greifswald, Universitat Gottingen andTechnische Universiteit Eindhoven

Abstract: We study the asymptotic behavior of piecewise constant leastsquares regression estimates, when the number of partitions of the estimateis penalized. We show that the estimator is consistent in the relevant metricif the signal is in L2([0, 1]), the space of cadlag functions equipped with theSkorokhod metric or C([0, 1]) equipped with the supremum metric. Moreover,we consider the family of estimates under a varying smoothing parameter, alsocalled scale space. We prove convergence of the empirical scale space towardsits deterministic target.

1. Introduction

Initially, the use of piecewise constant functions for regression has been proposed by[25], who called the corresponding reconstruction the regressogram. [25] proposedit as a simple exploratory tool. For a given set of jump locations, the regressogramsimply averages the data between two successive jumps. A difficult issue, however,is a proper selection of the location of jumps and its convergence analysis.

Approximation by step functions is well examined in approximation theory (seee.g., [7]), and there are several statistical estimation procedures which use locallyconstant reconstructions. [14] studied the case where the signal is a step functionwith one jump and showed that in this case the signal can be estimated at the para-metric n−1/2-rate and that the jump location can be estimated at a rate of n−1.This was generalized by [28] and [29] to step functions with a given a known upperbound for the number of jumps. The locally adaptive regression splines method by[16] and the taut string procedure by [6] use locally constant estimates to recon-struct unknown regression functions, which belong to more general function classes.Both methods reduce the complexity of the reconstruction by minimizing the totalvariation of the estimator, which in turn leads to a small number of local extremevalues.

∗Supported by Georg Lichtenberg program “Applied Statistics & Empirical Methods” andDFG graduate program 1023 “Identification in Mathematical Models”.

†Supported in part by DFG, Sonderforschungsbereich 386 “Statistical Analysis of DiscreteStructures”.

‡Supported by DFG grant “Statistical Inverse Problems under Qualitative Shape Constraints”.1Institute for Mathematical Stochastics, Georgia Augusta University Goettingen,

Maschmuehlenweg 8-10, D-37073 Goettingen, Germany, e-mail: [email protected]

goettingen.de; [email protected] Greifswald.3Technical University Eindhoven.AMS 2000 subject classifications: Primary 62G05, 62G20; secondary 41A10, 41A25.Keywords and phrases: Hard thresholding, nonparametric regression, penalized maximum like-

lihood, regressogram, scale spaces, Skorokhod topology.

65

Page 78: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

66 Boysen et al.

In this work we choose a different approach and define the complexity of thereconstruction by the number of intervals where the reconstruction is constant, orequivalently by the number of jumps of the reconstruction. Compared to the totalvariation approach, this method obviously captures extreme plateaus more easilybut is less robust to outliers. This might be of interest in applications where extremeplateaus are informative, like for example in mass spectroscopy.

Throughout the following, we assume a regression model of the type

(1) Yi,n = fi,n + ξi,n, (i = 1, . . . , n),

where (ξi,n)i=1,...,n is a triangular array of independent zero-mean random variablesand fi,n is the mean value of a square integrable function f ∈ L2([0, 1)) over theinterval [(i − 1)/n, i/n] (see e.g. [9]),

(2) fi,n = n

∫ i/n

(i−1)/n

f(u) du.

This model is well suited for physical applications, where observations of this typeare quite common.

We consider minimizers Tγ(Yn) ∈ argminHγ(·, Yn) of the hard thresholding func-tional

(3) Hγ(u, Yn) = γ · #J(u) +1n

n∑

i=1

(ui − Yi,n)2 ,

whereJ(u) = {i : 1 ≤ i ≤ n − 1, ui �= ui+1}

is the set of jumps of u. In the following we will call the minimizers of (3) jumppenalized least squares estimators or short Jplse.

Clearly choosing γ is equivalent to choosing a number of partitions of the Jplse.Figure 1 shows the Jplse for a sample dataset and different choices of the smoothingparameter γ.

This paper complements work of the authors on convergence rates of the Jplse. [2]show that given a proper choice of the smoothing parameter γ it is possible to obtainoptimal rates for certain classes of approximation spaces under the assumption ofsubgaussian tails of the error distribution. As special cases the class of piecewiseHolder continuous functions of order 0 < α ≤ 1 and the class of functions withbounded total variation are obtained.

In this paper we show consistency of regressograms constructed by minimizing(3) for arbitrary L2 functions and more general assumptions on the error. If the truefunction is cadlag, we additionally show consistency in the Skorokhod topology. Thisis a substantially stronger statement than the L2 convergence and yields consistencyof the whole graph of the estimator.

In concrete applications the choice of the regularization parameter γ > 0 in (3),which controls the degree of smoothness (which means just the number of jumps)of the estimate Tγ(Yn), is a delicate and important task. As in kernel regression[18, 23], a screening of the estimates over a larger region can be useful (see [16, 26]).Adapting a viewpoint from computer vision (see [15]), [3, 4] and [17] proposed toconsider the family (Tγ(f))γ>0, denoted as scale space, as target of inference. Thiswas justified in [4] by the fact that the empirical scale space converges towards thatof the actual density or regression function pointwisely and uniformly on compact

Page 79: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 67

Fig 1. The Jplse for different values of γ. The dots represent the noisy observations of somesignal f represented by the grey line. The black line shows the estimator, with γ chosen such thatthe reconstruction has four, six, eight and ten partitions, respectively.

sets. The main motivation for analyzing the scale space is exploration of structuresas peaks and valleys in regression and detection of modes in density estimation.Properties of the scale space in kernel smoothing are that structures like modesdisappear monotonically for a shrinking resolution level and that the reconstruc-tion changes continuously with respect to the bandwidth. For the Jplse, the family(Tγ(f))γ>0 behaves quite differently. Notable distinctions are that jumps may notchange monotonically and that there are only finitely many possible different esti-mates. To deal with these features, we consider convergence of the scale space in thespace of cadlag functions equipped with the Skorokhod J1 topology. In this settingwe deduce (under identifiability assumptions) convergence of the empirical scalespace towards its deterministic target. Note that the computation of the empiricalscale space is feasible. The family (Tγ(Yn)))γ>0 can be computed in O(n3) and theminimizer for one γ in O(n2) steps (see [26]).

The paper is organized as follows. After introducing some notation in Section 2,we provide in Section 3.1 the consistency results for general functions in the L2

metric. In Section 3.2 we present the results of convergence in the Skorokhod topol-ogy. Finally in Section 3.3 convergence results for the scale space are given. Theproofs as well as a short introduction to the concept of epi-convergence, which isrequired in the main part of the proofs, are given in the Appendix.

2. Model assumptions

By S([0, 1)) = span{1[s,t) : 0 ≤ s < t ≤ 1} we will denote the space of step functionswith a finite but arbitrary number of jumps and by D([0, 1)) the cadlag space ofright continuous functions on [0, 1] with left limits and left continuous at 1. Bothwill be considered as subspaces of L2([0, 1)) with the obvious identification of afunction with its equivalence class, which is injective for these two spaces. Moregenerally, by D([0, 1), Θ) and D([0,∞), Θ) we will denote spaces of functions with

Page 80: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

68 Boysen et al.

values in a metric space (Θ, ρ), which are right continuous and have left limits. ‖ · ‖will denote the norm of L2([0, 1)) and the norm on L∞([0, 1)) is denoted by ‖ · ‖∞.

Minimizers of the hard thresholding functionals (3) will be embedded into L2([0,1)) by the map ιn : R

n �−→ L2([0, 1)),

ιn((u1, . . . , un)) =n∑

i=1

ui1[(i−1)/n,i/n).

Under the regression model (1), this leads to estimates fn = ιn(Tγn(Yn)), i.e.

fn ∈ ιn(argminHγn(·, Yn)).

Note that, for a functional F we denote by argminF the whole set of minimizers.Here and in the following (γn)n∈N is a (possibly random) sequence of smoothingparameters. We suppress the dependence of fn on γn since this choice will be clearfrom the context.

For the noise, we assume the following condition.

(A) For all n ∈ N the random variables (ξi,n)1≤i≤n are independent. Moreover,there exists a sequence (βn)n∈N with n−1βn → 0 such that

(4) max1≤i≤j≤n

(ξi,n + · · · + ξj,n)2

j − i + 1≤ βn P-a.s.,

for almost every n.

The behavior of the process (4) is well known for certain classes of i.i.d. sub-gaussian random variables (see e.g. [22]). If for example ξi,n = ξi ∼ N(0, σ2) for alli = 1, . . . , n and all n, we can choose βn = 2σ2 log n in Condition (A). The nextresult shows that (A) is satisfied for a broad class of subgaussian random variables.

Lemma 1. Assume the noise satisfies the following generalized subgaussian condi-tion

(5) Eeνξi,n ≤ eαnζν2

, (for all ν ∈ R, n ∈ N, 1 ≤ i ≤ n)

with 0 ≤ ζ < 1 and α > 0. Then there exist a C > 0 such that for βn = Cnζ log nCondition (A) is satisfied.

A more common moment condition is given by the following lemma.

Lemma 2. Assume the noise satisfies

(6) supi,n

E|ξi,n|2m < ∞, (for all n ∈ N, 1 ≤ i ≤ n)

for m > 2. Then for all C > 0 and βn = C(n log n)2/m Condition (A) is satisfied.

3. Consistency

In order to extend the functional in (3) to L2([0, 1)), we define for γ > 0, thefunctionals H∞

γ : L2([0, 1)) × L2([0, 1)) �−→ R ∪∞:

H∞γ (g, f) =

{γ · #J (g) + ‖f − g‖2

, g ∈ S([0, 1)),∞, otherwise.

Page 81: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 69

HereJ (g) = {t ∈ (0, 1) : g(t−) �= g(t+)}

is the set of jumps of g ∈ S([0, 1)). For γ = 0, we set H∞0 (g, f) = ‖f − g‖2 for all

g ∈ L2([0, 1)). The following lemma guarantees the existence of a minimizer.

Lemma 3. For any f ∈ L2([0, 1)) and all γ ≥ 0 we have

argminH∞γ (·, f) �= ∅.

In the following we assume that Yn is determined through (1), the noise ξn

satisfies (A) and (βn)n∈N is a sequence with βn/n → 0 such that (4) holds.

3.1. Convergence in L2

We start with investigating the asymptotic behavior of the Jplse when the sequenceγn converges to a constant γ greater than zero. In this case we do not recover theoriginal function in the limit, but a parsimonious representation at a certain scaleof interest determined by γ.

Theorem 1. Suppose that f ∈ L2([0, 1)) and γ > 0 are such that fγ is a uniqueminimizer of H∞

γ (·, f). Then for any (random) sequence (γn)n∈N ⊂ (0,∞) withγn → γ P-a.s., we have

fnL2([0,1))−−−−−→

n→∞fγ P-a.s.

The next theorem states the consistency of the Jplse towards the true signal forγ = 0 under some conditions on the sequence γn.

(H) (γn)n∈N satisfies γn → 0 and γnn/βn → ∞ P-a.s..

Theorem 2. Assume f ∈ L2([0, 1)) and (γn)n∈N satisfies (H). Then

fnL2([0,1))−−−−−→

n→∞f, P-a.s.

3.2. Convergence in Skorokhod topology

As we use cadlag functions for reconstructing the original signal, it is natural toask, whether it is possible to obtain consistency in the Skorokhod topology.

We remember the definition of the Skorokhod metric [12, Section 5 and 6]. LetΛ∞ denote the set of all strictly increasing continuous functions λ : R+ �−→ R+

which are onto. We define for f, g ∈ D([0,∞), Θ)

ρ(f(λ(t) ∧ u), g(t))

where L(λ) = sups �=t≥0 | log λ(t)−λ(s)t−s |. Similarly, Λ1 is the set of all strictly increas-

ing continuous onto functions λ : [0, 1] �−→ [0, 1] with appropriate definition of L.Slightly abusing notation, we set for f, g ∈ D([0, 1), Θ),

ρS(f, g) = inf{

max(L(λ), sup0≤t≤1

ρ(f(λ(t)), g(t))) : λ ∈ Λ1

}.

The topology induced by this metric is called J1 topology. After determining themetric we want to use, we find that in the situation of Theorem 1 we can establishconsistency without further assumptions, whereas in the situation of Theorem 2 fhas to belong to D([0, 1)).

Page 82: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

70 Boysen et al.

Theorem 3. (i) Under the assumptions of Theorem 1,

fnD([0,1))−−−−−→n→∞

fγ P-a.s.

(ii) If f ∈ D([0, 1)) and (γn)n∈N satisfies (H), then

fnD([0,1))−−−−−→n→∞

f P-a.s.

If f is continuous on [0, 1], then

fnL∞([0,1])−−−−−−→

n→∞f P-a.s.

3.3. Convergence of the scale spaces

As mentioned in the introduction, following [4], we now want to study the scale spacefamily (Tγ(f))γ>0 as target for inference. First we show that the map γ �→ Tγ(f)can be chosen piecewise constant with finitely many jumps.

Lemma 4. Let f ∈ L2([0, 1)). Then there exists a number m(f) ∈ N ∪ {∞} and adecreasing sequence (γm)m(f)

m=0 ⊂ R ∪∞ such that

(i) γ0 = ∞, γm(f) = 0,(ii) for all 1 ≤ i ≤ m(f) and γ′, γ′′ ∈ (γi, γi−1) we have that

argminH∞γ′ (·, f) = argminH∞

γ′′(·, f) ,

(iii) for all 1 ≤ i ≤ m(f) − 1 and γi+1 < γ′ < γi < γ′′ < γi−1 we have:

argminH∞γi

(·, f) ⊇ argminH∞γ′ (·, f) ∪ argminH∞

γ′′(·, f) ,

and(iv) for all γ′ > γ1

argminH∞∞ (·, f) = argminH∞

γ′ (·, f) = {T∞(f)} .

Here T∞(f) is defined by T∞(f)(x) =∫

f(u) du1[0,1)(x).

Thus we may consider functions τn ∈ D([0,∞), L2([0, 1))) with

τn(ζ) ∈ ιn(argminH1/ζ(·, Yn)) ,

for all ζ ≥ 0. We will call τn the empirical scale space. Similarly, we define thedeterministic scale space τ for a given function f , such that

(7) τ(ζ) ∈ argminH∞1/ζ(·, f)), (for all ζ ≥ 0).

The following theorem shows that the empirical scale space converges almost surelyto the deterministic scale space. Table 1 and Figure 2 demonstrate this in a finitesetting for the blocks signal, introduced by [10].

Theorem 4. Suppose f ∈ L2([0, 1)) is such that #argminH∞γ (·, f) = 1 for all but

a countable number of γ > 0 and #argminH∞γ (·, f) ≤ 2 for all γ ≥ 0. Then τ is

uniquely determined by (7). Moreover,

τn −−−−→n→∞

τ P-a.s.

holds both in D([0,∞), D([0, 1))) and D([0,∞), L2([0, 1))).

Page 83: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 71

Fig 2. Comparison of scale spaces. The “Blocks” data of [11] sampled at 64 points (dots) arecompared with the different parts of the scale space derived both from the data (black) and theoriginal signal (grey), starting with γ = ∞ and lowering its value from left to right and top tobottom. Note that for the original sampling rate of 2048 the scale spaces are virtually identical.

Page 84: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

72 Boysen et al.

Table 1

Comparison of scale spaces. For the “Blocks” data of [10] sampled in 64 points with a signal tonoise ratio of 7, the eleven largest γ values (see Lemma 4) for the deterministic signal (bottom)

and the noisy signal (top) are compared. The last two values of the bottom row are equal tozero, since there are only nine ways to reconstruct the deterministic signal

852 217 173 148 108 99.8 55.9 46.6 5.36 4.62 2.29885 249 159 142 100 99.1 80.2 41.3 38.9 0 0

Fig 3. Scale spaces of a sample function (grey line). The black lines show all reconstructions ofthe sample function for varying γ.

Discussion. The scale space of a penalized estimator with hard thresholdingtype penalties generally does not have the same nice properties as its counterpartsstemming from an l2- or l1-type penalty. In our case the function value at somepoint of the reconstruction does not change continuously or monotonically in thesmoothing parameter. Moreover, the set of jumps of a best reconstruction with kpartitions is not necessarily contained in the set of jumps of a best reconstructionwith k′ partitions for k < k′, see Figure 3. This leads to increased computationalcosts, as greedy algorithms in general do not yield an optimal solution. Indeed, oneneeds only O(n log n) steps to compute the estimate for a given γ if the penalty isof l1 type as in locally adaptive regression splines by [16], compared to O(n2) stepsfor the Jplse.

We mention, that penalizing the number of jumps corresponds to an L0-penaltyand is a limiting case of the [20] functional, when the dimension of the signal (image)is d = 1 [27], and results in “hard segmentation” of the data [24].

4. Proofs

Some additional notation. Throughout this section, we shorten J(fn) to Jn.We set Sn([0, 1)) = ιn(Rn), Bn = σ(Sn([0, 1))). Observe that ιn(fn) is just theconditional expectation EU0,1(f |Bn), denoting the uniform distribution on [0, 1) byU0,1. Similarly, for any finite J ⊂ (0, 1) define BJ = σ({[a, b) : a, b ∈ J ∪ {0, 1}})and the partition PJ = {[a, b) : a, b ∈ J ∪ {0, 1}, (a, b) ∩ J = ∅}. For our proofs itis convenient to formulate all minimization procedures on L2([0, 1)). Therefore weintroduce the following functionals H∞

γ , Hγ : L2([0, 1))× L2([0, 1)) �−→ R, definedas

Hγ(g, f) =

{γ#J(g) + ‖f − g‖2 − ‖f‖2

, if g ∈ Sn([0, 1)),∞, otherwise,

H∞γ (g, f) = H∞

γ (g, f) − ‖f‖2.

Page 85: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 73

Clearly for each f , H∞γ has the same minimizers as H∞

γ , differing only by a constant.The following Lemma relates the minimizers of Hγ and Hγ .

Lemma 5. For all f ∈ L2([0, 1)) and n ∈ N we have u ∈ argminHγ(·, fn) if andonly if ιn(u) ∈ argmin Hγ(·, f). Similarly, u ∈ argminHγ(·, y) for y ∈ R

n if andonly if ιn(u) ∈ argmin Hγ(·, ιn(y)).

Proof. The second assertion follows from the fact that for u, y ∈ Rn

Hγ(ιn(u), ιn(y)) = Hγ(u, y) − ‖f‖2.

Further, for u ∈ Rn we have 〈ιn(fn) − f, ιn(fn) − ιn(u)〉 = 0 which gives

Hγ(ιn(u), f) = γ#J(u) + ‖f − g‖2 − ‖f‖2

= γ#J(u) +∥∥ιn(fn) − ιn(u)

∥∥2 +∥∥f − ιn(fn)

∥∥2 − ‖f‖2

= Hγ(u, fn) + constf,n

what completes the proof.

The minimizers g ∈ S([0, 1)) of Hγ(·, f) and H∞γ (·, f) for γ > 0 are determined

by their jump set J(g) through the formula g = EU0,1(f |BJ(g)). In the sequel, weabbreviate

µI(f) = �(I)−1

I

f(u) du

to denote the mean of f on some interval I. In addition, we will use the abbreviationfJ := EU0,1(f |BJ), such that for any partition PJ of [0, 1)

fJ =∑

I∈PJ

µI(f)1I .

Further, we extend the noise in (1) to L2([0, 1)) by

ξn = ιn((ξ1,n, . . . , ξn,n)).

4.1. Technical tools

We start by giving estimates on the behavior of (ξn)J =∑

I∈PJµI(ξn)1I .

Lemma 6. Assume (ξi,n)n∈N,1≤i≤n satisfies (A). Then P-almost surely for all in-tervals I ⊂ [0, 1) and all n ∈ N

µI(ξn)2 ≤ βn

n�(I).

Proof. For intervals of the type [(i− 1)/n, j/n) with i ≤ j ∈ N the claim is a directconsequence of (4). For general intervals, [(i+p1)/n, (j−p2)/n) with p1, p2 ∈ [0, 1],we have to show that

(p1 · ξi,n + ξi+1,n + · · · + ξj−1,n + p2 · ξj,n)2 − βn(p1 + p2 + j − i − 1) ≤ 0.

The left expression is convex over [0, 1]2 if it is considered as function in (p1, p2).Hence it attains its maximum in an extreme point of [0, 1]2.

Page 86: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

74 Boysen et al.

Lemma 7. There is a set of P-probability one on which for all sequences (Jn)n∈N

of finite sets in (0, 1) the relation limn→∞ βn#Jn/n = 0 implies

(ξn)Jn

L2([0,1))−−−−−→n→∞

0.

Proof. By Lemma 6 we find

(8) ‖(ξn)Jn‖2 =∑

I∈PJn

�(I)µI(ξn)2 ≤ βn

n(#Jn + 1),

This immediately gives the assertion.

Now we wish to show that the functionals epi-converge (see section 4.4). To thisend we need two more results.

Lemma 8. Let (Jn)n∈N be a sequence of closed subsets in (0, 1) which satisfies therelation limn→∞ βn#Jn/n = 0. For (gn)n∈N ⊂ L2([0, 1)) with ‖gn − g‖ −−−−→

n→∞0,

where gn is BJn measurable, we have almost surely

‖f + ξn − gn‖2 − ‖f + ξn‖2 −−−−→n→∞

‖f − g‖2 − ‖f‖2 .

Proof. First observe that

‖f + ξn − gn‖2 − ‖f + ξn‖2 = ‖gn‖2 − 2〈f, gn〉 − 2〈ξn, gn〉= ‖gn‖2 − 2〈f, gn〉 − 2〈(ξn)Jn , gn〉 .

Since the sequence (‖gn‖)n∈N is bounded we can use Lemma 7 to deduce

〈(ξn)Jn , gn〉 P-a.s.−−−−→n→∞

0.

This completes the proof.

Before stating the next result, we recall the definition of the Hausdorff metricρH on the space of closed subsets CL(Θ) of a compact metric space (Θ, ρ). ForΘ′ ⊆ Θ � ϑ we set

dist(ϑ, Θ′) = inf{ρ(ϑ, ϑ′) : ϑ′ ∈ Θ′}.

Define

ρH(A, B) =

max{supx∈A dist(x, B), supy∈B dist(y, A)}, A, B �= ∅,

1, A �= B = ∅,

0, A = B = ∅ ,

With this metric, CL(Θ) is again compact for compact Θ [19, see].

Lemma 9. The map

L2([0, 1)) � g �→{

#J(g), g ∈ S([0, 1))∞, g �∈ S([0, 1)) ∈ N ∪ {0,∞}

is lower semi-continuous, meaning the set {g ∈ S([0, 1)) : #J(g) ≤ N} is closed forall N ∈ N ∪ {0}.

Page 87: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 75

Proof. Suppose that ‖gn − g‖ −−−−→n→∞

0 with #J(gn) ≤ N < #J(g). Using compact-

ness of the space of closed subsets CL([0, 1]) and turning possibly to a subsequence,we could arrange that J(gn) ∪ {0, 1} −−−−→

n→∞J ∪ {0, 1} for some closed J ⊂ (0, 1),

where convergence is understood in Hausdorff metric ρH . Since the cardinality islower semi-continuous with respect to the Hausdorff metric, J must be finite. Weconclude for (s, t) ∩ J = ∅ and ε > 0 that (s + ε, t − ε) ∩ J(gn) = ∅ eventually,i.e. gn is constant on (s + ε, t − ε). Next we observe that gn1(s+ε,t−ε) convergestowards g1(s+ε,t−ε) (in L2([0, 1))) what implies that g is constant on (s + ε, t − ε).Since ε > 0 was arbitrary, we derive that g is constant on (s, t). Consequently, g isin S([0, 1)) and J(g) ⊆ J . Using again lower semi-continuity of the cardinality inthe space of compact subsets of [0, 1] shows that

#J(g) > N ≥ lim supn

#J(gn) ≥ lim infn

#J(gn) ≥ #J ≥ #J(g).

This contradiction completes the proof.

Now we can state the epi-convergence of Hγn as function on L2([0, 1)).

Lemma 10. For all sequences (γn)n∈N satisfying (H) we have

Hγn(·, f + ξn)epi−−−−→

n→∞H∞

γ (·, f)

almost surely. Here Hγn , H∞γ are considered as functionals on L2([0, 1)).

Proof. We have to show that on a set with probability one we have

(i) If gn −−−−→n→∞

g then lim infn→∞ Hγn(gn, f + ξn) ≥ H∞γ (g, f).

(ii) For all g ∈ L2([0, 1)), there exists a sequence (gn)n∈N ⊂ L2([0, 1)), gn −−−−→n→∞

g

with lim supn→∞ Hγn(gn, f + ξn) ≤ H∞γ (g, f).

To this end, we fix the set where the assertions of Lemmas 7 and 8 hold simultane-ously.

Ad 4.1: Without loss of generality, we may assume that Hγn(gn, f +ξn) convergesin R ∪ ∞. If gn /∈ Sn([0, 1)) for infinitely many n or #J(gn) > H∞

γ (g, f)/γn therelation 4.1 is trivially fulfilled. Otherwise, we obtain

lim supn→∞

βn

n#J(gn) ≤ lim sup

n→∞

βn

nγnH∞

γ (g, f) = 0.

Hence we can apply Lemma 8. Together with Lemma 9 we obtain P-a.s.

lim infn→∞

Hγn(gn, f + ξn)

≥ lim infn→∞

γnJ(gn) + lim infn→∞

(‖f + ξn − gn‖2 − ‖f + ξn‖2)

≥ γJ(g) + (‖f − g‖2 − ‖f‖2) = H∞γ (g, f).

Ad 4.1: If g /∈ S([0, 1)) and γ > 0 there is nothing to prove. If γ = 0 and stillg /∈ S([0, 1)), choose gn as a best L2-approximation of g in Sn([0, 1)) with at most1/

√γn jumps.

We claim that ‖gn − g‖ → 0 as n → ∞. For that goal, let gn,k denote a bestapproximation of g in {f ∈ Sn([0, 1)) : #J(f) ≤ k} and gk one in {f ∈ S([0, 1)) :#J(f) ≤ k}.

Page 88: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

76 Boysen et al.

Moreover, for every n, k let Jkn ⊂ (0, 1) be a perturbation of J(gk), with nJk

n ∈ N,#Jk

n = #J(gk) and ρH(Jkn , J(gk)) ≤ 1/n. Denote g′n,k = gk ◦ λn,k where λn,k ∈ Λ1

fulfills λn,k(Jkn) = J(gk). Since (a, b) �→ 1[a,b) is continuous in L2([0, 1)), we obtain

readily ‖g′n,k − gk‖ → 0. This implies for any k ∈ N

lim supn→∞

‖gn − g‖ ≤ lim supn→∞

‖gn,k − g‖ ≤ lim supn→∞

∥∥g′n,k − g∥∥ = ‖gk − g‖ .

Since the right hand side can be made arbitrary small by choosing k, gn convergesto g. Then Lemma 8 yields 4.1.

If γ > 0 and g ∈ S([0, 1)), gn is chosen as a best approximation of g in Sn([0, 1))with at most #J(g) jumps. Finally, in order to obtain 4.1, argue as before.

To deduce consistency with the help of epi-convergence, one needs to show thatthe minimizers are contained in a compact set. The following lemma will be appliedto this end.

Lemma 11. Assume (Θ, ρ) is a metric space. A subset A ⊂ D([0,∞), Θ) is rela-tively compact if the following two conditions hold

(B1) For all t ∈ R+ there is a compact Kt ⊆ Θ such that

g(t) ∈ Kt, (for all g ∈ A).

(B2) For all T > 0 and all ε > 0 there exists a δ > 0 such that for all g ∈ A thereis a step function gε ∈ S([0, T ), Θ) such that

sup{ρ(g(t), gε(t)) : t ∈ [0, T )} < ε and mpl(gε) ≥ δ ,

where mpl is the minimum distance between two jumps of f ∈ S([0, T ))

mpl(f) := min{|s − t| : s �= t ∈ J(f) ∪ {0, T}}.

A subset A ⊂ D([0, 1), Θ) is relative compact if the following two conditions hold

(C1) For all t ∈ [0, 1] there is a compact Kt ⊆ Θ such that

g(t) ∈ Kt (for all g ∈ A).

(C2) For all ε > 0 there exists a δ > 0 such that for all g ∈ A there is a stepfunction gε ∈ S([0, 1), Θ) such that

sup{ρ(g(t), gε(t)) : t ∈ [0, 1]} < ε and mpl(gε) ≥ δ .

Proof. We prove only the first assertion, as the proof of the second assertion canbe carried out in the same manner.

According to [12], Theorem 6.3, it is enough to show that (B2) implies

limδ→0

supg∈A

wg(δ, T ) = 0

where

wg(δ, T ) = inf{

max1≤i≤v

sups,t∈[ti−1,ti)

ρ(g(s), g(t)) : {t1, . . . , tv−1} ⊂ (0, T ),

t0 = 0, tv = T, |ti − tj | > δ

}.

Page 89: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 77

So, fix T > 0, ε > 0 and choose δ from (B2). Then we set for g ∈ A {t0, . . . , tv} =J(gε)∪{0, T}. Clearly, mpl(gε) > δ implies |ti−tj | > δ for all i �= j. For neighboringti−1, ti ∈ J(gε) ∪ {0, T} and s, t ∈ [ti−1, ti) we derive

ρ(g(s), g(t)) ≤ ρ(g(s), gε(s)) + ρ(gε(s), gε(t)) + ρ(gε(t), g(t)) < ε + 0 + ε = 2ε.

This establishes the above condition and completes the proof.

In the context of proving compactness we will also need the following result.

Lemma 12. For any f ∈ L2([0, 1)) the set {fJ : J ⊂ (0, 1), #J < ∞} isrelatively compact in L2([0, 1)).

Proof. The proof is done in several steps.1. Since (s, t) �→ 1[s,t) is continuous,

{M∑

i=1

αi1Ii : |αi| ≤ z, Ii ⊆ [0, 1) interval

}

is the continuous image of a compact set and hence compact for all M ∈ N andz > 0.

2. If f = 1I for some interval I, we obtain for any J ⊂ [0, 1) that fJ is a linearcombination of at most three different indicator functions.

3. If f =∑M

i=1 αi1Ii is a step function and J arbitrary then fJ =∑M ′

j=1 βj1I′j

holds by 2. for some M ′ ≤ 3M . Using

βj = µI′j(f) ≤ max

i=1,...,M|αi|

as well as 1., we get that {fJ : J ⊂ [0, 1)} is relatively compact for step functionsf .

4. Suppose f ∈ L2([0, 1)) is arbitrary and ε > 0. We want to show that wecan cover {fJ : J ⊂ [0, 1)} by finitely many ε-balls. Fix a step function g suchthat ‖f − g‖ < ε/2. By the Jensen Inequality for conditional expectations, weget ‖fJ − gJ‖ < ε/2 for all finite J ⊂ [0, 1). Further, by 3., there are finite setsJ1, . . . , Jp ⊂ [0, 1) with p < ∞ such that minl=1,...,p ‖gJ − gJl

‖ < ε/2 for all finiteJ ⊂ [0, 1). This implies

minl=1,...,p

‖fJ − gJi‖ ≤ minl=1,...,p

‖gJ − gJl‖ + ‖fJ − gJ‖ < ε

and the proof is complete.

4.2. Behavior of the partial sum process

Proof of Lemma 1. The following Markov inequality is standard for triangular ar-rays fulfilling condition (A), [21], Section III, §4, and all numbers µi, i = 1, . . . , n:

P(|n∑

i=1

µiξi,n| ≥ z) ≤ 2 exp( −z2

4αnζ∑

i µ2i

)(for all z ∈ R).

From this, we derive for z2 > 12α that∑

n∈N

1≤i≤j≤n

P(|ξi,n + · · · + ξj,n| ≥ z√

j − i + 1√

nζ log n)

≤ 2∑

n∈N

n2e−z2 log n

4α = 2∑

n

n− z2−8α4α < ∞.

Page 90: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

78 Boysen et al.

Hence, for ε > 0 we have with probability one that

max1≤i≤j≤n

(ξi,n + · · · + ξj,n)2

(j − i + 1)≥ (12 + ε)αnζ log n

only finitely often.

For the proof of Lemma 2, we need an auxiliary lemma. Denote by

Dn ={(i, j) : 1 ≤ i ≤ j ≤ n such that i = k2l, j = (k + 1)2l

for some l, k ∈ {0, 1, 2, . . .}}

the set of all pairs (i, j) which are endpoints of dyadic intervals contained in{1, . . . , n}.Lemma 13. Assume x ∈ R

n such that

(9) max(i,j)∈Dn

|xi + · · · + xj |√j − i + 1

≤ c

for some c > 0. Then

max1≤i≤j≤n

|xi + · · · + xj |√j − i + 1

≤ (2 +√

2)c .

Proof. Without loss of generality we may assume that n = 2m for some m ∈ N

(and add some zeros otherwise). First, we prove by induction on m that (9) implies

(10) max1≤j≤n

|x1 + · · · + xj |√j

≤ (1 +√

2)c .

For m = 0 there is nothing to prove. Now assume that the statement is true for m.Let 2m < j ≤ 2m+1. Note that

|x1 + · · · + xj |√j

≤√

2m

√j

|x1 + · · · + x2m |√2m

+√

j − 2m

√j

|x2m+1 + · · · + xj |√j − 2m

.

Apply the induction hypothesis to the second summand to obtain

|x1 + · · · + xj |√j

≤(√

2m + (1 +√

2)√

j − 2m)

√j

c .

For 2m + 1 ≤ j ≤ 2m+1 the expression on the right hand side is maximal forj = 2m+1 with maximum (1 +

√2)c. Hence the statement holds also for m + 1 and

we have shown that (9) implies (10).The claim is again proven by induction on m. For m = 0 there is nothing to prove.

Assume that the statement is true for m. If i ≤ j ≤ 2m or 2m < i ≤ j ≤ 2m+1 thestatement follows by application of the induction hypotheses to (x1, . . . , x2m) and(x2m+1, . . . , x2m+1), respectively. Now suppose i < 2m < j. Then

|xi + · · · + xj |√j − i + 1

≤√

2m − i + 1√j − i + 1

|xi + · · · + x2m |√2m − i + 1

+√

j − 2m

√j − i + 1

|x2m+1 + · · · + xj |√j − 2m

Page 91: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 79

Application of (10) to x′ = (x2m , x2m−1, . . . , x1) and x = (x2m+1, . . . , x2m+1) thengives

|xi + · · · + xj |√j − i + 1

≤√

2m − i + 1 +√

j − 2m

√j − i + 1

(1 +√

2)c ≤√

2(1 +√

2)c .

Proof of Lemma 2. [8] show that for m ≥ 1 and some constant Cm depending onm only

E

( |ξi,n + · · · + ξj,n|2m

(j − i + 1)m

)≤ Cm

E|ξi,n|2m + · · · + E|ξj,n|2m

j − i + 1.

The Markov inequality then yields for any z > 0 and all 1 ≤ i ≤ j ≤ n

P

( |ξi,n + · · · + ξj,n|√j − i + 1

≥ z)≤

Cm supi,n E|ξi,n|2m

z2m.

Since there are at most 2n dyadic intervals contained in {1, . . . , n}, we obtain byLemma 13 for any C > 0 that

n∈N

1≤i≤j≤n

P

( |ξi,n + · · · + ξj,n|√j − i + 1

≥ C(n log n)1/m)

≤∑

n∈N

(i,j)∈Dn

P

( |ξi,n + · · · + ξj,n|√j − i + 1

≥ (2 +√

2)C(n log n)1/m)

≤Cm supi,n E|ξi,n|2m

(2 +√

2)2mC2m

n∈N

2n

n2 log2 n< ∞.

The claim follows by application of the Borel–Cantelli lemma.

4.3. Consistency of the estimator

The proofs in this section use the concept of epi-convergence. It is introduced inAppendix.

Proof of Lemma 3. For γ = 0 there is nothing to prove. Assume γ > 0 and g ∈S([0, 1)) with #J(g) > ‖f‖2/γ. This yields

H∞γ (0, f) = ‖f‖2 < H∞

γ (g, f) .

Moreover, observe that for g ∈ S([0, 1)) we have H∞γ (g, f) ≥ H∞

γ (fJ(g), f). Thus,it is enough to regard the set {fJ : #J ≤ ‖f‖2/γ}, which is relatively compact inL2([0, 1)) by Lemma 12. This proves the existence of a minimizer.

Proof of Theorem 1 and Theorem 2. By the reformulation of the minimizers inLemma 5, Lemma 10 and Theorem 5 (see Appendix) it is enough to prove thatalmost surely there is a compact set containing

n∈N

argmin Hγn(·, f + ξn) .

First note that all fn ∈ argmin Hγn(·, f + ξn) have the form (f + ξn)Jn for some(random) sets Jn. Comparing Hγn(fn, f + ξn) with Hγn(0, f + ξn) = 0, we obtainthe a priori estimate

γn#Jn ≤ ‖(f + ξn)Jn‖2 ≤ 2‖f‖2 + 2 ‖(ξn)Jn‖

2 ≤ 2‖f‖2 +2βn

n(#Jn + 1)

Page 92: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

80 Boysen et al.

for all n ∈ N. Since γn > 4βn

n eventually, we find P-a.s.

#Jn ≤2‖f‖2 + 2βn

n

γn − 2βn

n

= O(γ−1n ).

Application of Lemma 7 gives limn→∞(ξn)Jn = 0 almost surely. Since by Lemma 12,{fJn : n ∈ N} is relatively compact in L2([0, 1)), relative compactness of the set⋃

n∈Nargmin Hγn(·, f + ξn) follows immediately. This completes the proofs.

Proof of Theorem 3, part (i) . Theorem 1 and Lemma 9 imply

lim infn→∞

#Jn ≥ #J(fγ) .

Suppose lim supn→∞ #Jn ≥ #J(fγ) + 1. Let fγ,n be an approximation of fγ fromSn([0, 1)) with the same number of jumps as fγ . Then we could arrange fγ,n −−−−→

n→∞fγ such that limn→∞ Hγ(fγ,n, f + ξn) = H∞

γ (fγ , f). Moreover, we know

lim supn→∞

Hγ(fn, f + ξn) ≥ γ + H∞γ (fγ , f) = γ + lim

n→∞Hγ(fγ,n, f + ξn)

which contradicts that fn is a minimizer of Hγ(·, f + ξn) for all n. Therefore,#Jn = #J(fγ) eventually.

Next, chose by compactness a subsequence such that Jn ∪ {0, 1} converges inρH . Then, by Lemma 9, the limit must be J(fγ) ∪ {0, 1}. Consequently, the wholesequence (Jn)n∈N converges to J(fγ) in the Hausdorff metric.

Thus eventually, there is a 1-1 correspondence between PJn and PJ(fγ) such thatfor each [s, t) ∈ PJ(fγ) there are [sn, tn) ∈ PJn with

sn −−−−→n→∞

s and tn −−−−→n→∞

t .

By Lemma 6 and continuity of (s, t) �→ 1[s,t), we find

µ[sn,tn)(f + ξn) −−−−→n→∞

µ[s,t)(f) .

Construct λn ∈ Λ1 linearly interpolating λn(sn) = s. Then

L(λn) −−−−→n→∞

1

as well as

‖fn − f ◦ λn‖∞ = maxI∈PJ(fγ )

|µλ−1(I)(f + ξn) − µI(f)| −−−−→n→∞

0

which completes the proof.

Proof of Theorem 3, part (ii). The proof can be carried out in the same manner asthe proof of Theorem 4, part (ii) in [2]. The only difference is, that it is necessaryto attend the slightly different rates of the partial sum process (4).

Page 93: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 81

4.4. Convergence of scale spaces

Proof of Lemma 4. It is clear, that each g ∈ argminH∞γ (·, f) is determined by its

jump set. Further, if g1, g2 ∈ S([0, 1)) with #J(g1) = #J(g2) and ‖f−g1‖ = ‖f−g2‖then g1 is a minimizer of H∞

γ (·, f) if and only if g2 is.Since H∞

γ (0, f) = ‖f‖2 we have that γ ∈ [ν,∞) implies J(g) ≤ ‖f‖2/ν, for aminimizer g of H∞

γ (·, f). Hence on [ν,∞) we have that

minH∞γ (·, f) = min{kγ + ∆k(f) : k ≤ ‖f‖2/ν}

with ∆k(f) defined by

∆k(f) := inf{‖g − f‖ : g ∈ S([0, 1)), #J(g) ≤ k} .

For each ν the map γ �→ minH∞γ (·, f) is thus a minimum of a finite collection of

linear functions with pairwise different slopes on [ν,∞). If there are different k, k′

and γ with kγ+hk = k′γ+hk′ it follows γ = (hk′−hk)/(k−k′). From this it followsthat there are only finitely many γ where #{k : kλ + ∆k(f) = minH∞

γ (·, f)} >1. Further, argminH∞

γ (·, y) is completely determined by the k which realize thisminimum. Call those γ, for which different k realize the minimum, changepointsof γ �→ min H∞

γ (·, f). Since the above holds true for each ν > 0, there are onlycountably many changepoints in [0,∞). This completes the proof.

Proof of Theorem 4. It is easy to see that the assumptions imply J(τ) = {γm : m =1, . . . , m(x)} for the sequence (γm)m(x)

m=0 ⊂ R∪∞ of Lemma 4. Since the scale spaceτ is uniquely determined by its jump points, this proves the uniqueness claim.

For the proof of the almost sure convergence, note that Theorem 1 and Theo-rem 3, part (i) show that τn(ζ) →n→∞ τ(ζ) if ζ is a point of continuity of τ , i.e.# argminH∞

1/ζ(·, f) = 1. Convergence in all continuity points together with relativecompactness of the sequence implies convergence in the Skorokhod topology. Hence,it is enough to show that {τn : n ∈ N} is relatively compact.

To this end, we will use Lemma 11. In the proof of Theorem 1 it was shown,that the sequence (Tγ(Yn))n∈N is relatively compact in L2(0, 1). To prove relativecompactness in D([0, 1)) we follow the lines of the proof of Theorem 3, part (i).Similarly we find that

lim supn→∞

#Jn ≤ maxg∈argmin H∞

1/ζ(·,f)

#J(g) .

For each subsequence of (Tγ(Yn))n∈N, consider the subsequence of correspondingjump sets. By compactness of CL([0, 1]) we choose a converging sub-subsequenceand argue as in the proof mentioned above that the corresponding minimizers con-verge to a limit in argminH∞

1/ζ(·, f). Thus we have verified condition (B1).For the proof of (B2), we will show by contradiction that for all T > 0 we have

inf{mpl(τn|[0,T ]) : n ∈ N} > 0.

This, obviously, would imply (B2). Observe that τn jumps in ζ only if thereare two jump sets J �= J ′ such that H1/ζ((Yn)J , Yn) = H1/ζ((Yn)J ′ , Yn) andH1/ζ((Yn)J , Yn) ≤ H1/ζ((Yn)J ′′ , Yn) for all J ′′.

If (B2) is not fulfilled for (τn)n∈N, we can switch by compactness to a subsequenceand find sequences (ζ1

n)n∈N, (ζ2n)n∈N with ζ1

n, ζ2n ∈ J(τn), ζ1

n < ζ2n and ζ1

n −−−−→n→∞

ζ,

ζ2n −−−−→

n→∞ζ for some ζ ≥ 0. Choosing again a subsequence, we could assume that the

Page 94: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

82 Boysen et al.

jump sets J1n, J2

n, J3n of minimizers fk

n ∈ ιn(argminHγn(·, Yn)) for some sequencesγ1

n − 1/ζ1n ↓ 0, γ2

n ∈ (1/ζ2n, 1/ζ1

n) and γ3n − 1/ζ2

n ↑ 0 are constant and (fkn)n∈N,

k = 1, 2, 3, converge. Further, we know from this choice of γkn and Lemma 4 that

#J1n > #J2

n > #J3n. This implies

(11) γ1n + γ2

n + ‖ιn(Yn) − f1n‖2 < γ2

n + ‖ιn(Yn) − f2n‖2 < ‖ιn(Yn) − f3

n‖2.

The same arguments as in Theorem 1 and Theorem 3, part (i) respectively, yield{limn→∞ fk

n : k = 1, 2, 3} ⊆ argminH∞1/ζ(·, f). Since (11) holds for all n, the limits

are pairwise different. This contradicts #argminH∞1/ζ(·, x) ≤ 2 and proves (B2).

Thus {τn : n ∈ N} is relatively compact in D([0,∞), D([0, 1))) as well as inD([0,∞), L2[0, 1]) and the proof is complete.

Appendix: Epi-Convergence

Instead of standard techniques from penalized maximum likelihood regression, weuse the concept of epi-convergence (see for example [5, 13]). This allows for simpleformulation and more structured proofs. The main arguments to derive consistencyof estimates which are (approximate) minimizers for a sequence of functionals canbriefly be summarized by

epi-convergence + compactness + uniqueness a.s. ⇒ strong consistency.

We give here the definition of epi-(or Γ-)convergence together with the results fromvariational analysis which are relevant for the subsequent proofs.

Definition 1. Let Fn : Θ �−→ R ∪ ∞, n = 1, . . . ,∞ be numerical functions on ametric space (Θ, ρ). (Fn)n∈N epi-converges to F∞ (Symbol Fn

epi−−−−→n→∞

F∞) if

(i) for all ϑ ∈ Θ, and sequences (ϑn)n∈N with ϑn −−−−→n→∞

ϑ

F∞(ϑ) ≤ lim infn→∞

Fn(ϑn)

(ii) for all ϑ ∈ Θ there exists a sequence (ϑn)n∈N with ϑn −−−−→n→∞

ϑ such that

(12) F∞(ϑ) ≥ lim supn→∞

Fn(ϑn)

The main, useful conclusions from epi-convergence are given by the followingtheorem.

Theorem 5 ([1], Theorem 5.3.6). Suppose Fnepi−−−−→

n→∞F∞.

(i) For any converging sequence (ϑn)n∈N, ϑn ∈ argminFn, it holds necessarilylimn→∞ ϑn ∈ argminF∞.

(ii) If there is a compact set K ⊂ Θ such that ∅ �= argminFn ⊂ K for largeenough n then argminF∞ �= ∅ and

dist(ϑn, argminF∞) −−−−→n→∞

0

for any sequence (ϑn)n∈N, ϑn ∈ argminFn.(iii) If, additionally, argminF∞ is a singleton {ϑ} then

ϑn −−−−→n→∞

ϑ

for any sequence (ϑn)n∈N, ϑn ∈ argminFn.

Page 95: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Scale space consistency of regressograms 83

References

[1] Beer, G. (1993). Topologies on Closed and Closed Convex Sets. Kluwer Aca-demic Publishers Group, Dordrecht.

[2] Boysen, L., Kempe, A., Liebscher, V., Munk, A. and Wittich, O.

(2006). Consistencies and rates of convergence of jump-penalized least squaresestimators. Submitted.

[3] Chaudhuri, P. and Marron, J. S. (1999). SiZer for exploration of structuresin curves. J. Amer. Statist. Assoc. 94 807–823.

[4] Chaudhuri, P. and Marron, J. S. (2000). Scale space view of curve esti-mation. Ann. Statist. 28 408–428.

[5] Dal Maso, G. (1993). An Introduction to Γ-convergence. Birkhauser, Boston.[6] Davies, P. L. and Kovac, A. (2001). Local extremes, runs, strings and

multiresolution. Ann. Statist. 29 1–65.[7] DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation.

Springer, Berlin.[8] Dharmadhikari, S. W. and Jogdeo, K. (1969). Bounds on moments of

certain random variables. Ann. Math. Statist. 40 1506–1509.[9] Donoho, D. L. (1997). CART and best-ortho-basis: A connection. Ann. Sta-

tist. 25 1870–1911.[10] Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk over lp-balls

for lq-error. Probab. Theory Related Fields 99 277–303.[11] Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown

smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200–1224.[12] Ethier, S. N. and Kurtz, T. G. (1986). Markov Processes. Wiley, New

York.[13] Hess, C. (1996). Epi-convergence of sequences of normal integrands and strong

consistency of the maximum likelihood estimator. Ann. Statist. 24 1298–1315.[14] Hinkley, D. V. (1970). Inference about the change-point in a sequence of

random variables. Biometrika 57 1–17.[15] Lindeberg, T. (1994). Scale Space Theory in Computer Vision. Kluwer,

Boston.[16] Mammen, E. and van de Geer, S. (1997). Locally adaptive regression

splines. Ann. Statist. 25 387–413.[17] Marron, J. S. and Chung, S. S. (2001). Presentation of smoothers: the

family approach. Comput. Statist. 16 195–207.[18] Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared

error. Ann. Statist. 20 712–736.[19] Matheron, G. (1975). Random Sets and Integral Geometry. Wiley, New

York–London–Sydney.[20] Mumford, D. and Shah, J. (1989). Optimal approximations by piecewise

smooth functions and associated variational problems. Comm. Pure Appl.Math. 42 577–685.

[21] Petrov, V. V. (1975). Sums of Independent Random Variables. Springer,New York.

[22] Shao, Q. M. (1995). On a conjecture of Revesz. Proc. Amer. Math. Soc. 123575–582.

[23] Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidthselection method for kernel density estimation. J. Roy. Statist. Soc. Ser. B 53683–690.

[24] Shen, J. (2005). A stochastic-variational model for soft Mumford–Shah seg-

Page 96: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

84 Boysen et al.

mentation. IMA Preprint Series 2062. Univ. Minnesota, Minneapolis.[25] Tukey, J. W. (1961). Curves as parameters, and touch estimation. In Proc.

4th Berkeley Sympos. Math. Statist. Probab. I 681–694. Univ. California Press,Berkeley.

[26] Winkler, G. and Liebscher, V. (2002). Smoothers for discontinuous signals.J. Nonparametr. Statist. 14 203–222.

[27] Winkler, G., Wittich, O., Liebscher, V. and Kempe, A. (2005). Don’tshed tears over breaks. Jahresber. Deutsch. Math.-Verein. 107 57–87.

[28] Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz’ cri-terion. Statist. Probab. Lett. 6 181–189.

[29] Yao, Y.-C. and Au, S. T. (1989). Least-squares estimation of a step function.Sankhya Ser. A 51 370–381.

Page 97: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotic: Particles, Processes and Inverse ProblemsVol. 55 (2007) 85–100c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000283

Confidence bands for convex median

curves using sign-tests

Lutz Dumbgen1

University of Bern

Abstract: Suppose that one observes pairs (x1, Y1), (x2, Y2), . . . , (xn, Yn),where x1 ≤ x2 ≤ · · · ≤ xn are fixed numbers, and Y1, Y2, . . . , Yn are inde-pendent random variables with unknown distributions. The only assumptionis that Median(Yi) = f(xi) for some unknown convex function f . We presenta confidence band for this regression function f using suitable multiscale sign-tests. While the exact computation of this band requires O(n4) steps, goodapproximations can be obtained in O(n2) steps. In addition the confidenceband is shown to have desirable asymptotic properties as the sample size ntends to infinity.

1. Introduction

Suppose that we are given data vectors x,Y ∈ Rn, where x is a fixed vector with

components x1 ≤ x2 ≤ · · · ≤ xn, and Y has independent components Yi withunknown distributions. We assume that

(1) Median(Yi) = f(xi)

for some unknown convex function f : R → R, where R denotes the extended realline [−∞,∞]. To be precise, we assume that f(xi) is some median of Yi. In whatfollows we present a confidence band (L, U) for f . That means, L = L(· |x,Y, α)and U = U(· |x,Y, α) are data-dependent functions from R into R such that

(2) P

(L(x) ≤ f(x) ≤ U(x) for all x ∈ R

)≥ 1 − α

for a given level α ∈ (0, 1).Our confidence sets are based on a multiscale sign-test. A similar method has

been applied by Dumbgen and Johns [2] to treat the case of isotonic regression func-tions, and the reader is referred to that paper for further references. The remainderof the present paper is organized as follows: Section 2 contains the explicit defin-ition of our sign-test statistic and provides some critical values. A correspondingconfidence band (L, U) is described in Section 3. This includes exact algorithms forthe computation of the upper bound U and the lower bound L whose running timeis of order O(n4) and O(n3), respectively. For large data sets these computationalcomplexities are certainly too high. Therefore we present approximate solutions inSection 4 whose running time is of order O(n2). In Section 5 we discuss the as-ymptotic behavior of the width of our confidence band as the sample size n tends

1Institute of Math. Statistics and Actuarial Science, University of Bern, Switzerland, e-mail:[email protected]

AMS 2000 subject classifications: Primary 62G08, 62G15, 62G20; secondary 62G35.Keywords and phrases: computational complexity, convexity, distribution-free, pool-adjacent-

violators algorithm, Rademacher variables, signs of residuals.

85

Page 98: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

86 L. Dumbgen

to infinity. Finally, in Section 6 we illustrate our methods with simulated and realdata.

Explicit computer code (in MatLab) for the procedures of the present paper aswell as of Dumbgen and Johns [2] may be downloaded from the author’s homepage.

2. Definition of the test statistic

Given any candidate g : R → R for f we consider the sign vectors sign(Y − g(x))and sign(g(x) − Y), where g(x) := (g(xi))n

i=1 and

sign(x) := 1{x > 0} − 1{x ≤ 0} for x ∈ R,

sign(v) :=(sign(vi)

)ni=1

for v = (vi)ni=1 ∈ R

n.

This non-symmetric definition of the sign function is necessary in order to dealwith possibly non-continuous distributions. Whenever the vector sign(Y− g(x)) orsign(g(x)−Y) contains “too many” ones in some region, the function g is rejected.Our confidence set for f comprises all convex functions g which are not rejected.

Precisely, let To : {−1, 1}n → R be some test statistic such that To(σ) ≤ To(σ)whenever σ ≤ σ component-wise. Then we define

T (v) := max{To(sign(v)), To(sign(−v))

}

for v ∈ Rn. Let ξ ∈ {−1, 1}n be a Rademacher vector, i.e. a random vector with

independent components ξi which are uniformly distributed on {−1, 1}. Further letκ = κ(n, α) be the smallest (1 − α)–quantile of T (ξ). Then

P (T (Y − f(x)) ≤ κ) ≥ P(T (ξ) ≤ κ) ≥ 1 − α;

see Dumbgen and Johns [2]. Consequently the set

C(x,Y, α) :={convex g : T (Y − g(x)) ≤ κ

}

contains f with probability at least 1 − α.As for the test statistic To, let ψ be the triangular kernel function given by

ψ(x) := max(1 − |x|, 0).

Then we define

To(σ) := maxd=1,...,�(n+1)/2�

(max

j=1,...,nTd,j(σ) − Γ

(2d − 1n

)),

where

Γ(u) := (2 log(e/u))1/2,

Td,j(σ) := βd

n∑

i=1

ψ( i − j

d

)σi with βd :=

( d−1∑

i=1−d

ψ( i

d

)2)−1/2

.

Note that Td,j(σ) is measuring whether (σi)j−d<i<j+d contains suspiciously manyones. Thus d and j can be viewed as scale and location parameter, respectively. Thenormalizing constant βd is chosen such that the standard deviation of Td,j(ξ) is notgreater than one, with equality if d ≤ j ≤ n + 1 − d. The additive correction term

Page 99: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Confidence bands for convex curves 87

Table 1

Critical values κ(n, α)

Sample size nα 100 200 300 500 700 1000 2000 5000 10000

0.50 0.054 0.124 0.152 0.188 0.216 0.232 0.279 0.333 0.3620.10 0.792 0.860 0.867 0.904 0.902 0.915 0.970 0.991 1.0210.05 1.035 1.102 1.102 1.135 1.136 1.152 1.229 1.231 1.246

Γ((2d− 1)/n) is justified by results of Dumbgen and Spokoiny [3] about multiscaletesting. In fact, Theorem 6.1 of Dumbgen and Spokoiny [3] and Donsker’s invarianceprinciple for partial sums of the Rademacher vector ξ together imply that thedistribution of T (ξ) converges weakly to a probability distribution on [0,∞) asn → ∞.

Explicit formulae for quantiles of the limiting distribution of T (ξ) are not avail-able. Therefore we list some quantiles of T (ξ) for various values of n and α inTable 1. Each quantile has been estimated in 19999 Monte Carlo simulations.

3. Definition and exact computation of a band

In principle one could define a confidence band (L, U) via

L := inf{g ∈ C(x,Y, α)

}

= inf{convex g : To(sign(Y − g(x)) ≤ κ, To(sign(g(x) − Y) ≤ κ

},

U := sup{g ∈ C(x,Y, α)

}

= sup{convex g : To(sign(Y − g(x)) ≤ κ, To(sign(g(x) − Y) ≤ κ

}.

Throughout this paper maxima or minima of functions are defined pointwise. Un-fortunately, the explicit computation of (L, U) is far from trivial. Therefore wemodify the latter definition and compute a band (L, U) in two steps. Our upperboundary is given by

U := max{convex g : To(sign(g(x) − Y) ≤ κ

}.

Thus we just drop the constraint To(sign(Y− g(x))) ≤ κ in the definition of U andobtain U ≥ U . With U at hand, our lower boundary is defined as

L := min{

convex g : g ≤ U , To(sign(Y − g(x))) ≤ κ}

.

Here we replace the constraint To(sign(g(x) − Y)) ≤ κ in the definition of L withthe weaker constrint g ≤ U and obtain L ≤ L. In what follows we concentrateon the computation of the corresponding vectors L = (Li)∞i=1 = L(x) and U =(Ui)n

i=1 = U(x).

3.1. Computation of U

A simplified expression for U . To determine U it suffices to consider the classG consisting of the following convex functions gj,k: For 1 ≤ j < k ≤ n with xj < xk

definegj,k(x) := Yj +

Yk − Yj

xk − xj(x − xj),

Page 100: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

88 L. Dumbgen

describing a straight line connecting the data points (xj , Yj) and (xk, Yk). Moreover,for j, k ∈ {1, . . . , n} let

g0,k(x) :=

∞ if x < xk,Yk if x = xk,−∞ if x > xk.

gj,n+1(x) :=

−∞ if x < xj ,Yj if x = xj ,∞ if x > xj .

Then

(3) U = max{g : g ∈ G, To(sign(g(x) − Y)) ≤ κ

}.

For let g be any convex function such that To(sign(g(x) − Y)) ≤ κ. Let g be thelargest convex function such that g(xi) ≤ Yi for all indices i with g(xi) ≤ Yi.This function g is closely related to the convex hull of all data points (xi, Yi) withg(xi) ≤ Yi. Obviously, g ≥ g and To(sign(g(x) − Y)) = To(sign(g(x) − Y)). Letω(1) < · · · < ω(m) be indices such that xω(1) < · · · < xω(m) and

{(x, g(x)) : x ∈ R

}∩{(xi, Yi) : 1 ≤ i ≤ n

}={(xω(�), Yω(�)) : 1 ≤ � ≤ m

}.

With ω(0) := 0 and ω(m + 1) := n + 1 one may write g as the maximum ofthe functions gω(�−1),ω(�), 1 ≤ � ≤ m + 1, all of which satisfy the inequalityTo(sign(gω(�−1),ω(�)(x) − Y)) ≤ To(sign(g(x) − Y)) ≤ κ. Figure 1 illustrates theseconsiderations.Computational complexity. As we shall explain in Section 3.3, the computa-tion of To(sign(g(x) − Y)) for one single candidate function g ∈ G requires O(n2)steps. In case of To(sign(g(x) − Y)) ≤ κ we have to replace U with the vector(max(g(xi), Ui)

)ni=1

in another O(n) steps. Consequently, since G contains at mostn(n − 1)/2 + 2n = O(n2) functions, the computation of U requires O(n4) steps.

Fig 1. A function g and its associated function g.

Page 101: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Confidence bands for convex curves 89

3.2. Computation of L

From now on we assume that U is nontrivial, i.e. that Ui = U(xi) < ∞ for somevalue xi. Moreover, letting xmin and xmax be the smallest and largest such value,we assume that xmin < xmax. Finally let To(sign(Y − U(x))) ≤ κ. Otherwise theconfidence set C(x,Y, α) would be empty, meaning that convexity of the medianfunction is not plausible.Simplified formulae for L.

Similarly as in the previous section, one may replace the set of all convex func-tions with a finite subset H = H(U). First of all let h be any convex function suchthat h ≤ U and To(sign(Y − h(x))) ≤ κ. For any real number t let z := h(t). Nowlet h = ht,z be the largest convex function such that h ≤ U and h(t) = z. Obviouslyh ≥ h, whence To(sign(Y − h(x))) ≤ κ. Consequently,

(4) L(t) = inf{

z ∈ R : To(sign(Y − ht,z(x))) ≤ κ}

.

Figure 2 illustrates the definition of ht,z. Note that ht,z is given by the convex hullof the point (t, z) and the epigraph of U , i.e. the set of all pairs (x, y) ∈ R

2 suchthat U(x) ≤ y.

Starting from equation (4) we derive a computable expression for L. For thatpurpose we define tangent parameters as follows: Let J be the set of all indicesj ∈ {1, . . . , n} such that U(xj) ≥ Yj . For j ∈ J define

slj :=

−∞ if xj ≤ xmin,

maxxi<xj

Yj − U(xi)xj − xi

else,

alj :=

xj if xj ≤ xmin,

arg maxxi<xj

Yj − U(xi)xj − xi

else,

Fig 2. The extremal function ht,z of two points (t, z).

Page 102: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

90 L. Dumbgen

srj :=

∞ if xj ≥ xmax,

minxk>xj

U(xk) − Yj

xk − xjelse,

arj :=

xj if xj ≥ xmax,

arg minxk>xj

U(xk) − Yj

xk − xjelse.

With these parameters we define auxiliary tangent functions

hlj(x) :=

{U(x) if x < al

j ,

Yj + slj(x − xj) if x ≥ al

j ,

hrj(x) :=

{Yj + sr

j(x − xj) if x ≤ arj ,

U(x) if x > arj .

Figure 3 depicts these functions hlj and hr

j . Note that

hlj(x) =

max{

h(x) : h convex, h ≤ U , h(xj) ≤ Yj

}if x ≤ xj ,

min{

h(x) : h convex, h ≤ U , h(xj) ≥ Yj

}if x ≥ xj ,

hrj(x) =

min{

h(x) : h convex, h ≤ U , h(xj) ≥ Yj

}if x ≤ xj ,

max{

h(x) : h convex, h ≤ U , h(xj) ≤ Yj

}if x ≥ xj ,

In particular, hlj(xj) = hr

j(xj) = Yj . In addition we define hl0(x) := hr

n+1(x) := −∞.Then we set

hj,k := max(hlj , h

rk) and H := {hj,k : j ∈ {0} ∪ J , k ∈ J ∪ {n + 1}} .

This class H consists of at most (n + 1)2 functions, and elementary considerationsshow that

(5) L = min{h ∈ H : To(sign(Y − h(x))) ≤ κ

}.

Fig 3. The tangent functions hl

j and hr

k.

Page 103: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Confidence bands for convex curves 91

Computational complexity. Note first that any pair (awj , sw

j ) may be computedin O(n) steps. Consequently, before starting with L we may compute all tangentparameters in time O(n2). Then Equation (5) implies that L may be computed inO(n4) steps. However, this can be improved considerably. The reason is, roughlysaying, that for fixed j, one can determine the smallest function hr

k such thatTo(sign(Y − hj,k(x))) ≤ κ in O(n2) steps, as explained in the subsequent section.Hence a proper implementation lets us compute L in O(n3) steps.

3.3. An auxiliary routine

In this section we show that the value of To(σ) can be computed in O(n2) steps.More generally, we consider n–dimensional sign vectors σ(0),σ(1), . . . ,σ(q) suchthat for 1 ≤ � ≤ q the vectors σ(�−1) and σ(�) differ exactly in one component, say,

σ(�−1)ω(�) = 1 and σ

(�)ω(�) = −1

for some index ω(�) ∈ {1, . . . , n}. Thus σ(0) ≥ σ(1) ≥ · · · ≥ σ(q) component-wise.In particular, To(σ(�)) is non-increasing in �. It is possible to determine the number

�∗ := min({

� ∈ {0, . . . , q} : To(σ(�)) ≤ κ}∪ {∞}

)

in O(n2) steps as follows:Algorithm. We use three vector variables S, S(0) and S(1) plus two scalar variables� and d. While running the algorithm the variable S contains the current vectorσ(�), while

S(0) =( ∑

i∈[j−d+1,j+d−1]

Si

)n

j=1,

S(1) =( ∑

i∈[j−d+1,j+d−1]

(d − |j − i|)Si

)n

j=1.

Initialisation.

� ← 0, d ← 1 andS

S(0)

S(1)

← σ(0).

Induction step. Check whether

maxi=1,...,n

S(1)i ≤

( d−1∑

i=1−d

(d − i)2)1/2(

Γ((2d − 1)/n) + κ)

(6)

= ((2d2 + 1)d/3)1/2(Γ((2d − 1)/n) + κ

).

• If (6) is fulfilled and d < (n + 1)/2, then

d ← d + 1,

S(0)i ←

S(0)i + Si+d−1 for i < d,

S(0)i + Si+1−d + Si+d−1 for d ≤ i ≤ n + 1 − d,

S(0)i + Si+1−d for i > n + 1 − d,

S(1) ← S(1) + S(0).

Page 104: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

92 L. Dumbgen

• If (6) is fulfilled and d = (n + 1)/2, then

�∗ ← �.

• If (6) is violated and � < q, then

� ← � + 1,

Sω(�) ←−1,

S(0)i ← S

(0)i − 2 and

S(1)i ← S

(1)i − 2(d − |i − ω(�)|) for ω(�) − d < i < ω(�) + d.

• If Condition (6) is violated but � = q, then To(σ(q)) > κ, and

�∗ ← ∞.

As for the running time of this algorithm, note that each induction step requiresO(n) operations. Since either d or � increases each time by one, the algorithmterminates after at most n + q + 1 ≤ 2n + 1 induction steps. Together with O(n)operations for the initialisation we end up with total running time O(n2).

4. Approximate solutions

Approximation of U . Recall that the exact computation of U involves testingwhether a straight line given by a function g(·) and touching one or two datapoints (xi, Yi) satisfies the inequality To(sign(g(x) − Y)) ≤ κ. The idea of ourapproximation is to restrict our attention to straight lines whose slope belongs toa given finite set.

Step 1. At first we consider the straight lines g0,k instroduced in section 3.1, allhaving slope −∞. Let ω(1), . . . , ω(n) be a list of {1, . . . , n} such that g0,ω(1) ≤ · · · ≤g0,ω(n). In other words, for 1 < � ≤ n either xω(�−1) < xω(�), or xω(�−1) = xω(�) andYω(�−1) ≤ Yω(�). With the auxiliary procedure of Section 3.3 we can determine thethe smallest number �∗ such that To(sign(g0,ω(�∗)(x)−Y)) ≤ κ in O(n2) steps. We

write G0 := g0,ω(�∗). Note that xω(�∗) is equal to xmin = min{x : U(x) < ∞}.Step 2. For any given slope s ∈ R let a(s) be the largest real number such that

the sign vectorσ(s) :=

(sign(Yi − a(s) − sxi)

)ni=1

satisfies the inequality To(σ(s)) ≤ κ. This number can also be determined in timeO(n2). This time we have to generate and use a list ω(1), . . . , ω(n) of {1, 2, . . . , n}such that Yω(�) − sxω(�) is non-increasing in �.

Now we determine the numbers a(s1), . . . , a(sM−1) for given slopes s1 < · · · <sM−1. Then we define

G�(x) := a(s�) + s�x for 1 ≤ � < M.

Step 3. Finally we determine the largest function GM among the degeneratelinear functions g1,n+1, . . . , gn,n+1 such that To(sign(GM (x) − Y)) ≤ κ. This isanalogous to Step 1 and yields the number xmax = max{x : U(x) < ∞}.

Step 4. By means of this list of finitely many straight lines G0, G1, . . . , GM oneobtains the lower bound U∗ := max(G0, G1, . . . , GM ) for U . In fact, one could evenreplace G� with the largest convex function G� such that G�(xi) ≤ Yi whenever

Page 105: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Confidence bands for convex curves 93

G�(xi) ≤ Yi. Each of these functions can be computed via a suitable variant of thepool-adjacent-violators algorithm in O(n) steps; see Robertson et al. [6].

Step 5. To obtain an upper bound U∗ for U , for 1 ≤ � ≤ M let H� be thesmallest concave function such that H�(xi) ≥ Yi whenever max(G�−1(xi), G�(xi)) ≥Yi. Again H� may be determined via the pool-adjacent-violators algorithm. Thenelementary considerations show that

U ≤ U∗ := max(U∗, H1, H2, . . . , HM

).

All in all, these five steps require O(Mn2) steps. By visual inpection of these twocurves U∗ and U∗ one may opt for a refined grid of slopes or use U∗ as a surrogatefor U .Approximation of L. Recall that the exact computation amounts to fixing anyfunction hl

j und finding the smallest function hrk such that To(sign(Y−hj,k(x))) ≤ κ.

Now approximations may be obtained by picking only a subset of the potential in-dices j. In addition, one may fix some functions hr

k and look for the smallest hlj

satisfying the constraint To(sign(Y−hj,k(x))) ≤ κ. Again this leads to approxima-tions L∗ and L∗ for L such that L∗ ≤ L ≤ L∗.

5. Asymptotic properties

In this section we consider a triangular array of observations xi = xn,i and Yi = Yn,i.Our confidence band (L, U) will be shown to have certain consistency properties,provided that f satisfies some smoothness condition, and that the following tworequirements are met for some constants −∞ < a < b < ∞:

(A1) Let Mn denote the empirical distribution of the design points xn,i. Thatmeans, Mn(B) := n−1#{i : xn,i ∈ B} for B ⊂ R. There is a constant c > 0 suchthat

lim infn→∞

Mn[an, bn]bn − an

≥ c

whenever a ≤ an < bn ≤ b and lim infn→∞ log(bn − an)/ log n > −1.(A2) All variables Yi = Yn,i with xn,i ∈ [a, b] satisfy the following inequalities:

P(Yi < µi + r)P(Yi > µi − r)

}≥ 1 + H(r)

2for any r > 0,

where H is some fixed function on [0,∞] such that

limr→0+

H(r)r

> 0.

These conditions (A1) and (A2) are satisfied in various standard models, aspointed out by Dumbgen and Johns [2].

Theorem 1. Suppose that assumptions (A1) and (A2) hold.(a) Let f be linear on [a, b]. Then for arbitrary a < a′ < b′ < b,

supx∈[a,b]

(f(x) − L(x)

)+

supx∈[a′,b′]

(U(x) − f(x)

)+

= Op(n−1/2).

Page 106: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

94 L. Dumbgen

(b) Let f be Holder continuous on [a, b] with exponent β ∈ (1, 2]. That means, f isdifferentiable on [a, b] such that for some constant L > 0 and arbitrary x, y ∈ [a, b],

|f ′(x) − f ′(y)| ≤ L|x − y|β−1.

Then for ρn := log(n + 1)/n and δn := ρ1/(2β+1)n ,

supx∈[a,b]

(f(x) − L(x)

)+

supx∈[a+δn,b−δn]

(U(x) − f(x)

)+

= Op

(ρβ/(2β+1)

n

).

Part (a) of this theorem explains the empirical findings in Section 6 that theband (L, U) performs particularly well in regions where the regression function fis linear.

Proof of Theorem 1, step I. At first we prove the assertions about U . Note thatfor arbitrary t, z ∈ R with z ≤ U(t) there exist parameters µ, ν ∈ R such thatz = µ + νt and

Sd,j(µ, ν) :=n∑

i=1

ψ( i − j

d

)sign(µ + νxi − Yi)

≤ β−1d

(Γ(2d − 1

n

)+ κ)

for any (d, j) ∈ Tn;(7)

here Tn denotes the set of all pairs (d, j) of integers d > 0, j ∈ [d, n+1−d]. Thereforeit is crucial to have good simultaneous upper bounds for

∣∣Sd,j(µ, ν) − Σd,j(µ, ν)∣∣,

where

Σd,j(µ, ν) := ESd,j(µ, ν) =n∑

i=1

ψ(2d − 1

n

)(2P(Yi < µ + νxi) − 1

).

One may write Sd,j(µ, ν) =∫

gd,j,µ,ν dΨn with the random measure

Ψn :=n∑

i=1

δi ⊗ δxi⊗ δYi

and the function

(i, x, y) → gd,j,µ,ν(i, x, y) := ψ( i − j

d

)sign(µ + νx − y) ∈ [−1, 1]

on R3. The family of all these functions gd,j,µ,ν is easily shown to be a Vapnik-Cerv-

onenkis subgraph class in the sense of van der Vaart and Wellner [7]. Moreover, Ψn

is a sum of n stochastically independent random probability measures. Thus well-known results from empirical process theory (cf. Pollard [5]) imply that for arbitraryη ≥ 0,

P

(sup

(d,j)∈Tn, µ,ν∈R

∣∣Sd,j(µ, ν) − Σd,j(µ, ν)∣∣ ≥ n1/2η

)

≤ C exp(−η2/C),(8)

P

(sup

µ,ν∈R

∣∣Sd,j(µ, ν) − Σd,j(Y, µ, ν)∣∣ ≥ d1/2η for some (d, j) ∈ Tn

)

≤ C exp(2 log n − η2/C),(9)

Page 107: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Confidence bands for convex curves 95

where C ≥ 1 is a universal constant. Consequently, for any fixed α′ > 0 there is aconstant C > 0 such that the following inequalities are satisfied simultaneously forarbitrary (d, j) ∈ Tn and (µ, ν) ∈ R

2 with probability at least 1 − α′:

(10)∣∣Sd,j(µ, ν) − Σd,j(µ, ν)

∣∣ ≤{

Cn1/2,

Cd1/2 log(n + 1)1/2.

In what follows we assume (10) for some fixed C.Proof of part (a) for U . Suppose that f is linear on [a, b], and let [a′, b′] ⊂ (a, b).

By convexity of U , the maximum of U − f over [a′, b′] is attained at a′ or b′. Weconsider the first case and assume that U(a′) ≥ f(a′) + εn for some εn > 0. Thenthere exist µ, ν ∈ R satisfying (7) such that µ + νa′ = f(a′) + εn and ν ≤ f ′(a′). Inparticular, µ + νx − f(x) ≥ εn for all x ∈ [a, a′]. Now we pick a pair (dn, jn) ∈ Tn

with dn as large as possible such that

[xjn−dn+1, xjn+dn−1

]⊂ [a, a′].

Assumption (A1) implies that dn ≥ (c/2 + o(1))n. Now

Σdn,jn(µ, ν) ≥jn+dn−1∑

i=jn−dn+1

ψ( i − jn

dn

)H(εn) = dnH(εn)

by assumption (A2). Combining this inequality with (7) and (10) yields

(11) β−1dn

(Γ(2dn − 1

n

)+ κ)

≥ dnH(εn) − Cn1/2.

But β−1d = 3−1/2(2d − 1)1/2 + O(d−1/2), and x → x1/2Γ(x) is non-decreasing on

(0, 1]. Hence (11) implies that

H(εn) ≤ d−1n

((3−1/2 + o(1))(2dn − 1)1/2

(Γ(2dn − 1

n

)+ κ)

+ Cn1/2)

≤ d−1n n1/2

((3−1/2 + o(1))(Γ(1) + κ) + C

)

= O(n−1/2).

Consequently, εn = O(n−1/2).Proof of part (b) for U . Now suppose that f ′ is Holder-continuous on [a, b] with

exponent β − 1 ∈ (0, 1] and constant L > 0. Let U(x) ≥ f(x) + εn for somex ∈ [a+ δn, b− δn] and εn > 0. Then there are numbers µ, ν ∈ R satisfying (7) suchthat µ + νx = f(x) + εn. Let (dn, jn) ∈ Tn with dn as large as possible such thateither

f ′(x) ≤ ν and[xjn−dn+1, xjn+dn−1

]⊂ [x, x + δn],

or

f ′(x) ≥ ν and[xjn−dn+1, xjn+dn−1

]⊂ [x − δn, x].

Assumption (A1) implies that

dn ≥ (c/2 + o(1))δnn.

Page 108: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

96 L. Dumbgen

Moreover, for any i ∈ {jn − dn + 1, jn + dn − 1},

µ + νxi − f(xi) = εn +∫ xi

x

(ν − f ′(t)) dt

≥ εn +∫ xi

x

(f ′(x) − f ′(t)) dt

≥ εn − L

∫ δn

0

sβ−1 ds

= εn − O(δβn),

so thatΣdn,jn(µ, ν) ≥ dnH

((εn − O(δβ

n))+).

Combining this inequality with (7) and (10) yields

H((εn − O(δβ

n))+)≤ d−1

n β−1dn

(Γ(2dn − 1

n

)+ κ)

+ Cd−1/2n log(n + 1)1/2

≤ d−1n β−1

dn(21/2 log(n)1/2 + κ) + Cd−1/2

n log(n + 1)1/2

= O(δβn).(12)

This entails that εn has to be of order O(δβn) = O

β/(2β+1)n

).

Proof of Theorem 1, step II. Now we turn our attention to L. For that purpose wechange the definition of Sd,j(·, ·) and Σd,j(·, ·) as follows: Let Un be a fixed convexfunction to be specified later. Then for (t, z) ∈ R

2 we define hn,t,z to be the largestconvex function h such that h ≤ Un and h(t) ≤ z. This definition is similar to thedefinition of ht,z in Section 3.2. Indeed, if U ≤ Un and L(t) ≤ z, then

Sd,j(t, z) :=n∑

i=1

ψ( i − j

d

)sign(Yi − hn,t,z(xi))

≤ β−1d

(Γ(2d − 1

n

)+ κ)

for any (d, j) ∈ Tn.(13)

Here we set

Σd,j(t, z) := ESd,j(t, z) =n∑

i=1

ψ(2d − 1

n

)(2P(Yi > hn,t,z(xi)) − 1

).

Again we may and do assume that (10) is true for some constant C.Proof of part (a) for L. Suppose that f is linear on [a, b]. We define Un(x) :=

f(x) + γn−1/2 + 1{x �∈ [a′, b′]}∞ with constants γ > 0 and a < a′ < b′ < b. Sincelim infn→∞ P(U ≤ Un) tends to one as γ → ∞, we may assume that U ≤ Un.Suppose that L(t) ≤ z := f(t) − 2εn for some t ∈ [a, b] and εn ≥ γn−1/2. A simplegeometrical consideration shows that hn,t,z ≤ f−εn on an interval [a′′, b′′] ⊂ [a, b] oflength b′′−a′′ ≥ (b′−a′)/3. If we pick (dn, jn) ∈ Tn with dn as large as possible suchthat [xjn−dn+1, xjn+dn−1] ⊂ [a′′, b′′], then dn ≥ (c(b′ − a′)/6 + o(1))n. Moreover,(13) and (10) entail (11), whence εn = O(n−1/2).

Proof of part (b) for L. Now suppose that f ′ is Holder-continuous on [a, b] withexponent β − 1 ∈ (0, 1] and constant L > 0. Here we define Un(x) := f(x) + γδβ

n +1{x �∈ [a + δn, b − δn]}∞ with a constant γ > 0, and we assume that U ≤ Un.

Page 109: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Confidence bands for convex curves 97

Suppose that L(t) ≤ z := f(t) − 2εn for some t ∈ [a, b] and εn > 0. If t ≤ b − 2δn,then

hn,t,z(t + λδn) ≤ z + λ(Un(t + δn) − z)= f(t) − 2εn + λ(f(t + δn) − f(t) + 2εn + γδβ

n)

= f(t) − 2(1 − λ)εn + λ

∫ δn

0

f ′(t + s) ds + λγδβn

for 0 ≤ λ ≤ 1. Thus

f(t + λδn) − hn,t,z(t + λδn)

= 2(1 − λ)εn + λ

∫ δn

0

(f ′(t + λs) − f ′(t + s)) ds − λγδβn

≥ 2(1 − λ)εn − λ

∫ δn

0

L(1 − λ)β−1sβ−1 ds − λγδβn

≥ εn − O(δβn)

uniformly for 0 ≤ λ ≤ 1/2. Analogous arguments apply in the case t ≥ a + 2δn.Consequently there is an interval [an, bn] ⊂ [a, b] of length δn/2 such that f−hn,t,z ≥εn − O(δβ

n), provided that a + 2δn ≤ b − 2δn. Again we choose (dn, jn) ∈ Tn withdn as large as possible such that [xjn−dn+1, xjn+dn−1] ⊂ [an, bn]. Then dn ≥ (c/4 +o(1))δnn, and (13) and (10) lead to (12). Thus εn = O(δβ

n) = O(ρβ/(2β+1)n ).

6. Numerical examples

At first we illustrate the confidence band (L, U) defined in Section 3 with somesimulated data. Precisely, we generated

Yi = f(xi) + σεi

with xi := (i − 1/2)/n, n = 500 and

f(x) :={

−12(x − 1/3) if x ≤ 1/3,(27/2)(x − 1/3)2 if x ≥ 1/3.

Moreover, σ = 1/2, and the random errors ε1, . . . , εn have been simulated froma student distribution with five degrees of freedom. Figure 4 depicts these datatogether with the corresponding 95%–confidence band (L, U) and f itself. Notethat the width of the band is smallest near the center of the interval (0, 1/3) onwhich f is linear. This is in accordance with part (a) of Theorem 1.

Secondly we applied our procedure to a dataset containing the income xi andthe expenditure Yi for food in the year 1973 for n = 7125 households in GreatBritain (Family Expenditure Survey 1968–1983). This dataset has also been ana-lyzed by Hardle and Marron [4]. They computed simultaneous confidence intervalsfor E(Yi) = f(xi) by means of kernel estimators and bootstrap methods. Figure 5depicts the data. In order to enhance the main portion, the axes have been chosensuch that 72 outlying observations are excluded from the display. Figure 6 shows a95%–confidence band for the isotonic median function f , as described by Dumbgenand Johns [2]. Figure 7 shows a 95%–confidence band for the concave median func-tion f , as described in the present paper. Note that the latter band has substantiallysmaller width than the former one. This is in accordance with our theoretical resultsabout rates of convergence.

Page 110: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

98 L. Dumbgen

Fig 4. Simulated data and 95%–confidence band (L, U), where n = 500.

Fig 5. Income-expenditure data.

Page 111: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Confidence bands for convex curves 99

Fig 6. 95%–confidence band for isotonic median function.

Fig 7. 95%–confidence band for concave median function.

Page 112: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

100 L. Dumbgen

Acknowledgments. The author is indebted to Geurt Jongbloed for constructivecomments. Many thanks also to Wolfgang Hardle (Humboldt Unversity Berlin) forproviding the family expenditure data.

References

[1] Davies, P. L. (1995). Data features. Statist. Neerlandica 49 185–245.[2] Dumbgen, L. and Johns, R. B. (2004). Confidence bands for isotonic median

curves using sign tests. J. Comput. Graph. Statist. 13 519–533.[3] Dumbgen, L. and Spokoiny, V. G. (2001). Multiscale testing of qualitative

hypotheses. Ann. Statist. 29 124–152.[4] Hardle, W. and Marron, J. S. (1991). Bootstrap simultaneous error bars

for nonparametric regression. Ann. Statist. 19 778–796.[5] Pollard, D. (1990). Empirical Processes: Theory and Applications. IMS, Hay-

ward, CA.[6] Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Re-

stricted Statistical Inference. Wiley, New York.[7] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and

Empirical Processes with Applications to Statistics. Springer, New York.

Page 113: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 101–107c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000292

Marshall’s lemma for convex density

estimation

Lutz Dumbgen1, Kaspar Rufibach1 and Jon A. Wellner2

University of Bern and University of Washington

Abstract: Marshall’s [Nonparametric Techniques in Statistical Inference(1970) 174–176] lemma is an analytical result which implies

√n–consistency

of the distribution function corresponding to the Grenander [Skand. Aktuari-etidskr. 39 (1956) 125–153] estimator of a non-decreasing probability density.The present paper derives analogous results for the setting of convex densitieson [0,∞).

1. Introduction

Let F be the empirical distribution function of independent random variables X1,X2, . . . , Xn with distribution function F and density f on the halfline [0,∞). Vari-ous shape restrictions on f enable consistent nonparametric estimation of it withoutany tuning parameters (e.g. bandwidths for kernel estimators).

The oldest and most famous example is the Grenander estimator f of f underthe assumption that f is non-increasing. Denoting the family of all such densities byF , the Grenander estimator may be viewed as the maximum likelihood estimator,

f = argmax{∫

log h dF : h ∈ F}

,

or as a least squares estimator,

f = argmin{∫ ∞

0

h(x)2dx − 2∫

h dF : h ∈ F}

;

cf. Robertson et al. [5]. Note that if F had a square-integrable density F′, then the

preceding argmin would be identical with the minimizer of∫ ∞0

(h− F′)(x)2 dx over

all non-increasing probability densities h on [0,∞).A nice property of f is that the corresponding distribution function F ,

F (r) :=∫ r

0

f(x) dx,

is automatically√

n–consistent. More precisely, since F is the least concave majo-rant of F, it follows from Marshall’s [4] lemma that

‖F − F‖∞ ≤ ‖F − F‖∞.

A more refined asymptotic analysis of F − F has been provided by Kiefer andWolfowitz [3].

1Institute of Math. Statistics and Actuarial Science, University of Bern, Switzerland, e-mail:[email protected]; [email protected]

2Dept. of Statistics, University of Washington, Seattle, USA, e-mail: [email protected] 2000 subject classifications: 62G05, 62G07, 62G20.Keywords and phrases: empirical distribution function, inequality, least squares, maximum

likelihood, shape constraint, supremum norm.

101

Page 114: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

102 L. Dumbgen, K. Rufibach and J. A. Wellner

2. Convex densities

Now we switch to the estimation of a convex probability density f on [0,∞). Aspointed out by Groeneboom et al. [2], the nonparametric maximum likelihood esti-mator fml and the least squares estimator fls are both well-defined and unique, butthey are not identical in general. Let K denote the convex cone of all convex andintegrable functions g on [0,∞). (All functions within K are necessarily nonnegativeand non-increasing.) Then

fml = argmaxh∈K

(∫log h dF −

∫ ∞

0

h(x) dx),

fls = argminh∈K

(∫ ∞

0

h(x)2dx − 2∫

h dF

).

Both estimators have the following property:

Proposition 1. Let f be either fml or fls. Then f is piecewise linear with

• at most one knot in each of the intervals (X(i), X(i+1)), 1 ≤ i < n,• no knot at any observation Xi, and• precisely one knot within (X(n),∞).

The estimators fml, fls and their distribution functions Fml, Fls are completelycharacterized by Proposition 1 and the next proposition.

Proposition 2. Let ∆ be any function on [0,∞) such that fml + t∆ ∈ K for somet > 0. Then ∫

fml

dF ≤∫

∆(x) dx.

Similarly, let ∆ be any function on [0,∞) such that fls + t∆ ∈ K for some t > 0.Then ∫

∆ dF ≤∫

∆ dFls.

In what follows we derive two inequalities relating F − F and F − F , where Fstands for Fml or Fls:

Theorem 1.

inf[0,∞)

(Fml − F ) ≥ 32

inf[0,∞)

(F − F ) − 12

sup[0,∞)

(F − F ),(1)

∥∥Fls − F∥∥∞ ≤ 2

∥∥F − F∥∥∞.(2)

Both results rely on the following lemma:

Lemma 1. Let F, F be continuous functions on a compact interval [a, b], and letF be a bounded, measurable function on [a, b]. Suppose that the following additionalassumptions are satisfied:

F (a) = F(a) and F (b) = F(b),(3)F has a linear derivative on (a, b),(4)F has a convex derivative on (a, b),(5)∫ b

r

F (y) dy ≤∫ b

r

F(y) dy for all r ∈ [a, b].(6)

Page 115: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Marshall’s lemma for convex densities 103

Then

sup[a,b]

(F − F ) ≤ 32

sup[a,b]

(F − F ) − 12(F − F )(b).

If condition (6) is replaced with

(7)∫ r

a

F (x) dx ≥∫ r

a

F(x) dx for all r ∈ [a, b],

then

inf[a,b]

(F − F ) ≥ 32

inf[a,b]

(F − F ) − 12(F − F )(a).

The constants 3/2 and 1/2 are sharp. For let [a, b] = [0, 1] and define

F (x) :={

x2 − c for x ≥ ε,(x/ε)(ε2 − c) for x ≤ ε,

F (x) := 0,

F(x) := 1{0 < x < 1}(x2 − 1/3)

for some constant c ≥ 1 and some small number ε ∈ (0, 1/2]. One easily verifiesconditions (3)–(6). Moreover,

sup[0,1]

(F − F ) = c − ε2, sup[0,1]

(F − F ) = c − 1/3 and (F − F )(1) = c − 1.

Hence the upper bound (3/2) sup(F − F ) − (1/2)(F − F )(1) equals sup(F − F ) +ε2 for any c ≥ 1. Note the discontinuity of F at 0 and 1. However, by suitableapproximation of F with continuous functions one can easily show that the constantsremain optimal even under the additional constraint of F being continuous.

Proof of Lemma 1. We define G := F − F with derivative g := G′ on (a, b). Itfollows from (3) that

max{a,b}

G = max{a,b}

(F − F ) ≤ 32

sup[a,b]

(F − F ) − 12(F − F )(b).

Therefore it suffices to consider the case that G attains its maximum at some pointr ∈ (a, b). In particular, g(r) = 0. We introduce an auxiliary linear function g on[r, b] such that g(r) = 0 and

∫ b

r

g(y) dy =∫ b

r

g(y) dy = G(b) − G(r).

Note that g is concave on (a, b) by (4)–(5). Hence there exists a number yo ∈ (r, b)such that

g − g

{≥ 0 on [r, yo],≤ 0 on [yo, b).

This entails that∫ y

r

(g − g)(u) du = −∫ b

y

(g − g)(u) du ≥ 0 for any y ∈ [r, b].

Page 116: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

104 L. Dumbgen, K. Rufibach and J. A. Wellner

Consequently,

G(y) = G(r) +∫ y

r

g(u) du

≥ G(r) +∫ y

r

g(u) du

= G(r) +(y − r)2

(b − r)2[G(b) − G(r)],

so that∫ b

r

G(y) dy ≥ (b − r)G(r) +G(b) − G(r)

(b − r)2

∫ b

r

(y − r)2 dy

= (b − r)[23G(r) +

13G(b)

]

= (b − r)[23G(r) +

13(F − F )(b)

].

On the other hand, by assumption (6),∫ b

r

G(y) dy ≤∫ b

r

(F − F )(y) dy ≤ (b − r) sup[a,b]

(F − F ).

This entails thatG(r) ≤ 3

2sup[a,b]

(F − F ) − 12(F − F )(b).

If (6) is replaced with (7), then note first that

min{a,b}

G = min{a,b}

(F − F ) ≥ 32

min{a,b}

(F − F ) − 12(F − F )(a).

Therefore it suffices to consider the case that G attains its minimum at some pointr ∈ (a, b). Now we consider a linear function g on [a, r] such that g(r) = 0 and

∫ r

a

g(x) dx =∫ r

a

g(x) dx = G(r) − G(a).

Here concavity of g on (a, b) entails that∫ x

a

(g − g)(u) du = −∫ r

x

(g − g)(u) du ≤ 0 for any x ∈ [a, r],

so that

G(x) = G(r) −∫ r

x

g(u) du

≤ G(r) −∫ r

x

g(u) du

= G(r) − (r − x)2

(r − a)2[G(r) − G(a)].

Consequently,∫ r

a

G(x) dx ≤ (r − a)G(r) − G(r) − G(a)(r − a)2

∫ r

a

(r − x)2 dx

= (r − a)[23G(r) +

13(F − F )(a)

],

Page 117: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Marshall’s lemma for convex densities 105

whereas ∫ r

a

G(x) dx ≥∫ r

a

(F − F )(x) dx ≥ (r − a) inf[a,b]

(F − F ),

by assumption (7). This leads to

G(r) ≥ 32

inf[a,b]

(F − F ) − 12(F − F )(a). �

Proof of Theorem 1. Let 0 =: t0 < t1 < · · · < tm be the knots of f , including theorigin. In what follows we derive conditions (3)–(5) and (6/7) of Lemma 1 for anyinterval [a, b] = [tk, tk+1] with 0 ≤ k < m. For the reader’s convenience we relyentirely on Proposition 2. In case of the least squares estimator, similar inequalitiesand arguments may be found in Groeneboom et al. [2].

Let 0 < ε < min1≤i≤m(ti − ti−1)/2. For a fixed k ∈ {1, . . . ,m} we define ∆1

to be continuous and piecewise linear with knots tk−1 − ε (if k > 1), tk−1, tk andtk + ε. Namely, let ∆1(x) = 0 for x /∈ (tk−1 − ε, tk + ε) and

∆1(x) :={

fml(x) if f = fml

1 if f = fls

}for x ∈ [tk−1, tk].

This function ∆1 satisfies the requirements of Proposition 2. Letting ε ↘ 0, thefunction ∆1(x) converges pointwise to

{1{tk−1 ≤ x ≤ tk}fml(x) if f = fml,

1{tk−1 ≤ x ≤ tk} if f = fls,

and the latter proposition yields the inequality

F(tk) − F(tk−1) ≤ F (tk) − F (tk−1).

Similarly let ∆2 be continuous and piecewise linear with knots at tk−1, tk−1 + ε,tk − ε and tk. Precisely, let ∆2(x) := 0 for x /∈ (tk−1, tk) and

∆2(x) :={−fml(x) if f = fml

−1 if f = fls

}for x ∈ [tk−1 + ε, tk − ε].

The limit of ∆2(x) as ε ↘ 0 equals{−1{tk−1 < x < tk}fml(x) if f = fml,

−1{tk−1 < x < tk} if f = fls,

and it follows from Proposition 2 that

F(tk) − F(tk−1) ≥ F (tk) − F (tk−1).

This shows that F(tk)−F(tk−1) = F (tk)−F (tk−1) for k = 1, . . . , m. Since F (0) = 0,one can rewrite this as

(8) F(tk) = F (tk) for k = 0, 1, . . . ,m.

Now we consider first the maximum likelihood estimator fml. For 0 ≤ k < mand r ∈ (tk, tk+1] let ∆(x) := 0 for x /∈ (tk − ε, r), let ∆ be linear on [tk − ε, tk],

Page 118: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

106 L. Dumbgen, K. Rufibach and J. A. Wellner

and let ∆(x) := (r − x)fml(x) for x ∈ [tk, r]. One easily verifies, that this function∆ satisfies the conditions of Proposition 2, too, and with ε ↘ 0 we obtain theinequality ∫ r

tk

(r − x) F(dx) ≤∫ r

tk

(r − x) F (dx).

Integration by parts (or Fubini’s theorem) shows that the latter inequality is equiv-alent to ∫ r

tk

(F(x) − F(tk)) dx ≤∫ r

tk

(F (x) − F (tk)) dx.

Since F(tk) = F (tk), we end up with∫ r

tk

F(x) dx ≤∫ r

tk

F (x) dx for k = 0, 1, . . . ,m − 1 and r ∈ (tk, tk+1].

Hence we may apply Lemma 1 and obtain (1).Finally, let us consider the least squares estimator fls. For 0 ≤ k < m and

r ∈ (tk, tk+1] let ∆(x) := 0 for x /∈ (tk − ε, r), let ∆ be linear on [tk − ε, tk] as wellas on [tk, r] with ∆(tk) := r − tk. Then applying Proposition 2 and letting ε ↘ 0yields ∫ r

tk

(r − x) F(dx) ≤∫ r

tk

(r − x) F (dx),

so that∫ r

tk

F(x) dx ≤∫ r

tk

F (x) dx for k = 0, 1, . . . ,m − 1 and r ∈ (tk, tk+1].

Thus it follows from Lemma 1 that

inf[0,∞)

(F − F ) ≥ 32

inf[0,∞)

(F − F ) − 12

sup[0,∞)

(F − F ) ≥ −2∥∥F − F

∥∥∞.

Alternatively, for 1 ≤ k ≤ m and r ∈ [tk−1, tk) let ∆(x) := 0 for x /∈ (r, tk + ε),let ∆ be linear on [r, tk] as well as on [tk, tk + ε] with ∆(tk) := −(tk − r). Thenapplying Proposition 2 and letting ε ↘ 0 yields

∫ tk

r

(tk − x) F(dx) ≥∫ tk

r

(tk − x) F (dx),

so that∫ tk

r

F(x) dx ≥∫ r

tk

F (x) dx for k = 1, 2, . . . ,m and r ∈ [tk−1, tk).

Hence it follows from Lemma 1 that

sup[0,∞)

(F − F ) ≤ 32

sup[0,∞)

(F − F ) − 12

inf[0,∞)

(F − F ) ≤ 2∥∥F − F

∥∥∞.

Acknowledgment. The authors are grateful to Geurt Jongbloed for constructivecomments and careful reading.

Page 119: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Marshall’s lemma for convex densities 107

References

[1] Grenander, U. (1956). On the theory of mortality measurement, part II.Skand. Aktuarietidskr. 39 125–153.

[2] Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001). Estimationof a convex function: Characterization and asymptotic theory. Ann. Statist. 291653–1698.

[3] Kiefer, J. and Wolfowitz, J. (1976). Asymptotically minimax estimationof concave and convex distribution functions. Z. Wahrsch. Verw. Gebiete 3473–85.

[4] Marshall, A. W. (1970). Discussion of Barlow and van Zwet’s paper. In Non-parametric Techniques in Statistical Inference. Proceedings of the First Interna-tional Symposium on Nonparametric Techniques held at Indiana University,June, 1969 (M. L. Puri, ed.) 174–176. Cambridge University Press, London.

[5] Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Re-stricted Statistical Inference. Wiley, New York.

Page 120: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 108–120c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000300

Escape of mass in zero-range processes

with random rates

Pablo A. Ferrari 1,∗ and Valentin V. Sisko 2,†

Universidade de Sao Paulo

Abstract: We consider zero-range processes in Zd with site dependent jump

rates. The rate for a particle jump from site x to y in Zd is given by λxg(k)p(y−

x), where p(·) is a probability in Zd, g(k) is a bounded nondecreasing function

of the number k of particles in x and λ = {λx} is a collection of i.i.d. randomvariables with values in (c, 1], for some c > 0. For almost every realizationof the environment λ the zero-range process has product invariant measures{νλ,v : 0 ≤ v ≤ c} parametrized by v, the average total jump rate fromany given site. The density of a measure, defined by the asymptotic averagenumber of particles per site, is an increasing function of v. There exists aproduct invariant measure νλ,c, with maximal density. Let µ be a probabilitymeasure concentrating mass on configurations whose number of particles at sitex grows less than exponentially with ‖x‖. Denoting by Sλ(t) the semigroupof the process, we prove that all weak limits of {µSλ(t), t ≥ 0} as t → ∞ aredominated, in the natural partial order, by νλ,c. In particular, if µ dominatesνλ,c, then µSλ(t) converges to νλ,c. The result is particularly striking whenthe maximal density is finite and the initial measure has a density above themaximal.

1. Introduction

In the zero-range process there are a finite number of particles at each site of Zd.

At a rate depending monotonically on the number of particles at the site, one of theparticles jumps to another site chosen independently with a transition probabilityfunction. The rate at which particles leave any site is bounded. When the rate ateach site x is multiplied by a random variable λx chosen at time zero independentlyof the process, the system may show a phase transition in the density. For almostevery realization of the environment λ the zero-range process has product invariantmeasures {νλ,v : 0 ≤ v ≤ c} parametrized by v, the average total jump rate fromany given site. The density of a measure is the asymptotic number of particles persite (when this exists). For each v ≤ c the invariant measure νλ,v has density ρ(v),which is an increasing function of v. Our main result is to start the system with ameasure concentrating mass in configurations not growing too fast (see (3) below)and show that the distribution of the process as time goes to infinity is dominatedby the maximal measure νλ,c. This is particularly interesting when ρ(c) < ∞ andthe initial density of µ is strictly bigger than ρ(c). In this case we say that there is

∗Supported in part by FAPESP.†Supported by FAPESP (2003/00847–1) and CNPq (152510/2006–0).1Departamento de Estatistica, Instituto de Matematica e Estatistica, Universidade de Sao

Paulo, Caixa Postal 66281, CEP 05311–970 Sao Paulo, SP, Brazil, e-mail: [email protected], url:www.ime.usp.br/~pablo

2IMPA, Estrada Dona Castorina 110, CEP 22460-320 Rio de Janeiro, Brasil, e-mail:[email protected], url: www.ime.usp.br/~valentin

AMS 2000 subject classifications: 60K35, 82C22.Keywords and phrases: random environment, zero-range process.

108

Page 121: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Zero range processes with random rates 109

an “escape of mass”. When the initial distribution dominates the maximal invariantmeasure, the process converges to the maximal invariant measure.

The zero-range process appeared first as a network of queues when Jackson [13]showed that the product measures are invariant for the process in a finite number ofsites. Spitzer [22] introduced the process in a countable number of sites as a modelof infinite particle system with interactions. The existence of the process has beenproved by Holley [12] and Liggett [17, 19]. We use Harris [11] direct probabilisticconstruction which permits the particles to be distinguishable, so one can follow thebehavior of any particular particle. Using Liggett’s [18] approach, Andjel [1] gavea description of the set of invariant measures for the zero-range process in somecases. Balazs, Rassoul-Agha, Seppalainen, and Sethuraman [4] studied the case ofrates bounded by an exponential function of k in a one dimensional asymmetricmodel.

The study of conservative interacting particle systems in random environmentwas proposed simultaneously by Benjamini, Ferrari and Landim [5] and Evans [7],who observed the existence of phase transition in these models; see also Krug andFerrari [15]. Benjamini, Ferrari and Landim [5], Krug and Seppalainen [20] andKoukkous [14] investigated the hydrodynamic behavior of conservative processes inrandom environments; Landim [16] and Bahadoran [3] considered the same prob-lem for non-homogeneous asymmetric attractive processes; Gielis, Koukkous andLandim [9] deduced the equilibrium fluctuations of a symmetric zero-range processin a random environment; Andjel, Ferrari, Guiol and Landim [2] proved the conver-gence to the maximal invariant measure for a one-dimensional totally asymmetricnearest-neighbor zero-range process with random rates. This phenomenon is stud-ied by Seppalainen, Grigorescu and Kang [10] in one dimension. Evans and Hanney[8] have recently published a review paper on the zero-range process which includesmany references to the mathematical physics literature.

Section 2 includes definitions, results and at the end a summary of the contentsof the other sections.

2. Results

We study the zero-range process with site dependent jump rates. Let N = {0, 1,

2, . . . } and give N the discrete topology. It would seem natural to take X = NZ

d

for the state space, but for topological reasons, let us begin by setting

N = N ∪ {∞}.

We give N the topology of one point compactification and take X = NZ

d

with theproduct topology for the state space. The set X is compact. We associate with X theBorel σ-field. The product topology on X is metrizable. For x = (x1, . . . , xd) ∈ Z

d,denote the sup-norm of x by

‖x‖ = maxi=1,...,d

|xi|.

Let γ : N → [0, 2] be such that γ(0) = 2, γ(n) = 1/n, n = 1, 2, . . . , and γ(∞) = 0.For instance, the metric

d(η, ξ) =∑

x∈Zd

12‖x‖

∣∣γ(η(x)) − γ(ξ(x))∣∣

Page 122: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

110 P. A. Ferrari and V. V. Sisko

is compatible with the product topology on X . The set X is a complete separablemetric space.

Fix 0 < c < 1 and consider a collection λ = {λx}x∈Zd taking values in (c, 1] suchthat c = infx∈Zd λx. We call λ the environment. Let p : Z

d → [0, 1] be a probabilityon Z

d:∑

x∈Zd p(x) = 1. We assume that the range of p is bounded by some M > 0:p(x) = 0 if ‖x‖ > M . Moreover, suppose that the random walk with transitionfunction p(x, y) = p(y − x) is irreducible.

Let g : N → [0, 1] be a nondecreasing continuous function with 0 = g(0) < g(1)and g(∞) = lim g(k) = 1.

The zero-range process in the environment λ is a Markov process informallydescribed as follows. Initially distribute particles on the lattice Z

d, then if thereare k particles at site x, at rate λxg(k)p(y − x) a particle jumps from x to y. InSection 5 we recall the construction of a process ηt with this behavior as a functionof a Poisson process in Z

d × R, a la Harris. Let {Sλ(t), t ≥ 0} be the semigroupassociated to this process, that is,

Sλ(t)f(η) = E[f(ηt) | η0 = η].

where E is expectation and ηt = ηλt is the process with fixed environment λ. Thecorresponding generator Lλ, defined by

Lλf(η) =d

dtSλ(t)f(η)

∣∣∣t=0

,

acts on cylinder continuous functions f : NZ

d

→ R as follows:

(Lλf)(η) =∑

x∈Zd

y∈Zd

λx p(y − x) g(η(x)) [f(ηx,y) − f(η)].

where ηx,y = η− δx + δy and δz ∈ X is the configuration with just one particle at zand no particles elsewhere; addition of configurations is performed componentwise.We set ∞± 1 = ∞.

The natural state space for this Markov process is X rather than X . From theconstruction a la Harris it is possible to see that if the standard Markov processwhose semigroup is given by Sλ(t) is started in X , then it never leaves X : if µ(X ) =1, then µSλ(t)(X ) = 1 for any t.

For each v ∈ [0, c] and environment λ, denote νλ,v the product measure withmarginals

(1) νλ,v{ξ : ξ(x) = k} =1

Z(v/λx)(v/λx)k

g(k)!,

where we use the notation g(k)! = g(1) · · · g(k) and g(0)! = 1;

(2) Z(u) =∑

k≥0

uk

g(k)!

is the normalizing constant. These measures are invariant for the process [1, 13, 22].In some cases it is known that all invariant measures (concentrated on X ) are convexcombinations of measures in {νλ,v : 0 ≤ v ≤ c} (see [1, 2]).

To define the standard partial order for probability measures on X let η ≤ ξ ifη(x) ≤ ξ(x) for all x ∈ Z

d. A real valued function f defined on X is increasing

Page 123: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Zero range processes with random rates 111

if η ≤ ξ implies that f(η) ≤ f(ξ). If µ and ν are two probability measures on X ,µ ≤ ν if

∫fdµ ≤

∫fdν for all increasing continuous functions f . In this case we

say that ν dominates µ. This is equivalent to the existence of a probability measureν on X × X with marginals µ and ν such that

ν{(η, ξ) : η ≤ ξ} = 1,

(coupling); see Theorem 2.4 of Chapter II in [19].Since X is compact, any sequence of probability measures on X is tight, and

therefore, has a weakly convergent subsequence.Our main theorem holds for measures µ on X giving total mass to configurations

for which the number of particles in x increases less than exponentially with ‖x‖.That is, measures satisfying

(3)∞∑

n=1

e−βn∑

x:‖x‖=n

η(x) < ∞ µ-a.s. for all β > 0.

The product measure νλ,v obviously satisfies (3).We consider random rates λ = {λx}x∈Zd , a collection of independent identically

distributed random variables in (c, 1]. Call P and E the probability and expectationinduced by these variables. Assume that for any ε > 0, P(λ0 ∈ (c, c + ε)) > 0.

Theorem 1. Let µ be a probability measure on X satisfying (3). Then P-a.s.

(i) Every weak limit of µSλ(t) as t tends to infinity is dominated by νλ,c.(ii) If νλ,c ≤ µ then µSλ(t) converges to νλ,c as t goes to infinity.

The result is better understood using the notion of density of particles. Recallthat lim g(k) = 1 and notice that the function Z : [0, 1) → [0,∞) defined in (2) isanalytic. Let R : [0, 1) → [0,∞) be the strictly increasing function defined by

R(u) =1

Z(u)

k≥0

kuk

g(k)!= u

Z ′(u)Z(u)

.

It is easy to see that R is onto [0,∞). Under the measure νλ,v the expected numberof particles (density) at site x is

(4) νλ,v[η(x)] = R(v/λx),

and the expected value of the jump rate is

νλ,v[λxg(η(x))] = v.

Since v/λx < 1, for any v ∈ [0, c] and x,

(5) νλ,c[η(x)] = limv→c

R(v/λx) < ∞.

Since the rate distribution is translation invariant, taking the average with respectto the rates, the mean number of particles per site is

ρ(v) :=∫

P(dλ0)R(v/λ0).

For v ∈ [0, c), ρ(v) < ∞. Depending on the distribution of λ0, two cases are possible:ρ(c) < ∞ and ρ(c) = ∞. Since R(u) is a nondecreasing nonnegative function,

(6) ρ(c) = limv↗c

ρ(v),

Page 124: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

112 P. A. Ferrari and V. V. Sisko

The equation also holds when ρ(c) = ∞.For v ∈ [0, c], denote mv :=

∫P(dλ) νλ,v the measure obtained by first P-

choosing an environment λ and then choosing a configuration η with νλ,v. Underthis law {η(x)}x∈Zd are independent identically distributed random variables withaverage number of particles per site given by mv[η(0)] = ρ(v). By the strong law oflarge numbers,

(7) limn→∞

1(2n + 1)d

‖x‖≤n

η(x) = ρ(v) mv-a.s.

Thus, P-a.s., the limit (7) holds νλ,v-a.s.; it also holds when ρ(v) = ∞.For η ∈ X , the lower asymptotic density of η is defined by

(8) D(η) := lim infn→∞

1(2n + 1)d

‖x‖≤n

η(x),

and the upper asymptotic density of η is defined by

(9) D(η) := lim supn→∞

1(2n + 1)d

‖x‖≤n

η(x).

Take some probability measure µ satisfying (3) and some environment λ. Let µbe a weak limit of µSλ(t) along a convergent subsequence. Then Theorem 1 (i)implies

(10) D(η) ≤ ρ(c) µ -a.s.

Suppose that ρ(c) < ∞ and µ concentrates mass on configurations with lowerasymptotic density strictly bigger than ρ(c), that is,

(11) D(η) > ρ(c) µ-a.s.

Inequality (10) says that weak limits of µSλ(t) are concentrated on configurationswith the upper asymptotic density of η not greater than ρ(c). This behavior isremarkable as the process is conservative, i.e., the total number of particles is con-served, but in the above limit there is an “escape of mass”. Heuristically, a fractionof the particles get stacked at further and further sites with lower and lower rates.

Sketch of proof. The proof is based on the study of a family of zero-rangeprocesses indexed with α > 0; we call them the α-truncated process. The α-truncated process behaves as the original process but at all times there are infinitelymany particles in sites x with λ(x) ≤ c + α. The measure να

λ is invariant for theprocess. Let the measure µα be the law of a configuration chosen with µ modified byputting infinitely many particles in sites x with λ(x) ≤ c+α and leaving the othersites unchanged. We use the fact that there is a density of sites with infinitely manyparticles to show that the α-truncated process starting with µα for µ satisfying (3)converges weakly to να

λ . We prove the convergence using coupling arguments. Twoα-truncated processes starting respectively with µα and the invariant law να

λ arejointly realized using the so called “basic coupling” [19] which amounts to use thesame Poisson processes to construct both marginals. The coupling induces first andsecond class particles, the last represent the discrepancies between both marginals.

Page 125: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Zero range processes with random rates 113

A key element of the proof is the study of the motion of a single tagged secondclass particle in the α-truncated process. The skeleton of the trajectory of each par-ticle is a simple random walk with jump probabilities p(·) absorbed at sites x withλ(x) ≤ c + α. The interaction with the other particles and with the environment λgoverns the waiting times between jumps but does not affect the skeleton of themotion. We show that with probability one (a) only a finite number of second classparticles will visit any fixed site x: particles starting sufficiently far away will beabsorbed before arriving to x and (b) the finite number of particles hitting x willbe eventually absorbed. The weak convergence and the uniqueness of the invariantmeasure for the α-process is a consequence of this result. The α-process dominatesstochastically the original process (which corresponds to α = 0) when both startwith the same configuration. Since να

λ converges to the maximal invariant measureas α → 0, this will conclude the proof.

In Section 3 we introduce the α-truncated process, and state the two main resultswhich lead to the proof of Theorem 1: the ergodicity of the α-truncated processand the fact that it dominates the original process. In the same section we proveTheorem 1. In Section 4 we prove results for the random walk absorbed at sites xwith λx ≤ c + α, and in Section 5 we graphically construct the process, introducethe relevant couplings and prove the ergodicity and domination results.

3. The α-truncated process

We introduce a family of zero-range process with infinite number of particles atsites with sufficiently slow rates. Let α > 0, cα = c + α and λα = {λα

x}x∈Zd thetruncation given by

λαx =

{cα if λx ≤ cα,λx if λx > cα.

For each α ≥ 0 consider a X -valued zero-range process ηαt in the environment

λα. We call it the α-truncated process or just the truncated process when α is clearfrom the context. When α = 0 we have the original process: η0

t = ηt. PartitionZ

d = Λ(λ, α) ∪ Λc(λ, α) with

Λ(λ, α) = {x ∈ Zd : λx > c + α} and Λc(λ, α) = {x ∈ Z

d : λx ≤ c + α}.

We impose that ηαt (x) = ∞ for all t for all x ∈ Λc(λ, α). The truncated process ηα

t isdefined in the same way as ηt from Section 2 with the following differences. Particlesjump as before to Λc(λ, α), but since there are infinitely many particles in Λc(λ, α),the rate of jump from x ∈ Λc(λ, α) to y is (c+α)g(∞)p(y−x). Since the number ofparticles in x is always infinity, this jumps can be interpreted as creation of particlesin y. Hence the process ηα

t can be thought of as evolving in Xα := NΛ(λ,α) with

boundary conditions “infinitely many particles at sites in Λc(λ, α)”.Let Lα

λ be the generator of the α-truncated process ηαt and {Sα

λ (t), t ≥ 0} be thesemigroup associated to the generator Lα

λ . We construct this process a la Harris inSection 5.

We consider measures on configurations of the processes ηt and ηαt as measures

on X . The product measure ναλ with marginals

ναλ {ξ : ξ(x) = k} =

1Z(cα/λα

x)(cα/λα

x)k

g(k)!if x ∈ Λ(λ, α),

1{k = ∞} if x ∈ Λc(λ, α),

Page 126: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

114 P. A. Ferrari and V. V. Sisko

is invariant for the process ηαt . Since cα → c and λα(x) → λ(x) as α goes to zero,

(12) limα→0

ναλ = νλ,c weakly.

Let Tα : X → X be the truncation operator defined by

(13) Tαη(x) =

{η(x) if λx > c + α,∞ if λx ≤ c + α.

The operator Tα induces an operator on measures that we also call Tα. Defineµα := Tαµ. We clearly have

(14) µ ≤ µα.

This domination is preserved by the respective processes:

Lemma 1. Let α > 0 and t ≥ 0. Then µSλ(t) ≤ µαSαλ (t).

The truncated process converges to the invariant measure:

Proposition 1. Let µ be a probability measure on X satisfying (3). Then for anyα > 0,

(15) limt→∞

µαSαλ (t) = να

λ P-a.s.

We prove Lemma 1 and Proposition 1 in Section 5.

Proof of Theorem 1. For any α > 0, Lemma 1 and Proposition 1 imply

lim supt→∞

µSλ(t) ≤ lim supt→∞

µαSαλ (t) = να

λ .

Item (i) follows by taking α → 0 and applying (12).To prove item (ii), take µ such that νλ,c ≤ µ. In the same way as in the proof of

Lemma 1, it is easy to see that the semigroup Sλ(t), acting on measures, preservesthe ordering: νλ,cSλ(t) ≤ µSλ(t) for any t. Since νλ,c is invariant, νλ,c = νλ,cSλ(t).Therefore, by item (i),

νλ,c = lim supt→∞

νλ,cSλ(t) ≤ lim supt→∞

µSλ(t) ≤ νλ,c.

Our task is to prove Proposition 1. The point is that the skeleton of each particleis just a discrete-time random walk with absorption at the sites where λx ≤ c + α.Since there is a positive density of those sites, only a finite number of particleswill arrive at any fixed finite region. On the other hand, the absorbing sites createnew particles. We couple the process with initial measure µα with the process withinitial invariant measure να

λ in such a way that new particles are created at thesame time in the same sites to both processes. New created particles jump togetherat both marginals. We show that as time goes to infinity, in both processes onlynew particles will be present in any finite region.

Page 127: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Zero range processes with random rates 115

4. Family of independent random walks

Fix η such that the inequality in (3) holds. Fix α > 0. Since η and α are fixed, weomit them in the notation when it is possible. For example, Λc(λ) := Λc(λ, α).

For each x ∈ Zd, enumerate the η(x) particles at site x in some way and let

ζ ={ζn(x, i) : x ∈ Z

d, i ∈ N ∩ [1, η(x)]}

be a family of independent discrete-time random walks with starting pointsζ0(x, i) = x, x ∈ Z

d, i ∈ N ∩ [1, η(x)] and transitions governed by p(·). We usethe notation P and E for the law and expectation induced by ζ. Recall P and Eare the law and expectation induced by the environment λ. By P × P denote theproduct measure with marginals P and P.

For each (x, i) and for each subset A of Zd, denote

τ(x, i; A) = min{n ≥ 0 : ζn(x, i) ∈ A}

the first time the walk hits the set A (this could be ∞).Let us prove that if we consider the random walks in time [0, τ(x, i; Λc(λ))] only

a finite number of walks visit the origin and the number of visits of the origin byeach of the walks is finite. More formally, by N(λ, ζ) denote the last time any walkvisits the origin before entering in Λc(λ):

N(λ, ζ) = sup⋃

x

i

{m : m ∈ [0, τ(x, y; Λc(λ))] and ζm(x, i) = 0}.

Proposition 2.

(16) (P ×P){(λ, ζ) : N(λ, ζ) < ∞} = 1.

Proof. Denote θ = P(λ0 ≤ c + α). If α is small enough, then 0 < θ < 1. Call Ex,i

the subset of Zd visited by the walk ζn(x, i) in the time interval [0, τ(x, i; Λc(λ))]

and denoteCx,i,N = {(λ, ζ) : |Ex,i| ≥ N}

where N ≥ 0 and |Ex,i| is the number of elements in the set Ex,i. Since each siteof Ex,i has probability θ to be in the set Λc(λ),

(17) (P ×P)(Cx,i,N ) ≤ (1 − θ)N → 0 as N → ∞.

By hypothesis the random walk with transitions governed by p(·) is irreducible,hence it cannot be confined to a finite region. This implies that the number of newsites visited by time n goes to infinity as n increases. This and (17) implies that

(18) (P ×P)( ⋂

(x,i):i≤η(x)

{τ(x, i; Λc(λ)) < ∞

})= 1.

DefineDx,i =

{(λ, ζ) : τ(x, i; {0}) < τ(x, i; Λc(λ))

}.

Since the range of the random walk is M < ∞, we see that the random walk ζn(x, i)visits at least (the integer part of) ‖x‖/M different sites before it reaches the origin.Therefore,

(19) (P ×P)(Dx,i) ≤ (1 − θ)‖x‖/M .

Page 128: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

116 P. A. Ferrari and V. V. Sisko

Thus ∑

(x,i):i≤η(x)

(P ×P)(Dx,i) ≤∑

k

(1 − θ)k/M∑

x:‖x‖=k

η(x) < ∞

because we assumed η satisfies (3). Borel-Cantelli then implies that with (P × P)probability one only a finite number of events Dx,i happen. Thus, if we considerthe random walks in time [0, τ(x, i; Λc(λ))], then only a finite number of walks visitthe origin, and by (18), each walk visits the origin a finite number of times.

5. Construction and coupling

We construct a la Harris a Markov process ηt on X = NZ

d

corresponding to theabove description. Let (Nx,y, x, y ∈ Z

d) be a collection of independent Poissonprocess such that Nx,y has intensity p(y − x). If a Poisson event s belongs to aPoisson process Nx,y, then we say that the event has the origin x and the end y. Totune the rate with the environment and the number of particles, we associate to eachPoisson event s ∈ ∪x,yNx,y a random variable U(s), uniform in [0, 1], independentof the Poisson processes and independent of the other uniform variables. Since theprobability that any two Poisson events from ∪x,yNx,y happen at the same time iszero, all the Poisson events can be indexed by their times, in other words, they canbe ordered by their time of occurrence.

The evolution of the process ηt = ηλt in the environment λ is given by thefollowing (deterministic) rule: if the Poisson process Nx,y has an event at time sand

(20) U(s) < λxg(ηs−(x)),

then one particle is moved from x to y at that time. Since g(0) = 0, if no particleis in x, then the Poisson event produces no effect in the process in this case.

Using that p is finite range, a percolation argument shows that, for h sufficientlysmall, Z

d can be partitioned in finite (random) subsets with the following property:all Poisson events in the interval [0, h] have the origin and the end in the same subset.Since there is a finite number of Poisson events in time interval [0, h] in each of thesubsets, the Poisson events can be well ordered by their time of occurrence and thevalue of ηh for each subset can be obtained with the rule (20) proceeding from thefirst event to the last in each subset. Starting at ηh, we repeat the construction inthe interval [h, 2h] and so on. Thus, for any t, the process ηt is well defined as afunction of the Poisson processes and the uniform random variables.

The α-truncated process ηαt in the same environment λ is also realized as a

function of the Poisson processes and uniform variables with a similar rule: if thePoisson process Nx,y has an event at time s and

(21) U(s) < λαxg(ηα

s−(x)),

then one particle is moved from x to y at that time. Rules (20) and (21) induce anatural coupling between the processes ηt and ηα

t . This is the key of the proof ofLemma 1.

We use the notation P and E for the probability and expectation induced by thePoisson processes and corresponding uniform associated random variables. Noticethat this alea does not depend on λ.

Page 129: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Zero range processes with random rates 117

Proof of Lemma 1. Fix a configuration η0 and an environment λ and let ηα0 (x) =

η0(x) if x ∈ Λ(λ, α) and ηα0 (x) = ∞ if x ∈ Λc(λ, α). Let (ηt, η

αt ) be the cou-

pling obtained by constructing each marginal as a function of the Poisson processes(Nx,y, x, y ∈ Z

d) and uniform random variables (U(s), s ∈ ∪x,yNx,y) followingrules (20) and (21).

It suffices to show that each jump keeps the initial order. Consider the jumpassociated to a Poisson event at time s ∈ Nx,y with uniform variable U(s). Thereare two possibilities:(1) If x ∈ Λ(λ, α), then λx = λα

x . Since the function g(·) is monotone and therandom variable U(s) is the same for both marginals, the order is kept.(2) If x ∈ Λc(λ, α), then λx < λα

x . In this case a ηs−(x) particle jumps from x to y ifU(s) < λxg(ηs−(x)) and a ηα

s−(x) particle jumps from x to y if U(s) < λαxg(ηα

s−(x)).Hence, if ηs−(x) ≤ ηα

s−(x) and ηs−(y) ≤ ηαs−(y), then ηs(y) ≤ ηα

s (y). On the otherhand, ηs(x) ≤ ηα

s (x) = ∞.

To prove Proposition 1, we need the following result. It helps to prove thatthe second class particles do not stop forever at some place: eventually every suchparticle either move or coalesce.

Lemma 2. Fix an environment λ and consider the stationary process (ηαt , t ∈ R)

with time-marginal distribution ναλ and fix x ∈ Λ(λ, α). Then ηα

t (x) = 0 infinitelyoften with probability one:

(22) lim inft→∞

ηαt (x) = 0.

Proof. Consider the discrete time stationary process (ηαn(x), n ∈ N) —this is just

the process (ηt(x), t ∈ R) observed at integer times. It is sufficient to show

(23) lim infn→∞

ηαn(x) = 0

with probability one. A theorem of Poincare (Chapter IV in [21] or Theorem 3.4 ofChapter 6 in [6]) implies that for every k ∈ N,

P(ηα

n(x) = k infinitely often in n | ηα0 (x) = k

)= 1.

Returning for a moment to the continuous time process ηαt , if at time t site x has at

least one particle, then one of the particles at x will jump with probability boundedbelow by g(1)λx/(1 + λx) > 0, this is the probability the exponential jump timeof x is smaller than the jump-times of particles from the other sites to x, whose rateis bounded by g(∞)

∑y p(y, x) = 1. Fix k ∈ N. By the same reasoning, for any m,

if ηαm(x) = k, then there is a positive probability to be visiting 0 at time m + 1

independently of previous visits and uniformly in the configuration outside x attime m. Since these are independent attempts, Borel–Cantelli implies

P(ηα

n(x) = 0 infinitely often in n | ηα0 (x) = k

)= 1.

This implies (23).

Proof of Proposition 1. In an environment λ, consider the coupling process of twoversions of the process ηα

t obtained by using the same family of Poisson processes(Nx,y : x, y ∈ Z

d) and uniform random variables (U(s), s ∈⋃

x,y Nx,y). By {Sαλ (t) :

t ≥ 0} denote the semigroup of the process and by P the probability associated tothe process.

Page 130: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

118 P. A. Ferrari and V. V. Sisko

Since ναλ is invariant for ηα

t , it is enough to show that, for any α > 0, any µsatisfying (3), any λ (P-a.s.), and for any x ∈ Λ(λ, α),

(24) limt→∞

(µα × ναλ ) Sα

λ (t) {(ξ, η) : ξ(x) = η(x)} = 0.

In coupling terms, (24) reads

(25) limt→∞

∫ ∫µα(dξ) να

λ (dη) P(ξt(x) = ηt(x) | (ξ0, η0) = (ξ, η)

)= 0,

where we have denoted ξt the first coordinate of the coupled processes and ηt thesecond. Therefore, to prove the proposition it is enough to prove that, for any α > 0,any µ satisfying (3), any λ (P-a.s.), any ξ0 (µ-a.s.), any η0 (να

λ -a.s.), and for anyx ∈ Λ(λ, α),

(26) limt→∞

P(ξt(x) = ηt(x) | (ξ0, η0) = (ξ0, η0)

)= 0.

Without loss of generality we assume x = 0 and α small enough such that0 ∈ Λ(λ, α). Fix α, λ, ξ0 and η0. The configurations ξ0 and η0 are in principlenot ordered: there are sites y ∈ Λ(λ, α) such that (ξ0(y) − η0(y))+ > 0 and sitesz ∈ Λ(λ, α) such that (ξ0(z) − η0(z))− > 0. We say that we have ξη-discrepanciesin the first case and ηξ-discrepancies in the second one.

Denote ξt(z) := min{ξt(z), ηt(z)} the number of coupled particles at site z attime t. The ξ-particles move as regular zero-range particles; they are usually calledfirst class particles. There is at most one type of discrepancy at each site at anytime. Discrepancies of both types move as second class particles, i.e., ξη-discrepancyjumps from y to z with rate

(27) λαy p(z − y)[g(ξ(y)) − g(ξ(y))]

and ηξ-discrepancy jumps from y to z with rate

(28) λαy p(z − y)[g(η(y)) − g(ξ(y))]

that is, second class particles jump with the difference rate. For instance, in the caseg(k) ≡ 1, the second class particles jump only when there are no coupled particlesin the site.

If a ξη-discrepancy jumps to a site z occupied by at least one ηξ-discrepancy,then the ξη-discrepancy and one of the ηξ-discrepancies at z coalesce into a coupledξ-particle in z. Analogously, for the case when a ηξ-discrepancy jumps to a site zoccupied by at least one ξη-discrepancy. The coupled particle behaves from thismoment on as a first class particle. If a discrepancy of any type jumps to a sitez with infinite number of particles, that is, z ∈ Λc(λ, α), then the discrepancydisappears. All particles in sites x ∈ Λc(λ, α) are first class ξ-particles. Therefore,any particle that jump from any site x ∈ Λc(λ, α) is a first class particle.

At time zero there are |ξ0(y)− η0(y)| discrepancies at site y. To the ith discrep-ancy at site y at time zero, that is, discrepancy (y, i), we associate the random walkζn(y, i) from the model of Section 4.

Since the interaction with the other particles and the environment λ governs thewaiting times between jumps but does not affect the skeleton of the discrepancymotion until coalescence or absorbing time, it is possible to couple the skeleton ofthe discrepancy (y, i) with the random walk ζn(y, i) in such a way that they performthe same jumps together until (a) the coalescence of the discrepancy with another

Page 131: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Zero range processes with random rates 119

discrepancy of different type or (b) the absorption of the discrepancy at some site ofΛc(λ). In any case, the number of discrete jumps is at most τ(y, i; Λc(λ)). Therefore,the full trajectory of discrepancy (y, i) is shorter (visits not more sites and has notmore number of visits to each site) than the trajectory of the random walk ζn(y, i)in the time interval [0, τ(y, i; Λc(λ))]. Thus, Proposition 2 implies that only a finitenumber of discrepancies visit x and the number of visits of site x by each of thediscrepancies is finite.

Lemma 2 implies that there are no η-particles at x infinitely often. There-fore, there are no ηξ-discrepancies at x infinitely often. This means that everyηξ-discrepancy that at some moment is at x will eventually jump out or coalesce.It follows that after some random time there is no ηξ-discrepancies in x forever.

Moreover, if at time t site x ∈ Λ(λ, α) has no η-particles, then a ξη-discrepancyat x will jump with probability bounded below by g(1)λx/(1 + λx) > 0. Therefore,using Lemma 2, we see that after some random time there is no ξη-discrepanciesin x forever.

Acknowledgments. We thank Enrique Andjel and James Martin for discussions.Part of this paper was written when the first author was participating of the pro-

gram Principles of the Dynamics of Non-Equilibrium Systems in the Isaac NewtonInstitute for Mathematical Sciences visiting Newton Institute, in May-June 2006.

This paper was partially supported by FAPESP and CNPq.

References

[1] Andjel, E. D. (1982). Invariant measures for the zero range processes. Ann.Probab. 10 525–547. MR659526

[2] Andjel, E. D., Ferrari, P. A., Guiol, H. and Landim, C. (2000). Conver-gence to the maximal invariant measure for a zero-range process with randomrates. Stochastic Process. Appl. 90 67–81. MR1787125

[3] Bahadoran, C. (1998). Hydrodynamical limit for spatially heterogeneoussimple exclusion processes. Probab. Theory Related Fields 110 287–331.MR1616563

[4] Balazs, M., Rassoul-Agha, F., Seppalainen, T. and Sethuraman, S.

(2007). Existence of the zero range process and a deposition model with su-perlinear growth rates. To appear. math/0511287.

[5] Benjamini, I., Ferrari, P. A. and Landim, C. (1996). Asymmetric con-servative processes with random rates. Stochastic Process. Appl. 61 181–204.MR1386172

[6] Durrett, R. (1996). Probability: Theory and Examples. Duxbury Press, Bel-mont, CA. MR1609153

[7] Evans, M. R. (1996). Bose-Einstein condensation in disordered exclusionmodels and relation to traffic flow. Europhys. Lett. 36 13–18.

[8] Evans, M. R. and Hanney, T. (2005). Nonequilibrium statistical mechan-ics of the zero-range process and related models. J. Phys. A 38 R195–R240.MR2145800

[9] Gielis, G., Koukkous, A. and Landim, C. (1998). Equilibrium fluctuationsfor zero range processes in random environment. Stochastic Process. Appl. 77187–205. MR1649004

[10] Grigorescu, I., Kang, M. and Seppalainen, T. (2004). Behavior dom-inated by slow particles in a disordered asymmetric exclusion process. Ann.Appl. Probab. 14 1577–1602. MR2071435

Page 132: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

120 P. A. Ferrari and V. V. Sisko

[11] Harris, T. E. (1972). Nearest-neighbor Markov interaction processes on mul-tidimensional lattices. Advances in Math. 9 66–89. MR0307392

[12] Holley, R. (1970). A class of interactions in an infinite particle system. Ad-vances in Math. 5 291–309. MR0268960

[13] Jackson, J. R. (1957). Networks of waiting lines. Operations Res. 5 518–521.MR0093061

[14] Koukkous, A. (1999). Hydrodynamic behavior of symmetric zero-rangeprocesses with random rates. Stochastic Process. Appl. 84 297–312.MR1719270

[15] Krug, J. and Ferrari, P. A. (1996). Phase transitions in driven diffusivesystems with random rates. J. Phys. A: Math. Gen. 29 1465–1471.

[16] Landim, C. (1996). Hydrodynamical limit for space inhomogeneous one-dimensional totally asymmetric zero-range processes. Ann. Probab. 24 599–638. MR1404522

[17] Liggett, T. M. (1972). Existence theorems for infinite particle systems.Trans. Amer. Math. Soc. 165 471–481. MR0309218

[18] Liggett, T. M. (1973). An infinite particle system with zero range interac-tions. Ann. Probab. 1 240–253. MR0381039

[19] Liggett, T. M. (1985). Interacting Particle Systems. Springer, New York.MR776231

[20] Seppalainen, T. and Krug, J. (1999). Hydrodynamics and platoon forma-tion for a totally asymmetric exclusion model with particlewise disorder. J.Statist. Phys. 95 525–567. MR1700871

[21] Shiryaev, A. N. (1996). Probability. Springer, New York. MR1368405[22] Spitzer, F. (1970). Interaction of Markov processes. Advances in Math. 5

246–290. MR0268959

Page 133: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 121–134c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000319

On non-asymptotic bounds for estimation

in generalized linear models with highly

correlated design

Sara A. van de Geer1

ETH Zurich

Abstract: We study a high-dimensional generalized linear model and penal-ized empirical risk minimization with �1 penalty. Our aim is to provide anon-trivial illustration that non-asymptotic bounds for the estimator can beobtained without relying on the chaining technique and/or the peeling device.

1. Introduction

We study an increment bound for the empirical process, indexed by linear com-binations of highly correlated base functions. We use direct arguments, instead ofthe chaining technique. We moreover obtain bounds for an M-estimation probleminserting a convexity argument instead of the peeling device. Combining the tworesults leads to non-asymptotic bounds with explicit constants.

Let us motivate our approach. In M-estimation, some empirical average indexedby a parameter is minimized. It is often also called empirical risk minimization.To study the theoretical properties of the thus obtained estimator, the theory ofempirical processes has been a successful tool. Indeed, empirical process theorystudies the convergence of averages to expectations, uniformly over some parameterset. Some of the techniques involved are the chaining technique (see e.g. [13]), inorder to relate increments of the empirical process to the entropy of parameterspace, and the peeling device (a terminology from [10]) which goes back to [1],which allows one to handle weighted empirical processes. Also the concentrationinequalities (see e.g. [9]), which consider the concentration of the supremum of theempirical process around its mean, are extremely useful in M-estimation problems.

A more recent trend is to derive non-asymptotic bounds for M-estimators. Thepapers [6] and [4] provide concentration inequalities with economical constants.This leads to good non-asymptotic bounds in certain cases [7]. Generally however,both the chaining technique and the peeling device may lead to large constants inthe bounds. For an example, see the remark following (5).

Our aim in this paper is simply to avoid the chaining technique and the peelingdevice. Our results should primarily be seen as non-trivial illustration that bothtechniques may be dispensable, leaving possible improvements for future research.In particular, we will at this stage not try to optimize the constants, i.e. we willmake some arbitrary choices. Moreover, as we shall see, our bound for the incrementinvolves an additional log-factor, log m, where m is the number of base functions(see below).

1Seminar fur Statistik, ETH Zurich, LEO D11, 8092 Zurich, Switzerland, e-mail:[email protected]

AMS 2000 subject classification: 62G08.Keywords and phrases: convex hull, convex loss, covering number, non-asymptotic bound, pe-

nalized M-estimation.

121

Page 134: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

122 S. A. van de Geer

The M-estimation problem we consider is for a high-dimensional generalizedlinear model. Let Y ∈ Y ⊂ R be a real-valued (response) variable and x be acovariate with values in some space X . Let

{fθ(·) :=

m∑

k=1

θkψk(·), θ ∈ Θ

}

be a (subset of a) linear space of functions on X . We let Θ be a convex subset ofRm, possibly Θ = Rm. The functions {ψk}m

k=1 form a given system of real-valuedbase functions on X . The number of base functions, m, is allowed to be large.However, we do have the situation m ≤ n in mind (as we will consider the case offixed design).

Let γf : X ×Y → R be some loss function, and let {(xi, Yi)}ni=1 be observations

in X × Y . We consider the estimator with �1 penalty

(1) θn := arg minθ∈Θ

{1n

n∑

i=1

γfθ(xi, Yi) + λ

22−sn I

2(1−s)2−s (θ)

},

where

(2) I(θ) :=m∑

k=1

|θk|

denotes the �1 norm of the vector θ ∈ Rm. The smoothing parameter λn controlsthe amount of complexity regularization, and the parameter s (0 < s ≤ 1) isgoverned by the choice of the base functions (see Assumption B below). Note thatfor a properly chosen constant C depending on λn and s, we have for any I > 0,

λ2

2−sn I

2(1−s)2−s = min

λ

(λI +

C

λ2(1−s)

s

).

In other words, the penalty λ2

2−sn I

2(1−s)2−s (θ) can be seen as the usual Lasso penalty

λI(θ) with an additional penalty on λ. The choice of the latter is such that adaptionto small values of I(θ∗n) is achieved. Here, θ∗n is the target, defined in (3) below.

The loss function γf is assumed to be convex and Lipschitz (see AssumptionL below). Examples are the loss functions used in quantile regression, logistic re-gression, etc. The quadratic loss function γf (x, y) = (y − f(x))2 can be studied aswell without additional technical problems. The bounds then depend on the tailbehavior of the errors.

The covariates x1, . . . , xn are assumed to be fixed, i.e., we consider the case offixed design. For γ : X × Y → R, use the notation

Pγ :=1n

n∑

i=1

Eγ(xi, Yi).

Our target function θ∗n is defined as

(3) θ∗n := arg minθ∈Θ

Pγfθ.

When the target is sparse, i.e., when only a few of the coefficients θ∗n,k arenonzero, it makes sense to try to prove that also the estimator θn is sparse. Non-asymptotic bounds for this case (albeit with random design) are studied in [12]. It

Page 135: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Non-asymptotic bounds for GLM 123

is assumed there that the base functions {ψk} have design matrix with eigenval-ues bounded away from zero (or at least that the base functions corresponding tothe non-zero coefficients in θ∗n have this property). In the present paper, the basefunctions are allowed to be highly correlated. We will consider the case where theyform a VC class, or more generally, have ε-covering number which is polynomial in1/ε. This means that a certain smoothness is imposed a priori, and that sparsenessis less an issue.

We use the following notation. The empirical distribution based on the sam-ple {(xi, Yi)}n

i=1 is denoted by Pn, and the empirical distribution of the covari-ates {xi}n

i=1 is written as Qn. The L2(Qn) norm is written as ‖ · ‖n. Moreover,‖ · ‖∞ denotes the sup norm (which in our case may be understood as ‖f‖∞ =max1≤i≤n |f(xi)|, for a function f on X ).

We impose four basic assumptions: Assumptions L, M, A and B.

Assumption L. The loss function γf is of the form γf (x, y) = γ(f(x), y), whereγ(·, y) is convex for all y ∈ Y. Moreover, it satisfies the Lipschitz property

|γ(fθ(x), y) − γ(fθ(x), y)| ≤ |fθ(x) − fθ(x)|,∀ (x, y) ∈ X × Y , ∀ θ, θ ∈ Θ.

Assumption M. There exists a non-decreasing function σ(·), such that all M > 0and all all θ ∈ Θ with ‖fθ − fθ∗

n‖∞ ≤ M , one has

P (γfθ− γfθ∗

n) ≥ ‖fθ − fθ∗

n‖2

n/σ2(M).

Assumption M thus assumes quadratic margin behavior. In [12], more generalmargin behavior is allowed, and the choice of the smoothing parameter does notdepend on the margin behavior. However, in the setup of the present paper, thechoice of the smoothing parameter does depend on the margin behavior.

Assumption A. It holds that

‖ψk‖∞ ≤ 1, 1 ≤ k ≤ m.

Assumption B. For some constant A ≥ 1, and for V = 2/s − 2, it holds that

N(ε, Ψ) ≤ Aε−V ,∀ ε > 0.

Here N(ε, Ψ) denotes the ε-covering number of (Ψ, ‖ · ‖n), with Ψ := {ψk}mk=1.

The paper is organized as follows. Section 2 presents a bound for the incrementsof the empirical process. Section 3 takes such a bound for granted and presents anon-asymptotic bound for ‖fθn

− fθ∗n‖n and I(θn). The two sections can be read

independently. In particular, any improvement of the bound obtained in Section 2can be directly inserted in the result of Section 3. The proofs, which are perhapsthe most interesting part of the paper, are given in Section 4.

2. Increments of the empirical process indexed by a subset of a linearspace

Let ε1, . . . , εn be i.i.d. random variables, taking values ±1 each with probability1/2. Such a sequence is called a Rademacher sequence. Consider for ε > 0 andM > 0, the quantity

Zε,M := sup‖fθ‖n≤ε, I(θ)≤M

∣∣∣∣∣1n

n∑

i=1

fθ(xi)εi

∣∣∣∣∣.

Page 136: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

124 S. A. van de Geer

We need a bound for the mean EZε,M , because this quantity will occur in theconcentration inequality (Theorem 4.1). In [12], the following trivial bound is used:

EZε,M ≤ ME

(max

1≤k≤m

∣∣ 1n

n∑

i=1

εiψk(xi)∣∣)

.

On the right hand side, one now has the mean of finitely many functions, whichis easily handled (see for example Lemma 4.1). However, when the base functionsψk are highly correlated, this bound is too rough. We need therefore to proceeddifferently.

Let conv(Ψ) = {fθ =∑m

k=1 θkψk : θk ≥ 0,∑m

k=1 θk = 1} be the convex hull ofΨ.

Recall that s = 2/(2+V ), where V is from Assumption B. From e.g. [10], Lemma3.2, it can be derived that for some constant C, and for all ε > 0,

(4) E∣∣∣∣ maxf∈conv(Ψ),‖f‖n≤ε

1n

n∑

i=1

f(xi)εi

∣∣∣∣ ≤ Cεs 1√n

.

The result follows from the chaining technique, and applying the entropy bound

(5) log N(ε, conv(Ψ)) ≤ A0ε−2(1−s), ε > 0,

which is derived in [2]. Here, A0 is a constant depending on V and A.

Remark. It may be verified that the constant C in (4) is then at least proportionalto 1/s, i.e., it is large when s is small.

Our aim is now to obtain a bound from direct calculations. Pollard ([8]) presentsthe bound

log N(ε, conv(Ψ)) ≤ A1ε−2(1−s) log

1ε, ε > 0,

where A1 is another constant depending on V and A. In other words, Pollard’sbound has an additional log-factor. On the other hand, we found Pollard’s proofa good starting point in our attempt to derive the increments directly, withoutchaining. This is one of the reasons why our direct bound below has an additionallog m factor. Thus, our result should primarily be seen as illustration that directcalculations are possible.

Theorem 2.1. For ε ≥ 16/m, and m ≥ 4, we have

E

∣∣∣∣∣ maxf∈conv(Ψ),‖f‖n≤ε

1n

n∑

i=1

f(xi)εi

∣∣∣∣∣ ≤ 20√

1 + 2Aεs

√log(6m)

n.

Clearly the set {∑m

k=1 θkψk : I(θ) ≤ 1} is the convex hull of {±ψk}mk=1. Using

a renormalization argument, one arrives at the following corollary

Corollary 2.1. We have for ε/M > 8/m and m ≥ 2

EZε,M ≤ 20√

1 + 4AM1−sεs

√log(12m)

n.

Invoking symmetrization, contraction and concentration inequalities (see Sec-tion 4), we establish the following lemma. We present it in a form convenient forapplication in the proof of Theorem 3.1.

Page 137: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Non-asymptotic bounds for GLM 125

Lemma 2.1. Define for ε > 0, M > 0, and ε/M > 8/m, m ≥ 2,

Zε,M := sup‖fθ−fθ∗

n‖n≤ε, I(θ−θ∗

n)≤M

|(Pn − P )(γfθ− γfθ∗

n)|.

Let

λn,0 := 80√

1 + 4A

√log(12m)

n.

Then it holds for all σ > 0, that

P(Zε,M ≥ λn,0ε

sM1−s +ε2

27σ2

)≤ exp

[− nε2

2 × (27σ2)2

].

3. A non-asymptotic bound for the estimator

The following theorem presents bounds along the lines of results in [10], [11] and[3], but it is stated in a non-asymptotic form. It moreover formulates explicitly thedependence on the expected increments of the empirical process.

Theorem 3.1. Define for ε > 0 and M > 0,

Zε,M := sup‖fθ−fθ∗

n‖n≤ε, I(θ−θ∗

n)≤M

|(Pn − P )(γfθ− γfθ∗

n)|.

Let λn,0 be such that for all 8/m ≤ ε/M ≤ 1, we have

(6) EZε,M ≤ λn,0εsM1−s.

Let c ≥ 3 be some constant.Define

Mn := 22−s

2(1−s) (27)−s

2(1−s) c1

1−s I(θ∗n),

σ2n := σ2(Mn),

andεn :=

√54σ

22−sn c

12−s λ

12−s

n,0 I1−s2−s (θ∗n) ∨ 27σ2

nλn,0.

Assume that

(7) 1 ≤(

272

)− 2−s2(1−s)

c1

1−s1

σ2nλn,0

I(θ∗n) ≤(m

8

)2−s

.

Then for λn := cσsnλn,0, with probability at least

1 − exp[−

nλ2

2−s

n,0 c2

2−s I2(1−s)2−s (θ∗n)

27σ4(1−s)2−s

n

],

we have that‖fθn

− fθ∗n‖n ≤ εn

andI(θn − θ∗n) ≤ Mn.

Page 138: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

126 S. A. van de Geer

Let us formulate the asymptotic implication of Theorem 3.1 in a corollary. Forpositive sequences {an} and {bn}, we use the notation

an bn,

when0 < lim inf

n→∞

an

bn≤ lim sup

n→∞

an

bn< ∞.

The corollary yields e.g. the rate εn n−1/3 for the case where the penalty rep-resents the total variation of a function f on {x1, . . . , xn} ⊂ R (in which cases = 1/2).

Corollary 3.1. Suppose that A and s do not depend on n, and that I(θ∗n) 1 andσ2(Mn) 1 for all Mn 1. By (4), we may take λn 1/

√n, in which case, with

probability 1 − exp[−dn], it holds that ‖fθn− fθ∗

n‖n ≤ εn, and I(θn − θ∗n) ≤ Mn,

withεn n− 1

2(2−s) , Mn 1, dn nε2n n1−s2−s .

4. Proofs

4.1. Preliminaries

Theorem 4.1 (Concentration theorem [6]). Let Z1, . . . , Zn be independentrandom variables with values in some space Z and let Γ be a class of real-valuedfunctions on Z, satisfying

ai,γ ≤ γ(Zi) ≤ bi,γ ,

for some real numbers ai,γ and bi,γ and for all 1 ≤ i ≤ n and γ ∈ Γ. Define

L2 := supγ∈Γ

n∑

i=1

(bi,γ − ai,γ)2/n,

and

Z := supγ∈Γ

∣∣∣∣∣1n

n∑

i=1

(γ(Zi) − Eγ(Zi))

∣∣∣∣∣ .

Then for any positive z,

P(Z ≥ EZ + z) ≤ exp[−nz2

2L2

].

The Concentration theorem involves the expectation of the supremum of theempirical process. We derive bounds for it using symmetrization and contraction.Let us recall these techniques here.

Theorem 4.2 (Symmetrization theorem [13]). Let Z1, . . . , Zn be independentrandom variables with values in Z, and let ε1, . . . , εn be a Rademacher sequenceindependent of Z1, . . . , Zn. Let Γ be a class of real-valued functions on Z. Then

E

(supγ∈Γ

∣∣∣∣∣

n∑

i=1

{γ(Zi) − Eγ(Zi)}∣∣∣∣∣

)≤ 2E

(supγ∈Γ

∣∣∣∣∣

n∑

i=1

εiγ(Zi)

∣∣∣∣∣

).

Page 139: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Non-asymptotic bounds for GLM 127

Theorem 4.3 (Contraction theorem [5]). Let z1, . . . , zn be non-random ele-ments of some space Z and let F be a class of real-valued functions on Z. ConsiderLipschitz functions γi : R → R, i.e.

|γi(s) − γi(s)| ≤ |s − s|, ∀ s, s ∈ R.

Let ε1, . . . , εn be a Rademacher sequence. Then for any function f∗ : Z → R, wehave

E

(supf∈F

∣∣∣∣∣

n∑

i=1

εi{γi(f(zi)) − γi(f∗(zi))}∣∣∣∣∣

)≤ 2E

(supf∈F

∣∣∣∣∣

n∑

i=1

εi(f(zi) − f∗(zi))

∣∣∣∣∣

).

We now consider the case where Γ is a finite set of functions.

Lemma 4.1. Let Z1, . . . , Zn be independent Z-valued random variables, andγ1, . . . , γm be real-valued functions on Z, satisfying

ai,k ≤ γk(Zi) ≤ bi,k,

for some real numbers ai,k and bi,k and for all 1 ≤ i ≤ n and 1 ≤ k ≤ m. Define

L2 := max1≤k≤m

n∑

i=1

(bi,k − ai,k)2/n,

Then

E

(max

1≤k≤m

∣∣∣∣∣1n

n∑

i=1

{γk(Zi) − Eγk(Zi)}∣∣∣∣∣

)≤ 2L

√log(3m)

n.

Proof. The proof uses standard arguments, as treated in e.g. [13]. Let us write for1 ≤ k ≤ m,

γk :=1n

n∑

i=1

{γk(Zi) − Eγk(Zi)

}.

By Hoeffding’s inequality, for all z ≥ 0

P (|γk| ≥ z) ≤ 2 exp[−nz2

2L2

].

Hence,

E exp[ n

4L2γ2

k

]= 1 +

∫ ∞

1

P

(|γk| ≥

√4L2

nlog t

)dt

≤ 1 + 2∫ ∞

1

1t2

dt = 3.

Thus

E(

max1≤k≤m

|γk|)

=2L√

nE

√max

1≤k≤mlog exp

[4

4L2γ2

k

]

≤ 2L√n

√log E

(max

1≤k≤mexp

[4

4L2γ2

k

])≤ 2L

√log(3m)

n.

Page 140: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

128 S. A. van de Geer

4.2. Proofs of the results in Section 2

Proof of Theorem 2.1. Let us define, for k = 1, . . . , m,

ξk :=1n

n∑

i=1

ψk(xi)εi.

We have1n

n∑

i=1

fθ(xi)εi =m∑

k=1

θkξk.

Partition {1, . . . , m} into N := N(εs, Ψ) sets Vj , j = 1, . . . , N , such that

‖ψk − ψl‖n ≤ 2εs, ∀ k, l ∈ Vj .

We can write1n

n∑

i=1

fθ(xi)εi =N∑

j=1

αj

k∈Vj

pj,kξk,

where

αj = αj(θ) :=∑

k∈Vj

θk, pj,k = pj,k(θ) :=θk

αj.

Set for j = 1, . . . , N ,nj = nj(α) := 1 + � αj

ε2(1−s) .

Choose πt,j = πt,j(θ), t = 1, . . . , nj , j = 1, . . . , N independent random variables,independent of ε1, . . . , εn, with distribution

P(πt,j = k) = pj,k, k ∈ Vj , j = 1, . . . , N.

Let ψj = ψj(θ) :=∑nj

i=1 ψπt,j /nj and ξj = ξj(θ) :=∑nj

i=1 ξπt,j /nj .We will choose a realization {(ψ∗

j , ξ∗j ) = (ψ∗j (θ), ξ∗j (θ))}N

j=1 of {(ψj , ξj)}Nj=1 de-

pending on {εi}ni=1, satisfying appropriate conditions (namely, (9) and (10) below).

We may then write∣∣∣∣∣

m∑

k=1

θkξk

∣∣∣∣∣ ≤∣∣∣∣∣

N∑

j=1

αjξ∗j

∣∣∣∣∣ +

∣∣∣∣∣

m∑

k=1

θkξk −N∑

j=1

αjξ∗j

∣∣∣∣∣.

Consider nowN∑

j=1

αjξ∗j .

Let AN := {∑N

i=1 αj = 1, αj ≥ 0}. Endow AN with the �1 metric. The ε-coveringnumber D(ε) of AN satisfies the bound

D(ε) ≤(

)N

.

Let Aε be a maximal ε-covering set of AN . For all α ∈ A there is an α′ ∈ Aε suchthat

∑Nj=1 |αj − α′

j | ≤ ε.

Page 141: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Non-asymptotic bounds for GLM 129

We now write∣∣∣∣∣

m∑

k=1

θkξk

∣∣∣∣∣ ≤∣∣∣∣∣

N∑

j=1

(αj − α′j)ξ

∗j

∣∣∣∣∣ +

∣∣∣∣∣

m∑

k=1

θkξk −N∑

j=1

αjξ∗j

∣∣∣∣∣ +

∣∣∣∣∣

N∑

j=1

α′jξ

∗j

∣∣∣∣∣

:= i(θ) + ii(θ) + iii(θ).

Let Π be the set of possible values of the vector {πt,j : t = 1, . . . , nj , j =1, . . . , N}, as θ varies. Clearly,

i(θ) ≤ ε maxΠ

maxj

|ξj |,

where we take the maximum over all possible realizations of {ξj}Nj=1 over all θ.

For each t and j, πt,j takes its values in {1, . . . ,m}, that is, it takes at most mvalues. We have

N∑

j=1

nj ≤ N +N∑

j=1

αj

ε2(1−s)

≤ Aε−sV +∑m

k=1 θk

ε2(1−s)

= (1 + A)ε−2(1−s) ≤ K + 1.

where K is the integerK := �(1 + A)ε2(1−s) .

The number of integer sequences {nj}Nj=1 with

∑Nj=1 nj ≤ K + 1 is equal to

(N + K + 2

K + 1

)≤ 2N+K+2 ≤ 4 × 2(1+2A)ε−2(1−s)

.

So the cardinality |Π| of Π satisfies

|Π| ≤ 4 × 2(1+2A)ε−2(1−s) × m(1+A)ε−2(1−s) ≤ (2m)(1+2A)ε−2(1−s)

,

since A ≥ 1 and m ≥ 4.Now, since ‖ψ‖∞ ≤ 1 for all ψ ∈ Ψ, we know that for any convex combination∑k pkξk, one has E|

∑k pkξk|2 ≤ 1/n. Hence Eξ2

j ≤ 1/n for any fixed ξj and thus,by Lemma 4.1,

(8) εEmaxΠ

maxj

|ξj | ≤ 2ε√

1 + 2Aε−(1−s)

√log(6m)

n= 2

√1 + 2Aεs

√log(6m)

n.

We now turn to ii(θ).By construction, for i = 1, . . . , n, t = 1, . . . , nj , j = 1, . . . , N ,

Eψπt,j (xi) =∑

k∈Vj

pj,kψk(xi) := gj(xi)

and henceE(ψπt,j (xi) − gj(xi))2 ≤ max

k,l∈Vj

(ψk(xi) − ψl(xi))2.

ThusE(ψj(xi) − gj(xi))2 ≤ max

k,l∈Vj

(ψk(xi) − ψl(xi))2/nj ,

Page 142: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

130 S. A. van de Geer

and soE‖ψj − gj‖2

n ≤ maxk,l∈Vj

‖ψk − ψl‖2n/nj ≤ (2εs)2/nj = 4ε2s/nj .

Therefore

E

∥∥∥∥∥

N∑

j=1

αj(ψj − gj)

∥∥∥∥∥

2

n

=N∑

j=1

α2jE‖ψj − gj‖2

n

≤ 4ε2sN∑

j=1

α2j

nj≤ 4ε2s

N∑

j=1

α2j ε

2(1−s)

αj≤ 4ε2.

Let Eε denote conditional expectation given {εi}ni=1. Again, by construction

Eεξπt,j =∑

k∈Vj

pj,kξk := ej = ej(θ),

and henceEε(ξπt,j − ej)2 ≤ max

k,l∈Vj

(ξk − ξl)2.

ThusEε(ξj − ej)2 ≤ max

k,l∈Vj

(ξk − ξl)2/nj .

So we obtain

∣∣∣∣∣

N∑

j=1

αj(ξj − ej)

∣∣∣∣∣ ≤N∑

j=1

αjEε|ξj − ej | ≤N∑

j=1

αj maxk,l∈Vj

|ξk − ξl|√nj

≤N∑

j=1

αjε1−s

√αj

maxk,l∈Vj

|ξk − ξl| =N∑

j=1

√αjε

1−s maxk,l∈Vj

|ξk − ξl|

≤√

Nε1−s maxj

maxk,l∈Vj

|ξk − ξl| ≤√

A maxj

maxk,l∈Vj

|ξk − ξl|.

It follows that, given {εi}ni=1, there exists a realization

{(ψ∗j , ξ∗j ) = (ψ∗

j (θ), ξ∗j (θ))}Nj=1

of {(ψj , ξj)}Nj=1 such that

(9) ‖N∑

j=1

αj(ψ∗j − gj)‖2

n ≤ 4ε

as well as

(10)

∣∣∣∣∣

N∑

j=1

αj(ξ∗j − ej)

∣∣∣∣∣ ≤ 2√

Amaxj

maxk,l∈Vj

|ξk − ξl|.

Thus we haveii(θ) ≤ 2

√A max

jmaxk,l∈Vj

|ξk − ξl|.

Since E|ξk − ξl|2 ≤ 2ε2/n for all k, l ∈ Vj and all j, we have by Lemma 4.1,

(11) 2√

AEmaxj

maxk,l∈Vj

|ξk − ξl| ≤ 6√

Aεs

√log(6m)

n.

Page 143: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Non-asymptotic bounds for GLM 131

Finally, consider iii(θ). We know that

‖fθ‖n =

∥∥∥∥∥

N∑

j=1

αjgj

∥∥∥∥∥n

≤ ε.

Moreover, we have shown in (9) that∥∥∥∥∥

N∑

j=1

αj(ψ∗j − gj)

∥∥∥∥∥n

≤ 4ε.

Also ∥∥∥∥∥

N∑

j=1

(αj − α′j)ψ

∗j

∥∥∥∥∥n

≤N∑

j=1

|αj − α′j |‖ψ∗

j ‖n ≤ ε,

since ‖ψ∗j ‖∞ ≤ 1 for all j . Thus

∥∥∥∥∥

N∑

j=1

α′jψ

∗j

∥∥∥∥∥n

≤∥∥∥∥∥

N∑

j=1

(α′j − αj)ψ∗

j

∥∥∥∥∥n

+

∥∥∥∥∥

N∑

j=1

αj(ψ∗j − gj)

∥∥∥∥∥n

+ ‖fθ‖n ≤ 6ε.

The total number of functions of the form∑N

j=1 α′jξ

∗j is bounded by

(4ε

)N

× |Π| ≤(

)Aε−2(1−s)

× (2m)(1+2A)ε−2(1−s)

≤ (2m)(1+2A)ε−2(1−s)

,

since we assume ε ≥ 16/m, and A ≥ 1. Hence, by Lemma 4.1,

(12) E maxα′∈Aε

maxΠ

|N∑

j=1

α′jξ

∗j | ≤ 12

√1 + 2Aεs

√log(6m)

n.

We conclude from (8), (11), and (12), that

Emaxθ

∣∣∣∣∣

N∑

j=1

αj(θ)ej(θ)

∣∣∣∣∣

≤ 2√

1 + 2Aεs

√log(6m)

n+ 6

√Aεs

√log(6m)

n+ 12

√1 + 2Aεs

√log(6m)

n

≤ 20√

1 + 2Aεs

√log(6m)

n.

Proof of Lemma 2.1. Let

Zε,M := sup‖fθ‖n≤ε, I(θ)≤M

∣∣∣∣∣1n

n∑

i=1

γfθ(xi, Yi)εi

∣∣∣∣∣

denote the symmetrized process. Clearly, {fθ =∑m

k=1 θkψk : I(θ) = 1} is theconvex hull of Ψ := {±ψk}m

k=1. Moreover, we have

N(ε, Ψ) ≤ 2N(ε, Ψ).

Page 144: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

132 S. A. van de Geer

Now, apply Theorem 2.1, to Ψ, and use a rescaling argument, to see that

EZε,M ≤ 20√

1 + 4AεsM1−s

√log(12m)

n.

Then from Theorem 4.2 and Theorem 4.3, we know that

EZε,M ≤ 4EZε,M .

The result now follows by applying Theorem 4.1.

4.3. Proofs of the results in Section 3

The proof of Theorem 3.1 depends on the following simple convexity trick.

Lemma 4.2. Let ε > 0 and M > 0. Define fn = tfn + (1 − t)f∗n with

t := (1 + ‖fn − f∗n‖n/ε + I(fn − f∗

n)/M)−1,

and with fn := fθnand f∗

n := fθ∗n. When it holds that

‖fn − f∗n‖n ≤ ε

3, and I(fn − f∗

n) ≤ M

3,

then‖fn − f∗

n‖n ≤ ε, and I(fn − f∗n) ≤ M.

Proof. We havefn − f∗

n = t(fn − f∗n),

so ‖fn − f∗n‖n ≤ ε/3 implies

‖fn − f∗n‖n ≤ ε

3t= (1 + ‖fn − f∗

n‖n/ε + I(fn − f∗n)/M)

ε

3.

So then

(13) ‖fn − f∗n‖n ≤ ε

2+

ε

2MI(fn − f∗

n).

Similarly, I(fn − f∗n) ≤ M/3 implies

(14) I(fn − f∗n) ≤ M

2+

M

2ε‖fn − f∗

n‖n.

Inserting (14) into (13) gives

‖fn − f∗n‖n ≤ 3ε

4+

14‖fn − f∗

n‖n,

i.e., ‖fn − f∗n‖n ≤ ε. Similarly, Inserting (13) into (14) gives I(fn − f∗

n) ≤ M .

Proof of Theorem 3.1. Note first that, by the definition of of Mn, εn and λn, itholds that

(15) λn,0εsnM1−s

n =ε2n

27σ2n

,

Page 145: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Non-asymptotic bounds for GLM 133

and also

(16) (27)s

2−s c−2

2−s λ2

2−sn M

2(1−s)2−s

n =ε2n

27σ2n

.

Defineθn = tθn + (1 − t)θ∗n,

wheret := (1 + ‖fθn

− fθ∗n‖n/εn + I(fθn

− fθ∗n)/Mn)−1.

We know that by convexity, and since θn minimizes the penalized empirical risk,we have

Pnγfθn+ λ

22−sn I

2(1−s)2−s (θn)

≤ t

(Pnγfθn

+ λ2

2−sn I

2(1−s)2−s (θn)

)+ (1 − t)

(Pnγfθ∗

n+ λ

22−sn I

2(1−s)2−s (θ∗n)

)

≤ Pnγfθ∗n

+ λ2

2−sn I

2(1−s)2−s (θ∗n).

This can be rewritten as

P (γfθn− γfθn∗) + λ

22−sn I

2(1−s)2−s (θn) ≤ −(Pn − P )(γfθn

− γfθ∗n) + λ

22−sn I

2(1−s)2−s (θ∗n).

Since I(fθn− fθ∗

n) ≤ Mn, and ‖ψk‖∞ ≤ 1 (by Assumption A), we have that

‖fθn− fθ∗

n‖∞ ≤ Mn. Hence, by Assumption M,

P (γfθn− γfθn∗) ≥ ‖fθn

− fθ∗n‖2

n/σ2n.

We thus obtain

‖fθn− fθ∗

n‖2

n

σ2n

+ λ2

2−sn I

2(1−s)2−s (θn − θ∗n)

≤‖fθn

− fθ∗n‖2

n

σ2n

+ λ2

2−sn I

2(1−s)2−s (θn) + λ

22−sn I

2(1−s)2−s (θ∗n)

≤ −(Pn − P )(γfθn− γfθ∗

n) + 2λ

22−sn I

2(1−s)2−s (θ∗n).

Now, ‖fθn− fθ∗

n‖n ≤ εn and I(θn − θ∗n) ≤ Mn. Moreover εn/Mn ≤ 1 and in view

of (7), εn/Mn ≥ 8/m. Therefore, we have by (6) and Theorem 4.1, with probabilityat least

1 − exp[− nε2n

2 × (27σ2n)2

],

that

‖fθn− fθ∗

n‖2

n

σ2n

+ λ2

2−sn I

2(1−s)2−s (θn − θ∗n)

≤ λn,0εsnM1−s

n + 2λ2

2−sn I

2(1−s)2−s (θ∗n) +

ε2n27σ2

n

≤ λn,0εsnM1−s

n + (27)s

2−s c−2

2−s λ2

2−sn M

2(1−s)2−s

n +ε2n

27σ2n

=1

9σ2n

ε2n,

Page 146: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

134 S. A. van de Geer

where in the last step, we invoked (15) and (16).It follows that

‖fθn− fθ∗

n‖n ≤ εn

3,

and also that

I2(1−s)2−s (θn − θ∗n) ≤ ε2n

9σ2n

λ− 2

2−sn ≤

(Mn

3

) 2(1−s)2−s

,

since c ≥ 3.To conclude the proof, apply Lemma 4.2.

References

[1] Alexander, K. S. (1985). Rates of growth for weighted empirical processes.Proc. Berkeley Conf. in Honor of Jerzy Neyman and Jack Kiefer 2 475–493.University of California Press, Berkeley.

[2] Ball, K. and Pajor, A. (1990). The entropy of convex bodies with “few”extreme points. Geometry of Banach Spaces (Strobl., 1989) 25–32. LondonMath. Soc. Lecture Note Ser. 158. Cambridge Univ. Press.

[3] Blanchard, G., Lugosi, G. and Vayatis, N. (2003). On the rate of con-vergence of regularized boosting classifiers. J. Machine L. Research 4 861–894.

[4] Bousquet, O. (2002). A Bennett concentration inequality and its applicationto suprema of empirical processes. C. R. Acad. Sci. Paris 334 495–500.

[5] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces:Isoperimetry and Processes. Springer, New York.

[6] Massart, P. (2000). About the constants in Talagrand’s concentration in-equalities for empirical processes. Ann. Probab. 28 863–884.

[7] Massart, P. (2000). Some applications of concentration inequalities to sta-tistics. Ann. Fac. Sci. Toulouse 9 245–303.

[8] Pollard, D. (1990). Empirical Processes: Theory and Applications. IMS,Hayward, CA.

[9] Talagrand, M. (1995). Concentration of measure and isoperimetric inequal-ities in product spaces. Publ. Math. de l’I.H.E.S. 81 73–205.

[10] van de Geer, S. (2000). Empirical Processes in M-Estimation. CambridgeUniv. Press.

[11] van de Geer, S. (2002). M-estimation using penalties or sieves. J. Statist.Planning Inf. 108 55–69.

[12] van de Geer, S. (2006). High-dimensional generalized linear models and theLasso. Research Report 133, Seminar fur Statistik, ETH Zurich Ann. Statist.To appear.

[13] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergenceand Empirical Processes. Springer, New York.

Page 147: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

arX

iv:m

ath/

0610

115v

2 [

mat

h.ST

] 6

Sep

200

7

IMS Lecture Notes–Monograph Series

Asymptotics: Particles, Processes and Inverse Problems

Vol. 55 (2007) 135–148c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000328

Better Bell inequalities

(passion at a distance)

Richard D. Gill 1,∗,†

Mathematical Institute, Leiden University and EURANDOM, NWO

Abstract: I explain so-called quantum nonlocality experiments and discusshow to optimize them. Statistical tools from missing data maximum likelihoodare crucial. New results are given on CGLMP, CH and ladder inequalities.Open problems are also discussed.

1. The name of the game

QM vs. LR. Bell’s [5] theorem states that quantum physics (aka quantum me-chanics, QM) is incompatible with classical physics. His proof exhibits a pattern ofcorrelations, predicted in a certain situation by quantum physics, which is forbiddenby any physical theory having a certain basic (and formerly uncontroversial) prop-erty called local realism (LR). Under LR, correlations must satisfy a Bell inequality,which however under QM can be violated.

Local realism = locality + realism, is closely connected to causality; a precisemathematical formulation will follow later. As we will see then, a further basic (andalso uncontroversial) assumption called freedom needs to be made as well.

For the time being I offer the following explanatory remarks. Let us agree thatthe task of physics is to provide a causal explanation (or if you prefer, description)of reality. Events have causes (realism); cause and effect are constrained by timeand space (locality). Realism has been taken for granted in physics since Aristotle;together with locality it has been a permanent feature and criterion of basic sanitytill Einstein and others began to uncover disquieting features of quantum physics,see Einstein, Podolsky and Rosen [11], referred to hereafter as EPR.

For some, John Bell’s theorem is a reason to argue that quantum physics mustdramatically break down at some (laboratory accessible) level. For Bohr it wouldmerely have confirmed the Copenhagen view that there is no underlying classicalreality behind quantum physics, no Aristotelian/Cartesian/rationalist explanationof the random outcomes of quantum measurements. For others, it is a powerfulincentive to deliver experimental proof that Nature herself violates local realism.

∗This paper is dedicated to my friend Piet Groeneboom on the occasion of his 65th birthday.I started the research during my previous affiliation at the Mathematical Institute, Utrecht Uni-versity. I acknowledge financial support from the European Community project RESQ, contractIST-2001-37559. The paper is based on work in progress joint with Toni Acin, Marco Barbieri,Wim van Dam, Nicolas Gisin, Peter Grunwald, Jan-Ake Larsson, Philipp Pluch, Stefan Zohren,and Marek Zukowski. Last but not least, Piet’s programming assistance was vital. Lang zal hij

leven, in de gloria!†NWO is the Dutch national Science Foundation.1Mathematical Institute, Snellius Bldg, University of Leiden, Niels Bohrweg 1, 2333 CA Leiden,

Netherlands, e-mail: [email protected] ; url: http://www.math.leidenuniv.nl/∼gill

AMS 2000 subject classifications: Primary 60G42, 62M07; secondary 81P68.Keywords and phrases: latent variables, missing data, quantum non-classicality, so-called quan-

tum non-locality.

135

Page 148: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

136 Richard D. Gill

By communis opinio, the splendid experiment of Aspect, Dalibard, and Grangier[3] settled the matter in favour of quantum physics. However, insiders have longknown that that experiment has major shortcomings which imply that the matteris not settled at all. Twenty-five years later these shortcomings have still not beenovercome, despite a continuing and intense effort and much progress; see Gill [14,15], Santos [25]. I can report that certain experimenters think that a definitivesuccessful experiment might well be achieved within ten years. A competition seemsto be on to do it first. We will see.

Bell-type experiments. We are going to study the sets of all possible jointprobability distributions of the outcomes of a Bell-type experiment, under two setsof assumptions, corresponding respectively to local realism and to quantum physics.Bell’s theorem can be reformulated as saying that the set of LR probability laws isstrictly contained in the QM set. But what is a Bell-type experiment?

That is not so difficult to explain. Here is a description of a p × q × r Bellexperiment, where p, q and r are fixed integers all at least equal to 2. The experimentinvolves a diabolical source, Lucifer, and a number p of players or parties, usuallycalled Alice, Bob, and so on. Lucifer sends a package to Alice and each of herfriends by FedEx. After the packges have been handed over by Lucifer to FedEx,but before each party’s package is delivered at his or her laboratory, each of theparties commits him or herself to using one particular tool or measurement-deviceout of some fixed set of toolboxes with which to open their packages. Suppose eachparty can choose one out of q tools; each party’s tools are labelled from 1 to q.There is no connection between different party’s tools (and it is just for simplicitythat we suppose each party has the same number). The q tools of each party areconventionally called measurements or settings.

When the packages arrive, each of the parties opens their own package with themeasurement setting that they have chosen. What happens precisely now is left tothe reader’s imagination; but we suppose that the possible outcomes for each ofthe parties can all be classified into one of r different outcome categories, labelledfrom 0 to r−1. Again, there is not necessarily any connection between the outcomecategory labelled x of different measurements for the same or different parties.

Given that Alice chose setting a, Bob b, and so on, there is some joint probabilityp(x, y, . . . |a, b, . . . ) that Alice will then observe outcome x, Bob y, . . . . We supposethat the parties chose their settings a, b, . . . , at random from some joint distributionwith probabilties π(a, b, . . . ); a, b, . . . = 1, . . . , q. Altogether, one run of the wholeexperiment has outcome (a, b, . . . ;x, y, . . . ) with probability p(a, b, . . . ;x, y, . . . ) =π(a, b, . . . )p(x, y, . . . |a, b, . . . ).

If the different party’s settings are independent, then each party would in prac-tice generate their own setting in their own laboratory according to its marginaldistribution. In general however we need a trusted, independent, referee, who wewill call Piet, who generates the settings of all parties simultaneously and makessure that each one receives their own setting in separate, sealed envelopes.

One can (and should) also consider “unbalanced” experiments with possiblydifferent numbers of measurements per party, different numbers of outcomes perparty’s measurement. Moreover, more complicated multi-stage measurement strate-gies are sometimes considered. We stick here to the basic “balanced” designs, justfor ease of exposition.

The classical polytope. Local realism and freedom can be taken mean thefollowing:

Page 149: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Better Bell inequalities 137

Measurements which were not done also have outcomes; actual and potential mea-surement outcomes are independent of the measurement settings actually used by allthe parties.

The outcomes of measurements which were not actually done are obviously coun-terfactual. I am not claiming the actual existence in physical reality of these out-comes, whatever that might be supposed to mean (see EPR for one possible defi-nition). I am supposing that a mathematical model for the experiment does allowthe existence of such variables.

To argue this point, consider a computer simulation of the Bell experiment inwhich Lucifer’s packages are put together on a classical computer, using randomiza-tion if necessary, while what goes on in each party’s laboratory is also simulated ona computer. The package that is sent to each party can therefore be represented bya random number. What happens in each party’s lab is the result of inputting themessage from Lucifer, and the setting from Piet the referee, into another computerprogram which might also make use of random number generation. There can beany kind of dependence between the random numbers used in Lucifer’s, Alice’s,Bob’s . . . computers. But without loss of generality all this randomization mightas well be done at Lucifer’s computer; Alice’s computer merely evaluates somefunction of the message from Lucifer, and the setting from Piet. We see that theoutcomes are now simultaneously defined of every measurement which each partymight choose, simply by considering all possible arguments to their computers pro-grams. The assumption of freedom is simply that Piet’s settings are independentof Lucifer’s random numbers. Now, given Lucifer’s randomization, everything thathappens is completely deterministic: the outcome of each possible measurement ofeach party is fixed.

For ease of notation, consider briefly a two party experiment. Let X1, . . . , Xq

and Y1, . . . , Yq denote the counterfactual outcomes of each of Alice’s and Bob’spossible q measurements (taking values in {0, . . . , r − 1}. We may think of thesein statistical terms as missing data, in physical terms as so-called hidden vari-ables. Denote by A and B Alice’s and Bob’s random settings, each taking values in{1, . . . , q}. The actual outcomes observed by Alice and Bob are therefore X = XA

and Y = YB . The data coming from one run of the experiment, A,B,X, Y , has jointprobability distribution with mass function p(a, b;x, y) = π(a, b, . . . )p(x, y, |a, b) =π(a, b) Pr(Xa = x, Yb = y).

Now the joint probability distribution of the Xa and Yb can be arbitrary, but inany case it is a mixture of all possible degenerate distributions of these variables.Consequently, for fixed setting distribution π, the joint distribution of A,B,X, Y isalso a mixture of the possible distributions corresponding to degenerate (determin-istic) hidden variables. Since there are only finitely many degenerate distributionswhen p, q and r are all fixed, we see that

Under local realism and freedom, the joint probability laws of the observable data liein a convex polytope, whose vertices correspond to degenerate hidden variables.

We call this polytope the classical polytope.

The quantum body. Introductions to quantum statistics can be found in Gill[13], Barndorff-Nielsen et al. [4]. The bible of quantum information, Nielsen andChuang [22], is a splendid resource and has introductory material for beginnersto the field whether coming from physics, computer science or mathematics. Thebasic rule for computation of a probability distribution in quantum mechanics iscalled Born’s law: take the squared lengths of the projections of the state vector

Page 150: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

138 Richard D. Gill

into a collection of orthogonal subspaces corresponding to the different possibleoutcomes. For ease of notation, consider a two-party experiment. Take two com-plex Hilbert spaces H and K. Take a unit vector |ψ〉 in H ⊗ K. For each a, letLa

x, x = 0, . . . , r − 1, denote orthogonal closed subspaces of H, together spanningall of H. Similarly, let M b

y denote the elements of q collections of decompositionsof K into orthogonal subspaces. Finally, define p(x, y|a, b) = ‖ΠLa

x⊗ ΠMb

y|ψ〉‖2,

where Π denotes orthogonal projection into a closed subspace. The reader shouldverify (basically by Pythagoras’ theorem), that this does define a collection ofjoint probability distributions of X and Y , indexed by (a, b). As before we takep(a, b, . . . ;x, y, . . . ) = π(a, b, . . . )p(x, y, . . . |a, b, . . . ).

The following fact is not trivial:

The collection of all possible quantum probability laws of A, B, X, Y (for fixed settingdistribution π) forms a closed convex body containing the local polytope.

Beyond the 2 × 2 × 2 case very little indeed is known about this convex body.

The no-signalling polytope. The two convex bodies so far defined are forcedto live in a lower dimensional affine subspace, by the basic normalization proper-ties of probability distributions:

∑x,y p(a, b;x, y) = π(a, b) for all a, b. Moreover,

probabilities are necessarily nonnegative, so this restricts us further to some convexpolytope. However, physics (locality) implies another collection of equality con-straints, putting us into a still smaller affine subspace. These constraints are calledthe no-signalling constraints:

∑y p(a, b;x, y) should be independent of b for each

a and x, and vice versa. It is easy to check that both the local realist probabilitylaws, and the quantum probability laws, satisfy no-signalling. Quantum mechanicsis certainly a local theory as far as manifest (as opposed to hidden) variables areconcerned.

The set of probability laws satisfying no-signalling is therefore another convex poly-tope in a low dimensional affine subspace; it contains the quantum body, which inturn contains the classical polytope.

Bell and Tsirelson inequalities. “Interesting” faces of the classical polypope,i.e., faces which do not correspond to the positivity constraints, generate (general-ized) Bell inequalities, that is, linear combinations of the joint probabilities of theobservable variables which reach a maximum value at the face. Similarly, “inter-esting” supporting hyperplanes to the quantum body correspond to (generalized)Tsirelson inequalities. These latter inequalities can be recast as inequalities con-cerning expectation values of certain observables called Bell operators.

The original Bell (more precisely, CHSH – Clauser, Horne, Shimony and Holt [6])and Cirel’son [8] inequalities concern the 2 × 2 × 2 case. However we will proceedby proving Bell’s theorem – the quantum body is strictly larger than the localpolytope – in the 3× 2× 2 case for which a rather elegant proof is available due toGreenberger, Horne and Zeilinger [17].

By the way, the subtitle “passion at a distance” is a phrase coined by AbnerShimony and it expresses that though there is no action at a distance (no manifestnon-locality), still quantum physics seems to allow the physical system at Alice’ssite to have some feeling for what is going on far away at Bob’s. Rather like theoracles of antiquity, no-one can make any sense of what the oracle is saying till it istoo late . . . . But one can use these non-classical correlations, as the physicists like tocall them, to enable Alice and her friends to succeed at certain collaborative tasks,in which Lucifer is their ally while Piet is their adversary, with larger probability

Page 151: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Better Bell inequalities 139

than is possible under any possible classical-like physics. The following exampleshould inspire the reader to imagine such a task.

GHZ paradox. We consider a now famous 3×2×2 example due to Greenberger,Horne and Zeillinger [17]. We use this example partly for fun, partly to exemplifythe computation of Bell probability laws under quantum mechanics and under localrealism.

Firstly, under local realism, one can introduce hidden variables X1, X2, Y1,Y2, Z1, Z2, standing for the counterfactual outcomes of Alice, Bob and Claudia’smeasurements when assigned settings 1 or 2 by Piet. These variables are binary,and we may as well denote their possible outcomes by ±1. Now note that

(X1Y2Z2).(X2Y1Z2).(X2Y2Z1) = (X1Y1Z1).

Thus, if the setting patterns (1, 2, 2), (2, 1, 2) and (2, 2, 1) always result in X , Yand Z with XY Z = +1, it will also be the case the setting pattern (1, 1, 1) alwaysresults in X , Y and Z with XY Z = +1.

Next define the 2 × 2 matrices

σ1 =

(0 11 0

), σ2 =

(1 00 −1

).

One easily checks that σ1σ2 = −σ2σ1, (anticommutation), σ21 = σ2

2 = 1, the 2 × 2identity matrix. Since σ1 and σ2 are both Hermitean, it follows that they have realeigenvalues, which by the properties given above, must be ±1.

Now define matrices X1 = σ1 ⊗ 1 ⊗ 1, X2 = σ2 ⊗ 1 ⊗ 1, Y1 = 1 ⊗ σ1 ⊗ 1,Y2 = 1 ⊗ σ2 ⊗ 1, Z1 = 1 ⊗ 1 ⊗ σ1, Z2 = 1 ⊗ 1 ⊗ σ2. It is now easy to check that

(X1Y2Z2).(X2Y1Z2).(X2Y2Z1) = −(X1Y1Z1),

and that (X1Y2Z2), (X2Y1Z2), (X2Y2Z1) and (X1Y1Z1) commute with one another.Since these four 8× 8 Hermitean matrices commute they can be simultaneously

diagonalized. Some further elementary considerations lead one to conclude the ex-istence of a simultaneous eigenvector |ψ〉 of all four, with eigenvalues +1, +1, +1,−1 respectively. We take this to be the state |ψ〉, with the three Hilbert spaces allequal to C

2. We take the two orthogonal subspaces for the 1 and 2 measurements ofAlice, Bob, and Claudia all to be the two eigenspaces of σ1 and σ2 respectively. Thisgenerates quantum probabilties such that the setting patterns (1, 2, 2), (2, 1, 2) and(2, 2, 1) always result in X , Y and Z with XY Z = +1, while the setting pattern(1, 1, 1) always results in X , Y and Z with XY Z = −1.

Thus we have shown that a vector of quantum probabilities exists, which cannotpossibly occur under local realism. Since the classical polytope is closed, the corre-sponding quantum law must be strictly outside the classical polytope. It thereforeviolates a generalized Bell inequality corresponding to some face of the classicalpolytope, outside of which it must lie. It is left as an exercise to the reader togenerate the corresponding “GHZ inequality.”

GHZ experiment. This brings me to the point of the paper: how should onedesign good Bell experiments; and what is the connection of all this physics withmathematical statistics? Indeed there are many connections – as already alludedto, the hidden variables of a local realist theory are simply the missing data of anonparametric missing data problem.

Page 152: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

140 Richard D. Gill

In the laboratory one creates the state |ψ〉, replacing Lucifer by a source of en-tangled photons, and the measurement devices of Alice and Bob by assemblagesof polarization filters, beam splitters and photodetectors implementing hereby themeasurements corresponding to the subspaces Lx

a, etc. One also settles on a jointsetting probability π. One repeats the experiment many times, hoping to indeedobserve a quantum probability law lying outside the classical polytope, i.e., vio-lating a Bell inequality. The famous Aspect et al. [3] experiment implemented thisprogram in the 2 × 2 × 2 case, violating the so-called CHSH inequality (which wewill describe later) by a large number of standard deviations. What is being donehere is statistical hypothesis testing, where the null hypotheses is local realism,the alternative is quantum mechanics; the alternative being true by design of theexperimenter and validity of quantum mechanics.

Dirk Bouwmeester recently carried out the GHZ experiment; the results are ex-citing enough to be published in Nature (Pan et al. [23]). He claimed in a newspaperinterview that this experiment is of a rather special type: only a finite number ofrepetitions are necessary since the experiment exhibits events which are impossi-ble under classical physics, but certain under quantum mechanics. However pleasenote that the events which are certain or impossible, are only certain or impossibleconditional on some other events being certain. Since the experiment is not perfect,Bouwmeester did observe some “wrong” outcome patterns, thereby destroying byhis own logic the conclusion of his experiment. Fortunately his data does statis-tically significantly violate the accompanying GHZ inequality and publication inNature was justified! The point is: all these experiments are statistical in nature;they do not prove for sure that local realism is false; they only give statistical ev-idence for this proposition; evidence which does become overwhelming if N , thenumber of repetitions, is large enough.

How to compare different experiments. Because of the dramatic zero-onenature of the GHZ experiment, it is felt by many physicists to be much strongeror better than experiments of the original 2 × 2 × 2 CHSH type (still to be eluci-dated!) The original aim of the research described here was to supply objective andquantitative evaluation of such claims.

Now the geometric picture above naturally leads one to prefer an experimentwhere the distance from the quantum physical reality is as far as possible fromthe nearest local realistic or classical description. Much research has been done byphysicists focussing on the corresponding Euclidean distance. However, it is not soclear what this distance means operationally, and whether it is comparable overexperiments of different types. Moreover the Euclidean distance is altered by tak-ing different setting distributions π (though physicists usually only consider theuniform distribution). It is true that Euclidean distance is closely related to noiseresistance, a kind of robustness to experimental imperfection. As one mixes thequantum probability distribution more and more with completely random, uniformoutomes, corresponding to pure noise in the photodetectors, the quantum probabil-ity distribution shrinks towards the center of the classical polytope, at some pointpassing through one of its faces. The amount of noise which can be allowed whilestill admitting violation of local realism is directly related to Euclidean distance, inour picture.

Van Dam, Gill and Grunwald [10] however propose to use relative entropy, D(q :p) =

∑abxy q(abxy) log2(q(abxy)/p(abxy)), where q now stands for the “true” prob-

ability distribution under some quantum description of reality, and p stands for alocal realist probability distribution. Their program is to evaluate supq infp D(q : p)

Page 153: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Better Bell inequalities 141

where the supremum is taken over parameters at the disposal of the experimenter(the quantum state |ψ〉, the measurement projectors, the setting distribution π;while the infimum is taken over probability distributions of outcomes given settingsallowed by local realism (thus q and p in supremum and infimum actually standfor something different from the probability laws q and p lying in the quantumbody and classical polytope respectively; hopefully this abuse of notation may beexcused.

They argue that this relative entropy gives direct information about the numberof trials of the experiment required to give a desired level of confidence in theconclusion of the experiment. Two experiements which differ by a factor 2 are suchthat the one with the smaller divergence needs to be repeated twice as often as theother in order to give an equally convincing rejection of local realism.

Moreover, optimizing over different sets of quantum parameters leads to variousmeasures of “strength of non-locality.” For instance, one can ask what is the bestexperiment based on a given entangled state |ψ〉? Experiments of different formatcan be compared with one another, possibly discounting the relative entropies ac-cording to the numbers of quantum systems involved in the different experiments inthe obvious way (typically, a p party experiment involves generation of p particlesat a time, so a four party experiment should be downweighted by a factor 2 whencomparing with a two party experiment). We will give some examples later.

Finally, that paper showed how the interior infimum is basically the computationof a nonparametric maximum likelihood estimator in a missing data problem. Var-ious algorithms from statistics can be succesfully applied here, in numerical ratherthan analytical experimentation; and progams developed by Piet Groeneboom (seeGroeneboom et al. [18]) played a vital role in obtaining the results which we arenow going to display.

2. CHSH and CGLMP

The 2×2×2 case is particularly simple and well researched. In a later section, I wantto compare the corresponding two particle CHSH experiment with the three particleGHZ. In another section I will discuss properties of 2 × 2 × d experiments, whichform a natural generalization of CHSH and have received much attention both bytheorists and experimenters in recent years. We will see that many open problemsexist here and some remarkable conjectures can be posed. Preparatory to that, Iwill therefore now describe the so-called CGLPM inequality, the generalization from2 × 2 × 2 to 2 × 2 × d of CHSH.

For the 2×2×d case an important step was made by Collins, Gisin, Linden, Mas-sar and Popescu [9], in the discovery of a generalized Bell inequality (i.e., interestingface of the classical polytope), together with a quantum state and measurementswhich violated the inequality. The original specification of the inequality is rathercomplex, and its derivation also took two closely printed pages. Here I offer a newand extremely short derivation of an equivalent inequality, found very recently byStefan Zohren, which further simplifyies an already very simple version of my own.Proof of equivalence with the original CGLMP is tedious!

Recall that a Bell inequality is the face of a classical polytope of the form∑abxy cabxyp(abxy) ≤ C. Now since we are only concerned with probability dis-

tributions within the no-signalling polytope, the probabilities p(abxy) necessarilysatisfy a large number of equality constraints (normalization, no-signalling), whichallows one to rewrite the Bell inequality in many different forms; sometimes re-markably different. A canonical form can be obtained by removing, by appropriate

Page 154: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

142 Richard D. Gill

substitutions, all p(abxy) with x and y equal to one particular value from the setof possible outcomes, e.g., outcome 0, and involving also the marginals p(ax) andp(by) with x and y non zero. This is not necessarily the “nicest” form of an inequal-ity. However, in the canonical form the constant C does disappear (becomes equalto 0).

To return to CGLMP: consider four random variables X1, X2, Y1, Y2. Note thatX1 < Y2 and Y2 < X2 and X2 < Y1 implies X1 < Y1. Consequently, X < 1 ≥ Y1

implies X1 ≥ Y2 or Y2 ≥ X2 or X2 ≥ Y1, and this gives us

Pr(X1 ≥ Y1) ≤ Pr(X1 ≥ Y2) + Pr(Y2 ≥ X2) + Pr(X2 ≥ Y1).

This is a CGLMP inequality, when we further demand that all four variables takevalues in {0, . . . , d− 1}. The case d = 2 gives the CHSH inequality (though also inan unfamiliar form).

CGLMP describe a state and quantum measurements which generate probabil-ities, which violate this inequality. Take Alice and Bob’s Hilbert space each to bed-dimensional. Consider the states |ψ〉 =

∑d−1x=0 |xx〉/

√d, where |xx〉 = |x〉 ⊗ |x〉,

and |x〉 for x = 0, . . . , d − 1 is an orthonormal basis of Cd. Alice and Bob’s set-tings 1, 2 are taken to correspond to angles α1 = 0, α2 = π/4, and β1 = π/8,β2 = −π/8. When Alice or Bob receives setting a or b, each applies the diagonalunitary operation with diagonal elements exp(ixθ/d), x = 0, . . . , d−1, to their partof the quantum system, where θ stands for their own angle (setting). Next Aliceapplies the quantum Fourier transform Q to her part, and Bob its inverse (and ad-joint) Q∗; Qxy = exp(ixy/d), Q∗

xy = exp(−ixy/d). Finally Alice and Bob “measurein the computational basis”, i.e., projecting onto the one-dimensional subspacescorresponding to the bases |x〉, |y〉. Applying a unitary U and then measuringthe projector ΠM is of course the same as measuring the projector ΠU∗M ; with aview to implementation in the laboratory it is very convenient to see the differentmeasurements as actually “the same measurement” applied after different unitarytransformations of each party’s state have been applied. In quantum optics theseoperations might correspond to use of various crystals, applying an electomagneticfield across a light pulse, and so on.

That these choices gives a violation of a CGLMP inequality follows from somecomputation and we desperately need to understand what is going on here, as willbecome more obvious in a later section when I describe conjectures concerningCGLMP and these measurements.

3. Comparing some classical experiments: GHZ vs CHSH

First of all, let me briefly report some results from van Dam et al. [10] concerning thecomparison of CHSH and GHZ. It is conjectured, and supported numerically, butnot yet proved, that the best 2× 2× 2 experiment in the sense of Kullback-Leiblerdivergence is the CGLMP experiment with d = 2 described in the last section,and usually known as the CHSH experiment. The setting probabilities should beuniform, the state is maximally entangled, the measurements are those implementedby Aspect et al. It turns out that D is equal to 0.0423.... For GHZ, which is can beconjectured to be the best 3 × 2 × 2 experiment, one finds D = 0.400, with settingprobabilities uniform over the four setting patterns involved in the derivation of theparadox; zero on the other. So this experiment is apparently almost 10 times better.By the way, D = 1 would be the strength of the experiment when one repeatedlythrows a coin which always comes up heads, in order to disprove the theory that

Page 155: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Better Bell inequalities 143

Pr(heads) = 1/2. So GHZ is less than half as good as an experiment in which onecompares probabilities 1 and 1/2; let alone comparable to an experiment comparingimpossible with certain outcomes!

However in practice the GHZ experiment is not performed exactly in optimalfashion. To begin with, in order to produce each triple of photons, Bouwmeestergenerated two maximally entangled pairs of photons, measured the polarizationof one of the four, and accepted the remaining set of three when the measuredpolarization was favourable, which occurs in half of the times. Since we need twopairs of photons for each triple, and discard the result half the times, the figure ofmerit should be divided by four. Next, the optimal setting probabilities is uniformover half of the eight possible combinations. In practice one generates settings atrandom at each measurement station, so that half of the combinations are actuallyuseless. This means we have to halve again, resulting in a figure of merit for GHZwhich is barely better than CHSH, and very far from the “infinity” which wouldcorrespond to an all or nothing experiment.

Actually things are even worse since the pairs of photon pairs are generatedat random times and one has to be quite lucky to have two pairs generated closeenough in time to one another that one has four photons to start with. Then thereare the inevitable losses which further degrade the experiment . . . (more on thislater). Bouwmeester needs to carry on measuring for hours in order to achievewhat can be done with CHSH in minutes. Which is not to say that his experimentis not a splendid acheivement!

4. CGLMP as # outcomes goes to infinity

In Acin, Gill and Gisin [2] a start is made with studying optimal 2 × 2 × r exper-iments, and some remarkable findings were made, though almost all conclusionsdepend on numerics, and even on numerics depending on conjectures.

Let me first describe one rather fundamental conjecture whose truth would takeus a long way in understanding what is going on.

In general nothing is known about the geometry of the classical polytope. Animpossible open problem is to somehow classify all interesting faces. It is not evenknown if, in general, all faces which are not trivial (i.e., correspond to nonneg-ativity constraints) are “interesting” in the sense of being violable by quantummechanics. As the numbers grow, the number and type of faces grow explosively,and exhausitive enumeration has only been done for very small numbers.

Clearly there are many many symmetries — the labelling of parties, measure-ments and outcomes is completely arbitrary. Moreover, there are three ways inwhich inequalities for smaller experiments remain inequalities for larger. Firstly,by merging categories in the larger experiment one obtains a smaller one, and theBell inequalities for the smaller can be lifted to the larger. Next, by simply omit-ting measurements one can lift Bell inequalities for smaller experiments to larger.Finally, by conditioning on a particular outcome of a particular measurement ofa particular party, one reduces a larger experiment to one with less parties, andconversely can lift a smaller inequality to a larger.

With the understanding that interesting faces for smaller polytopes can be liftedto interesting faces of larger in three different ways, the following conjecture seemshighly plausible:

All the faces of the 2 × 2 × r polytope are boring (nonnegativity) or interestingCGLMP, or lifted CGLMP, inequalities.

Page 156: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

144 Richard D. Gill

This is certainly true for r = 2, 3, 4 and 5 but beyond this there is only numericalevidence: numerical search for optimal experiments using the maximallly entangledstate |ψ〉 has only uncovered the CGLMP measurements, violating the CGLMPinequality.

Moreover this is true both using Euclidean and relative entropy distances.The next, stunning, finding is that the best state for these experiments is not

the maximally entangled state at all! Rather, it is a state of the form∑

x cx|xx〉where the so-called Schmidt coefficients cx are symmetric around x = (r−1)/2, firstdecreasing and then increasing. This “U-shape” become more and more pronouncedas r increases. Moreover the shape is found for both figures of merit, though it isa different state for the two cases (even less entangled for divergence than forEuclidean, i.e., less entangled for statistical strength than for noise resistance).Rather thorough numerical search takes us up to about r = 20 and has beenreplicated by various researchers.

Taking as a conjecture a) that all faces are CGLMP, b) that the best mea-surements are also CGLMP and the state is U -shaped, we only need to optimizeover the Schmidt coeffficients cx. Numerically one can quite easily get up to aboutr = 1000 in this way. However with some tricks one can go to r = 10 000 or even100 000. Note that we are solving supq infpD(q : p) where the infimum is over thelocal realist polytope, the supremum is just over the cj . Now a solution must alsobe a stationary point for both optimizations. Differentiating with respect to theclassical parameters, and recalling the form of D, one finds that one must have∑

abxy(qabxy/pabxy)(pabxy − pabxy) = 0 for classical probabilities p on the face ofthe classical polytope passing through the solution p. But this face is a CGLMPinequality! Hence the coefficients, qabxy/pabxy are the coefficients involved in thisinequality, i.e., up to some normalization constants they are already known! How-ever, the quantity we want to optimize, D itself, is

∑abxy qabxy log2(qabxy/pabxy)

and this is optimal over q at q = q (i.e., this the accompanying Tsirelson inequality,or supporting hyperplane to the quantum body at the optimum). Since the termsin the logarithm are known (up to a normalization constant) we just have to opti-mize the mean of an almost known Bell operator over the state. This is a largesteigenvalue problem, numerically easy up to very very large d.

All this raises the question what happens when r → ∞. In particular, can oneattain the largest conceivable violation of CGLMP, namely when the probabilityon the left is 1 and the three on the right are all 0, with infinite dimensionalHilbert spaces, and if so, are the corresponding state and measurements interestingand feasible experimentally? Strongly positive evidence and further conjectures aregiven in Zohren and Gill [27]. Some recent numerical results on r = 3 and 4 aregiven by Navascues et al. [21].

We think of this conjectured “perfect passion at a distance” as the optimalsolution of a variant of the infamous game of Polish Poker (played in Russian barsbetween a Polish traveller and local Russian drinkers with the inevitable outcomethat the Pole always gets the Roubles...). Now, Alice and Bob are playing together,against Piet. Piet chooses (completely randomly) a “setting” a = 1, 2 for Alice, andb = 1, 2 for Bob. Alice doesn’t know Bob’s setting and vice versa. Alice and Bobmust now, separately, each think of a number. Denote Alice’s number by xa, Bob’sby yb. Alice and Bob’s aim is to attain x1 < y2 (if Piet calls “1; 2”), and y2 < x2

(if Piet calls “2; 2”), and x2 < y1 (if ...), and y1 < x1 (if ...). If they choose theirnumbers by any classical means, e.g., with classical dice, they must fail at least aquarter of the times. However, with quantum dice (i.e., with the help of a coupleof bundles of photons, donated to each of them in advance by Lucifer) they can

Page 157: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Better Bell inequalities 145

succeed with probability arbitrarily close to certainty, by taking measurements withenough outcomes. At least, according to Zohren and Gill’s conjecture...

There remains the question: why are the CGLMP measurements optimal for theCGLMP inequality? Where do these angles come from, what has this to do withQFT? There are some ideas about this and the problem seems ripe to be cracked.

5. Ladder proofs

Is the CHSH experiment the best possible experiment with two maximally entan-gled qubits? This seemed a very good conjecture till quite recently. However theconjecture certainly needs modification now, as I will explain.

There has been some interest recently in so-called ladder proofs of Bell’s theorem.These appear to allow one to use less entangled states and get better experiments,though that dream is shown to be fallacious when one uses statistical strengthas a figure of merit rather than a criterion connected to “probability zero underLR, but positive under QM” (conditional on certain other probabilities equal tozero). Exactly as for GHZ, the size of this positive probability is not very impor-tant, the experiment is about violating an inequality, not about showing that someprobability is positive when it should be zero.

Let me explain the ladder idea. Consider the inequality

Pr(X1 ≥ Y1) ≤ Pr(X1 ≥ Y2) + Pr(Y2 ≥ X2) + Pr(X2 ≥ Y1).

Now add to this the same inequality for another pair of hidden variables:

Pr(X2 ≥ Y2) ≤ Pr(X2 ≥ Y3) + Pr(Y3 ≥ X3) + Pr(X3 ≥ Y2).

The intermediate “horizontal” 2—2 term cancels and we are left only with crossterms 1—2 and 2—3, and “end” terms 1—1 and 3—3. With a ladder built fromadding four inequalities involving X1 to X5 and Y1 to Y5, out of the 25 possiblecomparisons, only the two end horizontal terms and eight crossing terms survive,10 out of the total.

Numerical optimization of D for longer and longer ladders, shows that actuallythe optimal state is always the maximally entangled state. Moreover, much to mysurprise, the best D is obtained with the ladder of X1 to X5 and Y1 to Y5, andit is much better than the original CHSH! However, it has a uniform distributionover 10 out of 25 combinations. If one would implement the same experiment withthe uniform distribution over all 25, it becomes worse that CHSH. So the newconjecture is that CHSH is the optimal 2 × 2 × 2 experiment with uncorrelatedsettings.

These findings come from new unpublished work with Marco Barbieri; we arethinking of actually doing this experiment.

6. CH for Bell

In a CHSH experiment an annoying feature is that some photons are not registeredat all. This means that there are really three outcomes of each measurement, witha third outcome “no photon”; however, the outcome “no photon, no photon” is notobserved at all. One has a random sample size from the conditional distributiongiven that there is an event in at least one of the two laboratories of Alice and Bob.

It is better to realise that the original, complete sample size is actually alsorandom, and typically Poisson, hence the observed counts of the various events are

Page 158: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

146 Richard D. Gill

all Poisson. But can we create useful Bell inequalities for this situation?The answer is yes, using the possibility of reparametrization of inequalities using

the equality constraints. In a 2×2×3 experiment one can rewrite any Bell inequalityas an inequality involving only the pabxy with one of x or y not zero, as well asthe marginal probabilities pax, pby with x and y nonzero. The constant term inthe inequality becomes 0. So one gets a linear inequality involving only observed,Poisson distributed, random variables. “Poisson statistics” allows one to supply avalid standard error even though the “total sample size” was unknown.

Applying this technique in the 2 × 2 × 2 case gives a known inequality, theClauser-Horne (CH) inequality, useful when one has binary outcomes but one ofthe two outcomes is not observable at all; i.e., the outcomes are “detector click”and “no detector click.”

How to find a good inequality for 2×2×3? I simply add a certain probability of“no event”, independent on both sides of the experiment, to the quantum probabil-ities belonging to the classical CHSH set-up. Next I solve the problem infp D(q : p)using Piet Groeneboom’s programs. I observe the values of q/p which define the faceof the local polytope closest to q. I rewrite the inequality in its classical form. Theresult is a new inequality (not quite new: Stefano Pironio informs me it is knownto N. Gisin and others) which takes account of “no event” and which is linear inthe observed counts.

The linearity means that the inequality can be studied using martingale tech-niques to show that the experiment is “insured” against time dependence and timetrends, as long as the settings are chosen randomly; cf. Gill [14, 15]. It turns out tobe essentially equivalent to some rather non-linear inequalities developed by Jan-Ake Larsson, see Larsson and Gill [20], which were till now the only known way todeal with “non-events.” We intend to pursue this development in the near futurecombining treatment of the detection, coincidence and memory loopholes (Gill [16]and Larsson and Gill [20]).

7. Conclusions

I did not yet mention that studying the boundary of the 2×2×2 quantum body andsome different generalizations led Tsirelson into some deep mathematics and con-nections with fundamental questions involving Grothendieck’s mysterious constant,see Cirel’son [8], Tsirelson [26] (the same person . . . ), Reeds [24], and Fishburn andReeds [12].

Bell experiments offer a rich field involving many statistical ideas, beautifulmathematics, and offering deep exciting challenges. Moreover it is a hot topic inquantum information and quantum optics. Much remains to be done.

One remains wondering why nature is like this? There are two ways nature usesto generate probabilities: one is to take a line segment of length one and cut it intwo. The different experiments found by cutting it at different places are compatiblewith one another; one sample space will do (the unit interval). The other way ofnature is to take a line segment of length one, and let it be the hypothenuse ofa right angled triangle. Now the squares of the other two sides are probabilitiesadding to one. The different experiments are not compatible with one another (atleast, in dimension three or more, according to the Kochen–Specker theorem).

According to quantum mechanics and Bell’s theorem, the world is completelydifferent from how it has been thought for two thousand years of Western science.As Vovk and Shafer recently argued, Kolmogorov was one of the first to take theradical step of associating the little omega of a probability space with the outcome

Page 159: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Better Bell inequalities 147

and not the hidden cause. Before then, all probability in physics could be tracedback to uncertainty in initial conditions. Going back far enough, one could invokesymmetry to reduce the situation to “equally likely elementary outcomes.” Or moresubtly, sufficient chaoticity ensures that mixed up distributions are invariant undersymmetries and hence uniform. At this stage, frequentists and Bayesians use thesame probabilities and get the same answers, even if they interpret their probabil-ities differently.

According to Bell’s theorem, the randomness of quantum mechanics is trulyontological and not epistemological: it cannot be traced back to ignorance but is“for real.” It is curious that the quantum physics community is currently fallingunder the thrall of Bayesian ideas even though their science should be telling themthat the probabilities are objective. Of course, one can mix subjective uncertaintieswith objective quantum probabilities, but to my mind this is dissolving the babyin the bathwater, not an attractive thing to do.

Still, why is nature like this, why are the probabilities what they are? My roughfeeling is as follows. Reality is discrete. Hence nature cannot be continuous. How-ever we do observe symmetries under continuous groups (rotations, shifts); the onlyway to accomodate this is to make nature random, and to have the probabiltiy dis-tributions continuous, or even covariant, with the groups. Current research in thefoundations of quantum mechanics (e.g., by Inge Helland) points to the conclusionsthat symmetry forces the shape of the probabilities (and even forces the complexHilbert space); just as in the Aristotelian case, but at a much deeper level, proba-bilities are objectively fixed by symmetries.

References

[1] Acin, A., Gisin, N. and Toner, B. (2006). Grothendieck’s constant andlocal models for noisy entangled quantum states. Phys. Rev. A 73 062105(5 pp.). arxiv:quant-ph/0606138. MR2244753

[2] Acin, A., Gill, R. D. and Gisin, N. (2005). Optimal Bell tests do notrequire maximally entangled states. Phys. Rev. Lett. 95 210402 (4 pp.).arxiv:quant-ph/0506225.

[3] Aspect, A., Dalibard, J. and Roger, G. (1982). Experimental test ofBell’s inequalities using time-varying analysers. Phys. Rev. Lett. 49 1804–1807.MR0687359

[4] Barndorff-Nielsen, O. E., Gill, R. D. and Jupp, P. E. (2003). Onquantum statistical inference (with discussion). J. R. Statist. Soc. B 65 775–816. arxiv:quant-ph/0307191. MR2017871

[5] Bell, J. S. (1964). On the Einstein Podolsky Rosen paradox. Physics 1 195–200.

[6] Clauser, J. F., Horne, M. A., Shimony, A. and Holt, R. A. (1969).Proposed experiment to test local hidden-variable theories. Phys. Rev. Lett.23 880–884.

[7] Clauser, J. F. and Horne, M. A. (1974). Experimental consequences ofobjective local theories. Phys. Rev. D 10 526–35.

[8] Cirel’son, B. S. (1980). Quantum generalizations of Bell’s inequality. Lett.Math. Phys. 4 93–100. MR0577178

[9] Collins, D., Gisin, N., Linden, N., Massar, S. and Popescu, S. (2002).Bell inequalities for arbitrarily high dimensional systems. Phys. Rev. Lett. 88040404 (4 pp.). arxiv:quant-ph/0106024. MR1884489

Page 160: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

148 Richard D. Gill

[10] van Dam, W., Gill, R. D. and Grunwald, P. D. (2005). The statisti-cal strength of nonlocality proofs. IEEE Trans. Inf. Theory 51 2812–2835.arxiv:quant-ph/0307125. MR2236249

[11] Einstein, A., Podolsky, B. and Rosen, N. (1935). Can quantum-mechanical description of physical reality be considered complete? Phys. Rev.47 777–780.

[12] Fishburn, P. C. and Reeds, J. A. (1994). Bell inequalities, Grothendieck’sconstant, and root two. SIAM J. Discr. Math. 7 48–56. MR1259009

[13] Gill, R. D. (2001). Teleportation into quantum statistics. J. Korean Statist.Soc. 30 291–325. arxiv:math.ST/0405572. MR1892211

[14] Gill, R. D. (2003a). Time, finite statistics, and Bell’s fifth position. In Foun-dations of Probability and Physics 2 (Vaxjo, 2002). Math. Model. Phys. Eng.Cogn. Sci. 5 179–206. Vaxjo Univ. Press, Vaxjo. arxiv:quant-ph/0301059.MR2039718

[15] Gill, R. D. (2003b). Accardi contra Bell (cum mundi): The impossible cou-pling. In Mathematical Statistics and Applications: Festschrift for Constancevan Eeden (M. Moore, S. Froda, and C. Leger, eds.). IMS Lecture Notes –Monographs 42 133–154. Institute of Mathematical Statistics, Beachwood, OH.arxiv:quant-ph/0110137. MR2138290

[16] Gill, R. D. (2005). The chaotic chameleon. In Quantum Probability and InfiniteDimensional Analysis: from Foundations to Applications (M. Schurmann andU. Franz, eds.) QP–PQ: Quantum Probability and White Noise Analysis 18269–276. World Scientific, Singapore. arxiv:quant-ph/0307217. MR2212455

[17] Greenberger, D. M., Horne, M. and Zeilinger, A. (1989). Going be-yond Bell’s theorem. In Bell’s Theorem, Quantum Theory, and Conceptions ofthe Universe, (M. Kafatos, ed.) 73–76. Kluwer, Dordrecht.

[18] Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2003). Vertex di-rection algorithms for computing nonparametric function estimates in mixturemodels. arxiv:math.ST/0405511.

[19] Hardy, L. (1993). Nonlocality for two particles without inequalities for almostall entangled states. Phys. Rev. Lett. 71 1665–1668. MR1234454

[20] Larsson, J.-A. and Gill, R. D. (2004). Bell’s inequalityand the coincidence-time loophole. Europhys. Lett. 67 707–713.arxiv:quant-ph/0312035. MR2172249

[21] Navascues, M., Pironio, S. and Acin, A. (2006). Bounding the set ofquantum correlations. arxiv:quant-ph/0607119.

[22] Nielsen, M. A. and Chuang, I. L. (2000). Quantum Computation andQuantum Information. Cambridge University Press, New York. MR1796805

[23] Pan, J. W., Bouwmeester, D., Daniell, M., Weinfurter, H. and

Zeilinger, A. (2000). Experimental test of quantum nonlocality in three-photon Greenberger–Horne–Zeilinger entanglement. Nature 403 (6769) 515–519.

[24] Reeds, J. A. (1991). A new lower bound on the real Grothendieck constant.Available at http://www.dtc.umn.edu/∼reedsj/bound2.dvi.

[25] Santos, E. (2005). Bell’s theorem and the experiments: Increasing empiricalsupport to local realism. Studies In History and Philosophy of Modern Physics36 544–565. arxiv:quant-ph/0410193. MR2175810

[26] Tsirelson, B. S. (1993). Some results and problems on quantum Bell-type inequalities. Hadronic Journal Supplement 8 329–345. Available athttp://www.tau.ac.il/∼tsirel/download/hadron.html. MR1254597

[27] Zohren, S. and Gill, R. D. (2006). On the maximal violation of the

Page 161: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Better Bell inequalities 149

CGLMP inequality for infinite dimensional states. Phys. Rev. Lett. To appear.arxiv:quant-ph/0612020.

Page 162: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 149–166c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000337

Asymptotic oracle properties of

SCAD-penalized least squares estimators

Jian Huang1 and Huiliang Xie

University of Iowa

Abstract: We study the asymptotic properties of the SCAD-penalized leastsquares estimator in sparse, high-dimensional, linear regression models whenthe number of covariates may increase with the sample size. We are particularlyinterested in the use of this estimator for simultaneous variable selection andestimation. We show that under appropriate conditions, the SCAD-penalizedleast squares estimator is consistent for variable selection and that the esti-mators of nonzero coefficients have the same asymptotic distribution as theywould have if the zero coefficients were known in advance. Simulation studiesindicate that this estimator performs well in terms of variable selection andestimation.

1. Introduction

Consider a linear regression model

Y = β0 + X′β + ε,

where β is a p × 1 vector of regression coefficients associated with X. We areinterested in estimating β when p → ∞ as the sample size n → ∞ and when βis sparse, in the sense that many of its elements are zero. This is motivated frombiomedical studies investigating the relationship between a phenotype of interestand genomic covariates such as microarray data. In many cases, it is reasonableto assume a sparse model, because the number of important covariates is usuallyrelatively small, although the total number of covariates can be large.

We use the SCAD method to achieve variable selection and estimation of βsimultaneously. The SCAD method is proposed by Fan and Li [1] in a generalparametric framework for variable selection and efficient estimation. This methoduses a specially designed penalty function, the smoothly clipped absolute deviation(hence the name SCAD). Compared to the classical variable selection methods suchas subset selection, the SCAD has two advantages. First, the variable selection withSCAD is continuous and hence more stable than the subset selection, which is adiscrete and non-continuous process. Second, the SCAD is computationally feasiblefor high-dimensional data. In contrast, computation in subset selection is combina-torial and not feasible when p is large. In addition to the SCAD method, severalother penalized methods have also been proposed to achieve variable selection andestimation simultaneously. Examples include the bridge penalty (Frank and Fried-man [3]), LASSO (Tibshirani [11]), and the Elastic-Net (Enet) penalty (Zou andHastie [14]), among others.

1Department of Statistics and Actuarial Science, 241 SH, University of Iowa, Iowa City, Iowa52246, USA, e-mail: [email protected]

AMS 2000 subject classifications: Primary 62J07; secondary 62E20.Keywords and phrases: asymptotic normality, high-dimensional data, oracle property, penal-

ized regression, variable selection.

149

Page 163: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

150 J. Huang and H. Xie

Fan and Li [1] and Fan and Peng [2] studied asymptotic properties of SCADpenalized likelihood methods. Their results are concerned with local maximizers ofthe penalized likelihood, but not the maximum penalized estimators. These resultsdo not imply existence of an estimator with the properties of the local maximizerwithout auxiliary information about the true parameter value. Therefore, they arenot applicable to the SCAD-penalized maximum likelihood estimators, nor theSCAD-penalized estimator. Knight and Fu [7] studied the asymptotic distributionsof bridge estimators when the number of covariates is fixed. Huang, Horowitz andMa [4] studied the bridge estimators with a divergent number of covariates in alinear regression model. They showed that the bridge estimators have an oracleproperty under appropriate conditions if the bridge index is strictly between 0 and 1.Several earlier studies have investigated the properties of regression estimators witha divergent number of covariates. See, for example, Huber [5] and Portnoy [9, 10].Portnoy proved consistency and asymptotic normality of a class of M-estimators ofregression parameters under appropriate conditions. However, he did not considerpenalized regression or selection of variables in sparse models.

In this paper, we study the asymptotic properties of the SCAD-penalized leastsquares estimator, abbreviated as LS-SCAD estimator henceforth. We show thatthe LS-SCAD estimator can correctly select the nonzero coefficients with prob-ability converging to one and that the estimators of the nonzero coefficients areasymptotically normal with the same means and covariances as they would have ifthe zero coefficients were known in advance. Thus, the LS-SCAD estimators havean oracle property in the sense of Fan and Li [1] and Fan and Peng [2]. In otherwords, this estimator is asymptotically as efficient as the ideal estimator assistedby an oracle who knows which coefficients are nonzero and which are zero.

The rest of this article is organized as follows. In Section 2, we define the LS-SCAD estimator. The main results for the LS-SCAD estimator are given in Section3, including the consistency and oracle properties. Section 4 describes an algorithmfor computing the LS-SCAD estimator and the criterion to choose the penalty para-meter. Section 5 offers simulation studies that illustrate the finite sample behaviorof this estimator. Some concluding remarks are given in Section 6. The proofs arerelegated to the Appendix.

2. Penalized regression with the SCAD penalty

Let (Xi, Yi), i = 1, . . . , n be n observations satisfying

Yi = β0 + X′iβ + εi, i = 1, . . . , n,

where Yi ∈ R is a response variable, Xi is a pn×1 covariate vector and εi has mean0 and variance σ2. Here the superscripts are used to make it explicit that both thecovariates and parameters may change with n. For simplicity, we assume β0 = 0.Otherwise we can center the covariates and responses first.

In sparse models, the pn covariates can be classified into two categories: theimportant ones whose corresponding coefficients are nonzero and the trivial oneswhose coefficients are zero. For convenience of notation, we write

β = (β′1,β

′2)

′,

where β′1 = (β1, . . . , βkn) and β′

2 = (0, . . . , 0). Here kn(≤ pn) is the number ofnontrivial covariates. Let mn = pn − kn be the number of zero coefficients. Let

Page 164: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 151

Y = (Y1, . . . , Yn)′ and let X = (Xij , 1 ≤ i ≤ n, 1 ≤ j ≤ pn) be the n × pn designmatrix. According to the partition of β, write X = (X1, X2), where X1 and X2 aren × kn and n × mn matrices, respectively.

Given a > 2 and λ > 0, the SCAD penalty at θ is

pλ(θ; a) =

λ|θ|, |θ| ≤ λ,−(θ2 − 2aλ|θ| + λ2)/[2(a − 1)], λ < |θ| ≤ aλ,(a + 1)λ2/2, |θ| > aλ.

More insight into it can be gained through its first derivative:

p′λ(θ; a) =

sgn(θ)λ, |θ| ≤ λ,sgn(θ)(aλ − |θ|)/(a − 1), λ < |θ| ≤ aλ,0, |θ| > aλ.

The SCAD penalty is continuously differentiable on (−∞, 0) ∪ (0,∞), but not dif-ferentiable at 0. Its derivative vanishes outside [−aλ, aλ]. As a consequence, SCADpenalized regression can produce sparse solutions and unbiased estimates for largecoefficients. More detailed discussions of this penalty can be found in Fan and Li(2001).

The penalized least squares objective function for estimating β with the SCADpenalty is

(1) Qn(b; λn, a) = ‖Y − Xb‖2 + n

pn∑

j=1

pλn(bj ; a),

where ‖ · ‖ is the L2 norm. Given penalty parameters λn and a, the LS-SCADestimator of β is

βn ≡ β(λn; a) = arg minQn(b; λn, a).

We write βn = (β′1n, β

′2n)′ the way we partition β into β1 and β2.

3. Asymptotic properties of the LS-SCAD estimator

In this section we state the results on the asymptotic properties of the LS-SCADestimator. Results for the case of fixed design are slightly different from those forthe case of random design. We state them separately.

For convenience, the main assumptions required for conclusions in this sectionare listed here. (A0) through (A4) are for fixed covariates. Let ρn,1 be the smallesteigenvalue of n−1

X′X. πn,kn and ωn,mn are the largest eigenvalues of n−1

X′1X1 and

n−1X

′2X2, respectively. Let X′

i1 = (Xi1, . . . , Xikn) and X′i2 = (Xi,kn+1, . . . , Xipn).

(A0) (a) εi’s are i.i.d with mean 0 and variance σ2;(b) For any j ∈ {1, . . . , pn}, ‖X·j‖2 = n.

(A1) (a) limn→∞√

knλn/√

ρn,1 = 0;(b) limn→∞

√pn/

√nρn,1 = 0.

(A2) (a) limn→∞√

knλn/(√

ρn,1 min1≤j≤kn |βj |)

= 0;(b) limn→∞

√pn/(√

nρn,1 min1≤j≤kn |βj |)

= 0;(c) limn→∞

√pn/n/ρn,1 = 0.

(A3) limn→∞√

max(πn,kn , ωn,mn)pn/(√

nρn,1λn) = 0.(A4) limn→∞ max1≤i≤n X′

i1(∑n

i=1 Xi1X′i1)

−1Xi1 = 0.

Page 165: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

152 J. Huang and H. Xie

For random covariates, we require conditions (B0) through (B3). Suppose (X′i,

εi)’s are independent and identically distributed as (X′, ε) = (X1, . . . , Xpn , ε). Anal-ogous to the fixed design case, ρ1 denotes the smallest eigenvalue of E[XX′]. Alsoπkn and ωmn are the largest eigenvalues of E[Xi1X′

i1] and E[Xi2X′i2], respectively.

(B0) (X′i, εi) = (Xi1, . . . , Xipn , εi), i = 1, . . . , n are i.i.d. with

(a) E[Xij ] = 0, Var(Xij) = 1;(b) E[ε|X] = 0, Var(ε|X) = σ2.

(B1) (a) limn→∞ p2n/(nρ2

1) = 0;(b) limn→∞ knλ2

n/ρ1 = 0.(B2) (a) limn→∞

√pn/(

√nρ1 min1≤j≤kn |βj |) = 0;

(b) limn→∞ λn

√kn/(

√ρ1 min1≤j≤kn |βj |) = 0.

(B3)

limn→∞

√max(πkn , ωmn)pn√

nρ1λn= 0.

Theorem 1 (Consistency in the fixed design setting). Under (A0)–(A1),

‖βn − β‖ P→ 0 as n → ∞.

A similar result holds for the random design case.

Theorem 2 (Consistency in the random design setting). Suppose that thereexists an absolute constant M4 such that for all n, max1≤j≤pn E[Xj

4] ≤ M4 < ∞.Then under (B0)–(B1),

‖βn − β‖ P→ 0 as n → ∞.

For consistency, λn has to be kept small so that the SCAD penalty would notintroduce any bias asymptotically. Note that in both design settings, the restrictionon the penalty parameter λn does not involve mn, the number of trivial covariates.This is shared by the Lq(0 < q < 1)-penalized estimators in Huang, Horowitzand Ma [4]. However, unlike the bridge estimators, no upper bound requirementis imposed on the components of β1, since the derivative of the SCAD penaltyvanishes beyond a certain interval while that of the Lq penalty does not. In the fixeddesign case, (A1.b) is needed for model identifiability, as required in the classicalregression. For the random design case, a stricter requirement on pn is entailed bythe need of the convergence of n−1

X′X to E[XX′] in the Frobenius norm.

The next two theorems state that the LS-SCAD estimator is consistent for vari-able selection.

Theorem 3 (Variable selection in the fixed design setting). Under (A0)–(A3), β2n = 0mn with probability tending to 1.

Theorem 4 (Variable selection in the random design setting). Supposethere exists an absolute constant M such that max1≤j≤pn |Xj | ≤ M < ∞. Thenunder (B0)–(B3), β2n = 0mn with probability tending to 1.

(A2.a) and (A2.b) are identical to (A1.a) and (A1.b), respectively, provided that

lim infn→∞

min1≤j≤kn

|βj | > 0.

(B2) has a requirement for min1≤j≤kn |βj | similar to (A2). (A3) concerns the largesteigenvalues of n−1

X′1X1 and n−1

X′2X2. Due to the standardization of covariates,

πn,kn ≤ kn and ωn,mn ≤ mn.

Page 166: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 153

So (A3) is implied by

limn→∞

pn√nρn,1λn

= 0.

Likewise, (B3) can be replaced with

limn→∞

pn√nρ1λn

= 0.

Both (A3) and (B3) require λn not to converge too fast to 0 in order for theestimator to be able to “discover” the trivial covariates. It may be of concern if thereare λn’s that simultaneously satisfy (A1)–(A3) (in the random design setting (B1)–(B3)) under certain conditions. When lim inf ρn,1 > 0 and lim infn→∞ min1≤j≤kn

|βj | > 0, it can be checked that there exists λn that meets both (A2) and (A3) aslong as pn = o(n1/3). If we further know either that kn is fixed, or that the largesteigenvalue of n−1X′X is bounded from above, as is assumed in Fan and Peng [2],pn = o(n1/2) is sufficient. When both of these are true, pn = o(n) is adequatefor the existence of such λn’s. Similar conclusions hold for the random design caseexcept that pn = o(n1/2) is indispensable there.

The advantage of the SCAD penalty is that once the trivial covariates have beencorrectly picked out, regression with or without the SCAD penalty will make nodifference to the nontrivial covariates. So it is expected that β1n is asymptoticallynormally distributed. Let {An, n = 1, 2, . . .} be a sequence of matrices of dimensiond × kn with full row rank.

Theorem 5 (Asymptotic normality in the fixed design setting). Under(A0)–(A4),

√nΣ−1/2

n An(β1n − β1)D→ N(0d, Id),

where Σn = σ2An(∑n

i=1 Xi1X′i1/n)−1A′

n.

Theorem 6 (Asymptotic normality in the random design setting). Supposethat there exists an absolute constant M such that max1≤j≤pn ‖Xj‖ ≤ M < ∞ anda σ4 such that E[ε4|X11] ≤ σ4 < ∞ for all n. Then under (B0)–(B3),

n−1/2 Σ−1/2n AnE−1/2[Xi1X′

i1]n∑

i=1

Xi1X′i1(β1n − β1)

D→ N(0d, Id),

where Σn = σ2AnA′n.

For the random design the assumptions for asymptotic normality are no morethan those for variable selection. While for the fixed design, a Lindeberg-Fellercondition (A4) is needed in addition to (A0)–(A3).

4. Computation

We use the algorithm of Hunter and Li [6] to compute the LS-SCAD estimatorfor a given λn and a. This algorithm approximates a nonconvex target functionwith a convex function locally at each iteration step. We also describe the steps tocompute the approximate standard error of the estimator.

Page 167: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

154 J. Huang and H. Xie

4.1. Computation of the LS-SCAD estimator

Given λn and a the target function to be minimized is

Qn(b; λn, a) =∑

i=1

(Yi − X′ib)2 + n

pn∑

j=1

pλn(bj ; a).

Hunter and Li [6] proposes to minimize its approximation

Qn,ξ(b; λn, a) =n∑

i=1

(Yi − X′ib)2 + n

pn∑

j=1

pλn,ξ(bj ; a)

=n∑

i=1

(Yi − X′ib)2 + n

pn∑

j=1

(pλn(bj ; a) − ξ

∫ |bj |

0

p′λn(t; a)

ξ + tdt

)

Around b(k) = (b(k),1, . . . , b(k),pn)′, it can be approximated by

Sk,ξ(b; λn, a) =n∑

i=1

(Yi − X′ib)2

+ n

pn∑

j=1

[pλn,ξ(b(k),j ; a) +

p′λn(|b(k),j |+; a)

2(ξ + |b(k),j |)(bj

2 − b(k),j2)]

,

where ξ is a very small perturbation to prevent any component of the estimate fromgetting stuck at 0. Therefore the one-step estimator starting from b(k) is

b(k+1) = (X′X + nDξ(b(k); λn, a))−1

X′Y,

where Dξ(b(k); λn, a) is the diagonal matrix whose diagonal elements are 12p′λn

×(|b(k),j |+; a)/(ξ + |b(k),j |), j = 1, . . . , pn. Given the tolerance τ , convergence isclaimed when ∣∣∣∣

∂Qn,ξ(b)∂bj

∣∣∣∣ <τ

2, ∀j = 1, . . . , pn.

And finally the bj ’s that satisfy∣∣∣∣∂Qn,ξ(b)

∂bj− ∂Qn(b)

∂bj

∣∣∣∣ =nξp′λn

(|bj |+; a)ξ + |bj |

2

are set to 0. A good starting point would be b(0) = βLS, the least squares estimator.The perturbation ξ should be kept small so that difference between Qn,ξ(·) and

Qn(·) is negligible. Hunter and Li [6] suggests using

ξ =τ

2nλnmin{|b(0),j | : b(0),j = 0}.

4.2. Standard errors

The standard errors for the nonzero coefficient estimates can be obtained via theapproximation

∂Sξ(β1n; λ, a)

∂β1n

≈ ∂Sξ(β1; λn, a)∂β1

+∂2Sξ(β1; λn, a)

∂β1∂β′1

(β1n − β1

).

Page 168: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 155

So

β1n − β1 ≈ −(

∂2Sξ(β1; λn, a)∂β1∂β′

1

)−1∂Sξ(β1; λn, a)

∂β1

≈ −(

∂2Sξ(β1n; λn, a)

∂β1n∂β′1n

)−1∂Sξ(β1n; λn, a)

∂β1n

.

Since

∂Sξ(β1n; λn, a)

∂βj

= −2X′·jY + 2X

′·jX1β1n + n

βjp′λn

(|βj |; a)

ξ + |βj |

=n∑

i=1

[−2XijYi + 2XijX′

i1β1n +βjp

′λn

(|βj |; a)

ξ + |βj |

],

� 2n∑

i=1

Uij(ξ; λn, a),

letting Uij = Uij(ξ; λn, a), we have, for j, l = 1, . . . , kn,

Cov

(n−1/2 ∂Sξ(β1n; λn, a)

∂βj

, n−1/2 ∂Sξ(β1n; λn, a)

∂βl

)

≈ 4n

n∑

i=1

UijUil −4n2

n∑

i=1

Uij

n∑

i=1

Uil.

Let C = (Cjl, j, l = 1, . . . , kn), where

Cjl =1n

n∑

i=1

UijUil −1n2

n∑

i=1

Uij

n∑

i=1

Uil.

The variance-covariance matrix of the estimates can be approximated by

Cov(β1n) ≡ n(X′1X1 + nDξ(β1n; λn, a))−1

C (X′1X1 + nDξ(β1n; λn, a))−1.

4.3. Selection of λn

The above computational algorithm is for the case when λn and a are specified. Indata analysis, they can be selected by minimizing the generalized cross validationscore, which is defined to be

GCV(λn, a) =‖Y − X1β1n‖2/n

(1 − p(λn, a)/n)2,

where

p(λn, a) = tr[X1

(X

′1X1 + nD0(β1n; λn, a)

)−1

X′1

]

is the number of effective parameters and D0(β1n; λn, a) is a submatrix of thediagonal matrix Dξ(βn; λn, a) with ξ = 0. By submatrix, we mean the diago-nal of D0(β1n; λn, a) only contains the elements corresponding to the nontrivial

Page 169: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

156 J. Huang and H. Xie

components in β. Note that here X1 also only includes the columns of which thecorresponding elements of βn are non-vanishing.

The requirement that a > 2 is implied by the SCAD penalty function. Simulationsuggests that the generalized cross validation score does not change much with agiven λ. So to improve computing efficiency, we fix a = 3.7, as suggested by Fanand Li [1].

5. Simulation studies

In this section we illustrate the LS-SCAD estimator’s finite sample properties witha simulated example.

We simulate covariates Xi, i = 1, . . . , n from the multivariate normal distribu-tions with mean 0 and

Cov(Xij , Xil) = ρ|j−l|, 1 ≤ j, l ≤ p,

The response Yi is computed as

Yi =p∑

j=1

Xijβj + εi, i = 1, . . . , n.

where βj = j, 1 ≤ j ≤ 4, βj = 0, 5 ≤ j ≤ p, and εi’s are sampled from N(0, 1).For each (n, p, ρ) ∈ {(100, 10), (500, 40)} × {0, 0.2, 0.5, 0.8}, we generated N = 400data sets and use the algorithm in Section 4 to compute the LS-SCAD estimator.We set the tolerance τ described in Section 4.1 at 10−5. For comparison we alsoapply the ordinary least square (LS) method, the ordinary least square methodwith model selection based on AIC (abbreviated as AIC), and the ordinary leastsquares assuming that βj = 0 for j ≥ 5 are known beforehand (ORA). Note thatthis last estimator (ORA) is not feasible in a real data analysis setting. We use ithere as a benchmark in the comparisons.

The results are summarized in Tables 1 and 2. Columns 4 through 7 in Table 1are the biases of the estimates of βj , j = 1, . . . , 4 respectively. In the parenthesesfollowing each of them are the standard deviations of these estimates. Column 8(K) lists the numbers of estimates of βj , 5 ≤ j ≤ p that are 0, averaged over 400replications, and their modes are given in Column 9 (K). For LS, an estimate is setto be 0 if it lies within [−10−5, 10−5].

In Table 1, we see that the LS-SCAD estimates of the nontrivial coefficients havebiases and standard errors comparable to the ORA estimates. This is in line withTheorems 5 and 6. The average numbers of nonzero estimates for βj(j > 4), K,with respect to LS-SCAD are close to p, the true number of nonzero coefficientsamong βj(j > 4). As the true number of trivial covariates increases, the LS-SCADestimator may be able to discover more trivial ones than AIC. However, there ismore variability in the number of trivial covariates discovered via LS-SCAD thanthat via AIC.

Table 2 gives the averages of the estimated standard errors of βj , 1 ≤ j ≤ 4using the SCAD method over the 400 replications. They are obtained based onthe approach described in Section 4.2. They are slightly smaller than the samplingstandard deviations of βj , 1 ≤ j ≤ 4, which are given in parentheses in the rows forLS-SCAD.

Suppose for a data set the estimate of β via one of these four approaches isβ, then the average model error (AME) regarding this approach is computed asn−1

∑ni=1[X

′i(βn − β)]2. Box plots for these AME’s are given in Figure 1.

Page 170: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 157

Table 1

Simulation example 1, comparison of estimators

(n, p) ρ Estimator β1 β2 β3 β4 K K(100, 10) 0 LS .0007 (.1112) −.0034 (.0979) −.0064 (.1127) −.0024 (.1091) 0 0

ORA .0008 (.1074) −.0054 (.0936) −.0057 (.1072) −.0007 (.1040) 6 6AIC .0007 (.1083) −.0026 (.1033) −.0060 (.1156) −.0019 (.1181) 4.91 5

SCAD −.0006 (.1094) −.0037 (.0950) −.0058 (.1094) −.0014 (.1060) 4.62 50.2 LS −.0003 (.1051) −.0028 (.1068) .0093 (.1157) .0037 (.1103) 0 0

ORA −.0005 (.1010) −.0031 (.1035) .0107 (.1131) .0020 (.1035) 6 6AIC −.0002 (.1031) −.0024 (.1063) .0107 (.1150) .0021 (.1079) 4.95 5

SCAD −.0025 (.1035) −.0026 (.1046) .0104 (.1141) .0024 (.1066) 4.64 50.5 LS .0000 (.1177) −.0007 (.1353) .0010 (.1438) .0006 (.1360) 0 0

ORA −.0002 (.1129) −.0072 (.1317) .0115 (.1393) .0022 (.1171) 6 6AIC −.0003 (.1162) −.0064 (.1338) .0114 (.1413) .0017 (.1294) 4.91 5

SCAD .0035 (.1115) −.0219 (.1404) .0135 (.1481) .0006 (.1293) 4.78 50.8 LS −.0005 (.1916) −.0229 (.2293) .0059 (.2319) .0060 (.2200) 0 0

ORA −.0039 (.1835) −.0196 (.2197) .0070 (.2250) .0092 (.1787) 6 6AIC −.0021 (.1857) −.0209 (.2235) .0063 (.2289) .0013 (.2072) 4.85 6

SCAD −.0038 (.1868) −.0197 (.2249) .0062 (.2280) .0032 (.2024) 4.87 6(500, 40) 0 LS .0021 (.0466) −.0000 (.0475) −.0010 (.0466) .0014 (.0439) 0 0

ORA .0027 (.0446) −.0005 (.0453) −.0003 (.0448) .0011 (.0426) 36 36AIC .0023 (.0460) −.0003 (.0465) −.0004 (.0453) .0016 (.0433) 29.91 30

SCAD .0027 (.0447) −.0004 (.0454) −.0004 (.0450) .0013 (.0429) 32.22 350.2 LS .0018 (.0478) .0003 (.0478) −.0014 (.0487) .0005 (.0437) 0 0

ORA .0003 (.0522) −.0000 (.0465) −.0010 (.0517) .0009 (.0458) 36 36AIC .0024 (.0473) .0002 (.0471) −.0014 (.0475) .0018 (.0436) 29.87 30

SCAD .0028 (.0461) .0002 (.0460) −.0011 (.0475) .0006 (.0433) 32.20 350.5 LS .0024 (.0542) .0001 (.0617) .0050 (.0608) −.0048 (.0563) 0 0

ORA .0027 (.0526) .0017 (.0581) .0033 (.0597) −.0030 (.0488) 36 36AIC .0031 (.0537) .0007 (.0603) .0037 (.0605) −.0038 (.0526) 29.87 32

SCAD .0025 (.0528) .0017 (.0587) .0034 (.0601) −.0037 (.0494) 31.855 350.8 LS .0014 (.0788) −.0012 (.1014) .0090 (.1000) −.0077 (.0943) 0 0

ORA .0010 (.0761) .0017 (.0954) .0060 (.0983) −.0044 (.0704) 36 36AIC .0020 (.0776) .0003 (.0996) .0066 (.0995) −.0071 (.0862) 29.56 30SCAD .0014 (.0773) .0018 (.0982) .0059 (.0990) −.0050 (.0790) 29.38 35

Table 2

Simulated example, standard error estimate

(n, p) (100, 10) (500, 40)ρ 0 0.2 0.5 0.8 0 0.2 0.5 0.8

se(β1) .0983 .1005 .1139 .1624 .0442 .0444 .0512 .0735

se(β2) .0980 .1028 .1276 .2080 .0443 .0447 .0571 .0940

se(β3) .0996 .1027 .1278 .2086 .0442 .0445 .0573 .0940

se(β4) .0988 .1006 .1150 .1727 .0441 .0444 .0512 .0764

The LS estimator definitely has the worst performance in terms of AME. Thisbecomes more obvious as the number of trivial predictors increases. LS-SCAD out-performs AIC in this respect and is comparable to ORA. But it is also seen that theAME’s of LS-SCAD tend to be more diffuse as ρ increases. This is also the resultof more spread-out estimates of the number of trivial covariates.

6. Concluding remarks

In this paper, we have studied the asymptotic properties of the LS-SCAD estimatorwhen the number of covariates and regression coefficients increases to infinity as

Page 171: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

158 J. Huang and H. Xie

Fig 1. Box plots of the average model errors for four estimators: AIC, LS, ORA, and LS-SCAD.In the top four panels, (n, p, ρ) = (100, 10, 0), (100, 10, 0.2), (100, 10, 0.5), (100, 10, 0.8); and in thebottom four panels, (n, p, ρ) = (500, 40, 0), (500, 40, 0.2), (500, 40, 0.5), (500, 40, 0.8), where n is thesample size, p is the number of covariates, and ρ is the correlation coefficient used in generatingthe covariate values.

n → ∞. We have shown that this estimator can correctly identify zero coefficientswith probability converging to one and that the estimators of nonzero coefficientsare asymptotically normal and oracle efficient. Our results were obtained under theassumption that the number of parameters is smaller than the sample size. Theyare not applicable when the number of parameters is greater than the sample size,which arises in microarray gene expression studies. In general, the condition thatp < n is needed for identification of the regression parameter and consistent variableselection. To achieve consistent variable selection in the “large p, small n” case,certain conditions are required for the design matrix. For example, Huang et al. [4]

Page 172: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 159

showed that, under a partial orthogonality assumption in which the covariates of thezero coefficients are uncorrelated or only weakly correlated with the covariates ofnonzero coefficients, then the univariate bridge estimators are consistent for variableselection under appropriate conditions. This result also holds for the univariateLS-SCAD estimator. Indeed, under the partial orthogonality condition, it can beshown that the simple univariate regression estimator can be used to consistentlydistinguish between nonzero and zero coefficients. Finally, we note that our resultsare only valid for a fixed sequence of penalty parameters λn. It is an interestingand difficult problem to show that the asymptotic oracle property also holds for λn

determined by cross validation.

Appendix

We now give the proofs of the results stated in Section 3.

Proof of Theorem 1. By the definition of βn, it is necessary that Qn(βn) ≤ Qn(β).It follows that

0 ≥∥∥X(βn − β)

∥∥2 − 2ε′X(βn − β) + n

kn∑

j=1

[pλn(βj ; a) − pλn(βj ; a)

]

≥∥∥X(βn − β)

∥∥2 − 2ε′X(βn − β) − 2−1n(a + 1)knλ2

n

=∥∥[X′

X]1/2(βn − β) − [X′X]−1/2

X′ε∥∥2

− ε′X[X′

X]−1X

′ε − 2−1n(a + 1)knλ2n.

By the Cr-inequality (Loeve [8], page 155),∥∥[X′

X]1/2(βn − β)‖2 ≤ 2∥∥[X′

X]1/2(βn − β) − [X′X]−1/2

X′ε∥∥2 + 2ε′

X[X′X]−1

X′ε

≤ 4ε′X[X′

X]−1X

′ε + n(a + 1)knλ2n.

In the fixed design,

ε′X[X(n)′

X]−1X

′ε = E[ε′

X[X(n)′X]−1

X′ε]OP (1)

= σ2tr(X[X′X]−1

X′)OP (1)

= pnOP (1).

Since ∥∥[X′X]1/2(βn − β)‖2 ≥ nρn,1‖βn − β‖2,

we have

‖βn − β‖ = OP

( √pn√

nρn,1+

√knλn√ρn,1

)= oP (1).

Proof of Theorem 2. Let A(n) = (A(n)jk )j,k=1,...,pn with A

(n)jk = n−1

∑ni=1 XijXik −

E[XijXik]. Let ρ1(A(n)) and ρpn(A(n)) be the smallest and largest of the eigenval-ues of A(n), respectively. Then by Theorem 4.1 in Wang and Jia [13],

ρ1(A(n)) ≤ ρn,1 − ρ1 ≤ ρpn(A).

By the Cauchy inequality and the properties of eigenvalues of symmetric matrices,

max(|ρ1(A(n))|, |ρpn(A(n))|) ≤ ‖A(n)‖.

Page 173: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

160 J. Huang and H. Xie

When (B1.a) holds, ‖A(n)‖ = oP (ρ1) = oP (1), as is seen for any ξ > 0,

P (‖A(n)‖2 ≥ ξρ12) ≤ E‖A(n)‖2

ξρ12

≤ p2n

ξρ12

sup1≤j,k≤pn

Var(A(n)jk ) ≤ p2

n

nξρ12M4.

Since ρ1 > 0 holds for all n, n−1X

′X is invertible with probability tending to 1.

Following the argument for the fixed design case, with probability tending to 1,∥∥[X′

X]1/2(βn − β)‖2 ≤ 4ε′X[X′

X]−1X

′ε − n(a + 1)knλ2n.

In the random design setting,

E

[ε′

X[X′X]−1

X′ε∣∣∣‖A(n)‖2 <

12ρ1

2

]

= σ2E

[tr(X[X′

X]−1X

′)∣∣∣‖A(n)‖2 <

12ρ1

2

]

= σ2pn.

The rest of the argument remains the same as for the fixed design case and leadsto

‖βn − β‖ = OP

( √pn√nρ1

+√

knλn√ρ1

)= oP (1).

Lemma 1 (Convergency rate in the fixed design setting). Under (A0)–(A2),‖βn − β‖ = OP (

√pn/n/ρn,1).

Proof. In the proof of consistency, we have

‖βn − β‖ = OP (un), where un = λn

√kn/ρn,1 +

√pn/(nρn,1).

For any L1, provided that ‖b − β‖ ≤ 2L1un,

min1≤j≤kn

|bj | ≥ min1≤j≤kn

|βj | − 2L1un.

If (A2) holds, then for n sufficiently large, un/ min1≤j≤kn |βj | < 2−L1−1. It followsthat

min1≤j≤kn

|bj | ≥ min1≤j≤kn

|βj |/2,

which further implies than min1≤j≤kn |bj | > aλn for n sufficiently large (assumelim infn→∞ kn > 0).

Let {hn} be a sequence converging to 0. As in the proof of of Theorem 3.2.5of Van der Vaart and Wellner [12], decompose Rpn\{0pn} into shells {Sn,l, l ∈ Z}where Sn,l = {b : 2l−1hn ≤ ‖b−β‖ < 2lhn}. For b ∈ Sn,l such that 2lhn ≤ 2L1un,

Qn(b) − Qn(β) = (b − β)′X′X(b − β) − 2ε′

X(b − β)

+ n

pn∑

j=1

pλn(bj ; a) − n

pn∑

j=1

pλn(βj ; a)

= (b − β)′X′X(b − β) − 2ε′

X(b − β)� In1 + In2,

andIn1 ≥ nρn,1‖b − β‖2 ≥ 22(l−1)h2

nnρn,1.

Page 174: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 161

Thus

P(‖βn − β‖ ≥ 2Lhn

)

≤ o(1) +∑

l>L

2lhn≤2L1un

P(βn ∈ Sn,l

)

≤ o(1) +∑

l>L

2lhn≤2L1un

P

(inf

b∈Sn,l

Qn(b) ≤ Qn(β))

≤ o(1) +∑

l>L,

2l−1hn≤2L1un

P

(sup

b∈Sn,l

ε′X(b − β) ≥ 22l−3h2

nnρn,1

)

≤ o(1) +∑

l>L,

2l−1hn≤2L1un

E| supb∈Sn,lε′X(b − β)|

22l−3h2nnρn,1

≤ o(1) +∑

l>L

2lhnE1/2[‖ε′X‖2]

22l−3h2nnρn,1

≤ o(1) +∑

l>L

2l√

nσ2pn

22l−3hnnρn,1,

from which we see ‖βn − β‖ = OP (√

pn/n/ρn,1).

Lemma 2 (Convergence rate in the random design setting). Under (B0)–(B2), ‖βn − β‖ = OP (

√pn/n/ρ1).

Proof. Deduction is similar to that of Lemma 1. However, since X is a randommatrix in this case, extra details are needed in the following part. Let A(n) =(A(n)

jk )j,k=1,...,pn with A(n)jk = 1

n

∑ni=1 XijXik − E[XjXk]. We have

P(‖βn − β‖ ≥ 2Lhn

)

≤∑

l>L

2lhn≤2L1un

P(βn ∈ Sn,l, ‖A(n)‖ ≤ ρ1/2

)+ o(1)

≤∑

l>L

2lhn≤2L1un

P

(inf

b∈Sn,l

Qn(b) ≤ Qn(β), ‖A(n)‖ ≤ ρ1/2)

+ o(1)

≤∑

l>L

2lhnE1/2[‖ε′

X‖2∣∣∣‖A‖ ≤ ρ1/2

]

22l−4h2nnρ1

+ o(1).

The first inequality follows from (B1.a). This leads to ‖βn −β‖ = OP (√

pn/n/ρ1).

Proof of Theorem 3. By Lemma 1, ‖βn − β‖ ≤ λn with probability tending to 1under (A3). Consider the partial derivatives of Qn(β + v). For j = kn + 1, . . . , pn,

Page 175: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

162 J. Huang and H. Xie

if |vj | ≤ λn,

∂ Qn(β + v)∂ vj

= 2n∑

i=1

Xij(εi − X′iv) + nλnsgn(vj)

= 2n∑

i=1

Xijεi − 2n∑

i=1

XijX′i1v1 − 2

n∑

i=1

XijX′i2v2 + nλnsgn(vj)

� IIn1,j + IIn2,j + IIn3,j + IIn4,j .

Examine the first three terms one by one.

E[ maxkn+1≤j≤pn

|IIn1,j |] ≤ E1/2

pn∑

j=kn+1

II2n1,j

= 2

√nmnσ,

maxkn+1≤j≤pn

|IIn2,j | = 2 maxkn+1≤j≤pn

∣∣∣∣∣

n∑

i=1

XijX′i1v1

∣∣∣∣∣

≤ 2‖v1‖ maxkn+1≤j≤pn

√(X·j)′X1X′

1X·j

≤ 2‖v1‖ maxkn+1≤j≤pn

‖X·j‖ρ1/2max(X1X

′1)

= 2‖v1‖ maxkn+1≤j≤pn

‖X·j‖ ρ1/2max(X

′1X1)

= 2n√

πn,kn‖v1‖,

maxkn+1≤j≤pn

|IIn3,j | = 2 maxkn+1≤j≤pn

|n∑

i=1

XijX′i2v2|

≤ 2‖v1‖‖X·j‖ρ1/2max(X

′2X2)

= 2n√

ωn,mn‖v2‖.

Following the above argument we have

P

kn+1≤j≤pn

{|IIn1,j | > |IIn4,j | − |IIn2,j | − |IIn3,j |}

≤ 2√

nmnσ2

nλn − 2n(√

πn,kn‖v1‖ + √ωn,mn‖v2‖

) .

When (A3) holds,√

nλn/√

mn → ∞. Under (A1)–(A2), ‖v‖ = OP (√

pn/n/ρn,1).Therefore

P

kn+1≤j≤pn

{|IIn1,j | > |IIn4,j | − |IIn2,j | − |IIn3,j |}

→ 0 as n → ∞.

This indicates that with probability tending to 1, for all j = kn +1, . . . , pn, the signof ∂ Qn(β+v)

∂ vjis the same as vj , provided that |vj | < λn, which further implies that

limn→∞

P (β2n = 0mn) = 1.

Page 176: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 163

Proof of Theorem 4. Follow the argument in the proof of Theorem 3. Note that inthe random design setting, under (B1.a),

maxkn+1≤j≤pn

|IIn2,j | = 2 maxkn+1≤j≤pn

∣∣∣∣∣

n∑

i=1

XijX′i1v1

∣∣∣∣∣

≤ 2‖v1‖ maxkn+1≤j≤pn

√(X·j)′X1X′

1X·j

≤ 2‖v1‖ maxkn+1≤j≤pn

‖X·j‖ρ1/2max(X1X

′1)

≤ 2‖v1‖√

nM√

ρmax(X′1X1)

≤ 2M√

n‖v1‖√

n [ρmax(E[X1X′1]) + ‖A11‖]

≤ 2n‖v1‖√

πkn + E1/2‖A11‖2OP (1)

= 2n‖v1‖

πkn + OP (ρ1)M

1/24 kn

ρ1√

n

≤ 4n‖v1‖√

πknOP (1)

for sufficiently large n. Similarly

maxkn+1≤j≤pn

|IIn3,j | ≤ 4n‖v2‖√

ωmnOP (1).

The rest of the argument is identical to that in the fixed design case and thusomitted here.

Proof of Theorem 5. During the course of proving Lemma 1, we have under (A0)–(A1), ‖βn − β‖ = OP (λn

√kn/ρn,1 +

√pn/(nρn,1)). Under (A2), this implies that

‖β1n − β1‖ = oP ( min1≤j≤kn

|βj |).

Also from (A2), λn = o(min1≤j≤kn |βj |). Therefore, with probability tending to 1,all the βj (1 ≤ j ≤ kn) are bounded away from [−aλn, aλn] and so the partialderivatives exist. At the same time, β2n = 0mn with probability tending to 1.Thus with probability tending to 1, the stationarity condition holds for the first kn

components. That is, β1n necessarily satisfies the equation

n∑

i=1

(Yi − X′i1β1n)Xi1 = 0, i.e.

n∑

i=1

εiXi1 =n∑

i=1

Xi1X′i1(β1n − β1).

So the random vector being considered

Zn �√

nΣ−1/2n An(β1n − β1)

=√

n

n∑

i=1

Σ−1/2n An (X′

1X1)−1 Xi1εi

� n−1/2n∑

i=1

R(n)i εi,

Page 177: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

164 J. Huang and H. Xie

where R(n)i = Σ−1/2

n An(n−1X′1X1)−1Xi1. The equality holds with probability tend-

ing to 1. max1≤i≤n ‖R(n)i ‖/√n → 0 is implied by (A4), as can be seen from

‖R(n)i ‖√n

=‖Σ−1/2

n An

(n−1

X′1X1

)−1Xi1‖√

n

≤ n−1/2∥∥∥(n−1

X′1X1

)−1/2Xi1

∥∥∥

· ρ1/2max

((n−1

X′1X1

)−1/2A′

nΣ−1n An

(n−1

X′1X1

)−1/2)

= n−1/2∥∥∥(n−1

X′1X1

)−1/2Xi1

∥∥∥ ρ1/2max

(σ−2Σ−1/2

n ΣnΣ−1/2n

)

=

√√√√σ−2X′i1

(n∑

i=1

Xi1X′i1

)−1

Xi1.

Therefore for any ξ > 0,

1n

n∑

i=1

E[‖R(n)

i εi‖21{‖R(n)i εi‖ >

√nξ}]

=1n

n∑

i=1

R(n)′i R(n)

i E[ε2

i 1{‖R(n)i εi‖ >

√nξ}]

≤ 1n

n∑

i=1

R(n)′i R(n)

i E

[ε2

i 1{|εi| >√

nξ/ max1≤i≤n

‖R(n)i ‖}

]

=1n

n∑

i=1

R(n)′i R(n)

i o(1)

=1n

n∑

i=1

X′i1

(n−1

X′1X1

)−1A′

nΣ−1n An

(n−1

X′1X1

)−1Xi1o(1)

=n∑

i=1

tr{(

n−1X

′1X1

)−1A′

nΣ−1n An

(n−1

X′1X1

)−1 Xi1X′i1

n

}o(1)

= tr{(

n−1X

′1X1

)−1A′

nΣ−1n An

}o(1)

= tr

Σ−1

n An

(1n

n∑

i=1

Xi1X′i1

)−1

A′n

o(1)

= o(1)d.

SoZn

D→ N(0d, Id).

follows from the Lindeberg-Feller central limit theorem and Var(Zn) = Id.

Proof of Theorem 6. The vector being considered

1√n

Σ−1/2n AnE−1/2[Xi1X′

i1]n∑

i=1

Xi1X′i1(β1n − β1)

=1√n

Σ−1/2n AnE−1/2[Xi1X′

i1]n∑

i=1

εiXi1

Page 178: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

LS-SCAD estimator 165

with probability tending to 1. Let Zni = 1√nΣ−1/2

n AnE−1/2[Xi1X′i1]Xi1εi, i =

1, . . . , n. {Zni, n = 1, 2, . . . , i = 1, . . . , n} form a triangular array and within eachrow, they are i.i.d random vectors. First,

Var

(n∑

i=1

Zni

)= Var

(Σ−1/2

n AnE−1/2[Xi1X′i1]X11ε

)= Id.

Second, under (B1.a),

n∑

i=1

E[‖Zni‖21{‖Zni‖>ξ}

]= nE

[‖Zn1‖21{‖Zn1‖>ξ}

]

≤ nE1/2[‖Zn1‖4]P 1/2(‖Zn1‖ > ξ) = o(1),

since

E1/2[‖Zn1‖4] = E1/2[(Z′n1Zn1)2]

=1n

E1/2

[ε4(X′

11E− 1

2 [X11X′11]A

′nΣ−1

n AnE− 12 [X11X′

11]X11

)2]

≤ 1n

σ1/24 ρmax(A′

nΣ−1n An) ρmax(E−1[X11X′

11])E1/2[(X′

11X11)2]

≤ 1n

σ1/24 ρmax(Σ−1

n AnA′n) ρ1

−1E1/2

[(X(n)′

11 X11

)2]

=1n

σ1/24 σ−2ρ1

−1E1/2

kn∑

j=1

X1j2

2

= O

(kn

nρ1

),

and

P 1/2(‖Zn1‖ > ξ) ≤ E1/2(Z′n1Zn1)ξ

=

√d√nξ

,

by the Lindeberg–Feller central limit theorem we have

n−1/2Σ−1/2n AnE−1/2[Xi1X′

i1]X′1X1(β1n − β1)

D→ N(0d, Id).

Acknowledgments. JH is honored and delighted to have the opportunity tocontribute to this monograph in celebration of Professor Piet Groeneboom’s 65thbirthday and his contributions to mathematics and statistics. The authors thankthe editors and an anonymous referee for their constructive comments which led tosignificant improvement of this article.

References

[1] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized like-lihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.

[2] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a di-verging number of parameters. Ann. Statist. 32 928–961.

[3] Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemo-metrics regression tools (with discussion). Technometrics 35 109–148.

Page 179: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

166 J. Huang and H. Xie

[4] Huang, J., Horowitz, J. L. and Ma, S. G. (2006). Asymptotic propertiesof bridge estimators in sparse high-dimensional regression models. TechnicalReport # 360, Department of Statistics and Actuarial Science, University ofIowa.

[5] Huber, P. J. (1981). Robust Statistics. Wiley, New York.[6] Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms.

Ann. Statist. 33 1617–1642.[7] Knight, K. and Fu, W. J. (2000). Asymptotics for lasso-type estimators.

Ann. Statist. 28 1356–1378.[8] Loeve, M. (1963). Probability Theory. Van Nostrand, Princeton.[9] Portnoy, S. (1984). Asymptotic behavior of M estimators of p regression

parameters when p2/n is large: I. Consistency. Ann. Statist. 12 1298–1309.[10] Portnoy, S. (1985). Asymptotic behavior of M estimators of p regression

parameters when p2/n is large: II. Normal approximation. Ann. Statist. 131403–1417.

[11] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J.Roy. Statist. Soc. Ser. B 58 267–288.

[12] Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergenceand Empirical Processes: With Applications to Statistics. Springer, New York.

[13] Wang, S. and Jia, Z. (1993). Inequalities in Matrix Theory. Anhui EducationPress.

[14] Zou, H. and Hastie, T. (2005). Regularization and variable selection via theelastic net. J. Roy. Statist. Soc. Ser. B 67 301–320.

Page 180: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 167–178c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000346

Critical scaling of stochastic

epidemic models∗

Steven P. Lalley1

University of Chicago

To Piet Groeneboom, on the occasion of his 39th birthday.

Abstract: In the simple mean-field SIS and SIR epidemic models, infectionis transmitted from infectious to susceptible members of a finite populationby independent p−coin tosses. Spatial variants of these models are proposed,in which finite populations of size N are situated at the sites of a latticeand infectious contacts are limited to individuals at neighboring sites. Scalinglaws for both the mean-field and spatial models are given when the infectionparameter p is such that the epidemics are critical. It is shown that in allcases there is a critical threshold for the numbers initially infected: below thethreshold, the epidemic evolves in essentially the same manner as its branchingenvelope, but at the threshold evolves like a branching process with a size-dependent drift.

1. Stochastic epidemic models

1.1. Mean-field models

The simplest and most thoroughly studied stochastic models of epidemics are mean-field models, in which all individuals of a finite population interact in the samemanner. In these models, a contagious disease is transmitted among individuals ofa homogeneous population of size N . In the simple SIS epidemic, individuals areat any time either infected or susceptible; infected individuals remain infected forone unit of time and then become susceptible. In the simple SIR epidemic (morecommonly known as the Reed-Frost model), individuals are either infected, suscep-tible, or recovered ; infected individuals remain infected for one unit of time, afterwhich they recover and acquire permanent immunity from future infection. In bothmodels, the mechanism by which infection occurs is random: At each time, for anypair (i, s) of an infected and a susceptible individual, the disease is transmittedfrom i to s with probability p = pN . These transmission events are mutually in-dependent. Thus, in both the SIR and the SIS model, the number Jt+1 = JN

t ofinfected individuals at time t + 1 is given by

(1) Jt+1 =St∑

s=1

ξs,

∗Supported by NSF Grant DMS-04-05102.1University of Chicago, Department of Statistics, 5734 S. University Avenue, Eckhart 118,

Chicago, Illinois 60637, USA, e-mail: [email protected] 2000 subject classifications: 60K30, 60H30, 60K35.Keywords and phrases: stochastic epidemic model, spatial epidemic, Feller diffusion, branching

random walk, Dawson-Watanabe process, critical scaling.

167

Page 181: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

168 S. P. Lalley

where St = SNt is the number of susceptibles at time t and the random variables ξs

are, conditional on the history of the epidemic to time t, independent, identicallydistributed Bernoulli-1 − (1 − p)Jt . In the SIR model,

Rt+1 = Rt + Jt and(2)St+1 = St − Jt+1,

where Rt is the number of recovered individuals at time t, while in the SIS model,

(3) St+1 = St + Jt − Jt+1.

In either model, the epidemic ends at the first time T when JT = 0. The most basicand interesting questions concerning these models have to do with the duration Tand size

∑t≤T Jt of the epidemic and their dependence on the infection parameter

pN and the initial conditions.

1.2. Spatial SIR and SIS epidemics

In the simple SIS and SIR epidemics, no allowance is made for geographic or socialstratifications of the population, nor for variability in susceptibility or degree ofcontagiousness. Following are descriptions of simple stochastic models that incor-porate a geographic stratification of a population. We shall call these the (spatial)SIS−d and SIR−d epidemics, with d denoting the spatial dimension.

Assume that at each lattice point x ∈ Zd is a homogeneous population of Nx in-

dividuals, each of whom may at any time be either susceptible or infected, or (in theSIR variants) recovered. These populations may be thought of as “villages”. As inthe mean-field models, infected individuals remain contagious for one unit of time,after which they recover with immunity from future infection (in the SIR variants)or once again become susceptible (in the SIS models). At each time t = 0, 1, 2, . . . ,for each pair (ix, sy) of an infected individual located at x and a susceptible indi-vidual at y, the disease spreads from ix to sy with probability α(x, y).

The simple Reed-Frost and stochastic logistic epidemics described in section 1.1terminate with probability one, regardless of the value of the infection parameterp, because the population is finite. For the spatial SIS and SIR models this isno longer necessarily the case: If

∑x∈Zd Nx = ∞ then, depending on the value

of the parameter p and the dimension d, the epidemic may persist forever withpositive probability. (For instance, if Nx = 1 for all x and α(x, y) = p for nearestneighbor pairs x, y but α(x, y) = 0 otherwise, then the SIS −d epidemic is justoriented percolation on Z

d+1, which is known to survive with positive probabilityif p exceeds a critical value pc < 1 [6].) Obvious questions of interest center on howthe epidemic spreads through space, and in cases where it eventually dies out, howfar it spreads.

The figure below shows a simulation of an SIS-1 epidemic with village sizeNx = 20224 and infection parameter 1/60672. At time 0 there were 2048 infectedindividuals at site 0; all other individuals were healthy. The epidemic lasted 713generations (only the first 450 are shown).

1.3. Epidemic models and random graphs

All of the models described above have equivalent descriptions as structured ran-dom graphs, that is, percolation processes. Consider for definiteness the simple SIR

Page 182: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Critical Scaling of Stochastic Epidemic Models 169

Fig 1.

(Reed-Frost) epidemic. In this model, no individual may be infected more than once;furthermore, for any pair x, y of individuals, there will be at most one opportunityfor infection to pass from x to y or from y to x during the course of the epidemic.Thus, one could simulate the epidemic by first tossing a p−coin for every pair x, y,drawing an edge between x and y for each coin toss resulting in a Head, and thenusing the resulting (Erdos-Renyi) random graph determined by these edges to de-termine the course of infection in the epidemic. In detail: If Y0 is the set of infectedindividuals at time 0, then the set Y1 of individuals infected at time 1 consists ofall x /∈ Y0 that are connected to individuals in Y0, and for any subsequent time n,the set Yn+1 of individuals infected at time n+1 consists of all x /∈ ∪j≤nYj who areconnected to individuals in Yn. Note that the set of individuals ultimately infectedduring the course of the epidemic is the union of those connected components ofthe random graph containing at least one vertex in Y0.

Similar random graph descriptions may be given for the simple SIS and thespatial SIS and SIR epidemic models.

1.4. Branching envelopes of epidemics

For each of the stochastic epidemic models discussed above there is an associatedbranching process that serves, in a certain sense, as a “tangent” to the epidemic.We shall refer to this branching process as the branching envelope of the epidemic.The branching envelopes of the simple mean-field epidemics are ordinary Galton-Watson processes; the envelopes of the spatial epidemics are branching randomwalks. There is a natural coupling of each epidemic with its branching envelope inwhich the set of infected individuals in the epidemic is at each time (and in thespatial models, at each location) dominated by the corresponding set of individualsin the branching envelope.

Page 183: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

170 S. P. Lalley

Fig 2.

Following is a detailed description of the natural coupling of the simple SISepidemic with its branching envelope. The branching envelope is a Galton-Watsonprocess Zn with offspring distribution Binomial-(N, p), where p is the infectionparameter of the epidemic, and whose initial generation Z0 coincides with the setof individuals who are infected at time 0. Particles in the Galton-Watson processare marked red or blue: red particles represent infected individuals in the coupledepidemic, while blue offspring of red parents represent attempted infections that arenot allowed because the attempt is made on an individual who is not susceptible,or has already been infected by another contagious individual. Colors are assignedas follows: (1) Offspring of blue particles are always blue. (2) Each red particlereproduces by tossing a p−coin N times, once for each individual i in the population.Each Head counts as an offspring, and each represents an attempted infection. Ifseveral red particles attempt to infect the same individual i, exactly one of these ismarked as a success (red), and the others are marked as failures (blue). Also, if anattempt is made to infect an individual who is not susceptible, the correspondingparticle is colored blue. Clearly, the collection of all particles (red and blue) evolvesas a Galton-Watson process, while the collection of red particles evolves as theinfected set in the SIS epidemic. See the figure below for a typical evolution of thecoupling in a population of size N = 80, 000 with p = 1/80000 and 200 individualsinitially infected.

2. Critical behavior: mean-field case

When studying the behavior of the simple SIR and SIS epidemics in large pop-ulations, it is natural to consider the scaling p = pN = λN/N for the infectionparameter p. In this scaling, λ = λN is the mean of the offspring distribution inthe branching envelope. If λ < 1 then the epidemic will end quickly, even if a largenumber of individuals are initially infected. On the other hand, if λ > 1 then withpositive probability (approximately one minus the extinction probability for theassociated branching envelope), even if only one individual is initially infected, theepidemic will be large, with a positive limiting fraction of the population eventuallybeing infected. The large-N behavior of the size of the SIR epidemic in this case is

Page 184: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Critical Scaling of Stochastic Epidemic Models 171

well understood: see for example [12] and [14].

2.1. Critical scaling: size of the epidemic

The behavior of both the SIS and SIR epidemics is more interesting in the criticalcase λN ≈ 1. When the set of individuals initially infected is sufficiently small rel-ative to the population size, the epidemic can be expected to evolve in much thesame manner as a critical Galton-Watson process with Poisson-1 offspring distrib-ution. However, when the size of the initially infected set passes a certain criticalthreshold, then the epidemic will begin to deviate substantially from the branchingenvelope. For the SIR case, the critical threshold was (implicitly) shown by [11] and[1] (see also [12]) to be at N1/3, and that the critical scaling window is of widthN−4/3:

Theorem 1 ([11], [1]). Assume that pN = 1/N +a/N4/3 +o(n−4/3), and that thenumber JN

0 of initially infected individuals is such that JN0 /N1/3 → b as the popu-

lation size N → ∞. Then as N → ∞, the size UN :=∑

t Jt obeys the asymptoticlaw

(4) UN/N2/3 D−→ Tb

where Tb is the first passage time to the level b by Wt + t2/2 + at, and Wt is astandard Wiener process.

The distribution of the first passage time Tb can be given in closed form: See[11], also [8], [13].

For the critical SIS epidemic, the critical threshold is at N1/2, and the criticalscaling window is of width N−3/2:

Theorem 2 ([4]). Assume that pN = 1/N + a/N3/2 + o(n−3/2), and that theinitial number of infected individuals satisfies JN

0 ∼ bN1/2 as N → ∞. Then thetotal number of infections UN :=

∑t Jt during the course of the epidemic obeys

(5) UN/ND−→ τ(b − a;−a)

where τ(x; y) is the time of first passage to y by a standard Ornstein-Uhlenbeckprocess started at x.

2.2. Critical scaling: time evolution of the epidemic

For both the SIR and SIS epidemics, if the number of individuals initially infectedis much below the critical threshold then the evolution of the epidemic will not differnoticeably from that of its branching envelope. It was observed by [7] (and provedby [9]) that a (near-) critical Galton-Watson process initiated by a large numberM of individuals behaves, after appropriate rescaling, approximately as a Fellerdiffusion: In particular, if ZM

n is the size of the nth generation of a Galton-Watsonwith ZM

0 ∼ bM with offspring distribution Poisson(1 + a/M)then as M → ∞,

(6) ZM[Mt]/M

D−→ Yt

where Yt satisfies the stochastic differential equation

dYt = aYt dt +√

Yt dWt,(7)Y0 = b.

Page 185: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

172 S. P. Lalley

What happens at the critical threshold, in both the SIR and SIS epidemics, is thatthe deviation from the branching envelope exhibits itself as a size-dependent driftin the limiting diffusion:

Theorem 3 ([4]). Let JN (n) = JN[n] be the number infected in the nth generation

of a simple SIS epidemic in a population of size N . Then under the hypotheses ofTheorem 2,

(8) JN (√

Nt)/√

ND−→ Yt

where Y0 = b and Yt obeys the stochastic differential equation

(9) dYt = (aYt − Y 2t ) dt +

√Yt dWt

Note that the diffusion (9) has an entrance boundary at ∞, so that it is possibleto define a version Yt of the process with initial condition Y0 = 0. When the SISepidemic is begun with JN

0 √

N initially infected, the number JNt infected will

rapidly drop (over the first ε√

N generations) until reaching a level of order√

N , andthen evolve as predicted by (8). The following figure depicts a typical evolution in apopulation of size N = 80, 000, with infection parameter p = 1/N and I0 = 10, 000initially infected.

Theorem 4 ([4]). Let JN (n) = JN[n] and RN (n) = RN

[n] be the numbers of infectedand recovered individuals in the nth generation of a simple SIR epidemic in apopulation of size N . Then under the hypotheses of Theorem 1,

(10)(

N−1/3JN (N1/3t)N−2/3RN (N1/3t)

)D−→

(J(t)R(t)

)

where J0 = b, R0 = 0, and

dJ(t) = (aJ(t) − J(t)R(t)) dt +√

J(t) dWt,(11)dR(t) = J(t) dt.

Theorems 1–2 can be deduced from Theorems 3–4 by simple time-change argu-ments (see [4]).

Fig 3.

Page 186: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Critical Scaling of Stochastic Epidemic Models 173

2.3. Critical scaling: heuristics

The critical thresholds for the SIS−d and SIR−d epidemics can be guessed by sim-ple comparison arguments using the standard couplings of the epidemics with theirbranching envelopes. Consider first the critical SIS epidemic in a population of sizeN . Recall (Section 1.4) that the branching envelope is a critical Galton-Watsonprocess whose offspring distribution is Binomial-(N, 1/N). The particles of thisGalton-Watson process are marked red or blue, in such a way that in each genera-tion the number of red particles coincides with the number of infected individualsin the SIS epidemic. Offspring of blue particles are always blue, but offspring ofred particles may be either red or blue; the blue offspring of red parents in eachgeneration represent attempted infections that are suppressed.

Assume that initially there are Nα infected individuals and thus also Nα indi-viduals in the zeroth generation of the branching envelope. By Feller’s theorem, wemay expect that the extinction time of the branching envelope will be on the orderNα, and that in each generation up to (shortly before) extinction the branchingprocess will have order Nα individuals. If α is small enough that the SIS epidemicobeys the same rough asymptotics (that is, stays alive for O(Nα) generations andhas O(Nα) infected individuals in each generation), then the number of blue off-spring of red parents in each generation will be on the order N×(N2α/N2) (becausefor each of the N individuals of the population, the chance of a double infectionis about N2α/N2). Since the duration of the epidemic will be of the same roughorder of magnitude as the size of the infected set in each generation, there shouldbe at most O(1) blue offspring of red parents in any generation (if there were more,the red population would die out long before the blue). Thus, the critical thresholdmust be at N1/2.

A similar argument applies for the SIR epidemic. The branching envelope of thecritical SIR is once again a critical Galton-Watson process with offspring distrib-ution Binomial-(N, 1/N), with constituent particles again labeled red or blue, redparticles representing infected individuals in the epidemic. The rule by which redparticles reproduce is as follows: Each red particle tosses a p−coin N times oncefor each individual i in the population. Each Head counts as an offspring, and rep-resents an attempted infection. However, if a Head occurs on a toss at individuali where i was infected in an earlier generation, then the Head results in a blueoffspring. Similarly, if more than one red particle tosses a Head at an individual iwhich has not been infected earlier, then one of these is labeled red and the excessare all labeled blue.

Assume that initially there are Nα infected individuals. As before, we may expectthat the extinction time of the branching envelope will be on the order Nα, andthat in each generation up to extinction the branching process will have orderNα individuals. If α is small enough, the extinction time and the size of the redpopulation will also be O(Nα). Consequently, the size of the recovered populationwill be (for all but the first few generations) on order N2α. Thus, in each generation,the number of blue offspring of red parents will be on order (N2α/N)×Nα (becausethe chance that a recovered individual is chosen for attempted infection by aninfected individual is O(Nα/N)). Therefore, by similar reasoning as in the SIScase, the critical threshold is at N1/3, as this is where the the number of blueoffspring of red parents in each generation is O(1).

Page 187: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

174 S. P. Lalley

3. Critical behavior: SIS-1 and SIR-1 Spatial epidemics

Consider now the spatial SIS -d and SIR-d epidemic models on the d-dimensionalinteger lattice Z

d. Assume that the village size Nx = N is the same for all sites x ∈Z

d, and that the infection probabilities α(x, y) are nearest neighbor homogeneous,and uniform, that is,

(12) α(x, y) =

{p = pN , if |x − y| ≤ 1;0, otherwise.

3.1. Scaling limits of branching random walks

The branching envelope of a spatial SIS−d or SIR−d epidemic is a nearest neighborbranching random walk on the integer lattice Z

d. This evolves as follows: Any parti-cle located at site x at time t lives for one unit of time and then reproduces, placingrandom numbers ξy of offspring at the sites y such that |y − x| ≤ 1. The randomvariables ξy are mutually independent, with Binomial-(N, pN ) distributions.

The analogue for branching random walks of Feller’s theorem for Galton-Watsonprocesses is Watanabe’s theorem. This asserts that, after suitable rescaling, as theparticle density increases, critical branching random walks converge to a limit,the Dawson-Watanabe process, also known as super Brownian motion. A precisestatement follows: Consider a sequence of branching random walks, indexed byM = 1, 2, . . . , with offspring distribution Binomial-(N, pM ) as above, and

(13) pM = pN,M =1

(2d + 1)N− a

NM.

(Note: N may depend on M .) The rescaled measure-valued process XMt associated

with the Mth branching random walk puts mass 1/M at location x/√

M and time tfor each particle of the branching random walk that is located at site x at time [Mt].(Note: The branching random walk is a discrete-time process, but the associatedmeasure-valued process runs in continuous time.)

Watanabe’s theorem ([15]). Assume that the initial values XM0 converge weakly

(as finite Borel measures on Rd) to a limit measure X0. Then under the hypothesis

(13) the measure-valued processes XMt converge in law as M → ∞ to a limit Xt:

(14) XMt =⇒ Xt.

The limit process is the Dawson-Watanabe process with killing rate a and ini-tial value X0. (The term killing rate is used because the process can be obtainedfrom the “standard” Dawson-Watanabe process (a = 0) by elimination of massat constant rate a.) The Dawson-Watanabe process Xt with killing rate a can becharacterized by a martingale property: For each test function φ ∈ C2

c (Rd),

(15) 〈Xt, φ〉 − 〈X0, φ〉 −σ

2

∫ t

0

〈Xs, ∆φ〉 ds + a

∫ t

0

〈Xs, ϕ〉 ds

is a martingale. Here σ2 = 2d/(2d + 1) is the variance of the random walk ker-nel naturally associated with the branching random walks. It is known [10] thatin dimension d = 1 the random measure Xt is for each t absolutely continuousrelative to Lebesgue measure, and the Radon-Nikodym derivative X(t, x) is jointlycontinuous in t, x (for t > 0). In dimensions d ≥ 2 the measure Xt is almost surelysingular, and is supported by a Borel set of Hausdorff dimension 2 [3].

Page 188: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Critical Scaling of Stochastic Epidemic Models 175

3.2. Spatial SIS-1 and SIR-1 epidemics: critical scaling

As in the mean-field case, there are critical thresholds for the SIS -1 and SIR-1epidemics at which they begin to deviate noticeably from their branching envelopes.These are at N2/3 and N2/5, respectively:

Theorem 5 ([5]). Fix α > 0, and let XNt be the discrete-time measure-valued

process obtained from an SIS−1 or an SIR−1 epidemic on a one-dimensional gridof size-N villages by attaching mass 1/Nα to the point (t, x/Nα/2) for each infectedindividual at site x at time [tNα]. Assume that XN

0 converges weakly to a limitmeasure X0 as the village size N → ∞. Then as N → ∞,

(16) XN[Nαt]

D−→ Xt,

where Xt is a measure-valued process with initial value X0 whose law depends onthe value of α and the type of epidemic (SIS or SIR) as follows:

(a) SIS: If α < 23 then Xt is a Dawson-Watanabe process with variance σ2.

(b) SIS: If α = 23 then Xt is a Dawson-Watanabe process with variance σ2 and

killing rate

(17) θ(x, t) = X(x, t)/2.

(c) SIR: If α < 25 then Xt is a Dawson-Watanabe process with variance σ2.

(d) SIR: If α = 25 then Xt is a Dawson-Watanabe process with variance σ2 and

killing rate

(18) θ(x, t) = X(x, t)∫ t

0

X(x, s) ds.

The Dawson-Watanabe process with variance σ2 and (continuous, adapted)killing rate θ(t, x, ω) is characterized [2] by a martingale problem similar to (15)above: for each test function φ ∈ C2

c (R),

(19) 〈Xt, φ〉 − 〈X0, φ〉 −σ

2

∫ t

0

〈Xs, ∆φ〉 ds +∫ t

0

〈Xs, θϕ〉 ds

is a martingale. The law of this process is mutually absolutely continuous relativeto that of the Dawson-Watanabe process with no killing, and there is an explicitformula for the Radon-Nikodym derivative – see [2].

3.3. Critical scaling for spatial epidemics: heuristics

Arguments similar to those given above for the mean-field SIS and SIR epidemicscan be used to guess the critical thresholds for the spatial SIS-1 and SIR-1 epi-demics. For the spatial epidemics, the associated branching envelopes are branchingrandom walks. In the standard couplings, particles of the branching envelope arelabeled either red or blue, with the red particles representing infected individualsin the epidemics. As in the mean-field cases, offspring of blue particles are alwaysblue, but offspring of red particles may be either red or blue; blue offspring of redparents represent attempted infections that are suppressed. These may be viewedas an attrition of the red population (since blue particles created by red parentsare potential red offspring that are not realized!).

Page 189: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

176 S. P. Lalley

Consider first the SIS-1 epidemic. Assume that initially there are Nα parti-cles, distributed (say) uniformly among the Nα/2 sites nearest the origin. Then byFeller’s limit theorem (recall that the total population size in a branching randomwalk is a Galton-Watson process), the branching envelope can be expected to sur-vive for OP (Nα) generations, and at any time prior to extinction the populationwill have OP (Nα) members. These will be distributed among the sites at distanceOP (Nα/2) from the origin, and therefore in dimension d = 1 there should be aboutOP (Nα/2) particles per site. Consequently, for the SIS−1 epidemic, the rate ofattrition per site per generation should be OP (Nα−1), and so the total attritionrate per generation should be OP (N3α/2−1). If α = 2/3, then the total attritionrate per generation will be OP (1), just enough so that the total attrition throughthe duration of the branching random walk envelope will be on the same order ofmagnitude as the population size Nα.

For the SIR−1 epidemic there is a similar heuristic calculation. As for the SIS−1epidemic, the branching envelope will survive for OP (Nα) generations, and upto the time of extinction the population should have OP (Nα) individuals, aboutOP (Nα/2) per site. Therefore, through Nα generations, about Nα×Nα/2 numbers jwill be retired, and so the attrition rate per site per generation should be OP (Nα/2×N3α/2), making the total attrition rate per generation OP (N5α/2). Hence, if α = 2/5then the total attrition per generation should be OP (1), just enough so that thetotal attrition through the duration of the branching random walk envelope will beon the same order of magnitude as the population size.

3.4. Critical scaling in dimensions d ≥ 2

In higher dimensions, the critical behavior of the SIS -d and SIR-d epidemics appearsto be considerably different. We expect that there will be no analogous thresholdeffect, in particular, we expect that the epidemic will behave in the same manner asthe branching envelope up to the point where the infected set is a positive fraction ofthe total population. This is because in dimensions d ≥ 2, the particles of a criticalbranching random walk quickly spread out, so that (after a short initial period)there are only OP (1) (in dimensions d ≥ 3) or OP (log N) (in dimension d = 2)particles per site. (With N particles initially, a critical branching random walktypically lives O(N) generations, and particles are distributed among the sites atdistance O(

√N) from the origin; in dimensions d ≥ 2, there are O(Nd/2) such sites,

enough to accomodate the O(N) particles of the branching random walk withoutcrowding.) Consequently, the rate at which “multiple” infections are attempted(that is, attempts by more than one contagious individual to simultaneously infectthe same susceptible) is only of order OP (1/N) (or, in dimension d = 2, orderOP (log N/N)).

The interesting questions regarding the evolution of critical epidemics in dimen-sions d ≥ 2 center on the initial stages, in the relatively small amount of time (ordero(N) generations) in which the particles spread out from their initial sites. Thesewill be discussed in the forthcoming University of Chicago Ph. D. dissertation ofXinghua Zheng.

3.5. Weak convergence of densities

There is an obvious gap in the heuristic argument of Section 3.3 above: Even if thetotal number of infected individuals is, as expected, on the order Nα, and even if

Page 190: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Critical Scaling of Stochastic Epidemic Models 177

these are concentrated in the sites at distance on the order Nα/2 from the origin, itis by no means obvious that these will distribute themselves uniformly (or at leastlocally uniformly) among these sites. The key step in filling this gap in the argumentis to show that the particles of the branching envelope distribute themselves moreor less uniformly on scales smaller than Nα/2.

Consider, as in Section 3.1, a sequence of branching random walks, indexed byM = 1, 2, . . . , with offspring distribution Binomial-(N, pM ) as above, and pM givenby (13). Let Y M

t (x) be the number of particles at site x at time [t], and let XM (t, x)be the continuous function of t ≥ 0 and x ∈ R obtained by linear interpolation fromthe values

(20) XM (t, x) =YMt(

√Mx)√

Mfor Mt ∈ Z+ and

√Mx ∈ Z.

Theorem 6 ([5]). Assume that d = 1. Assume also that the initial particle con-figuration is such that all particles are located in an interval [−κ

√M, κ

√M ] and

such that the initial particle density satisfies

(21) XM (0, x) =⇒ X(0, x)

as M → ∞ for some continuous function X(0, x) with support [−κ, κ]. Then asM → ∞,

(22) XM (t, x) =⇒ X(t, x),

where X(t, x) is the density function of the Dawson-Watanabe process with killingrate a and initial value X(0, x). The convergence is relative to the topology of uni-form convergence on compacts in the space C(R+ × R) of continuous functions.

Since the measure-valued processes associated with the densities XM (t, x) areknown to converge to the Dawson-Watanabe process, by Watanabe’s theorem, toprove Theorem 6 it suffices to establish tightness. This is done by a somewhattechnical application of the Kolmogorov-Chentsov tightness criterion, based on acareful estimation of moments. See [5] for details.

It is also possible to show that convergence of the particle density processes holdsin Theorem 5.

References

[1] Aldous, D. (1997). Brownian excursions, critical random graphs and themultiplicative coalescent. Ann. Probab. 25 812–854.

[2] Dawson, D. A. (1978). Geostochastic calculus. Canad. J. Statist. 6 143–168.[3] Dawson, D. A. and Hochberg, K. J. (1979). The carrying dimension of a

stochastic measure diffusion. Ann. Probab. 7 693–703.[4] Dolgoarshinnykh, R. and Lalley, S. P. (2006). Critical scaling for the

sis stochastic epidemic. J. Appl. Probab. 43 892–898.[5] Dolgoarshinnykh, R. and Lalley, S. P. (2006). Spatial epidemics: Crit-

ical behavior. Preprint.[6] Durrett, R. (1984). Oriented percolation in two dimensions. Ann. Probab.

12 999–1040.[7] Feller, W. (1939). Die Grundlagen der Volterraschen Theorie des Kampfes

ums Dasein in wahrscheinlichkeitstheoretischer Behandlung. Acta Bioth. Ser.A 5 11–40.

Page 191: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

178 S. P. Lalley

[8] Groeneboom, P. (1989). Brownian motion with a parabolic drift and Airyfunctions. Probab. Theory Related Fields 81 79–109.

[9] Jirina, M. (1969). On Feller’s branching diffusion processes. Casopis Pest.Mat. 94 84–90. 107.

[10] Konno, N. and Shiga, T. (1988). Stochastic partial differential equationsfor some measure-valued diffusions. Probab. Theory Related Fields 79 201–225.

[11] Martin-Lof, A. (1998). The final size of a nearly critical epidemic, and thefirst passage time of a Wiener process to a parabolic barrier. J. Appl. Probab.35 671–682.

[12] Nagaev, A. V. and Starcev, A. N. (1970). Asymptotic analysis of a certainstochastic model of an epidemic. Teor. Verojatnost. i Primenen 15 97–105.

[13] Salminen, P. (1988). On the first hitting time and the last exit time for aBrownian motion to/from a moving boundary. Adv. in Appl. Probab. 20 411–426.

[14] Sellke, T. (1983). On the asymptotic distribution of the size of a stochasticepidemic. J. Appl. Probab. 20 390–394.

[15] Watanabe, S. (1968). A limit theorem of branching processes and continuousstate branching processes. J. Math. Kyoto Univ. 8 141–167.

Page 192: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 179–195c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000355

Additive isotone regression

Enno Mammen1,∗ and Kyusang Yu1,∗

Universitat Mannheim

This paper is dedicated to Piet Groeneboom

on the occasion of his 65th birthday

Abstract: This paper is about optimal estimation of the additive componentsof a nonparametric, additive isotone regression model. It is shown that asymp-totically up to first order, each additive component can be estimated as well asit could be by a least squares estimator if the other components were known.The algorithm for the calculation of the estimator uses backfitting. Conver-gence of the algorithm is shown. Finite sample properties are also comparedthrough simulation experiments.

1. Introduction

In this paper we discuss nonparametric additive monotone regression models. Wediscuss a backfitting estimator that is based on iterative application of the pooladjacent violator algorithm to the additive components of the model. Our mainresult states the following oracle property. Asymptotically up to first order, eachadditive component is estimated as well as it would be (by a least squares estimator)if the other components were known. This goes beyond the classical finding thatthe estimator achieves the same rate of convergence, independently of the numberof additive components. The result states that the asymptotic distribution of theestimator does not depend on the number of components.

We have two motivations for considering this model. First of all we think that thisis a useful model for some applications. For a discussion of isotonic additive regres-sion from a more applied point, see also Bacchetti [1], Morton-Jones et al. [32] andDe Boer, Besten and Ter Braak [7]. But our main motivation comes from statisticaltheory. We think that the study of nonparametric models with several nonparamet-ric components is not fully understood. The oracle property that is stated in thispaper for additive isotone models has been shown for smoothing estimators in someother nonparametric models. This property is expected to hold if the estimation ofthe different nonparametric components is based on local smoothing where the lo-calization takes place in different scales. An example are additive models of smoothfunctions where each localization takes place with respect to another covariate. InMammen, Linton and Nielsen [28] the oracle property has been verified for the lo-cal linear smooth backfitting estimator. As local linear estimators, also the isotonicleast squares is a local smoother. The estimator is a local average of the responsevariable but in contrast to local linear estimators the local neighborhood is chosen

∗Research of this paper was supported by the Deutsche Forschungsgemeinschaft project MA1026/7-3 in the framework of priority program SPP-1114.

1Department of Economics, Universitat Mannheim, L 7, 3–5, 68131 Mannheim, Germany, e-mail: [email protected]; [email protected]

AMS 2000 subject classifications: 62G07, 62G20.Keywords and phrases: isotone regression, additive regression, oracle property, pool adjacent

violator algorithm, backfitting.

179

Page 193: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

180 E. Mammen and K. Yu

by the data. This data adaptive choice is automatically done by the least squaresminimization. This understanding of isotonic least squares as a local smoother wasour basic motivation to conjecture that for isotonic least squares the oracle propertyshould hold as for local linear smooth backfitting.

It may be conjectured that the oracle property holds for a much larger class ofmodels. In Horowitz, Klemela and Mammen [19] a general approach was introducedfor applying one-dimensional nonparametric smoothers to an additive model. Theprocedure consists of two steps. In the first step, a fit to the additive model isconstructed by using the projection approach of Mammen, Linton and Nielsen [28].This preliminary estimator uses an undersmoothing bandwidth, so its bias termsare of asymptotically negligible higher order. In a second step, a one-dimensionalsmoother operates on the fitted values of the preliminary estimator. For the re-sulting estimator the oracle property was shown: This two step estimator is as-ymptotically equivalent to the estimator obtained by applying the one-dimensionalsmoother to a nonparametric regression model that only contains one component.It was conjectured that this result also holds in more general models where sev-eral nonparametric components enter into the model. Basically, a proof could bebased on this two step procedures. The conjecture has been verified in Horowitz andMammen [20, 22] for generalized additive models with known and with unknownlink function.

The study of the oracle property goes beyond the classical analysis of rates of con-vergence. Rates of convergence of nonparametric estimators depend on the entropyof the nonparametric function class. If several nonparametric functions enter intothe model the entropy is the sum of the entropies of the classes of the components.This implies that the resulting rate coincides with the rate of a model that onlycontains one nonparametric component. Thus, rate optimality can be shown for alarge class of models with several nonparametric components by use of empiricalprocess theory, see e.g. van de Geer [39]. Rate optimality for additive models wasfirst shown in Stone [38]. This property was the basic motivation for using additivemodels. In contrast to a full dimensional model it allows estimation with the samerate of convergence as a one-dimensional model and avoids for this reason the curseof dimensionality. On the other hand it is a very flexible model that covers manyfeatures of the data nonparametrically. For a general class of nonparametric modelswith several components rate optimality is shown in Horowitz and Mammen [21].

The estimator of this paper is based on backfitting. There is now a good under-standing of backfitting methods for additive models. For a detailed discussion of thebasic statistical ideas see Hastie and Tibshirani [18]. The basic asymptotic theoryis given in Opsomer and Ruppert [34] and Opsomer [35] for the classical backfittingand in Mammen, Linton and Nielsen [28] for the smooth backfitting. Bandwidthchoice and practical implementations are discussed in Mammen and Park [29, 30]and Nielsen and Sperlich [33]. The basic difference between smooth backfitting andbackfitting lies in the fact that smooth backfitting is based on a smoothed leastsquares criterion whereas in the classical backfitting smoothing takes place only forthe updated component. The full smoothing of the smooth backfitting algorithmstabilizes the numerical and the statistical performance of the estimator. In par-ticular this is the case for degenerated designs and for the case of many covariatesas was shown in simulations by Nielsen and Sperlich [33]. In this paper we usebackfitting without any smoothing. For this reason isotone additive least squareswill have similar problems as classical backfitting and these problems will be evenmore severe because no smoothing is used at all. Smooth backfitting methods forgeneralized additive models were introduced in Yu, Park and Mammen [42]. Haag

Page 194: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 181

[15] discusses smooth backfitting for nonparametric additive diffusion models. Testsbased on smooth backfitting have been considered in Haag [16] and Mammen andSperlich [31]. Backfitting tests have been proposed in Fan and Jiang [10]. Additiveregression is an example of a nonparametric model where the nonparametric func-tion is given as a solution of an integral equation. This has been outlined in Lintonand Mammen [24] and Carrasco, Florens and Renault [6] where also other exam-ples of statistical integral equations are given. Examples are additive models wherethe additive components are linked as in Linton and Mammen [25] and regressionmodels with dependent errors where an optimal transformation leads to an additivemodel, see Linton and Mammen [26]. The representation of estimation in additivemodels as solving an empirical integral equation can also be used to understandwhy the oracle property holds.

In this paper we verify the oracle property for additive models of isotone func-tions. It is shown that each additive component can be estimated with the sameasymptotic accuracy as if the other components would be known. We compare theperformance of a least squares backfitting estimator with a least squares isotoneestimator in the oracle model where only one additive component is unknown. Thebackfitting estimator is based on iterative applications of isotone least squares toeach additive component. Our main theoretical result is that the differences be-tween these two estimators are of second order. This result will be given in thenext section. The numerical performance of the isotone backfitting algorithm andits numerical convergence will be discussed in Section 3. Simulations for the com-parison of the isotone backfitting estimator with the oracle estimator are presentedin Section 4. The proofs are deferred to the Appendix.

2. Asymptotics for additive isotone regression

We suppose that we have i.i.d. random vectors (Y 1, X11 , . . . , X1

d), . . . , (Y n, Xn1 , . . . ,

Xnd ) and we consider the regression model

(1) E(Y i|Xi1, . . . , X

id) = c + m1(Xi

1) + · · · + md(Xid)

where mj(·)’s are monotone functions. Without loss of generality we suppose thatall functions are monotone increasing. We also assume that the covariables takevalues in a compact interval, [0, 1], say. For identifiability we add the normalizingcondition

(2)∫ 1

0

mj(xj) dxj = 0.

The least squares estimator for the regression model (1) is given as minimizer of

(3)n∑

i=1

(Y i − c − µ1(Xi1) − · · · − µd(Xi

d))2

with respect to monotone increasing functions µ1, . . . , µd and a constant c thatfulfill

∫ 1

0µj(xj) dxj = 0. The resulting estimators are denoted as m1, . . . , md and c.

We will compare the estimators mj with oracle estimators mORj that make use

of the knowledge of ml for l �= j. The oracle estimator mORj is given as minimizer

Page 195: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

182 E. Mammen and K. Yu

ofn∑

i=1

(Y i − c − µj(Xi1) −

l �=j

ml(Xil ))

2

=n∑

i=1

(mj(Xij) + εi − c − µj(Xi

1))2

with respect to a monotone increasing function µj and a constant c that fulfill∫ 1

0µj(xj)dxj = 0. The resulting estimators are denoted as mOR

j and cOR.In the case d = 1, this gives the isotonic least squares estimator proposed by

Brunk [4] which is given by

(4) m1(X(i)1 ) = max

s≤imint≥i

t∑

j=s

Y (j)/(t − s + 1)

where X(1)1 , . . . , X

(n)1 are the order statistics of X1, . . . , Xn and Y (j) is the obser-

vation at the observed point X(j)1 . Properties and simple computing algorithms

are discussed e.g. in Barlow et al. [2] and Robertson, Wright, and Dykstra [36]. Afast way to calculate the estimator is to use the Pool Adjacent Violator Algorithm(PAVA). In the next section we discuss a backfitting algorithm for d > 1 that isbased on iterative use of PAVA.

We now state a result for the asymptotic performance of mj . We use the followingassumptions. To have an economic notation, in the assumptions and in the proofswe denote different constants by the same symbol C.

(A1) The functions m1, . . . , md are differentiable and their derivatives are boundedon [0, 1]. The functions are strictly monotone, in particular for G(δ) =inf |u−v|≥δ,1≤j≤d |mj(v) − mj(u)| it holds G(δ) ≥ Cδγ for constants C, γ > 0for all δ > 0.

(A2) The d-dimensional vector Xi = (Xi1, . . . , X

id) has compact support [0, 1]d. The

density p of Xi is bounded away from zero and infinity on [0, 1]d and it iscontinuous. The tuples (Xi, Y i) are i.i.d. For j, k = 1, . . . , d the density pXk,Xj

of (Xik, Xi

j) fulfills the following Lipschitz condition for constants C, ρ > 0

sup0≤uj ,uk,vk≤1

|pXk,Xj (uk, uj) − pXk,Xj (vk, uj)| ≤ C|uk − vk|ρ.

(A3) Given Xi the error variables εi = Y i − c − m1(Xi1) − · · · − md(Xi

d) haveconditional zero mean and subexponential tails, i.e. for some γ > 0 and C ′ >0, it holds that

E[exp(γ|εi|)

∣∣∣Xi]

< C ′ a.s.

The conditional variance of εi given Xi = x is denoted by σ2(x). The condi-tional variance of εi given Xi

1 = u1 is denoted by σ21(u1). We assume that σ2

1

is continuous at x1.

These are weak smoothness conditions. We need (A3) to apply results fromempirical process theory. Condition (A1) excludes the case that a function mj hasflat parts. This is done for the following reason. Isotonic least squares regressionproduces piecewise constant estimators where for every piece the estimator is equalto the sample average of the piece. If the function is strictly monotone the pieces

Page 196: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 183

shrink to 0, a.s. If the function has flat parts these averages do not localize atthe flat parts. But in our proof we make essential use of a localization argument.We conjecture that our oracle result that is stated below also holds for the casethat there are flat parts. But we do not pursue to check this here. It is also ofminor interest because at flat parts the monotone least squares estimator is oforder mOR

j −mj = oP (n−1/3). Thus the oracle result mj − mORj = oP (n−1/3) then

only implies that mj − mj = oP (n−1/3). In particular, it does not imply that mj

and mORj have the same asymptotic distribution limit.

For d = 1 the asymptotics for m1 are well known. Note that the estimator m1 ford = 1 coincides with the oracle estimator mOR

1 for d > 1 that is based on isotonizingY i − c − m2(Xi

2) − · · · − md(Xid) = m1(Xi

1) + εi in the order of the values of Xi1

(i = 1, . . . , n). For the oracle model (or for the model (1) with d = 1) the followingasymptotic result holds under (A1)–(A3):

For all x1 ∈ (0, 1) it holds that

mOR1 (x1) − m1(x1) = OP (n−1/3).

Furthermore at points x1 ∈ (0, 1) with m′1(x1) > 0, the normalized estimator

n1/3 [2p1(x1)]1/3

σ1(x1)2/3m′1(x1)1/3

[mOR1 (x1) − m1(x1)]

converges in distribution to the slope of the greatest convex minorant of W (t)+t2, where W is a two-sided Brownian motion. Here, p1 is the density of Xi

1 .

The greatest convex minorant of a function f is defined as the greatest convexfunction g with g ≤ f , pointwise. This result can be found, e.g. in Wright [41] andLeurgans [23]. Compare also Mammen [27]. For further results on the asymptoticlaw of mOR

1 (x1) − m1(x1), compare also Groeneboom [12, 13].We now state our main result about the asymptotic equivalence of mj and mOR

j .

Theorem 1. Make the assumptions (A1)–(A3). Then it holds for c large enoughthat

supn−1/3≤xj≤1−n−1/3

|mj(xj) − mORj (xj)| = oP (n−1/3),

sup0≤xj≤1

|mj(xj) − mORj (xj)| = oP (n−2/9(log n)c)

The proof of Theorem 1 can be found in the Appendix. Theorem 1 and the abovementioned result on mOR

1 immediately implies the following corollary.

Corollary 1. Make the assumptions (A1)–(A3). For x1 ∈ (0, 1) with m′1(x1) > 0

it holds that

n1/3 [2p1(x1)]1/3

σ1(x1)2/3m′1(x1)1/3

[m1(x1) − m1(x1)]

converges in distribution to the slope of the greatest convex minorant of W (t) + t2,where W is a two-sided Brownian motion.

3. Algorithms for additive isotone regression

The one-dimensional isotonic least squares estimator can be regarded as a projectionof the observed vector (Y (1), . . . , Y (n)) onto the convex cone of isotonic vectors in R

n

Page 197: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

184 E. Mammen and K. Yu

with respect to the scalar product 〈f, g〉 ≡∑n

i=1 f (i)g(i) where f ≡ (f (1), . . . , f (n))and g ≡ (g(1), . . . , g(n)) ∈ R

n . Equivalently, we can regard it as a projection of aright continuous simple function with values (Y (1), . . . , Y (n)) onto the convex coneof right continuous simple monotone functions which can have jumps only at X

(i)1 ’s.

The projection is with respect to the L2 norm defined by the empirical measure,Pn(y, x1) which gives mass 1/n at each observations (Y i, Xi

1). Other monotonefunctions m with m(X(i)) = g(i) would also solve the least square minimization.

Now, we consider the optimization problem (3). Without loss of generality, wedrop the constant. Let Hj , j = 1, . . . , d be the sets of isotonic vectors of length nor right continuous monotone simple functions which have jumps only at Xi

j ’s withrespect to the ordered Xj ’s. It is well known that these sets are convex cones. Then,our optimization problem can be written as follows:

(5) ming∈H1+···+Hd

n∑

i=1

(Y i − gi)2.

We can rewrite (5) as an optimization problem over a product sets H1 × · · · × Hd.Say g = (g1, . . . , gd) ∈ H1 × · · · × Hd where gj ∈ Hj for j = 1, . . . , d. Thenthe minimization problem (5) can be represented as minimizing a function over acartesian product of sets, i.e.,

(6) ming∈H1×···×Hd

F (Y,g).

Here, F (Y,g) =∑n

i=1(Yi − gi

1 − · · · − gid)

2.A classical way to solve an optimization problem over product sets is a cyclic

iterative procedure where at each step we minimize F with respect to one gj ∈ Hj

while keeping the other gk ∈ Hk, j �= k fixed. That is to generate sequences g[r]j ,

r = 1, 2, . . . , j = 1, . . . , d, recursively such that g[r]j minimizes F (y, g

[r]1 , . . . , g

[r]j−1, u,

g[r−1]j+1 , . . . , g

[r−1]d ) over u ∈ Hj . This procedure for (6) entails the well known back-

fitting procedure with isotonic regressions on Xj , Π(·|Hj) which is given as

(7) g[r]j = Π

(Y − g

[r]1 − · · · − g

[r]j−1 − g

[r−1]j+1 − · · · − g

[r−1]d

∣∣∣∣Hj

),

r = 1, 2, . . . , j = 1, . . . , d, with initial values g[0]j = 0 where Y = (Y 1, . . . , Y n). For

a more precise description, we introduce a notation Yi,[r]j = Y i−g

i,[r]1 −· · ·−g

i,[r]j−1−

gi,[r−1]j+1 − · · · − g

i,[r−1]d where g

i,[r]k is the value of gk at Xi

k after the r-th iteration,

i.e. Yi,[r]j is the residual at the j-th cycle in the r-th iteration. Then, we have

gi,[r]j = max

s≤imint≥i

t∑

�=s

Y(�),[r]j /(t − s + 1).

Here, Y(�),[r]j is the residual at the k-th cycle in the r-th iteration corresponding to

the X(�)j .

Let g∗ be the projection of Y onto H1 + · · · + Hd, i.e., the minimizer of theproblem (5).

Theorem 2. The sequence, g(r,j) ≡∑

1≤k≤j g[r]k +

∑j≤k≤d g

[r−1]k , converges to g∗

as r → ∞ for j = 1, . . . , d. Moreover if the problem (6) has a solution that is unique

Page 198: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 185

up to an additive constant, say g∗ = (g∗1 , . . . , g∗d), the sequences g[r]j converge to a

vector with constant entries as r → ∞ for j = 1, . . . , d.

In general, g∗ is not unique. Let g = (g1, . . . , gd) be a solution of (6). Suppose, e.g.that there exists a tuple of non constant vectors (f1, . . . , fd) such that f i

1+· · ·+f id =

0 for i = 1, . . . , n and gj + fj are monotone. Then, one does not have the uniquesolution for (6) since

∑dj=1 gi

j =∑d

j=1(gij + f i

j) and gj + fj are monotone. Thisphenomenon is similar to ’concurvity’, introduced in Buja et al. (1989). One simpleexample for non-uniqueness is the case that elements of X are ordered in the sameway, i.e., Xp

j ≤ Xqj ⇔ Xp

k ≤ Xqk for any (p, q) and (j, k). For example when d = 2,

if g solves (5), then (αg, (1−α)g) for any α ∈ [0, 1] solve (6). Other examples occurif elements of X are ordered in the same way for a subregion.

4. Simulations

In this section, we present some simulation results for the finite sample performance.These numerical experiments are done by R on windows. We iterate 1000 times foreach setting. For each iteration, we draw random samples from the following model

(8) Y = m1(X1) + m2(X2) + ε,

where (X1, X2) has truncated bivariate normal distribution and ε ∼ N(0, 0.52).In Table 1 and 2, we present the empirical MISE (mean integrated squared

error) of the backfitting estimator and the oracle estimator. We also report theratio (B/O), MISE of the backfitting estimator to MISE of the oracle estimator.For Table 1, we set m1(x) = x3 and m2(x) = sin(πx/2). The results in Table 1show that the backfitting estimator and the oracle estimator have a very similarfinite sample performance. See that the ratio (B/O) is near to one in most casesand converges to one as sample size grows. We observe that when two covariateshave strong negative correlation, the backfitting estimator has bigger MISE thanthe oracle estimator but the ratio (B/O) goes down to one as sample size grows.Figure 1 shows typical curves from the backfitting and oracle estimators for m1. Weshow the estimators that achieve 25%, 50% and 75% quantiles of the L2-distance

Table 1

Comparison between the backfitting and the oracle estimator: Model (8) with m1(x) = x3,m2(x) = sin(πx/2), sample size 200, 400, 800 and different values of ρ for covariate distribution

m1 m2

n ρ Backfitting Oracle B/O Backfitting Oracle B/O200 0 0.01325 0.01347 0.984 0.01793 0.01635 1.096

0.5 0.01312 0.01350 0.972 0.01817 0.01674 1.086−0.5 0.01375 0.01345 1.022 0.01797 0.01609 1.117

0.9 0.01345 0.01275 1.055 0.01815 0.01601 1.134−0.9 0.01894 0.01309 1.447 0.02363 0.01633 1.447

400 0 0.00824 0.00839 0.982 0.01068 0.01000 1.0680.5 0.00825 0.00845 0.977 0.01070 0.01001 1.063

−0.5 0.00831 0.00830 1.001 0.01081 0.00997 1.0840.9 0.00846 0.00814 1.040 0.01092 0.00997 1.095

−0.9 0.10509 0.00805 1.305 0.01311 0.00992 1.321800 0 0.00512 0.00525 0.976 0.00654 0.00621 1.053

0.5 0.00502 0.00513 0.977 0.00646 0.00614 1.052−0.5 0.00509 0.00513 0.994 0.00660 0.00620 1.066

0.9 0.00523 0.00500 1.046 0.00667 0.00611 1.091−0.9 0.00603 0.00498 1.211 0.00757 0.00612 1.220

Page 199: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

186 E. Mammen and K. Yu

between the backfitting and the oracle estimator for m1(x). We observe that thebackfitting and the oracle estimator produce almost identical curves.

Table 2 reports simulation results for the case that one component function isnot smooth. Here, m1(x) = x, |x| > 0.5; 0.5, 0 ≤ x ≤ 0.5;−0.5,−0.5 ≤ x < 0 andm2(x) = sin(πx/2). Even in this case the backfitting estimator shows a quite goodperformance. Thus the oracle property of additive isotonic least square regression

Table 2

Comparison between the backfitting and the oracle estimator: Model (8) withm1(x) = x, |x| > 0.5; 0.5, 0 ≤ x ≤ 0.5;−0.5,−0.5 ≤ x < 0, m2(x) = sin(πx/2), sample size 200,

400, 800 and different values of ρ for covariate distribution

m1 m2

n ρ Backfitting Oracle B/O Backfitting Oracle B/O200 0 0.01684 0.01548 1.088 0.01805 0.01635 1.104

0.5 0.01686 0.01541 1.094 0.01756 0.01604 1.095−0.5 0.01726 0.01541 1.120 0.01806 0.01609 1.123

0.9 0.01793 0.01554 1.154 0.01852 0.01628 1.138−0.9 0.02269 0.01554 1.460 0.02374 0.01633 1.454

400 0 0.01016 0.00950 1.071 0.01094 0.01014 1.0790.5 0.00987 0.00944 1.046 0.01088 0.01025 1.062

−0.5 0.01010 0.00944 1.070 0.01084 0.00998 1.0860.9 0.01000 0.00897 1.115 0.01105 0.00996 1.109

−0.9 0.01192 0.00897 1.330 0.01308 0.00996 1.314800 0 0.00576 0.00552 1.044 0.00657 0.00622 1.056

0.5 0.00578 0.00555 1.041 0.00651 0.00617 1.055−0.5 0.00588 0.00555 1.059 0.00657 0.00614 1.071

0.9 0.00598 0.00538 1.110 0.00670 0.00616 1.088−0.9 0.00695 0.00538 1.291 0.00772 0.00612 1.262

Fig 1. The real lines, dashed lines and dotted lines show the true curve, backfitting estimates andoracle estimates, respectively. The left, center and right panels represent fitted curves for the datasets that produce 25%, 50% and 75% quantiles for the distance between the backfitting and theoracle estimator in Monte Carlo simulations with ρ = 0.5 and 200 observations.

Page 200: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 187

is well supported by our asymptotic theory and by the simulations.

Appendix: Proofs

A.1. Proof of Theorem 1

The proof of Theorem 1 is divided into several lemmas.

Lemma 3. For j = 1, . . . , d it holds that

supn−2/9≤uj≤1−n−2/9

|mj(uj) − mj(uj)| = OP [(log n)n−2/9].

Proof. Because εi has subexponential tails (see (A3)) we get that sup1≤i≤n |εi| =OP (log n). This implies that max1≤j≤d sup0≤uj≤1 |mj(uj)| = OP (log n). We nowconsider the regression problem

Y i/(log n) = c/(log n) + m1(Xi1)/(log n) + . . . + md(Xi

d)/(log n) + εi/(log n).

Now, in this model the least squares estimators of the additive components arebounded and therefore we can use the entropy bound for bounded monotone func-tions (see e.g. (2.6) in van de Geer [39] or Theorem 2.7.5 in van der Vaart andWellner [40]). This gives by application of empirical process theory for least squaresestimators, see Theorem 9.2 in van de Geer [39] that

1n

n∑

i=1

[m1(Xi

1) − m1(Xi1) + . . . + md(Xi

d) − md(Xid)]2

= OP [(log n)2n−2/3].

And, using Lemma 5.15 in van de Geer [39], this rate for the empirical norm canbe replaced by the L2 norm:∫

[m1(u1) − m1(u1) + . . . + md(ud) − md(ud)]2p(u) du = OP [(log n)2n−2/3].

Because p is bounded from below (see (A2)) this implies∫

[m1(u1) − m1(u1) + . . . + md(ud) − md(ud)]2

du = OP [(log n)2n−2/3].

Because of our norming assumption (2) for mj and mj the left hand side of the lastequality is equal to

∫[m1(u1) − m1(u1)]

2du1 + . . . +

∫[md(ud) − md(ud)]

2dud.

This gives

(9) max1≤j≤d

∫[mj(uj) − mj(uj)]

2duj = OP [(log n)2n−2/3].

We now use the fact that for j = 1, . . . , d the derivatives m′j are bounded. This

gives together with the last bound the statement of Lemma 3.

Page 201: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

188 E. Mammen and K. Yu

We now define localized estimators mORj,loc and mj,loc. They are defined as mOR

j

and mj but now the sum of squares runs only over indices i with xj − (log n)1/γ ×n−2/(9γ)cn ≤ Xi

j ≤ xj + (log n)1/γn−2/(9γ)cn, i.e. mORj,loc minimizes

i:|Xij−xj |≤(log n)1/γn−2/(9γ)cn

[mj(Xi

j) + εi − mORj,loc(X

ij)]2

and mj,loc minimizes

i:|Xij−xj |≤(log n)1/γn−2/(9γ)cn

Y i −

l �=j

ml(Xil ) − mj,loc(Xi

j)

2

.

Here cn is a sequence with cn → ∞ slowly enough (see below). We now argue that

mj,loc(xj) = mj(xj) for j = 1, . . . , d and 0 ≤ xj ≤ 1with probability tending to 1.(10)

This follows from Lemma 3, the fact that mj fulfills (A1) and the representation(compare (4)):

mj(xj) = max0≤u≤xj

minxj≤v≤1

∑i:u≤Xi

j≤v Y i

j

#{i : u ≤ Xij ≤ v} ,(11)

mj,loc(xj) = maxxj−(log n)1/γn−2/(9γ)cn≤u≤xj(12)

minxj≤v≤xj+(log n)1/γn−2/(9γ)cn

∑i:u≤Xi

j≤v Y i

j

#{i : u ≤ Xij ≤ v}

with Y ij = Y i −

∑l �=j ml(Xi

l ). Here #A denotes the number of elements of a set A.Proceeding as in classical discussions of the case d = 1 one gets:

mORj,loc(xj) = mOR

j (xj) for j = 1, . . . , d and 0 ≤ xj ≤ 1(13)

with probability tending to 1.

We now consider the functions

Mj(uj , xj) = n−1∑

i:Xij≤uj

Y ij − n−1

i:Xij≤xj

Y ij ,

MORj (uj , xj) = n−1

i:Xij≤uj

[mj(Xi

j) + εi]− n−1

i:Xij≤xj

[mj(Xi

j) + εi],

Mj(uj , xj) = n−1∑

i:Xij≤uj

mj(Xij) − n−1

i:Xij≤xj

mj(Xij).

For xj − (log n)1/γn−2/(9γ)cn ≤ uj ≤ xj + (log n)1/γn−2/(9γ)cn we consider thefunctions that map #{i : Xi

j ≤ uj} onto Mj(uj , xj), MORj (uj , xj) or Mj(uj , xj),

respectively. Then we get mj,loc(xj), mORj,loc(xj) and mj(xj) as the slopes of the

greatest convex minorants of these functions at uj = xj .We now show the following lemma.

Page 202: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 189

Lemma 4. For α > 0 there exists a β > 0 such that uniformly for 1 ≤ l, j ≤ d,0 ≤ xj ≤ 1 and xj − (log n)1/γn−2/(9γ)cn ≤ uj ≤ xj + (log n)1/γn−2/(9γ)cn

MORj (uj , xj) − Mj(uj , xj)(14)

= OP ({|uj − xj | + n−α}1/2n−1/2(log n)β),

Mj(uj , xj) − MORj (uj , xj)(15)

= −∑

l �=j

n−1

i:Xij≤uj

−∑

i:Xij≤xj

[ml(ul) − ml(ul)] pXl|Xj(ul|Xi

j) dul

+ OP ({|uj − xj | + n−α}2/3n−13/27(log n)β),

n−1

i:Xij≤uj

−∑

i:Xij≤xj

[ml(ul) − ml(ul)] pXl|Xj(ul|Xi

j) dul(16)

= n−1[#{i : Xi

j ≤ uj} − #{i : Xij ≤ xj}

]

×∫

[ml(ul) − ml(ul)] pXl|Xj(ul|xj) dul

+ OP ({|uj − xj | + n−1}n−2ρ/(9γ)(log n)β).

Proof. Claim (14) is a standard result on partial sums. Claim (16) directly followsfrom (A2). For a proof of claim (15) we use the following result: For a constant Csuppose that ∆ is a difference of monotone functions on [0, 1] with uniform boundsupz |∆(z)| ≤ C and that Z1, . . . , Zk is a triangular array of independent randomvariables with values in [0, 1]. Then it holds uniformly over all functions ∆

k∑

i=1

∆(Zi) − E[∆(Zi)] = OP (k2/3),

see e.g. van de Geer [39]. This result can be extended to

l∑

i=1

∆(Zi) − E[∆(Zi)] = OP (k2/3),

uniformly for l ≤ k and for ∆ a difference of monotone functions with uniformbound supz |∆(z)| ≤ C. More strongly, one can show an exponential inequality forthe left hand side. This implies that up to an additional log-factor the same rateapplies if such an expansion is used for a polynomially growing number of settingswith different choices of k, Zi and ∆.

We apply this result, conditionally given X1j , . . . , Xn

j , with Zi = Xil and ∆(u) =

I[n−2/9 ≤ u ≤ 1 − n−2/9][ml(u) − ml(u)]/(n−2/9 log n). The last factor is justifiedby the statement of Lemma 3. This will be done for different choices of k ≥ n1−α.Furthermore, we apply this result with Zi = Xi

l and ∆(u) = {I[0 ≤ u < n−2/9] +I[1 − n−2/9 < u ≤ 1]}[ml(u) − ml(u)]/(log n) and k ≥ n1−α. This implies claim(15).

We now show that Lemma 4 implies the following lemma.

Lemma 5. Uniformly for 1 ≤ j ≤ d and n−1/3 ≤ xj ≤ 1 − n−1/3 it holds that

(17) mj(xj) = mORj (xj)−

l �=j

∫[ml(ul) − ml(ul)] pXl|Xj

(ul|xj)dul +oP (n−1/3)

Page 203: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

190 E. Mammen and K. Yu

and that with a constant c > 0 uniformly for 1 ≤ j ≤ d and 0 ≤ xj ≤ n−1/3 or1 − n−1/3 ≤ xj ≤ 1

mj(xj) = mORj (xj) −

l �=j

∫[ml(ul) − ml(ul)] pXl|Xj

(ul|xj)dul

(18)+ oP (n−2/9(log n)c).

Proof. For a proof of (17) we use that for n−1/3 ≤ xj ≤ 1 − n−1/3

m−j (xj) ≤ mj(xj) ≤ m+

j (xj) with probability tending to 1,(19)

mOR,−j (xj) ≤ mOR

j (xj) ≤ mOR,+j (xj) with probability tending to 1,(20)

sup0≤xj≤1

mOR,+j (xj) − mOR,−

j (xj) = oP (n−1/3),(21)

where

m−j (xj) = max

xj−en≤u≤xj−dn

minxj≤v≤xj+en

∑i:u≤Xi

j≤v Y i

j

#{i : u ≤ Xij ≤ v} ,

m+j (xj) = max

xj−en≤u≤xj

minxj+dn≤v≤xj+en

∑i:u≤Xi

j≤v Y i

j

#{i : u ≤ Xij ≤ v} ,

mOR,−j (xj) = max

xj−en≤u≤xj−dn

minxj≤v≤xj+en

∑i:u≤Xi

j≤v mj(Xi

j) + εi

#{i : u ≤ Xij ≤ v} ,

mOR,+j (xj) = max

xj−en≤u≤xj

minxj+dn≤v≤xj+en

∑i:u≤Xi

j≤v mj(Xi

j) + εi

#{i : u ≤ Xij ≤ v} ,

compare (11) and (12). Here, en = (log n)1/γn−2/(9γ)cn and dn is chosen as dn =n−δ with 1/3 < δ < 4/9. Claims (19) and (20) follow immediately from the def-initions of the considered quantities and (10) and (13). Claim (21) can be estab-lished by using well known properties of the isotone least squares estimator. Now,(15),(16),(19) and (20) imply that uniformly for 1 ≤ j ≤ d and n−1/3 ≤ xj ≤1 − n−1/3

m±j (xj) = mOR,±

j (xj) −∑

l �=j

∫[ml(ul) − ml(ul)] pXl|Xj

(ul|xj)dul + oP (n−1/3).

This shows claim (17).For the proof of (18) one checks this claim separately for n−7/9(log n)−1 ≤

xj ≤ n−1/3 or 1 − n−1/3 ≤ xj ≤ 1 − n−7/9(log n)−1 (case 1) and for 0 ≤ xj ≤n−7/9(log n)−1 or 1 − n−7/9(log n)−1 ≤ xj ≤ 1 (case 2). The proof for Case 1is similar to the proof of (17). For the proof in Case 2 one considers the setIj,n = {i : 0 ≤ Xi

j ≤ n−7/9(log n)−1}. It can be easily checked that with prob-ability tending to 1 it holds that n−2/9 ≤ Xi

l ≤ 1 − n−2/9. Therefore it holdsthat supi∈Ij,n

|ml(Xil ) − ml(Xi

l )| = OP [(log n)n−2/9], see Lemma 3. Therefore for0 ≤ xj ≤ n−7/9(log n)−1 the estimators mj(xj) and mOR

j (xj) are local averages ofquantities that differ by terms of order OP [(log n)n−2/9]. Thus also the difference ofmj(xj) and mOR

j (xj) is of order OP [(log n)n−2/9]. This shows (18) for Case 2.

We now show that Lemma 5 implies the statement of the theorem.

Page 204: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 191

Proof of Theorem 1. We rewrite equations (17) and (18) as

(22) m = mOR + H(m − m) + ∆,

where m, mOR and ∆ are tuples of functions mj , mORj or ∆j , respectively. For ∆j

it holds that

supn−1/3≤xj≤1−n−1/3

|∆j(xj)| = oP (n−1/3),(23)

sup0≤xj≤1

|∆j(xj)| = oP (n−2/9(log n)c).(24)

Furthermore, H is the linear integral operator that corresponds to the linear mapin (17) and (18). For function tuples f we denote by Nf the normalized functiontuple with (Nf)j(xj) = fj(xj) −

∫fj(uj)duj and we introduce the pseudo norms

‖f‖22 =

∫[f1(x1) + . . . + fd(xd)]2p(x) dx,

‖f‖∞ = max1≤j≤d

sup0≤xj≤1

|fj(xj)|.

Here pj is the marginal density of Xij and p is the joint density of Xi. We make use

of the following properties of H. On the subspace F0 = {f : f = Nf} the operatorH has bounded operator norm:

(25) supf∈F0,‖f‖2=1

‖Hf‖2 = O(1).

For the maximal eigenvalue λmax of H, it holds that

(26) λmax < 1.

Claim (25) follows directly from the boundedness of p. Claim (26) can be seen asfollows. Compare also with Yu, Park and Mammen [42].

A simple calculation gives

(27)∫

(m1(u1) + · · · + md(ud))2p(u) du = ‖m‖22 =

∫mT (I − H)m(u)p(u)du.

Let λ be an eigenvalue of H and mλ be an eigen(function)vector corresponding toλ. With (27), we have

‖mλ‖22 =

∫mT

λ (I − H)mλ(u)p(u)du = (1 − λ)∫

mTλ mλ(u)p(u)du.

Thus, the factor 1 − λ must be strictly positive, i.e. λ < 1. This implies I − H isinvertible and hence we get that

N(m − m) = (I − H)−1N(mOR − m) + (I − H)−1N∆.

Here we used that because of (22)

N(m − m) = N(mOR − m) − NH(m − m) + N∆= N(mOR − m) − HN(m − m) + N∆.

Page 205: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

192 E. Mammen and K. Yu

We now use

(I − H)−1 = I + H + H(I − H)−1H,

(I − H)−1 = I + H(I − H)−1.

This gives

N(m − m) = N(mOR − m) + N∆ + HN(mOR − m)+ H(I − H)−1HN(mOR − m) + H(I − H)−1∆.

We now use that

‖HN(mOR − m)‖2 ≤ ‖HN(mOR − m)‖∞ = oP (n−1/3),(28)sup

f∈F0,‖f‖∞=1

‖Hf‖∞ = O(1).(29)

Claim (28) follows because mOR is a local average of the data, compare also Groene-boom [12], Groeneboom, Lopuhaa and Hooghiemstra [14] and Durot [8]. Claim (29)follows by a simple application of the Cauchy Schwarz inequality, compare also (85)in Mammen, Linton and Nielsen [28].

This implies that

‖N(m − m) − N(mOR − m) − N∆‖∞ = oP (n−1/3).

Thus,

supn−1/3≤xj≤1−n−1/3

|N(m − mOR)j(xj)| = oP (n−1/3),

sup0≤xj≤1

|N(m − mOR)j(xj)| = oP (n−2/9(log n)c)

This implies the statement of Theorem 1.

A.2. Proof of Theorem 2

For a given closed convex cone K, we call K∗ ≡ {f : 〈f, g〉 ≤ 0 for all g ∈ K}the dual cone of K. It is clear that K∗ is also a convex cone and K∗∗ = K. It ispointed out in Barlow and Brunk [3] that if P is a projection onto K then I−P is aprojection onto K∗ where I is the identity operator. Let Pj be a projection onto Hj

then P ∗j ≡ I − Pj is a projection onto H∗

j . The backfitting procedure (7) to solvethe minimization problem (5) corresponds in the dual problem to an algorithmintroduced in Dykstra [9]. See also Gaffke and Mathar [11]. We now explain thisrelation. Let Hj , j = 1, . . . , d, be sets of monotone vectors in R

n with respect tothe orders of Xj and Pj = Π(·|Hj). Denote the residuals in algorithm (7) after thek-th cycle in the r-th iteration with h(r,k). Then, we have

h(1,1) = Y − g[1]1 = P ∗

1 Y,

h(1,2) = Y − g[1]1 − g

[1]2 = P ∗

1 Y − P2P∗1 Y = P ∗

2 P ∗1 Y,

...h(1,d) = Y − g

[1]1 − · · · − g

[1]d = P ∗

d · · ·P ∗1 Y;(30)

Page 206: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 193

h(r,1) = Y − g[r]1 − g

[r−1]2 − · · · − g

[r−1]d = P ∗

1 (Y − g[r−1]2 − · · · − g

[r−1]d ),

...h(r,k) = Y − g

[r]1 − · · · − g

[r]k − g

[r−1]k+1 − · · · − g

[r−1]d

= P ∗k (Y − g

[r]1 − · · · − g

[r]k−1 − g

[r−1]k+1 − · · · − g

[r−1]d ),

...h(r,d) = Y − g

[r]1 − · · · − g

[r]d = P ∗

d (Y − g[r]1 − · · · − g

[r]d−1).(31)

With the notation Ir,k ≡ −g[r]k for the incremental changes at the k-th cycle in

the r-th iteration, equations (30) and (31) form a Dykstra algorithm to solve thefollowing optimization problem:

(32) minh∈H∗

1∩···∩H∗

d

n∑

i=1

(Y i − hi)2.

Denote the solutions of (32) with h∗. Theorem 3.1 of Dykstra [9] shows thath(r,j) converges to h∗ as r → ∞ for j = 1, . . . , d. From the dual property, it is wellknown g∗ = Y − h∗ and also it is clear that g(r,j) = Y − h(r,j) for j = 1, . . . , d.Since h(r,j) converges to h∗, g(r,j) converge to g∗ as r → ∞ for j = 1, . . . , d. Theconvergence of g

[r]j follows from Lemma 4.9 of Han [17].

References

[1] Bacchetti, P. (1989). Additive isotonic models. J. Amer. Statist. Assoc. 84289–294.

[2] Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H.

D. (1972). Statistical Inference under Order Restrictions. Wiley, New York.[3] Barlow, R. E. and Brunk, H. D. (1972). The isotonic regression problem

and its dual. J. Amer. Statist. Assoc. 67 140–147.[4] Brunk, H. D. (1958). On the estimation of parameters restricted by inequal-

ities. Ann. Math. Statist. 29 437–454.[5] Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and

additive models. Ann. Statist. 17 454–510.[6] Carrasco, M., Florens, J.-P. and Renault, E. (2006). Linear inverse

problems in structural econometrics: Estimation based on spectral decompo-sition and regularization. In Handbook of Econometrics (J. Heckman and E.Leamer, eds.) 6. North Holland.

[7] De Boer, W. J., Besten, P. J. and Ter Braak, C. F. (2002). Statisticalanalysis of sediment toxicity by additive monotone regression splines. Ecotox-icology 11 435–50.

[8] Durot, C. (2002). Sharp asymptotics for isotonic regression. Probab. TheoryRelated Fields 122 222–240.

[9] Dykstra, R. L. (1983). An algorithm for restricted least squares regression.J. Amer. Statist. Assoc. 78 837–842.

[10] Fan, J. and Jiang, J. (2005). Nonparametric inference for additive models.J. Amer. Statist. Assoc. 100 890–907.

[11] Gaffke, N. and Mathar, R. (1989). A cyclic projection algorithm via du-ality. Metrika 36 29–54.

Page 207: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

194 E. Mammen and K. Yu

[12] Groeneboom, P. (1985). Estimating a monotone density. In Proceedings ofthe Berkeley Conference in Honor of Jerzy Neuman and Jack Kiefer (L. M.LeCam and R. A. Olshen, eds.) 2 539–555. Wadsworth, Belmont, CA.

[13] Groeneboom, P. (1989). Brownian motions with a parabolic drift and airyfunctions. Probab. Theory and Related Fields 81 79–109.

[14] Groeneboom, P., Lopuhaa, H. P. and Hooghiemstra, G. (1999). As-ymptotic normality of the L1-error of the Grenander estimator. Ann. Statist.27 1316–1347.

[15] Haag, B. (2006). Nonparametric estimation of additive multivariate diffusionprocesses. Working paper.

[16] Haag, B. (2006). Nonparametric regression tests based on additive modelestimation. Working paper.

[17] Han, S.-P. (1988). A successive projection method. Mathematical Program-ming 40 1–14.

[18] Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models.Chapman and Hall, London.

[19] Horowitz, J., Klemela, J. and Mammen, E. (2006). Optimal estimationin additive regression models. Bernoulli 12 271–298.

[20] Horowitz, J. and Mammen, E. (2004). Nonparametric estimation of anadditive model with a link function. Ann. Statist. 32 2412–2443.

[21] Horowitz, J. and Mammen, E. (2007). Rate-optimal estimation for a gen-eral class of nonparametric regression models. Ann. Statist. To appear.

[22] Horowitz, J. and Mammen, E. (2006). Nonparametric estimation of anadditive model with an unknown link function. Working paper.

[23] Leurgans, S. (1982). Asymptotic distributions of slope-of-greatest-convex-minorant estimators. Ann. Statist. 10 287–296.

[24] Linton, O. B. and Mammen, E. (2003). Nonparametric smoothing methodsfor a class of non-standard curve estimation problems. In Recent Advances andTrends in Nonparametric Statistics (M.G. Akritas and D. N. Politis, eds.).Elsevier, Amsterdam.

[25] Linton, O. and Mammen, E. (2005). Estimating semiparametric ARCH (∞)models by kernel smoothing methods. Econometrika 73 771–836.

[26] Linton, O. and Mammen, E. (2007). Nonparametric transformation to whitenoise. Econometrics. To appear.

[27] Mammen, E. (1991). Nonparametric regression under qualitative smoothnessassumptions. Ann. Statist. 19 741–759.

[28] Mammen, E., Linton, O. B. and Nielsen, J. P. (1999). The existenceand asymptotic properties of a backfitting projection algorithm under weakconditions. Ann. Statist. 27 1443–1490.

[29] Mammen, E. and Park, B. U. (2005). Bandwidth selection for smooth back-fitting in additive models. Ann. Statist. 33 1260–1294.

[30] Mammen, E. and Park, B. U. (2006). A simple smooth backfitting methodfor additive models. Ann. Statist. 34 2252–2271.

[31] Mammen, E. and Sperlich, S. (2006). Additivity Tests Based on SmoothBackfitting. Working paper.

[32] Morton-Jones, T., Diggle, P., Parker, L., Dickinson, H. O. and

Binks, K. (2000). Additive isotonic regression models in epidemiology. Stat.Med. 19 849–59.

[33] Nielsen, J. P. and Sperlich, S. (2005). Smooth backfitting in practice. J.Roy. Statist. Soc. Ser. B 67 43–61.

[34] Opsomer, J. D. and Ruppert, D. (1997). Fitting a bivariate additive model

Page 208: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Additive isotone regression 195

by local polynomial regression. Ann. Statist. 25 185–211.[35] Opsomer, J. D. (2000). Asymptotic properties of backfitting estimators. J.

Multivariate Analysis 73 166–179.[36] Robertson, T., Wright, F. and Dykstra, R. (1988). Order Restricted

Statistical Inference. Wiley, New York.[37] Sperlich, S., Linton, O. B. and Hardle, W. (1999). Integration and

Backfitting methods in additive models: Finite sample properties and compar-ison. Test 8 419–458.

[38] Stone, C. J. (1985). Additive regression and other nonparametric models.Ann. Statist. 13 689–705.

[39] van de Geer, S. (2000). Empirical Processes in M-Estimation. CambridgeUniversity Press.

[40] van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Em-pirical Processes. Springer, New York.

[41] Wright, F. T. (1981). The asymptotic behaviour of monotone regressionestimates. Ann. Statist. 9 443–448.

[42] Yu, Kyusang, Park, B. U. and Mammen, E. (2007). Smooth backfittingin generalized additive models. Ann. Statist. To appear.

Page 209: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 196–203c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000364

A note on Talagrand’s convex hull

concentration inequality

David Pollard1

Yale University

Abstract: The paper reexamines an argument by Talagrand that leads to aremarkable exponential tail bound for the concentration of probability near aset. The main novelty is the replacement of a mysterious calculus inequalityby an application of Jensen’s inequality.

1. Introduction

Let X be a set equipped with a sigma-field A. For each vector w = (w1, . . . , wn)in R

n+, the weighted Hamming distance between two vectors x = (x1, . . . , xn) and

y = (y1, . . . , yn), in Xn is defined as

dw(x, y) :=∑

i≤nwihi(x, y) where hi(x, y) =

{1 if xi �= yi

0 otherwise.

For a subset A of Xn and x ∈ Xn, the distances dw(x, A) and D(x, A) are definedby

dw(x) := inf{y ∈ A : dw(x, y)} and D(x, A) := supw∈W dw(X, A),

where the supremum is taken over all weights in the set

W :=

{(w1, . . . , wn) : wi ≥ 0 for each i and |w|2 :=

∑i≤n

w2i ≤ 1

}.

Talagrand ([10], Section 4.1) proved a remarkable concentration inequality forrandom elements X = (X1, . . . , Xn) of Xn with independent coordinates and sub-sets A ∈ An:

(1) P{X ∈ A}P{D(X, A) ≥ t} ≤ exp(−t2/4) for all t ≥ 0.

As Talagrand showed, this inequality has many applications to problems in combi-natorial optimization and other areas. See [12], Chapter 6 of [7] and Section 4 of[6], for further examples.

There has been a strong push in the literature to establish concentration anddeviation inequalities by “more intuitive” methods, such as those based on thetensorization, as in [1, 3–5]. I suspect the search for alternative approaches hasbeen driven by the miraculous roles played by some elementary inequalities inTalagrand’s proofs.

1Statistics Department, Yale University, Box 208290 Yale Station, New Haven, CT 06520-8290,USA, e-mail: [email protected]; url: http://www.stat.yale.edu/~pollard/

AMS 2000 subject classifications: Primary 62E20; secondary 60F05, 62G08, 62G20.Keywords and phrases: Concentration of measure, convex hull, convexity.

196

Page 210: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Talagrand’s concentration inequality 197

Talagrand [10] used an induction on n to establish his result. He invoked a slightlymysterious inequality in the inductive step,

inf0≤θ≤1

u−θ exp(

(1 − θ)2

4

)≤ 2 − u for 0 < u < 1,

which he borrowed from [2] – see Talagrand’s discussion following his Lemma 4.1.3for an explanation of how those authors generalized the earlier result from [8]. Thereis similar mystery in Talagrand’s Lemma 4.2.1, which (in my notation) asserts that

sup0≤θ≤1

u−θ/c exp(ψc(1 − θ)

)≤ 1 + c − u

cfor 0 < u < 1,

where ψc is defined in equation (6) below. (My ψc(u) equals Talagrand’s ξ(α, u)with α = 1/c.) Talagrand [9] had used this inequality to further generalize theresult of [2], giving the concentration inequality listed as Theorem 4.2.4 in [10]. Itwas my attempts to understand how he arrived at his ξ(α, u) function that led meto the concavity argument that I present in the note.

It is my purpose to modify Talagrand’s proof so that the inductive step becomesa simple application of the Holder inequality (essentially as in the original proof)and the Jensen inequality. Most of my methods are minor variations on the methodsin the papers just cited; my only claim of originality is for the recognition that themysterious inequalities can be replaced by more familiar appeals to concavity. Seealso the Remarks at the end of this Section.

The distance D(x, A) has another representation, as a minimization over a convexsubset of [0, 1]. Write h(x, y) for the point of {0, 1}n with ith coordinate hi(x, y). Foreach fixed x, the function h(x, ·) maps A onto a subset h(x, A) := {h(x, y) : y ∈ A}of {0, 1}n. The convex hull, co (h(x, A)), of h(x, A) in [0, 1]n is compact, and

D(x, A) = inf{|ξ| : ξ ∈ co (h(x, A))}.

Each point ξ of co (h(x, A)) can be written as∫

h(x, y) ν(dy) for a ν in the set P(A)of all probability measures for which ν(A) = 1. That is, ξi = ν{y ∈ A : yi �= xi}.Thus

(2) D(x, A)2 = infν∈P(A)

∑i≤n

(ν{y ∈ A : yi �= xi}

)2.

Talagrand actually proved inequality (1) by showing that

(3) P{X ∈ A}P exp(

14D(X, A)2

)≤ 1.

He also established an even stronger result, in which the D(X, A)2/4 in (3) isreplaced by a more complicated distance function.

For each convex, increasing function ψ with ψ(0) = 0 = ψ′(0) define

(4) Fψ(x, A) := infν∈P(A)

∑i≤n

ψ(ν{y ∈ A : yi �= xi}

),

For each c > 0 ([10], Section 4.2) showed that

(5) (P{X ∈ A})cP exp

(Fψc(X, A)

)≤ 1,

Page 211: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

198 D. Pollard

where

ψc(θ) := c−1

((1 − θ) log(1 − θ) − (1 − θ + c) log

((1 − θ) + c

1 + c

))

=∑

k≥2

θk

k

(Rc + R2

c + · · · + Rk−1c

(k − 1)

)with Rc :=

1c + 1

.(6)

≥ θ2

2 + 2c.

As you will see in Section 3, this strange function is actually the largest solution toa differential inequality,

ψ′′(1 − θ) ≤ 1/(θ2 + θc) for 0 < θ < 1.

Inequality (5) improves on (3) because D(x, A)2/4 ≤ Fψ1(x, A).

Following the lead of [10], Section 4.4, we can ask for general conditions on theconvex ψ under which an analog of (5) holds with some other decreasing functionof P{X ∈ A} as an upper bound. The following slight modification of Talagrand’stheorems gives a sufficient condition in a form that serves to emphasize the roleplayed by Jensen’s inequlity.

Theorem 1. Suppose γ is a decreasing function with γ(0) = ∞ and ψ is a convexfunction. Define Gψ(η, θ) := ψ(1 − θ) + θη and Gψ(η) := inf0≤θ≤1 Gψ(η, θ) forη ∈ R

+. Suppose

(i) r �→ exp(Gψ(γ(r) − γ(r0))) is concave on [0, r0], for each r0 ≤ 1(ii) (1 − p)eψ(1) + p ≤ eγ(p) for 0 ≤ p ≤ 1.

Then P exp(Fψ(X, A)) ≤ exp(γ(P{X ∈ A})) for every A ∈ An and every randomelement X of Xn with independent components.

The next lemma, a more general version of which is proved in Section 3, leads toa simple sufficient condition for the concavity assumption (i) of Theorem 1 to hold.

Lemma 2 (Concavity lemma). Suppose ψ : [0, 1] → R+ is convex and increas-

ing, with ψ(0) = 0 = ψ′(0) and ψ′′(θ) > 0 for 0 < θ < 1. Suppose ξ : [0, r0] →R

+ ∪ {∞} is continuous and twice differentiable on (0, r0). Suppose also that thereexists some finite constant c for which ξ′′(r) ≤ cξ′(r)2 for 0 < r < r0. If

ψ′′(1 − θ) ≤ 1/(θ2 + θc) for 0 < θ < 1

then the function r �→ exp(Gψ(ξ(r))) is concave on [0, r0].

Lemma 2 will be applied with ξ(r) = γ(r) − γ(r0) for 0 ≤ r ≤ r0. As shown inSection 3, the conditions of the Lemma hold for ψ(θ) = θ2/4 with γ(r) = log(1/r)and also for the ψc from (6) with γ(r) = c−1 log(1/r).

Remarks.

(i) If γ(0) were finite, the inequality asserted by Theorem 1 could not hold for allnonempty A and all X. For example, if each Xi had a nonatomic distributionand A were a singleton set we would have Fψ(X, A) = nψ(1) almost surely.The quantity P exp(Fψ(X, A)) would exceed exp(γ(0)) for large enough n. Itit to avoid this difficulty that we need γ(0) = ∞.

Page 212: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Talagrand’s concentration inequality 199

(ii) Assumption (ii) of the Theorem, which is essentially an assumption that theasserted inequality holds for n = 1, is easy to check if γ is a convex functionwith γ(1) ≥ 0. For then the function B(p) := exp(γ(p)) is convex with B(1) ≥1 and B′(1) = γ′(1)eγ(1). We have

B(p) ≥ (1 − p)eψ(1) + p for all p in [0, 1]

if B′(1) ≤ 1 − eψ(1).(iii) A helpful referee has noted that both my specific examples are already covered

by Talagrand’s results. He (or she) asked whether there are other (ψ, γ) pairsthat lead to other useful concentration inequalities. A good question, but Ido not yet have any convincing examples. Actually, I had originally thoughtthat my methods would extend to the limiting case where c tends to zero,leading to an answer to the question posed on page 128 of [10]. Unfortunatelymy proof ran afoul of the requirement γ(0) = ∞. I suspect more progressmight be made by replacing the strong assumption on ψ′′ from Lemma 2 bysomething closer to the sufficient conditions presented in Section 3.

2. Proof of Theorem 1

Argue by induction on n. As a way of keeping the notation straight, replace thesubscript on Fψ(x, B) by an n when the argument B is a subset of Xn. Also, workwith the product measure Q = ⊗i≤nQi for the distribution of X and Q−n = ⊗i<nQi

for the distribution of (X1, . . . , Xn−1). The assertion of the Theorem then becomes

Q exp(Fn(x, A)

)≤ exp(γ(QA))

For n = 1 and B ∈ A we have F1(x, B) = ψ(1){x /∈ B} + 0{x ∈ B} so thatQ1 exp(F1(x, B)) ≤ (1 − p)eψ(1) + p, where p = Q1B. Assumption (ii) then givesthe desired exp(γ(p)) bound.

Now suppose that n > 1 and that the inductive hypothesis is valid for dimensionsstrictly smaller than n. Write Q as Q−n ⊗ Qn. To simplify notation, write w forx−n := (x1, . . . , xn−1) and z for xn. Define the cross section Az := {w ∈ Xn−1 :(w, z) ∈ A} and write Rz for Q−nAz. Define r0 := supz∈X Rz. Notice that r0 ≥Qz

nRz = QA.

Page 213: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

200 D. Pollard

The key to the proof is a recursive bound for Fn: for each x = (w, z) with Az �= ∅,each m with Am �= ∅, and all θ = 1 − θ ∈ [0, 1],

(7) ψ(θ) + Fn(x, A) ≤ θFn−1(w, Az) + θFn−1(w, Am).

To establish inequality (7), suppose µz is a probability measure concentratedon Az and µm is a probability measure concentrated on Am. For a θ in [0, 1],define ν = θµz ⊗ δz + θµm ⊗ δm, a probability measure concentrated on the subset(Az × {z}) ∪ (Am × {m}) of A. Notice that, for i < n,

ν{y ∈ A : yi �= xi} = θµz{w ∈ Az : yi �= xi} + θµm{w ∈ Am : yi �= xi}

so that, by convexity of ψ,

ψ(ν{yi �= xi}

)≤ θψ(µz{w ∈ Az : yi �= xi}) + θψ(µm{w ∈ Am : yi �= xi});

and (remembering that xn = z),

ν{y ∈ A : yn �= xn} =

{θ if z �= m

0 otherwise≤ θ.

Thus

Fn(x, A) ≤ ψ(θ) + θ∑

i<nψ

(µz{yi �= xi}

)+ θ

∑i<n

ψ(µm{yi �= xi}

).

The two sums over the first n − 1 coordinates are like those that appear in thedefinitions of Fn−1(w, Az) and Fn−1(w, Az). Indeed, taking an infimum over allµz ∈ P(Az) and µm ∈ P(Am) we get the expression on the right-hand side of (7).

Take exponentials of both sides of (7) then integrate out with respect to Q−n

over the w component. For 0 < θ < 1 invoke the Holder inquality, Q−nUθV θ ≤(Q−nU)θ(Q−nV )θ, with U = exp(Fn−1(w, Az)) and V = exp(Fn−1(w, Am)), for afixed m. For each z with Az �= ∅ we get

Q−n exp(Fn((w, z), A)

)(8)

≤ eψ(θ)(Q−n exp

(Fn−1(w, Az)

))θ (Q−n exp

(Fn−1(w, Am)

))θ.

The inequality also hold in the extreme cases where θ = 0 or θ = 1, by continuity.The inductive hypothesis bounds the last product by

exp(ψ(θ) + θγ(Rz) + θγ(Rm)

)= exp

(γ(Rm) + G(γ(Rz) − γ(Rm), θ)

).

The exponent is a decreasing function of Rm. Take an infimum over m, to replaceγ(Rm) by γ(r0). Then take an infimum over θ to get

Q−n exp(Fn((w, z), A)

)≤ exp

(γ(r0) + G(ξ(Rz))

)(9)

where ξ(r) := γ(Rz) − γ(r0) for 0 ≤ r ≤ r0.

If the cross section Az is empty, the set P(Az) is empty. The argument leadingfrom (7) to (9) still works if we fix θ equal to zero throughout, giving the bound

Qw−n exp

(Fn(x, A)

)≤ exp

(γ(r0) + ψ(1)

)if Az = ∅.

Thus the inequality (9) also holds with Rz = 0 when Az = ∅, because ξ(0) =γ(0) − γ(r0) = ∞ and G(∞) = ψ(1).

Page 214: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Talagrand’s concentration inequality 201

By Assumption (i), the function r �→ exp(G(ξ(r))) is concave on [0, r0]. Integrateboth sides of (9) with respect to Qn to average out over the z variable. Then invokeJensen’s inequality and the fact that QnRz = QA, to deduce that

Q exp(Fn(x, A)

)≤ exp

(γ(r0) + G

(γ(QA) − γ(r0)

)).

Finally, use the inequality G(η) ≤ η to bound the last expression by exp(γ(QA)),thereby completing the inductive step.

Remark. Note that it is important to integrate with respect to Qn before usingthe bound on G: the upper bound exp(−γ(Rz)) is a convex function of Rz, notconcave.

3. Proof of the concavity lemma

I will establish a more detailed set of results than asserted by Lemma 2. Invoke themonotonicity and continuity of ψ′ to define g(η) as the solution to ψ′ (1 − g(η)

)= η

if 0 ≤ η < ψ′(1) and g(η) = 0 if ψ′(1) ≤ η. Then the following assertions are true.(I drop the ψ subscripts for notational simplicity.)

(i)

G(η) =

(1 − g(η)

)+ ηg(η) for 0 ≤ η < ψ′(1)

ψ(1) for ψ′(1) ≤ η

(ii) G is increasing and concave, with a continuous, decreasing first derivative g.In particular, G(0) = 0 and G′(0) = g(0) = 1.

(iii) G′′(η) = g′(η) = −[ψ′′(1 − g(η))]−1 for 0 < η < ψ′(1).(iv) G(η) ≤ η for all η ∈ R

+.(v) Suppose ξ : J → R

+ is a convex function defined on a subinterval J of thereal line, with ξ′ �= 0 on the interior of J . Suppose

1ψ′′(1 − ξr)

≥ g(ξr)2 + g(ξr)ξ′′(r)/ξ′(r)2,

for all r in the interior of J for which ξr := ξ(r) ∈ (0, 1). Then r �→exp(G(ξ(r))) is a concave function on J .

Proof of (i) through (iv). The fact that G is concave and increasing follows fromits definition as an infimum of increasing linear functions of η. (It would also followfrom the fact that G′(η) = g(η), which is nonnegative and decreasing.) Replacementof the infimum over 0 ≤ θ ≤ 1 by the value at θ = 1 gives the inequality G(η) ≤ η.

If η ≥ ψ′(1), the derivative −ψ′(1− θ)+η is nonnegative on [0, 1], which ensuresthat the infimum is achieved at θ = 1.

If 0 < η < ψ′(1), the infimum is achieved at the zero of the derivative, θ = g(η).Differentiation of the defining equality ψ′(1 − g(η)) = η then gives the expressionfor g′(η). Similarly

G′(η) = −ψ′ (1 − g(η))g′(η) + ηg′(η) + g(η) = g(η).

The infimum that defines G(0) is achieved at g(0) = 1, which gives G(0) =ψ(0) = 0. Continuity of g at 0 then gives G′(0) = g(0) = 1.

Page 215: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

202 D. Pollard

Proof of (v). Note that the function L(r) := exp(G(ξ(r))) is continuous on J andtakes the value eψ(1) for all r at which ξ(r) ≥ ψ′(1). The second derivative L′′(r)exists except possibly at points r for which ξ(r) = ψ′(1). In particular, L′′(r) = 0when ξ(r) > ψ′(1) and

L′′(r) =(g′(ξr)(ξ′r)

2 + g(ξr)ξ′′r + g(ξr)2(ξ′r)2)L(r) for 0 < ξr < ψ′(1).

From (iii) and the positivity of L, the last expression is ≤ 0 if and only if

− (ξ′r)2

ψ′′(1 − g(ξr))+ g(ξr)ξ′′r + g(ξr)2(ξ′r)

2 ≤ 0.

Divide through by (ξ′r)2 then rearrange to get the asserted inequality for ψ′′.Lemma 2 follows as a special case of (i) through (iv).

Special cases. If supr ξ′′(r)/ξ′(r)2 ≤ c, with c a positive constant, the inequalityfrom part (v) will certainly hold if

(10) ψ′′(1 − θ) ≤ (θ2 + cθ)−1 for all 0 < θ < 1.

This differential inequality can be solved, subject to the constraints 0 = ψ(0) =ψ′(0), by two integrations. Indeed,

ψ′(1 − θ) =∫ 1

θ

ψ′′(1 − t) dt ≤∫ 1

θ

dt

t2 + ct= c−1

(− log θ + log

(θ + c

1 + c

))

and, with ψc defined by (6),

ψ(1 − θ) =∫ 1

θ

ψ′(1 − t) dt ≤ c−1

∫ 1

θ

− log t + log(

t + c

1 + c

)dt = ψc(1 − θ).

Note that ψc(1 − θ) is the solution to the differential equation

ψ′′c (1 − θ) =

1θ2 + cθ

for all 0 < θ < 1, with ψc(0) = ψ′c(0) = 0.

It is the largest solution to (10).

References

[1] Boucheron, S., Lugosi, G. and Massart, P. (2000). A sharp concentra-tion inequality with applications. Random Structures Algorithms 16 277–292.

[2] Johnson, W. B. and Schechtman, G. (1991). Remarks on Talagrand’sdeviation inequality for Rademacher functions. Lecture Notes in Math. 147072–77.

[3] Ledoux, M. (1996). On Talagrand’s deviation inequalities for product mea-sures. ESAIM Probab. Statist. 1 63–87.

[4] Lugosi, G. (2003). Concentration-of-measure inequalities. Notes from theSummer School on Machine Learning, Australian National University. Avail-able at http://www.econ.upf.es/˜lugosi/.

[5] Massart, P. (2003). Saint-Flour Lecture Notes 2003: Concentration Inequali-ties and Model Selection. Available at http://www.math.u-psud.fr/˜massart/.

Page 216: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Talagrand’s concentration inequality 203

[6] McDiarmid, C. (1998). Concentration. In Probabilistic Methods for Algo-rithmic Discrete Mathematics (M. Habib, C. McDiarmid, J. Ramirez-Alfonsenand B. Reed, eds.) 195–248. Springer, Berlin.

[7] Steele, J. M. (1997). Probability Theory and Combinatorial Optimization.SIAM, Philadelphia, PA.

[8] Talagrand, M. (1988). An isoperimetric theorem on the cube and theKintchine-Kahane inequalities. Proc. Amer. Math. Soc. 104 905–909.

[9] Talagrand, M. (1991). A new isoperimetric inequality for product measureand the tails of sums of independent random variables. Geom. Funct. Anal. 1211–223.

[10] Talagrand, M. (1995). Concentration of measure and isoperimetric inequal-ities in product spaces. Publ. Math. de l’I.H.E.S. 81 73–205.

[11] Talagrand, M. (1996a). New concentration inequalities in product spaces.Invent. Math. 126 505–563.

[12] Talagrand, M. (1996b). A new look at independence. Ann. Probab. 24 1–34.

Page 217: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 204–233c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000373

A growth model in multiple dimensions

and the height of a random partial order

Timo Seppalainen1,∗

University of Wisconsin-Madison

Abstract: We introduce a model of a randomly growing interface in multi-dimensional Euclidean space. The growth model incorporates a random or-der model as an ingredient of its graphical construction, in a way that repli-cates the connection between the planar increasing sequences model and theone-dimensional Hammersley process. We prove a hydrodynamic limit for theheight process, and a limit which says that certain perturbations of the ran-dom surface follow the characteristics of the macroscopic equation. By virtueof the space-time Poissonian construction, we know the macroscopic velocityfunction explicitly up to a constant factor.

1. Introduction

We introduce a model of a randomly growing interface, whose construction in-volves the height of a random partial order. The interface is defined by a heightfunction on d-dimensional Euclidean space, and the related model of random orderis in d + 1 dimensional space-time. Our goal is to emulate in higher dimensionsthe fruitful relationship between the one-dimensional Hammersley process and themodel of increasing sequences among planar Poisson points. The connection be-tween Hammersley’s process and increasing sequences was suggested in Hammers-ley’s paper [15], first utilized by Aldous and Diaconis [1], and subsequently in papers[21, 22, 25, 27]. A review of the use of Hammersley’s process to study increasingsequences appeared in [14], and of the wider mathematical context in [2]. The studyof higher dimensional random orders was started by Winkler [30].

The interface process we introduce is defined through a graphical representationwhich utilizes a homogeneous space-time Poisson point process, and in particularthe heights of the partial orders among the Poisson points in space-time rectangles.This definition suggests a natural infinitesimal description, which we verify in asense. After defining the process, we prove a hydrodynamic limit for the heightfunction. This proceeds in a familiar way, by the path level variational formulation.The deterministic limiting height is the solution of a Hamilton-Jacobi equationgiven by a Hopf-Lax formula.

Next we use this process to prove a limit that in a way generalizes the law oflarge numbers of a second class particle in one-dimensional systems. In interactingparticle systems, a second class particle is the location X(t) of the unique discrep-ancy between two coupled systems that initially differ by exactly one particle (see[16], part III). This makes sense for example for Hammersley’s process and exclu-sion type processes. If the particle system lives in one-dimensional space, we can

∗Research partially supported by NSF Grants DMS-01-26775 and DMS-04-02231.1Mathematics Department, University of Wisconsin-Madison, Madison, WI 53706-1388, USA

e-mail: [email protected] 2000 subject classifications: Primary 60K35; secondary 82C22.Keywords and phrases: characteristics, growth model, hydrodynamic limit, increasing se-

quences, random order, second-class particle.

204

Page 218: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 205

also look at the system as the height function of an interface, so that the occupa-tion variables of the particle system define the increments of the height function.In terms of the height functions, we start by coupling two systems σ and ζ so thatinitially ζ = σ to the left of X(0), and ζ = σ + 1 to the right of X(0). Then at alltimes the point X(t) is the boundary of the set {x : ζ(x, t) = σ(x, t) + 1}.

This last idea generalizes naturally to the multidimensional interface model. Wecouple two height processes σ and ζ that satisfy σ ≤ ζ ≤ σ + 1 at all times, andprove that the boundary of the random set {x : ζ(x, t) = σ(x, t) + 1} follows thecharacteristics of the macroscopic equation.

Laws of large numbers for height functions of asymmetric interface models ofthe general type considered here have been earlier studied in a handful of papers.A hydrodynamic limit for ballistic deposition was proved in [24], and for modelsthat generalize the exclusion process in [19, 20]. Articles [19, 24] deal with totallyasymmetric models, and utilize the path-level variational formulation that general-izes from one-dimensional systems [23]. Article [20] introduces a different approachfor partially asymmetric systems. These results are existence results only. In otherwords convergence to a limiting evolution is shown but nothing explicit about thelimit is known, except that it is defined by a Hamilton-Jacobi equation. For partiallyasymmetric systems in more than one dimension it is presently not even known ifthe limit is deterministic.

Our motivation for introducing a new model is to have a system for which betterresults could be proved. An advantage over earlier results is that here we can writedown explicitly the macroscopic velocity function up to a constant factor. Thisis because the process is constructed through a homogeneous space-time Poissonprocess, so we can simultaneously scale space and time. This is not possible fora lattice model. With an (almost) explicit velocity we can calculate macroscopicprofiles, for example see what profiles with shocks and rarefaction fans look like. Inthe earlier cases at best we know that the velocity function is convex (or concave),but whether the velocity is C1 or strictly convex is a hard open question. Here thisquestion is resolved immediately.

Hammersley’s process has been a fruitful model for studying large scale behav-ior of one-dimensional asymmetric systems, by virtue of its connection with theincreasing sequence model. For example, by a combination of the path-level varia-tional construction and the Baik-Deift-Johansson estimates [3], one can presentlyprove the sharpest out-of-equilibrium fluctuation results for this process [25, 27].The model introduced in the present paper has a similar connection with a simplecombinatorial model, so it may not be too unrealistic to expect some benefit fromthis in the future.

Before the arrival of the powerful combinatorial and analytic approach pioneeredin [3], Hammersley’s process was used as a tool for investigating the increasing se-quences model. This approach was successful in finding the value of the limitingconstant [1, 21] and in large deviations [22]. The proofs relied on explicit knowledgeof invariant distributions of Hammersley’s process. A similar motivation is possiblefor us too, and this time the object of interest would be the height of the ran-dom partial order. But currently we have no explicit knowledge of steady states ofthe process introduced here, so we cannot use the process to identify the limitingconstant for the random order model.

Recently Cator and Groeneboom [8] developed an approach to the one-dimensio-nal Hammersley process that captures the correct order t1/3 of the current fluctua-tions across a characteristic. The argument utilizes precise equilibrium calculationsand a time reversal that connects maximizing increasing paths with trajectories

Page 219: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

206 T. Seppalainen

of second class particles. In [4] this method is adapted to the totally asymmetricexclusion process. Whether the idea can be applied in multidimensional settingsremains to be seen.

Organization of the paper. We begin by reminding the reader of the randompartial order model, and then proceed to define the process and state the limittheorems. Proofs follow. A technical appendix at the end of the paper shows thatthe process has a natural state space that is a complete, separable metric space.

2. The height of a random partial order

Fix an integer ν ≥ 2. Coordinatewise partial orders on Rν are defined for pointsx = (x1, . . . , xν) and y = (y1, . . . , yν) by

(1) x ≤ y iff xi ≤ yi for 1 ≤ j ≤ ν, and x < y iff xi < yi for 1 ≤ j ≤ ν.

We use interval notation to denote rectangles: (a, b] = {x ∈ Rν : a < x ≤ b} fora < b in Rν , and similarly for [a, b] and the other types of intervals.

Consider a homogeneous rate 1 Poisson point process in Rν . A sequence of Pois-son points pk, 1 ≤ k ≤ m, is increasing if p1 < p2 < · · · < pm in the coordinatewisesense. For a < b in Rν , let H(a, b) denote the maximal number of Poisson points onan increasing sequence contained in the set (a, b]. Let 1 = (1, 1, . . . , 1) ∈ Rν . (Thisis the only vector we will denote by a boldface.) Kingman’s subadditive ergodictheorem and simple moment bounds imply the existence of constants cν such that

(2) limn→∞

1nH(0, n1) = cν a.s.

Presently the only known value is c2 = 2, first proved by Vershik and Kerov [29]and Logan and Shepp [17]. The case ν = 2 is the same as the problem of the longestincreasing subsequence of a random permutation, see [2] for a review. Bollobas andWinkler [7] proved that cν → e as ν → ∞.

The general case is called the ν-dimensional random partial order, and H(0, n1)is the height of the random partial order. The study of random partial orders wasinitiated by Winkler [30]. Here is an alternative construction of the random partialorder on a fixed (rather than Poisson) number of elements. From the k! linear orderson a set of k elements, choose ν orders ≺1, ≺2, . . . , ≺ν uniformly at random withreplacement. Define the random order ≺ as the intersection, namely x ≺ y iff x ≺j yfor j = 1, . . . , ν. The height of the random order is the maximal size of a linearlyordered subset. Conditioned on the number k of Poisson points in the cube (0, n1],H(0, n1) has the same distribution as the height of the random order ≺.

Let us also point out that by the spatial scaling of the Poisson point process, forany b = (b1, . . . , bν) > 0 in Rd,

(3) limn→∞

1nH(0, nb) = cν(b1b2b3 · · · bν)1/ν a.s.

3. The interface process

Fix a spatial dimension d ≥ 2. Appropriately interpreted, everything we say istrue in d = 1 also, but does not offer anything significantly new. We describethe evolution of a random, integer-valued height function σ = (σ(x))x∈Rd . Heightvalues ±∞ are permitted, so the range of the height function is Z∗ = Z ∪ {±∞}.

Page 220: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 207

The state space of the process is the space Σ of functions σ : Rd → Z∗ that satisfyconditions (i)–(iii):

(4) (i) Monotonicity: x ≤ y in Rd implies σ(x) ≤ σ(y).

The partial order x ≤ y on Rd is the coordinatewise one defined in Section 2.(ii) Discontinuities restricted to a locally finite, countable collection of coordinate

hyperplanes: for each bounded cube [−q1, q1] ⊆ Rd, there are finite partitions

−q = s0i < s1

i < · · · < smii = q

along each coordinate direction (1 ≤ i ≤ d), such that any discontinuity point of σin [−q1, q1] lies on one of the hyperplanes {x ∈ [−q1, q1] : xi = sk

i }, 1 ≤ i ≤ d and0 ≤ k ≤ mi.

At discontinuities σ is continuous from above: σ(y) → σ(x) as y → x so thaty ≥ x in Rd. Since σ is Z∗-valued, this is the same as saying that σ is constant onthe left closed, right open rectangles

(5) [sk, sk+1) ≡d∏

i=1

[ski

i , ski+1i ) , k = (k1, k2, . . . , kd) ∈

d∏

i=1

{0, 1, 2, . . . ,mi − 1},

determined by the partitions {ski : 0 ≤ k ≤ mi}, 1 ≤ i ≤ d.

(iii) A decay condition “at −∞”:

(6) for every b ∈ Rd, limM→∞

sup{|y|−d/(d+1)

∞ σ(y) : y ≤ b, |y|∞ ≥ M}

= −∞.

The role of the (arbitrary) point b in condition (6) is to confine y so that as thelimit is taken, all coordinates of y remain bounded above and at least one of themdiverges to −∞. Hence we can think of this as “y → −∞” in Rd. The �∞ norm onRd is |y|∞ = max1≤i≤d |yi|.

We can give Σ a complete, separable metric. Start with a natural Skorohodmetric suggested by condition (ii). On bounded rectangles, this has been consideredearlier by Bickel and Wichura [5], among others. This metric is then augmented withsufficient control of the left tail so that convergence in this metric preserves (6). TheBorel σ-field under this metric is generated by the coordinate projections σ → σ(x).These matters are discussed in a technical appendix at the end of the paper.

Assume given an initial height function σ ∈ Σ. To construct the dynamics,assume also given a space-time Poisson point process on Rd × (0,∞). We definethe process σ(t) = {σ(x, t) : x ∈ Rd} for times t ∈ [0,∞) by

(7) σ(x, t) = supy:y≤x

{σ(y) + H((y, 0), (x, t))}.

The random variable H((y, 0), (x, t)) is the maximal number of Poisson points onan increasing sequence in the space-time rectangle

((y, 0), (x, t)] = {(η, s) ∈ Rd × (0, t] : yi < ηi ≤ xi (1 ≤ i ≤ d)},

as defined in Section 2. One can prove that, for almost every realization of thePoisson point process, the supremum in (7) is achieved at some y, and σ(t) ∈ Σ forall t > 0. In particular, if initially σ(x) is finite then σ(x, t) remains finite for all0 ≤ t < ∞. And if σ(x) = ±∞, then σ(x, t) = σ(x) for all 0 ≤ t < ∞. This definesa Markov process on the path space D([0,∞), Σ).

Page 221: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

208 T. Seppalainen

The local effect of the dynamical rule (7) is the following. Suppose (y, t) ∈Rd × (0,∞) is a Poisson point, and the state at time t− is σ. Then at time t thestate changes to σy defined by

(8) σy(x) ={

σ(x) + 1, if x ≥ y and σ(x) = σ(y),σ(x), for all other x ∈ Rd.

We can express the dynamics succinctly like this: Independently at all x ∈ Rd,σ(x) jumps to σ(x) + 1 at rate dx (d-dimensional volume element). When a jumpat x happens, the height function σ is updated to σ + 1 on the set {w ∈ Rd : w ≥x, σ(w) = σ(x)} to preserve the monotonicity property (4). It also follows that ifσ(y) = ±∞ then σy = σ.

We express this in generator language as follows. Suppose φ is a bounded mea-surable function on Σ, and supported on a compact cube K ⊆ Rd. By this we meanthat φ is a measurable function of the coordinates (σ(x))x∈K . Define the generatorL by

(9) Lφ(σ) =∫

Rd

[φ(σy) − φ(σ)]dy.

The next theorem verifies that L gives the infinitesimal description of the processin one basic sense.

Theorem 3.1. For bounded measurable functions φ on Σ, σ ∈ Σ, and t > 0,

(10) Eσ[φ(σ(t))] − φ(σ) =∫ t

0

Eσ[Lφ(σ(s))]ds.

Eσ denotes expectation under the path measure Pσ of the process defined by (7) andstarted from state σ.

4. Hydrodynamic limit for the height process

Let u0 : Rd → R be a nondecreasing locally Lipschitz continuous function, suchthat for any b ∈ Rd,

(11) limM→∞

sup{|y|−d/(d+1)

∞ u0(y) : y ≤ b, |y|∞ ≥ M}

= −∞.

The function u0 represents the initial macroscopic height function. Assume thaton some probability space we have a sequence of random initial height functions{σn(y, 0) : y ∈ Rd}, indexed by n. Each σn(· , 0) is a.s. an element of the state spaceΣ. The sequence satisfies a law of large numbers:

(12) for every y ∈ Rd, n−1σn(ny, 0) → u0(y) as n → ∞, a.s.

Additionally there is the following uniform bound on the decay at −∞:

(13)for every fixed b ∈ Rd and C > 0, with probability 1 there exist finite,possibly random, M, N > 0 such that, if n ≥ N , y ≤ b and |y|∞ ≥ M ,then σn(ny, 0) ≤ −Cn|y|d/(d+1)

∞ .

Augment the probability space of the initial σn(·, 0) by a space-time Poissonpoint process, and define the processes σn(x, t) by (7). For x = (x1, . . . , xd) ≥ 0 inRd, define

g(x) = cd+1(x1x2x3 · · ·xd)1/(d+1).

Page 222: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 209

The constant cd+1 is the one from (2), and it comes from the partial order amongPoisson points in d + 1 dimensional space-time rectangles.

Define a function u(x, t) on Rd × [0,∞) by u(x, 0) = u0(x) and for t > 0,

(14) u(x, t) = supy:y≤x

{u0(y) + tg((x − y)/t)}.

The function u is nondecreasing in x, increasing in t, and locally Lipschitz in Rd ×(0,∞).

Theorem 4.1. Suppose u0 is a locally Lipschitz function on Rd that satisfies (11).Define u(x, t) through (14). Assume that the initial random interfaces {σn(y, 0)}satisfy (12) and (13). Then for all (x, t) ∈ Rd × [0,∞),

(15) limn→∞

n−1σn(nx, nt) = u(x, t) a.s.

By the monotonicity of the random height and the continuity of the limitingfunction, the limit (15) holds simultaneously for all (x, t) outside a single exceptionalnull event.

Extend g to an u.s.c. concave function on all of Rd by setting g ≡ −∞ outside[0,∞)d. Define the constant

(16) κd =(

cd+1

d + 1

)d+1

.

The concave conjugate of g is g∗(ρ) = infx{x · ρ − g(x)}, ρ ∈ Rd. Let f = −g∗.Then f(ρ) = ∞ for ρ /∈ (0,∞)d, and

(17) f(ρ) = κd(ρ1ρ2 · · · ρd)−1 for ρ > 0 in Rd.

The Hopf-Lax formula (14) implies that u solves the Hamilton-Jacobi equation (see[10])

(18) ∂tu − f(∇u) = 0 , u|t=0 = u0.

In other words, f(∇u) is the upward velocity of the interface, determined by thelocal slope.

The most basic case of the hydrodynamic limit starts with σ(y, 0) = 0 for y ≥ 0and σ(y, 0) = −∞ otherwise. Then σ(x, t) = H((0, 0), (x, t)) for x ≥ 0 and −∞otherwise. The limit is u(x, t) = tg(x/t).

5. The defect boundary limit

Our objective is to generalize the notion of a second class particle from the one-dimensional context. The particle interpretation does not make sense now. But asecond class particle also represents a defect in an interface, and is sometimes calleda ‘defect tracer.’ This point of view we adopt. Given an initial height functionσ(y, 0), perturb it by increasing the height to σ(y, 0) + 1 for points y in some setA(0). The boundary of the set A(0) corresponds to a second class particle, so wecall it the defect boundary. How does the perturbation set A(·) evolve in time? Todescribe the behavior of this set under hydrodynamic scaling, we need to look athow the Hamilton-Jacobi equation (18) carries information in time.

Page 223: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

210 T. Seppalainen

For (x, t) ∈ Rd × (0,∞), let I(x, t) be the set of maximizers in (14):

(19) I(x, t) = {y ∈ Rd : y ≤ x, u(x, t) = u0(y) + tg((x − y)/t)}.

Continuity and hypothesis (11) guarantee that I(x, t) is a nonempty compact set.It turns out that these three statements (i)–(iii) are equivalent for a point (x, t):(i) the gradient ∇u in the x-variable exists at (x, t), (ii) u is differentiable at (x, t),and (iii) I(x, t) is a singleton. We call a point (x, t) with t > 0 a shock if I(x, t) hasmore than one point.

For y ∈ Rd let W (y, t) be the set of points x ∈ Rd for which y is a maximizer inthe Hopf-Lax formula (14) at time t:

(20) W (y, t) = {x ∈ Rd : x ≥ y, u(x, t) = u0(y) + tg((x − y)/t)},

and for any subset B ⊆ Rd,

(21) W (B, t) =⋃

y∈B

W (y, t).

Given a closed set B ⊆ Rd, let

(22) X(B, t) = W (B, t) ∩ W (Bc, t).

W (B, t) and W (Bc, t) are both closed sets. We can characterize x ∈ X(B, t) asfollows: if (x, t) is not a shock then the unique maximizer {y} = I(x, t) in (14) lieson the boundary of B, while if (x, t) is a shock then I(x, t) intersects both B andBc.

If dimension d = 1 and B = [a,∞) ⊆ R, an infinite interval, then X(B, t) isprecisely the set of points x for which there exists a forward characteristic x(·)such that x(0) = a and x(t) = x. By a forward characteristic we mean a Filippovsolution of dx/dt = f ′(∇u(x, t)) [9, 18]. A corresponding characterization of X(B, t)in multiple dimensions does not seem to exist at the moment.

The open ε-neighborhood of a set B ⊆ Rd is denoted by

(23) B(ε) = {x : d(x, y) < ε for some y ∈ B}.

The distance d(x, y) can be the standard Euclidean distance or another equivalentmetric, it makes no difference. Let us write B(−ε) for the set of x ∈ B that are atleast distance ε > 0 away from the boundary:

(24) B(−ε) = {x ∈ B : d(x, y) ≥ ε for all y /∈ B} =[(Bc)(ε)

]c.

The topological boundary of a closed set B is bdB = B ∩ Bc.Suppose two height processes σ(t) and ζ(t) are coupled through the space-time

Poisson point process. This means that on some probability space are defined theinitial height functions σ(y, 0) and ζ(y, 0), and a space-time Poisson point processwhich defines all the random variables H((y, 0), (x, t)). Process σ(x, t) is definedby (7), and process ζ(x, t) by the same formula with σ replaced by ζ, but withthe same realization of the variables H((y, 0), (x, t)). If initially σ ≤ ζ ≤ σ + h forsome constant h, then the evolution preserves these inequalities. We can follow theevolution of the “defect set” A(t), defined as A(t) = {x : ζ(x, t) = σ(x, t) + h} fort ≥ 0. This type of a setting we now study in the hydrodynamic context. In the

Page 224: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 211

introduction we only discussed the case h = 1, but the proof works for general finiteh.

Now precise assumptions. On some probability space are defined two sequences ofinitial height functions σn(y, 0) and ζn(y, 0). The {σn(y, 0)} satisfy the hypotheses(12) and (13) of Theorem 4.1. For some fixed positive integer h,

(25) σn(y, 0) ≤ ζn(y, 0) ≤ σn(y, 0) + h for all n and y ∈ Rd.

Construct the processes σn(t) and ζn(t) with the same realizations of the space-timePoisson point process. Then

(26) σn(x, t) ≤ ζn(x, t) ≤ σn(x, t) + h for all n and (x, t).

In particular, ζn and σn satisfy the same hydrodynamic limit.Let

(27) An(t) = {x ∈ Rd : ζn(x, t) = σn(x, t) + h}.

Our objective is to follow the evolution of the set An(t) and its boundary bd{An(t)}.We need an initial assumption at time t = 0. Fix a deterministic closed set B ⊆ Rd.We assume that for large n, n−1An(0) approximates B locally, in the following sense:almost surely, for every compact K ⊆ Rd and ε > 0,

(28) B(−ε) ∩ K ⊆{n−1An(0)

}∩ K ⊆ B(ε) ∩ K for all large enough n.

Theorem 5.1. Let again u0 satisfy (11) and the processes σn satisfy (12) and (13)at time zero. Fix a positive integer h and a closed set B ⊆ Rd. Assume that theprocesses σn are coupled with processes ζn through a common space-time Poissonpoint process so that (26) holds. Define An(t) by (27) and assume An(0) satisfies(28).

If W (B, t) = ∅, then almost surely, for every compact K ⊆ Rd, An(nt)∩nK = ∅for all large enough n.

Suppose W (B, t) �= ∅. Then almost surely, for every compact K ⊆ Rd and ε > 0,

(29) bd {n−1An(nt)} ∩ K ⊆ X(B, t)(ε) ∩ K for all large enough n.

In addition, suppose no point of W (Bc, t) is an interior point of W (B, t). Thenalmost surely, for every compact K ⊆ Rd and ε > 0,

W (B, t)(−ε) ∩ K ⊆{n−1An(nt)

}∩ K

(30) ⊆ W (B, t)(ε) ∩ K for all large enough n.

The additional hypothesis for (30), that no point of W (Bc, t) is an interior pointof W (B, t), prevents B and Bc from becoming too entangled at later times. Forexample, it prohibits the existence of a point y ∈ bd B such that W (y, t) hasnonempty interior (“a rarefaction fan with interior”).

Page 225: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

212 T. Seppalainen

6. Examples and technical comments

6.1. Second class particle analogy

Consider a one-dimensional Hammersley process z(t) = (zi(t))i∈Z with labeledparticle locations · · · ≤ z−1(t) ≤ z0(t) ≤ z1(t) ≤ · · · on R. In terms of labeledparticles, the infinitesimal jump rule is this: zi jumps to the left at exponential ratezi−zi−1, and when it jumps, its new position z′i is chosen uniformly at random fromthe interval (zi−1, zi). The height function is defined by σ(x, t) = sup{i : zi(t) ≤ x}for x ∈ R.

Now consider another Hammersley process z(t) constructed with the same re-alization of the space-time Poisson point process as z(t). Assume that at time 0,z(0) has exactly the same particle locations as z(0), plus h additional particles.Then at all later times z(t) will have h particles more than z(t), and relative to thez(t)-process, these extra particles behave like second class particles.

Suppose the labeling of the particles is such that zi(t) = zi(t) to the left of allthe second class particles. Let X1(t) ≤ · · · ≤ Xh(t) be the locations of the secondclass particles. Then the height functions satisfy σ(x, t) = σ(x, t) for x < X1(t), andσ(x, t) = σ(x, t) + h for x ≥ Xh(t). So in this one-dimensional second class particlepicture, the set A(t) is the interval [Xh(t),∞). It has been proved, in the context ofone-dimensional asymmetric exclusion, K-exclusion and zero-range processes, thatin the hydrodynamic limit a second-class particle converges to a characteristic orshock of the macroscopic p.d.e. [12, 18, 26].

Despite this analogy, good properties of the one-dimensional situation are readilylost as we move to higher dimensions. For example, we can begin with a set A(0)that is monotone in the sense that x ∈ A(0) implies y ∈ A(0) for all y ≥ x.But this property can be immediately lost: Suppose a jump happens at w suchthat ζ(w, 0) = σ(w, 0) but the set V = {x ≥ w : σ(x, 0) = σ(w, 0)} intersectsA(0) = {x : ζ(x, 0) = σ(x, 0) + 1}. Then after this event ζ = σ on V , and cuttingV away from A(0) may have broken its monotonicity.

6.2. Examples of the limit in Theorem 5.1

We consider here the simplest macroscopic profiles for which we can explicitly calcu-late the evolution W (B, t) of a set B, and thereby we know the limit of n−1An(nt) inTheorem 5.1. These are the flat profile with constant slope, and the cases of shocksand rarefaction fans that have two different slopes. Recall the slope-dependentvelocity f(ρ) = κd(ρ1ρ2ρ3 · · · ρd)−1 for ρ ∈ (0,∞)d, where κd is the (unknown)constant defined by (2) and (16).

For the second class particle in one-dimensional asymmetric exclusion, thesecases were studied in [12, 13].

Flat profile. Fix a vector ρ ∈ (0,∞)d, and consider the initial profile u0(x) =ρ · x. Then u(x, t) = ρ · x + tf(ρ), for each (x, t) there is a unique maximizery(x, t) = x + t∇f(ρ) in the Hopf-Lax formula, and consequently for any set B,W (B, t) = −t∇f(ρ) + B.

Shock profile. Fix two vectors λ, ρ ∈ (0,∞)d, and let

(31) u0(x) ={

ρ · x, (ρ − λ) · x ≥ 0,λ · x, (ρ − λ) · x ≤ 0.

Page 226: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 213

Then at later times we have

u(x, t) ={

ρ · x + tf(ρ), (ρ − λ) · x ≥ t(f(λ) − f(ρ)),λ · x + tf(λ), (ρ − λ) · x ≤ t(f(λ) − f(ρ)).

The Hopf-Lax formula is maximized by

y ={

x + t∇f(ρ), if (ρ − λ) · x ≥ t(f(λ) − f(ρ)),x + t∇f(λ), if (ρ − λ) · x ≤ t(f(λ) − f(ρ)).

In particular, points (x, t) on the hyperplane (ρ−λ) ·x = t(f(λ)−f(ρ)) are shocks,and for them both alternatives above are maximizers. In the forward evolution,W (y, t) is either a singleton or empty:

W (y, t) =

y − t∇f(ρ),

if (ρ − λ) · y ≥ t(f(λ) − f(ρ)) + t(ρ − λ) · ∇f(ρ),

∅, if t(f(λ) − f(ρ)) + t(ρ − λ) · ∇f(λ) < (ρ − λ) · y< t(f(λ) − f(ρ)) + t(ρ − λ) · ∇f(ρ),

y − t∇f(λ),

if (ρ − λ) · y ≤ t(f(λ) − f(ρ)) + t(ρ − λ) · ∇f(λ).

In this situation Theorem 5.1 is valid for all sets B.Rarefaction fan profile. Fix two vectors λ, ρ ∈ (0,∞)d, and let

u0(x) ={

λ · x, (ρ − λ) · x ≥ 0,ρ · x, (ρ − λ) · x ≤ 0.

For (x, t) such that

−t(ρ − λ) · ∇f(ρ) < (ρ − λ) · x < −t(ρ − λ) · ∇f(λ)

there exists a unique s = s(x, t) ∈ (0, 1) such that

(ρ − λ) · x = −t(ρ − λ) · ∇f(sλ + (1 − s)ρ).

Then at later times the profile can be expressed as

u(x, t) =

ρ · x + tf(ρ), if (ρ − λ) · x ≤ −t(ρ − λ) · ∇f(ρ),(sλ + (1 − s)ρ) · x + tf(sλ + (1 − s)ρ), if

−t(ρ − λ) · ∇f(ρ) < (ρ − λ) · x < −t(ρ − λ) · ∇f(λ),λ · x + tf(λ), if (ρ − λ) · x ≥ −t(ρ − λ) · ∇f(λ).

The forward evolution manifests the rarefaction fan: points y on the hyperplane(ρ − λ) · y = 0 have W (y, t) given by a curve, while for other points y W (y, t) is asingleton:

W (y, t) =

y − t∇f(ρ), if (ρ − λ) · y < 0,

{y − t∇f(sλ + (1 − s)ρ) : 0 ≤ s ≤ 1}, if (ρ − λ) · y = 0,y − t∇f(λ), if (ρ − λ) · y > 0.

In Theorem 5.1, consider the half-space B = {x : (ρ − λ) · x ≥ 0}. Then

X(B, t) = {x : −t(ρ − λ) · ∇f(ρ) ≤ (ρ − λ) · x ≤ −t(ρ − λ) · ∇f(λ)},

Page 227: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

214 T. Seppalainen

the “rarefaction strip” in space. Statement (30) is not valid for B, because theinterior of X(B, t) lies in the interiors of both W (B, t) and W (Bc, t). Statement(29) is valid, and says that the boundary of n−1An(nt) is locally contained in anyneighborhood of X(B, t).

In the corresponding one-dimensional setting, Ferrari and Kipnis [13] proved thaton the macroscopic scale, the second class particle is uniformly distributed in therarefaction fan. Their proof depended on explicit calculations with Bernoulli distrib-utions, so presently we cannot approach such precise knowledge of bd{n−1An(nt)}.

6.3. Some random initial conditions

We give here some natural examples of random initial conditions for Theorems 4.1and 5.1 for the case d = 2. We construct these examples from space-time evolutionsof one-dimensional Hammersley’s process. The space-time coordinates (y, t) of the1-dimensional process will equal the 2-dimensional spatial coordinates x = (x1, x2)of a height function.

Flat profiles. In one dimension, Aldous and Diaconis [1] denoted the Hammersleyprocess by N(y, t). The function y → N(y, t) (y ∈ R) can be regarded as thecounting function of a point process on R. Homogeneous Poisson point processesare invariant for this process.

To construct all flat initial profiles u0(x) = ρ · x on R2, we need two parametersthat can be adjusted. The rate µ of the spatial equilibrium of N(y, t) gives oneparameter. Another parameter τ is the jump rate, in other words the rate of thespace-time Poisson point process in the graphical construction of N(y, t). Let nowN(y, t) be a process in equilibrium, defined for −∞ < t < ∞, normalized so thatN(0, 0) = 0, with jump rate τ , and so that the spatial distribution at each fixed timeis a homogeneous Poisson process at rate µ. Then the process of particles jumpingpast a fixed point in space is Poisson at rate τ/µ [1], Lemma 8. ConsequentlyEN(y, t) = µy + (τ/µ)t.

This way we can construct a random initial profile whose mean is a given flatinitial profile: given ρ = (ρ1, ρ2) ∈ (0,∞)2, take an equilibrium process {N(y, t) :y ∈ R, t ∈ R} with µ = ρ1 and τ = ρ1ρ2, and define the initial height function forx = (x1, x2) ∈ R2 by σ((x1, x2), 0) = N(x1, x2).

Shock profiles. Next we construct a class of initial shock profiles. Suppose ρ =(ρ1, ρ2) and λ = (λ1, λ2) satisfy ρ > λ and ρ1/ρ2 < λ1/λ2. Start by constructing theequilibrium Hammersley system {N(y, t) : y ∈ R, t ∈ R} with spatial density µ =λ1 and jump rate τ = λ1λ2. Set a = (ρ1−λ1)/(ρ2−λ2) > 0. Stop each Hammersleyparticle the first time it hits the space-time line t = −ay, and “erase” the entireevolution of N(y, t) above this line. The assumption ρ1/ρ2 < λ1/λ2 guarantees thateach particle eventually hits this line. Now we have constructed the slope-λ heightfunction σ((x1, x2), 0) = N(x1, x2) below the line (ρ − λ) · x = 0 ⇐⇒ x2 = −ax1.(Slope-λ in the sense that Eσ(x, 0) = λ · x.)

To continue the construction, put a rate τ ′ = ρ1ρ2 space-time Poisson pointprocess above the line t = −ay in the space-time picture of the 1-dim Hammersleyprocess. Let the Hammersley particles evolve from their stopped locations on theline t = −ay, according to the usual graphical construction [1] of the process, usingthe rate τ ′ space-time Poisson points. The construction is well defined, becausegiven any finite time T , N(y, T ) is already constructed for y ≤ −T/a, and fory > −T/a the particle trajectories can be constructed one at a time from left toright, starting with the leftmost particle stopped at a point (y,−ay) for y > −T/a.

Page 228: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 215

One can check that defining σ((x1, x2), 0) = N(x1, x2) for x2 > −ax1 gives theslope-ρ height function above the line (ρ−λ) ·x = 0. Now we have a random initialheight function σ(x, 0) with mean Eσ(x, 0) = u0(x) as in (31).

Finally, we describe a way to define initial configurations for the coupled processesζ and σ in the context of this shock example. We shall do it so that the set{x : ζ(x, 0) = σ(x, 0) + 1} lies inside B = {x : x2 ≥ −ax1}, and approximatesit closely. Let ζ(x, 0) be the height function defined above in terms of the N(y, t)constructed in two steps, first below and then above the line t = −ay. Let zk(t)be the trajectories of the labeled Hammersley particles. These trajectories are thelevel curves of ζ(x, 0), namely ζ((x1, x2), 0) ≥ k iff zk(x2) ≤ x1. The constructionperformed above has the property that each zk(t) crosses the line t = −ay exactlyonce (the particles were stopped upon first hitting this line, and then continuedentirely above the line).

Define new trajectories z′k(t) as follows: z′k(t) = zk(t) below the line t = −ay.From the line t = −ay the trajectory z′k(t) proceeds vertically upward (in the t-direction) until it hits the trajectory of zk+1(t). From that point onwards z′k(t)follows the trajectory of zk+1(t). This is done for all k. Let N ′(y, t) be the countingfunction defined by N ′(y, t) = sup{k : z′k(t) ≤ y}. And then set σ((x1, x2), 0) =N ′(x1, x2)

The initial height functions σ(x, 0) and ζ(x, 0) thus defined have these properties:σ(x, 0) = ζ(x, 0) for x2 ≤ −ax1. For any point (x1, x2) such that x2 > −ax1 andsome particle trajectory zk(t) passes between (x1,−ax1) and (x1, x2), ζ(x, 0) =σ(x, 0) + 1. This construction satisfies the hypotheses of Theorem 5.1.

6.4. Some properties of the multidimensional Hamilton-Jacobiequation

Let u(x, t) be the viscosity solution of the equation ut = f(∇u), defined by theHopf-Lax formula (14). By assumption, the initial profile u0 is locally Lipschitzand satisfies the decay estimate (11). Hypothesis (11) is tailored to this particularvelocity function, and needs to be changed if f is changed.

Part (b) of this lemma will be needed in the proof of Thm. 5.1.

Lemma 6.1. (a) For any compact K ⊆ Rd,⋃

x∈K I(x, t) is compact.(b) W (B, t) is closed for any closed set B ⊆ Rd.

Proof. (a) By (11), as y → −∞ for y ≤ x, u0(y) + tg((x − y)/t) tends to −∞uniformly over x in a bounded set. Also, the condition inside (19) is preserved bylimits because all the functions are continuous. (b) If W (B, t) � xj → x, then by(a) any sequence of maximizers yj ∈ I(xj , t)∩B has a convergent subsequence.

The association of I(x, t) to x is not as well-behaved as in one dimension. Forexample, not only is there no monotonicity, but a simple example can have x1 < x2

with maximizers yi ∈ I(xi, t) such that y2 < y1. The local Lipschitz condition onu0 guarantees that each y ∈ I(x, t) satisfies y < x (i.e. strict inequality for allcoordinates).

Properties that are not hard to check include the following. Part (a) of the lemmaimplies that u(x, t) is locally Lipschitz on Rd×(0,∞). Lipschitz continuity does notnecessarily hold down to t = 0, but continuity does. u is differentiable at (x, t) iffI(x, t) is a singleton {y}, and then ∇u(x, t) = ∇g((x−y)/t). Also, ∇u is continuouson the set where it is defined because whenever (xn, tn) → (x, t) and yn ∈ I(xn, tn),the sequence {yn} is bounded and all limit points lie in I(x, t).

Page 229: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

216 T. Seppalainen

A converse question is when W (y, t) has more than one point. As in one dimen-sion, one can give a criterion based on the regularity of u0 at y. The subdifferentialD−u0(x) and superdifferential D+u0(x) of u0 at x are defined by

D−u0(x) ={

q ∈ Rd : lim infy→x

u0(y) − u0(x) − q · (y − x)‖y − x‖ ≥ 0

}

and

D+u0(x) ={

p ∈ Rd : lim supy→x

u0(y) − u0(x) − p · (y − x)‖y − x‖ ≤ 0

}.

It is a fact that both D±u0(x) are nonempty iff u0 is differentiable at x, and thenD±u0(x) = {∇u0(x)}.

One can check that W (y, t) ⊆ y − t∇f(D+u0(y)

). Consequently if D−u0(y)

is nonempty, W (y, t) cannot have more than 1 point. Another fact from one-dimensional systems that also holds in multiple dimensions is that if we restartthe evolution at time s > 0, then all forward sets W (y, t) are empty or singletons.In other words, if u is a solution with initial profile u0, and we define u0(x) = u(x, s)and u(x, t) = u(x, s+t), then D−u0(y) is never empty. This is because ∇g((x−y)/s)lies in D−{u(·, s)}(x) for every y that maximizes the Hopf-Lax formula for u(x, s).

7. Proof of the generator relation

In this section we prove Theorem 3.1. Throughout the proofs we use the abbrevia-tion

x! = x1x2x3 · · ·xd

for a point x = (x1, . . . , xd) ∈ Rd. We make the following definition related to thedynamics of the process. For a height function σ ∈ Σ and a point x ∈ Rd, let

(32) Sx(σ) =

{{y ∈ Rd : y ≤ x, σ(y) = σ(x)}, if σ(x) is finite,∅, if σ(x) = ±∞.

Sx(σ) is the set in space where a Poisson point must arrive in the next instantin order to increase the height value at x. Consequently the Lebesgue measure(volume) |Sx(σ)| is the instantaneous rate at which the height σ(x) jumps up by1. Since values σ(x) = ±∞ are not changed by the dynamics, it is sensible to setSx(σ) empty in this case. For a set K in Rd we define

(33) SK(σ) =⋃

x∈K

Sx(σ),

the set in space where an instantaneous Poisson arrival would change the functionσ in the set K.

We begin with a simple estimate.

Lemma 7.1. Let x > 0 in Rd, t > 0, and k a positive integer. Then

P{H((0, 0), (x, t)) ≥ k} ≤ (x!t)k

(k!)(d+1)≤ e−k(d+1)

where the second inequality is valid if k ≥ e2(x!t)1/(d+1). Note that above (0, 0)means the space-time point (0, 0) ∈ Rd × [0,∞).

Page 230: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 217

Proof. Let γ = x!t = x1x2x3 · · ·xd · t > 0 be the volume of the space-time rectangle(0, x]×(0, t]. k uniform points in this (d+1)-dimensional rectangle form an increasingchain with probability (k!)−d. Thus

P{H((0, 0), (x, t)) ≥ k} ≤∑

j:j≥k

e−γγj

j!

(j

k

)(k!)−d = γk(k!)−(d+1)

≤ γk(k/e)−k(d+1) ≤ e−k(d+1)

if k ≥ e2γ1/(d+1).

We need to make a number of definitions that enable us to control the heightfunctions σ ∈ Σ. For b ∈ Rd and h ∈ Z, let yb,h(σ) be the maximal point y ≤ b suchthat the rectangle [y, b] contains the set {x ≤ b : σ(x) ≥ h}, with yb,h(σ) = b if thisset is empty. Note that if yb,h(σ) �= b then there exists x ≤ b such that σ(x) ≥ hand |b − x|∞ = |b − yb,h(σ)|∞.

Throughout this section we consider compact cubes of the type

K = [−q1, q1] ⊆ Rd

for a fixed number q > 0. When the context is clear we may abbreviate yh =yq1,h(σ). Define

λk(σ) = sup−∞<h≤k−2

(q1 − yh)! · (k − h)−(d+1).

Property (6) of the state space guarantees that λk(σ) < ∞.The minimal and maximal finite height values in K are defined by

I(K, σ) = min{σ(x) : x ∈ K, −∞ < σ(x) < ∞}and

J(K, σ) = max{σ(x) : x ∈ K, −∞ < σ(x) < ∞}.If σ = ±∞ on all of K we interpret I(K, σ) = ∞ = −J(K, σ). Otherwise thesequantities are finite because σ can take only finitely many values in K. If σ is finiteon all of K then by monotonicity I(K, σ) = σ(−q1) and J(K, σ) = σ(q1). Set

(34) ψK(σ) =

J(K,σ)+1∑

k=I(K,σ)+1

λ2k(σ)

1/2

.

If σ = ±∞ on all of K then ψK(σ) = 0.The next two lemmas are preliminary and illustrate how ψK(σ) appears as a

bound.

Lemma 7.2. For a cube K = [−q1, q1] and σ ∈ Σ,

(35) |SK(σ)| ≤ 2d+1ψK(σ).

Proof. If σ = ±∞ on all of K then both sides of (35) are zero by the definitions.Suppose I(K, σ) is finite (this is the complementary case). If x ∈ K and y ∈ Sx(σ),then y ≤ x ≤ q1 and σ(y) = σ(x) ≥ I(K, σ), and consequently y ∈ [yI(K,σ), q1].This is true for an arbitrary point y ∈ SK(σ). Since yI(K,σ)−1 ≤ yI(K,σ), we canweaken the conclusion to SK(σ) ⊆ [yI(K,σ)−1, q1] to get

|SK(σ)| ≤ (q1 − yI(K,σ)−1)! =2d+1(q1 − yI(K,σ)−1)!

(I(K, σ) + 1 − (I(K, σ) − 1))d+1

≤ 2d+1λI(K,σ)+1(σ) ≤ 2d+1ψK(σ).

Page 231: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

218 T. Seppalainen

Lemma 7.3. Define the event

G = {there exist x ∈ K and y ∈ Rd such that y < x,−∞ < σ(y) ≤ σ(x) − 1 < ∞,and H((y, 0), (x, t)) ≥ σ(x) + 1 − σ(y)}.

(36)

Then for 0 < t < 1/(2ed+1ψK(σ)),

Pσ(G) ≤ 2e2(d+1)ψ2K(σ)t2.

Proof. Let the index k run through the finite values of σ(x)+1 in K, and h representσ(y). Then

Pσ(G) ≤J(K,σ)+1∑

k=I(K,σ)+1

h≤k−2

P{H((yh, 0), (q1, t)) ≥ k − h}.

By Lemma 7.1 and the inequality j! ≥ (j/e)j , for a fixed k the inner sum becomes

h≤k−2

((q1 − yh)!t)k−h

((k − h)!)d+1≤∑

j≥2

(λk(σ)t)jj(d+1)j

(j!)d+1≤ 2(ed+1λk(σ)t)2.

The assumption on t was used to sum the geometric series. Now sum over k.

We get the first bound on the evolution.

Lemma 7.4. Let φ be a bounded measurable function on Σ, supported on a com-pact cube K = [−q1, q1] ⊆ Rd. This means that φ(σ) depends on σ only through(σ(x))x∈K . Then there is a finite constant C = C( ‖φ‖∞) such that, for all σ ∈ Σand t > 0, the quantity

∆t(σ) = Eσ[φ(σ(t))] − φ(σ) − tLφ(σ)

satisfies the bound |∆t(σ)| ≤ Ct2ψ2K(σ), provided 0 ≤ t < 1/(2ed+1ψK(σ)).

Proof. We may assume that σ is not ±∞ on all of K. For otherwise from thedefinitions

Eσ[φ(σ(t))] = φ(σ) and Lφ(σ) = 0,

and the lemma is trivially satisfied.Observe that on the complement Gc of the event defined in (36),

σ(x, t) = supy∈Sx(σ)∪{x}

{σ(y) + H((y, 0), (x, t))}

for all x ∈ K. (The singleton {x} is added to Sx(σ) only to accommodate thosex for which σ(x) = ±∞ and Sx(σ) was defined to be empty.) Consequently onthe event Gc the value φ(σ(t)) is determined by σ and the Poisson points in thespace-time region SK(σ) × (0, t]. Let Dj be the event that SK(σ) × (0, t] containsj space-time Poisson points, and D2 = (D0 ∪ D1)c the event that this set containsat least 2 Poisson points. On the event D1, let Y ∈ Rd denote the space coordinateof the unique Poisson point, uniformly distributed on SK(σ). Then

Eσ[φ(σ(t))] = φ(σ) · Pσ(Gc ∩ D0) + E[φ(σY ) · 1Gc∩D1] + O

(Pσ(G) + Pσ(D2)

)

= φ(σ) + tLφ(σ) + t2 · O(ψ2

K(σ) + |SK(σ)|2).

Page 232: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 219

To get the second equality above, use

Pσ(Dj) = (j!)−1(t|SK(σ)|)j exp(−t|SK(σ)|),

Lemma 7.3 for bounding Pσ(G), and hide the constant from Lemma 7.3 and ‖φ‖∞in the O-terms. Proof of the lemma is completed by (35).

We insert here an intermediate bound on the height H. It is a consequence ofLemma 7.1 and a discretization of space.

Lemma 7.5. Fix t > 0, α ∈ (0, 1/2), and β > e2t1/(d+1). Then there are finitepositive constants θ0, C1 and C2 such that, for any θ ≥ θ0,

P{there exist y < x such that |y|∞ ≥ θ, |x − y|∞ ≥ α|y|∞,

and H((y, 0), (x, t)) ≥ β|x − y|d/(d+1)∞

}≤ C1 exp(−C2θ

d/(d+1)).(37)

For positive m, define

(38) σm(x, t) ≡ supy≤x

|y|∞≤m

{σ(y) + H((y, 0), (x, t))

}

Corollary 7.6. Fix a compact cube K ⊆ Rd, 0 < T < ∞, and initial stateσ ∈ Σ. Then there exists a finite random variable M such that, almost surely,σ(x, t) = σM (x, t) for (x, t) ∈ K × [0, T ].

Proof. If σ(x) = ±∞ then y = x is the only maximizer needed in the variationalformula (7). Thus we may assume that I(K, σ) is finite.

Fix α ∈ (0, 1/2) and β > e2T 1/(d+1). By the boundedness of K, (37), and Borel-Cantelli there is a finite random M such that

H((y, 0), (x, T )) ≤ β|x − y|d/(d+1)∞

whenever x ∈ K and |y|∞ ≥ M . Increase M further so that M ≥ 1 + |x| for allx ∈ K, and σ(y) ≤ −(2β + |I(K, σ)| + 1)|y|−d/(d+1)

∞ for all y such that y ≤ x forsome x ∈ K and |y|∞ ≥ M .

Now suppose y ≤ x, x ∈ K, σ(x) is finite, and |y|∞ ≥ M . Then

σ(y) + H((y, 0), (x, t)) ≤ −(2β + |I(K, σ)| + 1)|y|−d/(d+1)∞ + β|x − y|d/(d+1)

∞≤ −|I(K, σ)| − 1 ≤ σ(x) − 1 ≤ σ(x, t) − 1.

We see that y cannot participate in the supremum in (7) for any (x, t) ∈ K ×[0, T ].

To derive the generator formula we need to control the error in Lemma 7.4uniformly over time, in the form ∆τ (σ(s)) with 0 ≤ s ≤ t and a small τ > 0. Fora fixed k, λk(σ(s)) is nondecreasing in s because each coordinate of yh decreasesover time. For q > 0 and k ∈ Z introduce the function

(39) Ψq,k(σ) = supx≤q1

|x|d∞(1 ∨ {k − σ(x)}

)d+1.

A calculation that begins with

λk(σ) ≤ qd/2 + suph≤k−2

yq1,h =q1

2d|yq1,h|d∞(k − h)d+1

Page 233: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

220 T. Seppalainen

shows thatλk(σ) ≤ qd/2 + 2dΨq,k(σ).

Interface heights σ(x, s) never decrease with time, and Ψq,k(σ) is nonincreasing in kbut nondecreasing in σ. Therefore we can bound as follows, uniformly over s ∈ [0, t]:

ψ2K(σ(s)) =

J(K,σ(s)+1∑

k=I(K,σ(s))+1

λ2k(σ(s))

≤(J(K, σ(t)) − I(K, σ(0)) + 1

)· max

I(K,σ(0))+1≤k≤J(K,σ(t))+1λ2

k(σ(s))

≤(J(K, σ(t)) − I(K, σ(0)) + 1

)2(q2d + 1) + 24dΨ4q,I(K,σ(0))

(σ(t)).(40)

Above we used the inequality c(a + b)2 ≤ 2ca2 + c2 + b4 for a, b, c ≥ 0. The nextlemma implies that the moments Eσ[Ψp

q,I(K,σ)(σ(t))] are finite for all p < ∞.

Lemma 7.7. Let σ be an element of the state space Σ. Fix t > 0 and a pointq1 ∈ Rd

+. Then there exists a finite number v0(σ) such that, for v ≥ v0(σ),

(41) Pσ{Ψq,I(K,σ)

(σ(t))

> v}≤ C1 exp(−C2v

1/(d+1)),

where the finite positive constants C1, C2 are the same as in Lemma 7.5 above.

Proof. Choose α, β so that (37) is valid. Let

β1 = 2β + β(2α)d/(d+1) + 2.

Fix v0 = v0(σ) > 0 so that these requirements are met: v0 ≥ 1 + |I(K, σ)|d+1, andfor all y ≤ x ≤ q1 such that |x|d∞ ≥ v0,

σ(x) ≤ −β1|x|d/(d+1)∞ and |y|∞ ≥ |x|∞ ≥ θ0.

Here θ0 is the constant that appeared in Lemma 7.5, and we used property (6) ofthe state space Σ.

Let v ≥ v0. We shall show that the event on the left-hand side of (41) is containedin the event in (37) with θ = v1/d. Suppose the event in (41) happens, so that somex ≤ q1 satisfies

(42) v−1/(d+1)|x|d(d+1)∞ > I(K, σ) − σ(x, t) and |x|d∞ ≥ v.

Note that the above inequality forces σ(x, t) > −∞, while the earlier requirementon v0 forces σ(x) < ∞, and thereby also σ(x, t) < ∞. Find a maximizer y ≤ x sothat

σ(x, t) = σ(y) + H((y, 0), (x, t)).

Regarding the location of y, we have two cases two consider.Case 1. y ∈ [x− 2α|x|∞1, x]. Let y′ = x− 2α|x|∞1. Then |x− y′|∞ ≥ α|y′|∞ by

virtue of α ∈ (0, 1/2). Also y′ ≤ x so the choices made above imply |y′|∞ ≥ |x|∞ ≥v1/d.

H((y′, 0), (x, t)) ≥ H((y, 0), (x, t)) = σ(x, t) − σ(y)

> I(K, σ) − v−1/(d+1)|x|d/(d+1)∞ − σ(x)

≥ (β1 − 2)|x|d/(d+1)∞ ≥ β(2α)d/(d+1)|x|d/(d+1)

= β|x − y′|d/(d+1)∞ .

Page 234: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 221

In addition to (42), we used −v−1/(d+1) ≥ −1, −σ(y) ≥ −σ(x) ≥ β1|x|d/(d+1)∞ , and

I(K, σ) ≥ −v1/(d+1)0 ≥ −|x|d/(d+1)

∞ .Case 2. y /∈ [x − 2α|x|∞1, x]. This implies |x − y|∞ ≥ α|y|∞.

H((y, 0), (x, t)) = σ(x, t) − σ(y) > I(K, σ) − v−1/(d+1)|x|d/(d+1)∞ − σ(y)

≥ −|x|d/(d+1)∞ − v−1/(d+1)|x|d/(d+1)

∞ + β1|y|d/(d+1)∞

≥ 2β|y|d/(d+1)∞ ≥ β|x − y|d/(d+1)

∞ .

We conclude that the event in (41) lies inside the event in (37) with θ = v1/d, aslong as v ≥ v0, and the inequality in (41) follows from (37).

Corollary 7.8. Let K be a compact cube, ε > 0, and 0 < t < ∞. Then there existsa deterministic compact cube L such that

Pσ{SK(σ(s)) ⊆ L for all s ∈ [0, t]

}≥ 1 − ε.

Proof. For 0 ≤ s ≤ t, x ∈ SK(σ(s)) implies that I(K, σ(s)) is finite, x ≤ q1 andσ(x, s) ≥ I(K, σ(s)). Consequently

|x|d∞ ≤ Ψq,I(K,σ(s))

(σ(s))≤ Ψq,I(K,σ)

(σ(t)).

Thus given ε, we can choose L = [−m1, m1] with m picked by Lemma 7.7 so thatPσ{Ψq,I(K,σ)

(σ(t))

> md}

< ε.

We are ready for the last stage of the proof of Theorem 3.1.

Proposition 7.9. Let φ be a bounded measurable function on Σ supported on thecompact cube K = [−q1, q1] of Rd, and σ ∈ Σ. Then

(43) Eσ[φ(σ(t))] − φ(σ) =∫ t

0

Eσ[Lφ(σ(s))]ds.

Proof. Pick a small τ > 0 so that t = mτ for an integer m, and denote the partitionby sj = jτ . By the Markov property,

Eσ[φ(σ(t))] − φ(σ) = Eσ

m−1∑

j=0

{Eσ(sj)[φ(σ(τ))] − φ(σ(sj))

}

= Eσ

∫ t

0

m−1∑

j=0

1(sj ,sj+1](s)Lφ(σ(sj+1))ds

+ τ(Lφ(σ) − Eσ[Lφ(σ(t))]

)+ Eσ

m−1∑

j=0

∆τ (σ(sj))

,(44)

where the terms ∆τ (σ(sj)) are as defined in Lemma 7.4.We wish to argue that, as m → ∞ and simultaneously τ → 0, expression (44)

after the last equality sign converges to the right-hand side of (43).Note first that Lφ(σ) is determined by the restriction of σ to the set SK(σ)∪K.

By Corollary 7.8 there exists a fixed compact set L such that SK(σ(s)) ∪ K ⊆ Lfor 0 ≤ s ≤ t with probability at least 1 − ε. By Corollary 7.6, the time evolution

Page 235: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

222 T. Seppalainen

{σ(x, s) : x ∈ L, 0 ≤ s ≤ t} is determined by the finitely many Poisson points in therandom compact rectangle [−M, M ]d × [0, t]. Consequently the process Lφ(σ(s)) ispiecewise constant in time, and then the integrand

∑m−1j=0 1(sj ,sj+1](s)Lφ(σ(sj+1))

converges to Lφ(σ(s)) pointwise as m → ∞. This happens on an event with prob-ability at least 1 − ε, hence almost surely after letting ε → 0.

To extend the convergence to the expectation and to handle the error terms, weshow that

(45) Eσ[

sup0≤s≤t

ψ2K(σ(s))

]< ∞.

Before proving (45), let us see why it is sufficient. Since

(46) |Lφ(σ)| ≤ 2‖φ‖∞|SK(σ)|,

(35) and (45) imply that also the first expectation after the equality sign in (44)converges, by dominated convergence. The second and third terms of (44) vanish,through a combination of Lemma 7.4, (35), and (45).

By the bound in (40) for sup0≤s≤t ψ2K(σ(s)) and by Lemma 7.7, it only remains

to show thatEσ[ (

J(K, σ(t)) − I(K, σ(0)) + 1)2 ]

< ∞.

This follows from property (6) of σ and the bounds for H in Lemmas 7.1 and 7.5.We omit the proof since it is not different in spirit than the estimates we alreadydeveloped.

This completes the proof of Theorem 3.1.

8. Proof of the limit for the height function

Introduce the scaling into the variational formula (7) and write it as

(47) σn(nx, nt) = supy∈Rd:y≤x

{σn(ny, 0) + H((ny, 0), (nx, nt))}.

Lemma 8.1. Assume the processes σn satisfy (12) and (13). Fix a finite T > 0 anda point b ∈ Rd such that b > 0, and consider the bounded rectangle [−b, b] ⊆ Rd.Then with probability 1 there exist a random N < ∞ and a random point a ∈ Rd

such that

(48) σn(nx, nt) = supy∈[a,x]

{σn(ny, 0) + H((ny, 0), (nx, nt))}

for x ∈ [−b, b], t ∈ (0, T ], n ≥ N .

Proof. For β ≥ e2T 1/(d+1) and b ∈ Rd fixed, one can deduce from Lemma 7.1 andBorel-Cantelli that, almost surely, for large enough n,

H((ni, 0), (nb, nt)) ≤ βn|b − i|d/(d+1)∞

for all i ∈ Zd such that i ≤ b and |i − b|∞ ≥ 1. If y ∈ Rd satisfies y ≤ b and|y − b|∞ ≥ 1, we can take i = [y] (coordinatewise integer parts of y) and see that

(49) H((ny, 0), (nb, nt)) ≤ βn + βn|b − y|d/(d+1)∞

for all such y.

Page 236: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 223

In assumption (13) choose C > β so that −C+(2+|b|d/(d+1)∞ )β < u0(−b)−1. Let

N and M be as given by (13), but increase M further to guarantee M ≥ 1. Now takea ∈ Rd far enough below −b so that, if y ≤ b but y ≥ a fails, then |y|∞ ≥ M . [Sinceassumption (13) permits a random M > 0, here we may need to choose a randoma ∈ Rd.] Then by (13), if y ≤ b but y ≥ a fails, then σn(ny, 0) ≤ −Cn|y|d/(d+1)

∞ .Now suppose x ∈ [−b, b], y ≤ x, but y ≥ a fails. Then

σn(ny, 0) + H((ny, 0), (nx, nt))≤ σn(ny, 0) + H((ny, 0), (nb, nt))≤ −Cn|y|d/(d+1)

∞ + βn + βn|b − y|d/(d+1)∞

≤ n((−C + β)|y|d/(d+1)

∞ + β + β|b|d/(d+1)∞

)

≤ nu0(−b) − n ≤ σn(−nb, 0) − n/2[by assumption (12), for large enough n]

≤ σn(nx, 0) − n/2 [by monotonicity].

This shows that in the variational formula (47) the point y = x strictly dominatesall y outside [a, x].

Starting with (48) the limit (15) is proved (i) by partitioning [a, x] into smallrectangles, (ii) by using monotonicity of the random variables, and the monotonicityand continuity of the limit, and (iii) by appealing to the assumed initial limits (12)and to

(50) n−1H((ny, 0), (nx, nt)) → cd+1((x − y)!t)1/(d+1) = tg((x − y)/t) a.s.

To derive the limit in (50) from (3) one has to fill in a technical step because in(50) the lower left corner of the rectangle (ny, nx] × (0, nt] moves as n grows. Onecan argue around this complication in at least two different ways: (a) The Kesten-Hammersley lemma [28], page 20, from subadditive theory gives a.s. convergencealong a subsequence, and then one fills in to get the full sequence. This approachwas used in [24]. (b) Alternatively, one can use Borel-Cantelli if summable deviationbounds are available. These can be obtained by combining Theorems 3 and 9 fromBollobas and Brightwell [6].

9. Proof of the defect boundary limit

In view of the variational equation (7), let us say σ(x, t) has a maximizer y if y ≤ xand σ(x, t) = σ(y, 0) + H((y, 0), (x, t)).

Lemma 9.1. Suppose two processes σ and ζ are coupled through the space-timePoisson point process.

(a) For a positive integer m, let Dm(t) = {x : ζ(x, t) ≥ σ(x, t) + m}. Then ifx ∈ Dm(t), ζ(x, t) cannot have a maximizer y ∈ Dm(0)c. And if x ∈ Dm(t)c, σ(x, t)cannot have a maximizer y ∈ Dm(0).

(b) In particular, suppose initially σ(y, 0) ≤ ζ(y, 0) ≤ σ(y, 0) + h for all y ∈ Rd,for a fixed positive integer h. Then this property is preserved for all time. If wewrite

A(t) = {x : ζ(x, t) = σ(x, t) + h},then

(51) A(t) = {x : σ(x, t) has a maximizer y ∈ A(0)}.

Page 237: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

224 T. Seppalainen

(c) If h = 1 in part (b), we get additionally that

(52) A(t)c = {x : ζ(x, t) has a maximizer y ∈ A(0)c }.

Proof. (a) Suppose x ∈ Dm(t), y ∈ Dm(0)c, and ζ(x, t) = ζ(y, 0) + H((y, 0), (x, t)).Then by the definition of Dm(t),

σ(x, t) ≤ ζ(x, t)−m = ζ(y, 0)−m +H((y, 0), (x, t)) ≤ σ(y, 0) +H((y, 0), (x, t))− 1

which contradicts the variational equation (7). Thus ζ(x, t) cannot have a maximizery ∈ Dm(0)c. The other part of (a) is proved similarly.

(b) Monotonicity implies that σ(x, t) ≤ ζ(x, t) ≤ σ(x, t) + h for all (x, t), soA(t) = Dh(t). Suppose x ∈ A(t). By (a) ζ(x, t) cannot have a maximizer y ∈ A(0)c,and so ζ(x, t) has a maximizer y ∈ A(0). Consequently

σ(x, t) = ζ(x, t) − h = ζ(y, 0) − h + H((y, 0), (x, t)) = σ(y, 0) + H((y, 0), (x, t)),

which says that σ(x, t) has a maximizer y ∈ A(0). On the other hand, if σ(x, t) hasa maximizer y ∈ A(0), then by (a) again x /∈ A(t)c. This proves (51).

(c) Now A(t) = D1(t) and A(t)c = {x : σ(x, t) = ζ(x, t)}. If ζ(x, t) has amaximizer y ∈ A(0)c, then by part (a) x /∈ A(t). While if x ∈ A(t)c, again by part(a) σ(x, t) must have a maximizer y ∈ A(0)c, which then also is a maximizer forζ(x, t). This proves (52).

Assume the sequence of processes σn(·) satisfies the hypotheses of the hydrody-namic limit Theorem 4.1 which we proved in Section 8. The defect set An(t) wasdefined through the (σn, ζn) coupling by (27). By (51) above, we can equivalentlydefine it by

(53) An(t) = {x : σn(x, t) has a maximizer y ∈ An(0) }.

In the next lemma we take the point of view that some sequence of sets that dependon ω has been defined by (53), and ignore the (σn, ζn) coupling definition.

Lemma 9.2. Let B ⊆ Rd be a closed set. Suppose that for almost every samplepoint ω in the underlying probability space, a sequence of sets An(0) = An(0;ω) isdefined, and has this property: for every compact K ⊆ Rd and ε > 0,

(54){n−1An(0)

}∩ K ⊆ B(ε) ∩ K for all large enough n.

Suppose the sets An(t) satisfy (53) and fix t > 0. Then almost surely, for everycompact K ⊆ Rd and ε > 0,

(55){n−1An(nt)

}∩ K ⊆ W (B, t)(ε) ∩ K for all large enough n.

In particular, if W (B, t) = ∅, then (55) implies that {n−1An(nt)} ∩ K = ∅ for alllarge enough n.

Proof. Fix a sample point ω such that assumption (54) is valid, the conclusion ofLemma 8.1 is valid for all b ∈ Zd

+, and we have the limits

(56) n−1σn(nx, nt) → u(x, t) for all (x, t),

Page 238: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 225

and

(57) n−1H((ny, 0), (nx, nt)) → tg((x − y)/t) for all y, x, t.

Almost every ω satisfies these requirements, by the a.s. limits (50) and (15), bymonotonicity, and by the continuity of the limiting functions. It suffices to prove(55) for this fixed ω.

To contradict (55), suppose there is a subsequence nj and points xj ∈ Ksuch that njxj ∈ Anj (njt) but xj /∈ W (B, t)(ε). Note that this also contradicts{n−1An(nt)} ∩K = ∅ in case W (B, t) = ∅, so the empty set case is also proved bythe contradiction we derive.

Let njyj ∈ Anj (0) be a maximizer for σnj (njxj , njt). Since the xj ’s are bounded,so are the yj ’s by Lemma 8.1, and we can pass to a subsequence (again denotedby {j}) such that the limits xj → x and yj → y exist. By the assumptions on xj ,x /∈ W (B, t). For any ε > 0, yj ∈ B(ε) for large enough j, so y ∈ B by the closednessof B.

Fix points x′ < x′′ and y′ < y′′ so that x′ < x < x′′ and y′ < y < y′′ in thepartial order of Rd. Then for large enough j, x′ < xj < x′′ and y′ < yj < y′′. Bythe choice of yj ,

σnj (njxj , njt) = σnj (njyj , 0) + H((njyj , 0), (njxj , njt))

from which follows, by the monotonicity of the processes,

n−1j σnj (njx

′, njt) ≤ n−1j σnj (njxj , njt)

≤ n−1j σnj (njy

′′, 0) + n−1j H((njy

′, 0), (njx′′, njt)).

Now let nj → ∞ and use the limits (56) and (57) to obtain

u(x′, t) ≤ u0(y′′) + tg((x′′ − y′)/t).

We may let x′, x′′ → x and y′, y′′ → y, and then by continuity u(x, t) ≤ u0(y) +tg((x − y)/t). This is incompatible with having x /∈ W (B, t) and y ∈ B. Thiscontradiction shows that, for the fixed ω, (55) holds.

We prove statement (29) of Theorem 5.1. The assumption is that

(58) B(−ε) ∩ K ⊆{n−1An(0)

}∩ K ⊆ B(ε) ∩ K for all large enough n.

We introduce an auxiliary process ξn(x, t). Initially set

(59) ξn(y, 0) ={

σn(y, 0), y /∈ An(0)σn(y, 0) + 1, y ∈ An(0).

ξn(y, 0) is a well-defined random element of the state space Σ because An(0) isdefined (27) in terms of ζn(y, 0) which lies in Σ. Couple the process ξn with σn andζn through the common space-time Poisson points. Then

σn(x, t) ≤ ξn(x, t) ≤ σn(x, t) + 1.

By part (b) of Lemma 9.1, An(t) that satisfies (53) also satisfies

An(t) = {x : ξn(x, t) = σn(x, t) + 1}.

Page 239: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

226 T. Seppalainen

Then by part (c) of Lemma 9.1,

(60) An(t)c = {x : ξn(x, t) has a maximizer y ∈ An(0)c }.

The first inclusion of assumption (58) implies that n−1An(0)c∩K ⊆(Bc)(ε)∩K

for large n. The processes ξn inherit all the hydrodynamic properties of the processesσn. Thus by (60) we may apply Lemma 9.2 to the sets An(nt)c and the processesξn(nt) to get

(61) n−1An(nt)c ∩ K ⊆ W (Bc, t)(δ) ∩ K

for large enough n. By (55) and (61),

bd {n−1An(nt)} ∩ K ⊆ W (B, t)(δ) ∩ W (Bc, t)(δ) ∩ K

for large n. For small enough δ > 0, the set on the right is contained in[W (B, t) ∩ W (Bc, t)

](ε) ∩ K = X(B, t)(ε) ∩ K. This proves (29).To complete the proof of Theorem 5.1, it remains to prove

(62)W (B, t)(−ε) ∩ K ⊆ n−1An(nt) ∩ K ⊆ W (B, t)(ε) ∩ K for all large enough n

under the further assumption that no point of W (Bc, t) is an interior point ofW (B, t).

The second inclusion of (62) we already obtained in Lemma 9.2. (61) implies[W (Bc, t)(δ)

]c∩ K ⊆ n−1An(nt) ∩ K.

It remains to check that, given ε > 0,

W (B, t)(−ε) ∩ K ⊆[W (Bc, t)(δ)

]c∩ K

for sufficiently small δ > 0. Suppose not, so that for a sequence δj ↘ 0 there existxj ∈ W (B, t)(−ε) ∩ W (Bc, t)(δj) ∩ K. By Lemma 6.1 the set W (Bc, t) is closed.Hence passing to a convergent subsequence xj → x gives a point x ∈ W (Bc, t)which is an interior point of W (B, t), contrary to the hypothesis.

10. Technical appendix: the state space of the process

We develop the state space in two steps: first describe the multidimensional Skoro-hod type metric we need, and then amend the metric to provide control over theleft tail of the height function. This Skorohod type space has been used earlier (see[5] and their references).

10.1. A Skorohod type space in multiple dimensions

Let (X, r) be a complete, separable metric space, with metric r(x, y) ≤ 1. LetD = D(Rd, X) denote the space of functions σ : Rd → X with this property: forevery bounded rectangle [a, b) ⊆ Rd and ε > 0, there exist finite partitions

ai = s0i < s1

i < · · · < smii = bi

Page 240: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 227

of each coordinate axis (1 ≤ i ≤ d) such that the variation of σ in the partitionrectangles is at most ε: for each k = (k1, k2, . . . , kd) ∈

∏di=1{0, 1, 2, . . . ,mi − 1},

(63) sup{r(σ(x), σ(y)) : skii ≤ xi, yi < ski+1

i (1 ≤ i ≤ d)} ≤ ε.

Note that the partition rectangles are closed on the left. This implies that σ iscontinuous from above: σ(y) → σ(x) as y → x in Rd so that y ≥ x; and limits existfrom strictly below: limσ(y) exists as y → x in Rd so that y < x (strict inequalityfor each coordinate).

We shall employ this notation for truncation in Rd: for real u > 0 and x =(x1, . . . , xd) ∈ Rd,

[x]u =((x1 ∧ u) ∨ (−u), (x2 ∧ u) ∨ (−u), . . . , (xd ∧ u) ∨ (−u)

).

Let Λ be the collection of bijective, strictly increasing Lipschitz functions λ : Rd →Rd that satisfy these requirements: λ is of the type λ(x1, . . . , xd) = (λ1(x1), . . . ,λd(xd)) where each λi : R → R is bijective, strictly increasing and Lipschitz; and

γ(λ) = γ0(λ) + γ1(λ) < ∞

where the quantities γ0(λ) and γ1(λ) are defined by

γ0(λ) =d∑

i=1

sups,t∈R

∣∣∣∣logλi(t) − λi(s)

t − s

∣∣∣∣

andγ1(λ) =

∫ ∞

0

e−u(1 ∧ sup

x∈Rd

∣∣ [λ(x)]u − [x]u∣∣∞

)du.

For ρ, σ ∈ D, λ ∈ Λ and u > 0, define

d(ρ, σ, λ, u) = supx∈Rd

r(ρ([x]u) , σ([λ(x)]u)

).

And then

(64) dS(ρ, σ) = infλ∈Λ

{γ(λ) ∨

∫ ∞

0

e−u d(ρ, σ, λ, u) du

}.

The definition was arranged so that γ(λ−1) = γ(λ) and γ(λ ◦ µ) ≤ γ(λ) + γ(µ), sothe proof in [11], Section 3.5, can be repeated to show that dS is a metric.

It is clear that if a sequence of functions σn from D converges to an arbitraryfunction σ : Rd → X, and this convergence happens uniformly on compact subsetsof Rd, then σ ∈ D. Furthermore, we also get convergence in the dS-metric, as thenext lemma indicates. This lemma is needed in the proof that (D, dS) is complete.

Lemma 10.1. Suppose σn, σ ∈ D. Then dS(σn, σ) → 0 iff there exist λn ∈ Λ suchthat γ(λn) → 0 and

r(σn(x), σ(λn(x))

)→ 0

uniformly over x in compact subsets of Rd.

Proof. We prove dS(σn, σ) → 0 assuming the second condition, and leave the otherdirection to the reader. For each rectangle [−M1, M1), M = 1, 2, 3, . . . , and eachε = 1/K, K = 1, 2, 3, . . . , fix the partitions {sk

i } that appear in the definition (63)

Page 241: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

228 T. Seppalainen

of σ ∈ D. Pick a real u > 0 so that neither u nor −u is among these countablymany partition points.

d(σn, σ, λn, u) = supx∈Rd

r(σn([x]u) , σ([λn(x)]u)

)

≤ supx∈Rd

r(σn([x]u) , σ(λn([x]u))

)+ sup

x∈Rd

r(σ(λn([x]u)) , σ([λn(x)]u)

).

The first term after the inequality vanishes as n → ∞, by assumption.Let ε = 1/K > 0, pick a large rectangle [−M1, M1) that contains [−u1, u1]

well inside its interior, and for this rectangle and ε pick the finite partitions thatsatisfy (63) for σ, and do not contain ±u. Let δ > 0 be such that none of thesefinitely many partition points lie in (±u − δ,±u + δ). If n is large enough, thensupx∈[−M1,M1] |λn(x)− x| < δ, and one can check that λn([x]u) and [λn(x)]u lie inthe same partition rectangle, for each x ∈ Rd. Thus

supx∈Rd

r (σ(λn([x]u)) , σ([λn(x)]u) ) ≤ ε.

We have shown that d(σn, σ, λn, u) → 0 for a.e. u > 0.

With this lemma, one can follow the proof in [11], page 121, to show that (D, dS)is complete. Separability of (D, dS) would also be easy to prove. Next, we take thisSkorohod type space as starting point, and define the state space Σ for the heightprocess.

10.2. The state space for the height process

In the setting of the previous subsection, take S = Z∗ = Z∪{±∞} with the discretemetric r(x, y) = 1{x �= y}. Let Σ be the space of functions σ ∈ D(Rd,Z∗) that arenondecreasing [σ(x) ≤ σ(y) if x ≤ y in Rd] and decay to −∞ sufficiently fast at−∞, namely

(65) for every b ∈ Rd, limM→∞

sup{|y|−d/(d+1)

∞ σ(y) : y ≤ b, |y|∞ ≥ M}

= −∞.

Condition (65) is not preserved by convergence in the dS metric, so we need to fixthe metric.

For σ ∈ Σ, h ∈ Z, and b ∈ Rd, let yb,h(σ) be the maximal y ≤ b in Rd such thatthe rectangle [y, b] contains the set {x ≤ b : σ(x) ≥ h}. Condition (65) guaranteesthat such a finite yb,h(σ) exists. In fact, (65) is equivalent to

(66) for every b ∈ Rd, limh→−∞

|h|−(d+1)/d|yb,h(σ)|∞ = 0.

For ρ, σ ∈ Σ and b ∈ Rd, define

θb(ρ, σ) = suph≤−1

|h|−(d+1)/d · |yb,h(ρ) − yb,h(σ)|∞

andΘ(ρ, σ) =

Rd

e−|b|∞(1 ∧ θb(ρ, σ))db.

Θ(ρ, σ) satisfies the triangle inequality, is symmetric, and Θ(σ, σ) = 0, so we candefine a metric on Σ by

dΣ(ρ, σ) = Θ(ρ, σ) + dS(ρ, σ).

The effect of the Θ(ρ, σ) term in the metric is the following.

Page 242: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 229

Lemma 10.2. Suppose dS(σn, σ) → 0. Then dΣ(σn, σ) → 0 iff for every b ∈ Rd,

(67) limh→−∞

supn

|h|−(d+1)/d|yb,h(σn)|∞ = 0,

or equivalently, for every b ∈ Rd

(68) limM→∞

supn

supy≤b

|y|∞≥M

σn(y)

|y|d/(d+1)∞

= −∞.

We leave the proof of the above lemma to the reader. Lemmas 10.1 and 10.2together give a natural characterization of convergence in Σ.

Lemma 10.3. The Borel σ-field BΣ is the same as the σ-field F generated by thecoordinate projections σ → σ(x).

Proof. The sets {x : σ(x) ≥ h} are closed, so the functions σ → σ(x) are uppersemicontinuous. This implies F ⊆ BΣ.

For the other direction one shows that for a fixed ρ ∈ Σ, the function σ →dΣ(ρ, σ) is F-measurable. This implies that the balls {σ ∈ Σ : dΣ(ρ, σ) < r} areF-measurable. Once we argue below that Σ is separable, this suffices for BΣ ⊆ F .

To show the F-measurability of σ → dS(ρ, σ) one can adapt the argument frompage 128 of [11]. To show the F-measurability of σ → Θ(ρ, σ), one can start byarguing the joint BRd ⊗F-measurability of the map (b, σ) → yb,h(σ) from Rd × Σinto Rd. We leave the details.

The remaining work is to check that (Σ, dΣ) is a complete separable metric space.

Proposition 10.4. The space (Σ, dΣ) is complete.

We prove this proposition in several stages. Let {σn} be a Cauchy sequence inthe dΣ metric. By the completeness of (D, dS), we already know there exists aσ ∈ D(Rd,Z∗) such that dS(σn, σ) → 0. We need to show that (i) σ ∈ Σ and (ii)Θ(σn, σ) → 0.

Following the completeness proof for Skorohod space in [11], page 121, we mayextract a subsequence, denoted again by σn, together with a sequence of Lipschitzfunctions ψn ∈ Λ (actually labeled µ−1

n in [11]), such that

(69) γ(ψn) < 21−n

and

(70) σn(ψn(x)) → σ(x) uniformly on compact sets.

Step 1. σ ∈ Σ.

Fix b ∈ Rd, for which we shall show (66). It suffices to consider b > 0. Let bk =b + k1. By passing to a further subsequence we may assume Θ(σn, σn+1) < e−n2

.Fix n0 so that

(71) exp(|b2|∞ + d(n + 1) − n2

)< 2−n for all n ≥ n0.

Lemma 10.5. For n ≥ n0 there exist points βn in Rd such that b1 < βn+1 < βn <b2, and θβn(σn, σn+1) < 2−n.

Page 243: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

230 T. Seppalainen

Proof. Let αn = b1 + e−n · 1 in Rd.

e−n2 ≥ Θ(σn, σn+1)

≥ infx∈(αn+1,αn)

{1 ∧ θx(σn, σn+1)} · e−|b2|∞ · Lebd{x : αn+1 < x < αn},

whereLebd{x : αn+1 < x < αn} = (e−n − e−n−1)d ≥ e−d(n+1)

is the d-dimensional Lebesgue measure of the open rectangle (αn+1, αn). This im-plies it is possible to choose a point βn ∈ (αn+1, αn) so that θβn(σn, σn+1) <2−n.

βn+1 < βn implies yβn+1,h(σn+1) ≥ yβn,h(σn+1) − (βn − βn+1). For each fixedh ≤ −1, applying the above Lemma inductively gives for n ≥ n0:

yβn+1,h(σn+1) ≥ yβn,h(σn+1) − (βn − βn+1)≥ yβn,h(σn) − |h|(d+1)/d2−n · 1 − (βn − βn+1)

≥ · · · ≥ yβn0 ,h(σn0) − |h|(d+1)/d

n∑

k=n0

2−k · 1 − (βn0 − βn+1),

from which then

(72) infn≥n0

yb1,h(σn) ≥ yb2,h(σn0) − |h|(d+1)/d21−n0 · 1 − (b2 − b1).

Now fix h ≤ −1 for the moment. By (72) we may fix a rectangle [y1, b1] thatcontains the sets {x ≤ b1 : σn(x) ≥ h} for all n ≥ n0. Let Q = [y1 − 1, b1 + 1]be a larger rectangle such that each point in [y1, b1] is at least distance 1 fromQc. By (69) and (70) we may pick n large enough so that |ψn(x) − x| < 1/4 andσn(ψn(x)) = σ(x) for x ∈ Q. [Equality because Z∗ has the discrete metric.]

We can now argue that if x ≤ b and σ(x) ≥ h, then necessarily x ∈ Q, ψn(x) ≤ b1

and σn(ψn(x)) ≥ h, which implies by (72) that

x ≥ ψn(x)− (1/4)1 ≥ yb1,h(σn)− (1/4)1 ≥ yb2,h(σn0)− (5/4 + |h|(d+1)/d21−n0) · 1.

This can be repeated for each h ≤ −1, with n0 fixed. Thus for all h ≤ −1,

|yb,h(σ)| ≤ |b| ∨(|yb2,h(σn0

)| + 5/4 + |h|(d+1)/d21−n0

),

and then, since σn0∈ Σ,

limh→−∞

|h|−(d+1)/d|yb,h(σ)| ≤ 21−n0 .

Since n0 can be taken arbitrarily large, (66) follows for σ, and thereby σ ∈ Σ.

Step 2. Θ(σn, σ) → 0.

As for Step 1, let us assume that we have picked a subsequence σn that satisfies(69) and (70) and Θ(σn+1, σn) < e−n2

. Let φn = ψ−1n . If we prove Θ(σn, σ) → 0

along this subsequence, then the Cauchy assumption and triangle inequality give itfor the full sequence.

Page 244: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 231

Fix an arbitrary index n1 and a small 0 < ε0 < 1. Fix also β ∈ Rd. For eachh ≤ −1, fix a rectangle [yh, β] that contains the sets {x ≤ β : σn(x) ≥ h} for eachσn for n ≥ n1, and also for σ, which Step 1 just showed lies in Σ. This can be donefor each fixed h because by (72) there exists n0 = n0(β) defined by (71) so thatthe points yβ,h(σn) are bounded below for n ≥ n0. Then if necessary decrease yh

further so that

yh ≤ yβ,h(σ1) ∧ yβ,h(σ2) ∧ · · · ∧ yβ,h(σn0−1).

Let Qh,k = [yh − k1, β + k1] be larger rectangles.On the rectangles Qh,2, h ≤ −1, construct the finite partition for σ which sat-

isfies (63) for ε = 1/2, so that the discrete metric forces σ to be constant on thepartition rectangles. Consider a point b = (b1, b2, . . . , bd) < β with the propertythat no coordinate of b equals any one of the (countably many) partition points.This restriction excludes only a Lebesgue null set of points b.

Find ε1 = ε1(β, b, h) > 0 such that the intervals (bi − ε1, bi + ε1) contain noneof the finitely many partition points that pertain to the rectangle Qh,2. Pick n =n(β, b, h) > n1 such that σn(ψn(x)) = σ(x) and |ψn(x) − x| < (ε0 ∧ ε1)/4 forx ∈ Qh,2. Since the maps ψ, φ do not carry any points of [yh, β] out of Qh,1,yh,b(σn) = yh,b(σ ◦ φn). It follows that

|yh,b(σ) − yh,b(σn)| = |yh,b(σ) − yh,b(σ ◦ φn)| < ε0.

The last inequality above is justified as follows: The only way it could fail is that σ(or σ ◦φn) has a point x ≤ b with height ≥ h, and σ ◦φn (respectively, σ) does not.These cannot happen because the maps ψ, φ cannot carry a partition point fromone side of bi to the other side, along any coordinate direction i.

Now we have for a.e. b < β and each h ≤ −1, with n = n(β, b, h) > n1:

|h|−(d+1)/d|yh,b(σ) − yh,b(σn1)|

≤ |h|−(d+1)/d|yh,b(σ) − yh,b(σn)| + |h|−(d+1)/d|yh,b(σn) − yh,b(σn1)|

≤ ε0 + θb(σn, σn1)

≤ ε0 + supm:m>n1

θb(σm, σn1).

The last line has no more dependence on β or h. Since β was arbitrary, this holdsfor a.e. b ∈ Rd. Take supremum over h ≤ −1 on the left, to get

1 ∧ θb(σ, σn1) ≤ ε0 + sup

m:m>n1

{1 ∧ θb(σm, σn1)} for a.e. b.

Integrate to get

Θ(σ, σn1) =∫

Rd

e−|b|∞{1 ∧ θb(σ, σn1)} db

≤ Cε0 +∫

Rd

e−|b|∞ supm:m>n1

{1 ∧ θb(σm, σn1)} db

≤ Cε0 +∫

Rd

e−|b|∞ supm:m>n1

{m−1∑

k=n1

1 ∧ θb(σk+1, σk)

}db

= Cε0 +∞∑

k=n1

Rd

e−|b|∞{1 ∧ θb(σk+1, σk)} db

= Cε0 +∞∑

k=n1

Θ(σk+1, σk) ≤ Cε0 +∞∑

k=n1

e−k2

,

Page 245: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

232 T. Seppalainen

where C =∫Rd e−|b|∞db. Since n1 was an arbitrary index, we have

lim supn1→∞

Θ(σn1, σ) ≤ Cε0.

Since ε0 was arbitrary, Step 2 is completed, and Proposition 10.4 thereby proved.We outline how to construct a countable dense set in (Σ, dΣ). Fix a < b in

Zd. In the rectangle [a, b] ⊆ Rd, consider the (countably many) finite rationalpartitions of each coordinate axis. For each such partition of [a, b] into rectangles,consider all the nondecreasing assignments of values from Z∗ to the rectangles.Extend the functions σ thus defined to all of Rd in some fashion, but so that theyare nondecreasing and Z∗-valued. Repeat this for all rectangles [a, b] with integercorners. This gives a countable set D of elements of D(Rd,Z∗). Finally, each suchσ ∈ D yields countably many elements σ ∈ Σ by setting

σ(x) ={−∞, σ(x) < hσ(x), σ(x) ≥ h

for all h ∈ Z. All these σ together form a countable set Σ ⊆ Σ.Now given an arbitrary σ ∈ Σ, it can be approximated by an element σ ∈ Σ

arbitrarily closely (in the sense that σ = σ◦φ for a map φ ∈ Λ close to the identity)on any given finite rectangle [−β, β], and so that yb,h(σ) is close to yb,h(σ) for allb in this rectangle, for any given range h0 ≤ h ≤ −1. Since |h|−(d+1)/d|yβ,h(σ)| < εfor h ≤ h0 for an appropriately chosen h0, this suffices to make both d(σ, σ, φ, u)and θb(σ, σ) small for a range of u > 0 and b ∈ Rd. To get close under the metricdΣ it suffices to approximate in a bounded set of u’s and b’s, so it can be checkedthat Σ is dense in Σ.

Acknowledgments. The author thanks Tom Kurtz for valuable suggestions andanonymous referees for careful readings of the manuscript. Hermann Rost has alsostudied the process described here but has not published his results.

References

[1] Aldous, D. and Diaconis, P. (1995). Hammersley’s interacting particleprocess and longest increasing subsequences. Probab. Theory Related Fields103 199–213.

[2] Aldous, D. and Diaconis, P. (1999). Longest increasing subsequences: frompatience sorting to the Baik-Deift-Johansson theorem. Bull. Amer. Math. Soc.(N.S.) 36 413–432.

[3] Baik, J., Deift, P. and Johansson, K. (1999). On the distribution of thelength of the longest increasing subsequence of random permutations. J. Amer.Math. Soc. 12 1119–1178.

[4] Balazs, M., Cator, E. and Seppalainen, T. (2006). Cube root fluctua-tions for the corner growth model associated to the exclusion process. Electron.J. Probab. 11 1094–1132.

[5] Bickel, P. J. and Wichura, M. J. (1971). Convergence criteria for multi-parameter stochastic processes and some applications. Ann. Math. Statist. 421656–1670.

[6] Bollobas, B. and Brightwell, G. (1992). The height of a random partialorder: Concentration of measure. Ann. Appl. Probab. 2 1009–1018.

[7] Bollobas, B. and Winkler, P. (1988). The longest chain among randompoints in Euclidean space. Proc. Amer. Math. Soc. 103 347–353.

Page 246: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Growth model 233

[8] Cator, E. and Groeneboom, P. (2006). Second class particles and cuberoot asymptotics for Hammersley’s process. Ann. Probab. 34 1273–1295.

[9] Dafermos, C. M. (1977). Generalized characteristics and the structure ofsolutions of hyperbolic conservation laws. Indiana Univ. Math. J. 26 1097–1119.

[10] Evans, L. C. (1998). Partial Differential Equations. Amer. Math. Soc., Prov-idence, RI.

[11] Ethier, S. N. and Kurtz, T. G. (1986). Markov Processes: Characterizationand Convergence, Wiley, New York.

[12] Ferrari, P. A. (1992). Shock fluctuations in asymmetric simple exclusion.Probab. Theory Related Fields 91 81–110.

[13] Ferrari, P. A. and Kipnis, C. (1995). Second class particles in the rarefac-tion fan. Ann. Inst. H. Poincare Probab. Statist. 31 143–154.

[14] Groeneboom, P. (2002). Hydrodynamical methods for analyzing longest in-creasing subsequences. J. Comput. Appl. Math. 142 83–105. Special issue onProbabilistic Methods in Combinatorics and Combinatorial Optimization.

[15] Hammersley, J. M. (1972). A few seedlings of research. Proc. Sixth BerkeleySymp. Math. Stat. Probab. I 345–394.

[16] Liggett, T. M. (1999). Stochastic Interacting Systems: Contact, Voter andExclusion Processes. Springer, Berlin.

[17] Logan, B. F. and Shepp, L. A. (1977). A variational problem for randomYoung tableaux. Adv. Math. 26 206–222.

[18] Rezakhanlou, F. (1995). Microscopic structure of shocks in one conservationlaws. Ann. Inst. H. Poincare Anal. Non Lineaire 12 119–153.

[19] Rezakhanlou, F. (2002). Continuum limit for some growth models I. Sto-chastic Process. Appl. 101 1–41.

[20] Rezakhanlou, F. (2001). Continuum limit for some growth models II. Ann.Probab. 29 1329–1372.

[21] Seppalainen, T. (1996). A microscopic model for the Burgers equation andlongest increasing subsequences. Electron. J. Probab. 1 1–51.

[22] Seppalainen, T. (1998). Large deviations for increasing sequences on theplane. Probab. Theory Related Fields 112 221–244.

[23] Seppalainen, T. (1999). Existence of hydrodynamics for the totally asym-metric simple K-exclusion process. Ann. Probab. 27 361–415.

[24] Seppalainen, T. (2000). Strong law of large numbers for the interface inballistic deposition. Ann. Inst. H. Poincare Probab. Statist. 36 691–736.

[25] Seppalainen, T. (2001). Perturbation of the equilibrium for a totally asym-metric stick process in one dimension. Ann. Probab. 29 176–204.

[26] Seppalainen, T. (2001). Second-class particles as microscopic characteristicsin totally asymmetric nearest-neighbor K-exclusion processes. Trans. Amer.Math. Soc. 353 4801–4829.

[27] Seppalainen, T. (2002). Diffusive fluctuations for one-dimensional totallyasymmetric interacting random dynamics. Comm. Math. Phys. 229 141–182.

[28] Smythe, R. T. and Wierman, J. C. (1978). First-Passage Percolation onthe Square Lattice. Lecture Notes in Math. 671. Springer, Berlin.

[29] Versik, A. M. and Kerov, S. V. (1977). Asymptotic behavior of thePlancherel measure of the symmetric group and the limit form of Youngtableaux. (Russian) Dokl. Akad. Nauk SSSR 233 1024–1027. English trans-lation: Soviet Math. Dokl. 18 527–531.

[30] Winkler, P. (1985). Random orders. Order 1 317–331.

Page 247: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

IMS Lecture Notes–Monograph SeriesAsymptotics: Particles, Processes and Inverse ProblemsVol. 55 (2007) 234–252c© Institute of Mathematical Statistics, 2007DOI: 10.1214/074921707000000382

Empirical processes indexed by

estimated functions

Aad W. van der Vaart1 and Jon A. Wellner2,∗

Vrije Universiteit Amsterdam and University of Washington

Abstract: We consider the convergence of empirical processes indexed byfunctions that depend on an estimated parameter η and give several alterna-tive conditions under which the “estimated parameter” ηn can be replacedby its natural limit η0 uniformly in some other indexing set Θ. In particularwe reconsider some examples treated by Ghoudi and Remillard [AsymptoticMethods in Probability and Statistics (1998) 171–197, Fields Inst. Commun. 44(2004) 381–406]. We recast their examples in terms of empirical process theory,and provide an alternative general view which should be of wide applicability.

1. Introduction

Let X1, . . . , Xn be i.i.d. random elements in a measurable space (X ,A) with lawP , and for a measurable function f : X → R let the expectation, empirical measureand empirical process at f be denoted by

Pf =∫

fdP, Pnf =1n

n∑

i=1

f(Xi), Gnf =√

n(Pn − P )f.

Given a collection {fθ,η : θ ∈ Θ, η ∈ H} of measurable functions fθ,η : X → R

indexed by sets Θ and H and “estimators” ηn, we wish to prove that, as n → ∞,

(1) supθ∈Θ

∣∣∣Gn(fθ,ηn − fθ,η0)∣∣∣ →p 0.

Here an “estimator” ηn is a random element with values in H defined on the sameprobability space as X1, . . . , Xn, and η0 ∈ H is a fixed element, which is typicallya limit in probability of the sequence ηn.

The result (1) is interesting for several applications. A direct application is tothe estimation of the functional θ �→ Pfθ,η. If the parameter η is unknown, we mayreplace it by an estimator ηn and use the empirical estimator Pnfθ,ηn . The result(1) helps to derive the limit behaviour of this estimator, as we can decompose

(2)√

n(Pnfθ,ηn − Pfθ,η0) = Gn(fθ,ηn − fθ,η0

) + Gnfθ,η0+

√nP (fθ,ηn − fθ,η0

).

If (1) holds, then the first term on the right converges to zero in probability. Un-der appropriate conditions on the functions fθ,η0

, the second term on the right

∗Supported in part by NSF Grant DMS-05-03822, NI-AID Grant 2R01 AI291968-04, and bygrant B62-596 of the Netherlands Organisation of Scientific Research NWO

1Section Stochastics, Department of Mathematics, Faculty of Sciences, Vrije Universiteit, DeBoelelaan 1081a, 1081 HV Amsterdam, e-mail: [email protected]

2University of Washington, Department of Statistics, Box 354322, Seattle, Washington 98195-4322, USA, e-mail: [email protected]

AMS 2000 subject classifications: 62G07, 62G08, 62G20, 62F05, 62F15.Keywords and phrases: delta-method, Donsker class, entropy integral, pseudo observation.

234

Page 248: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 235

will converge to a Gaussian process by the (functional) central limit theorem. Thebehavior of the third term depends on the estimators ηn, and would typicallyfollow from an application of the (functional) delta-method, applied to the mapη �→ (Pfθ,η : θ ∈ Θ).

In an interesting particular case of this situation, the functions fθ,η take the form

fθ,η(x) = θ(η(x)

),

for maps θ : Rd → R and each η ∈ H being a map η : X → R

d. The realizationsof the estimators ηn are then functions x �→ ηn(x) = ηn(x; X1, . . . , Xn) on thesample space X and can be evaluated at the observations to obtain the randomvectors ηn(X1), . . . , ηn(Xn) in R

d. The process {Pnfθ,ηn : θ ∈ Θ} is the empiricalmeasure of these vectors indexed by the functions θ. For instance, if Θ consistsof the indicator functions 1(−∞,θ] for θ ∈ R

d, then this measure is the empiricaldistribution function

θ �→ Pnfθ,ηn =1n

n∑

i=1

1{ηn(Xi) ≤ θ}

of the random vectors ηn(X1), . . . , ηn(Xn). The properties of such empirical proces-ses were studied in some generality and for examples of particular interest in Ghoudiand Remillard [6, 7]. Ghoudi and Remillard [6] apparently coined the name “pseudo-observations” for the vectors ηn(X1), . . . , ηn(Xn). The examples include, for in-stance, regression residuals, Kendall’s dependence process, and copula processes;see the end of Section 2 for explicit formulation of these three particular examples.One purpose of the present paper is to extend the results in these papers also toother index classes Θ besides the class of indicator functions. Another purpose is torecast their results in terms of empirical process theory, which leads to simplificationand alternative conditions.

A different, indirect application of (1) is to the derivation of the asymptoticdistribution of Z-estimators. A Z-estimator for θ might be defined as the solutionθn of the equation Pnfθ,ηn = 0, where again an unknown “nuisance” parameter ηis replaced by an estimator ηn. In this case (1) shows that

Pnfθn,ηn− Pnfθn,η0

= P (fθn,ηn− fθn,η0

) + oP (1/√

n),

so that the limit behavior of θn can be derived by comparison with the estimatingequation defined by Pnfθ,η0

(with η0 substituted for ηn). The “drift” sequenceP (fθn,ηn

− fθn,η0), which will typically be equivalent to P (fθ0,ηn − fθ0,η0

) up toorder oP (1/

√n), may give rise to an additional component in the limit distribution.

The paper is organized as follows. In Section 2 we derive general conditions for thevalidity of (1) and formulate several particular examples to be considered in moredetail in the sequel. In Section 3 we specialize the general results to compositionmaps. In Section 4 we combine these results with results on Hadamard differen-tiability to obtain the asymptotic distribution of empirical processes indexed bypseudo observations. Finally in Section 5 we formulate our results for several of theparticular examples mentioned above and at the end of Section 2.

2. General result

In many situations we wish to establish (1) without knowing much about the natureof the estimators ηn, beyond possibly that they are consistent for some value η0.

Page 249: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

236 A. W. van der Vaart and J. A. Wellner

For instance, this is true if (1) is used as a step in the derivation of M− or Z−estimators. (Cf. Van der Vaart and Wellner [12] and Van der Vaart [11].) Then anappropriate method of establishing (1) is through a Donsker or entropy condition,as in the following theorems. Proofs of the Theorems 2.1 and 2.2 can be found inthe mentioned references.

Both theorems assume that ηn is “consistent for η0” in the sense that

(3) supθ∈Θ

P (fθ,ηn − fθ,η0)2 →p 0.

Theorem 2.1. Suppose that H0 is a fixed subset of H such that Pr(ηn ∈ H0) → 1and suppose that the class of functions {fθ,η : θ ∈ Θ, η ∈ H0} is P -Donsker. If (3)holds, then (1) is valid.

For the second theorem, let N(ε,F , L2(P )) and N[ ](ε,F , L2(P )) be the ε-coveringand ε-bracketing numbers of a class F of measurable functions (cf. Pollard [8] andvan der Vaart and Wellner [12]) and define entropy integrals by

J(δ,F , L2) =∫ δ

0

supQ

√log N(ε‖F‖Q,2,F , L2(Q)) dε,(4)

J[ ](δ,F , L2(P )) =∫ δ

0

√log N[ ](ε‖F‖P,2,F , L2(P )) dε.(5)

Here F is an arbitrary, measurable envelope function for the class F : a measurablefunction F : X → R such that |f(x)| ≤ F (x) for every f ∈ F and x ∈ X . Wesay that a sequence Fn of envelope functions satisfies the Lindeberg condition ifPF 2

n = O(1) and PF 2n1Fn≥ε

√n → 0 for every ε > 0.

Theorem 2.2. Suppose that Hn are subsets of H such that Pr(ηn ∈ Hn) → 1and such that the classes of functions Fn = {fθ,η : θ ∈ Θ, η ∈ Hn} satisfy eitherJ[·](δn,Fn, L2(P )) → 0, or J(δn,Fn, L2) → 0 for every sequence δn → 0, relativeto envelope functions that satisfy the Lindeberg condition. In the second case alsoassume that the classes Fn are suitably measurable (e.g. countable). If (3) holds,then (1) is valid.

Because there are many techniques to verify that a given class of functions isDonsker, or to compute bounds on its entropy integrals, the preceding lemmas givequick results, if they apply. Furthermore, they appear to be close to best possibleunless more information about the estimators ηn can be brought in, or explicitcomputations are possible for the functions fθ,η.

In some applications the estimators ηn are known to converge at a certain rateand/or known to possess certain regularity properties (e.g. uniform bounded deriv-atives). Such knowledge cannot be exploited in Theorem 2.1, but could be used forthe choice of the sets Hn in Theorem 2.2. We now discuss an alternative approachwhich can be used if the estimators ηn are also known to converge in distribution,if properly rescaled.

Let H be a Banach space, and suppose that the sequence√

n(ηn −η0) convergesin distribution to a tight, Borel-measurable random element in H. The “convergencein distribution” may be understood in the sense of Hoffmann-Jørgensen, so that ηn

need not be Borel-measurable itself.The tight limit of the sequence

√n(ηn−η0) takes its values in a σ-compact subset

H0 ⊂ H. For θ ∈ Θ, h0 ∈ H0, and δ > 0 define a sequence of classes of functions by

(6) Fn(θ, h0, δ) ={fθ,η0+n−1/2h − fθ,η0+n−1/2h0

: h ∈ H, ‖h − h0‖ < δ}.

Page 250: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 237

Let Fn(θ, h0, δ) be arbitrary measurable envelope functions for these classes.

Theorem 2.3. Suppose that the sequence√

n(ηn − η0) converges in distribution toa tight, random element with values in a given σ-compact subset H0 of H. Supposethat

(i) supθ |Gn(fθ,η0+n−1/2h0− fθ,η0

)| →p 0 for every h0 ∈ H0.(ii) supθ |GnFn(θ, h0, δ)| →p 0 for every δ > 0 and every h0 ∈ H0;(iii) supθ suph0∈K

√nPFn(θ, h0, δn) → 0 for every δn → 0 and every compact

K ⊂ H0;

Then (1) is valid.

Proof. Suppose that√

n(ηn − η0) ⇒ Z and let ε > 0 be fixed. There exists acompact set K ⊂ H0 with P (Z ∈ K) > 1 − ε and hence for every δ > 0, with Kδ

the set of all points at distance less than δ to K,

lim infn→∞

Pr(√

n(ηn − η0) ∈ Kδ/2)

> 1 − ε.

In view of the compactness of K there exist finitely many elements h1, . . . , hp ∈K ⊂ H0 (with p = p(δ) depending on δ) such that the balls of radius δ/2 aroundthese points cover K. Then Kδ/2 is contained in the union of the balls of radius δ,by the triangle inequality. Thus, with B(h, δ) denoting the ball of radius δ aroundh in the space H,

{√n(ηn − η0) ∈ Kδ/2

}⊂

p(δ)⋃

i=1

{ηn ∈ B(η0 + n−1/2hi, δ)

}.

It follows that with probability at least 1 − ε, as n → ∞,

supθ

|Gn(fθ,ηn − fθ,η0)|

≤ supθ

maxi

sup‖h−hi‖<δ

|Gn(fθ,η0+n−1/2h − fθ,η0)|

≤ supθ

maxi

sup‖h−hi‖<δ

[|Gn(fθ,η0+n−1/2h − fθ,η0+n−1/2hi

)|

+ |Gn(fθ,η0+n−1/2hi− fθ,η0

)|]

≤ supθ

maxi

|GnFn(θ, hi, δ)| + 2 supθ

suph0∈K

√nPFn(θ, h0, δ)

+ supθ

maxi

|Gn(fθ,η0+n−1/2hi− fθ,η0

)|,

where in the last step we use the inequality |Gnf | ≤ |GnF |+2√

nPF , valid for anyfunctions f and F with |f | ≤ F . The maxima in the display are over the finite seti = 1, . . . , p(δ), and the elements h1, . . . , hp(δ) ∈ K depend on δ. By assumptions(i) and (ii) the first and third terms converge to zero as n → ∞, for every fixedδ. It follows that there exists δn ↓ 0 such that these terms with δn substitutedfor δ converge to 0. For this δn, all three terms converge to zero in probability asn → ∞.

The rate of convergence√

n in the preceding theorem may be replaced by anotherrate, with appropriate changes in the conditions, but the rate

√n appears natural in

the following context. For more general metrizable topological vector spaces thereare similar, but less attractive, results possible.

Page 251: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

238 A. W. van der Vaart and J. A. Wellner

The two conditions (i), (ii) of Theorem 2.3 concern the empirical process indexedby the classes of functions

{fθ,η0+n−1/2h0− fθ,η0

: θ ∈ Θ},(7)

{Fn(θ, h0, δ) : θ ∈ Θ}.(8)

These classes are indexed by Θ only, and hence Theorem 2.3, if applicable, avoidsconditions for (1) that involve measures of the complexity of the class {fθ,η : θ ∈Θ, η ∈ H} due to the parameter η ∈ H.

Condition (iii) of Theorem 2.3 involves the mean of the envelopes of the classesFn(θ, h0, δ). For the minimal envelopes this condition takes the form

supθ

suph0∈K

√n P sup

‖h−h0‖<δn

|fθ,η0+n−1/2h − fθ,η0+n−1/2h0| → 0(9)

for all δn ↓ 0. This is an “integrated uniform local Lipschitz assumption” on thedependence η �→ fθ,η. In some applications it may be useful not to use the minimalenvelope functions. The lemma is valid for any envelope functions, as long as thesame envelopes are used in both (ii) and (iii).

The set K in (iii) or (9) is a compact set in the support of the limit distributionof the sequence

√n(ηn − η0). In some cases condition (iii) may be valid for any

compact K ⊂ H, whereas in other cases more precise information about the limitprocess must be exploited. For instance, if the sequence

√n(ηn − η0) converges in

distribution to a tight zero-mean Gaussian process G in the space H = �∞(T ) ofbounded functions on some set T , then K may be taken to be a set of functionsz : T → R that is uniformly bounded and uniformly equicontinuous relative to thesemimetric with square d2(s, t) = E(Gs − Gt)2 (and T will be totally bounded ford). Cf. e.g. van der Vaart and Wellner [12], page 39.

Condition (iii) is an analytical condition, whereas conditions (i) and (ii) areempirical process conditions. In many cases the latter pair of conditions can beverified by standard empirical process type arguments. For reference we quote twolemmas that allow handling the empirical process indexed by a sequence of classes,as in (8) or (7). (For proofs see e.g. van der Vaart [10, 11].) Both lemmas apply toclasses Fn of measurable functions f : X �→ R such that

supf∈Fn

Pf2 → 0.(10)

Lemma 2.1. Suppose that the class of functions⋃

n Fn is P -Donsker. If (10)holds, then supf∈Fn

|Gn(f)| →p 0.

Lemma 2.2. Suppose that either J[·](δn,Fn, L2(P )) → 0 or J(δn,Fn, L2) → 0 forall δn ↓ 0 relative to envelope functions Fn that satisfy the Lindeberg condition. Inthe second case also assume that each class Fn is suitably measurable. If (10) holds,then supf∈Fn

|Gn(f)| →p 0.

Example 1 (Regression residual processes). Suppose that (X1, Y1), . . . , (Xn,Yn) are a random sample distributed according to the regression model Y = gη(X)+e. For given estimators ηn we can form the residuals ei = Yi − gηn(Xi) and may beinterested in the empirical process corresponding to e1, . . . , en, i.e. for a collectionΘ of functions θ : R → R we consider the process

{n−1

∑ni=1 θ(ei) : θ ∈ Θ

}. This

fits the general set-up with the functions fθ,η defined as fθ,η(x, y) = θ(y − gη(x)

).

Page 252: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 239

In many cases it will be possible to apply Theorem 2.1. For instance, if x ∈ Rd, gη

is a polynomial in x, and Θ is the class of indicator functions 1(−∞,θ] for θ ∈ R, thenthe functions fθ,η are the indicator functions of the sets {(x, y) : y−gη(x)−θ ≤ 0}.Because the set of functions (x, y) �→ y−gη(x)−θ is contained in a finite-dimensionalvector space, it is a VC-class, and hence so are their negativity sets (e.g. van derVaart and Wellner [12], Lemma 2.6.18). Thus the class of functions fθ,η is Donsker,and Theorem 2.1 can be applied directly.

Example 2 (Kendall’s process). Let ηn be the empirical distribution function ofa random sample X1, . . . , Xn from a distribution η0 on R

d. Barbe, Genest, Ghoudiand Remillard [2] and Ghoudi and Remillard [6] study the behavior of the empiricaldistribution function Kn of the pseudo-observations ηn(Xi),

Kn(θ) =1n

n∑

i=1

1{ηn(Xi) ≤ θ}, θ ∈ [0, 1],

and the resulting Kendall’s process

(11)√

n(Kn(θ) − K(θ)), θ ∈ [0, 1]

where K(θ) = P (η0(X) ≤ θ). This fits the general set-up with fθ,η the compositionfunction fθ,η = θ◦η, and θ the indicator function 1(−∞,θ] (where we abuse notationby using the symbol θ in two different ways).

An attempt to apply Theorem 2.1 to this problem would lead to the considerationof the class of all indicator functions of sets of the form {x ∈ R

d : η(x) ≤ θ} forη ranging over the cumulative distribution functions on R

d and θ ∈ [0, 1]. Thisclass is similar to the collection of all “lower layers” in R

d, and, unfortunately,fails to be Donsker for most distributions (cf. Dudley [3], page 264, 373 or Dudley[4]). In this case it appears to be necessary to exploit the limit behaviour of thesequence

√n(ηn−η0). Ghoudi and Remillard [6] have shown that (1) is valid in this

case, under some strong smoothness assumptions on the underlying measure η0. InSections 4 and 5 we rederive some of their results by empirical process methodsusing Theorem 2.3.

We also consider the empirical process of the variables ηn(X1), . . . , ηn(Xn) in-dexed by classes of functions other than the indicators 1(−∞,θ]. If the indexing func-tions are smooth, then this empirical process will converge even without smoothnessconditions on η0. A proof can be based on Theorem 2.3.

Example 3 (Copula processes). Suppose that X1, . . . , Xn are a sample froma distribution η0 on R

d. Write Xi = (Xi,1, . . . , Xi,d) and let η0,1, . . . , η0,d be themarginal distributions. The copula function C associated with η0 is the distribu-tion function of the vector

(η0,1(X1,1), . . . , η0,d(X1,d)

), i.e. with η−1

0,j (u) = inf{x :η0,j(x) ≥ u} for u ∈ [0, 1],

C(u1, . . . , ud) = η0(η−10,1(u1), . . . , η−1

0,d(ud))

for (u1, . . . , ud) ∈ [0, 1]d. For j = 1, . . . , d let ηn,j be the empirical distributionfunction of X1,j , . . . , Xn,j (on R), and let ηn be the empirical distribution functionof X1, . . . , Xn (on R

d). Then a natural estimator Cn of C is given by

Cn(u) =1n

n∑

i=1

1{en,i ≤ u}, u ∈ [0, 1]d,

Page 253: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

240 A. W. van der Vaart and J. A. Wellner

for the “pseudo-observations” en,i =(ηn,1(Xi,1), . . . , ηn,d(Xi,d)

). The resulting

“copula processes”,

(12)√

n(Cn(u) − C(u)), u ∈ [0, 1]d,

have been considered by Stute [9], Ganssler and Stute [5], and Ghoudi and Remil-lard [7]. This example can be treated using Theorem 2.3, but also with the morestraightforward Theorem 2.1, or even by employing the theory of Hadamard differ-entiability, as in Chapter 3.9 of Van der Vaart and Wellner [12].

3. Composition

In this section we consider the case where the functions fθ,η take the form

(13) fθ,η(x) = θ(η(x)),

for θ ranging over a class Θ of functions θ : Rd → R and η ranging over a class H of

measurable functions η : X → Rd. We first give general conditions for the validity

of condition (iii) of Theorem 2.3, and next consider also the conditions (i) and (ii)for the special cases of functions θ that are Lipschitz and monotone, respectively.We develop these results for the case that the sequence

√n(ηn − η0) converges in

distribution in the space H = �∞(X , Rd) of uniformly bounded functions z : X →R

d, equipped with the uniform norm ‖z‖ = supx∈X∥∥z(x)

∥∥. (Variations of theseresults are possible. For instance, R

d could be replaced by a more general Banachspace, and H could be equipped with a weighted uniform norm.)

3.1. Condition (i)

For fθ,η taking the form (13), condition (i) of Theorem 2.3 takes the form

(14) supf∈Fn

|Gn,Qf | →Q 0

for Q = P ◦ (η0, h0)−1, Gn,Q the empirical process of a random sample from themeasure Q, and Fn the class of functions

(15) Fn ={(y, z) �→ θ(y + n−1/2z) − θ(y) : θ ∈ Θ

}.

Condition (i) requires that (14) is valid for every fixed choice of h0 ∈ H0, i.e. forevery measure Q determined as the law of (η0(X), h0(X)) for some h0 ∈ H0 and Xdistributed according to P .

This situation is of the form considered in Lemmas 2.1 and 2.2, and both lemmasmay be applicable in a given setting. It is not especially helpful to restate theselemmas for the present special situation. Instead, we give one easy to check set ofsufficient conditions. This covers VC-classes Θ, and much more.

If Θenv : Rd → R is an envelope function for Θ, then

(16) Fn(y, z) = Θenv(y + n−1/2z) + Θenv(y)

is an envelope function for Fn. (A crude one, because we do not exploit that thefunctions in Fn are differences.)

Page 254: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 241

Lemma 3.1. Suppose that J(1, Θ, L2) < ∞, that Θ is suitably measurable, thatP (Θenv◦η0)2 < ∞, and that the functions Θenv◦(η0+n−1/2h0) satisfy the Lindebergcondition in L2(P ), for every h0 ∈ H0. If

supθ∈Θ

P(θ ◦ (η0 + n−1/2h0) − θ ◦ η0

)2 → 0,

for every h0 ∈ H0, then condition (i) of Theorem 2.3 is satisfied for the functionsfθ,η given by (13).

Proof. It suffices to prove (14) for the classes Fn given in (15). The class Fn iscontained in the difference of the classes

F ′n = {(y, z) �→ θ(y + n−1/2z) : θ ∈ Θ},

F ′′ = {(y, z) �→ θ(y) : θ ∈ Θ}.

These classes possess envelope functions F ′n and F ′′ defined by

F ′n(y, z) = Θenv(y + n−1/2z),

F ′′(y, z) = Θenv(y).

The uniform entropy of F ′′ relative to F ′′ is finite by assumption. The uniformentropy of F ′

n relative to F ′n is exactly the same, as the law of Y + n−1/2Z runs

through all possible laws on Rd if the law of (Y, Z) runs through all possible laws

on Rd × R

d. The uniform entropy of Fn relative to Fn is bounded by the sum ofthe uniform entropies of F ′

n and F ′′. (Cf. e.g. Theorem 2.10.20 of van der Vaartand Wellner [12].) Now apply Lemma 2.2.

3.2. Lipschitz functions θ

Assume that every function θ : Rd → R in the class Θ is uniformly Lipschitz in

that

(17) |θ(r1) − θ(r2)| ≤ ‖r1 − r2‖.

Then, for every x ∈ X ,

∣∣∣θ(η0(x) + n−1/2h(x)

)− θ

(η0(x) + n−1/2h0(x)

)∣∣∣ ≤ ‖h(x) − h0(x)‖√n

.

The norm in the right side is bounded by the supremum norm ‖h − h0‖ on �∞(X ,R

d). It follows that the classes Fn(θ, h0, δ) as in (6) possess envelope functions

(18) Fn(θ, h0, δ) = δ/√

n.

Theorem 3.1. If Θ is a suitably measurable collection of uniformly bounded, uni-formly Lipschitz functions θ : R

d → R such that J(1, Θ, L2) < ∞ (relative to aconstant envelope function), η0 ∈ �∞(X , Rd), and the sequence

√n(ηn − η0) con-

verges weakly in �∞(X , Rd) to a tight random element, then

supθ∈Θ

∣∣Gn

(θ(ηn) − θ(η0)

)∣∣ →p 0.

Page 255: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

242 A. W. van der Vaart and J. A. Wellner

Proof. With the envelope functions Fn(θ, h0, δ) as defined in (18), condition (ii) ofTheorem 2.3 is trivially satisfied because the envelopes are actually constants, andthe validity of condition (iii) is immediate.

By assumption we can choose the envelope function of Θ equal to a constantand J(1, Θ, L2) < ∞. This suffices for the verification of most of the conditions ofLemma 3.1. Finally, it suffices to note that

P(θ ◦ (η0 + n−1/2h0) − θ ◦ η0

)2 ≤ ‖h0‖2/n.

By Lemma 3.1 we conclude that condition (i) of Theorem 2.3 is also satisfied,whence the theorem follows from Theorem 2.3.

For the verification of condition (i) of Theorem 2.3 it suffices to consider thefunctions θ on the range of the functions η0 + h0/

√n for a fixed h0 in the support

of the limit distribution of the sequence√

n(ηn − η0). Thus we may restrict thefunctions θ to a subset of R

d that contains the ranges of these functions and in-terpret the condition J(1, Θ, L2) < ∞ in Lemma 3.1 accordingly. In particular, inTheorem 2.3 we may replace this condition by the condition that J(1, ΘK , L2) < ∞for every norm-bounded subset K ⊂ R

d, where ΘK is the collection of restrictionsθ : K → R of the functions θ ∈ Θ.

Any collection of uniformly bounded, Lipschitz functions θ : K → R on a com-pact interval K satisfies J(1, Θ, L2) < ∞. (Cf. e.g. van der Vaart and Wellner [12],page 157.) Thus in the case that d = 1 the assertion of the preceding theorem istrue for any collection of uniformly bounded Lipschitz functions.

For d > 1 further restrictions on the class Θ may be necessary. For instance, anysubset of the unit ball in the Holder space Cα(K) for a compact interval K ⊂ R

d

possesses a finite uniform entropy integral provided α > d/2. (Cf. e.g. van der Vaartand Wellner [12], page 157.) The assertion of the preceding theorem is also true forsuch a class.

There are many other examples of classes of Lipschitz functions with finite uni-form entropy integrals. For instance, VC-classes of Lipschitz functions.

3.3. Monotone functions θ

Assume that every function θ : Rd → R in Θ is the survival function θ(x) =∫

1[x,∞) dθ of a subprobability measure on Rd. Then each θ is nonincreasing in each

of its arguments. If H = �∞(X , Rd) equipped with the uniform norm relative to themax-norm on R

d, then∣∣∣θ(η0 + n−1/2h) − θ(η0 + n−1/2h0)

∣∣∣

≤ θ(η0 + n−1/2h0 − n−1/2‖h − h0‖) − θ(η0 + n−1/2h0 + n−1/2‖h − h0‖).

It follows that the classes Fn(θ, h0, δ) possess envelope functions, with δ the vector(δ, . . . , δ),

Fn(θ, h0, δ) = θ(η0 + n−1/2h0 − n−1/2δ) − θ(η0 + n−1/2h0 + n−1/2δ).

In order to verify condition (iii) of Theorem 2.3, we assume that for given (possiblyinfinite) a < b in R

d and every δn ↓ 0 and compact set K ⊂ H0 ∪ {0},

(19) supt∈Rd,a≤t≤b

suph0∈K

√nP

(1η0+n−1/2h0≤t+n−1/2δn

− 1η0+n−1/2h0≤t

)→ 0.

Page 256: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 243

Theorem 3.2. Let Θ be a collection of survival functions θ : Rd → [0, 1] of subprob-

ability measures supported on an interval (a, b) ⊂ Rd. If the sequence

√n(ηn − η0)

converges in distribution in �∞(X , Rd) to a tight Borel measure concentrating on theσ-compact set H0, and (19) holds for every δn ↓ 0 and every compact K ⊂ H0∪{0},then

supθ

∣∣Gn

(θ(ηn) − θ(η0)

)∣∣ →p 0.

Proof. The survival functions of subprobability measures are in the convex hull ofthe set of indicator functions 1[t,∞), which is a VC-class. Therefore the entropyintegral J(1, Θ, L2) of Θ relative to a constant envelope is finite. (Cf. e.g. van derVaart and Wellner [12], page 145.)

Defining Fn(θ, η0, δ) as in the display preceding the theorem, we can write

PFn(θ, η0, δ)

=∫

P(1(−∞,s](η0 + n−1/2h0 − n−1/2δ) − 1(−∞,s](η0 + n−1/2h0)

)dθ(s)

≤ ‖θ‖ supa≤s≤b

P(1(−∞,s](η0 + n−1/2h0 − n−1/2δ) − 1(−∞,s](η0 + n−1/2h0)

).

By assumption (19) the right side converges to zero faster than 1/√

n, for everyδ = δn ↓ 0, uniformly in h0 ∈ K, and uniformly in θ because the total variationnorms ‖θ‖ are uniformly bounded. This verifies condition (iii) of Theorem 2.3.

Because θ is monotone with range contained in [0, 1],

P(θ(η0 + n−1/2δ) − θ(η0)

)2 ≤ supa≤s≤b

P(1(−∞,s](η0 − n−1/2δ) − 1(−∞,s](η0)

).

By assumption (19) with h0 = 0 this converges to zero faster than 1/√

n for everysequence δn ↓ 0. This can be seen to imply that the expression in the display (whichdoes not have the leading

√n) converges to zero also for fixed δ. By monotonicity

of θ we can bound∣∣θ(η0 + h0/

√n)− θ(η0)

∣∣ by∣∣θ(η0 − δ/

√n)− θ(η0)

∣∣ for δ = ‖h0‖.By Lemma 3.1 we now conclude that condition (i) of Theorem 2.3 is satisfied.

In the present case the envelope functions Fn(θ, h0, δ) are equal to the functionsfθ,η0+n−1/2h − fθ,η0−n−1/2h for h = h0 − δ. Therefore, the validity of condition (ii)of Theorem 2.3 follows by the same arguments as used for the validity of condition(i).

Condition (19) is a uniform Lipschitz condition on the distribution functions ofthe variables η0(X) + h0(X)/

√n. If the distribution of η0(X) is smooth, then we

might expect that the distribution functions of the perturbed variables η0(X) +h0(X)/

√n will be smooth as well. However, this appears not to be true in general,

and it will usually be necessary to exploit some information about the functions h0.(We need to consider functions in the support of the limit measure of the sequence√

n(ηn − η0).) In this respect the conditions of Theorem 3.2 for composition withmonotone functions are much more stringent than the conditions of Theorem 3.1for the composition with Lipschitz functions.

The condition (19) is in terms of the indicator functions 1(−∞,s], and would haveexactly the same form if we considered only indicator functions θ = 1(−∞,θ], ratherthan general monotone functions. Thus the restrictive condition is connected tostudying the classical empirical process.

The following lemma allows the verification of condition (19) in many cases. Itwill also be used in the next section to prove applicability of the delta-method. Thelemma is similar to Lemma 5.1 of Ghoudi and Remillard [6].

Page 257: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

244 A. W. van der Vaart and J. A. Wellner

Lemma 3.2. Suppose that X, Y, Yt (with t > 0) are real-valued random variableson a common probability space such that

(i) X possesses a Lebesgue density f that is continuous in a neighbourhood of x;(ii) ‖Yt − Y ‖∞ → 0 and ‖Y ‖∞ < ∞;(iii) the conditional distribution of Y given X = s can be represented by a Markov

kernel K(s, ·) such that the map s �→ K(s, ·) is continuous at x for the weaktopology.

Then for every continuous function g : R → R and every converging sequencesxt → x, at → a and 0 ≤ bt → b, as t → 0,

1tEg(Yt)1xt<X+tatYt≤xt+tbt → b

∫g(y)K(x, dy) f(x).

Proof. First consider the case that Yt = Y for every t. By the definitions of K andf , we can write

1tEg(Y )1xt<X+tatY <xt+tbt

=1t

∫E

(g(Y )1(xt−s)/t<atY ≤(xt−s)/t+bt

|X = s)f(s) ds

=∫ ∫

g(y)1u<aty≤u+bt K(xt − ut, dy) f(xt − ut) du.

The inner integral is equal to Eg(Yt)1u<atYt≤u+bt for Yt possessing the law K(xt −ut, ·). By assumption atYt converges in distribution to the law of aY for Y possessingthe law K(x, ·). It follows that the inner integral converges to

∫g(y)1u<ay<u+b K(x,

dy) for any (u, b) such that u and u + b are not among the atoms of the law of aY .This includes almost every u for every fixed b. Because Y has bounded range, thedouble integral can be restricted y in a compact set and hence u in a compact set;the argument xt − ut fo f is then restricted to a neighbourhood of x. Therefore,we can apply the dominated convergence theorem to see that the right side of thedisplay converges to

∫ ∫g(y)1u<ay≤u+b K(x, dy) f(x) du.

This reduces to b∫

g(y)K(x, dy) f(x) by Fubini’s theorem. This concludes the proofof the lemma for the case that Yt = Y .

Because g is continuous, Y possesses bounded range and ‖Yt − Y ‖∞ → 0 wehave that ‖g(Yt) − g(Y )‖∞ → 0. Therefore

1tE

∣∣g(Yt) − g(Y )∣∣1xt<X+tatYt<xt+tbt

≤ o(1)1t

Pr(xt − tat‖Yt‖∞ < X < xt + t + tat‖Yt‖∞

).

This converges to zero as X has a density that is bounded on bounded intervals.Finally the difference 1xt<X+tatYt<xt+tbt − 1xt<X+tatY <xt+tbt is nonzero only if

X + tatY is in the union of intervals of total length bounded by tat‖Yt − Y ‖∞ in aneighbourhood of x. By the lemma with Yt = Y and g = 1, which is already proved,t−1 Pr(xt < X + tatY < xt + tct) → cf(x) for any sequences xt → x and ct → c.

Page 258: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 245

Hence this probability converges to zero for ct = tat‖Yt − Y ‖∞, which satisfiedct → 0. We conclude from this that

1t

∣∣Eg(Y )(1xt<X+tatYt<xt+tbt − 1xt<X+tatY <xt+tbt)∣∣

≤ ‖g(Y )‖∞1tE|1xt<X+tatYt<xt+tbt − 1xt<X+tatY <xt+tbt |

converges to zero. The proof of the lemma is complete upon combining the preced-ing.

In order to verify (19) with K = {h0} a single function we can apply the precedinglemma with (X, Y ) equal to the pair of variables

(η0(X), h0(X)

), bt = δn, and t =

1/√

n. Then the conditions of the lemma require that the variable η0(X) possesses acontinuous density, and that the conditional distribution of h0(X) given η0(X) = zdepends continuously on the value of z. The second condition is clearly unpleasant,but appears to be natural in the present situation. It will involve a closer analysisof the support of the limit distribution of the sequence

√n(ηn − η0).

To verify (19) with a general compact set K ⊂ �∞(X , R) we simply note that inview of the compactness it suffices to verify that for every sequence hn such that‖hn − h0‖∞ → 0 for some h0 ∈ H0,

supt∈Rd,a≤t≤b

√n P

(1η0+n−1/2hn≤t+n−1/2δn

− 1η0+n−1/2hn≤t

)→ 0.

Thus we can apply the preceding lemma with the variables (X, Y, Yt) equal to(η0(X), h0(X), hn(X)

)and t = 1/

√n.

Example 2, continued. Suppose that ηn is the empirical distribution of a ran-dom sample from the cumulative distribution function η0 on R

d. Then the limitdistribution of the sequence

√n(ηn − η0) is the d-dimensional η0-Brownian sheet

on Rd.

If d = 1, then the Brownian sheet is a Brownian bridge and can be representedas B ◦ η0 for B a standard Brownian bridge on the unit interval. A typical functionin the support of the limit distribution of the sequence

√n(ηn − η0) can be repre-

sented as h0 = h ◦ η0 for some function h : [0, 1] → R. The conditional law of thevariable h0(X) given η0(X) = z is the Dirac measure at h(z). Because the standardBrownian bridge is continuous, the function h can be taken continuous and hencethe corresponding Markov kernels K(z, ·) = δh(z)(·) are weakly continuous in z, asrequired by the preceding lemma.

If d > 1, then we can, without loss of generality, suppose that η0 is a distributionfunction on [0, 1]d with uniform marginal distributions (i.e. a copula function). Thenthe conditioning event η0(X) = z will typically restrict X to a one-dimensionalcurve in [0, 1]d. Under sufficient smoothness of η0, this curve will vary continuouslywith z, and under smoothness conditions on the law of X, the conditional distribu-tion of h0(X) given η0(X) = z for a continuous function h0 will vary continuouslyas well. Ghoudi and Remillard [6] give sufficient conditions for this continuity in anumber of examples.

The preceding lemma can be extended to the case of multidimensional variables.For simplicity we only consider the two-dimensional case.

Lemma 3.3. Suppose that X, Y, Yt (with t > 0) are random variables in R2 defined

on a common probability space such that

Page 259: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

246 A. W. van der Vaart and J. A. Wellner

(i) X possesses a Lebesgue density f that has continuous conditional densities;(ii) ‖Yt − Y ‖∞ → 0 and ‖Y ‖∞ < ∞;(iii) the conditional distribution of Y given X = s can be represented by a Markov

kernel K(s, ·) such that the map s �→ K(s, ·) is continuous at x for the weaktopology.

Then for every continuous function g : R → R and every converging sequencesxt → x, at → a and bt → b > 0, as t → 0,

1tEg(Yt)(1X+tatYt≤xt+tbt − 1X+tatYt≤xt)

→ b1

∫ x2

−∞

∫g(y)K

((x1, s2), dy

)f(x1, s2) ds2

+ b2

∫ x1

−∞

∫g(y)K

((s1, x2), dy

)f(s1, x2) ds1.

Proof. The event {X + tatYt ≤ xt + tbt} ∩ {X + tatYt ≤ xt}c can be decomposedin the three events

I = {x1t < X1 + tatY1t ≤ x1t + tb1t, X2 + tatY2t ≤ x2t},II = {x1t < X1 + tatY1t ≤ x1t + tb1t, x2t < X2 + tatY2t ≤ x2t + tb2t},

III = {X1 + tatY1t ≤ x1t, x2t < X2 + tatY2t ≤ x2t + tb2t}.

In view of the boundedness of the Yt the event II is contained in an event of theform {X ∈ B1} for Bt rectangles of area O(t2). Therefore, this event does notcontribute to the limit.

The contribution of the event I with Yt = Y can be written∫ ∫ ∫

g(y)1u1<aty1≤u1+b1t 1s2+taty2t≤x2t K((x1t − u1t, s2), dy

)

× f(x1t − u1t, s2) du1 ds2.

By arguments as given previously this can be shown to converge to∫ ∫ ∫

g(y)1u1<ay1≤u1+b1 K((x1, s2), dy

)1s2≤x2

f(x1, s2) du1 ds2.

The integral with respect to u1 can be computed explicitly and the expressionreduces to the first term on the right of the lemma.

The contribution of the event III gives the second term.We can replace Yt by Y by similar arguments as in the one-dimensional case. (In

fact, bound x2t by ∞ and use exactly the same arguments.)

4. Pseudo observations

In this section we consider the asymptotic behaviour of the process {√n(Pnθ ◦ηn −Pθ◦η0) : θ ∈ Θ} for a given class Θ of functions θ : R

d → R. The set-up is the sameas in Section 3. As explained in the introduction we can decompose this process as

Gn(θ ◦ ηn − θ ◦ η0) + Gnθ ◦ η0 +√

nP (θ ◦ ηn − θ ◦ η0).

Under the conditions of Theorem 3.1 or Theorem 3.2, Theorem 2.3, or their exten-sions, the first term will converge to zero in probability in �∞(Θ). The second term

Page 260: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 247

will converge in distribution to a Gaussian process in this space if and only if theclass of functions Θ is Donsker for the law P ◦ η−1

0 . If the third term also convergesin distribution, then the sum of the three processes is asymptotically tight, and itwill usually be straightforward to deduce its limit distribution from considerationof the marginal distributions.

The behaviour of the third term will follow by the (functional) delta-method ifthe sequence

√n(ηn − η0) converges in distribution in the Banach space H and the

map η �→ (Pθ ◦ η : θ ∈ Θ) from H to �∞(Θ) is suitably differentiable. If the limitdistribution of the sequence

√n(ηn−η0) concentrates on the space H0 ⊂ H, then it

suffices that the map η �→ (Pθ◦η : θ ∈ Θ) be “Hadamard differentiable tangentiallyto H0”, i.e. for every converging sequence ht → h0 ∈ H0 ⊂ H

1tP

[(θ ◦ (η0 + tht) − θ ◦ η0)

]→ L(h0)(θ),

uniformly in θ ∈ Θ, for a continuous linear map L : linH0 → �∞(Θ). Under theadditional condition that L is defined on all H, this implies

√nP (θ ◦ ηn − θ ◦ η0) = L(

√n(ηn − η0))(θ) + oP (1).

Cf. van der Vaart and Wellner [12], page 374.As in the preceding section we consider the cases that the functions θ are smooth

or of bounded variation separately. In the former case the differentiability is relativeto a weak norm on H (and is easy to prove), but for discontinuous functions θ, suchas the indicator functions 1(−∞,θ], the differentiability requires a strong norm onH and some conditions on the underlying distribution.

4.1. Smooth functions θ

If the functions θ are differentiable with bounded derivatives, then the Hadamarddifferentiability is true for H equipped with the L1(P )-norm on H.

Lemma 4.1. Let the functions θ : Rd → R in Θ be continuously differentiable with

derivative θ such that ‖θ(x)‖ ≤ 1 for every x ∈ Rd. Then the map η �→ (Pθ ◦ η :

θ ∈ Θ) from L1(X ,A, P ) to �∞(Θ) is Hadamard differentiable at η0 with derivativeh �→

(P θ ◦ h : θ ∈ Θ

).

Proof. Given a sequence ht with P |ht −h0| → 0 we can write, by Fubini’s theorem,∣∣∣1tP

(θ(η + tht) − θ(η)

)− P θ(η)h

∣∣∣

=∣∣∣∫ 1

0

P(θ(η + stht)ht − θ(η)h

)ds

∣∣∣

≤∫ 1

0

P∥∥θ(η + stht) − θ(η)

∥∥‖ht‖ ds + P‖θ(η)‖‖ht − h0‖.

The second term on the right is bounded above by P |ht − h0| and converges tozero by assumption. The first term on the right converges to zero by the dominatedconvergence theorem.

4.2. Functions θ of bounded variation

In the second result we let Θ be a set of functions of bounded variation on abounded interval in R, and consider the Hadamard differentiability of the map

Page 261: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

248 A. W. van der Vaart and J. A. Wellner

η �→ (Pθ ◦ η : θ ∈ Θ) as a map from �∞(X ) to �∞(Θ). For simplicity of notation,let X be a random variable with law P .

Lemma 4.2. Let the functions θ ∈ Θ be distribution functions of subprobabilitymeasures supported on a compact interval I ⊂ R. Suppose that the variable η0(X)possesses a Lebesgue density f that is continuous on a neighbourhood of I. Thenthe map η �→ (Pθ ◦ η : θ ∈ Θ) from �∞(X ) to �∞(Θ) is Hadamard differentiable atη0 tangentially to the set of all h0 such that there exists a version of the conditionaldistribution of h0(X) given η0(X) = s that is weakly continuous in s ∈ I. Thederivative is given by h0 �→

(∫E

(h0(X)|η0(X) = s

)f(s) dθ(s) : θ ∈ Θ

).

Proof. Let h0 be as given and suppose ht → h0 in �∞(X ).For given s ∈ R and u > 0 let χs,u be the continuous function that takes the

value 0 on (−∞, s − u], takes the value 1 on [s,∞) and is linear on the interval[s − u, s]. Then 1[s,∞) ≤ χs,u, and hence

P (1s≤η0+tht − 1s≤η0) ≤ P

(χs,u(η0 + tht) − χs,u(η0)

)+ P

(χs,u(η0) − 1[s,∞)(η0)

).

Because η0(X) possesses a Lebesgue density that is bounded on a neighbourhoodof I and 1[s,∞) −χs,u vanishes off the set (s− u, s), the second term on the right isbounded in absolute value by a multiple of u, uniformly in s ranging through I, forsmall u. By choosing u = δt this term divided by t can be made arbitrarily smallby choice of δ.

Because χs,u is absolutely continuous with derivative 1/u on (s − u, s) and 0elsewhere, the first term on the right divided by 1/t can be written in the form

∫ 1

0

1u

P (ht1s−u<η0+vtht≤s) dv.

For u = δt this converges to E(h0(X)|η0(X) = s

)f(s), by Lemma 3.2, uniformly in

s ranging over I.It follows that, uniformly in s ranging over I,

lim supt↓0

(1tP (1s≤η0+tht − 1s≤η0

) − E(h0(X)|η0(X) = s

)f(s)

)≤ 0.

A similar argument using the functions χs+u,u instead of χs,u gives a correspondinglower bound, whence the expression in brackets converges to zero, uniformly in sranging through compacta. This concludes the proof of the lemma for Θ equal tothe set of functions 1[s,∞) with s in a compact interval.

For a general collection Θ of functions of bounded variation we can write

P(θ(η0 + tht) − θ(η0)

)=

∫P (1s≤η0+tht − 1s≤η0

) dθ(s).

Next we use the assumption that the functions θ ∈ Θ are supported on the compactinterval I with total variation bounded by 1.

The applicability of the second lemma depends on whether the set H0 of func-tions such that the conditional distribution of h0(X) given η0(X) = s is weaklycontinuous in s is large enough to support the limit distribution of the sequence√

n(ηn − η0). As noted in the preceding section, under some smoothness conditionson η0 and on the distribution of X, the set H0 typically contains all continuousfunctions. Then it suffices that the sequence possesses a continuous weak limitingprocess.

Page 262: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 249

5. Examples: completion

In this section we return to two of the three examples discussed at the end ofSection 3, Example 2 and Example 3. We give the theorems (and proofs) resultingfrom our approach. The general theme here is that the traditional results given inCorollaries 5.1 and 5.3 for indicator functions involve non-trivial restrictions on theunderlying distribution η0 of the data, while the results for indexing by Lipschitzfunctions given in Corollaries 5.2 and 5.4 involve almost no restrictions on η0 (butsignificantly smoother indexing functions θ).

5.1. Two corollaries for Kendall processes

For the Kendall process, Example 2, it suffices to consider the case in which η0 isconcentrated on [0, 1]d and has uniform marginal distributions (i.e. is a copula func-tion), as noted by Ghoudi and Remillard [6]. We first give a corollary for indexingby indicator functions, and then a corollary for indexing by Lipschitz functions.

Corollary 5.1. Suppose that for a given interval [a, b] ⊂ (0, 1):

(i) The variable η0(X) possesses a density k with respect to Lebesgue measurethat is continuous on a neighbourhood of [a, b].

(ii) The conditional distribution of X given η0(X) = s, has a regular versionrepresentable as a Markov kernel K(s, ·) such that s �→ K(s, ·) is continuouson [a, b] for the weak topology.

Then the sequence of processes√

n(Kn −K) as in (11) tends in �∞[a, b] in distrib-ution to the process

(Gη0

fθ : θ ∈ [a, b])

for Gη0an η0-Brownian bridge process and

fθ : [0, 1]d → R defined as

fθ(x) = 1η0(x)≤θ − k(θ) E[1x≤X |η0(X) = θ].

Corollary 5.2. (Kendall processes, Example 2, indexed by Lipschitz functions).Suppose that Θ is a suitably measurable collection of continuously differentiablefunctions θ : [0, 1] → [−1, 1] with derivatives θ satisfying ‖θ(x)‖ ≤ 1 for everyx ∈ [0, 1]. Then the sequence of processes n−1/2

∑ni=1

(θ(ηn(Xi)) − Pθ(η0)

)tends

in distribution in �∞(Θ) to the process(Gη0

fθ : θ ∈ Θ)

for Gη0an η0-Brownian

bridge process in �∞(Θ) and fθ : [0, 1]d → R defined as

fθ(x) = θ(η0(x)

)− P θ(1x≤X).

Proof of Corollary 5.1. We apply the decomposition (2) with fθ,η(x) = 1{η(x) ≤θ}, for distribution functions η on [0, 1]d, θ ∈ [0, 1] and x ∈ [0, 1]d.

As discussed following the proof of Lemma 3.2, hypotheses (i) and (ii) imply thatthe condition (19) for Theorem 3.2 (with d = 1) holds by way of Lemma 3.2, andhence the first term on the right side of (2) converges in probability to 0 uniformlyin θ ∈ [a, b].

The second term is simply the usual empirical process for the i.i.d. one-dimen-sional random variables η0(X1), . . . , η0(Xn), and hence it converges weakly as claim-ed by standard theory.

To handle the third term, note that (i) and (ii) imply that the hypotheses ofLemma 4.2 hold, and hence that the map η �→ {Pfθ,η : θ ∈ Θ} from �∞(X )

Page 263: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

250 A. W. van der Vaart and J. A. Wellner

to �∞([a, b]) is Hadamard differentiable tangentially to C([0, 1]d) with derivativeL : C([0, 1]d) → �∞([a, b]) given by

L(h0)(θ) = −E(h0(X)||η0(X) = θ

)k(θ).

Weak convergence of the third term then follows from van der Vaart and Wellner[12], Theorem 3.9.5, page 375.

The joint limit law of the second and third term can be determined from themarginals, and the limit of the sum of the two terms can be represented in theform as given. An insightful way to derive this is from asymptotic linearity of thetwo terms as follows. The second term is already linear with influence functionsx �→ 1η0(x)≤θ. The third term can be approximated by L((

√n(ηn − η)

), where ηn −

η = n−1∑n

i=1(1Xi≤x − η(x)), so that L((√

n(ηn − η))

= n−1/2∑n

i=1 L(1[Xi,1] − η).The terms in the latter sum should be understood as L acting on the functionsx �→ 1[Xi,1](x) − η(x) for fixed Xi. We thus obtain that

L((√

n(ηn − h))

= n−1/2n∑

i=1

L(1[Xi,1]) −√

nL(η) = Gη0L(1[Xi,1]).

The representation of the limit process as given in the corollary follows.

For many distribution functions η0 the corresponding density k of K is un-bounded at 0 and hence not continuous on [0, 1]. See Barbe, et al. [2], pages 202-208, for a number of explicit examples. In particular this is true even when η0 isthe uniform distribution on [0, 1]d. For such distributions the preceding corollarydoes not yield convergence of Kendall’s process in the space �∞([0, 1]). However,this convergence may be valid even when k is unbounded. Barbe et al. [2] show thatunder the growth condition

k(t) = o(t−1/2(log(1/t))−1/2−ε), t ↓ 0, ε > 0.

convergence in the full domain still holds. They achieve this using results of Alexan-der [1] to show that the empirical process

√n(ηn − η0)1{η0 ≥ an} converges in the

weighted metric ‖ · /q(η0)‖∞ if q(t) = t1/2(log(1/t))p for some 1/2 < p < r/2and an = n−1(log n)r. This strengthening of the convergence of

√n(ηn − η0) then

compensates for the growth of k at 0.

Proof of Corollary 5.2. This follows by combining Theorems 3.1 and Lemma 4.1with the fact that F = {θ ◦ η0 : θ ∈ Θ} is Donsker.

5.2. Two corollaries for copula processes

For the copula processes (12) in Example 3 it again suffices to consider the casein which η0 = C, so that all all the marginal distributions η0,j , j = 1, . . . , d, areUniform(0, 1). The first of the following two corollaries was obtained in Stute [9]and Ghoudi and Remillard [7].

Corollary 5.3. Suppose that:

(i) η0 = C is continuous.(ii) The copula function η0 = C is continously differentiable on [0, 1]d with gra-

dient ∇C(u).

Page 264: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

Random empirical processes 251

Then the sequence of copula processes√

n(Cn − C) given in (12) converges in dis-tribution in �∞([0, 1]d) to the process (Gη0

fu : u ∈ [0, 1]d) for Gη0an η0-Brownian

bridge process, and fu : [0, 1]d → R defined as

fu(x) = 1x≤u −∇C(u)′(1x1≤u1, . . . , 1xd≤ud

).

Corollary 5.4. (Copula processes, Example 3, indexed by Lipschitz functions).Suppose that Θ is a suitably measurable collection of continuously differentiablefunctions θ : [0, 1]d → R such that with derivative ‖θ(x)‖ ≤ 1 for every x ∈ [0, 1]d

and satisfying J(1, Θ, L2) < ∞. Then the sequence of processes n−1/2∑n

i=1

(θ(ηn(Xi))−

Pθ(η0))

tends in distribution in �∞(Θ) to the process(Gη0

fθ : θ ∈ Θ)

for Gη0an

η0-Brownian bridge process in �∞(Θ) and fθ : [0, 1]d → R defined as

fθ(x) = θ(x) − P θ(1x1≤X1

, . . . , 1xd≤Xd

).

Proof of Corollary 5.3. We apply the decomposition (2) with

fθ,η(x) = 1{η1(x1) ≤ θ1, . . . , ηd(xd) ≤ θd}

for θ = (θ1, . . . , θd) ∈ [0, 1]d, x = (x1, . . . , xd) ∈ [0, 1]d, and ηj the jth one-dimensional marginal distribution function on [0, 1] of the distribution functionη (so ηj(uj) = η(1, . . . , 1, uj , 1, . . . , 1)).

To show that the first term in (2) converges to zero uniformly in θ ∈ [0, 1]d, wecan apply Theorem 2.1. The class of functions

fθ,η(x) = 1{x1 ≤ η−11 (θ1), . . . , xd ≤ η−1

d (θd)}

is a class of indicators of a Vapnik-Chervonenkis-class of sets. Thus Theorem 2.1applies if we show that (3) holds. But this is easily verified by the assumed continuityof η0 = C and the uniform consistency of the empirical quantile functions η−1

n,j forj = 1, . . . , d. Thus (1) holds.

The second term in (2) is simply the classical empirical process of the randomvectors X1, . . . , Xn in [0, 1]d, and converges weakly by classical theory.

Finally, the third term in (2) converges weakly to ∇C(u)′ · Gη0(v(X, u)), for

v(x, u) = (1x1≤u1, . . . , 1xd≤ud

), by the delta-method for the map η �→ Pfu,η. Thismap can be decomposed as

η �→(η−11 (u1), . . . , η−1

d (ud))�→

(C ◦

(η−11 (u1), . . . , η−1

d (ud)), u ∈ [0, 1]d

and can be shown to Hadamard-differentiable from the domain of distribution func-tions in �∞(X ) = �∞([0, 1]d) to �∞(Θ) = �∞([0, 1]d) by the chain rule, using thecontinuity of ∇C and the fact that the quantile transformation is Hadamard dif-ferentiable.

It is possible to extend Corollary 5.3 to the case in which ∇C is continuous on(0, 1)d but satisfies certain growth restrictions at 0 and/or 1. Then weighted metricsare involved in the proof.

Proof of Corollary 5.4. This follows by combining Theorems 3.1 and Lemma 4.1with the fact that F = {θ : θ ∈ Θ} is Donsker and the delta-method, e.g. van derVaart and Wellner [12], Theorem 3.9.5, page 375.

Page 265: Asymptotics: particles, processes and inverse problems. Festschrift for Piet Groeneboom

252 A. W. van der Vaart and J. A. Wellner

References

[1] Alexander, K. (1987). The central limit theorem for weighted empiricalprocesses indexed by sets. J. Mult. Anal. 22 313–339.

[2] Barbe, P., Genest, C., Ghoudi, K. and Remillard, B. (1996). OnKendall’s process. J. Mult. Anal. 58 197–229.

[3] Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Univ.Press, Cambridge.

[4] Dudley, R. M. (1984). A Course on Empirical Processes. Ecole d’Ete de St.Flour 1982. Springer, New York.

[5] Ganssler, P. and Stute, W. (1987). Seminar on Empirical Processes. DMVSeminar. Birkhauser, Basel.

[6] Ghoudi, K. and Remillard, B. (1998). Empirical processes based onpseudo-observations. In Asymptotic Methods in Probability and Statistics (Ot-tawa, ON, 1997) 171–197. North-Holland, Amsterdam.

[7] Ghoudi, K. and Remillard, B. (2004). Empirical processes based onpseudo-observations. II. The multivariate case. In Asymptotic Methods in Sto-chastics 381–406. Fields Inst. Commun. 44. Amer. Math. Soc., Providence,RI.

[8] Pollard, D. (1984). Convergence of Stochastic Processes. Springer, NewYork.

[9] Stute, W. (1984). The oscillation behavior of empirical processes: The mul-tivariate case. Ann. Statist. 12 361–379.

[10] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Univ. Press,Cambridge.

[11] van der Vaart, A. W. (2002). Semiparametric statistics. Ecole d’Ete de St.Flour 1999 331–457. Springer, New York.

[12] van der Vaart, A. W. and Wellner, J. A. (1996). Empirical Processesand Weak Convergence. Springer, New York.


Recommended