Nonparametric tilted density function estimation: A cross...

Accepted Manuscript

Nonparametric tilted density function estimation: A cross-validationcriterion

Hassan Doosti, Peter Hall, Jorge Mateu

PII: S0378-3758(17)30215-XDOI: https://doi.org/10.1016/j.jspi.2017.12.003Reference: JSPI 5626

To appear in: Journal of Statistical Planning and Inference

Received date : 18 July 2017Revised date : 18 December 2017Accepted date : 21 December 2017

Please cite this article as: Doosti H., Hall P., Mateu J., Nonparametric tilted density functionestimation: A cross-validation criterion. J. Statist. Plann. Inference (2018),https://doi.org/10.1016/j.jspi.2017.12.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service toour customers we are providing this early version of the manuscript. The manuscript will undergocopyediting, typesetting, and review of the resulting proof before it is published in its final form.Please note that during the production process errors may be discovered which could affect thecontent, and all legal disclaimers that apply to the journal pertain.

https://doi.org/10.1016/j.jspi.2017.12.003

Nonparametric Tilted Density Function Estimation: ACross-validation Criterion

Hassan Doosti1

Macquarie University, Sydney, Australia.

Peter Hall 2

The University of Melbourne, Melbourne, Australia.

Jorge Mateu 3

Universitat Jaume I, Castellon, Spain.

Abstract

In this paper, we propose a tilted estimator for nonparametric estimation of a

density function. We use a cross-validation criterion to choose both the band-

width and the tilted estimator parameters. We demonstrate theoretically that

our proposed estimator provides a convergence rate which is strictly faster than

the usual rate attained using a conventional kernel estimator with a positive

kernel. We investigate the performance through both theoretical and numerical

studies.

Keywords: Cross validation function, Non-parametric density function

estimation, Rate of convergence, Tilted estimators

2010 MSC: 62G07, 62G20

1. Introduction

Motivation. Doosti and Hall (2016) introduced new high-order, non-parametric

density estimators based on data perturbation, e.g. by tilting or data sharp-

ening. They proposed an approach to choose the parameters to minimise the

[email protected] (Corresponding Author)2Deceased 9 January [email protected]

Preprint submitted to Journal of Multivariate Analysis December 18, 2017

doosti-hall-mateu-16-7-2017.texClick here to view linked References

integrated squared distance between the new estimators and an adaptive estima-5

tor, based for example on the sinc kernel or a flat-top kernel. The new estimators

produce more accurate estimations than the high-order methods because they

remove those negative parts of the estimator which always penalise performance.

On the other hand, the new estimators suffer from low speed of computation.

In the current paper we introduce a cross-validation function which is suited for10

tilted estimators and show that by minimising the corresponding criterion, we

will have estimators which enjoy high speed of convergence. Theoretically we

prove that their rate of convergence is faster than the usual rate of convergence.

Practically the new approach reduces a large amount of computational labour.

The first draft of this work was initiated by Peter Hall before he fell ill, but15

then he did not have a chance to see more than a first draft of the paper. We

dedicate this manuscript to Peter Hall.

On the density function estimation by tilting. The first version of a

tilting-based approach was suggested by Grenander (1956), for enforcing con-

straints. Chen (1997), Zhang (1998), Muller et al. (2005) and Schick and We-20

felmeyer (2009) employed empirical likelihood, introduced by Owen (1988, 1990,

2001), to find probability weights by profiling a multinomial likelihood under a

set of constraints. Instead of empirical likelihood-based methods, some authors

used a variety of distance measures in this context. In the latter approach the

tilted estimator can be constructed by minimising these distances, subject to25

the constrains. Hall and Presnell (1999), Hall and Huang (2001, 2002), Carroll

et al. (2011) and Doosti and Hall (2016) used this methodology to find their

tilted estimators.

The novelty of our approach compared to the existing approaches is that we

benefit from recruiting a cross-validation criterion. Furthermore, we propose a30

new algorithm that helps to find the amount of tilt.

Assume we are given a random sample X = {X1, . . . , Xn}, and wish to

estimate the density, f , of the distribution from which the data were drawn.

To define a tilted kernel estimator, let p = (p1, . . . , pn) denote a probability

2

distribution on n points, so that each pi ≥ 0 and

n∑

i

pi = 1 . (1.1)

Given a random sample {X1, . . . , Xn}, drawn from a distribution with density

f , a bandwidth h and kernel K, we define the tilted kernel density estimator as

f(x|h, p) =1

h

n∑

i

piK

(x−Xi

h

). (1.2)

The conventional kernel estimator, f say, is recovered by taking pi = p0 ≡(n−1, . . . , n−1), the uniform multinomial distribution

f(x|h) = f(x|h, p0) =1

nh

n∑

i

K

(x−Xi

h

). (1.3)

It is common to take K to be a bounded, symmetric probability density, and in

that case the estimators at (1.1) and (1.2) are both nonnegative and integrate

to 1, and so are themselves proper probability densities. However, the conver-

gence rate of the estimator at (1.3) is restricted severely; it is Op(n−4/5) if f35

has two bounded derivatives, and generally cannot be improved beyond that

point, even if f is infinitely differentiable, without using a kernel K that takes

negative values in some parts of its support.

In principle we would like to choose p to minimise the integrated squared

error,

ISE(h, p) =

∫ {f(x|h, p)− f(x)

}2w(x) dx , (1.4)

where w denotes a nonnegative weight function, and minimisation is undertaken

subject to the constraint that p is a proper probability distribution. In this40

paper, however in principle, we can use cross-validation to choose h and p.

Paper organisation. The rest of this paper is organised as follows. In Sec-

tion 2, we present our cross-validation function. The main theoretical results are

described in Section 3, and Section 4 is devoted to the numerical performances

of our estimator. The proofs of the technical results appear in an Appendix at45

the end of the paper.

3

2. Cross-validation function

The cross-validation criterion that we use is given by

CV (h, p) =

∫f(x|h, p)2 dx− 2

n

n∑

i

f−i(Xi|h, p) , (2.1)

where

f−i(x|h, p) =1

h

∑

j : j 6=ipj K

(x−Xj

h

)(2.2)

and, as before, p = (p1, . . . , pn) is a probability measure on n points.

The density estimator f−i(x|h, p), defined at (2.2), is a leave-one-out version

of f(x|h, p), computed using the dataset X \ {Xi}. However, the vector p used50

in (2.2) is not re-standardised so that∑j : j 6=i pj = 1, since the latter step turns

out to be unnecessary. This reduces computational labour.

Recall the definition of the integrated squared error, ISE, at (1.4), and that

p0 denotes the uniform probability measure. For simplicity, take the weight

function in (1.4) equal to 1. It is well known that if f and f ′′ are bounded

and square integrable, then, under mild assumptions on h and K, the following

property of integrated squared error holds

ISE(h, p0) �p (nh)−1 + h4 , (2.3)

where the notation An(h) �p Bn(h) means that, whenever h = h(n) → 0 such

that nh→∞, the ratio Rn = An(h)/Bn(h) satisfies

limC→∞lim infn→∞

P(C−1 ≤ Rn ≤ C

)= 1.

In Section 3 we shall state a theorem which asserts that, for a quantity Qn

not depending on h or p, and for a class H of bandwidth h, and a class P of

probability measures p,

supp∈P

∣∣∣∣1

n

n∑

i

f−i(Xi|h, p)−∫f(x|h, p)f(x)dx−Qn

∣∣∣∣ = op{

1}. (2.4)

It follows from (2.3) and (2.4) that, since

ISE(h, p) =

∫f(x|h, p)2dx− 2

∫f(x|h, p)f(x)dx+

∫f(x)2dx, (2.5)

4

then by minimising the cross-validation criterion over h and p together, rather

than over h alone as in conventional applications of cross-validation, we can find

the bandwidth h, and multinomial probability distributions p that allow us to55

reduce ISE to a quantity strictly smaller order than min{(nh)−1+h4} � n−4/5.

That is, ISE(h, p) = op(n−4/5

)holds. In view of (2.3), n−4/5 is the order of

ISE for the conventional estimator f( ·|h), and so by using cross-validation to

choose p we are able to strictly improve the performance of f( ·|h) as measured

by ISE.60

In practice, we choose (h, p) sequentially as follows. First we take p = p0,

choose h = h to minimise CV (h, p0), and put (h(1), p(1)) = (h, p0). This is the

conventional cross-validation approach to bandwidth choice. At the rth step in

the algorithm, if (h(r), p(r)) has been selected already, and p involves m = r in a

sparse interpolation algorithm, see A below, choose (h(r+1), p(r+1)) to minimise

CV (h, p), subject to

h(r+1) ≥ ρ h(r) , (2.6)

to p having m = r + 1 in algorithm A below, and to p(r+1) being a proper

probability measure. Here, ρ ∈ (0, 1) is a predetermined constant, and its range

reflects the fact that, in asymptotic terms, the optimal bandwidth increases as

the role of bias decreases.

Algorithm A is a piecewise-constant sparse interpolation algorithm.65

Algorithm A (AA): Let X(1) ≤ . . . ≤ X(n) denote an ordering of the data

X1, . . . , Xn, and, given an integer m in the range 2 ≤ m < n, let 1 < i1 <

. . . < im = n be approximately regularly spaced between 0 and n, in the sense

that, for 1 ≤ j ≤ m, we have C1 n/m ≤ ij − ij−1 ≤ C2 n/m, where i0 = 0 and

X(i0) = −∞. Take pi to be constant, in particular to have the value qj , say, for70

ij−1 < i ≤ ij , where 1 ≤ j ≤ m.

Here and below we write C1, C2, . . . for positive constants. Of course, mod-

ifications of AA are possible, including those where the weights are distributed

approximately symmetrically in the two tails. It follows from AA that the con-

5

straints∑i pi = 1 and pi ≥ 0 are equivalent to

m∑

j=1

(ij − ij−1) qj = 1 , q1, . . . , qm ≥ 0 . (2.7)

In practice it is often desirable to choose i1, . . . , im so that some X(ij) are close

to each mode of f . This can be determined approximately by constructing a

pilot estimator of f , for example f .

The algorithm A can be replaced by its obvious linear interpolation coun-75

terpart, with almost no impact on numerical results and no impact on the

theoretical properties that we shall discuss in Section 3.

3. Theoretical properties

In this section we state assumptions under which (2.4) holds. Given con-

stants c1, c2 > 0 we define

H =[n−(1/5)−c1 , n−(1/5)+c1

], (3.1)

and we take P to be the class of all probability measures p = (p1, . . . , pn) that

satisfy

max1≤i≤n

∣∣pi − n−1∣∣ ≤ h2 nc2 (3.2)

and are constructed by sparse interpolation (see AA) from values of pi, for i in

a sequence of integers i1, . . . , im. The manner in which the definition of H at80

(3.1) involves n−1/5 reflects the fact that we permit H to contain bandwidths

that are both of smaller and larger magnitude than the optimal size, n−1/5,

allowing to minimise the order of magnitude of the integrated squared error of

f( ·|h) (see (2.3)). Note that the presence of the factor nc2 in (3.2) acknowledges

that, for some densities f , we can choose pi with more flexibility than in the85

formulae proposed by Doosti and Hall (2016).

With p = (p1, . . . , pn) defined in terms of q = (q1, . . . , qm) as in AA, we

choose h and q together to minimise CV (h, p), subject to (2.7).

In addition to AA, we impose the following condition:

Condition B (CB): (a) f , |f ′| and |f ′′| are uniformly bounded; (b) K is a90

6

bounded, symmetric, compactly supported and a Holder continuous probability

density; (c) max1≤i≤n pi ≤ C3 n−1; and (d) m ≤ nc3 where c3 < 1/2.

Assumptions (a) and (b) are conventional in kernel density estimation, assump-

tion (c) simply ensures that the constant c2 in (3.2) is not so large that the pis

can be of larger order than n−1, and assumption (d) bounds the rate at which95

the number of distinct pis can grow.

Theorem 1. Assume that H is as in (3.1), and that P is the class of prob-

ability measures p satisfying AA and CB. Then the constants c1, c2, c3 > 0, in

(3.1), (3.2) and CB, respectively, can be chosen so that (2.4) holds, and in fact100

so that the right-hand side of (2.4) equals Op[{(nh)−1 +h4}n−c4 ], uniformly in

h ∈ H and p ∈ P , where c4 > 0.

A proof of Theorem 1 is given in Appendix at the end of the paper.

4. Simulation Study105

4.1 Summary of methods used. A simulation study is carried out to

compare our estimators to those employed by Doosti and Hall (2016). In this

section we follow the first three examples in Section 4 in Doosti and Hall (2016)

to show the performances of the proposed estimator with those competitors

which have already been employed in Doosti and Hall (2016). Here, the same110

setup is employed for the selection of parameters and dataset. We have three

different approaches to tilting, referred to below as (I)–(III). Techniques (I)

and (II) were proposed by Doosti and Hall (2016) which choose p to minimise

the L2 distance to estimators computed using the sinc kernel or trapezoidal

kernel, respectively, when the bandwidth for the latter technique is chosen by115

cross-validation. Method (III) chooses p and the bandwidth directly and it cor-

responds to our new proposed estimators. The case m = 2 was also explored.

Also the following competitors contribute to the simulation study: (i) a con-

ventional second-order kernel estimator, (ii) the sinc kernel estimator, (iii) the

7

trapezoidal kernel estimator, and (iv) the diffusion estimator. For either the def-120

inition of all competitor estimators or the set up for choosing their parameters

see Section 4 in Doosti and Hall (2016). Our proposed estimator, i.e. Method

III, chooses p and the bandwidth directly, using the cross-validation function

at (2.1). In each case the standard normal kernel was used to construct the tilted

estimator. The performance depends to some extent on the way these values125

are distributed. For example, when m = 2 and the density being estimated is

unimodal, it is desirable to have one of the two groups of equal pis approxi-

mately in the middle nine-tenths of the distribution. Analogous arrangements

are appropriate for multimodal densities, or for unimodal densities when m > 2,

and in each case the appropriate distribution of the groups of equal pi values130

can be determined empirically by cross-validation.

4.2. Simulations. For the first example, we employed the same datasets

as in Figure 1 in Doosti and Hall (2016). Figure 1, from the top panel to

the bottom panel, shows estimators of Beta (3, 3), Beta (5, 3), Beta (3, 6) and

Beta (6, 6) densities, respectively. As we pass through the sequence of Beta135

densities, from top to bottom, the densities become successively smoother at

the boundaries. Each panel contains eight curves, representing the true density,

the conventional kernel estimator, the sinc and trapezoidal kernel estimators,

the diffusion estimator, and tilted estimators (I)–(III).

The following properties can be deduced from the plots in Figure 1, which are140

given in the case m = n: (a) the conventional and diffusion estimators perform

similarly, with the former typically higher than the latter in the vicinity of

the mode, except in the Beta (3, 6) case where the diffusion estimator performs

poorly due to its bimodality; (b) the sinc and trapezoidal kernel estimators suffer

from negative side lobes; (c) apart from the previous problem, as expected the145

trapezoidal kernel estimator performs similarly to the tilting method (II); and

(d) overall, the tilting method (III) arguably is the most satisfactory, since it

best captures the peak of the true density, it does not suffer from negative side

lobes, and more generally, it is not misled into producing a spurious mode.

Results when m = 2 or 3, for methods (I)–(III), are similar.150

8

−1 −0.5 0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

−1 −0.5 0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

−1 −0.5 0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

3.5

−1 −0.5 0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

2

2.5

3

Figure 1 Beta density function estimation. See text for distribution types. In

each panel the true density is represented by a bold curve, the conventional

kernel estimator by a blue dashed curve, the diffusion estimator by a black line

with triangles, the sinc kernel estimator by a red dotted curve, the trapezoidal

estimator by a green dot-dashed curve, and the tilted estimators (I), (II) and

(III) by red lines with squares, green lines with crosses and blue lines with

circles, respectively. Sample size is n = 50.

9

Figure 2 is based on the datasets in the second example in Doosti and Hall

(2016). In the case of Figure 2 the distributions are all infinitely supported,

from top to bottom panels, it shows the estimators of Student’s t3, lognormal,

separated bimodal and skewed bimodal densities, respectively. In the latter two

cases the densities were #7 and #8 of Marron and Wand (1992).155

Table 1: Approximation of MISE. The sample size is 100 and the number of replication is

500.

Density Tilted Tilted Tilted Tilted Tilted Tilted conventional Sinc Trapezoidal Diffussion

I,m=3 I,m = n II,m=3 II,m=n III,m=2 III,m=3 Estimator Estimator Estimator Estimator

Gua 0.0538 0.0487 0.0520 0.0661 0.0296 0.0283 0.0476 0.0579 0.0674 0.0511

SkU 0.6309 0.6181 0.6583 0.6550 0.5429 0.5203 0.5329 0.6249 0.6573 0.5769

StS 5.0323 5.0375 5.0527 5.0741 4.9725 4.5596 4.5732 5.0947 5.0997 4.6107

KtU 0.3751 0.5133 0.3501 0.4440 0.5042 0.5064 1.0277 0.5174 0.4460 0.4429

Out 0.4287 0.5007 0.4580 0.5972 0.5349 0.5247 0.5369 0.5171 0.5662 0.7243

Bim 0.0777 0.0858 0.0746 0.0982 0.0660 0.0772 0.0885 0.0905 0.0997 0.0731

SeB 0.0928 0.0997 0.0918 0.1207 0.1207 0.1348 0.5456 0.1092 0.1249 0.0974

SkB 0.1076 0.1251 0.1051 0.1270 0.0837 0.1018 0.1164 0.1319 0.1301 0.0972

When describing the results in Figure 2 it is convenient to treat by itself the

lognormal density flgn, for which the results are shown in the upper right-hand

panel. Density estimation there is challenged seriously by the sharp decrease

to zero of flgn(x) as x ↓ 0. Although flgn is infinitely differentiable on the real

line, in several respects estimators of flgn behave as though the density had a160

jump discontinuity at the origin. This is particularly true for the conventional

and diffusion estimators, which perform almost identically and very poorly. The

10

−6 −4 −2 0 2 4 6−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 1 2 3 4 5 6 7−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

−4 −3 −2 −1 0 1 2 3 4−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

−4 −3 −2 −1 0 1 2 3 4−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Figure 2 Estimation of densities with infinite support. See text for distribu-

tion types. Line types are as in Figure 1, and sample size is n = 100.

11

tilted estimators (I) and (II) are higher in the vicinity of the mode than the tilted

estimator (III), but the latter is smoother than (I) and (II) and so is arguably

more attractive. The tilted sinc kernel estimator (I) tracks its untilted form165

(II) very closely, and both have several spurious modes in the right-hand tail,

although of course only the sinc kernel estimator takes negative values there.

More generally, the closeness of (I) and (II) can be seen in all the panels in

Figure 2.

Next we discuss the other three panels of Figure 2. In the top panel, tilting170

methods (II) and (III) do particularly well and about equally, while in the

third panel from above the tilted estimator (II) arguably performs best. This

estimator is very close to its untilted form (iii), except in the right tail, where the

untilted form is unattractive because it takes negative values. In the bottom

panel, our tilted estimators succeed in estimating the right peak but do not175

perform so well for the left peak. All in all, tilted estimator (III) performs best

out of the seven methods.

Next we assess the performance of all seven estimators in terms of mean

integrated squared error, MISE(x) =∫E{f(x) − f(x)}2 dx. We treated the

following eight densities, for which names and formulae are as in Table 1 of180

Marron and Wand (1992): Gaussian (Gau), skewed unimodal (SkU), strongly

skewed (StS), kurtotic unimodal (KtU), outlier (Out), bimodal (Bim), separated

bimodal (SeB) and skewed bimodal (SkB). Each of these densities has either one

or two modes. Our Table 1 gives values of MISE, computed from 500 simulations

in each instance for these cases, when sample size is n = 100. The lowest value of185

MISE in each row is printed in bold. In the cases of tilted estimators (I) and (II)

we give results both for m = n, meaning that the probabilities pi are permitted

to be all different from one another, and for m = 3, where the pis take only

three distinct values, for neighboring values of i. For the tilted estimator (III)

we consider only m = 2 and m = 3. We should remind that since we recruit the190

same datasets, some columns are exactly similar to the corresponding columns

in Table 1 in Doosti and Hall (2016).

Perhaps the first thing to note from Table 1 is that, in mean squared error

12

terms, our tilted estimators perform similarly to their conventional counterparts,

for example kernel methods, when the true density is relatively simple, but they195

perform much better than conventional estimators, and similarly to the sinc

estimator, in the case of complex densities, for example KtU and SeB.

Note too that, when we allow all the pis to be distinct (i.e. when m = n),

the tilted versions of the sinc and trapezoidal kernel estimators always have

lower MISE than the native forms of those estimators. Moreover, excepting the200

case of the distribution SkU and the tilted estimator (III), even taking m = 3

produces a reduction in MISE.

Overall the tilted estimator based on minimisation of the cross-validation

function (2.1) gives best performance, having least MISE in five out of the eight

cases. In two cases the tilted trapezoidal kernel estimator (II), and in one case205

the tilted estimator (I), offer a slight improvement over method (III).

4.3. Real Data. Finally we illustrate, in Figure 3, the performance of compet-

ing estimators in the case of the Old Faithful geyser dataset (see e.g. Doosti and

Hall (2016)), where the Xis represent the duration, in minutes, of 107 eruptions

of Old Faithful geyser in Yellowstone National Park.210

The upper panel of Figure 3 shows estimators (i)–(iv) introduced in the

current section, using the same respective line types as in Figures 1 and 2.

The lower panel of Figure 3 shows the tilted estimators of types (I)–(III), and a

histogram of the data. As can be seen, all three tilted estimators are bimodal, al-

though, reflecting our experience in Figure 2, type (II) captures bimodality more215

sharply. The tilted estimators are of course always positive. The conventional

kernel estimator, in the upper panel, shows the least amount of “enthusiasm”

for bimodality. As indicated in Figure 2 and Table 1, this estimator performs

poorly when used to estimate separated bimodal densities.

220

Figure 11 in Sain and Scott (1996) and Figure 7 in Reynaud-Bouret et al.

(2011) argue that the true density should be bimodal too, but the left peak at-

tained in Sain and Scott (1996) is sharper than those found in our study as well

as that obtained by Reynaud-Bouret et al. (2011). However, in the problem

13

0 1 2 3 4 5 6-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 3 Density estimators computed from Old Faithful geyser dataset. Up-

per panel shows estimators (i)–(iv), and lower panel shows estimators (I)–(III)

and a histogram. Line types are as in Figure 1.

14

of estimating the right peak all the estimators perform similarly. The proposed225

estimator in Reynaud-Bouret et al. (2011) is zero in an interval located between

the two peaks, and in this case it follows from their algorithm that their esti-

mator is not as smooth as ours.

5. Discussion. We have proposed a new cross-validation function that can

be used in choosing the bandwidth and tilted parameters. We have shown that230

the performance of the proposed density function estimator is better than the

conventional kernel-based estimators. It outperforms the previous estimators

shown in Doosti and Hall (2016) in both accuracy and in minimising the com-

putation time. Globally, the rate of convergence is faster than the usual one.

Our new estimator is handy to be employed in a wide range of applied fields. In235

particular the authors are now exploring tilted estimators for presence-absence

of spatially correlated data. In addition, one can also consider a generalised ver-

sion of the cross-validation function to impose some constraints on the density

function, e.g. bimodalality, making it more realistic in real scenarios.

References240

[1] Carroll, R. J., Delaigle, A. and Hall, P. (2011) Testing and estimating shape-

constrained nonparametric density and regression in the presence of mea-

surement error. J. Am. Statist. Ass., 106, 191202.

[2] Chen, S. X. (1997) Empirical likelihood-based kernel density estimation.

Aust. J. Statist., 39, 4756.245

[3] Doosti, H. and Hall, P. (2016) Making a non-parametric density estimator

more attractive, and more accurate, by data perturbation. J. R. Statist. Soc.

B, 78(2), 445-462.

[4] Grenander, U. (1956) On the theory of mortality measurement II. Skand.

Akt., 39, 125153.250

[5] Hall, P. and Huang, L. S. (2001) Nonparametric kernel regression subject to

monotonicity constraints. Ann. Statist., 29, 624647.

15

[6] Hall, P. and Huang, L. S. (2002) Unimodal density estimation using kernel

methods. Statist. Sin., 12, 965990.

[7] Hall, P. and Presnell, B. (1999) Intentionally biased bootstrap methods. J.255

R. Statist. Soc. B, 61, 143158.

[8] Marron, J. S. and Wand, M. P. (1992) Exact mean integrated squared error.

Ann. Statist., 20, 712736.

[9] M uller, U. U., Schick, A. and Wefelmeyer,W. G. (2005) Weighted residual-

based density estimators for nonlinear autoregressive models. Statist. Sin.,260

15, 177195.

[10] Owen, A. B. (1988) Empirical likelihood ratio confidence intervals for a

single functional. Biometrika, 75, 237249.

[11] Reynoud-Bouret, P., Rivoirard, V. and Tuleau-Malot, C. (2011) Adaptive

density estimation: a curse of support? J. Statist. Planng Inf., 141, 115139.265

[12] Sain, S. R. and Scott, D. W. (1996) On locally adaptive density estimation.

J. Am. Statist. Ass., 91, 15251534.

[13] Schick, A. and Wefelmeyer, W. G. (2009) Improved density estimators for

invertible linear processes. Communs Statist. Theor. Meth., 38, 31233147.

[14] Weisberg, S. (1985) Applied Linear Regression. New York: Wiley.270

[15] Zhang, B. (1998) A note on kernel density estimation with auxiliary infor-

mation. Communs Statist. Theor. Meth., 27, 111.

16

APPENDIX : Proof of Theorem 1

Step 1: Initial approximation to n−1∑i f−i(Xi |h, p). The approximation is

given at (B.3). Throughout our proof, in this step and subsequent ones, we

assume that c1 in (3.1) satisfies 0 < c1 < 15 . This is appropriate since the275

theorem is asserted only for sufficiently small c1.

To derive the approximation, let p(i) denote the concomitant of X(i) in the

pair (pi, Xi), and observe that

hn∑

i=1

f−i(Xi |h, p) =n∑

i=1

∑

j : j 6=ipj K

(Xi −Xj

h

)

=n∑

i=1

∑

j : j 6=ip(j)K

(X(i) −X(j)

h

)= T1 + . . .+ T4 , (B.1)

where, defining I(E) to be the indicator function of an event E ,

T1 =m∑

k1=1

∑

k2 : k2 6=k1p(ik2

)K

(X(ik1

) −X(ik2)

h

),

T2 =

m∑

k1=1

m∑

k2=1

ik2−1∑

j=ik2−1+1

I(ik1 6= j) p(j)K

(X(ik1

) −X(j)

h

),

T3 =m∑

k1=1

ik1−1∑

i=ik1−1+1

m∑

k2=1

p(ik2) I(ik2 6= i)K

(X(i) −X(ik2

)

h

),

T4 =m∑

k1=1

ik1−1∑

i=ik1−1+1

m∑

k2=1

ik2−1∑

j=ik2−1+1

I(i 6= j) p(j)K

(X(i) −X(j)

h

). (B.2)

Since pi ≤ C3 n−1, uniformly in i (see CB(c)), then

T1 ≤C3

n

m∑

k1=1

∑

k2 : k2 6=k1K

(X(ik1

) −X(ik2)

h

)≤ C3 n

−1m2 supK ,

T2 + T3 ≤2C3

n

m∑

k1=1

m∑

k2=1

ik2−1∑

j=ik2−1+1

K

(X(ik1

) −X(j)

h

)

≤ 2C3mh sup−∞<x<∞

f(x) = Op(mh) ,

uniformly in h ∈ H. (Here we have used the assumption that 0 < c1 <15 .)

Combining these bounds with (B.1) we deduce that, uniformly in h ∈ H and

17

p ∈ P,

1

n

n∑

i=1

f−i(Xi |h, p) =T4nh

+Op

(m2

n2h+m

n

). (B.3)

Step 2: Hoeffding decomposition of T4. The quantity T4 is not unlike a U -

statistic, and in this section we represent T4 as a Hoeffding-like linear projection

of T4, which we denote by T8, plus a remainder term, T9. See (B.7) and (B.8).

All expectations involved are conditional on F , which we define to be the sigma-280

field generated by X(ij) for 1 ≤ j ≤ m.

Recall from AA that X(1) ≤ . . . ≤ X(n) are the order statistics of the data

X1, . . . , Xn, and that pi = qj (the latter depending only on j) for ij−1 < i ≤ ij ,where 1 ≤ j ≤ m and i0 = 0. Conditional on F , the random variables X(i),

for ij−1 < i ≤ ij , are independent and identically distributed with density285

f/{F (X(ij))−F (X(ij−1))}, supported on the interval [X(ij−1), X(ij)]. Moreover,

the pis are F-measurable functions of q1, . . . , qm.

With these properties in mind, write T5 and T6 for the versions of T4 that

arise if, in the quadruple series defining T4 at (B.2), we take the expected value

of the summands conditional on both X(j) and F , and the expected value con-

ditional on both X(i) and F , respectively. Let T7, an F-measurable random

variable, be the quantity obtained if the conditional expectation is on F alone:

T5 =m∑

k1=1

ik1−1∑

i=ik1−1+1

m∑

k2=1

ik2−1∑

j=ik2−1+1

I(i 6= j) p(j)

× E{K

(X(i) −X(j)

h

) ∣∣∣∣ X(j),F}, (B.4)

T6 =m∑

k1=1

ik1−1∑

i=ik1−1+1

m∑

k2=1

ik2−1∑

j=ik2−1+1

I(i 6= j) p(j)

× E{K

(X(i) −X(j)

h

) ∣∣∣∣ X(i),F}, (B.5)

T7 =

m∑

k1=1

ik1−1∑

i=ik1−1+1

m∑

k2=1

ik2−1∑

j=ik2−1+1

I(i 6= j) p(j)E

{K

(X(i) −X(j)

h

) ∣∣∣∣ F}. (B.6)

In this notation we write

T4 = T8 + T9 , (B.7)

18

where

T8 = T4 − (T5 + T6) + T7 , T9 = T5 + T6 − T7 . (B.8)

Step 3: Approximation to T5. The approximation is given at (B.23). Observe

that, starting from (B.4),

T5 =∑∑

1≤k1 6=k2≤m

ik1−1∑

i=ik1−1+1

ik2−1∑

j=ik2−1+1

p(j)E

{K

(X(i) −X(j)

h

) ∣∣∣∣ X(j),F}

+m∑

k=1

∑∑

ik−1+1≤i6=j≤ik−1p(j)E

{K

(X(i) −X(j)

h

) ∣∣∣∣ X(j),F}

=∑∑

1≤k1 6=k2≤m

ik2−1∑

j=ik2−1+1

(ik1 − ik1−1 − 1) qk2F (X(ik1

))− F (X(ik1−1))

∫ X(ik1)

X(ik1−1)

K

(X(j) − x

h

)f(x) dx

+m∑

k=1

ik−1∑

j=ik−1+1

(ik − ik−1 − 2) qkF (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

K

(X(j) − x

h

)f(x) dx

=

m∑

k1=1

m∑

k2=1

ik2−1∑

j=ik2−1+1

(ik1 − ik1−1 − 1) qk2F (X(ik1

))− F (X(ik1−1))

∫ X(ik1)

X(ik1−1)

K

(X(j) − x

h

)f(x) dx

−m∑

k=1

ik−1∑

j=ik−1+1

qkF (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

K

(X(j) − x

h

)f(x) dx

= T51 − T52 , (B.9)

say. Since qk = qk(n) is bounded above by C3 n−1, uniformly in k and n (see

CB(c)) and since∫K = 1 then

0 ≤ T52 = C3 n−1 h

m∑

k=1

ik−1∑

j=ik−1+1

sup f

F (X(ik))− F (X(ik−1)). (B.10)

Writing F for the empirical distribution function for the sample X1, . . . , Xn,

we have sup |F − F | = Op(n−1/2). Therefore, since ik = n F (X(ik)) for each k,

then F (X(ik)) − F (X(ik−1)) = n−1 (ik − ik−1) + Op(n−1/2), uniformly in 1 ≤

k ≤ m for choices of m ≤ n. Moreover, by AA, C1 n/m ≤ ik − ik−1 ≤ C2 n/m.

Hence, since m = o(n1/2) (see CB(d)),

max1≤k≤m

{F (X(ik))− F (X(ik−1))}−1 = Op(m) . (B.11)

19

Combining (B.10) and (B.11) we deduce that, uniformly in h ∈ H and p ∈ P,

T52 = Op(n−1hmn

)= Op(mh) . (B.12)

Next, replacing (k1, k2) by (k, k1), we approximate T51, indicated in (B.9):

T51 =m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))

×∫ X(ik)

X(ik−1)

{ m∑

k1=1

qk1

ik1−1∑

j=ik1−1+1

K

(X(j) − x

h

)}f(x) dx

=m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

{ n∑

i=1

piK

(Xi − xh

)}f(x) dx

−m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

{ m∑

k1=1

qk1 K

(X(ik1

) − xh

)}f(x) dx

= T511 − T512 , (B.13)

say. Now,

T511 = hm∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

f(x |h, p) f(x) dx , (B.14)

and, noting (B.11) and taking C2 to be as in (2.24),

T512 ≤ C2 n−1

m∑

k=1

C2 n/m

F (X(ik))− F (X(ik−1))

∫ { m∑

k1=1

K

(X(ik1

) − xh

)}f(x) dx

= Op{n−1m2 (n/m)mh

}= Op

(m2 h

). (B.15)

To simplify the formula for T511 at (B.14), note first that, since F (X(ik)) =

ik/n for each k,

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))=

ik − ik−1 − 1

F (X(ik))− F (X(ik−1)) +Op(n−1/2)

= nik − ik−1 − 1

ik − ik−1 +Op(n1/2)= n+Op

(mn1/2

), (B.16)

uniformly in 1 ≤ k ≤ m. The Op(mn1/2) remainder in (B.16) does not depend

on h or p, and will be denoted below by Q1nk. Define

Ds(h, p) =

∫ ∣∣f(x |h, p)− f(x)∣∣s dx . (B.17)

20

Moment methods can be used to prove that, for all r, s ≥ 1,

suph∈H

supp∈P

E{Ds(h, p)

r}

= O[{

(nh)−1 + h4}rs/2]

.

This property, Markov’s inequality, and a lattice argument which we shall in-

troduce in step 5 and develop further in step 8, can be used to show that, for

each ε > 0,

suph∈H

supp∈P

Ds(h, p) = Op[{

(nh)−1 + h4}s/2

n2ε]. (B.18)

Hence,

maxk

∣∣∣∣∫ X(ik)

X(ik−1)

{f(x |h, p)− f(x)

}f(x) dx

∣∣∣∣2

≤ D2(p)

∫f2

= Op[{

(nh)−1 + h4}n2ε]

uniformly in h ∈ H and p ∈ P. Therefore, defining

Q1n =1

n

m∑

k=1

Q1nk

∫ X(ik)

X(ik−1)

f(x)2 dx ,

which does not depend on h or p, we have:

m∑

k=1

Q1nk

∫ X(ik)

X(ik−1)

f(x |h, p) f(x) dx = nQ1n +Op

[ m∑

k=1

|Q1nk|{

(nh)−1/2 + h2}nε]

= nQ1n +Op

[m2 n1/2 {(nh)−1/2 + h2

}nε], (B.19)

uniformly in h ∈ H and p ∈ P. Combining (B.14), (B.16) and (B.19) we deduce

that, uniformly in h ∈ H and p ∈ P, and for each ε > 0,

(nh)−1 T511 =

∫ X(n)

−∞f(x |h, p) f(x) dx+Q1n

+Op

[m2 n−1/2 {(nh)−1/2 + h2

}nε]

=

∫f(x |h, p) f(x) dx+Q1n

+Op

[m2 n−1/2 {(nh)−1/2 + h2

}nε]. (B.20)

The last identity above follows from the fact that, for each ε > 0,

∫ ∞

X(n)

f(x |h, p) f(x) dx = Op(nε−1

). (B.21)

21

To derive this result, let gδ(x) = {1−F (x)}(1/2)−δ and note that, since |F (x)−F (x)|/gδ(x) = Op(n

−1/2) uniformly in x, for each δ ∈ (0, 12 ), then, taking

xn = X(n), we have

1− F (xn) = 1− F (xn) +Op{n−1/2 gδ(xn)

}.

Hence, since 1− F (xn) = n−1,

1− F (xn) = 1− F (xn) +Op

[n−1/2

{1− F (xn)

}(1/2)−δ]

= n−1 +Op(nδ−1

)= Op

(nδ−1

)

Therefore, defining Ds(h, p) as at (B.17), we deduce that if s, t > 0 they satisfy

s−1 + t−1 = 1, then for all δ, ε > 0,

∫ ∞

X(n)

f(x |h, p) f(x) dx ≤∫ ∞

xn

f(x)2 dx+

∫ ∞

xn

∣∣f(x |h, p)− f(x)∣∣ f(x) dx

≤ (sup f) {1− F (xn)}+Ds(h, p)1/s

{∫

xn

f(x)t dx

}1/t

= Op

[nδ−1 +

{(nh)−1 + h4

}1/2n2ε/s

(nδ−1

)1/t], (B.22)

where we have used (B.18) to derive the last identity. Choosing s arbitrarily

large (i.e. t > 1 arbitrarily close to 1) in (B.22), and ε, δ arbitrarily small, we

deduce that (B.21) holds for all ε > 0.290

Combining (B.9), (B.12), (B.13) and (B.15) we deduce that, uniformly in

h ∈ H and p ∈ P,

T5 = T51 − T52 = T511 − T512 − T52 = T511 +Op(m2 h

).

Hence, by (B.20), uniformly in h ∈ H and p ∈ P, and for each ε > 0,

(nh)−1 T5 =

∫f(x |h, p) f(x) dx+Q1n +Op

[m2 n−1/2 {(nh)−1/2 + h2

}nε].

(B.23)

Step 4: Approximation to T6. The approximation is given at (B.33). The first

22

step in deriving it, starting from (B.5), is to approximate T6:

T6 =∑∑

1≤k1 6=k2≤m

ik1−1∑

i=ik1−1+1

ik2−1∑

j=ik2−1+1

p(j)E

{K

(X(i) −X(j)

h

) ∣∣∣∣ X(i),F}

+m∑

k=1

∑∑

ik−1+1≤i6=j≤ik−1p(j)E

{K

(X(i) −X(j)

h

) ∣∣∣∣ X(i),F}

=∑∑

1≤k1 6=k2≤m

ik1−1∑

i=ik1−1+1

(ik2 − ik2−1 − 1) qk2F (X(ik2

))− F (X(ik2−1))

∫ X(ik2)

X(ik2−1)

K

(X(i) − x

h

)f(x) dx

+m∑

k=1

ik−1∑

i=ik−1+1

(ik − ik−1 − 2) qkF (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

K

(X(i) − x

h

)f(x) dx

=

m∑

k1=1

m∑

k2=1

(ik2 − ik2−1 − 1) qk2F (X(ik2

))− F (X(ik2−1))

ik1−1∑

i=ik1−1+1

∫ X(ik2)

X(ik2−1)

K

(X(i) − x

h

)f(x) dx

−m∑

k=1

ik−1∑

i=ik−1+1

qkF (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

K

(X(i) − x

h

)f(x) dx

= T61 − T62 , (B.24)

say. Since S52 = S62 then, by (B.12),

suph∈H

supp∈P

T62 = Op(mh) . (B.25)

Next we develop an approximation to T61, which is given equivalently by

T61 =

m∑

k=1

m∑

k1=1

(ik − ik−1 − 1) qkF (X(ik))− F (X(ik−1))

×∫ X(ik)

X(ik−1)

{ ik1−1∑

i=ik1−1+1

K

(X(i) − x

h

)}f(x) dx .

23

Therefore,

T10 ≡ T61 − E(T61 | F)

=m∑

k=1

(ik − ik−1 − 1) qkF (X(ik))− F (X(ik−1))

m∑

k1=1

{ ik1−1∑

i=ik1−1+1

∫ X(ik)

X(ik−1)

K

(X(i) − x

h

)f(x) dx

− ik1 − ik1−1 − 1

F (X(ik1))− F (X(ik1−1))

∫ X(ik)

X(ik−1)

f(x) dx

∫ X(ik1)

X(ik1−1)

K

(y − xh

)f(y) dy

}

= h

m∑

k=1

(ik − ik−1 − 1) qk

m∑

k1=1

(ik1 − ik1−1 − 1)

×∫ X(ik)

X(ik−1)

[fk1(x)− E

{fk1(x)

∣∣ F}]fk(x) dx , (B.26)

where fk(x) = 0 if x /∈ Ik ≡ [X(ik−1), X(ik)],

fk(x) =f(x)

F (X(ik))− F (X(ik−1))if x ∈ Ik , )fk(x) =

1

(ik − ik−1 − 1)h

ik−1∑

i=ik−1+1

K

(X(i) − x

h

).

In what follows it is convenient to indicate, in notation for T10, T101 and

T102, the extent to which these quantities depend on h or p. Therefore we

write them as T10(h, p), T101(h) and T102(h, p), respectively. Defining rk by

qk = n−1 (1 + h2 rk), we decompose T10 into two parts:

T10(h, p) = T101(h) + T102(h, p) , (B.27)

where

T101(h) =h

n

m∑

k=1

(ik − ik−1 − 1)

m∑

k1=1

(ik1 − ik1−1 − 1)

×∫ X(ik)

X(ik−1)

[fk1(x)− E

{fk1(x)

∣∣ F}]fk(x) dx ,

T102(h, p) =h3

n

m∑

k=1

(ik − ik−1 − 1) rk

m∑

k1=1

(ik1 − ik1−1 − 1)

×∫ X(ik)

X(ik−1)

[fk1(x)− E

{fk1(x)

∣∣ F}]fk(x) dx .)

Given a random variable V , write EF (V ) for E(V | F), let S = {1, . . . , n} \

24

{i1, . . . , im}, and note that

hm∑

k1=1

(ik1−ik1−1−1) fk1(x) =m∑

k=1

ik−1∑

i=ik−1+1

K

(X(i) − x

h

)=∑

i∈SK

(Xi − xh

).

(B.28)

Therefore,

h

m∑

k1=1

(ik1 − ik1−1 − 1)[fk1(x)−E

{fk1(x)

∣∣ F}]

=∑

i∈S(1−EF )K

(Xi − xh

),

and so

T101(h) =1

n

∑

i∈S(1− EF )

m∑

k=1

(ik − ik−1 − 1)

∫ X(ik)

X(ik−1)

K

(Xi − xh

)fk(x) dx

=1

n

∑

i∈S(1− EF )

m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

K

(Xi − xh

)f(x) dx

=∑

i∈S(1− EF )

∫K

(Xi − xh

)f(x) dx+Op

(m2 nε−(1/2) h

), (B.29)

T102(h, p) =h2

n

∑

i∈S(1− EF )

m∑

k=1

(ik − ik−1 − 1) rk

∫ X(ik)

X(ik−1)

K

(Xi − xh

)fk(x) dx

=h2

n

∑

i∈S(1− EF )

m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))rk

∫ X(ik)

X(ik−1)

K

(Xi − xh

)f(x) dx

= Op

(mn(1/2)+ε h3 sup

k|rk|), (B.30)

uniformly in h ∈ H and p ∈ P. (The arguments leading to the remainder terms

in each of (B.29)–(B.31), uniformly in h ∈ H and p ∈ P, will be given in detail

in step 5). Similarly it can be proved that, for each ε > 0,

1

nh

∑

i∈S(1− EF )

∫K

(Xi − xh

)f(x) dx = Q2n +Op

(nε−(1/2) h2

), (B.31)

uniformly in h ∈ H, where Q2n does not depend on h. Hence, combining (B.27)

and (B.29)–(B.31), we deduce that

(nh)−1{T61 − E(T61 | F)

}= (nh)−1 T10(h, p)

= Q2n +Op

(m2 nε−(3/2) +mnε−(1/2) h2 sup

k|rk|). (B.32)

25

In view of (B.24) and (B.25), T61 = T6 + Op(mh), uniformly in h ∈ H and

p ∈ P, and so by (B.32),

(nh)−1{T6 − E(T61 | F)

}

= Q2n +Op

(m2 nε−(3/2) +mnε−(1/2) h2 sup

k|rk|+mn−1

), (B.33)

uniformly in the same sense.

Step 5: Proofs of (B.29)–(B.31). First we develop a bound for

s(h) =m∑

k=1

|sk(h)| ,

where

sk(h) =

∫ X(ik)

X(ik−1)

{(1− EF )

∑

i∈SK

(Xi − xh

)}f(x) dx .

Now,

∫ X(ik)

X(ik−1)

{∑

i∈SK

(Xi − xh

)}f(x) dx = h

∑

i∈S

∫ (Xi−X(ik−1))/h

(Xi−X(ik))/h

K(u) f(Xi−hu) du ,

and so by Rosenthal’s inequality, for each r ≥ 1,

h−2r E{sk(h)2r

∣∣ F}

≤ C(r)

([∑

i∈Svar

{∫ (Xi−X(ik−1))/h

(Xi−X(ik))/h

K(u) f(Xi − hu) du

∣∣∣∣ F}]r

+∑

i∈SE

{∣∣∣∣(1− EF )

∫ (Xi−X(ik−1))/h

(Xi−X(ik))/h

K(u) f(Xi − hu) du

∣∣∣∣2r∣∣∣∣∣ F

})

≤ 22r+1 C(r) (1 + sup f)2r nr ,

where the constant C(r) depends only on r. Therefore,

E{s(h)2r

}≤ m2r

m∑

k=1

E[E{sk(h)2r

∣∣ F}]≤ 22r+1 C(r) (1+sup f)2r

(m2nh2

)r.

Hence, by Markov’s inequality, for each ε > 0 and each B1 > 0,

suph∈H

P{s(h) > mn(1/2)+ε h

}= O

(n−B1

),

and so, if Hn ⊆ H contains no more than O(nB2) elements for some B2 > 0,

P

{suph∈Hn

s(h) > mn(1/2)+ε h

}= O

(n−B3

),

26

for all B3 > 0. Taking Hn to be a regular lattice in H, and noting the Holder

continuity assumed of K, we can extend the bound above from Hn to all of H:

P

{suph∈H

s(h) > mn(1/2)+ε h

}= O

(n−B3

), (B.34)

for all B3 > 0.

To derive (B.29), note that, uniformly in 1 ≤ k ≤ m,

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))=

(ik − ik−1) {1 +O(m/n)}n−1 (ik − ik−1) {1 +Op(m/n1/2)} = n+Op

(mn1/2

).

Therefore the third identity in (B.29) can be written as

1

n

∑

i∈S(1− EF )

m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

K

(Xi − xh

)f(x) dx

=1

n

m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

{(1− EF )

∑

i∈SK

(Xi − xh

)}f(x) dx

=1

n

m∑

k=1

{n+Op

(mn1/2

)}sk(h)

=∑

i∈S(1− EF )

∫K

(Xi − xh

)f(x) dx+Op

(m2 nε−(1/2) h

),

uniformly in h ∈ H for each ε > 0, where we derived the last identity using

(B.34). This establishes (B.29).

To derive (B.30) we replace the series of steps at (B.34) by

h2

n

∑

i∈S(1− EF )

m∑

k=1

ik − ik−1 − 1

F (X(ik))− F (X(ik−1))rk

∫ X(ik)

X(ik−1)

K

(Xi − xh

)f(x) dx

=h2

n

m∑

k=1

(ik − ik−1 − 1) rkF (X(ik))− F (X(ik−1))

∫ X(ik)

X(ik−1)

{(1− EF )

∑

i∈SK

(Xi − xh

)}f(x) dx

=h2

n

m∑

k=1

{n+Op

(mn1/2

)}rk sk(h)

= Op

{h2 n−1

(n+mn1/2

) (supk|rk|)s(h)

}= Op

(mn(1/2)+ε h3 sup

k|rk|), )

where again the bound applies uniformly in h ∈ H for each ε > 0, and we295

obtained the last identity by using (B.34). This implies (B.30).

27

Finally in this step, to establish (B.31) note that, using the exact formula

for the remainder in a Taylor expansion,

1

h

∫K

(Xi − xh

)f(x) dx =

∫K(u) f(Xi − hu) du

=

∫K(u)

{f(Xi)− hu f ′(Xi) +

∫ hu

0

(hu− v) f ′′(Xi − v) dv

}du

= f(Xi) + h2 g(Xi, h) , (B.35)

where

g(x, h) =1

h2

∫K(u) du

∫ hu

0

(hu− v) f ′′(x− v) dv

and g has the properties (i) |g| ≤ B1 where B1 = 12 (sup |f ′′|)

∫u2K(u) du,

and (ii)

supx

∣∣g(x, h1)− g(x, h2)∣∣ ≤ B2 max

[ |h1 − h2|min(h1, h2)

,|h1 − h2|2{min(h1, h2)}2

].

Property (i) of g uses the assumption, in CB(a), that |f ′′| is bounded, and

permits it to be proved that, for each r ≥ 1,

suph∈H

E

[{∑

i∈S(1− EF ) g(Xi, h)

}2r]= O

(nr).

Therefore, using Markov’s inequality, it can be shown that for each ε, C > 0,

suph∈H

P

{∣∣∣∣∑


∣∣∣∣∣ > n(1/2)+ε}

= O(n−C

).

Hence, if Hn is any subset of H containing only polynomially many points (as

a function of n), then for each ε, C > 0,

P

{suph∈Hn

∣∣∣∣∑


∣∣∣∣∣ > n(1/2)+ε}

= O(n−C

). (B.36)

Property (ii) of g permits us to extend (B.36) from Hn to H. Combining this

result with (B.35) we deduce that (B.31) holds with

Q2n =1

n

∑

i∈S(1− EF ) f(Xi) .

28

Step 6: Bound to E(T61 | F)− T7. The bound is given at (B.48). Observe from

(B.28), and the argument leading to (B.26), that

E(T61 | F) = hm∑

k=1

(ik − ik−1 − 1) qk

m∑

k1=1

(ik1 − ik1−1 − 1)

×∫ X(ik)

X(ik−1)

E{fk1(x)

∣∣ F}fk(x) dx

=1

n

m∑

k=1

(ik − ik−1 − 1) qk

∫ X(ik)

X(ik−1)

fk(x)

×[

m∑

k1=1

ik1−1∑

i=ik1−1+1

E

{K

(X(i) − x

h

) ∣∣∣∣ F}]

dx .

Note too, from (B.6), that

T7 =

m∑

k1=1

ik1−1∑

i=ik1−1+1

m∑

k2=1

ik2−1∑

j=ik2−1+1

I(i 6= j) p(j)E

{K

(X(i) −X(j)

h

) ∣∣∣∣ F}

=m∑

k=1

(ik − ik−1 − 1) (ik − ik−1 − 2) qk

∫ X(ik)

X(ik−1)

fk(x) dx

×∫ X(ik)

X(ik−1)

fk(y)K

(y − xh

)dy

+m∑

k=1

(ik − ik−1 − 1) qk

∫ X(ik)

X(ik−1)

fk(x)

×[

m∑

k1 : k1 6=k

ik1−1∑

i=ik1−1+1

E

{K

(X(i) − x

h

) ∣∣∣∣ F}]

dx

= E(T61 | F)−m∑

k=1

(ik − ik−1 − 1) qk

∫ X(ik)

X(ik−1)

fk(x) dx

×∫ X(ik)

X(ik−1)

fk(y)K(y − x

h

)dy . (B.37)

Since qk ≤ C3 n−1 (see CB(c)) then the subtracted term on the far right-hand

side is bounded above by

C3

n

m∑

k=1

ik − ik−1 − 1

{F (X(ik))− F (X(ik−1))}2∫ X(ik)

X(ik−1)

f(x) dx

∫ X(ik)

X(ik−1)

f(y)K(y − x

h

)dy

= Op(m3 h

), (B.38)

29

uniformly in h ∈ H and p ∈ P. (Here we have used (B.11).) Therefore,

combining (B.37) and (B.38),

E(T61 | F)− T7 = Op(m3 h

), (B.39)

uniformly in h ∈ H and p ∈ P.

Step 7: Approximation to T9. Recall that T9 is defined at (B.8). Combining

(B.23), (B.33) and (B.39) we deduce that

(nh)−1 T9 =

∫f(x |h, p) f(x) dx+Q1n +Q2n

+Op

(m2 nε−(3/2) +mnε−(1/2) h2 sup

k|rk|+mn−1

), (B.40)

uniformly in h ∈ H and p ∈ P.

Step 8: Bound to T8. The bound is given at (B.42). Recall that T8 was defined

at (B.8). In all the arguments in this step, including the randomisation in the300

next paragraph, we condition on F .

Randomise the order statistics X(i) for ik−1 + 1 ≤ i ≤ ik − 1, obtaining

Xk1, . . . , Xkνk , say, where νk = ik − ik−1 − 1; and do this independently for

k = 1, . . . ,m. Write (k1, j1) ≺ (k2, j2) if either k1 < k2, or k1 = k2 and j1 < j2.

In this way we can order each of the ν =∑k≤m νk pairs (k, j). Let π(`) denote

the `th pair, satisfying π(`− 1) ≺ π(`) ≺ π(`+ 1). Define

Zπ(`1),π(`2) = (qk1 + qk2)

[K

(Xk1j1 −Xk2j2

h

)− E

{K

(Xk1j1 −Xk2j2

h

) ∣∣∣∣ F , Xk1j1

}

− E{K

(Xk1j1 −Xk2j2

h

) ∣∣∣∣ F , Xk2j2

}+ E

{K

(Xk1j1 −Xk2j2

h

) ∣∣∣∣ F}]

when (k1, j1) and (k2, j2) are the `1th and `2th pairs, respectively, and write

Z` =`−1∑

`1=1

Zπ(`),π(`1) .

30

In this notation,

T8 = T4 − (T5 + T6) + T7

=∑∑

(k1,j1)≺(k2,j2)(qk1 + qk2)

[K

(Xk1j1 −Xk2j2

h

)

− E{K

(Xk1j1 −Xk2j2

h

) ∣∣∣∣ F , Xk1j1

}− E

{K

(Xk1j1 −Xk2j2

h

) ∣∣∣∣ F , Xk2j2

}

+ E

{K

(Xk1j1 −Xk2j2

h

) ∣∣∣∣ F}]

=

ν∑

`2=2

`2−1∑

`1=1

Zπ(`1),π(`2) =

ν∑

`=2

Z`

We write π(`1) � π(`2) if either π(`1) ≺ π(`2) or π(`1) = π(`2). Suppose

π(`) = (k0, j0); here, 1 ≤ k0 ≤ m and 1 ≤ j0 ≤ νk0 . Define F` to be the

intersection of F with the smallest σ-field generated by Xkj for all pairs (k, j)

satisfying (k, j) � (k0, j0). Then, Z` is measurable in F`, and E(Z` | F`−1) = 0.

Therefore, Z1, . . . , Zν is a sequence of zero-mean martingale differences, adapted

to the σ-fields F1, . . . ,Fν . Hence, using Rosenthal’s inequality, we have for each

r ≥ 1

EF(T 2r8

)≤ C(r)

[{ ν∑

`=2

EF(Z2`

)}r+

ν∑

`=2

EF(Z2r`

)], (B.41)

where the constant C(r) ≥ 1 depends only on r, and all the bounds in this

paragraph hold uniformly in h ∈ H and p ∈ P. If π(`) = (k, j), write simply

Xπ(`) for Xkj . Conditional on F and Xπ(`), the random variables Zπ(`1),π(`),

for 1 ≤ `1 ≤ `− 1, are independent and have zero mean. Therefore,

EF(Z2`

)= E

{( `−1∑

`1=1

Zπ(`1),π(`)

)2 ∣∣∣∣∣ F}

= E

[E

{( `−1∑

`1=1

Zπ(`1),π(`)

)2 ∣∣∣∣∣ F , Xπ(`)

} ∣∣∣∣∣ F]

= E

{`−1∑

`1=1

E(Z2π(`1),π(`)

∣∣ F , Xπ(`)

)∣∣∣∣∣ F

}=

`−1∑

`1=1

E(Z2π(`1),π(`)

∣∣ F)

≤ B1 n−2

`−1∑

`1=1

E

{K

(Xπ(`1) −Xπ(`)

h

)2 ∣∣∣∣∣ F}

≤ B2 n−2 nh = B2 n

−1h . (B.42)

31

A similar bound for EF (Z2r` | F) can be derived using Rosenthal’s equality again

EF(Z2r`

∣∣ F)≤ C(r)

[{ `−1∑

`1=1

E(Z2π(`1),π(`)

∣∣ F)}r

+

`−1∑

`1=1

E(|Zπ(`1),π(`)|2r

∣∣ F)]

≤ C(r){(B2 n

−1 h)r

+Br3 n−2r nh

}≤ C(r)

(B4 n

−1 h)r, (B.43)

where, here and below, Bj does not depend on h, m, n or r. Combining (B.41)–

(B.43), and recalling from AA that ik − ik−1 ≤ C2 n/m, we deduce that

EF(T 2r8

)≤ C(r)2

(B5 ν n

−1 h)r ≤ C(r)2

(B5 C2m

−1 h)r. (B.44)

It follows from Hitczenko (1990) that C(r)2 (B5 C2)r ≤ (B6 r/ log r)B7r.

Therefore, provided that

m−1 hn2δ ≤ n−B8 , (B.45)

where B8 > 0, we have: E{(nδ T8)2r} ≤ (B6 r/nB8)B7r. The right-hand

side here is minimised by taking r = (B6 e)−1 nB8 , in which case it equals

exp(−B9 nB8), where B9 = B7/(B6e). Therefore, by Markov’s inequality,

suph∈H

supp∈P

P{∣∣(nh)−1 T8(h, p)

∣∣ > (nh)−1 n−δ}≤ suph∈H

supp∈P

E{(nδ T8

)2r}

≤ exp(−B9 n

B8). (B.46)

Let B10, B11 > 0 be fixed but arbitrarily large. If c3, in CB(d), satisfies c3 < B8

if Hn consists of a lattice of nB10 distinct values of h ∈ H and if Pn consists of

probability distributions p ∈ P, defined as at AA, when there are just m ≤ nc3

distinct pis and each of these takes at most nB11 possible values, all satisfying

(3.2), then the number of pairs (h, p) ∈ Hn ×Pn is at most nB10+mB11 , and so

(B.46) implies that

P

{suph∈Hn

supp∈Pn

∣∣(nh)−1 T8(h, p)∣∣ > (nh)−1 n−δ

}= O

{nB10+mB11 exp

(−B9 n

B8)}

= O{

exp(− nB8−η)

}, (B.47)

for all η > 0, provided that c3 < B8. In this case, choosing B10 and B11

arbitrarily large, (B.47) can be extended from (h, p) ∈ Hn × Pn, inside the

32

probability statement on the left-hand side, to all (h, p) ∈ H × P. It therefore

follows that, provided δ > 0 is chosen so small that (B.45) holds, which in turn

requires only that c1 − (1/5) + 2 δ < −B8, we have

P

{suph∈H

supp∈P

∣∣(nh)−1 T8(h, p)∣∣ > (nh)−1 n−δ

}→ 0 . (B.48)

Step 9: Completion. Write Qn = Q1n + Q2n. In (3.1), choose c1 ∈ (0, 130 ); let

c2, in (3.2), be in the interval (0, c1); choose B8 ∈ (0, 110 − 2 c1 − c2); let c3, in

CB(d), lie in the interval (0, 110 − 2 c1 − c2 −B8); and in (B.45)–(B.48), choose

δ ∈ (0, ( 15 − c1 −B8)/2). These choices are appropriate for the argument in the

last paragraph of the previous step. Combining (B.3), (B.7), (B.40) and (B.48),

we deduce that, for some η > 0,

1

n

n∑

i=1

f−i(Xi |h, p) =

∫f(x |h, p) f(x) dx+Qn +Op

(m2 nε−(3/2) +m2 n−2 h−1

+ (nh)−1 n−δ +mnε−(1/2) h2 supk|rk|+mn−1

)

=

∫f(x |h, p) f(x) dx+Qn +Op

[{(nh)−1 + h4

}n−η

],

uniformly in h ∈ H and p ∈ P, where the first identity holds for all ε > 0, and

the last holds if ε is sufficiently small. The theorem follows directly from this

property.

REFERENCES

HITCZENKO, P. (1990). Best constants in martingale version of Rosenthal’s305

inequality. Ann. Probab. 18, 1656–1668.

33

Highlights:

A cross-validation criterion to choose both the bandwidth and the

tilted estimator parameters, has been proposed.

It’s demonstrated theoretically that the proposed estimator provides

a convergence rate which is strictly faster than the usual rate

attained using a conventional kernel estimator.

The performance of the proposed tilted estimator through both

theoretical and numerical studies was investigated.

*Highlights (for review)

Date post:	21-Feb-2021
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

Nonparametric tilted density function estimation: A cross...

Documents