+ All Categories
Home > Documents > The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE...

The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE...

Date post: 25-Jan-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
15
TheAnnalsofStatistics 1994,Vol.22,No.3,1371-1385 ONTHESTRONGUNIVERSALCONSISTENCYOFNEAREST NEIGHBORREGRESSIONFUNCTIONESTIMATESI BYLUC DEVROYE,LAszLOGYORFI,ADAMKRZYZAKANDGABOR LUGOSI McGillUniversity,TechnicalUniversityo f Budapest,Concordia UniversityandTechnicalUniversityof Budapest Tworesultsarepresentedconcerningtheconsistencyofthek-nearest neighborregressionestimate .Weshowthatallmodesofconvergencein L 1 (inprobability,almostsure,complete)areequivalentiftheregression variableisbounded .Undertheadditionalconditionk/logn->oowealso obtainthestronguniversalconsistencyoftheestimate . 1 .Introduction. Let(X1 , Y1 ), . . .,(X n ,Y n ) beindependentobservationsof anRdxR-valuedrandomvector(X,Y) .DenotetheprobabilitymeasureofX by~i .Theregressionfunction m(x)=E(Y X=x)canbeestimatedbythe kernelestimate (x)_ >i=1 YiKh(x- X1) mn >i'=iKh(x - Xi) where h >0isasmoothingfactordependingupon n ; Kisanabsolutelyin- tegrablefunction(thekernel) ;and Kh(x)= K(x/h) [Nadaraya (1964,1970), Watson (1964)] . Alternatively,onecanusethek-nearestneighborestimate, n m(x)= Wni(x ; X1 , . . . , Xn)Yi, i 1 and Wni(x ;X1 , . . .,Xn ) is1/kif Xi isoneoftheknearestneighborsofxamong X1, . . .,X n , andWn i iszerootherwise .NoteinparticularthatI' 1 W,1 = 1 .The k-nearestneighborestimatewasstudiedbyCover (1968) . Forasurveyofother estimates,see,forexample,Collomb (1981,1985) orGyorfi (1981) . Weareconcernedwiththe L1convergence of m n to m asmeasuredby J n - f ~mn (x) -m(x)'(dx), where ,u isthe(unknown)probabilitymeasure forX.Thisquantityisparticularlyimportantindiscriminationbasedonthe kernelrule[seeDevroyeandWagner (1980) orStone (1977)] . Stone (1977) first pointedoutthatthereexistestimatorsforwhich Jn --* 0inprobabilityforall distributionsof(X,Y)withE(Y(<oo .Thisincludedthenearestneighborand histogramestimates .Forexample,fortheknearestneighbors,itsufficesto askthat (1) k--*oo, k/n--*0, ReceivedAugust1992 ;revisedNovember1994. 1 SupportedinpartbyNSERCGrantsA3456andA0270,andFCARGrantsEQ-1679and EQ-2904 . AMS1991 subjectclassifications. Primary62G05 . Keywordsandphrases . Regressionfunction,nonparametricestimation,consistency,strong convergence,nearestneighborestimate. 1371
Transcript
Page 1: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

The Annals of Statistics1994, Vol. 22, No. 3,1371-1385

ON THE STRONG UNIVERSAL CONSISTENCY OF NEARESTNEIGHBOR REGRESSION FUNCTION ESTIMATESI

BY LUC DEVROYE, LAszLO GYORFI, ADAM KRZYZAK AND GABOR LUGOSIMcGill University, Technical University ofBudapest, Concordia

University and Technical University o f Budapest

Two results are presented concerning the consistency of the k-nearestneighbor regression estimate . We show that all modes of convergence inL1 (in probability, almost sure, complete) are equivalent if the regressionvariable is bounded . Under the additional condition k/logn -> oo we alsoobtain the strong universal consistency of the estimate .

1 . Introduction. Let (X1 , Y1), . . . , (Xn , Yn) be independent observations ofan Rd x R-valued random vector (X, Y) . Denote the probability measure of Xby ~i. The regression function m(x) = E( Y X = x) can be estimated by thekernel estimate

(x) _ >i=1 YiKh(x - X1)mn>i'=iKh(x - Xi)

where h > 0 is a smoothing factor depending upon n; K is an absolutely in-tegrable function (the kernel) ; and Kh(x) = K(x/h) [Nadaraya (1964, 1970),Watson (1964)] . Alternatively, one can use the k-nearest neighbor estimate,

n

m(x) =

Wni(x ; X1 , . . . , Xn)Yi,

i 1

and Wni(x; X1 , . . . ,Xn) is 1/k ifXi is one of the k nearest neighbors of x amongX1, . . . , Xn , and Wn i is zero otherwise . Note in particular that I' 1 W,1 = 1. Thek-nearest neighbor estimate was studied by Cover (1968) . For a survey of otherestimates, see, for example, Collomb (1981, 1985) or Gyorfi (1981) .

We are concerned with the L1 convergence of mn to m as measured byJn - f ~mn (x) - m(x)' (dx), where ,u is the (unknown) probability measurefor X. This quantity is particularly important in discrimination based on thekernel rule [see Devroye and Wagner (1980) or Stone (1977)] . Stone (1977) firstpointed out that there exist estimators for which Jn --* 0 in probability for alldistributions of (X, Y) with E(Y( < oo . This included the nearest neighbor andhistogram estimates . For example, for the k nearest neighbors, it suffices toask that

(1)

k --* oo,

k/n --* 0,

Received August 1992 ; revised November 1994.1 Supported in part by NSERC Grants A3456 and A0270, and FCAR Grants EQ-1679 and

EQ-2904 .AMS 1991 subject classifications. Primary 62G05 .Key words and phrases. Regression function, nonparametric estimation, consistency, strong

convergence, nearest neighbor estimate.

1371

Page 2: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

1372

DEVROYE, GYORFI, KRZYZAK AND LUGOSI

provided that ties among points at equal distance from x are adequately takencare of. These conditions are the best possible. Devroye and Wagner (1980) and,independently, Spiegelman and Sacks (1980) showed that this is also the casefor the kernel estimate with smoothing factor h provided that K is a boundednonnegative function with compact support such that, for a small fixed sphereS centered at the origin, inf x E s K(x) > 0, and that

(2)

lim h = 0,

lim nhd = oo .n~oo

n~oo

These results were extended and complemented by Greblicki, Krzyzak andPawlak (1984), Krzyzak (1986) and Krzyzak and Pawlak (1984) .

Interestingly, it turns out that the conditions for the "in probability" conver-gence of Jn are also sufficient for the strong convergence ofJn , thus renderingall modes of convergence equivalent . Difficulties arise when theX-variable doesnot have an absolutely continuous distribution . We summarize what is knownin this respect :

1 . For the k-nearest neighbor estimates, Jn --* 0 almost surely under condition(1) wheneverX has a density and Y is bounded [Devroye and Gyorfi (1985),Chapter 10, and Zhao (1987)] . Beck (1979) showed this result earlier underthe additional constraint that m has a continuous version .

2 . For the k-nearest neighbor estimate, J n --* 0 almost surely for all distribu-tions of (X, Y) with Y bounded, provided that k/n --* 0 and k/log log n --* o0

[Devroye (1982)] . The unnatural condition on k arises from the proof method :the convergence ofJn to 0 is obtained by first establishing the pointwise con-vergence (i .e ., mn - m --* 0 almost surely) at almost all x(i) and then movingon to L1 convergence via a result of Glick (1974) .

3 . Devroye and Gyorfi (1983) obtained the equivalence for all distributions of(X, Y) with ~Y) < M < oo for the histogram regression estimate . Gyorfi (1991)has pointed out that Jn --p 0 almost surely for a modification of partitioningestimates whenever E~YI < oo, provided that a bin width condition similarto (2) [with the additional condition nhd/ log(n) --* oo] is satisfied .

4. Assuming that Y is uniformly bounded, the kernel estimate is strongly con-sistent if (2) holds, K is a Riemann integrable kernel and K > a15, wherea > 0 is a constant and S is a ball centered at the origin that has a positiveradius [Devroye and Krzyzak (1989)] .

The purpose of the present paper is twofold. First we explain a simple tech-nique based upon exponential martingale inequalities for proving the equiv-alence of all modes of convergence of Jn for the k-nearest neighbor estimateunder no conditions on the distribution of (X, Y) other than the boundedness ofY. Thus, Stone's conditions on the relative sizes of k and n are strong enoughto imply complete and almost sure convergence . Our other result is the stronguniversal convergence ofJn , that is, we can replace the boundedness assump-tion by the natural condition E Y) < oo . Here we need the additional conditionk/logn --* oo on k .

Page 3: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

where

(i : lix - xi II < Rn (x)

(X(1) ,Y(1) ), . . . , (X(n)~ Y(n))

NEAREST NEIGHBOR REGRESSION

1373

Before we can state the main results, we have to take care of the messy prob-lem of distance ties (Mx -Xi _ ~x -X~ ) . The exponential inequality used hereand in Devroye and Krzyzak (1989) is basically useful whenever the removalof one data point has a limited effect on the error. Also, our covering lemmarequires some sort of duality that states that ifXi is one of the near neighborsofX~, then roughly speaking X~ should be one of the near neighbors ofX1 . Nextwe list three of the possible tie-breaking methods :1. Tie breaking by indices . IfXi and X~ are equidistant from x, then Xi is de-

clared "closer" if i < j . This method has some undesirable properties . Forexample, if X is monoatomic, then X 1 is the nearest neighbor of all XD's,j > 1, butX3 is only the (j - 1)st nearest neighbor of X 1 . The influence of X1in such a situation is too great, making the estimate very unstable and thusundesirable. In this case, Devroye and Gyorfi [(1985), Chapter 10] pointedout that, when Y is not degenerate and > 0 is small enough,

P

~mn(x) - m(x)~~i(dx) >

> exp(-ck),

for some c > 0 . This is in contrast to Theorem 1(a) below, where m (x) is thek-nearest neighbor regression estimate defined by tie breaking by indices .

2. Stone's tie breaking . Stone (1977) introduced a nearest neighbor rule which isnot a k-nearest neighbor rule in a strict sense, for his estimate, in general,uses more than k neighbors . If we denote the distance of the kth nearestneighbor to x by R(x) (note that it is unique), then Stone's estimate is thefollowing :

fin(x)

(3)

1k

k - #{i : Mx - xi II < Rn(x)}#{i: Ilx-x

jll =Rn

(x)}

i:llx - x~II=Rn(x) YL

3. Tie breaking by randomization . This is the method that we will consider . Weassume that (X, Z) is a random vector independent of the data, where Z isindependent of X and uniformly distributed on [0,1] . The latter assumptionmay be replaced by the weaker assumption that Z has a density ; however,as it is up to us to generate Z, we may as well pick a uniform random variable .We also artificially enlarge the data by introducing Z 1 , Z2, . . . ,Z,, wherethe Zi's are i .i .d. uniform [0,1] as well. Thus, each (Xi , Z i ) is distributed as(X, Z) . The probability measure induced by (X, Z) is denoted by v . Given(x, z), we define

mn(x, z) ki=

Page 4: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

1374

DEVROYE, GYORFI, KRZYZAK AND LUGOSI

is a reordering of the data according to increasing values of ~~x -X( i) M . In caseof distance ties, we declare (Xi , Zi ) closer to (x, z) than (X3 , Z~) provided that

~Zi - zI< ~Zj - zL

The criterion is

Jn = E{ I mn(X, Z) - m(X)) IX1 ,Zi,Yi , . . •

,Xn , Zn, Yn1

_

~mn(x, z) - m(x)~,u(dx) dz =

~mn(x, z) - m(x)) v (d(x, z)) .0

The main difference between Stone's tie-breaking policy and the one basedon randomization is that Stone's method takes into account all points whosedistance to x equals that of the kth nearest neighbor, while the method basedon randomization picks one of these randomly and neglects the others . We willsee in the proof of Theorem 1 that EJn cannot be smaller than the expectedL1-error of Stone's estimate. It should be stressed that if ,u has a density, thentie breaking is needed with zero probability and becomes therefore irrelevant .

2. The equivalence theorem . The purpose of this section is to prove thefollowing result .

THEOREM 1 . Let mn(x, z) be the k-nearest neighbor estimate defined above .Then the following statements are equivalent :

(a) For every distribution of (X, Y) with IIY< M < oo and > 0, there is apositive integer no such that, for n > no ,

P{Jn > ~} < exp -n~ 2 /(8M2 -y2) ,

where the constant 'yd is the minimal number of cones centered at the origin ofangle 7r/6 that cover R° .

(b) For every distribution of (X, Y) with IIY< M < oo,

Jn --* 0 with probability 1 as n --* oo .

(c) For every distribution of (X, Y) with IIY< M < oo,

Jn --* 0 in probability as n --* oo .

(d) limn

k = oo and limn

k/n =0 .

REMARK 1 (A curiosity) . It is interesting that we can find sequences k forwhichJn --* 0 almost surely for all distributions of (X, Y) with bounded Y, yetmndoes not tend to m in the almost sure pointwise sense [take k N log log log(n ),and note that k/log log(n) --* oo is necessary for the almost sure pointwiseconvergence of the kernel estimate whenever X has a density and m is twicecontinuously differentiable with m" 0; Devroye (1982)1 .

Page 5: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

NEAREST NEIGHBOR REGRESSION

1375

REMARK 2 (Necessity of the conditions) . It is not true that, when Jn --p 0in probability for one distribution of (X, Y), the conditions (1) on k follow : justconsider the case that Y = 0 with probability 1 . Of course, the implication istrue for "most" distributions of (X, Y) .

REMARK 3 (General estimates). We will not consider smoothed versionsof the k-nearest neighbor method here . For example, as in Devroye (1982),one might consider attaching weight Uni to the ith nearest neighbor, whereUnl ~ Un2 ~ ~ Unn > 0 and the weights sum to 1 for every n. Such methodswere first proposed by Royall (1966) .

REMARK 4 (Other references) . For other results on k-nearest neighbor con-vergence, see, for example, Collomb (1979, 1980), Mack (1981), Devroye (1978,1981, 1982), Stute (1984) and Bhattacharya and Mack (1987) .

REMARK 5 (Random k) . If k is replaced by a random variable K that is in-dependent of the data and satisfies K/n --* 0 and K --p oo almost surely, thenJn --* 0 almost surely. Such data-based choices can be obtained by splitting thedata, for example.

REMARK 6 (Discrimination). The conditional probability of error, of thek-nearest neighbor rule in discrimination, given the data [Cover and Hart(1967)], converges completely and strongly to the Bayes probability of error asn -+ oo for all distributions of the data whenever (1) holds . This result strength-ens the universal weak convergence results of Stone (1977) and Devroye andWagner (1980) .

REMARK 7 (Lp-consistency) . By the boundedness of Y it is easy to see thatthe LP -error

1

1/PJ,= ( / / ~mn (x, z) - m(x)~P i(dx) dz

for 1 < p < oo0

converges to zero if and only if the L1-error Jn does; therefore the results ofTheorem 1 remain valid for LP-errors .

REMARK 8 (Inequalities). The inequality of Theorem 1(a) is less useful inpractice, as it is only valid for n > n o , where n o depends upon ~ .

PROOF OF THEOREM 1 . Clearly, (a) implies (b) and (b) implies (c) . Part (c)implies that EJn --* 0; therefore, by Jensen's inequality,

EJn = E

~mn(x,z) - m(x)) b(d(x,z))

> E f E

mn(x,z)dz Xl,Yl , . . .,Xn,Yn - m(x) µ(dx)

= E

Ii(x)

n - m(x)I/c(dx) --* 0,

where )nn is Stone's estimate defined by (3) ; but this implies (d) by the resultsof Stone (1977) . The novelty in this paper is the proof that condition (d) implies

Page 6: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

1376

DEVROYE, GYORFI, KRZYZAK AND LUGOSI

(a) . We begin with an exponential inequality generalizing inequalities due toHoeffding (1963) . The generalization due to Azuma (1967) [see Stout (1974)] hasled to interesting applications in combinatorics and the theory of random graphs[for a survey, see McDiarmid (1989)] . We have used it in density estimation[Devroye (1988, 1991)] .

LEMMA 1 [McDiarmid (1989)] . Let X1 , . . . ,Xn be independent random vari-ables taking values in a set A, and assume that f : An --* R satisfies

sup

1 < i < n .xl, . . .,xn

xl , . . ., xn E A

Then

-2t2 lP{~f(X1i . . .,Xn)-Ef(X1, . . .,Xn)) > t} G 2expC

/n~i= a

The other tool needed for our proof is exploiting some geometric propertiesof the "metric" defined by the tie-breaking rule . In order to make it more trans-parent, we recall Lemma 10.1 from Devroye and Gyorfi (1985), which was usedin the proof of complete consistency of k-nearest neighbor estimates if ,u has adensity. Let Sx , r and Sx , r denote the open and closed balls of radius r centeredat x, respectively.

LEMMA 2 [Devroye and Gyorfi (1985)] . Let ,u be an absolutely continuousprobability measure on ~d . Define

Then, for all x ERd

Ba(x) _ {x' : /~ ('sx' ,II x - x' I I ) C a} .

µ(Ba(x)) < 'Yda .

Since Devroye and Gyorfi assumed the existence of a density, they did nothave to worry about tie breaking. In order to generalize Lemma 2 to our case, we

dneed some notation. For x e ~Z let C(x) C R be a cone of angle it/6 centered atx. The cone consists of ally with the prdperty that either y = x or angle(y - x, s)< it/6, where s is a fixed direction. Ify, y' E C(x) and II x - y I I < I) x - y' I I, thenl y- y' I I< I Ix - y' I I • Furthermore, if I I x - y I i <- I i x- y' I I, then I ly- y' I I< I I x - y'

This follows from a simple geometric argument in the vector space spanned byx, y and y' .

For (x, z) E ~d x [0, 1] define Co(x, z), C1(x, z), S(x, z),(r, b) C R x [0, 1] as

C0(x, z) = C(x) x [0, z] ,C1(x, z) = C(x) x [z,1]

Page 7: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

and

NEAREST NEIGHBOR REGRESSION

1377

andS(x, z), (r, b) = Sx, r x [0,1] U {(,) :xzII-xIIx

=r, II-zI <b} .

Clearly, Rd x [0,1] can be covered by 2'-yd sets of type Co(x,z) and C1(x,z) .The property that we need is the following .

LEMMA 3. LetBa(x,z)- {(x',z') : v(S(XI,z1),(Ilx-x'll,Iz-z'1))

a} .

Then for all (x, z) E Rd x [0,1]

V (Ba(x, z)) < 2'yda .

First of all, we prove a covering lemma, the key property of the "cones" C0 (x, z)and Ci(x, z) .

LEMMA 4. I f (x', z') E C0(x, z), then

CU(x,z) f1 S(x, 2 ),(Ilx-x'11, Iz-z'I) C S(x',2'),(Ilx-x'll, Iz-z'I),

and if (x', z') E C 1(x, z), then

C1(x,z) f1 S(x,2),(Ilx-x'11, Iz-z'I) C S(x',2'),(Ilx-x'll, Iz-z'I)~

PROOF. Because of symmetry it is enough to prove one of the statements .We have to show that (x, z) E Co (x, z) f1 S(x, z), (I1x - x' II, Iz - z' I) implies (x, z) E

S(x',z,),(Ilx-x'll,lz-z'I) .

If x E C(x) f1 Sx, IIx - x' II , then from the well-known property of the conex e Si ', IIx - x' II follows, so it is enough to deal with pairs (x, z) where IIx - xII =

IIx - x' II . Since x E C(x), the only case when x Si', IIx - x' II is ifIIx-xII=IIx-x'II=IIx'-xII .

Denote the set of suchxs by H . Thus, it is enough to deal with pairs inH x [0,1] .Intersecting this set with the left- and right-hand sides of the statement, we get

H x [0,1] f1 Co (x,z) f1 S(x,z), (IIx-x'II, Iz -z'1)

= H x ([0,z] n {z: Iz - z I < Iz - z' } = H x [z', z]

H x [0,1] flS(x',z'),(Ilx-x'I1,lz-z'1) =H x {z: Iz -

respectively. Clearly, however,

[z', z] C {z: Iz - z' I < Jz - z' I } ,

which completes the proof. J

z'J

Iz-z'I}),

Page 8: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

1378

DEVROYE, GYORFI, KRZYZAK AND LUGOSI

PROOF OF LEMMA 3 . The proof is similar to that of Lemma 2 . Let C1 , . . .,C2-rd

be a collection of sets of form C o (x, z) and C1(x, z) that cover Rd x [0,1] . Then

2-ydv (Ba(x, z)) <

v (Cs f1Ba (x, z)) .s=1

Let (x', z') E Cs nBa (x, z) . Then from Lemma 4 we have

v (C5 n S(xz), (~~x - x' (~, ~z - z' () f1Ba(x, z)) < v(s(x', z'), (~~x - x' (~, ~z - z' ~))

a,where we used the fact that (x', z') E Ba (x, z) . Since (x', z') was arbitrary,

v (Cs f1 Ba(x, z)) < a,

which completes the proof of the lemma. J

Now we are equipped to prove that (d) implies (a) in Theorem 1 . Set rn= rn(x, z) and bn = bn (x, z) to satisfy

bv (S(x, Z), (rn , bn)) _ -n

Note that the solution always exists, by the absolute continuity of the distribu-tion of Z and its independence from X . Also define

1 nmn x z

YjI{(x1, z E,S(x , z), (rn, bn)I

Obviously,

~m(x) - mn(x, z) ~ < Im(x) - E (mn(x, z ) )(4)+ IE (mn(x, z)) - mn(x, z) I + ~mn(x, z) - mn (x, z)J .

The first term on the right-hand side is a deterministic "bias"-type term, whoseintegral will be shown to converge to zero. The second and third terms arerandom; they can be considered as "variation" terms . We will obtain exponentialprobability inequalities for these terms that are valid for large n's .

The condition b/n -f 0 implies that rn(x, z) -f 0, so for the first term we have,by Lebesgue's density theorem [see Wheeden and Zygmund (1977)], that

1

[Jv (S x ,( z), (rn, bn) ) "S(x, z), (rn, b n )

-~ E (Y (X, Z) _ (x, z)) = m(x)

for almost all x mod µ . By the boundedness of Y, the dominated convergencetheorem implies that

E (mn(x, z))

J Im(x) - E (mn(x, z)) I v (d(x, z)) ~ 0 .

E (Y (X, Z) _ (x', z')) v (d(x', z'))

Page 9: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

xl,yl,zl, • • • , xn,yn,zn,xi,zi,yt

and, by Lemma 1,

P{

(5)

- E

Emn(x, z) - mn(x, z)I v (d(x, z))

< 2e' 212 .-yd)

So we have to show thatE ~Emn(x, z) - mn(x, z)Jv(d(x, z)) -~ 0 .

However, using the Cauchy-Schwarz inequality, we have

E IEmn(x, z) - mn(x, z)Iv(d(x, z))

<

/EJEm(x,z)

n- mn(x, z)Izv (d(x, z))

=

1n Var(YI

v d(x z)f k 2

{(X, Z) E S(x, z) > (rn , 6 n ) } ) (

~

)

M2<

2 nv(S(x, z), (rn, bn)) v (d(x, z))kf /nM2 bJ V k n ,u(dx)/M2

= v --- o.

NEAREST NEIGHBOR REGRESSION

1379

Turning to the second term in (4), first we get an exponential bound for

f E (mn(x, z)) - mn(x, z) I v (d(x, z)) - E IE (mn(x, z)) - mn(x, z) I v(d(x, z))

by Lemma 1. Fix the data and replace (xi , zi, y i ) by (xi , zi, yi), changing the valueof mn(x, z) to mni(x, z) . Then

f E (mn(x, z)) - mn(x, z) v (d(x, z)) - f IE (mn(x, z)) - mni(x, z) v (d(x, z))

<

m(x,z) - mni(x,z)fv(d(x,z)) ;

but ~mn(x, z) - mni(x, z) is bounded by 2M/k and can differ from zero only if(xt , zt•) E S(, z), (rn , b) or (, tx z•) E S(, z), (rn , b) . Observe that (xi, z) E Sx n

t

x n

i

( x, z), (rn , b n )if and only if v(S(x , z), (Ilx _ xl II, I z - ztl ))

k/n . However, the measure of such (x, z)pairs is bounded by 'ydk/n, by Lemma 3; therefore,

m* x z- m n* x z v d x z

2M'Ydk _ 2M~ydp

n( , )

t ( , )~ ( ( , ))su

- k n

n

f Emn(x, z) - mn(x, z) v (d(x, z))

Page 10: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

1380

DEVROYE, GYORFI, KRZYZAK AND LUGOSI

Finally, denoting Rn = X(k) -x M I andBn = ~Z(k) -z, write the third term in (4) as

mn(x, z) - mn(x, z)I

<

n

J

n

II{(Xj, Zj) E "S(x, z), (rn , bn)}k j=1

n

YjI{(Xj, Zj) E "S(x, z), (rn , b n )}

YjI{(Xj , Zj ) E "S(x, z), (Rn , Bn )}

n

I{(Xj, Zj) E "S(x, z), (rn , bn )}j 1

= MJfii (x, z) - Emn(x, z)J,

where mn is defined as mn with Y replaced by the constant random variableY = 1. Therefore the bound of (5) applies for the third term, too, and the proofis complete . o

3. Strong universal consistency . In this section we demonstrate thatthe k-nearest neighbor regression estimate is consistent even if Y is notbounded, if k is chosen to satisfy k/log(n) -~ oo and k/n -~ 0. More precisely,we prove the following theorem .

THEOREM 2 . I f

lim k/log(n) = oo and lim k/n = 0,n -~ oo

n -~ o0

then Jn -~ 0 with probability 1 for all distributions of(X, Y) satisfying E ~Y~ < oo .

Gyorfi (1991) gave conditions for the strong universal consistency of a regres-sion estimate. Translating his result to our case, we get the following lemma .

LEMMA 5 [Gyorfi (1991), Theorem 2] . Consider the k-nearest neighbor re-gression estimate m n(x, z). Then the L 1 -error of the estimate Jn converges to zeroalmost surely for all distributions of (X, Y) satisfying E ~Y~ < oo if the followingtwo conditions are satisfied :

(a) Jn -~ 0 almost surely for all distributions of (X, Y) with bounded Y.(b) There is a constant c > 0 such that, for all distributions o f (X , Y) satisfy-

ing E~Y~ <00,

j=1

{(4 Zj) E "S(x, z), (Rn , B n ) }

1

1lim sup kn -~ o0f1 f IYL,x,Z~Iu(dx)dz < cEJY~ a .s.

Clearly, condition (a) is satisfied by Theorem 1, 50 we only have to check (b) .In order to do so, we need some notation . Let A i be the collection of all (x, z)

Page 11: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

NEAREST NEIGHBOR REGRESSION

138 1

that are such that (Xi, Zi) is one of its k nearest neighbors . Here, we use somegeometric arguments similar to those in the proof of Theorem 1 . Similarly, let usdefine a cone C(x, 8, s), where x defines the top of the cone, s is a vector indicatinga direction in Rd and 8 E (0, 7r) is an angle . The cone consists of all y with theproperty that either y = x or angle(y - x, s) < 8. For any fixed 8, there exists afinite collection S of directions such that

U C(x, B, s) = RdsES

regardless of how x is picked . The cardinality of this set is denoted by ~SI anddepends upon both B and d. If 8 <7r/6 and if y, y' E C(x, B, s) and ~~x - y~~

-

, then ~~y -

< ~~x -

. Furthermore, if Ilx - yll

-

, thenIlv - - y~ II • We fix B E (0, 7r/6) and S as indicated above . In the spaceRd x [0, 1], define the sets

C1 , 5 = C(X1, 8, s) x [0,1] .

Let Bi , S be the subset ofC1 , 5 consisting of all (x, z) that are among the k nearestneighbors of (Xi, Zi) in the set

{(Xi,Zi), . . .,(X1_i,Z1_i),(X1+i,Z1+i), . . .,(Xn,Zn),(x,z)}flC1,5

when distance tie breaking is done in the described fashion . [If C1 , 5 containsfewer than k -1 of the (Xi, Z1 ) pairs i j, then Bi, s = Ci, s •] Equivalently, Bi, sis the subset of C1 , 5 consisting of all (x, z) that are closer to (Xi, Zi) than the kthnearest neighbor of (Xi, Zi) in C 1 , 5 , when distance tie breaking is done by thedescribed fashion.

LEMMA 6 . If (x, z) E A1 , then (x, z) E US E S Bi, s, and thus

v(A 1 ) <

(Bi , s ) .sES

PROOF . To prove this claim, take (x, z) E Ai . Then locate an s E S for which(x, z) E C1 , 5 . We have to show that (x, z) E Bi , s to conclude the proof. Thus, weneed to show that (x, z) is one of the k nearest neighbors of (Xi , Zi) in the set

{ (Xl, Zl ), • • . , (Xi - l, ~i -1), (Xi + l, Zi + 1), • • . , (Xn, Zn ), (x, z) } n Ci, s

when distance tie breaking is done appropriately. Take (X1, Z~) closer to (Xi, Zi)than (x, z) in Ci, S . If ~~X; - Xi ~~ < ~~x - Xi ~~, we recall that by the property of ourcones ~~x -X~ < ~~x -Xi , and thus (X~,Z~) is one of the k -1 nearest neighborsof (x, z) in Rd . If on the other hand X1 - Xi _ ~x - Xi , and z is further fromZi than Z~, then by the property of the cone, ~x - X1 < ~x - Xi , which showsagain that (X1, Z1 ) is one of the k -1 nearest neighbors of (x, z) inRd. This showsthat in C1 , 5 there are at most k -1 points (X~, Z~) closer to (Xi, Zi) than (x, z) .

Page 12: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

1382

DEVROYE, GYORFI, KRZYZAK AND LUGOSI

Thus, with the same tie-breaking policy, (x, z) is one of the k nearest neighborsof (Xi , Zi) in the set

{(Xl , Zl), . . . , (Xi - 1, Zi - 1), (Xi + 1 ~ Zi + 1), . . . , (Xn, Zn), (x, z)} n Ci, S

This concludes the proof of the claim . o

LEivtivtn 7 (An inequality for binomial random variables) . LetB be a binomialrandom variable with parameters n and p . Then

P{B > e} < exp [~ - np - elog(e/nP)],

~ > np,

P{B < e} < exp [e - np - e loge/n )] ,

e < np .

PROOF . We proceed by Chernoff's exponential bounding method [Chernoff(1952)] . In particular, for arbitrary A> 0,

P{B > e} < E{exp(AB - AE)}

= eXP(-Ae)((exp A )p + 1 -p)'1

exp [-Ae + np ((exp A) - 1)] .

The right-hand side is minimal for A = log(elnp) . Resubstitution of this valuegives the first bound. The proof of the other bound is similar . o

LEMMA 8 [Property of v(B1 , S )] . If k/log(n) -~ oo and k/n -~ 0, thennlim sup k max v(Bi , s) < 2 as.

n -~ oo

i

PROOF . We prove that, for every s E S,

Pn max v(Bi , s) > 2 < oo .kn

In order to do this we give a bound for

P{v(B1,s)>~~X1,Z1} .

If v(C1 , S) < ~, then, since Bi , S C C, 5 , we have P{v(B1 , S) > lXi , Z1} = 0; thereforewe assume that v(Ci, s) > €. Fix Xi and Z1 . The distance-ordering and tie-breaking method induces a total ordering of all (x, z) with respect to closeness to(Xi , Zi) . Find a pair (x, z) E C1 , 5 such that ifB e is the collection of all (x', z') E C1 , sthat are nearer to (Xi, Zi) than (x, z), then v(B e ) _ € . By our method of tie break-ing, such a pair (x, z) exists. We have the following dual relationship :

P{v(B1 ,s) > X1 ,Z1 }= P{Be captures fewer thank of the points CK, Z~), j i I Xi .Zi } .

Page 13: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

If we can show that

(6)

1hm supn -*00 n

n

i=1

so we have to prove (6) . However, by Lemma 6,

sES

therefore, Lemma 8 implies that (6) is satisfied with c = 2)S), so the proof of thetheorem is complete . 0

NEAREST NEIGHBOR REGRESSION

1383

However, ifB is a binomial (n -1, €) random variable, then the last probabilityis equal to

P{B <k} < exp k - (n - 1)e - k log ( k if k < (n - 1)E .(n - 1)e

Finally, with

2k/n we have k <(n -1)E, therefore

P max v(B1, S ) > €} < nP{ v(B1s ) > ~}1<i<n

< n exp k - (n - 1)E- k log(n - 1)e

= n exp k - 2k(n~ 1) - k1ogC2 (nn 1)

< n exp (_b +2k + k log 2n

which is summable in n when k > [2/(1- log 2)] log n . 0

PROOF OF THEOREM 2 . By Lemma 5 and Theorem 1, it is enough to provethat there is a constant c > 0 such that

1 k

111m sup -

ff ~Y(i, x, z) I i(dx) dz < cE Y~ as .n->00 k

i=1

Observe that

n -*00

nlim supkmax v(A 1 ) < c as .,

n --~ o0

for some constant c > 0, then, by the law of large numbers,

n

1 n

Y(i)k

max v(A) < lim sup c-

~Y( i) = cE Y~ a.s .,n

Page 14: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

1384

DEVROYE, GYORFI, KRZYZAK AND LUGOSI

Acknowledgments . We would like to thank three referees .

REFERENCESAZUMA, K . (1967) . Weighted sums of certain dependent random variables . Tohoku Math . J. 37

357-367 .BECK, J . (1979) . The exponential rate of convergence of error for k, 1 NN nonparametric regression

and decision . Problems Control Inform . Theory 8 303-311 .B HATTACHARYA, P . K . and MACK, Y. P . (1987) . Weak convergence of k-NN density and regression

estimators with varying k and applications . Ann. Statist. 15 976-994.CHERNOFF, H . (1952). A measure of asymptotic efficiency of tests of a hypothesis based on the sum

of observations. Ann. Math . Statist . 23 493-507 .COLLOMB, G. (1979) . Estimation de la regression par la methode des k points les plus proches :

proprietes de convergence ponctuelle . C. R. Acad. Sci . Paris 289 245-247 .COLLOMB, G . (1980) . Estimation de la regression par la methode des k points les plus proches avec

noyau . Statistique non Parametrique Asymptotique . Lecture Notes in Math . 821 159-175 .Springer, Berlin.

COLLOMB, G . (1981) . Estimation non parametrique de la regression : revue bibliographique. Inter-nat. Statist. Rev. 49 75-93 .

COLLOMB, G . (1985) . Nonparametric regression : an up-to-date bibliography . Statistics 16 300-324 .COVER, T . M . (1968) . Estimation by the nearest neighbor rule . IEEE Trans. Inform . Theory IT-14

50-55 .COVER, T. M . and HART, P. E. (1967) . Nearest neighbor pattern classification. IEEE Trans.

Inform . Theory IT-13 21-27 .DEVROYE, L . (1978) . A universal k-nearest neighbor procedure in discrimination . In Proceedings

of the 1978 IEEE Computer Society Conference on Pattern Recognition and Image Pro-cessing 142-147 . IEEE, New York.

D EVROYE, L . (1981) . On the almost everywhere convergence of nonparametric regression functionestimates. Ann. Statist . 9 1310-1319 .

D EVROYE, L . (1982) . Necessary and sufficient conditions for the almost everywhere convergence ofnearest neighbor regression function estimates. Z. Wahrsch . Verve. Gebiete 61 467-481 .

DEVROYE, L . (1983) . The equivalence of weak, strong and complete convergence in L1 for kerneldensity estimates. Ann. Statist. 11 896-904 .

D EVROYE, L . (1988) . The kernel estimate is relatively stable . Probab. Theory Related Fields 77521-536 .

DEVROYE, L. (1991) . Exponential inequalities in nonparametric estimation . In NonparametricFunctional Estimation (G . Roussas, ed .) 31-44 . Springer, Berlin.

DEVROYE, L . and GYORFI, L . (1983) . Distribution-free exponential bound on the L1 error of par-titioning estimates of a regression function . In Proceedings of the Fourth PannonianSymposium on Mathematical Statistics (F Konecny, J. Mogyorodi and W. Wertz, eds .)67-76 . Akademiai Kiado, Budapest.

DEVROYE, L . and GYORFI, L . (1985) . Nonparametric Density Estimation: The L 1 View. Wiley,New York .

DEVROYE, L . and KRZYZAK, A . (1989) . An equivalence theorem for L1 convergence of the kernelregression estimate . J. Statist . Plann. Inference 23 71-82 .

DEVROYE, L . and WAGNER, T . J . (1980) . Distribution-free consistency results in nonparametricdiscrimination and regression function estimation . Ann. Statist. 8 231-239 .

GLICK, N . (1974) . Consistency conditions for probability estimators and integrals of density esti-mators . Utilitas Math. 6 61-74 .

GREBLICKI, W ., KRZYZAK, A. and PAWLAK, M . (1984) . Distribution-free pointwise consistency ofkernel regression estimate . Ann. Statist . 12 1570-1575 .

GYORFI, L. (1981) . Recent results on nonparametric regression estimate and multiple classifica-tion. Problems Control Inform. Theory 10 43-52 .

Page 15: The Annals of Statistics 1994, Vol. 22, No. 3,1371-1385 ON THE …luc.devroye.org/devroye_gyorfi_1994_on_the_strong... · 2003-05-15 · The Annals of Statistics 1994, Vol. 22, No.

NEAREST NEIGHBOR REGRESSION

1385

GYORFI, L . (1991) . Universal consistencies of a regression estimate for unbounded regres-sion functions . In Nonparametric Functional Estimation (G. Roussas, ed.) 329-338 .Springer, Berlin.

HART, P . E. (1968) . The condensed nearest neighbor rule . IEEE Trans. Inform. Theory IT-14515-516 .

HOEFFDING, W. (1963) . Probability inequalities for sums of bounded random variables . J. Amer.Statist. Assoc. 58 13-30 .

KRZYZAK, A . (1986) . The rates of convergence of kernel regression estimates and classificationrules. IEEE Trans. Inform. Theory IT-32 668-679.

KRZYZAK, A . and PAWLAK, M. (1984) . Distribution-free consistency of a nonparametric kernelregression estimate and classification . IEEE Trans. Inform. Theory IT-30 78-81 .

MACK, Y. P . (1981). Local properties of k-nearest neighbor regression estimates. SIAM Journalon Algebraic and Discrete Methods 2 311-323 .

MCDIARMID, C . (1989) . On the method of bounded differences . In Surveys in Combinatorics 1989 .London Mathematical Society Lecture Notes Series 141 148-188, Cambridge Univ. Press .

NADARAYA, E . A . (1964) . On estimating regression . Theory Probab. Appl . 9 141-142 .NADARAYA, E . A . (1970) . Remarks on nonparametric estimates for density functions and regres-

sion curves . Theory Probab. Appl . 15 134-137 .ROYALL, R. M. (1966) . A class of nonparametric estimators of a smooth regression function. Ph .D .

dissertation, Stanford Univ.SPIEGELMAN, C . and SACKS, J . (1980) . Consistent window estimation in nonparametric regression .

Ann. Statist . 8 240-246.STONE, C . J . (1977) . Consistent nonparametric regression (with discussion) . Ann. Statist. 5 595-

645 .STOUT, W . F . (1974). Almost Sure Convergence . Academic, New York.STUTE, W. (1984) . Asymptotic normality of nearest neighbor regression function estimates . Ann.

Statist . 12 917-926 .WATSON, G . S . (1964) . Smooth regression analysis . Sankhya Ser. A 26 359-372 .WHEEDEN, R. L . and ZYGMUND, A. (1977). Measure and Integral. Dekker, New York.ZHAO, L. C . (1987) . Exponential bounds of mean error for the nearest neighbor estimates of re-

gression functions . J. Multivariate Anal . 21 168-178 .

LUC DEVROYE

ADAM KRZYZAKSCHOOL OF COMPUTER SCIENCE

DEPARTMENT OF COMPUTER SCIENCEMCGILL UNIVERSITY

CONCORDIA UNIVERSITYMONTREAL

1455 DE MAISONNEUVE WESTCANADA H3A 2A7

MONTREALCANADA H3G 1MB

LASZLO GYORFIGABOR LUGOSIDEPARTMENT OF MATHEMATICSTECHNICAL UNIVERSITY OF BUDAPEST1521 STOCZEK U . 2BUDAPESTHUNGARY


Recommended