+ All Categories
Home > Documents > Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Date post: 03-Dec-2016
Category:
Upload: christine
View: 214 times
Download: 0 times
Share this document with a friend
75
Chapter 3 NONPARAMETRIC CURVE ESTIMATION 1. INTRODUCTION Reproducing kernels are often found in nonpararnetric curve estima- tion in connection with the use of spline functions, which were popu- larized by Wahba in the statistics literature in the 1970s. A brief in- troduction to the theory of splines is presented in Section 2. Sections 4 and 5 are devoted to the use of splines in nonparametric estimation of density and regression functions. Sections 6 and 7 shortly present the application of reproducing kernels to the problem of shape constraints and unbiasedness. For different purposes kernels (in the sense of Parzen and Rosenblatt) with vanishing moments have been introduced in the literature. In Section 8 we state the link between those kernels (called higher order kernels) and reproducing kernels. Section 9 provides some background on local approximation of functions in view of application to local polynomial smoothing of statistical functionals presented in Section 10. A wide variety of functionals and their derivatives can be treated (density, hazard rate, mean residual time, Lorenz curve, spectral den- sity, quantile function, . . .). The examples show the practical interest of kernels of order (m, p) (kernels of order p for estimating derivatives of order m) . Their properties and the definition of hierarchies of higher order kernels are further developed in Section 11. Indeed hierarchies of kernels offer large possibilities for optimizing smoothers such as in Cross-Validation techniques, double or multiple kernel procedures, mul- tiparameter kernel estimation, reduction of kernel complexity and many others which are far from being fully investigated. They can also be used as dictionaries of kernels in Support Vector Functional Estimation (see Chapter 5). A. Berlinet et al., Reproducing Kernel Hilbert Spaces In Probability and Statistics © Springer Science+Business Media New York 2004
Transcript
Page 1: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Chapter 3

NONPARAMETRIC CURVE ESTIMATION

1. INTRODUCTION

Reproducing kernels are often found in nonpararnetric curve estima­tion in connection with the use of spline functions, which were popu­larized by Wahba in the statistics literature in the 1970s. A brief in­troduction to the theory of splines is presented in Section 2. Sections 4and 5 are devoted to the use of splines in nonparametric estimation ofdensity and regression functions. Sections 6 and 7 shortly present theapplication of reproducing kernels to the problem of shape constraintsand unbiasedness. For different purposes kernels (in the sense of Parzenand Rosenblatt) with vanishing moments have been introduced in theliterature. In Section 8 we state the link between those kernels (calledhigher order kernels) and reproducing kernels. Section 9 provides somebackground on local approximation of functions in view of application tolocal polynomial smoothing of statistical functionals presented in Section10. A wide variety of functionals and their derivatives can be treated(density, hazard rate, mean residual time, Lorenz curve, spectral den­sity, quantile function, . . .). The examples show the practical interestof kernels of order (m,p) (kernels of order p for estimating derivativesof order m) . Their properties and the definition of hierarchies of higherorder kernels are further developed in Section 11. Indeed hierarchiesof kernels offer large possibilities for optimizing smoothers such as inCross- Validation techniques, double or multiple kernel procedures, mul­tiparameter kernel estimation, reduction of kernel complexity and manyothers which are far from being fully investigated. They can also be usedas dictionaries of kernels in Support Vector Functional Estimation (seeChapter 5).

A. Berlinet et al., Reproducing Kernel Hilbert Spaces In Probability and Statistics

© Springer Science+Business Media New York 2004

Page 2: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

110 RKHS IN PROBABILITY AND STATISTICS

As Chapter 2, this chapter will be another opportunity to study thelinks between optimal linear approximation in numerical analysis andbayesian models on functional spaces with the work of Larkin in a se­ries of papers (Larkin (1970,1972, 1980, 1983)) and Kuelbs, Larkin andWilliamson (1972)). To emphasize the interest of such an approach,Larkin (1983) says: "the reproducing kernel function often plays a keyrole in bridging the gulf separating the abstract formalism of functionalanalysis from the computational applications" .

2. A BRIEF INTRODUCTION TO SPLINESMany books are devoted to the theory of splines e.g. Ahlberg et

al (1967), de Boor (1978), Schumaker (1981), Laurent (1972), Atteia(1992), Bezhaev and Vasilenko (1993) in the numerical analysis litera­ture or Wahba (1990), Eubank (1988), Green and Silverman (1994) inthe statistical one. Our goal here is just to give the reader the basic toolsto understand the role of reproducing kernel Hilbert space in this theoryand to allow him to handle the use of splines in density or regressionestimation or more generally functional parameter estimation. Splinesand reproducing kernels also appear in other areas of statistics: in filter­ing as we saw in Chapter 2, but also in factor analysis (van der Linde,1988), principal components analysis (Besse and Ramsay, 1986). Splinefunctions form a very large family that can be approached in differentways. Some splines belong to the variational theory where a spline ispresented as a solution of an optimization problem in a Hilbert space,while others are defined as piecewise functions with continuity condi­tions. To illustrate this , let us introduce the classical polynomial splinesfrom the piecewise point of view.

DEFINITION 20 Given an integer r, a set of points Zl < . . , < Zn calledknots in an interval (a, b), a polynomial spline of order r with simpleknot sequence Zl < ... < Zk is a function on (a, b) which

• is continuously differentiable up to order r - 2 on (a, b), and

• coincides with a polynomial of degree less than or equal to r - 1 oneach of the subintervals (a, zt), ... , (Zi, Zi+l), ... , (Zk' b).

The space S (Zl' ... ,Zk) of polynomial splines of order r with simpleknot sequence Zl < ... < Zk is a vector space of dimension r + k (seeExercise 2). A simple basis of this space is given by the polynomials1, x, ... , x r

- 1 and the k functions {(x - Zi)+-l; i = 1, ... , k}. For compu­tational purposes, the B-spline basis (see Schumaker, 1981) is preferredfor the following reason: their being compactly supported improves theinvertibility of the matrices involved in the least squares minimization

Page 3: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 111

steps.In the family of polynomial splines of even order say 2m, a spline iscalled natural when it coincides with a polynomial of degree less thanor equal to m - 1 (instead of 2m - 1) on the boundary intervals (a, zt)and (zn, b).Now if one measures the exact values ai of an unknown smooth functionf at the set of points {ti , i = 1, ... n} in an interval (a, b), it is customaryto approximate f by minimizing

in the set of interpolating functions (i .e. 9(ti) = ai, i = 1, . .. , n) of theSobolev space H'" (a,b). This integral represents a measure of roughnessof the functions in this space. It turns out that the solution is a naturalpolynomial spline of order 2m with knots t1, . .. , t.; called the interpolat­ing D": spline for the points (ti, ai) . This remarkable fact will be provedin Section 2.4. Justifications for this specific measure of roughness areits relationship with total curvature and with elastic potential energy(see Champion et al, 1996) from which the term spline was coined.Our general presentation will follow the variational approach first pro­moted by french mathematicians (Atteia (1970), Laurent (1972), etc).We will define abstract splines before presenting the concrete classicalfamilies of D'" splines, L-splines, thin plate splines, which can be ap­proached both ways. The piecewise nature of these classical splines willbe demonstrated .We will first focus on interpolating and smoothing splines before givingsome hints about mixed splines and partial (or inf-convolution) splines.

2.1. ABSTRACT INTERPOLATING SPLINESAn abstract interpolating spline is an element in a Hilbert space which

minimizes an "energy" measure given some "interpolating" conditions.In practice, the interpolating conditions are given by some measure­ments (location, area, volume, etc) and the energy measure is chosen toquantify the smoothness among solutions of the interpolating equations.

DEFINITION 21 Given three Hilbert spaces 11, A, B, two bounded linearoperators A : 11 -7 A and B : 11 -7 B and an element a E A ,an interpolating spline corresponding to the data a, the measurementoperator A and the energy operator B is any element o of 11 minimizingthe energy II Ba II~ among elements satisfying the interpolating equationsAu=a.

Page 4: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

112 RKHS IN PROBABILITY AND STATISTICS

The interpolating equations properly bear their name when the mea­surement operator involves point evaluation functionals i.e. when 11. is aHilbert space of functions on T and there exists n points {t j, i = 1, ... , n}in T such that Aa = (a(tl) , . .. , a(tn ) )' . In this case , we will call A theevaluation operator at tl,' .. , tn' Another type of measurement oper­ator is Aa = (ft~i+l a(t)d),(t) , i = 1, ... , n). Examples of applicationswith measurement operators other than the evaluation operator are inNychka et al (1984), Wahba (1977,1990) . It is particularly in the case ofan evaluation operator that reproducing kernel Hilbert space come intoplay since such an operator cannot be bounded for any choice of tj unless11. is a reproducing kernel Hilbert space. However we will state the ex­istence and uniqueness theorem in its full generality before specializingto the reproducing kernel Hilbert space context and refer the reader toLaurent (1972), Atteia (1992), or Bezhaev and Valisenko (1993) for aproof.

THEOREM 57 If (i) Ker(B) n Ker(A) = {O} and (ii) Ker(A) + Ker(B)is closed in 11. , then for any a E A such that {a : Aa = a} i= 0there exists a unique interpolating spline corresponding to the data a ,the measurement operator A and the energy operator B.

(ii) can be replaced equivalently by B(Ker(A)) is closed in 8 or by therange of B is closed in 8 and Ker(B) is finite dimensional.The particular case when 11. is a reproducing kernel Hilbert space, A isan evaluation operator and B is the identity operator( 8 = 11.) is animportant one in applications. In this context, the following theoremfound in Shapiro (1971) attests that the connection between reproduc­ing kernels and minimum norm solutions of interpolation problems haslong been recognized. For these reasons, we will state it and prove itindependently of the next result .Let 11. be a RKHS of functions on T, To be a subset of T and Uo bea function defined on To. Let 11.0 be the set of functions of 11. whichcoincide with Uo on the elements of To. Let So be the closed span of{K(t, .) ; t E To}.

THEOREM 58 11.0 is non-empty if and only if 11.0 n So is non empty.In that case, the unique element of this intersection is the element ofminimal norm in 11.0-

Proof. Let us first prove that Ho n So contains at most one element.If it contained two distinct elements, their difference would belong toSo. It would also vanish on To and hence belong to st, the orthogonalcomplement of So, and therefore be a non zero element of Son st. Nowif 1lo is not empty, since it is closed and convex, it contains a unique

Page 5: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 113

element of minimal norm u*. For any v E Scf, v vanishes on To , andhence IIu* + vII ~ IIu*II. This implies that u* is orthogonal to scf, andhence that it belongs to So since So is closed. •

We note that when To consists of a finite number of points{tj, i = 1, ... , n}, the solution of the interpolation problem reduces to alinear system of n equations, with matrix K(ti, tj), which is non-singularif the evaluation functionals corresponding to the ti are linearly indepen­dent.We now turn to the case when 1l is a reproducing kernel Hilbert spacewith kernel K , the operator A is of the form

Aa = « hi;« >, ... ,< hn,a >)'for a finite number n of linearly independent elements {hi, i = 1, .. . , n}in 1l and B has a finite dimensional null space.Let us first give a characterization of the spline in terms of K. It turnsout that the natural tool to express this solution is rather a semi-kernel(see Section 6 of Chapter 1) associated with the semi-norm, as in Laurent(1981, 1986, 1991). A more general formulation can be found in Bezhaevand Vasilenko (1993) without the restriction to a finite number of inter­polating conditions but for the sake of simplicity we prefer to presentthis simplest but very common case. Let us assume that the semi-norminduced by B confers to 1l a structure of semi-hilbertian space. Let lKbe a semi-kernel operator of the semi-hilbertian space 1l, endowed withthe semi-norm II Bu II~.

THEOREM 59 The interpolating spline corresponding to the data a, themeasurement operator A and the energy operator B has the followingform

n

a =L AilK(hi) +qi = 1

where q E Ker(B), and E Ajhj E Ker(B).l . Under the additional as­sumptions of Theorem 57, q and A are uniquely determined by the n in­terpolating equations < a, hi >1i= a i , If we denote by PI,'" ,Pm a basisof Ker(B) , and write q = Ej:l I jPj, then the vectors A = (All" " An)'and I = (,1, . .. , 1m)' satisfy the following n + m linear system of equa­tions

{EA+T, = aT'A= 0

(3.1)

where E is the n by n matrix with elements < lK(hj), hj »u, and T isthe n by m matrix with elements < Pk, hj >1i.

Page 6: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

114 RKHS IN PROBABILITY AND STATISTICS

Proof. Introducing n Lagrange multipliers al, ... , an , we must optimizethe Lagrangian

n

L(u, a) =11 Bu II~ +2L ai(- < uih; >1i +ai) .i=l

Let (0" , A) denote the optimizing point. L being quadratic in u, it is easyto see that we must have

n

B*BO" - LAihi = O.i=l

Now if q is any element of Ker(B), we have

< B*BO", q >1i=< BO", Bq >8=< BO",O >8= O.

Therefore B*BO" = L:f=l Aihi and belongs to Ker(B)l. . Hence, by thesemi-reproducing property of IK (see 8 in Chapter 1) we have for anyx E 1£,

n n

< x , B*BO" >1i=< z , L Aihi >1i=< Bx, BIK(L Aihi) >8i=l i=I

Since for all x E 1£,

n n

< Bx, BIK(L Aihi) >8=< x , B*B(L AilK(hi)) >8,i=I i=l

we have

n

< x , B*B(O" - L AilK(hd) >1i= O.i=l

We then conclude that 0" - .E~l AilK(hi) E Ker(B) = Ker(B* B) with.Ei'=l >'ihi E Ker(B)l. which proves the first statement.The first equation of system (3.1) comes from the interpolating condi­tions and the second equation from the orthogonality conditions.Ei'=l >'ihi E Ker(B)l.. Let us directly check that this system has aunique solution. It is enough to prove that the corresponding homoge­neous system has A = 0 and, = 0 for solution. Indeed if EA = -T,and T' A = 0, then

>.'E>. = 0 and T'>.=O. (3.2)

Page 7: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation

On the other hand,

n n

A'EA =< L AiOC(hd, L Aihi >1i .i=1 i=1

Using the semi-reproducing property again, we get

n

A'EA =11 B(L AiOC(hd) II~=II Bo II~ .i=1

115

Finally combining this result with (3.2), we get B*Ba =0 =Ei=1 Aihi.From the linear independence of the hi it follows that A = O. Comingback to (3.1), we then have T', = 0 which is equivalent to q E Ker(A).Therefore by the assumptions of Theorem 57, we conclude that q = O••

Note that Theorem 58 when To is finite is a corollary of Theorem 59 forB equal to the identity operator. Another relationship between thesetwo theorems is that since </>(a) =11 Aa II~ + II Bo II~ defines a norm in11. and since minimizing </>(a) on the set {a : Aa = a} is equivalent tominimizing II Ba II~ on that same set, an interpolating spline can alwaysbe regarded as a solution to a minimum norm problem.Finally note that the elements of the null space Ker(B) are exactly re­produced by the spline interpolation process because their energy is zero.We will now state an optimality property of interpolating splines calledoptimality in the sense of Sard (1949) in the approximation literature.Given a bounded linear functional Lo on 11., with representer 10 (i.e.Lo(h) =< h,10 >, Vh E 11.), we consider a linear approximation ofLo(h) by values of h at distinct design points t1, . .. , tn' The approxima­tion error Lo(h) - Ei=1 Wih(ti) can be easily bounded

n

I Lo(h) - L Wih(t i) Ii=1

n

= 1< 10 - L WiK(ti , .) , h >1i=1

n

< II h 111110 - L WiK(ti , .) II .i=1

The element Lo(h)* = Ei=1 wih(t;) is then called optimal ifit minimizesthe bound on the approximation error when h ranges in 11.. The mini­mization can be over the set of weights Wi, over the set of design pointsti, or both . For fixed and distinct design points, quadratic optimizationshows that the optimal weights are given by w* = G- 1a where G is theGram matrix of the evaluations K(ti, ') and a = (h(td, ... ,h(tn)). ByTheorem 59, L:~1 wih(ti) then coincides with the interpolating spline

Page 8: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

116 RKHS IN PROBABILITY AND STATISTICS

described in the following theorem that we have just proved (see Larkin,1972).

THEOREM 60 For fixed and distinct design points t i, we have

Lo(h)* = Lo(h*)

where h* is the in terpolating spline corresponding to the data{h(ti), i = 1, ... , n}, the measurement operator consisting in the evalua­tions at the design points and the energy operator being the identity on?t.

2.2. ABSTRACT SMOOTHING SPLINESWhen the data are noisy, interpolation is no longer a good solution

and it is replaced by the minimization of a criterion which balances theroughness measure on one side and the goodness of fit to the data onthe other. For D'" splines , it amounts to minimizing

b n

p1(j(m) (t))2d).,(t) + I)f(ti) - ai)2a i=1

for f ranging in H m (a, b) and p > 0, where p controls the trade off.When p is small, faithfulness to the data is preferred to smoothness andreversely when p is large.

DEFINITION 22 Given three Hilbert spaces 1£, A, B, two bounded linearoperators A : 1£ --+ A and B : 1£ --+ B an element a E A and apositive real parameter p, a smoothing spline corresponding to the dataa , the measurement operator A , the energy operator B and the parameterp is any element a of1£ minimizing II Aa - a II;' +p II Ba II~·

We first establish a link between interpolating splines and smoothingsplines which will be used in the forthcoming characterization but alsoin many other situations.

THEOREM 61 Under the assumptions of Theorem 57, the smoothingspline a p corresponding to the data a, the measurement operator A, theenergy operator B and the parameter p is equal to the interpolating splinecorresponding to the same energy and measurement operator as ao andto the data a* = Aap •

Proof. Let ¢(s) be the objective function that is minimized to find ap

i.e. ¢(s) =11 Aa - a II~ +p II Bo II~. Let us denote by ao the inter­polating spline corresponding to the data a", the measurement operatorA, the energy operator B where a" = Aap • The definition of a p implies

Page 9: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 117

(3.3)

that ¢>((Jp) ~ ¢>((Jo) , On the other hand, by definition of (Jo, we haveII B(Jo II~~II B(Jp II~ since (Jp satisfies the same interpolating conditionsas (JO by definition of a", This inequality implies that ¢>((Jo) ~ ¢>((Jp)which in turn implies equality. It is then enough to use the uniquenessof the optimizing element. •

Weinert, Byrd and Sidhu (1980) present an arbitrary smoothing splineas an interpolating spline minimizing a norm in an augmented space.Existence and uniqueness conditions for the smoothing splines are thesame as conditions (i) and (ii) for interpolating splines (Theorem 57) .As previously we now concentrate on the case when 1£ is a reproducingkernel Hilbert space with kernel K, the operator A is of the form Aa =« h ll a>, . . . , < hn , a»' for a finite number n of linearly independentelements {h i, i = 1, . . . , n} in 1£ and B has a finite dimensional nullspace. A is jRn and its norm is defined by the symmetric positive definitematrix W. W may account for example for unequal variances in themeasurement process. wij denote the (i, j)-th element of the inversematrix W-I. Once again we use the semi-kernel to characterize thesmoothing spline, assuming that the semi-norm induced by B confers to1£ a structure of semi-hilbertian space. Let lK be a semi-kernel operatorof the semi-hilbertian space 1£, endowed with the semi-norm II Bu II~ .

THEOREM 62 The smoothing spline corresponding to the data a, themeasurement operator A and the energy operator B has the followingform:

n

(Jp = L AilK(hd+qi=l

where q E Ker(B), I: Aihi E Ker(B)J.. and pA + A(Jp = a. Under theadditional assumptions of Theorem 57, if we denote by PI, ... , Pm a basisof Ker(B), and write q = E.f:=l ,jPj, q and Aare uniquely determinedby the following system of equations

{L;p>' +T, = aT'>' = 0

where L;p is the n by n matrix with elements < lK(hd, hj '>« +pwij, andT is the n by m matrix with elements < Pk, hi >1l .Proof. Theorem 61 and Theorem 59 prove that (Jp has the form ap =:E?=1 >'ilK(hd + q where q E Ker(B) and :E >'ihi E Ker(B)J... It is thenstraightforward to see that

II A(J - a 113t +p II B(J II~=II L;>. +T, - a 11 2 +p>.'L;>.

Page 10: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

118 RKHS IN PROBABILITY AND STATISTICS

and to minimize this quadratic form . The solution satisfies ~A + T'Y+PA = a which is equivalent to AO"p +PA = a. It is then clear that system(3.3) is satisfied. •

Another link between smoothing and interpolating splines is the follow­ing, obtained by taking the limit when P tends to 0 in the system ofequations.

THEOREM 63 When p tends to 0, the pointwise limit of the smoothingspline 0"p is the interpolating spline 0"0.

As for interpolating splines, it is important to consider the particularcase when A is an evaluation operator and B is the identity operator(B = 1l). In that case, the semi-kernel is just a regular kernel and thesolution is of the form oAt) = L:?=i AiK(t i, t) where A = (~+ pln)-i a.This is the result we have used in Theorem 48 of Chapter 2.It is worth mentioning that the smoothing spline corresponding to thedata a, the measurement operator A, the energy operator B and theparameter p can be defined alternatively as the minimizer of II Bs II~

among elements of H satisfying II As - a II~~ C p for a constant C; de­pending on the smoothing parameter. From this angle, smoothing splineappear as an extension of interpolating spline where the interpolatingconditions are relaxed .

2.3. PARTIAL AND MIXED SPLINESAs suggested by their name, mixed splines are a mixture of smoothing

and interpolating splines in the sense that some interpolating conditionsare strict while others are relaxed. More precisely, given two measure­ment operators Ai and A2 respectively from 'H to Ai and A2 , given twodata vectors ai E Ai and a2 E A2, the mixed spline is defined as theminimizer of II Ai 0" - a II~ +p II BO" II~ among elements satisfying theinterpolating equations A20" = a2. We refer the reader to Bezhaev andVasilenko (1993) for more details.Inf-convolution splines have been introduced in the 1980s by Laurent(1981) for the approximation or interpolation of functions presentingsingularities like discontinuities of the function or its derivatives, peaks,cliffs, etc. They have been used in statistics under the name partialsplines as in Wahba (1984, 1986, 1991). The first name is related totheir connection with the inf-convolution operation in convex analysisbut we will not give extensive details about their most general form (seeLaurent, 1991). We will restrict instead to the following framework.

DEFINITION 23 Given three Hilbert spaces 1l C JRT, A , B, two boundedlinear operators A : JRT --7 A and B : H. --+ B, an element a E A and

Page 11: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 119

a finite dimensional subspace V of IRT . A partial interpolating splinecorresponding to the data a, the measurement operator A, the energyoperator B and the singularities subspace V is a function a + d witha E 1l and d E V minimizing II Bo II~ where a varies in H, d in V ando + d satisfies the interpolating equations A(o + d) = a.A partial smoothing spline corresponding to the data a, the measurementoperator A, the energy operator B, the real parameter p and the singu­larities subspace V is a function a +d with a E 1l and d E V minimizingII A(cr + d) - a II~ +p II Ba II~ where a varies in 1l and d in V.

The function a can be thought of as the smooth part of the spline andd as the singularity part.In the case when 1l is a reproducing kernel Hilbert space with kernelK , the operator A is of the form Acr = (L1(cr) , . . . , Ln(cr))' for a finitenumber n of independent bounded linear functionals {Li' i = 1, ... , n}on JRT and B has a finite dimensional null space, Laurent (1991) givesconditions for existence and uniqueness of the partial spline as well as acharacterization of the solution in terms of the semi-kernel again.

THEOREM 64 The minimization problem defining the partial interpolat­ing spline has a unique solution if and only if

{sEKer(B) ,dEV and A(s+d)=O}::}s=d=O.

Let lK. be a semi-kernel operator for the semi-hilbertian space 1l andA = JRT', endowed with the semi-norm II Bu II~ . Let (Ker( B) +V)O bethe set of linear functionals that vanish on Ker(B) + V .

THEOREM 65 Under the conditions of Theorem 64, the partial interpo­lating spline corresponding to the data a, the measurement operator A,the energy operator B and the singularities subspace V has the followingform:

n

crp = L AilK.(Ld + q + di = l

where q E Ker(B) , s e V, EAiLi E (Ker(B) + V)o.Under the additional assumptions of Theorem 57, if we denote by Pi, . . . , Pma basis of Ker(B) , d1 , • • • , d; a basis of V and write q = Ej=l 'YjPj and

d = E~=l djdj, then A, 'Y and d are uniquely determined by the followingsystem of n + m + l equations

{

'2;A + T'Y + Dc5 = aT'A = 0D'A =0

Page 12: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

120 RKHS IN PROBABILITY AND STATISTICS

where E is the n by n matrix with elements ((K( Ld, Lj) ), T is the n bym matrix with elements Li(Pk), D is the n by 1 matrix with elementsLi(dk) '

THEOREM 66 Under the conditions of Theorem 64, the partial smooth­ing spline corresponding to the data a, the measurement operator A,the energy operator B, the singularities subspace V and the smoothingparameter p has the following form

n

a P = l:::>\jK(Ld + q+dj=1

(3.5)

where q E Ker(B) , «« V , 2: AiLi E (Ker(B) + V)O andpA+A(f.Tp+d)=a.Under the additional assumptions of Theorem 57, if we denote byPI, . . . ,Pma basis of Ker(B), d1 , • • • , dl a basis of V and writeq = 2:j=1 'YjPj and d = 2:;=1 s.a; then A, 'Y , and 8 are uniquely deter­mined by the following system of n + m + 1 equations

{

E pA+ T"'I + D8 = aT'A=OD'A = 0

where Ep is the n by n matrix with elements ((K(Ld, Lj)) + pwij, T isthe n by m matrix with elements Li(Pk), D is the n by 1 matrix withelements Li(dk) '

As previously, it is important to consider the particular case when Ais an evaluation operator and B is the identity operator (8 = 1£) . Inthat case, the semi-:ernel is just a repular kernel and the solution is ofthe form f.Tp(t) =Ei=l AjK(ti, t) + Ej=t 8j dj (t ) where A and 8 solve thesystem

{EpA+ D8 = aD'A=O

This is the result we have used in Theorem 49 of Chapter 2.Let us prove that Theorem 66 yields the result used in Theorem 51 ofChapter 2. Recall that B was the orthogonal projection IT onto theorthogonal complement N of lPm - 1 in 1£KG' To apply Theorem 66 inthis context, one needs to exhibit a semi-kernel. Using the notations andassumptions of the Kriging model, we have the following theorem.

THEOREM 67 The function G*(t, s) = G(t - s) - E~=1 Pj(t)G(Xj - s)is a semi-kernel for llKG with the semi-norm II TIu IIk-

G'

Page 13: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 121

Proof. It is first easy to check that KG(Xi, t) = Pi(t). Therefore thesubspace N is equal to

N = {u :< u, Pi >= 0, 1 :::: i :::: n} = {u : U(Xi) = 0,1 :::: i :::: n}.

Since G*(Xi , tk) = 0, the functions G*(., tk) belong to N. It is also easyto check that if Ais a generalized increment of order m - 1 (i.e. Ais suchthat E AiLi E Ker(II)O), we have E k=1 AkKG(., tk) = Ek=1 AkG*(., tk)'Therefore for any UE 1I.KG' we have

n n

< U,LAkKG(.,tk) >KG=< IIu,IIL AkG*(., tk) >KGk=1 k=1

which is the semi-reproducing property. •The raw application of Theorem 66 proves that the spline minimizing(2.18) is of the form (3.5) where K(Li) = G*(., ti)' It is then easy tocheck that the solution is unchanged if one replaces E AiG*(., ti) byE AiG(. - ti) in (3.5) provided one replaces Lj(K(Li)) by G(ti - tj) inthe definition of Ep•

2.4. SOME CONCRETE SPLINES2.4.1 n M SPLINES

First introduced by Schoenberg (1946), interpolating D'" splines cor­respond to 11. = Hm(a, b), Au = (u(tI) , ... , u(tn ) ) , and the operator Bfrom 11. to L2(a, b) given by Bu = u(m). Therefore hi = K(tj ,.) where Kis the reproducing kernel of the Sobolev space Hm(a, b) endowed withthe norm II u 11 2= E~=~1 u(k)(a)2+ II Bu IIl,2(a,b)' In fact the first partof the norm is irrelevant (because it does not appear in the optimizingproblem) and could be replaced for example by 2::'1 U2(Xi) where Xi isany unisolvent set in (a, b) (see Chapter 6) to yield a uniformly equiva­lent norm (see Exercise 14). That is why it may appear in different waysin the presentation of D'" splines. Using Theorem 67, a semi-kernel isgiven by Em(s - t) where Em is a fundamental solution of the m-thiterated Laplacian (see Chapter 6) . The following theorem which estab­lishes the piecewise nature of D"' splines is then an easy consequenceof Theorem 59. It is generally attributed to Holladay (1957) for cubicsplines and to de Boor (1963) for the general case.

THEOREM 68 Given n distinct points ti, ... , tn in (a, b), n realsa1, . .. , an and an integer m with n ~ m, the interpolating D'" spline for

Page 14: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

122 RKHS IN PROBABILITY AND STATISTICS

the points (ti' ai), i = 1, .. . , n is of the formn m-l

s(t) = L >'iEm(t - ti) +L 'YjPj(t)i=l j=l

where Pj, j = 1, ... , m - 1 form a basis of lPm-l, >. is a generalizedincrement of order m - 1 and>. and'Y are solution to system (3.1) withL:ij = Em(ti - tj) and Tij = Pj(ti). The solution is therefore a naturalpolynomial spline of order 2m .

Generalized increments have been introduced in Chapter 2, Definition11. The fact that this polynomial spline is natural is demonstratedin Exercise 3 without using Theorem 59. Of course more direct andsimpler proofs of this fact can be found as in Wahba (1991). Ahlberget al (1967) or Green and Silverman (1994) use very elementary toolsand do not mention reproducing kernels. The most frequent polynomialspline is the cubic spline of order 4 obtained for m = 2. The simplestform of D'" smoothing splines is the minimizer of

n b

L(f(ti) - ai)2 + p1(f(m)(t))2d>.(t)i=l a

in H'" (a, b) and the solution is characterized similarly.

(3.6)

2.4.2 PERIODIC D M SPLINESAlso called trigonometric splines, they were first studied by Schoen­

berg (1964) and used in a statistical model by Wahba (1975c). LetH~r(O, 1) be the periodic Sobolev space on (0,1) (see Appendix) . If(Uk) denote the Fourier coefficients of a function u E H~r(O, 1), us­ing Plancherel formula and the formula relating the Fourier coefficientsof a function to that of its derivatives (see Appendix), the D'" energymeasure can be written L:t;:..oo 1 (2rrk)mUk 1

2• For the classical norm

II U 11 2=1UQ 12 +L:t::..oo I Uk 12, the space H~r (0,1) is a reproducing

kernel Hilbert space with a translation invariant kernel (see Chapter 7).An interesting aspect of the corresponding smoothing splines is that inthe case of equispaced design, one can find closed form formulas for theFourier coefficients of the estimates using finite Fourier transform tools(see Thomas-Agnan, 1990) . Moreover, the eigenvalues and eigenfunc­tions defined in Exercise 1 can be computed explicitly and one can showthat the spline acts as a filter by downweighting the k-th frequency co­efficient by the factor (1 +p(2rrk)2m)-1. Splines on the sphere can bedefined in a similar fashion (see Wahba, 1981a) and are used in mete­orology where the sphere is the earth as well as in medecine where thesphere is the skull .

Page 15: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 123

2.4.3 L SPLINES

2.4.4

The operator D'" involved in the energy operator of the D": splinescan be replaced by a more general differential operator L yielding theenergy measure J:(Lf(t))2d>..(t). Their piecewise nature can be provedwith each piece being a solution of the differential equation L*Lf = O.Details about the corresponding reproducing kernel Hilbert space andits kernel are found in Chapter 7. Some references for these splinesare Schultz and Varga (1967), Kimeldorf and Wahba (1971) , Schumaker(1981) , Heckman (1997). Kimerdorf and Wahba (1970b) treat the par­ticular case of a differential operator with constant coefficients. Heck­man and Ramsay (2000) demonstrate the usefulness of using differentialoperators other than D'":

a-SPLINES, THIN PLATE SPLINES ANDDUCHON'S ROTATION INVARIANT SPLINES

The fact that D"' interpolating or smoothing splines are natural poly­nomial splines of order 2m implies that the energy semi-norm can berewritten as J~oo iD": f(t))2d>..(t) since the m-th derivative of the splineis zero outside the interval defined by the boundary knots. One thenneeds to extend the functions of ur-t«, b) outside (a,b) but it turns outthat the adequate space is not Hm(JR) since natural splines are not squareintegrable on JR but rather a space of the Beppo-Levi type denoted byBL m(L2(JR)) . These spaces introduced by Deny and Lyons (1954) havebeen used in Duchon (1976, 1977) for thin plate splines and rotationinvariant splines and in Thomas-Agnan (1987) for a-splines. Their el­ements are tempered distribu tions. We refer the reader to Chapter 6for details. The t heory of Fourier transform of tempered distributionsallows to further transform the energy semi-norm into

The same construction can be extended to dimension d. It is then naturalto generalize this energy measure by replacing (21riw) m by a more generalweight of the form a(w) with suitable assumptions and to adapt thedefinition of the Beppo Levi space accordingly. This space is constructedin Chapter 6 and leads to a -splines (Thomas-Agnan, 1987, 1991) . Theidea of building a smoothness measure on the asymptotic behavior of itsFourier transform is also found in Klonias (1984).Thin plate splines correspond to the case 0'(w) = 1 and its energy semi-

Page 16: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

124 RKHS IN PROBABILITY AND STATISTICS

norm in ]Rd can also be written as

"" m' /3c: n,8~! II D f IIL2(lRd)1/3I=m

where for a multiindex ,8 ENd, we denote by 1,8 I the sum z:1=1 ,8i, and

by D/3 the differential operator 13;1131 /3d' An example of their applica-8x l •••8xd

tion to meteorology is presented in Wahba and Wendelberger (1980).With a similar construction, Duchon (1977) defines rotation invariantsplines. They would correspond to a weight function a(w) =1 w Ir for areal r > m +~ but the a function is not allowed to vanish in the theoryof a-splines. The advantage of that weight is that the correspondinginterpolating method commutes with similarities, translations and rota­tions in Rd. According to Laurent (1986), a semi-kernel for Duchon'ssplines is given by

K(s , t) = C II t - s Wm +2r-

d log(1I t - s II)

if 2m + 2r - d is an even integer and by

K (s, t) = c' II t - S 1/2m+2r-d

otherwise. C and C' are constants depending on m, rand d whichare actually irrelevant since they may be included in the coefficients Ai.For example, for d = m = 2 and r = ! ' the spline can be writtena(t) = L:~=1 Ai II t - ti 11 3 +p(t) where p is a polynomial of degree less orequal to 1. Duchon's splines include as a special case pseudo-polynomialsplines for r = d;l with multi-conic functions for m = 1.

2.4.5 OTHER SPLINESSchoenberg (1968) introduced interpolation conditions on the deriva­

tives with varying order from knot to knot called Hermite splines org-splines. Jerome and Schumaker (1969) combine L-splines with gen­eral interpolation conditions including at the same time Dm-splines, L­splines and Hermite splines as special cases.Madych and Nelson (1988, 1990) introduce a method of multivariateinterpolation in which the interpolants are linear combinations of trans­lates of a prescribed continuous function on R d conditionally of positivetype of a given order. Their theory extends Duchon's theory and also in­cludes thin plate splines and multiquadric surfaces as special cases. Theconstruction shares many aspects with that of a-splines but the preciserelationships betwen them remain to be investigated. The description

Page 17: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 125

of the function spaces in terms of the weight function a seems to us eas­ier to handle than in terms of the corresponding conditionally positivefunction .Spline functions can be constrained to satisfy such restrictions as mono­tonicity, convexity, or piecewise combinations of the latter. Several ap­proaches are possible e.g. Wahba (1973), Laurent (1980) , Wright andWegman (1980), Utreras (1985, 1987), Villalobos and Wahba (1987),Michelli and Utreras (1988), Elfving and Anderson (1988), Delecroixet al (1995,1996). A short overview of these methods can be found inDelecroix and Thomas-Agnan (2000) .

3. RANDOM INTERPOLATING SPLINESNote that in statistical applications t he data a will usually be modelled

as a random variable a(w) defined on some probability space (Q, E, P).With the notations of Subsection 2.1, for each w in Q there exists byTheorem 57 a unique interpolating spline S(w) such that

A(S(w» = a(w).

The restriction A of A to the set S of interpolating splines is linear,continuous and one-to-one from S to A(S) . Therefore its inverse A-I iscontinuous. Now

Vw E Q, S(w) = 1-1 (a(w»

so that S defines a random variable on (Q, E , P) with values in theHilbert space 11. endowed with its Borel a-algebra. Using properties ofweak or strong integrals (see Chapter 4) it is easy to see that the mapS is P-integrable if a is P-integrable. The expectation of the randomspline S is given by

E(S) = 1- 1(E (a» .

Convergence theorems for spline functions will therefore imply asymp­totic unbiasedness of spline estimates whenever suitable observations areavailable. Similar remarks can be made about other types of splines .

4. SPLINE REGRESSION ESTIMATIONSeveral types of regression models are considered in the literature.

They are meant to describe the relationship between a response vari­able Y and an explanatory variable T. The variable T may be random,random with known marginal distribution , or deterministic. It is in thecontext of deterministic explanatory variable that the properties of thespline regression estimates have been studied. We will restrict attention

Page 18: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

126 RKHS IN PROBABILITY AND STATISTICS

to the case when T is l-dimensional. In the alternative, several esti­mation procedures also involve reproducing kernel Hilbert spaces andsplines (Thin plate splines, ANOVA).The observations (ti,Yi ) arise from the model Yi = r(t i) + Ei where r isthe regression function and Ei is the residual. Assumptions on the resid­uals may vary but the common base is that they have zero expectationso that the regression function represents the mean of the response andthat they are uncorrelated. Two types of spline estimates have beenintroduced in that case: one is called least squares splines or some­times regression splines (a confusing vocabulary) and the other is calledsmoothing splines. A third type less frequent is sometimes referred toas hybrid splines. Classical (parametric) regression imposes rigid con­straints on the regression function, its belonging to a finite dimensionalvector space, and then fits the data to a member of this class by leastsquares. Nonparametric regression models only make light assumptionson the regression function generally of a smoothness nature. Withoutloss of generality we may assume that the design points ti lie in a com­pact interval say (0,1). Repetitions (i.e. several observations with thesame value of design point ti ) are possible in which case we will usethe notation Yij for the ph observation at the i th design point and n;for the number of repetitio ns at the i t h design point. It is easy to seethat a naive estimator consisting in interpolating the empirical means ofthe response at the distinct values of the design would be mean squareinconsistent unless t he number of repetitions tends to infinity. Somesmoothing device is necessary in t he no (or little) repetitions case , t herepetitions at a point being replaced by the neighboring values whichare alike by a continuity assumption.

4.1. LEAST SQUARES SPLINE ESTIMATORSLeast squares (LS hereafter) spline estimators make use of spline func­

tions from the piecewise approach and hence make little use of repro­ducing kernel Hilbert space theory. However we describe them shortlyto dispel the confusion often encountered between them and smoothingsplines.They can be derived from polynomial regression by replacing in theleast squares principle spaces of polynomials by spaces of splines whichpresent a better local sensitivity to coefficients values. To define theleast squares spline estimator , one must choose an integer p (the order),an integer k and a set of k knots Zj such that 0 < Zl < . . . < Zk < 1. Theleast squares spline estimator is then the spline sin S p (Zl , "" Zk) thatbest fits the data by least squares. The smoothing parameter of this pro­cedure is complex in t he sense that it comprises the order p, the number

Page 19: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 127

of knots k and the position of the knots Zl, . .. ,Zk . The sensitivity to theorder of the spline is not prevalent so that in general cubic splines areused which means p = 4. If the knots are positioned regularly (eitherlinearly spaced or at the quantiles of the design points) , the only param­eter left is the number of knots. A more complex approach consists inoptimizing the position of the knots with respect to some criterion (deBoor, 1978). For k = 0, the estimator coincides with a polynomial ofdegree p - 1. For the largest possible value of k = n - p, the estimator isa spline interpolant. The LS spline estimator is consistent and achievesthe best possible rates of convergence with respect to integrated meansquare error in the Sobolev class of mean functions provided the numberof knots tends to infinity (Agarwal and Studden, 1980). This optimal

2mrate is n- 2 m +1 for a mean function r in Hm(o, 1). Multivariate AdaptiveRegression Splines (Friedman, 1991) generalize this tool to the case ofmultidimensional explanatory variable.

4.2. SMOOTHING SPLINE ESTIMATORSIf the finite dimensional spaces of polynomials or of polynomial splines

are replaced by an infinite dimensional space in the least squares prin­ciple, it is easy to see that the solution interpolates the points and wehave already dismissed that kind of estimator. It is then natural to pe­nalize the LS criterion which leads to smoothing splines as defined inSubsection 2.2. More particularly, it is classical to penalize the leastsquares term I:r=l(Yi - !(ti))2 by fo1 !(m) (t)2d>-.(t) and this leads tonatural polynomial spline estimators as in Section 2.4.1 (Wahba, 1990)when there are no repetitions and when the variance of the residual isconstant .In case of heteroscedasticity or even correlations among residuals, a pos­itive definite matrix W reflecting the correlation structure can be in­corporated in the least squares term (Kimeldorf and Wahba, 1970b).Exercise 4 is a proof of Theorem 62 adapted to that case with the ana­log of system 3.3.In case of repetitions, Theorem 68 does not apply any more because ofthe assumption of distinct design points, but it is easy to see that theproblem of minimizing

Page 20: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

128 RKHS IN PROBABILITY AND STATISTICS

is then equivalent to the problem of minimizing

where iii = :jL:j~l Yij . It is therefore enough to use a diagonal weightmatrix as mention ned above.When the smoothing parameter p tends to 0, the smoothing spline esti­mator converges pointwise a.s. to the interpolating spline and when thesmoothing parameter p tends to infinity, the smoothing spline estimatorconverges pointwise a .s. to the polynomial estimator of degree less thanor equal to m - 1. Speckman (1985) proves that under some regularityconditions on the asymptotic design, the Mean Integrated Square Er­ror weighted by the density of the design converges to 0 provided thesmoothing parameter p tends to 0 and np tends to infinity and that fora suitable sequence of smoo thing parameters p, the smoothing splineestimator achieves the optimal rate of convergence in the Sobolev classof regression functions. The choice of the smoothing parameter can bemade by the method of generalized cross-validation and the asymptoticproperties of this procedure are established (Wahba, 1977). Several au­thors compare the smoothing spline estimator to a Nadaraya-Watsontype kernel estimator and prove their "asymptot ic equivalence" for aspecific kernel: Cogburn and Davis (1974) for periodic L-splines, Silver­man (1984) for D'" splines, Thomas-Agnan (1990) for periodic a-splines.This kernel is given by the Fourier transform of (1 +w 2m ) - 1 for D":splines and of (1 +Aa2(w))- 1 for a-splines.

4.3. HYBRID SPLINESIn the least squares spline approach, the amount of smoothing is tuned

by the number of knots and their location, whereas in the smoothingspline approach , it is performed by the parameter p. In the first ap­proach with random knots, computational cost is important and there isa chance of missing the optimal location with an inappropriate initial po­sition choice. A hybrid formulation consists in minimizing the objectivefunction defining the smoothing spline in an approximating subspace ofsplines with given knots which are a priori different and fewer than thedesign points. These so called hybrid splines have been used by Kellyand Rice (1990), Luo and Wahba (1997). They allow an automatic al­location of knots with more knots near sharp features as well as whereth ere are more observation s.

Page 21: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation

4.4. BAYESIAN MODELS

129

It is possible to interpret the smoothing spline regression estimator asa bayesian estimate when the mean function r( .) is given an improperprior distribution. As in Wahba (1990), assume that in the regressionmodel of Section 4, the r(tj) are in fact realizations of a random processX, at the design points and model X, as

m""' k-1 aX, = L....t (}kt + ;;;v:vZt,k=l v··"/

where y > 0, Zt is the (m-1)-fold integrated Wiener process (see (2.22)in Chapter 1) with proper covariance function given by

R(s, t) = [1 (s - ur~-l (t - ur~-l d>.(u)Jo (m - I)! (m - I)!

and 0 = (lJr , . .. ,Om) is a gaussian random vector with mean zero, co­variance matrix aln , and independent from f. and from Z.Given this a priori distribution on X t , the bayesian predictor for X,based on Y = (Y1 , • • • , Yn ) is given by E(Xt I Y) and the followinglemma shows that when a tends to infinity, this predictor as a functionof t converges pointwise to a spline .

THEOREM 69 Under the above assumptions,

where s-y (t) is the spline solution of

(3.7)

Proof. Let T be the n x m matrix with elements t7- 1 for k =1, . . . , mand i = 1, , n. Let E be the n x n matrix with elements R(t j,tj) withi , j = 1, , n. Since (Xt , Y) is a gaussian vector, we have that

E(Xt IY) = E(Xt) + Cov(Xt, Y)var(y)-l(y - E(Y)),

where

var(Y)

Page 22: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

130 RKHS IN PROBABILITY AND STATISTICS

Let M = L;+n--yln and let L;t = (R(t ,tI), ... ,R(t,tn))' . Then

E(Xt IY)2 2

= [~L;~ + a(1 , t, ... , tm-1)T'](aTT' + ~M)-lyn'Y n'Y

an~ an~ 1(1, t, ... , tm-1)- 2' T'(-2'TT' + M)- Y

0" 0"

+ L;,(an--yTT' + M)-lyt 0"2

Let 7] = 7' It is easy to see that

(7]TT'+M)-l = M-1_M-1T(T'M-1T)-1 (I+~(T'M-IT)-l)-lT'M-l.7]

On the other hand, by Exercise 3, the spline solution of (3.7) is given by

s...,,(t) = (1,t, ... ,tm-l)d+L;~c,

with c = M- 1(y - Td) and d = (T'M-IT)-lT'M-1y. Therefore it isenough to prove that lim 7J - H oo TJT'(TJTT' +M)-l = (T'M-IT)-l andthat lim7J - H oo (TJT T ' +M)-l = M-1(In - T(T'M-1T)-lT'M- 1). Thisresults from Taylor expansions with respect to 1/7] of these expressionsin the neighborhood of O. •

Wahba (1981b) derives confidence intervals for smoothing spline esti­mates based on the posterior covariance in this bayesian model.As we saw in Section 2.4.1, since the smoothing spline is independentof the particular inner product chosen on Irm-lt it is interesting to askwhether this invariance property carries out to the bayesian interpreta­tion. Van der Linde (1992) proves that the usual statistical inferencesand particularly the smoothing error are invariant with respect to thischoice.As Huang and Lu (2001) point out, the mean square error in Wahba'sapproach is averaged over the distribution of the 0 parameter whereasin the BLUP approach, conditioning is done on a fixed value of O.Coming back to the problem of approximation of a bounded linear func­tional on a RKHS 1-£, Larkin (1972) shows that optimal approximationcan be interpreted as maximum likelihood in a Hilbert space of normallydistributed functions. Before giving details about his model, let us de­scribe the general philosophy of his approach.The first idea is to try to define a gaussian distribution on 1-£ such thatsmall norm elements are a priori more likely to be choosen than those oflarge norm. Considering then the joint distribution of the known and theunknown quantities, it is possible to derive the condi tional distributionof the required values given the data. The posterior mode then provides

Page 23: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 131

v({h E 11.

the desired approximation . The advantage of this approach compared tothe traditional and formally equivalent approximation approach is thatone gets simultaneously error bounds based on the a posteriori disper­sion. These error bounds reveal more operational than the traditionalones (see the hypercircle inequality in Larkin (1972)) because they arein terms of computable quantities.However he encounters a difficulty in the first step of the proced ure. TheGaussian measure that he can define by Formula (3.8) below on the ringof cylinder sets of an infinite dimensional Hilbert space 11. does not ad­mit any countably additive extension to the Borel a-algebra of 11.. Thisis the reason for the qualifier weak gaussian distribution used in thatcase. More details about this problem will be given in Chapter 4. Forthe moment, ignoring this difficulty, let us define a cylinder set measurev by

«h1,h>, . .. ,<hn,h»EE})= (3.8)

(!!"'t/2 1G 1-1/ 2 rexp(-py'G-Iy)d>'(y),11" JE

where n is any positive integer, hi, .. . , hn are n linearly independentelements of 11. with Gram matrix G, E is any Borel subset of Rn andp > 0 is an a priori dispersion parameter. The cylinder set measureu induces a joint gaussian distribution on any finite set of values Y =(Yj =< hj, h >)j=l ,o.o,n, with realizations yj, the density ofY being givenby

p(y) = (!!...)n/21 G 1-1/ 2 exp(-py'G-1y)

11"

= (!!...)n/2 IG 1- 1/ 2 exp(-p II h* 11 2),11"

where h* is the interpolating spline corresponding to the data y, themeasurement operator Ah = « hI, h >, . . . , < hn , h » and the energynorm II h 1I1i .THEOREM 70 Given ho E 11., the conditional density function ofYo =< ho, h > given Y = y is then

(P) 1/2 , , 2 2

p(Yo IYl, . .. ,Yn) = -; II h" exp{-p II h II (Yo- < ho,h* » },(3.9)

where h* is the optimal approximant of Yo based on Ah and it is theelement of least norm in 11. satisfying the interpolating conditions

< hj,h >= 0 for i = 1, .. . ,n, and < ho,h >= 1.

Page 24: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

132 RKHS IN PROBABILITY AND STATISTICS

The definition of optimal approximant was given in Section 2.1. Let usjust outline the proof of this result. Using the density formula (3.9), it ispossible to write the joint density of (Yo, Y1 , • •• , Yn ) . By classical proper­ties of gaussian distributions, it is then possible to write the conditionaldensity of Yo given Y = y. Straightforward calculations together withresults about the optimal approximant and the interpolating splines leadto the given form of this conditional density.Confidence intervals on the optimal approximant are derived with thefollowing result, which is an application of Cochran's theorem.

THEOREM 71 With the above assumptions , the quantity

n1

/2 II it II *

t = II h* II IYo - < ho, h >I

is distributed as Student 's t with n degrees of freedom.

This model can be extended to encompass the case of noisy observationsand allows to give a probabilistic interpretation to Tychonov regulariza­tion . Let us also mention that Larkin (1983) derives from this a methodfor the choice of smoothing parameter based on the maximum likelihoodprinciple .

5. SPLINE DENSITY ESTIMATIONSpline density estimation is the problem of estimating the value f(x)

of an unknown continuous density f at a point x, from a sample Xl, .. . ,X n from the distribution with density f with respect to Lebesgue mea­sure. Spline density estimation, although less popular than spline re­gression estimation, has developed in several directions: histosplines,maximum penalized likelihood (MPL), logsplines . .. .Boneva, Kendall and Stefanov (1971) introduce the histospline densityestimate based on the idea of interpolating the sample cumulative dis­tribution function by interpolating splines. Variants of this estimateare also studied in Wahba (1975a), Berlinet (1979, 1981). Let h > 0be a smoothing parameter satisfying *= 1+ 1 where 1 is a positiveinteger. For the sake of simplicity, we restrict attention to a densityf with support in the interval (0,1) . Let Nj be the fraction of ele­ments in the sample falling between jh and (j + 1)h. The BKS es­timate of the density is defined to be the unique function 9 in theSobolev space H1(0, 1) which minimizes Jd g'(t)2d>.(t) under the con-

straints Jj~+l)hg(t)d>..(t) = hilj = 0, . .. , 1. The corresponding cumu­lative distribution function is then the unique function G in H2(0 , 1)which minimizes Jd G"(t)2d>.(t) under the constraints G(O) = 0, and

Page 25: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 133

G(jh) =L:1~~ hi,j =1, . . . , I + l.Wahba (1975a) introduces additional constraints on G' to improve bound­ary behavior. Wahba (1975b) shows that the resulting estimate achieves

2m-2the optimal rate of convergence in the Sobolev class which is O( n - 2m-I) .

An application of Theorem 59 (combined with Theorem 9 of Chapter1) allows to establish the existence, uniqueness and nature (parabolicsplines) of the Wahba estimate for which explicit expressions are devel­oped in Exercise 5. Berlinet (1979, 1981) proves for his estimate severalconvergence results including uniform almost complete convergence withbounds on the rate of convergence and asymptotic normality. Substi­tuting the empirical cumulative distribution function by its spline inter­polant results in a loss of information, unless the knots correspond to thesample points. This last case involves difficult theoretical computationsdue to the random nature of the knots and yields poor practical resultsin some cases due to extreme distances between knots. In general, thecorresponding spline interpolant estimate of a density is not a densityfunction and may be negative on large intervals.

Log-splines models correct this last problem by using finite spaces ofcubic splines to model the log-density instead of the density itself (Stoneand Koo, 1986) . Kooperberg and Stone (1991) introduce refinementsof this method: in order to avoid spurious details in the tails, they usesplines which are linear outside the design interval resulting in an ex­ponential fit in the tails, and they propose a knot selection procedurebased on a variant of the AIC criterion.Maximum penalized likelihood (MPL) density estimation was introducedby Good and Gaskins (1971). Starting from the fact that maximumlikelihood in infinite dimensional spaces results in an unsmooth solutionmade of Dirac spikes , a natural idea is to penalize the log-likelihood. As­suming again that the unknown density f has compact support (0,1),for a class of smooth functions F, a positive real parameter A, and apenalty functional <P : F -+ lR+, a general MPL estimator is a min­imizer of L:f=l -log(g(Xi)) + A<P(g) among density functions 9 in F .The Good and Gaskins estimator enter in this framework as an estima­tor of the square root of the density for a particular choice of space :Fand of penalty functional cI> and results in a positive exponential spline(see Schumaker, 1981) with knots at the data points (see Tapia andThompson (1978) and Exercise 12). De Montricher, Tapia and Thomp­son (1975) propose an alternative choice of F and cI> which results ina polynomial spline with knots at the sample points (see Tapia andThompson, 1978). Log-spline models can also be penalized as in Silver­man (1982). Wahba, Lin and Leng (2001) extend this idea to the multi-

Page 26: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

134 RKHS IN PROBABILITY AND STATISTICS

variate case using a hybrid spline approach and ANOVA decompositionof multivariate functions (see Chapter 5), the interesting feature beingthat the presence or absence of interaction terms determines the condi­tional dependencies. Gu (1995) extends the method of penalized likeli­hood density estimation to the estimation of conditional densities, ex­ploiting again the ANOVA decomposition of multivariate functions (seeChapter 5) . Maechler (1996) introduces an original roughness penaltyaimed at restricting the number of modes and inflection points.Before concluding this section, let us mention that splines have also beenused for other functional parameters as in Wahba and Wold (1975) forthe log-spectral density.

6. SHAPE RESTRICTIONS IN CURVEESTIMATION

The problem of taking into account shape restrictions such as mono­tonicity or convexity arises in a variety of models. For example, ineconometrics, cost functions and production functions are known to beconcave from economic theory. Delecroix and Thomas-Agnan (2000)review the shape restricted estimators based on kernels or splines. Re­producing kernel Hilbert spaces appear naturally here with the use ofsplines but also in a different way in the projection method suggestedby Delecroix et al (1995 and 1996) .It is easy to incorporate shape restrictions in the minimization problemdefining the smoothing splines. If the restriction is described by a coneC C H"' (a, b), as is the case for monotonicity or convexity, it is enoughto minimize (3.6) in the cone instead of minimizing it in the whole spaceHm(a, b). From the theoretical point of view, the problem of existenceand uniqueness of the solutions is in general not so hard (see for exampleUtreras, 1985 and 1991) but the actual computing algorithms complex­ity, when they exist, drastically depends upon whether the number ofrestrictions is infinite or it has been discretized to a finite number. Werefer the reader to Delecroix and Thomas-Agnan (2000) for more detailsand references.Delecroix et al (1995 and 1996) introduce a two steps procedure with asmoothing step followed by a projection step. It applies to any shape re­striction described by a closed and convex cone in a given Hilbert space.The smoothing step must result in an initial estimate that belongs tothat Hilbert space and preferably consistent in the sense of its norm.The principle of the method is that projecting the initial estimate ontothe cone then yields a restricted and consistent estimate. A practicalimplementation relies on a choice of space, norm and initial estimate andan algorithm for computing the projection. They propose to settle the

Page 27: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 135

last point by discretizing the cone and then solving a quadratic optimiza­tion problem with linear constraints as can be seen in Exercise 11. Forthe former point, they give an operational choice of space and estimatorby working in a Sobolev space Hm(O, 1) endowed with a non classicalnorm that will be studied in more details in Chapter 6, Section 1.6.1.The initial estimate can then be choosen as a convolution type estimatoras well as a smoothing spline estimator. Mammen and Thomas-Agnan(1999) prove that constrained smoothing spline as in Utreras (1985)achieve optimal rates in shape restricted Sobolev classes and that pro­jecting ordinary smoothing splines as in Delecroix et al (1995 and 1996)is asymptotically equivalent to using constrained smoothing splines.

7. UNBIASED DENSITY ESTIMATIONWe consider here the problem of estimating the value f(x) of an un­

known continuous density f at a point z , from independent identicallydistributed observations Xl, X 2, • . . , X n having density f with respectto the Lebesgue measure A. We know from the Bickel-Lehmann theorem(1969) that if it were possible to estimate unbiasedly f(x) then an unbi­ased estimate based on one observation would exist (it is a consequenceof the linearity of the derivation of measures). This means that therewould exist a function K( ., x) such that

EK(X, x) = JK(y, x)f(y)dA(y) = f(x),

where X has density f . In other words the function K would have areproducing property in the set of possible densities. More precisely wehave the following theorem (Bosq, 1977a, 1977b, Bosq and Lecoutre,1987) .

THEOREM 72 Suppose that the vector space 'H. spanned by the set D ofpossible densities with respect to the measure v is included in L 2(v). Inorder that there exists an estimate K(X, x) of f(x) satisfying

Vx E JR, K(.,x) E 'H and E(K(X,x)) = f(x) , (3.10)

where X has density f , it is necessary and sufficient that 1l , endowedwith the inner product of L2(v) , be a pre-Hilbert space with reproducingkernel K .

Proof. If 1l is a pre-Hilbert subspace of L 2(v) with reproducing kernelK, then

E(K(X, x)) = JK(y , x)f(y) dv(y) =< I, K( ., x) >1i= f(x)

Page 28: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

136 RKHS IN PROBABILITY AND STATISTICS

and (3.10) is satisfied. Conversely, if (3.10) is satisfied we have the aboveequalities for any element f of 1) and, by linearity,

Vip E n, < ip,K(., x) >1£= ip(x) = JK(y, x) ip(y) dv(y)

which gives the conclusion .

8. KERNELS AND HIGHER ORDERKERNELS

In nonparametric curve estimation a kernel is usually understood as abounded measurable function integrating to one. The role of the kernelis to smooth the data, the degree of smoothness varying with some realparameter h called bandwidth or window-width. Indeed the smoothing"parameter" is the couple (K, h) , where K is the kernel. The real h de­pends on the sample size. Both K and h may depend on the data orlandon the point of estimation. To be more precise consider the simple ex­ample of density estimation from a sequence (Xi)iEN of real-valued in­dependent random variables with common unknown density f. Considerthe standard Akaike-Parzen-Rosenblatt kernel estimate (Akaike (1954),Parzen (1962b), Rosenblatt (1956»

where (hn)nEN is a sequence of positive real numbers tending to zero andK is a bounded measurable function integrating to one . The expectationof fn(x) is

1 { (x - v)Efn(x) = hn

J'E!. K -,;;: f(v) d>.(v) .

hence , by a change of variable and the fact that K integrates to one, onegets the bias

Efn(x) - f(x) = k[j(x - hnu) - f(x)] K(u)d>'(u).

If the pth order derivative of f (p ~ 2) exists and if K has finite momentsup to order p a Taylor series expansion gives

p-l k

Efn(x) - f(x) = 2:h~ (-k~) f(k)(x) l ukK(u) d>.(u) + O(h~) . (3.11)k=l R

Page 29: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 137

Formula (3.11) shows that the asymptotic bias is reduced whenever thefirst moments of K vanish . This motivates the following definition ofhigher order kernels .

DEFINITION 24 Let p ;::: 2. A boundedmeasurablefunction K integratingto one is said to be a kernel of order p if and only if

l uK(u) d>.(u) =l u2K(u) d>.(u) = ... =l up-

1K(u) d>'(u) = 0

and l uP K(u) d>.(u) is finite and non null.

For a kernel of order p the bias in formula (3.11) reduces to O(h~).

Many kernels used in curve estimation are symmetric densities with finitemoment of order 2. Such densities are kernels of order 2. Now let us seehow higher order kernels are connected with reproducing kernels. Forthis a new characterization of kernels of order p is useful.Recall that, r being a nonnegative integer, we denote by lPr the space ofpolynomials of degree at most r ,

LEMMA 17 A bounded measurable function K integrating to one is akernel of order p if and only if

{

'riP E lPp-l IlR P(u)K(u) d>.(u) = P(O)

and IlR uPK(u)d>'(u) = c; i= O.

The first property above tells that, by means of higher order kernels,one can represent the evaluation functional

lPp - 1 --+ RP ~ P(O)

as an integral functional.Proof of Lemma 17. Let P E lPp-l' Let us suppose that K has finitemoments up to order p and expand P in Taylor series. This gives

J p-l p(i) (0) JP(u)K(u)d>'(u) =?= i! uiK(u) d>.(u).

,:=0

The last sum is equal to P(O) whenever K is a kernel of order p.To prove the converse take P equal to the monomial ui , 1 SiS (p - 1).

Page 30: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

138 RKHS IN PROBABILITY AND STATISTICS

Now let us prove that standard kernels are products of densities withpolynomials. For this a definition is needed .

DEFINITION 25 A real function g is said to have a change of sign at apoint Z if there is "I > 0 such thatg(x) does not vanish and keeps a fixed sign on]z - "I, z[g(x) does not vanish and keeps the opposite sign on ]z, z + "1[.

THEOREM 73 Let K be an integrable function (non equal to 0 almosteverywhere) with a finite number N ~ 1 of sign changes at distinct(ordered) points Zl , Z2, .. . ,ZN at which it vanishes and is differentiable.If K keeps a fixed sign on each of the intervals] - 00, Zl [, ]Zl, Z2[, . .. ,]ZN ,OO[ then there is a constant A and a density Ko such that

N

Vx E R, K(x) = AKo(x) II(x - Zi).i=l

Proof. On the intervals] - 00, Zl [, ]Zl , Z2[, . .. , ]ZN,00[, the functionK and the polynomial rr~l (x - Zi) have either the same sign or theopposite sign . So we can choose e in {-I , I} so that

N

cK (x)II(x - Zi)i=l

be a nonnegative function . Now let H be the function defined as follows:

N

H(x) = cK(x) fI(x - Zi)-li = l

H(Zj) = cK'(zj) fI (Zj - Zi)-llSiSN;ih

for j = 1, ... , N,

where K' is the derivative of K.H is nonnegative on lR \ {Zl, Z2, . . . , ZN}. It is continuous at the pointsZl , Z2 , . . . , ZN where K vanishes since

VjE{l, . .. ,N } lim K(x) =K'(Z o).x-tZj x - Zj J

Moreover K is integrable and, for lx/large enough, the function

N

xN I1(x - Zi)-li= l

Page 31: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 139

is bounded. Hence H has a finite moment of order N. The integral ofH cannot be 0, otherwise K would be 0 almost everywhere. Thus

is a density and

N

Vx E JR , K(x) = c:Ko(x)II(x - Zi) i H.i=l IR

•We have just proved that any reasonable kernel to be used in curveestimation can be written as a product

P(x)Ko(x) (3.12)

where P is a polynomial and Ko a density. Let us now characterizekernels of order (r + 1) among kernels of the form (3.12) .

THEOREM 74 Let P be a polynomial of degree at most r, r ;:::: 1, let Kobe a density with finite moments up to order (2r + 1) and Kr be thereproducing kernel of P, in L2(Ko>'). Then P(x)Ko(x) is a kernel oforder (r +1) if and only if

{

Vx E JR, P(x) = Kr(x, 0)

JlRxr+lp(x)Ko(x)d>'(x) = Cr+l =1= O.

Proof. Suppose that P(x)Ko(x) is a kernel of order (r + 1) and let Rbe a polynomial in lP'r. Applying Lemma 17 one gets

1R(x)P(x)Ko(x)d>.(x) = R(O) =1R(x)Kr(x , O)Ko(x)d>'(x),

hence 1. R(x)[P(x) - x.t», O)]Ko(x)d>.(x) = O.

Thus [P(.) - Kr ( . , 0)] is orthogonal to lP'r and the necessary conditionfollows. The converse is obvious by Lemma 17.

•The following important property of higher order kernels can be derivedfrom Theorem 73 and Theorem 74.

Page 32: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

140 RKHS IN PROBABILITY AND STATISTICS

Kernels of order (r + 1), r ~ 1, can be written as products

where Ko is a probability density function and KA., .) is thereproducing kernel of the subspace lPr of L2(Ko>').

This property will be extended in Section 11.Examples of higher order kernels

• The favourite density in smoothing problems is the normal density

Ko(x) = vk exp (_ x2

2

) .

The associated higher order kernels are called Gram-Charlier kernels.

Here are the polynomials which can be multiplied by K o to get the

kernels of order 2 to 12. They are easily deduced from standard

Hermite polynomials. The figures 3.1 and 3.2 show the graphs of the

first three of them.

orders polynomials2 1

3 and 4 (3 - x~ ) /2

5 and 6 (15 - 10x~ + x~)/8

7 and 8 (105 - 105x· + 21x 4- x")j48

9 and 10 (945 - 1260x~ + 378x - 36xU)j384

11 and 12 (10395 - 17325x~ + 6930x~ - 990x u + 55x o- x'U)/3840

Page 33: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 141

Figure 3.1: The first three Gram-Charlier kernels over [-3,3] .

- 0. 01

-0 . 02

Figure 3.2: The first three Gram-Charlier kernels over [2.5,4].

• Another widely used family of kernels is the one associated with the

Epanechnikov density. They are polynomials restricted to [-1,1].

The first six are listed in the following table and the first three plotted

in Figure 3.3

orders Epanechnikov higher order kernels2 3(1 - x~)/4

3 and 4 (45 - 150x~ + 105x~) /325 and 6 (525 - 3675x· + 6615x~ - 3465xO)/2567 and 8 (11025 - 132300x~ + 436590x~ - 540540xo - 225225xO)/4096

9 and 10 (218295 - 4002075x~ + 20810790x~ - 44594550xo + 42117075xo-14549535x10)/65536

11 and 12 (2081079 - 54108054x~ + 405810405x~ - 1314052740xo+2080583505x8

- 1588809222x10 + 468495027x 12)/524288

Page 34: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

142 RKHS IN PROBABILITY AND STATISTICS

Figure 3.3: The first three Epanechnikov kernels.

To save computation time it is interesting to consider piecewise linearor quadratic higher order kernels (Berlinet and Devroye, 1994). Sucha kernel of order 4 is plotted in Figure 3.4. Its graph is included inthe union of two parabolas (See Exercise 20).

-1

Figure 3.4: A piecewise quadratic kernel of order 4 and the rescaledEpanechnikov kernel of the same order .

Page 35: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 143

9. LOCAL APPROXIMATION OFFUNCTIONS

To analyze the behavior of a function ip in the neighborhood of somefixed point x one usually attempts to approximate it by means of sim­ple functions such as polynomials, logarithms, exponentials, etc. Forinstance polynomial functions appear in truncated Taylor series expan­sions. The problem is to write t.p(x + u) as a function of u, say Ix (u),plus a rest on which we have good enough information (for instance weknow that it tends to 0 at some rate when u tends to 0). The function Ixbelongs to some given family :F of functions. One can try to minimizesome distance or divergence between t.p(x+.) and Ix(.) . We will considerhere a local L2 distance. The criterion to minimize will be of the form

1: (t.p(x + u) - Ix(u))2 dA(U)

or equivalently

J1(-1,1) (~) (t.p(x + u) - Ix(u))2 dA(U)

where the positive real number h determines the neighborhood of x onwhich the deviation between rp(x + .) and Ix is taken into account inthe L2 sense. For this reason, h is called the approximation bandwidthor window-width. More generally one can choose as a weight function aprobability density K o and consider the criterion

JKo GD (t.p(x + u) - Ix (u))2 dA(U)

also equal to

JKo (z ~ x) (rp(z) - Ix(z - x))2 dA(Z)

and, up to factor h, to

JKo (v) (t.p(x + hv) - Ix(hv))2 dA(V). (3.13)

Suppose now that t.p(x + hv), as a function of v, belongs to the spaceL2(Ko A) of square integrable functions with respect to the measureK o A, that is the set of measurable functions I such that J12 K o dAis finite, endowed with the inner product

< I,g >= JIg tc; a»:

Page 36: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

144 RKHS IN PROBABILITY AND STATISTICS

Suppose moreover that the family :F of possible functions

R --+ Rv f---t fx(hv)

is equal to a hilbertian subspace V of L2(Ko >.) with reproducing ker­nel K (spaces of polynomials or of trigonometric polynomials are oftenchosen as space V). Then, there is a unique element fx(h.) in V mini­mizing criterion (3.13), it is the projection ITv(<p(x+h.)) of the function<p(x +h.) onto V. One of the major interest of our framework is the pos­sibility of giving explicit representations for the values of the optimizerfx(h.). As we have

<p(x + h.) = IIv(<p(x + h.)) + (<p(x + h.) - IIv(<p(x + h.)))

we can write, by orthogonality,

'r/v , < <p(x + h.), K(., v) > = < ITv(<p(x + h.)), K(., v) >ITv(<p(x + h.))(v) = fx(hv).

Thus,

'r/v, fx(hv) = J<p(x + hu) K(u, v) Ko(u) d>'(u) (3.14)

= JITv(<p(x + h.))(u) K(u, v) Ko(u) d>'(u).

When the functions of V have derivatives of order m, K(., v) as a memberof V obviously shares this property. Assuming that it is possible tointerchange derivation and integration (for this it is sufficient that K(., v)and its derivatives up to order m be bounded) , we have for any v

= JITv(<p(x + h.))(u) dm

(~~' v)) Ko(u) d>.(u).

This extends the reproducing property to evaluation of derivatives. Letus summarize the beginning of the present section into the followingtheorem.

THEOREM 75 Let K o be a probability density function, let h > 0 andx be fixed real numbers and let a function ip be such that the function

Page 37: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 145

ep(x+h .) belongs to L2(Ko >.). Let V be a hilbertian subspace of L2(Ko >.)with reproducing kernel K. Then, the minimization problem

min JKo (v) (ep(x + hv) - fx(hv))2 d>'(v)fx(h .)EV

has a unique solution. Moreover this solution satisfies (3.14) and, if thead hoc conditions are satisfied, it also satisfies (3.15).

The particular case where V is equal to the space lPr (r ~ 0) of poly­nomials of degree at most r has been widely investigated. Indeed, localestimation of a function by low order polynomial arises in many appliedsciences. An early reference for example is Woolhouse (1870). Withpolynomial spaces we can go further with representation (3.15) by usingTaylor expansions.If Ko has finite moments up to order 2r, then P, is a reproducing kernelHilbert subspace of L2(Ko>' ) just like any finite dimensional subspace offunctions.Let K~O)(.,.) be the reproducing kernel of lPr . Let mE {O, .. . , r} and let

(m) _ fr (K~O)(x , y))K r (x , y) - {) .ym

Then for any sequence (PdO<i< r of (r + 1) orthonormal polynomials inL2(Ko>' ) iP; being of exact degree i), we have by Theorem 14

r

K~O)(x, y) = L Pi(X)Pi(Y)i=O

and thereforer r

K~m)(x, y) = L Pi(X)~(m)(y) = L Pi(X)P/m)(y)i=O t=m

where plm) is the derivative of order m of Pi. The second expression

above follows from the fact that each P; is of exact degree i. The poly­nomial K~m)(.,y) represents in lP r the derivation of order m. We haveeven more, as stated in the following theorem.

THEOREM 76

Vep E L2(Ko>'), kep(x)K~m)(x , y)Ko(x)d>.(x) = dm

(IT;;:)) (y)

where ITr is the projection from L2(Ko>') onto lPr -

Page 38: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

146 RKHS IN PROBABILITY AND STATISTICS

Proof. Let Q(x) = Ei=o aiPi(x) be any polynomial of degree at mostr, We have

l (to a;p;(x)) (to I1m1(y)P;(x)Ko(x) ) dA(X)

t,t,a;pjm1(y)l P;(x)Pj(x)Ko(x)dA(X)

r

= LaiP/m)(y) =Q(m)(y).i= O

Now, let <.p E L2(Ko>') and TIr(<.p) be the projection of <.p onto lPr. AsK~m)(., y) lies in lPn

1<.p(x)K~m)(x,y)Ko(x)d>'(x)

•Using Theorem 76 it is now easy to particularize Theorem 75 to localpolynomial L2-approximation.

THEOREM 77 Let Ko be a probability density function with finite mo­ments up to order 2r, let h > 0 and x be fixed real numbers and let afunction <.p be such that the function <.p(x+h.) belongs to L2(Ko>') . Then,the minimization problem

min JKo (v) (<.p(x + hv) - fx(hv))2 d>'(v)f",(h.)E 'Pr

has a unique solution. Moreover this solution fx(h .) is such that

Therefore, setting

Page 39: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation

the polynomial fx can be expanded as

r tm

fx(t) = a(O) + L a(m) -,m.

m=l

with

o~ m ~ r,

147

When the function r.p has derivatives up to order (r+1) in a neighborhoodof x, a Taylor expansion

shows that every coefficient a(m) in the expansion of fx should be veryclose to r.p(m)(x). This property will be used in the next section to con­sistently estimate derivatives of functionals of distribution functions.

The kernels K~m) appearing in the expression of the solution to the aboveminimization problem generalize higher order kernels defined in Section8. To see this let us extend Definition 24 in the following way.

DEFINITION 26 Let p ~ 2 and m ~ (p - 2). A measurable function Kis said to be a kernel of order (m, p) if and only if

JxiK(x)d>"(x) = { ~!c, i 0

for j E {0, .. . , p - 1}and j -=1= mfor j = mfor j =P

The methodology developed in the present section and therefore kernelsof order (m,p) are used to estimate m th derivatives of statistical func­tionals with a red uced bias (typically of order hP , h being the window­width). This will be made clear in the next section. A kernel of order(O,P) is simply a kernel of order p in the sense of Definition 24. Now,taking r.p equal to the monomial xi in Theorem 76 one gets

Jxi K~m)(x) d>..(x) =Jxi JC~m)(x, 0) Ko(x) d>"(x) = ddm: Ix x=o

As we have

dmxi {O__ = m!m .,.dx J . J-m

(j-m)! X

if j < mif j = m

if j> m

Page 40: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

148 RKHS IN PROBABILITY AND STATISTICS

we get, for 1 ~ m ~ r - 1 and 0 ~ j ~ r,

if j < mif j =mif j > m

Hence, if m ::; (r - 1) and if

Jxr+1 K~m)(x) dA(X) i= 0

then K~m) is a kernel of order (m, r + 1).K~r) is a kernel of order (0, r) or simply of order r . If moreover

then K~r) is a kernel of order (r , q) if q is the smallest integer greatest

than (r+l) such that Jxq K~r)(x) dA(X) is finite and non null. Such aninteger q does not necessarily exist . Properties of kernels of order (m,p)are studied in Section 11.Let us first see how local approximation of functions can be applied tostatistical functionals of interest.

10. LOCAL POLYNOMIAL SMOOTHING OFSTATISTICAL FUNCTIONALS

The present section follows the main lines of the first part of a paperby Abdous, Berlinet and Hengartner (2002). We show that most of thekernel estimates (including the standard Akaike-Parzen-Rosenblatt den­sity estimate at the origin of an impressive literature in the second halfof the twentieth century) are solutions to local polynomial smoothingproblems. Indeed we present a general framework for estimating smoothfunctionals of the probability distribution functions, such as the density,the hazard rate function, the mean residual time, the Lorenz curve, thespectral density, the tail index, the quantile function and many others.For any probability distribution function F on lR denote by <I> (x , F) thecollection of functionals of interest, indexed by x E JR . We assume thatfor each fixed x, the functional <I>(x, F) is defined for all distributionfunctions F , so that we can estimate <I>(x , F) by substituting an estima­tor Fn for F. In many applications,

Page 41: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 149

is the empirical distribution based on an n-sample Xl, ... , X n from F. Iffor a fixed distribution F, the function ~(', F) has r continuous deriva­tives, we propose to estimate them by using the local approximationmethodology developed in the preceding Section, replacing the function<p with the function ~(" Fn ) in Theorem 77. The criterion to minimizeis

JKo ( Z ~ .) {~(Z, Fn) _ 1; a;;) (z _ .)m } 2 dA(Z), (3.16)

where K o is a given probability density. Indeed, in a broad variety ofexamples, the minimizing vector

of (3.16) can be shown to estimate consistently the vector

(~(x, F), ~(l)(x, F), ~(2)(x , F), . .. , ~(r)(x, F)).

Cleveland (1979) introduced local linear regression smoothers as a datavisualization tool. Recent work has exhibited many desirable statisti­cal properties of these smoothers and has contributed to an increase oftheir popularity. For example, Lejeune (1985) showed that these regres­sion smoothers did not suffer from edge effects and Fan (1992, 1993)showed them to be, not only design adaptive, but also essentially min­imax among all linear smoothers. We refer to the book of Fan andGijbels (1997) for a nice introduction to these smoothers and their sta­tistical properties.Density estimation is a special case of considerable interest which corre­sponds to taking

~(x, F) = F(x).

Given n independent identically distributed observations, let Fn (x) de­note the empirical distribution and consider the minimization of

JKo(z ~ .) {Fn(z) - 1; a",~.) (z - .)mrdA(z) (3.17)

Estimates for the density are obtained by setting j(x) = aleX) andcorrespond to a Parzen-Rosenblatt kernel density estimator, with thekernel belonging to the hierarchy of higher order kernels introduced inthe above section.Others have considered fitting local polynomials for estimating densi­ties. Hjort and Jones (1996) and Loader (1996) fit local polynomial to

Page 42: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

150 RKHS IN PROBABILITY AND STATISTICS

(3.18)

10.1.

the logarithm of the density and Cheng et al (1997) have proposed toestimate the probability density by local polynomial regression to his­tograms. Closer in spirit to our formulation is the paper by Lejeune andSarda (1992) who fitted local polynomials to the empirical distributionto estimate smooth distribution functions.Our formulation also covers estimation of the hazard rate function whichis obtained by setting

cI>(x, F) = -log(1 - F(x)).

Let Sn = 1 - Fn be an estimate of the survival function. We proposeestimating the hazard rate function

t\(x) = J(x)1 - F(x)

by minimizing

JI<0 ( z ~ x) { - log s:(z) - 1;~ (z - x)m } 2 o.(z)

and setting A(x) = adx). This point of view was adopted by Brunei(1999).Let us now describe more precisely some statistical problems where ourgeneral formulation can be successfully applied. Some of the obtainedestimators are known, others are new.

DENSITY ESTIMATION IN SELECTIONBIAS MODELS.

WEIGHTED DISTRIBUTIONS.Weighted distributions or selection biased models data arise in manyfields, e.g., missing data, survey sampling, damaged observations, soci­ological studies, reliability theory, economics, etc (see Patil et al (1988)and references therein). Let Y be a nonnegative random variable withdistribution function F and probability density J. Suppose that we donot observe Y but rather a different random variable X with distributionfunction G and density function 9 related to J as follows

g(x) = w(x)J(x), x > 0,/lw

where w(x) > a is known, and

/lw = 100

w(x)f(x)d).,(x) < 00.

Page 43: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 151

The aim is to estimate the density f from a random sample Xl, . .. , X nfrom G. For this, we set <p(x, F) = F(x) and consider the estimate ofthe distribution function F

where

Since <p(x, Fn) = Fn(x) , an estimate for the density f and higher orderderivatives are obtained by minimizing

J~[( (z ~ x) {F,,(Z) _to adz _ X)'} 2 dJl(z)

and setting

j(k)(X) = (k + I)! llk+1 (x) .

Following Theorem 77, these estimators are of the form

where

R[m,r](u) = 100

J([m,r] (v) dv.

DENSITY ESTIMATION.A special case of particular interest is when w(x) = 1, that is, we havedirect observations from the distribution F. In this case, E; is th eempirical distribution function and the density estimator (3.19)

j(x) = .£ t .£R[l".] (Xj - x) ,n. h h

J=l

is the well known Parzen-Rosenblatt estimator.NONPARAMETRIC RATIO ESTIMATION.Let G be a known distribution with density g(x) > O. We are interested

Page 44: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

152 RKHS IN PROBABILITY AND STATISTICS

in estimating the ratio of the densities f / 9 from an LLd. sample fromF. For this, consider the family of functionals

<p(x , F) = 1: g-l(z)dF(z).

If Fn is the empirical distribution function , then

<p( F;) = ~ ~..d(Yi S; x)x, n L..J (Y;) '

n i=l 9 1

and following Theorem 77, an estimator for the ratio f /9 = <p' (x,F) is

£(x) = ~ ~ _l_~K[l,rl (Y; - x)9 n~ g(Y;) h h

where

K[l ,r](u) = 100

K[l ,rl (v) dv.

Multiplying the latter by g(x) produces the nonparametric density esti­mator using a parametric start of Hjort and Glad (1995). This estimatorhas smaller mean squared error than the usual Parzen-Rosenblatt kernelsmoother for distributions F that are close to G.

10.2. HAZARD FUNCTIONSHazard functions are of prime importance in survival analysis and

reliability. Their estimation is easily handled by our procedure. Letthe survival time X and the censoring time Y be independent withdistributions Fo and H, respectively. We observe ~ = l(X~Y) andZ = min (X,Y) , the survival function of the latter being

S(x) = (1 - Fo(x))( l - H(x)).

Functions of the form

( )_ (1 - H(x))fo(x)

TJ x - Q( x) ,

where fo(x) is the probability density of X and Q(x) is some positivefunction, are interesting target functions. For instance, the classicalhazard function (or failure rate) corresponds to the particular case inwhich

Q(x) = S(x) = (1 - Fo(x))(1 - H(x))

Page 45: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 153

is the survival function of Z, while for Q (x) = (1- H (x», one retrieves adensity estimation problem. As suggested by Patil et at (1994), considerthe cumulative target function

_ _ r (X dP-(u)<I> (x , Po, H) = <I>(x; F ,Q) = Jo 1](u)d'x'(u) = Jo Q(u) ,

with P-(x) = P[Z < x,~ = 1] = Po(x)(l- H(x» . Given the sample(Zl ' ~d,·. ·, (Zn, ~n), let Fn- and Qn be the empirical counterparts toP- and Q, respectively, so that

and Theorem 77 can be used to derive local polynomial estimates of

l X

1](u)o.(u)

and its derivatives.For the hazard function , one has that

j x dR (u)<I> (x; Fo, H) = _ ~ ( )d'x'(u) = -log(l - Fo(x) .

-00 1 0 u

The last function is the cumulative hazard function denoted by A(x).We estimate it by

An(x) = -log(l - Pn(x)) ,

where, to avoid indeterminate evaluations of the logarithm, we use aslight modification of the empirical distribution function :

1 n

Pn(x) = --"l(X-<x)'n+ 1 Z:: '-i=l

Local polynomial estimates of the cumulative hazard rate function A(x)and its derivatives are of the form

-- 1 J (z x)A(m)(x) = hm+1 log(l - Fn(z))K~m) T d'x'(z).

Let X(l)' . .. , X(n) , X(n+1) = 00 denote the order statistics. As

iFn(z) =­

n+lif

Page 46: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

154 RKHS IN PROBABILITY AND STATISTICS

we have

~(x) = h':+ l t, [IOg(l- n ~ 1) L::;+" K!m) (z ~ x) d),(Z)] •

the last expression being valid even when equality occurs between ele­ments of the order statistics. Setting

K!m)(t) =100

K~m) (u) d.\(u)

we can write for 1 ::; i ::; (n - 1),

1:':i+

1) K~m) (z ~ X) d.\(z) = K!m) (X(i~- X) _K!m) (X(i+~ - X) .

---Therefore Mm)(x) can be written as

_1~ [10 (n + 1 - i) -10 (n + 2 - i)] K(m) (XU) - X)hm+l~ g n +1 g n + 1 r h ',=2

1 (n) (m) (X(l) - X)+hm+1 log n + 1 Kr h .

Finally

A(;)( ) = _1_~ I (n + 1 - i) v-(m) (X(i) - X)X hm+1 LJ og n + 2 _ i f\v r h .

i= l

These estimates generalize the estimates of the hazard function intro­duced by Rice and Rosenblatt in 1976, but note that the kernel K!m) isnot necessarily integrable.

10.3. RELIABILITY AND ECONOMETRICFUNCTIONS.

Let X be a nonnegative random variable with a continuous distribu­tion function F and a finite mean u: There are various transforms ofF which are of great importance in industrial reliability, biomedical sci­ence, life insurance, demography, econometric studies, etc. Among thesetransforms are

• the mean residual life function M defined by

M(x) E(X - xiX> x)

{

JxOO(l-F(y»d )'(y)= (l-F(x»)

°if (1 - F(x)) > 0,otherwise,

and X ~ 0,

Page 47: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 155

• the Lorenz curve, defined in the cartesian plane by the parametricform (F(x), Lp(x)) with

l1X

Lp(x) = - s dF(s),J.L 0

x>O

• and the scaled total time on test function T (or total time of testtransform) defined in the cartesian plane by the parametric form(F(x), Tp(x)) with

( )_ J;(1- F(s))dsTp x - Jooo(l _ F(s))ds' x> O.

Motivations and more information about these functionals can be foundin Shorack and Wellner (1986, page 775). Empirical estimates of thefunctionals M, Lp and Tp are easily obtained by substituting for F theempirical distribution function Fn . Local polynomial estimates of thesefunctionals together with their derivatives follow from Theorem 77.

11. KERNELS OF ORDER (M,P)We have seen in the last two sections how kernels of order (m, p)

(p 2 2, m ::; (p - 2), see Definition 26) naturally appear in local approxi­mation of functions. The aim of the present Section is to extend Lemma17, Theorem 73 and 74 to these kernels and to investigate further theirproperties.

The first consequence of the theory introduced hereafter (Berlinet,1993) is that kernels of order (m, p) can be grouped into hierarchieswith the following property: each hierarchy is identified by a density K obelonging to it and contains kernels of order 2, 3, 4,... which are prod­ucts of polynomials with K«. Examples of hierarchies and algorithms forcomputing each element of a hierarchy from the "basic kernel" Ko arepresented in Subsection 11.2. Subsection 11.3 gives a convergence re­sult about sequences of hierarchies which is useful when approximatinggeneral kernels with compactly supported kernels. Subsection 11.4 isdevoted to properties of roots of higher order kernels and to optimalityproperties.Let us now suppose as in the introduction of Section 8 that we want touse a kernel of order p to reduce the asymptotic bias but that we alsowant to minimize the asymptotic variance which is equivalent (Singh,1979) to

Page 48: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

156 RKHS IN PROBABILITY AND STATISTICS

We have to choose K of order p so as to minimize the criterion

with some additional conditions that remove degenerate cases. Our de­scription of finite order kernels provides a powerful portmanteau theoryfor such optimization problems:

• it suffices to solve the problem for the basic kernel K« in orderto obtain a hierarchy in which every kernel will optimize the criterion atits own order. We recall that Ko is a density, thus a positive function,which makes the problem easy to solve.

• our proofs explain why a kernel has optimal properties: wewrite the value of the criterion for this kernel as the difference betweenthe value for a general kernel of the same order and an explicit positivefunctional. The multiple kernel method which can be applied in any con­text of kernel estimation (e.g. probability density, spectral density, re­gression, hazard rate, intensity functions, ...) is described in Subsection11.5. It provides an estimate minimizing a criterion over the smoothingparameter h and the order of the kernel. We come back to the example ofdensity estimation in Subsection 11.6. Let us now turn to a more generaland technical setting. When smoothing data by a kernel-type methodtwo parameters have to be specified: a kernel K and a window-widthh. As far as only positive kernels are concerned, it is known that theirshape is not of crucial importance whenever h is chosen accurately. Un­fortunately curve estimates built from positive kernels are usually highlybiased and the improvement of kernel-type estimates requires kernels oforder r (Schucany and Sommers, 1977; Schucany, 1989). When fittingdata with such kernels the problems facing us will be:

• How to choose simultaneously the order of the kernel and thewindow-width?

• How to choose the shape of higher order kernels?To deal with these practical questions we first address the following the­oretical one:

• Is it possible to build hierarchies of kernels of increasing orderassociated with an "init ial shape" from which they inherit their proper­ties?The answer is affirmative: the initial shape will be determined by adensity Ko and each kernel of the hierarchy will be of the form

Kr(x) = K-r(x ,O)Ko(x)

where K r is the reproducing kernel of the space of polynomials of degreeat most r imbedded in L2(Ko>' ). It is equally easy to deal with kernels

Page 49: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve est imation 157

K~m) of order (m, r), i.e. kernels of order r for estimating derivatives oforder m (as defined in Section 2); they also can be written as productsof K o with polynomials and therefore inherit the properties of K o: anychoice of shape, support, regularity conditions (such as continuity, differ­entiability, etc.) or tail heaviness is possible. This possibility of choice isone of the main points of the present theory, in accordance with papersthat try to dismiss the commonly held idea that practically, the kernelcharacteristics are at best secondary. In particular some asymmetric ker­nels are known to overcome boundary effects (Gasser and Muller, 1979;Muller, 1991). Our framework provides easy ways of solving optimiza­tion problems about kernels . We give two examples: minimum variancekernels and minimum MISE kernels for which calculus of variations isnot understood at a comfortable intuitive level. We show how the oldresults can be thought of as simple projection plus remainder in £2 spaceand extend them to any order. Indeed if K o is optimal in a certain sense,each kernel of the hierarchy has an optimality property at its own order.Two hierarchies have already appeared in the literature: the Legendreand Gram-Charlier hierarchies studied by Deheuvels in 1977. The latterhas been reexamined by Wand and Schucany (1990), under the name ofgaussian-based kernels; a paper by Granovsky and Muller (1991) showsthat they can be interpreted as limiting cases of some optimal kernels .We extend this last property. A natural extension of the concept ofpositivity to higher order kernels is the notion of minimal number ofsign changes. This has been introduced by Gasser and Muller (1979)to remove degenerate solutions in some optimization problems. Keepingthe initial density Ko unspecified we give in Subsection 11.4 very generalproperties about the number and the multiplicity of roots of our kernels.It turns out that kernels of order (0, r) and (1, r) defined from a non­vanishing density K o have only real roots of multiplicity one. Up to nowthe methods for building kernels used some specific arguments based onmoment relationships and gave no natural link between the initial kerneland the higher order ones. This is the case for the following properties:

• if K(x) is of order 2, (3K(x) +xK'(x))/2 is of order 4 (Schu­cany and Sommers, 1977; Silverman, 1986). This has been generalizedby Jones (1990) .

• Twicing and other methods (Stuetzle and Mittal, 1979 ; De­vroye, 1989): if K(x) is of order 8, 2K(x) - (K *K)(x) is of order 28 and3K(x) - 3(K * K)(x) + (K * K *K)(x) is of order 38. On the contrary,our framework makes clear the relationships between kernels of differentorders in the same hierarchy. The relevant computational questions areeasy to solve: two kernels of the same hierarchy differ by a product ofK o and a linear combination of polynomials which are orthonormal in

Page 50: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

158 RKHS IN PROBABILITY AND STATISTICS

11.1.

£2(Ko>') and are therefore very easy to compute. When the FourierTransform is used, choosing Ko in a clever way may considerably reducecomputational costs. The selection of the order of a Parzen-Rosenblattkernel was first considered by Hall and Marron (1988) in the case of den­sity estimation. By performing a mean integrated squared error analysisof the problem , they investigated theoretical properties of kernels withFourier transform exp( -ltIP) and proposed cross-validation as a methodfor choosing the kernel order and the smoothing parameter. We definehere a multi-stage procedure for constructing curve estimates based onincreasing order kernels and leading to a data-driven choice of both theorder and the window-width which applies to a wide variety of smooth­ing problems. In the last part we will focus on density estimates basedon hierarchies of kernels for which strong consistency results are avail­able (Berlinet, 1991). The interpretation of these estimates by meansof projections provides exponential upper bounds for the probability ofdeviation. Such upper bounds can be extended to the estimates definedin Section 10.

DEFINITION OF Ko-BASEDHIERARCHIES

A common construction of finite order kernels is obtained throughpiecewise polynomials (Singh (1979), Muller (1984), Gasser et al (1985),Berlinet and Devroye (1994)) or Fourier transform (Hall and Marron(1988) and Devroye (1989)). We shall be mainly concerned here withproducts of polynomials and densities; it turns out that almost all rea­sonable kernels are of this type. As usual we will denote by P r (r ~ 0)the space of polynomials of degree at most r, Unless otherwise statedintegrals will be taken with respect to the Lebesgue measure on JR.A very useful characterization of the order (m,p) of a kernel, generaliz­ing Lemma 17 is given in the following lemma by means of evaluationmaps for derivatives in function space.

LEMMA 18 A function K is a kernel of order (m,p) if and only if

{

VP E lPp- 1 fIR P(x)K(x)d>'(x) = p(m)(o)

and fIR xPK(x)d>'(x) = c, =1= o.Proof. See Exercise 15.

•In other words if K is a kernel of order (m ,p) the linear form on lPp-l

P f---t k. P(x)K(x)d>'(x)

Page 51: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 159

is nothing else than the evaluation of p(m) at the point zero.A convenient general structure for the construction of hierarchies ofhigher order kernels can be established from RKHS theory, through us­ing a succession of reproducing kernels applied to a "basic kernel".Let us now extend Theorem 74 to kernels of order (m ,p).

THEOREM 78 Let P be a polynomial of degree at most r, Ko be a densitywith finite moments up to order (2r+1) and ICr be the reproducing kerneloff'r in L2(KoA) . Then P(x)Ko(x) is a kernel of order (m, r+ 1) if andonly if

{

\/x E JR , P(x) = IC~m)(x , 0)

flR xr+l P(x)Ko(x)dA(x) = Cr+l =I O.

Proof. See Exercise 16.

Theorem 73 and 78 show that the product

where IC~m) is the reproducing kernel of P r in L2(KoA), is preciselythe form under which any reasonable kernel of order (m, r +1) can

be written .

This suggests the following definition .

DEFINITION 27 (HIERARCHY OF KERNELS) Let Ko be a density and(Pi)iEICN be a sequence of orthonormal polynomials in L2(KoA) , (Pi)being of exact degree i , The hierarchy of kernels associated with Ko isthe family of kernels

r

IC~m)(x, O)Ko(x) = I: p/m) (O)Pi(x)Ko(x) , (r, m) E [2 , r ~ m.

The property that P r is embedded in L2(KoA) and JC~m)(. , 0) is welldefined holds if and only if K o has finite moments up to order 2r. Theset I may be reduced to {O}, as it is the case when Ko is the Cauchydensity. [is always equal to an interval of N with lower bound equal tozero.Each kernel JC!m)(x, O)Ko(x) with finite and non null moment of order(r + 1) is a kernel of order (m, r + 1).We actually obtain a hierarchy of sets of kernels, the initial set being

Page 52: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

160

the set of densities

RKHS IN PROBABILITY AND STATISTICS

(~KO(}J),h>Oand a rescaling of the initial kernel does not affect this hierarchy, asstated in the next theorem.

THEOREM 79 K( .) is a kernel of order (m,p) with pth moment equal toCp if and only if for any h > 0,

is a kernel of order (m ,p) whose pth moment is equal to hP-mCp ' Let

(K~m) (.)) be the hierarchy of kernels associated with Ko(.). Then, the

hierarchy associated with *Ko (ii) is the fam ily of kernels

(h!+1 ) K~m) (iJ .Proof. The first assertion follows from the following equality

Yj E {O, . . . ,p}kxj h~+l K (X) dA(x) = hj-

mkx j K(x)dA(x) ;

the second one from the fact that, for any polynomial P of degree atmost r, we have

kP(x) h~+1 K~m) (X' 0) Ko (X) dA(x)

=h~ LP(hu)K~m)(u,O)Ko(u)dA(u) = p(m)(O).

Each kernel used in data smoothing is determined by Ko, h, p and m.To choose the shape (for instance following optimality arguments) andthe smoothing parameter one chooses a suitably rescaled version of Ko.To choose (m,p) one moves along the hierarchy. The order of theseoperations has no importance.

11.2. COMPUTATIONAL ASPECTSOnly straightforward methods of numerical analysis are needed to cal­

culate these kernels and the associated curve estimates. The orthonor­mal polynomials can be computed by means of the following relation­ships

Qn(x) ( r )1/2Pn(x) = IIQnll' where Yn E N, IIQnll = iN. Q;(x)Ko(x) dA(x) ;

Page 53: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation

Qo(x) = 1; c,(x) = X - kxKo(x) d>'( x) ;

Qn(x) = (x - Qn)Qn-l (x) - f3nQn-2(X) , n ~ 2

. fn~ XQ~_1 (x)Ko(x) d>.(x)with Q n = t 2

JR Qn-l (x)Ko(x) d>.(x)

JRQ~-1 (x)Ko(x) d>.(x)and f3n = t 2 .) •

JR Qn_2(x)!{O(x d>.(x)

The associated kernel K~m) of order (m, r + 1) is given by

r

K~m)(x) = L Pi(X)Pi(m)(O)Ko(x).i=m

When K o is symmetric, we have

Qo(x) = 1; Ql(X) = x;

Qn(x) = xQn-t{x) - f3nQn-2(X) , n ~ 2

161

and Vn E N, Q2n is even and Q2n+l is odd. Therefore, in that case, thecondition kxr+1K~m)(x) d>.( x) = C r +1 i- 0

can be satisfied only if (r + m) is odd ; this last condition entails that

p(m)(O) = 0 and K(m)(x 0) = K(m)(x 0)r r' r-l '·

The reproducing kernel can be computed either iteratively or by means ofthe Christoffel-Darboux formulas, when the Qi'S are known explicitely:

r

Vx =1= Y, Kr(x, y) = L Pi(X)Pi(Y)i=O

= _1_ (Qr+l(X)Qr(Y) - Qr+dY)Qr(X))IIQrl12 x - Y

r

Vx, Kr(x, x) L[Pi(X)]2i=O

IIQ~112) (Q~+dx)Qr(x) - Qr+dx)Q~(x)) .

DETERMINANTAL EXPRESSIONS.To give an explicit formula for K~m), we introduce some notation. For

Page 54: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

162 RKHS IN PROBABILITY AND STATISTICS

k 2: 1 and any sequence J-L = (J-Ld of real numbers, let us denote by M%the Hankel matrix of order k built from I1q, J-Lq+!, ,l1q+2k-2, and by H~

its determinant.

and

M q­k-

(

J-LqI1q+l

~q+k-1

J-Lq+k-l J. .. I1q+2k-2

HZ = det(M%) .

Finally, let H~m(x),x E R,m E {l,oo.,k} be the determinant of the

matrix obtained from M% by replacing the m th line by 1, x, x2, ... , xk-1.

We will suppose that all the principal minors of M~+1 are different fromzero.

THEOREM 80 Let It = (J-Li)0::;i::;2s be the sequence of (28 + 1) first mo­ments of /(0. Then

'tIx E R, 'tIk E N,Qdx)

'tIk E N, IIQkl1'tIk E N,l3k

'tIk E N, Pk(X)

'tIx E R, [(~m)(x)

H~+1 ,k+1 (x)/H~

(HO /HO)1/2k+1 k

H~H~_2/ (HZ- 1)2

H2+1,k+1 (x) (H2 H2+d -1/2

m! H~+1,m+dx)/(0(x)/H~+!.

Proof. The first four equalities are well known and easy to check(Brezinski, 1980). Now, writing ](~m)(x) in the basis 1, x, x2. . . , x" andapplying the definition of a kernel of order (m,p) yields a linear system

in the coefficients of ](~m)(x) with matrix Mr Straightforward algebragives the result. •

The determinantal form of Idm)(x) can be used either in practical com­putations with small values of r, or in theoretical considerations, forinstance to show that the kernels derived in Gasser et al (1985) arethose of Legendre and Epanechnikov hierarchies (we give a direct proofin Subsection 11.4 below).EXAMPLES.Any choice of ](0, with finite moments up to order 2r (r 2: 1) , provides

a sequence of kernels ](~m)(x) = K~m\x,O)](o(x) . This choice, possiblymade from the observations, has to be further investigated, especiallywhen information is available on the support of f. As we shall see in

Page 55: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 163

Subsection 11.4, optimal densities (in a sense to be defined) give rise tooptimal hierarchies.

• (a) Ko(x) = ~l[-l .ll(x) leads to piecewise polynomial kernels, Leg­endre kernels.

• (b) Ko(x) = ~ (1- x2 )+ is the basic kernel of the Epanechnikovhierarchy.(a) and (b) are particular cases of the Jacobi hierarchies obtainedwith Ko(x) = A(l - x)+(l + x)~.

• (c) Ko(x) = (2l+1/ 2kr(1 + 1/2k)r1

exp (_x~k) gives rise to Gram­

Charlier kernels when k = 1. The derivatives of the orthonormalpolynomials are in this case linear combinations of a bounded numberof polynomials of the same family.

• (d) Ko(x) = (~r (of)) -1 IxlO exp (-l xl 13 ) gives rise to Laguerre

kernels when a = 0 and (3 = 1.

• (e) Ko(x) = A exp (-%( x - (3)4 - !(x - (3)2)) (a ~ OJ / > 0 if a =0) . This family of distributions is characterized by the same propertyas in (c) with a number of polynomials less than or equal to two.

Some of these kernels have been discussed in detail in the literature (De­heuvels, 1977). Numerous results concerning orthogonal polynomialswith weights, such as those given above and many others, can be foundin Freud(1973) , Nevai (1973a, 1973b, 1979) , Brezinski (1980).From kernels K 1 , K 2 , ••• , Kd belonging to hierarchies of univariate ker­nels one can build d-dimensional product kernels

K(xll" . ,Xd) = rr1=IKi(Xi)

defined on jRd with desirable shape, support, regularity and optimalityconditions, tail heaviness and moment properties. In Figures 3.5 and 3.6are respectively plotted t he Gram-Charlier kernel of order (4, 6) givenby

K 4 ( ) _ 3 - 6x2 + x

4(X

2)

5 X - exp --..,f2IT 2

Page 56: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

164

and the product kernel

RKHS IN PROBABILITY AND STATISTICS

4

Figure 3.5: The Gram-Charlier kernel Kt(x) of order (4,6) over [-5,5].

4 -4

Figure 3.6: The product kernel Kt(xdKt(X2) over [-4,4]2.

Page 57: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 165

On a semi-metric space (£, d) one can define a kernel K by combiningany univariate kernel K with the semi-metric d, i.e. by setting

K(x) = K(d(x , y))

where y is any fixed point in E. In Figure 3.7 is plotted the kernel

K(Xl' X2) = Kt ((Xl +x2)1/2)

obtained by combining the Gram-Charlier kernel of order (4,6) with theEuclidean norm on ]R2. Of course such a kernel has to be rescaled (ortruncated) to integrate to 1.

4 -4

Figure 3.7: The kernel Kt ((Xl+X2)1/2) over [-4,4]2.

11.3. SEQUENCES OF HIERARCHIESLet us now study how some families of densities and hierarchies of

kernels approximate each other.Let Ko and KO,e, .e E N, be densities associated with families of orthonor­mal polynomials (Pj)iE[ and (Pi,diE[' From Theorem 4 it is clear thatthe convergence, as .e tends to infinity, of the moments of Ko,f. to thecorresponding moments of K o entails the convergence of the coefficientsof Pi,e to the coefficients of P; and therefore each element of the Ko­hierarchy can appear as a limiting case of the Ko,e-hierarchies. From the

Page 58: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

166 RKHS IN PROBABILITY AND STATISTICS

Lebesgue dominated convergence theorem it follows that the conditionof convergence of the moments is fulfilled provided that the functionsKo,e(x),.e E N, are bounded by a function with corresponding finite mo­ments and KO,l tends to Ko almost surely. As an example, Theorem 81below shows that a number of hierarchies with unbounded support canappear as limiting cases of hierarchies with compact support.

THEOREM 81 Let (K~7)) be the hierarchy of kernels associated with

the density

Q ( <p(X))lKO,l(X) = Aelxl 1 - -f- 1{ep(x):51}

where r.p is a positive function such that exp( -r.p(x)) has finite momentsof any order. Then

where (K~m») is the hierarchy associated with the density Ko(x) =

Alxl Q exp( -r.p(x)).

Proof. The key idea is that for any positive function r.p we haveVx E R ,Vi ~ 1,

(r.p(X))l Xl

OS; exp( -r.p(x)) - 1 - -f- l{ep(x)~l} S; f exp(-Xl)

where (Xl) lies in ]1, 2[ and satisfies liml>-*oo Xl = 2. Therefore, if ip issuch that exp( -r.p(x) ) has moments of any order, the conclusion followsfrom the Lebesgue Theorem. •

Application of Theorem 81 to Example c) above and its extension toExample d) are straightforward. A particular case is the Gauss hierarchywith initial kernel (27r)-1/2 exp (-x 2/2) which is the limit, as 1tends toinfinity, of the hierarchies associated with the densities

Indeed, Theorem 81 makes a wide family of analytical kernels appear aslimiting cases of compactly supported kernels with attractive properties(Granovsky and Miiller, 1991).

Page 59: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 167

11.4. OPTIMALITY PROPERTIES OFHIGHER ORDER KERNELS

ROOTS OF HIGHER ORDER KERNELS.A natural extension of the concept of positivity to higher order kernelsis the concept of minimal number of sign changes. This has been in­troduced by Gasser and Muller (1979) to remove degenerate solutionsin some optimization problems. They have proved that kernels of ex­amples a) and b) have a minimal number of sign changes ((p - 2) for akernel of order p). Mimicking their proof, such results can be extendedto all commonly used hierarchies, once /(0 has been specified. The poly­

nomials K~~~ (x, 0) do have orthogonality properties, but with respectto non necessarily positive definite functionals and the classical proper­ties of roots of orthogonal polynomials cannot be carried over. Letting/(0 be unspecified we give hereafter very general properties about thenumber and the multiplicity of roots of our kernels. Theorems 82 and83 are technical. Their corollary states that kernels of order (0,1') and(1,1') defined from a non-vanishing density /(0 only have real roots ofmultiplicity one.

THEOREM 82 Let /(0 be a density of probability, let r ~ 2, m E [0, r -1]and (PdO<i< r be the sequence of the first (1'+ 1) orthonormal polynomialsin L 2(/(0>. fThe polynomial

r

K~m)(x , 0) = L p/m)(O)p;(x)t=m

(of degree d E [1, r)) has at least one real root of odd multiplicity.

Proof. As [(0 is a density of probability, the equalities

kK~m)(x, O)[(o(x) d).,(x) 0 (m > 0)

and kx 2K(0)7'(X, O)[(o(x) d)"(x) = x21x=0 = 0

show that K~m)(x,0) has at least one real root where it changes sign . •

THEOREM 83 Let r; be the multiplicity of each real root z; of K~m)(x,0)and let qo be the sum of the numbers [1';/2] (brackets denote the integerpart). Then

{

either m is even, 2m < r and 2qo = d +m + 1 - ror2qo < min(d + 1 - m, d + m + 1 - 1').

Page 60: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

168 RKHS IN PROBABILITY AND STA TIS TICS

Proof. K~m)(x, 0) = u(x)v(x) where u(x) = Il(x - zd 2[r;/2] and v(x)are polynomials of degrees 2qo and (d - 2qo) respectively. We have, forq E N,

The first integral would vanish if we had 2q+ d - 2qo ::; r and (m ::; 2q-lor m > 2q + d - 2qo). Therefore no integer number q 2: 0 satisfies

m + 1 ::; 2q ::; r + 2qo - d or 2q < minfr + 2qo - d + 1, m + 2qo - d).

The first condition is equivalent to

{

(m is even and m + 1 = r + 2qo - d)or(1' + 2qo - d < m + 1)

while the second one is equivalent to

{

(1'+2qo-d+l::;0)or(m + 2qo - d ::; 0).

Since r 2: d and qo 2 0 the condition (1' + 2qo - d + 1 ::; 0) cannot besatisfied. The conclusion follows. •

COROLLARY 11 If m E {O, I}, K~m)(x, 0) only has real roots of multi­plicity one.

Proof. If mE {a, I}, 2qo ::; d + 1 - r ::; 1 thus qo = o. •Note that kernels of order (0,1') and (1,1') may have roots with multi­

plicity higher than one if /(0 has such roots or if K~m)(x, 0) and /(o(x)have roots in common. An example of kernel of order (0,3) with a rootof order two has been presented by Mamrnitzsch (1989).TWO OPTIMAL HIERARCHIES.Our description of finite order kernels turns out to be a powerful tool inthe search for asymptotically optimal kernels. It enables production ofvery short proofs and confirmation of a conjecture claimed by Gasser etal (1985). The functionals to be minimized are the same in almost allnonparametric estimation problems (cumulative distribution function,density, regression, spectral density, hazard function , . . . and deriva­tives) and lead to two important families of kernels: minimum variance

Page 61: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 169

and minimum MISE hierarchies.Minimum variance hierarchy Minimum variance kernels of order(m, r + 1) on [-1,1] are solutions to the following variational problem

{

W(K) = i: K 2(x) d),(x)(PI) is minimized subject to

\:IP E lFrJ~l P(x)K(x) d),(x) = p(m)(o) .

They are known to be uniquely defined polynomials of degree (r-I) with(r - 1) real roots in [-1 , 1], symmetric for m even and antisymmetricfor m odd. Explicit formulas have been derived for their coefficients inGasser et al (1985) as mentioned above. We show that the minimumvariance family of order (m, r + 1) kernels is identical to the hierarchyassociated with the density Ko(x) = ~1[-1,l)(X) which is the minimumvariance kernel of order (0,2).

THEOREM 84 The solution to problem (Pi) is given by

r

K~m)(x) = L pim)(0)Pi(X)1[_1,l)(X)i=m

where the Pc 's are the orthonormal polynomials in L2(1[_1,1)), i.e. theLegendre polynomials.

Proof. Letr

K~m)(x, 0) = :LV2P/m) (O)V2Pi(X)i=m

and Ko(x) = ~1[-1 ,1)(X) . Then, by Theorem 78,

K~m)(x) = K~m)(x, O)Ko(x)

is a kernel of order (m, r + 1). Let K be another polynomial kernel on[-1, 1] of order (m, r + 1). K has necessarily a degree d greater than rand has the same first (r + 1) coordinates as K~m)(x, 0). Thus

K(x) = (K~m)(X' 0)+.t aiPi(X)) Ko(x)l= r + 1

and

W(K) ~ W (K!m) (xl) + W Cf, <>,l';(X)Ko(Xl).

This shows that K~m)(x) is the unique solution to problem (PI). •

Page 62: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

170 RKHS IN PROBABILITY AND STATISTICS

Minimum MISE hierarchy Gasser et al introduced polynomial ker­nels for which they proved optimality up to order 5 and conjectured thesame property for any order. This conjecture can now be proved usingthe unifying variational principle introduced in Granovsky and Muller(1991) . We give here a general very simple proof. The minimum MISEfamily of order (m, p) kernels is identical to the hierarchy associated withthe Epanechnikov density (3/4)(1 - x2)+ which is the minimum MISEkernel of order (0,2) .Minimum MISE kernels of order (m, r + 1) on [-1,1] are solutions tothe following variational problem ((r + m) is supposed to be odd)

{

(1 )r+1-m I /2m+1T(K) = 1-1 K 2(x) d..\(x) J~l xr+lK(x) d..\(x)

(P2) is minimized, subject to

'tiP E r, i: P(x)K(x) d..\(x) = p(m)(o).

THEOREM 85 The polynomial solution to problem (P2) vanishing at theend points of [-1,1] is given by

r

K~m)(x) = L ~(m)(O)Pi(x)(3/4)(1- x2)+i=m

where the Pi'S are the orthonormal polynomials in L2(Ko..\) with Ko(x) =(3/4}(1- x 2)+.

Proof. Obviously, K~m) satisfies the condition. The functional T isinvariant under scale transformations

K(.) f-+ h~+lK (iJ .Therefore we have to compare W(K~m») with W(RKo) where R is apolynomial such that

{t: xr+1 R(x)Ko(x) d..\(x) =t: xr+1K~m)(x) d..\(x)

\:IP E IPr J~l P(x)R(x)Ko(x) d..\(x) = p(m)(o).

It turns out that (R - K~m)(x, 0)) is orthogonal to IPr +1 in L2(Ko..\).Now,

W(RKo) J(RKo - K~m)r d..\+ W(K~m»)

+ 2 JK~m)(X) (R(x)K~m)(x, 0)) Ko(x) d..\(x).

Page 63: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation

As Ko is symmetric, K~m) is of degree (r + 1) at most. Thus

W(RKo) = J(RKo - K~m)) 2 d>" + W(K~m))

171

and the conclusion follows. •

Granovsky and Miiller (1989) proved that K~m) minimizes the samecriterion over the set of square integrable kernels of order (m, p) with afixed number (p - 2) of sign changes on JR.

11.5. THE MULTIPLE KERNEL METHODLet us suppose that a function I (e.g. a probability density function ,

a spectral density function, a regression function, an intensity function,etc) has to be estimated from a sample of points and that a criterionC has been chosen to judge the accuracy of any kernel estimate In. Cis a score function of the sample estimating some measure of deviationbetween In and the true unknown function f. Once the sample is given,C is a function of the rescaled kernel

h;+1 K~m) (~) .

The initial kernel Ko is chosen regarding the asymptotic behavior of C.As an example one can think of the problem of density estimation froma sample Xl, ... , X n of independent random variables with commondensity f. If the criterion is the MISE (Mean Integrated Squared Error)equal to

where I n(x) is the standard Parzen-Rosenblatt kernel est imate

~~K(Xi - x 0) K (Xi - x)nh ~ r h' 0 h

3=1

built from the sample, a natural choice for K o is the Epanechnikov op­timal kernel , or a nearly optimal kernel (under suitable assumptions onI , see Epanechnikov (1969)) . A natural choice for C is the £2 cross­validation criterion

where In ,i is the kernel estimate based on the (n - 1) observations dif­ferent from Xi . For relevant discussion and references, see Berlinet and

Page 64: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

172 RKHS IN PROBABILITY AND STATISTICS

Devroye (1989). Once K o has been chosen one can compute for any or­der r the value hr of the smoothing parameter optimizing (at least overa grid G)

C (h~+l K~m) (~)).Let C; be the value of C at the optimal hr. Then, the optimal order fin a bounded interval [0, R] is defined so as to optimize C; over [0, R]and the corresponding rescaled kernel

is used to build In. The multiple kernel method can also be used withestimates In,r and In,s built from kernels of different orders rand s toprovide best smoothing parameters h; and li, at those orders as proposedby Devroye (1989): h: and h., are chosen so as to minimize for instancethe Ll distance between In,r and In,s' It also allows to simultaneouslynumerically optimize with respect to Ko, h, and r .The study of the asymptotic behavior of the multiple kernel method wascarried out by Vieu (1999) who proved the asymptotic optimality of themethod in the general framework set at the beginning of this section. Apaper by Horova, Vieu and Zelinka (2003) deals with derivatives of thedensity.

11.6. THE ESTIMATION PROCEDURE FORTHE DENSITY AND ITS DERIVATIVES

As in Subsection 5 above let Xl, . . . , X n be independent random vari­ables with common unknown density I and cumulative distribution func­tion F. We give in this subsection some specific properties of estimatesof I, F and of derivatives of I based on higher order kernels . These esti­mates can be interpreted by means of projections in L 2 spaces, as seenin Section 10. Let

1 ~ (x-X")In(x) = nh ~Kr h J

be the standard kernel estimate of I built from the kernel

Kr(x) = K-r(x, O)Ko(x).

Let J.Ln be the measure with density In and lin be the empirical measureassociated with the sample. Theorem 86 particularizes general resultsgiven in Section 10. It shows that estimating the measure J.L(A) of a Borel

Page 65: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 173

set A with a kernel like K, and smoothing parameter h is nothing elsethan deriving the best L2-approximation with weight K o of the functionJLn(A - h .) by a polynomial ITA of degree at most r and taking ITA(O) asan estimate of J.l(A) :

THEOREM 86 For any Borel set A, we have J.ln(A) = ITA (0) where

ITA = arg min f (p(u) - JIn(A - hu))2 Ko(u)du.PEJI>rJ)R

Proof. We have

f 1 nJ.ln(A) = l« - I: lA(Xj + hv)Kr(v, O)Ko(v)d'x(v).

)R n j=l

The above integral is the value at 0 of the projection of

on the subspace IFr i.e. the solution of the following problem: find 1r inIFr minimizing the norm of

(1r(.) - JIn(A - h.))

and evaluate 1r at the point O. The conclusion follows.

Now let us see how to handle the deviation•

between the m th derivative of f and its standard kernel estimate. Letus suppose, as it is usually the case, that the function

d(.) = f(x - h.)

belongs to L2(Ko'x). Theorem 87 gives the relationship between theexpectation of f~m) (x) and the function d and provides an exponentialupper bound for the probability of deviation:

THEOREM 87 Let

j (m)( ) = 1 ~ K(m) (x - X j 0) K (x - Xj)n x nh m+1 L- r h' 0 h

J=l

Page 66: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

174 RKHS IN PROBABILITY AND STATISTICS

be the standard kernel estimate of the m th derivative of f . Suppose thatthe function d(.) = f(x-h.) belongs to L 2 (K o>' ). Then the expectation of

f~ml(x) is the value at 0 of the m th derivative of the polynomial Ph such

that Ph(h.) is the projection of d on Pr. If moreover IKJmll is bounded

by the constant M(m, r) we have

Proof.

Ef~ml(x) = (hm~ 1K~ml (-iJ * f) (x)

h~1f(x - hv)K~ml(v, O)Ko(v)d>.(v)

_1 dm

(Ph(hV))! = p(ml(O).hm dv m hv=o

The inequality is a consequence of Lemma (1.2) in (Me Diarmid, 1989) .

•We have a similar result for Fn(x) when the function F(x - h.) belongsto L2(Ko>' ). Now, once K o is specified deterministic approximation the­orems in L2(Ko>" ) give the behavior of (J(ml(x) - Ef~m)(x)). Thus weakor strong (using Borel-Cantelli lemma) convergence theorems can be eas-ily derived for f~m l (x). Strong consistency results covering a wide classof density estimates were given in (Berlinet, 1991). They can be appliedin the framework of this section to hierarchies of density estimates.

Page 67: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 175

12. EXERCISES1 With the notations of Section 1.2, assume that the measurement op­

erator is the evaluation operator at tl , . .. ,tn . Let 80 (respectively8 p ) denote the interpolating spline (respectively smoothing spline)corresponding to the data a, the measurement operator A, and theenergy operator B (respectively and the smoothing parameter p).a- Prove that there exists a matrix n such that II Bsi, 11 2= a'na andexpress it in terms of E and T.Answer: n = E-l(In - T(T'E-IT)-IT'E- I.

b- If a* = Asp, prove that a* = (In + pn)-la.Let f..lk (respectively Vk) for k = 1, .. . , n be the eigenvalues (respec­tively the eigenvectors) of I]. Let m be the number of null eigenvaluesand assume without loss of generality that f..ll = .. .= f..lm = 0.c- Prove that nand E-I coincide on the set of a such that the inter­polating spline corresponding to a belongs to Ker(B)J.. and that fork > m, J.lk is the inverse of an eigenvalue of E.d- Calculate sp as a linear combination of the interpolating splinesof the data Vk . One can prove that these splines display an oscil­latory behavior increasing with k when the eigenvalues are orderedincreasingly and therefore that this decomposition of the smoothingspline underlines the tapering effect of the smoothing parameter (seeExercise 9).

2 Let Zl < . . . < Zk be n reals of an interval (a, b) and let Sr(ZI, . .. , Zk)denote the space of polynomial splines of order r with simple knotsZl < .. . < Zk on (a, b) . The aim of this exercise is to prove that thedimension of Sr(ZI, ... , Zk) is r +k.a- Prove that the r + k functions1 r-l ( )r-l ( )r-l I' I' ddt, x, .. . , x 1 X - Zl + 1 " " x - Zk + are mear y m epen enand belong to Sr (ZI1' . . , Zk).b- Let s be a given element of Sr(ZI," . ,Zk) and Pi denote the poly­nomial of IPr - 1 which coincide with 8 on (Zi, zi+d . Prove that thereexists constants Ci such that PHI (x) - pi(X) = Ci(X - ziy-l and con­clude that s is a linear combination of the r + k functions of the firstquestion .

3 This exercise is an alternative proof of Theorem 68 without usingTheorem 59. Without loss of generality we assume that a = 0 andb = 1. Let tl < '" < t n be n distinct points in (0,1) . Let Y =(YI,' . . , Yn)' belong to jRn and m be an integer less than or equal ton. As in Section 1.6.2 of Chapter 6, Hm(O, 1) is endowed with theinitial value operator norm (6.39) and Ko (respectively KI) denotethe reproducing kernel of IPm (respectively the null space of the initial

Page 68: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

176 RKHS IN PROBABILITY AND STATISTICS

value operator) .a- Prove that the interpolating D'" spline s* for the points (ti, yd(i = 1, . . . , n) belongs to the finite dimensional subspace of Hm(o, 1)generated by KO(ti,') and KI(ti ,.) for i = 1, ... , n .b- Prove that the n X n matrix E = (KI(ti,tj)) for i, j = 1, .. . , n ispositive definite.c- Let T be the m X n matrix (t7- 1

) for k = 1, . . . , m and i = 1, .. . , n.Check that the rank of T'E-IT is m.d- Prove that

m n

s*(t) = Ldktk-1 +LCiKl(ti,t)k=l i=l

(3.20)

where C = (CI, .• . ,Cm )' E jRm and d = (d1, . .. ,dn )' E jRn are solutionto

{min c'EcEc+Td=Y

e- Prove that the solution of (3.21) is given by

{c* = E-l(y - Td)d* = (T'E-IT)-lT'E-1y

f- Prove that c", d* is also the unique solution of

{Ec+Td= YT'c= 0

(3.21)

g- Using formula (6.40) of Chapter 6, prove that s* coincides with apolynomial of degree less than or equal to 2m - 1 on each interval(ti, ti+d and with a polynomial of degree less than or equal to m - 1on (0, tt) and (tn, 1). Conclude that s* is a natural cubic spline.

4 We use the same notations as in Exercise 3. Let p be a positive realand for a function fin Hm(O, 1), let F = (J(tl) ,"" f(tn))'. Let Wbe a positive definite matrix and for X E jRn, let II X 11?v= X'WXbe the corresponding norm. We consider the problem of minimizing

(3.22)

a- Prove that (3.22) has a unique minimizer s; which is a naturalspline of order 2m.

Page 69: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 177

(3.23)

b- Prove that s; is of the form (3.20) where c" E jRn and d* E jRm

minimize II Y - Td - Ee II~ +pe'Ee.c- Prove that c", d* is also the unique solution to

{T'W(Y - Ec- Td) = 0(WE + pln)c = W(Y - Td)

or equivalently of

{(E +pW- 1 )c + Td = YT'Wc =0

5 a- Prove that

II 9 11 2= g(0)2+ g'(0)2+11

g"(t)2d)..(t)

defines a norm on the Sobolev space H 2(0, 1).b- Prove that Q(s ,t) = 1 + st + fomin(s,t)(s - u)(t - u)d)"(u) is thereproducing kernel of H2(0, 1) endowed with this norm.c- Let (tl , . . . ,tn) be n distinct points on (0,1) and let to = 0,t n+l = 1. Given n + 4 reals (a,b,Yo, . . . , Yn+I) , we consider the prob­lem of minimizing fo1 g"(t)2d)..(t) under the n + 3 interpolation con­straints g(ti) = Yi and g'(O) = a, g'(l) = b. Write this optimizationproblem in the framework of abstract interpolating splines (in par­ticular specify the spaces 1t, A, B and the operators A and B) andconclude that it has a unique solution G*.d- Prove that this solution belongs to the subspace generated by thefunctions Q(t i, .) (i = 0, .. " n + 1) and ~ (. , t) It=o, ~(., t) It=l.e- Prove that the solution is a cubic spline and describe its computa­tion .f- In which sense can we say that this spline is less smooth than thenatural cubic spline interpolating the points (ti, Vi), (i = 0, . . . , n+1).g- Show that G* is the projection of any function of H 2(0, 1) satisfy­ing the n + 3 interpolation constraints of question c.

6 The Sobolev space H 1(0, 1) is endowed with the norm

II 9 11 2= g(0)2+11

g'(t)2d)..(t)

Let H be the subspace of functions 9 E H1(0, 1) such that g(O) = O.a- Prove that H endowed with the induced norm is a reproducingkernel Hilbert space of jR[O,I) with reproducing kernel given by

R(s, t) = min(s, t).

Page 70: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

178 RKHS IN PROBABILITY AND STATISTICS

(3.24)

b- Let p be a positive real, (tl, . .. , t n ) be n distinct points on (0,1)and (Yl' ... ,Yn) be n reals . Prove that there exists a unique minimizerof

t,(Yi - f(ti))2 +P11

(J(m)(t))2d'x(t),

when f ranges in H.c- Write this solution 1* in terms of Y = (Yl' ... , Yn)', of the functionsR(ti,') and of the n X n matrix ~ with elements R(ti, tj).d- Prove that 1* is a polynomial spline. Specify its order. Is it anatural spline?e- Let (X t , t E (0,1)) be a zero mean gaussian process withE(Xt - X s )2=1 t - s I. Let €i , i = 1, .. . , n be i.i.d. random variableswith mean zero and variance (7'2 independent of the process X t • Yidenotes a realization of the random variable Yi = X t i + €i . Provethat the BLUP of X, based on (Yl , . . . , Yn ) is of the form 1*(t) forsome value of p when one substitutes the random variables Yi fortheir realizations Yi . See Chapter 2.

7 The Sobolev space H l (0, 1) is endowed with the norm

(3.25)

a- Prove that Hl (0,1) endowed with this norm is a reproducing kernelHilbert space whose reproducing kernel is given by

R( ) = cosh(s - 1) cosh(t)s, t sinh(1)

for t :::; sand R(s, t) = R(t, s) for s < t , See Chapter 6.b- Prove that there exists an element 9 of minimal norm in HI (0, 1)such that 9(0) = 0 and that 9(1) = 1 and compute it.

8 a- Given n points (ti, Yi) E JR2 with distinct abscissae, prove that ifthere exists a polynomial P of degree less than or equal to m - 1such that Yi = P(td, i = 1, ... , n, then the D'" interpolating splineof these points coincides with P. Prove that the same is true for theD'" smoothing spline for any value of the smoothing parameter.b- Under the same assumptions, and if the points (ti, Yi) result from astandard nonparametric regression model, prove that the D": smooth­ing spline estimator is unbiased.c- Under the same assumptions, specify which of the least squaresspline estimators are unbiased?

Page 71: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation 179

(3.26)

9 We use the same notations as in Exercise 1, 3 and 4.a- Prove that n has at least m zero eigenvalues (one may use thefirst question of Exercise 8) and prove that the trace of (In + pn)-lis greater than or equal to m.b- If (Zk), (k = 1, . .. , n), denotes an orthonormal basis of eigenvec­tors of n with corresponding eigenvalues J.lk, and if Y = 2:k=1 'YkZk ,prove that

n 1s; = L 1+ 'Yk¢k

k=1 PJ.lk

where ¢k is the interpolating spline corresponding to the data Zk.

10 The aim of this exercise is to compute explicitly the periodic smooth­ing spline in the case of equispaced design. Let ti = (i - 1)jn fori = 1, . .. , n. Let us denote by Y = (YI, .. . ,Yn)' a vector of datavalues and for a function f E H~r(O, 1), let F = (f(td,· ··, f(tn))'be the vector of values of f at the design points. For a sequencePk, k E Z, let pin) be the periodic sequence defined by

+00(n) '"""Pk = L....J PH/n·

1=-00

Let W denote the (finite Fourier transform) matrix with elementsWkl = n- 1! 2 expe1l"~kl) for k and lin {O, 1, ... , n - 1}.

Ch k h f() . r: "n-l f(n) fen) -fen)a- ec t at tl = v n L.Jk=O k Wk(l-l) ' n-k = k •

b- If F(n) denotes the vector (fJn), . . . , f~rjl)" check that

~F = WF(n) and that F(n) = *W'F.

c- Let y(n) = 7nW'y = (Yan), ... , y~rj1)" Let S be the periodicsmoothing D": spline minimizing

when f ranges in H~r(O, 1). Prove that the faithfulness to the datameasure can be written /I pen) - yen) 11 2 and that the energy measurecan be written 2:t~ 121rk 12ml !k /2.d- Use the calculus of variations to prove that the Fourier coefficientsof the solution s satisfy sin) - yin) + p(21r(k + In))2m S k+1n = 0 fork = 0, ... , n - 1 and 1 E Z .e- If bk = (21l"~)2m , prove that the solution is given by Sk+ln =

Page 72: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

180 RKHS IN PROBABILITY AND STATISTICS

~y(n) for k + In -J. 0 and So = y(n).p+b~n) k r 0

f- Prove that the eigenvalues defined in Exercise 1 are given by J.lo = 0and J.lk = l/b~n) for 1 ::; k ::; n - 1.

11 This exercise details the computations of the projected estimates ofDelecroix et al (1995). Let H be a reproducing kernel Hilbert space,11, ... , lp be p given elements of H and let C be the coneC = {u E H:< u,lj >1-£'.5. a,j = 1, ... ,p}. It is recalled that thepolar cone ofC is given by C- = n::::j=lajlj,aj 2: O,i = 1, .. . ,p}and that if P (respectively P-) denote the projection onto C (re­spectively C-), then u = P(u) +p- (u) .a- Prove that P(u) = u - I:j=l ajlj where the p coefficients satisfythe following quadratic optimization program

b- Prove that the previous optimization problem is equivalent to

c- Assume that H = H 2(R) and

C = {u E H: u'(tj) ::; a,j = 1, . . . ,p}.

Let dt denote the representer of derivation at tj in this space. If Undenotes an initial estimate of a functional parameter u E H, provethat the projected estimate onto C is given byUn = Un - I:j=l ajd:j where (aj) solve the optimization program

d- Use Chapter 6 to explain how to compute the terms < 4, d~J ».

12 Let Xl, . . . , X n be a sample from the unknown density f. Let a andp be two positive real numbers. Assume that there exists a densityf E H1(R) such that f(x j) > 0 for all i. The derivation of Goodand Gaskins (1971) estimator is linked to the following optimization

Page 73: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve estimation

problem (see Tapia and Thompson, 1978) .

n

max . I1 I(x;)exp( -4>(1)),fEH1(R) ,j(x;)~O (.=I, ...,n) ;= 1

where 4>(1) is the following norm on H 1(lR)

1+00 1+00

CI>(J) = 2a -00 I'(t)2d>.(t) + P -00 I(t)2d>.(t).

181

1) Prove that this optimization problem has a unique solution 1*.2) Prove (see Chapter 6) that the reproducing kernel of HI (lR) withthe norm CI>(J) is given by

K(s, t) = ( 1 )1/2 exp _(2a)I/2 I t - s I .22ap p

3) Prove that the objective function of the above optimization prob­lem (penalized log-likelihood) can be written

n

-log L(J) = - L log < I, K(x;,.) > +CI>(J);=1

4) Conclude from 3) that 1* satisfies

5) Conclude from 2) and 4) that 1* is an exponential spline (seeSchumaker, 1981).

13 In the Sobolev space H 2(O, 1), consider the inner product

< I, 9 >= aI(O)g(O) + ;31'(O)g'(O) +11

I"(t)g"(t)d>.(t) .

Denote by K its reproducing kernel (see Chapter 7 for the formula).1) Check that the linear functional L(J) = J~ I(t)d>.(t) is continuousand find its representer l ,

. l ( ) - t 4 t3

t2

t 1Answer. t - 24 - "6 + "4 + 2i3 + ;;.2) Find the optimal weights Wi and the optimal design t i E (0,1) sothat E~=1 w;f(t;) is the best approximation in the sense of Sard of

Page 74: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

182 RKHS IN PROBABILITY AND STATISTICS

L(j) given {J(ti) , i = 1, .. .n} .Answer: t1 = ~ - (n - 1)h/2, tj = t1 + (j - l)h, t n = ~ + (n - l)hand W1 = ~ - (n/2 - l)h = Wn ,Wj = h,j = 2, . .. n - 1, whereh = (n - 1 + (~)1/2)-1.

14 Let the Sobolev space H2(0, 1) be endowed with the following twonorms

and

II f II~= f(0)2 + f'(0)2 +11

f'12(t)d>..(t).

The aim of this exercise is to prove that these two norms are equiv­alent, more precisely that

II f Iii::; 9 II f II~::; 45 II f IIi .

1) By the absolute continuity of f, write a Taylor expansion of f inthe neighborhood of 0 (for x E [0,1]) with integral remainder andconclude that

By integrating this inequality on [0,1], conclude thatf(0)2 ::; 2 II f Iii ·2) By the same arguments applied to i', conclude thatf'(0)2 ::; 211 f IIi, and hence that II f II~::; 511 f IIi .3) By similar arguments as in 1) prove that

and by integrating this inequality conclude that

4) Applying the same technique as in 3) to f' show that

fa1 f'(t)2d>..(t) ::; 2(j'2(0) +lx

f"(t)2d>..(t)) ::; 211 f II~,

Page 75: Reproducing Kernel Hilbert Spaces in Probability and Statistics || Nonparametric Curve Estimation

Nonparametric curve est imation

and hence that II f II~ ~ 9 II f II~ .

183

15 Extend the proof of Lemma 17 to get Lemma 18. For this use deriva­tives of order m of monomials.

16 Extend the proof of Theorem 74 to get Theorem 78. For this useLemma 18 and Theorem 76.

17 Prove the determinantal expressions given in Theorem 80.

18 Particularize the algorithms given in Subsection 11.2 to the casewhere the density Ko is piecewise polynomial.

19 Prove the inequalities used in the proof of Theorem 81. Apply thistheorem to Examples c) and d) in Subsection 11.2.

20 Let K o be the real function supported by [-1,1] defined by

K () {I if -0.5 ~ x s 0.5,ox = (1 _ x2)/(31x2 - 7) if 0.5 ~ Ixl ~ 1.

K o is continuous. It may appear as a regularization of the uniformdensity over [-1,1]. Find a primitive of Ko and the constant C suchthat C K o is a probability density function.Compute the kernels of order 4 and 6 in the hierarchy with basicdensity C tc;

21 Extend Theorem 87 to the general estimates obtained in Section 10.

22 By computing its moments, check that the function

3-6x2+x4

(x2

)exp --J21T 2

is the Gram-Charlier kernel of order (4,6).


Recommended