Robust Non-Parametric Curve Estimation using Density Power ...€¦ · eralized linear models. ......

July 6, 2015 Journal of Nonparametric Statistics jns2015˙07˙06

To appear in the Journal of Nonparametric StatisticsVol. 00, No. 00, Month 20XX, 1–18

Robust Non-Parametric Curve Estimation using Density Power

Divergences

Arun Kumar Kuchibhotlaa and Ayanendranath Basua∗

aIndian Statistical Institute, Kolkata, India.

(Received 00 Month 20XX; accepted 00 Month 20XX)

The family of density power divergences proposed by Basu, Harris, Hjort, and Jones (1998)has been extensively studied in the literature for various parameter estimation problemsincluding censored models, independent but non-identical data (linear regression) and gen-eralized linear models. It has been shown to be a robust and reasonably efficient alternativeto the maximum likelihood estimation procedure. In this paper, we extend the applicationof the density power divergence to the problem of estimating the non-parametric regressionfunction. We show by simulations that this method is robust against outliers in the responsevariable in non-parametric regression problems.

Keywords: Density Power Divergence; Non-Parametric Regression; Average SquaredError; Smoothing Spline; Local Linear Regression; Nearest Neighbour.

AMS Subject Classification: 62G08, 62G35.

1. Introduction

We consider the non-parametric regression problem where the regression function giventhe explanatory variable X is denoted by m0(X). We assume a parametric model for theresponse variable Y given the explanatory variable X. There are two existing paradigmsin the literature of non-parametric regression, namely, regression estimation with a fixeddesign and with random design. For a comparison of these two paradigms see Section 1.9of Gyorfi, Kohler, Krzyzak, and Walk (2002).

In this paper, we consider the case where the explanatory variable X is random. Thestandard non-parametric regression model can then be represented as

Y = m0(X) + ε, E(ε|X) = 0.

There are many existing methods in literature for estimating the unknown functionm0 based on a random sample Dn = {(X1, Y1), . . . , (Xn, Yn)}. To build our method,we briefly describe two of the most prominent existing function estimation techniques.One of the natural ideas in constructing an estimate of m0 is to use local modellingand historically this gave rise to the first non-parametric regression function estimate inTukey (1947) as a regressogram. The idea was further generalized by several authors, eg.

∗Corresponding author. Email: [email protected]


Stone (1977), leading to the estimator

mn1(x) := argmint∈R

n∑i=1

Wni(x)[Yi − t]2, (1)

where Wni(x) represents non-negative weights so that Wni(x) is large if x is close toXi; thus the estimator is a local averaging based estimator. Usually these weights arebased on kernels or the nearest neighbour method, examples of which can be found inChaudhuri and Dewanji (1995). This corresponds to weighted likelihood estimation of m0

if we assume that the errors have a normal distribution with mean zero. This method wasalso extended to include other error models in Tibshirani and Hastie (1987) constructingthe local likelihood based regression function estimator. By direct differentiation, it iseasy to see that the analytic form of mn1(x) is given by

mn1(x) =n∑i=1

Wni(x)Yi

/ n∑i=1

Wni(x).

Most of the asymptotic properties of mn1(x) can be obtained from this form. See Gyorfiet al. (2002, Chapter 4) and references therein for more details on the asymptotic prop-erties. Also, see Staniswalis (1989), Chaudhuri and Dewanji (1995) and Fan, Gasser,Gijbels, Brockmann, and Engel (1997). There are others generalizations of this methodavailable in the literature which consider local polynomial fits. We will refer to the ap-proach described in Equation (1) as the first approach.

Another natural idea is to minimize the sum of squares (as in linear regression) togetherwith a roughness penalty. To avoid over-fit we need to suitably restrict the class offunctions we consider; usually this is done by adding a penalty based on the secondderivative to the least squares objective function. This approach gives a global smoothnessproperty while the approach described in Equation (1) gives local smoothness. Thisestimator is given by

mn2(·) := argminf∈Ck(R)

1

n

n∑i=1

[Yi − f(Xi)]2 + λn

∫{f ′′}2, (2)

where λn > 0 represents the smoothing parameter and Ck(Rd) represents the class of allfunctions which are (k−1) times continuously differentiable from Rd to R. This estimatoris called the smoothing spline. We will refer to the approach given in Equation (2) as thesecond approach. If λn tends to∞, then the estimator generated by the second approachis the linear regression estimator; if λn is close to zero, the estimator is very rough andwill fit the data very closely leading to over-fitting. It is well-known in the literature thatthe estimator (minimizer) in Equation (2) is unique and is a natural spline of degree(2k−1). This approach also corresponds to the penalized likelihood approach for normalerrors. General likelihood based extensions of this penalization approach can be foundin Eggermont and LaRiccia (2009) and Cox and O’Sullivan (1990) among others. Mostof the asymptotic properties are derived from the objective function and we refer theinterested reader to Gyorfi et al. (2002, Chapters 20, 21), Eubank (1999) and referencestherein for more details including extension to the case of multivariate covariates. Alsosee Silverman (1985), Wahba (1990), and van de Geer (1990).

The estimators described above and their likelihood generalizations are very sensitive

2


to outliers (in both covariates and response) just as in the case of parametric estimationusing the likelihood. In the literature, there are robust versions available for standard non-parametric regression model. Robust alternatives to the first approach given in Equation(1) include the estimators

mn1(x) := argmint∈R

n∑i=1

Wni(x)ρ(Yi − t),

for some suitable function ρ; for example, one can use ρ which leads to Huber’s ψ-function or Tukey’s biweight function. These estimators along with their asymptoticand robustness properties were extensively studied in Cleveland (1979), Hardle (1984),Hardle and Tsybakov (1988) and Boente and Fraiman (1989) to mention a few. Robustalternatives to the second approach include the estimators of the type

mn2(·) := argminf∈Ck(R)

1

n

n∑i=1

ρ(Yi − f(Xi)) + λn

∫{f ′′}2,

for appropriately chosen functions ρ. These estimators were extensively studied in termsof asymptotic properties in Cox and O’Sullivan (1990), Cox and O’Sullivan (1996), Cox(1983) and Oh, Nychka, and Lee (2007).

These robust alternatives to the least squares are adaptations of M -estimators of loca-tion to the non-parametric regression problem. Few estimators take general robustifiedlikelihood approach with explicit consideration of the structure of the conditional distri-bution. Here, we provide a flexible class of non-parametric regression function estimatorswhich includes the likelihood based regression function estimate as a special case whilethe others are robust estimators of the regression function at different levels of robustness.This class is derived from density power divergences (Basu et al. (1998)) parametrizedby a scalar tuning parameter α ≥ 0 that can be tuned as needed. This family avoids theuse of non-parametric smoothing in density-based minimum distance estimation which isunavoidably necessary in the approaches of Beran (1977), Basu and Lindsay (1994) andPark and Basu (2004). As a function of the tuning parameter (α ≥ 0), the divergenceswithin this family are given by

φα(g, f) =

∫f1+α − 1 + α

α

∫gfα +

1

α

∫g1+α. (3)

By a simple application of Jensen’s inequality, it is easy to see that φα(g, f) ≥ 0. Forα = 0, it is defined by taking the limit of φα(g, f) as α → 0 which coincides with theKullback-Leibler divergence; in the minimum distance literature, the latter divergenceleads to the maximum likelihood procedure. As α increases from zero, the asymptoticefficiency of the minimum divergence estimators based on Equation (3) decreases androbustness properties become more pronounced. We define the best fitting parameter by

θg := argminθ φα(g, fθ),

given the target density g and the parametric family of densities F = {fθ : θ ∈Θ ⊂ Rp}. Suppose we have independent and identically distributed random variablesX1, X2, . . . , Xn with common probability density g. In order to get an estimate of θg, wecan minimize a sample based version of φα(g, fθ). To get this empirical version we replace

3


the second term of the divergence in (3), which is an expectation of fα with respect tog, with the sample mean of fαθ (Xi), 1 ≤ i ≤ n. As the third term is independent of θ, areasonable estimate of θ is given by

θn := argminθ

∫f1+αθ − 1 + α

α

1

n

n∑i=1

fαθ (Xi).

Here we note that in the case g = fθ0 for some θ0 ∈ Θ, we get θg = θ0. Asymptoticproperties of these estimators (for any α ≥ 0) were studied in Basu et al. (1998) and forthe linear regression set up in Ghosh and Basu (2013).

The idea on which we base our estimation procedure is as follows. Suppose we considerthe parametric model F = {fθ(y|x) : θ ∈ Θ} for the conditional distribution of Y givenX = x where θ = θ(x) might be a vector of functions. We consider

ρ(θ) :=

∫φα (g(·|x), fθ(·|x))h(dx),

as a distance measure, where h is the density of X. As before, define θg as a minimizerof ρ(θ) over all θ ∈ Θ and an estimator of ρ(θ) is given by

ρ(θ) :=1

n

n∑i=1

{∫f1+αθ (y|Xi)dy −

1 + α

αfαθ (Yi|Xi)

}.

In order to eliminate the roughness problem, we define the estimator θn,α as

θn,α := argminθ∈Θ ρ(θ) + λnJ2k (θ), (4)

where J2k (f) =

∫{f (k)}2 and λnJ

2k (f) is the roughness penalty. In the case of a vector

of functions to be estimated, for example in simultaneous estimation of the mean andvariance functions, one should add penalties for each of these functions. Here we notethat λn has to converge to zero as n→∞ to get a consistent estimator.

Our second approach to get a smooth robust regression function estimate minimizes

1

n

n∑i=1

Wni(x)

{∫f1+αθ (y|x)dy − 1 + α

αfαθ (Yi|x)

}, (5)

over all θ(x) to get θα(x). Note that for each x in the support of X, θ(x) belongs to Rqfor some q ≥ 1 given by the number of functions in the vector θ. For multidimensionalcovariates, this approach is much simpler to apply in practice than the former; this isbecause the former approach requires differentiability of the functions to be estimatedand for the penalty term we have to consider partial derivatives of different orders whilethe latter approach does not require differentiability of the function to be estimated.

In this section, we have introduced two different families of non-parametric functionestimators using density power divergences under a “parametric” model for the condi-tional distribution. The rest of the paper is organized as follows. In Section 2, we considerthe special case of function estimation based on the penalized density power divergencemethod given by Equation (4) where the conditional distribution belongs to the normallocation family and study the properties of the estimator. In Section 3, we provide the

4


corresponding derivation for the estimation method given by Equation (5). In Section 4,we numerically investigate the performance of the proposed methods, and compare themwith some existing estimators. In Section 5, we discuss a simple extension of the loca-tion model to location-scale model. In Section 6, we elaborate on the need for these newestimators. In Section 7, we briefly discuss the issue of choosing the tuning parameters.Section 8 provides concluding remarks, including scope for future extensions.

2. Penalized Density Power Divergence in Gaussian Errors Model

2.1. Computational Aspects

Consider the standard non-parametric regression model

Y = m0(X) + ε,

where ε given X has a N(0, σ2) distribution. Let there be n iid observations from thismodel given by Dn = {(Xi, Yi) : 1 ≤ i ≤ n}. For simplicity, we will deal with the casewhere X is a bounded univariate random variable. Without loss of generality, we alsoassume that X ∈ [0, 1] a.s. The penalized density power divergence regression functionestimator is given by (assuming that σ is known and k = 2),

mn(·) := argminf∈C2(R)−1

nα

n∑i=1

exp(−α

2[Yi − f(Xi)]

2)

+1

α+λn2J2

2 (f). (6)

We note here that as α tends to 0, the objective function coincides with that of smoothingspline corresponding to least squares. Without loss of generality, assume that the Xi’sare in ascending order. Define δy(x) = 1{x=y}. One can prove from the Euler-Lagrangeequations (see Ng (1994)) by rewriting the objective function as

1

α−∫ {

1

nα

n∑i=1

exp(−α

2[Yi − f(x)]2

)δXi

(x)− λn2{f ′′}2

},

that mn is a natural spline of degree 3 satisfying

m(3)n (X+

i )−m(3)n (X−i ) =

∑j:Xj=Xi

exp(−α

2[Yj −mn(Xi)]

2) Yj −mn(Xi)

nλn.

Here the summation on the right hand side arises because of the possibility that therecan be repetitions in X observations. Similar adjustments have to be done throughoutthe paper but for simplicity we will, from now on, assume that the X observationsare all distinct. The theory goes through all the same even in the case when there arerepetitions. Once we know that the estimator is a spline, we can alternatively get holdof the estimator by minimizing

T (a) :=1

α− 1

nα

n∑i=1

exp(−α

2[Yi − ai]2

)+λn2a>QR−1Q>a,

5


where a = (a1, a2, . . . , an)> and Q is a lower triangular banded matrix of order n×(n−2)while R is a tridiagonal matrix of order (n − 2) × (n − 2). These matrices are verywell known in the spline interpolation literature; see, eg., Ng (1994). Also, the vector arepresents the function values of the spline at the Xi’s. If a vector c is chosen to denotethe second derivative values of the spline at the interior (n − 2) points, then it is well-known that Q>a = Rc and J2

2 (f) = c>Rc holds. See, for example, Green and Silverman(1994, Section 2.1.2). Having the minimizer aα of T (a), one can construct the spline byinterpolating the points (Xi, aα,i) for 1 ≤ i ≤ n.

In the case α = 0, which corresponds to the smoothing spline, the estimator a0 can bewritten explicitly as

a0 = (In + nλnQR−1Q>)−1Y,

where In is the identity matrix of order n and Y = (Y1, Y2, . . . , Yn). However this equationis not useful for the numerical computation of the estimator; see Green and Silverman(1994, Section 2.3.3). Reinsch (1967) proposed an algorithm which is numerically morestable. A similar algorithm with iterations until convergence can be given for α > 0 inour case. It is easy to see by differentiating T (a) that aα satisfies

a = Y − nλnD(a)QR−1Q>a = Y − λnD(a)Qc,

where c represents the corresponding vector of third derivative values at the interiorpoints and D(a) is a diagonal matrix with its ith entry given by n exp(α[Yi − ai]2/2).Now multiplying both sides by Q>, we get

(R+ λnQ>D(a)Q)c = Q>Y.

Using this equation along with the relation Q>a = Rc, we get an iterative algorithm withsteps similar to those in the Reinsch algorithm. In actual implementation, this iterativealgorithm is seen to be numerically more stable than the reweighted iterative algorithmwhich uses the minimizer of

n∑i=1

wi(Yi − f(Xi))2 + λn

∫{f ′′}2.

See Eubank (1999, Theorem 5.3).

2.2. Asymptotic Properties

Most of the asymptotic properties of the general penalized estimator in (6) follow from theresults of Cox (1983). In particular, he proves consistency and optimal rate of convergencefor the estimator; see Cox (1983, Theorem 1). His proofs are complicated, and involverestrictive conditions to accommodate the very general setting that he considers. We willprovide a substantially simpler proof of consistency of a truncated version of our penalizedestimator without any distributional assumptions on X, except that X is bounded almostsurely, for the particular setting considered by us. We hope to come up with a similarsimpler proof of the rate of convergence result in the future. The approach for proving thisis based on Kohler and Krzyzak (2001). For this theorem, we first present some notation.Suppose we have the data Dn containing independent and identically distributed (Xi, Yi)

6


for 1 ≤ i ≤ n. Consider

mn(·) := argminf∈Ck(Rd)−1

nα

n∑i=1

exp(−α

2[Yi − f(Xi)]

2)

+1

α+λn2J2k (f).

For a function f : Rd → R and L > 0, define TLf : Rd → R by

TLf(x) =

L, if f(x) > L,

f(x), if |f(x)| ≤ L−L, if f(x) < −L.

Set mn(·) := Tlognmn(·). For any real number x, TLx is defined as TLf where f isidentically equal to x. Let ‖ · ‖2 denote the Euclidean norm on Rd and m0(·) denotes theconditional expectation of Y given X, i.e, m0(x) = E(Y |X = x). Let the distributionof X be denoted by µ and N represent the set of all natural numbers. The followingtheorem proves the L2(µ) consistency of our function estimator mn.

Theorem 2.1 Let k ∈ N with 2k > d. Depending on the data, choose λn = λn(Dn) > 0such that λn → 0 almost surely as n→∞ and

nλd

2kn

log7 n→∞ as n→∞ a.s. (7)

Then ∫|mn(x)−m0(x)|2µ(dx)→ 0 as n→∞ a.s,

for every distribution of (X,Y ) with ‖X‖2 bounded almost surely and EY 2 <∞.

Proof. Our proof is basically an adaptation of the proof of Theorem 1 of Kohler andKrzyzak (2001) to our objective function and so we omit some details which can befound in the above paper.

Since X is bounded, without loss of generality, we assume that X ∈ [0, 1]d almostsurely. Let L, ε > 0 and choose gε ∈ Ck(Rd) such that∫

|m0(x)− gε(x)|2µ(dx) < ε, and J2k (gε) <∞.

By definition mn satisfies, with ψα(x) = exp(−αx/2)/α,

− 1

n

n∑i=1

ψα([Yi − mn(Xi)]2) + λnJ

2k (mn) ≤ − 1

n

n∑i=1

ψα(Y 2i ) + λnJ

2k (0) ≤ 0,

with 0 in J2k (0) representing the identically zero function. This implies, using ψα(t) ≤ 1

for all t ≥ 0, that

λnJ2k (mn) ≤ 1⇒ J2

k (mn) ≤ 1

λn. (8)

7


Now notice that,

∫{mn(x)−m0(x)}2µ(dx) = E{[mn(X)− Y ]2|Dn} − E{[m0(X)− Y ]2} =:

8∑j=1

Tj,n,

where

T1,n = E{|mn(X)− Y |2|Dn} − (1 + ε)E{|mn(X)− YL|2|Dn},

T2,n = (1 + ε)

[E{|mn(X)− YL|2|Dn} −

1

n

n∑i=1

|mn(Xi)− Yi,L|2],

T3,n = (1 + ε)

[1

n

n∑i=1

|mn(Xi)− Yi,L|2 −1

n

n∑i=1

|mn(Xi)− Yi,L|2],

T4,n = (1 + ε)1

n

n∑i=1

|mn(Xi)− Yi,L|2 − (1 + ε)2 1

n

n∑i=1

|mn(Xi)− Yi|2,

T5,n = (1 + ε)2

[1

n

n∑i=1

|mn(Xi)− Yi|2 −1

n

n∑i=1

|gε(Xi)− Yi|2],

T6,n = (1 + ε)2

[1

n

n∑i=1

|gε(Xi)− Yi|2 − E|gε(X)− Y |2],

T7,n = (1 + ε)2[E|gε(X)− Y |2 − E|m0(X)− Y |2

],

T8,n = {(1 + ε)2 − 1}E|m0(X)− Y |2,

with Yi,L = TLYi. Our goal now is to prove that Tj,n → 0 almost surely as n → ∞ for1 ≤ j ≤ 8. Almost sure convergence to zero of T1,n and T4,n (under the limit L → ∞)follows from the proof of Theorem 1 of Kohler and Krzyzak (2001). If x, y ∈ R with|y| ≤ log n and z = Tlognx, then |z − y| ≤ |x − y|, which implies that T3,n ≤ 0 for allsufficiently large n. From the definition, it readily follows that

− 1

n

n∑i=1


2k (mn) ≤ − 1

n

n∑i=1

ψα([Yi − gε(Xi)]2) + λnJ

2k (gε).

Hence, we get

− 1

n

n∑i=1

ψα([Yi − mn(Xi)]2) +

1

n

n∑i=1

ψα([Yi − gε(Xi)]2)

≤ − 1

n

n∑i=1


2k (mn) +

1

n

n∑i=1


≤ − 1

n

n∑i=1

ψα([Yi − gε(Xi)]2) + λnJ

2k (gε) +

1

n

n∑i=1


= λnJ2k (gε). (9)

8


By assumption λn → 0 as n→∞ almost surely and so we get that

lim supn→∞

− 1

n

n∑i=1

ψα([Yi − mn(Xi)]2) +

1

n

n∑i=1

ψα([Yi − gε(Xi)]2) ≤ 0.

Noting that ψα is convex on R+, we get that for all p, q > 0, ψα(p)−ψα(q) ≥ ψ′α(q)(p−q),where ψ′α(q) represents the derivative of ψα evaluated at q. From the definition of ψα,ψ′α(t) = − exp(−αt/2)/2. Now, taking pi = [Yi − gε(Xi)]

2 and qi = [Yi − mn(Xi)]2 and

adding up over all i, we get,

1

n

n∑i=1

ψα([Yi − gε(Xi)]2)− 1

n

n∑i=1

ψα([Yi − mn(Xi)]2)

≥ − 1

2n

n∑i=1

exp(−α

2[Yi − mn(Xi)]

2){

[Yi − gε(Xi)]2 − [Yi − mn(Xi)]

2}.

Since exp(−α[Yi − mn(Xi)]

2/2)≤ 1, we get,

1

n

n∑i=1

ψα([Yi − gε(Xi)]2)− 1

n

n∑i=1

ψα([Yi − mn(Xi)]2)

≥ − 1

2n

n∑i=1

{[Yi − gε(Xi)]

2 − [Yi − mn(Xi)]2}. (10)

By inequalities (9) and (10), we have,

1

2n

n∑i=1

{[Yi − mn(Xi)]

2 − [Yi − gε(Xi)]2}≤ λnJ2

k (gε),

and hence,

lim supn→∞

1

n

n∑i=1

{[Yi − mn(Xi)]

2 − [Yi − gε(Xi)]2}≤ 0.

From the arguments above, lim supT5,n ≤ 0 and by strong law of large numbers, T6,n → 0a.s as n→∞. By definition of gε, we have T7,n converges to 0 almost surely as n→∞.T8,n converges to zero by letting ε to zero. The proof of almost sure convergence of T2,n

to 0 is based on Lemma 1 of Kohler and Krzyzak (2001) which exploits the bound inEquation (8). This completes the proof of L2(µ) consistency of the truncated version mn

of mn for all distributions of (X,Y ) with X ∈ [0, 1] a.s. �

Remark 1 We emphasize once more that there are no distributional assumptions on X,not even the existence of a density with respect to the Lebesgue measure is necessary.But the proofs in Cox (1983) require a density that is bounded away from zero. Alsosee the remarks following Theorem 1 in Kohler and Krzyzak (2001) about relaxing theassumptions in our Theorem 2.1.

9


3. Weighted Density Power Divergence in Gaussian Errors Model

3.1. Computational Aspects

The weighted density power divergence estimator mn(·) in Gaussian error models at apoint x0 is obtained as a minimizer of

n∑i=1

Wni(x0)

[1

α− 1

αexp

(−α

2[yi − a]2

)],

over all a ∈ R. A general M -estimation version of this which uses the objective functionas∑n

i=1Wni(x)ρ(Yi − a), was studied, among others, by Cleveland (1979), Hardle andTsybakov (1988) and Boente and Fraiman (1989). It is clear from the objective functionthat the minimizer satisfies,

mn(x)n∑i=1

Wni(x)φα([Yi −mn(x)]2) =n∑i=1

Wni(x)φα([Yi −mn(x)]2)Yi for all x,

where φ(·) = αψα(·). This equation naturally leads to a fixed point algorithm which isseen to be a useful computational tool in simulation studies. Observe that if α = 0, thenthe iterative algorithm converges in one step leading to a linear estimator based on theweights. This estimator is given by

m(0)n (x) =

n∑i=1

Wni(x)Yi/ n∑i=1

Wni(x),

which coincides with the kernel based regression estimator or nearest neighbour basedregression function estimator for corresponding weights.

3.2. Asymptotic Properties

Various asymptotic aspects of the general estimators of the weighted type have beenstudied in the literature. Hardle (1984) proved weak consistency, strong consistencyand asymptotic normality of these regression function estimators along with minimaxrobustness of the estimators with decreasing and non-redescending ψ-function albeit un-der somewhat restrictive conditions which were relaxed by Boente and Fraiman (1989).Hardle and Luckhaus (1984) proved uniform consistency of these estimators under bothrandom design and fixed design models. Boente and Fraiman (1989) and Hardle (1984)prove finite dimensional asymptotic normality of the estimator mn, i.e, they prove asymp-totic normality of (mn(t1),mn(t2), . . . ,mn(tk)), suitably normalized as n→∞ which canbe used to provide confidence bands for these estimators. All these results are specific tothe objective function involving ρ(Yi − a) (i.e, location model). One can derive similarresults for a general likelihood based weighted regression function estimator as was donein Chaudhuri and Dewanji (1995). It is straightforward to extend these results to thecase of weighted divergence estimator and so these proofs are omitted. Finite dimen-sional convergence also follows by using the Cramer-Wold technique. Here we mentionthe consistency theorem with appropriate adaptation of the conditions of Chaudhuri andDewanji (1995) and the proof is similar to the one given by these authors which is a stan-dard proof of asymptotic normality of an Euclidean parameter. Here is the notation for

10


the theorem. The estimator θn(x) is defined as a minimizer of

Hn(t) =

n∑i=1

Wni(x)ρ(yi|t). (11)

The assumptions on ρ and f are as follows:

(A1) Support of f(y|t) is same for all t ∈ J ⊂ Rd and for each y in that support∇ρ(y|t), f(y|t) is thrice continuously differentiable with respect to t for all t∈J .

(A2) E[∇ρ(Yi|t)] = 0, E[ρjk(Yi|t)] = Vjk(t), and V (t) is continuous and negative defi-nite for all t∈J .

(A3) There exists functions Mjkl, such that |∇jklρ(Y |t)| ≤ Mjkl(Y ) for all t ∈ J, andE[Mjkl] = mjkl <∞ for all j, k, l.

(A4) θ(x) is continuous in x.

The conditions on weights Wni’s which are assumed to depend only on Xi’s are as follows:

(W1) For any x in the domain of θ,∑n

i=1W2ni(x)→ 0 in probability as n→∞.

(W2) There exists a sequence {δn} (either random or deterministic) such that δn > 0for all n ≥ 1, δn tends to zero in probability as n goes to infinity, and

limn→∞

P

{max

1≤i≤n;|x−Xi|≥δnWni(x) = 0

}= 1

Theorem 3.1 Under the regularity conditions (A1)-(A4) and (W1)-(W2), there

exists θn(x) which minimizes Hn(t) in the limit and is a consistent estimator of θ(x)which is the minimizer of the limit of Hn(·).

4. Simulation Studies

In this section, we provide numerical evidence through simulations about the usefulnessof our proposed non-parametric estimator in pure and contaminated data.

4.1. Minimum Penalized Density Power Divergence

The set up used for simulation in this case is as follows. Let X ∼ U(−1, 1) and, indepen-dently, let ε ∼ (1− ε)N(0, 0.1) + εN(2, 0.1) with ε = 0.00, 0.05, 0.10, 0.15. Subsequently,we compute Y = X2 + ε. The estimators of the non-parametric regression function basedon the penalized density power divergence for different choices of α for a particular sam-ple from the above mentioned model are shown in Figure 1. It is clear from the figure thatin the absence of contamination all the values of α give similar and very close results.However, under contamination, the least squares estimator (corresponding to α = 0) isvery severely affected by the outliers; with increasing α, the effect of outliers on the esti-mator becomes less pronounced. Thus larger values of α appear to afford a high degreeof robustness with minimal loss in efficiency.

11


Figure 1. Regression Fits using the Penalized Density Power Divergence

4.2. Weighted Density Power Divergence

Here we use the same set up as in Subsection 4.1. The plots of the regression functionestimates obtained using weighted divergences are as shown in Figure 2. Here we usedweights from kernel regression with the Epanechnikov kernel. Our observations in thiscase are similar to those in the previous set of simulations. Large values of α are againseen to provide a high degree of robustness with reasonably good efficiency. We remarkhere that the estimator can be seen to have some boundary effects which are well-knownin the kernel regression literature and may need some correction.

Figure 2. Regression Fits using the Weighted Density Power Divergence

12


4.3. Some Comparison with Existing Estimators

In order to numerically evaluate the performance of the non-parametric regression func-tion estimator discussed in the previous sections, define the average squared error as

ASE(f) =1

n

n∑i=1

{f(Xi)−m0(Xi)}2.

Table 1 provides the observed average squared error values for the penalized densitypower divergence based regression function estimator over different contamination levelsand different α values over 100 replications each containing n = 100 observations fromthe model described in Subsection 4.1.

Table 1. ASEs of the Penalized Density Power Divergence Estimators of the Re-

gression FunctionError(ε) α = 0.00 α = 0.25 α = 0.35 α = 0.50 α = 0.75 α = 1.00

0.00 0.000636 0.000637 0.000638 0.000638 0.000639 0.0006400.05 0.017501 0.008522 0.006283 0.003981 0.002000 0.0012150.10 0.049678 0.026087 0.019367 0.011927 0.005010 0.0022230.15 0.103761 0.060111 0.046230 0.029779 0.012915 0.0053340.20 0.188599 0.123647 0.100143 0.069475 0.033543 0.0153940.25 0.289391 0.207209 0.174300 0.127634 0.064893 0.027836

Table 2. ASEs of the Penalized Huber Proposal based Estimators of the

Regression FunctionError (ε) δ =2.000 δ =1.345 δ =1.000 δ =0.600 δ =0.300

0.00 0.000637 0.000638 0.000640 0.000647 0.0006770.05 0.010512 0.007204 0.005055 0.002644 0.0013650.10 0.031426 0.021662 0.014889 0.006931 0.0026100.15 0.070197 0.050280 0.035525 0.016842 0.0057430.20 0.138882 0.105306 0.078007 0.040067 0.0146480.25 0.226739 0.179690 0.137880 0.072942 0.025036

From the values given in the table, it may be observed that under pure data theperformance of the smoothing spline estimator based on the squared error loss is onlymarginally better than the divergence based estimators for α > 0. As the error con-tamination increases the performance of the smoothing spline deteriorates sharply butthe performance of the divergence based estimators corresponding to larger values of αhave far greater relative stability. Clearly we need large values of α in order to get goodperformance for higher contamination proportions.

In order to understand the improvement in the performance of the proposed estimator,we compare it with a well-known penalizedM -estimator obtained using the pseudo-Huberloss function given by

Lδ(a) = δ2

({1 +

a2

δ2

}1/2

− 1

).

The performance of the penalized M -estimator of regression function for various val-ues of tuning parameter δ are recorded in Table 2. We note from the tables thatfor comparable levels of stability under contamination the density power divergencebased estimator performs better than the Huber’s loss based estimator in pure dataexhibiting better relative efficiency. Compare the results corresponding to the pairs(0.25, 2), (0.35, 1.345), (0.5, 1), (0.75, 0.6) and (1.0, 0.3) of (α, δ).

13


Table 3. ASEs of the Weighted Density Power Divergence Estimators of the Re-gression Function

Error(ε) α =0.00 α =0.25 α =0.35 α =0.50 α =0.75 α =1.00

0.00 0.004345 0.004341 0.004339 0.004337 0.004333 0.0043290.05 0.019386 0.010564 0.008458 0.006364 0.004657 0.0040330.10 0.052575 0.028652 0.021991 0.014729 0.008098 0.0054220.15 0.110860 0.066292 0.052019 0.034905 0.017111 0.0088360.20 0.189815 0.125002 0.101691 0.071243 0.035425 0.0176720.25 0.283315 0.205487 0.174932 0.131986 0.074504 0.041607

Table 4. ASEs of the Weighted Huber Proposal based Estimators of

the Regression FunctionError(ε) δ =2.000 δ =1.345 δ =1.000 δ =0.500 δ =0.250

0.00 0.004341 0.004336 0.004330 0.004307 0.0042930.05 0.012540 0.009421 0.007438 0.004813 0.0039920.10 0.034205 0.024626 0.018049 0.008689 0.0055830.15 0.076924 0.056799 0.041755 0.018238 0.0097720.20 0.140797 0.108292 0.081914 0.036288 0.0184990.25 0.224703 0.182207 0.145075 0.073187 0.041585

Under exactly the same conditions of Tables (1) and (2), we compute the ASEs ofthe weighted divergence estimators for different choices of α and different contaminationlevels; these are given in Table 3. Comparing the ASE values obtained in this case withthe case of the penalized divergence estimator, we note that the weighted divergenceestimator is less accurate than the penalized one. This deficiency must be due, at least inpart, to the boundary effect. The performance of the weighted M -estimator of regressionfunction for various values of δ based on the pseudo-Huber loss function are given in Table4. Here too our conclusions are similar to those in the case of penalized estimators. Inthis case, compare the pairs (0.25, 2), (0.35, 1.345), (0.50, 1.00), (0.75, 0.5) and (1.0, 0.25).Here we have different (but close) sets of pairs as the calibration appears to be slightlydifferent in the weighted case compared to the penalized case.

The above comparison shows that our estimators are competitive or better than theM -estimator based on the pseudo Huber loss function in the location case studied here.As this is the specific case for which this M -estimator is designed, our results indicatethat the proposed estimators match the M -estimators in the latter’s ballpark. On theother hand, the proposed estimators are much more general and may apply in manyother scenarios where there are no well-established competitors.

5. Extensions of the Model

One can generalize the results of Section 2 by allowing heteroscedastic error models,i.e, by taking the conditional distribution of ε given X = x to be the N(µ(x), σ2(x))distribution and determining the function estimators for both location µ(·) and scale σ(·).In this case, one needs to consider constrained optimization as we require σ(·) > 0. Oneway to circumvent this problem is to define another parameter function θ(x) = log σ(x)and optimize the objective function in terms of µ(·) and θ(·) and then recover an estimateof σ from that of θ. Even in this case, one can prove that the estimators of µ and θ aresplines of the appropriate order. See, for example, Yuan and Wahba (2004).

In case of weighted density power divergence discusses in Section 3, one can generalizethe model by allowing the distribution of Y given X = x to be the N(µ(x), σ2(x)) distri-bution. This location-scale model in the context of M -estimation was extensively studiedby Hardle and Tsybakov (1988). They prove pointwise consistency, asymptotic normalityand optimal rate of convergence of the estimator thus obtained. In this case, one need not

14


reparametrize the scale curve and can get the estimator easily using the iterative algo-rithm mentioned earlier. Finally, we remark that the weighted density power divergenceapproach is much simpler than the penalized density power divergence approach in thatthis approach can be easily extended to the case of higher dimensional covariates whilethe penalized approach becomes computationally more cumbersome because in this caseone has to consider penalties over different partial derivatives and also the computationof estimators is not as easy as in the univariate case due to the lack of natural ordering.

6. Why New Estimators?

There are many regression function estimators available in the literature which are similarto the class ofM -estimators of location. So, it is natural to ask why we need new estimatesof regression which in case of location models belong to the class of existing M -estimators.Notice that the estimators proposed in this paper are derived from the minimization ofa bonafide measure of divergence between the empirical and the model density. As aresult we get a particular class of M -estimators which directly involves the form of themodel density and includes the optimal estimator as a special case irrespective of themodel. In this sense, the non-parametric function estimators based on the density powerdivergence clearly stand out, just as the parametric estimators based on the densitypower divergence (Basu et al. (1998)) stand out within the class of M -estimators in thecase of independent and identically distributed data.

Another important feature of these estimators is that they take into account the con-ditional distribution of the errors. In particular, our approach covers the case of non-additive errors and the standard models which fall into this category, including, logisticregression model, Poisson regression model, gamma regression model and all generalizedlinear models. Our estimators are parametrized by a scalar parameter α; whenever α isclose to zero, the performance of the estimator will be close to that of the maximumlikelihood estimator. However, as we have seen, often we get fairly robust estimators forsmall to moderate values of α. Thus, an adaptive choice of α can allow us to choosebetween a more efficient solution and a more robust solution based on the proportion ofcontamination in the Y -direction.

The reason why we observe robustness in the Y -direction is that we are down-weightingoutlying observations in this direction; however we are giving equal weights to all theobservations in the X-direction as we estimate the expectation with respect to X by themean over these observations. Thus we do not expect the estimator to be particularlyrobust in the X-direction. As the model under consideration is a random design, it maybe of interest to consider the robustness issue in the X-direction and refine the procedureto suitably integrate this element in the functioning of the method. We hope to studythis aspect in the future.

7. Choosing Tuning Parameters

In standard regression problems, one common and very popular method of choosingtuning parameters like λ and k ∈ N in the penalized case or h (or k ∈ N) in the kernel(or k-NN) regression case is to use cross-validation or generalized cross-validation. In caseof contamination, it would be unwise to consider a non-robust criterion to choose amongdifferent robust alternatives. One has to choose the tuning parameters based on a robustcriterion such as robustified cross-validation or generalized cross-validation for which

15


there are several options available in the literature; see, for example, Ronchetti, Field,and Blanchard (1997) and Cantoni and Ronchetti (2001). One also has to choose α alongwith the parameters λ and h; all the three parameters can be chosen by minimizinga robust model selection criterion. In some rare cases, it is possible that the chosenvalues of the tuning parameters violate condition (7) in Theorem 2.1 in which case otherrefinements will be necessary.

One such possible refinement could be to use the Bayesian information criterion for theselection of parameter λn and subsequent choice of α by using a robust cross-validationcriterion as before. This would, to some extent, be in the spirit of Wang, Jiang, Huang,and Zhang (2013b) where the authors use the Bayesian information criterion to select theregularization parameters and then adaptively choose the robustness tuning parameter topreserve high breakdown point and high efficiency in case of variable selection problem.Indeed, it may be noted that the objective function that Wang et al. (2013b) haveused is the objective function for the density power divergence in case of parametricGaussian error regression. In their case it was also shown that their adaptive choiceof tuning parameters lead to an estimator with asymptotic breakdown point of 1/2and the estimator also satisfies the oracle property. Currently, we are exploring possibleapplications of similar techniques to our problem.

8. Conclusions

In this paper, two classes of robust non-parametric regression function estimators areintroduced which take into account the distribution of the errors. Consistency results areproved for both these estimators under general set ups with special focus on the Gaussianregression model. These estimators are shown to be robust in simulation studies thatcompare their performance in terms of the ASE over different contamination levels anddifferent parameter values.

The methods described in this paper can be generalized in various ways. There aremany extensions of the density power divergences which can lead to a flexible family ofregression estimators to choose from; see Jones, Hjort, Harris, and Basu (2001). Anotherway to generalize these estimators in the penalized case is to consider other roughnesspenalties, such as total variation penalty (Mammen and van de Geer (1997)) or vari-able bandwidth penalty (Wang, Du, and Shen (2013a)), which might lead to improvedperformance and better adaptivity.

In this paper, we only considered the case of regression and these can be extended toinclude additive models, single index models or multiple index models and also shaperestricted inference. For shape restricted non-parametric inference in the context of leastsquares, likelihood or divergences, we do not need to add penalization, as in these cases,we can get a reasonable estimate without penalization and adding the penalty leads toan estimate with more smoothness properties (differentiability). Robust shape restrictedinference has not been sufficiently explored in the literature. For other possible extensions,we refer the reader to Green and Silverman (1994).

Acknowledgement

We thank Elvezio Ronchetti and Anand Vidyashankar for helpful discussions.

16


References

Basu, A., and Lindsay, B.G. (1994), ‘Minimum disparity estimation for continuous models: effi-ciency, distributions and robustness’, Ann. Inst. Statist. Math., 46, 683–705.

Basu, A., Harris, I.R., Hjort, N.L., and Jones, M.C. (1998), ‘Robust and efficient estimation byminimising a density power divergence’, Biometrika, 85, 549–559.

Beran, R. (1977), ‘Minimum Hellinger distance estimates for parametric models’, Ann. Statist.,5, 445–463.

Boente, G., and Fraiman, R. (1989), ‘Robust nonparametric regression estimation’, J. Multivari-ate Anal., 29, 180–198.

Cantoni, E., and Ronchetti, E. (2001), ‘Resistant selection of the smoothing parameter forsmoothing splines’, Stat. Comput., 11, 141–146.

Chaudhuri, P., and Dewanji, A. (1995), ‘On a likelihood-based approach in nonparametricsmoothing and cross-validation’, Statist. Probab. Lett., 22, 7–15.

Cleveland, W.S. (1979), ‘Robust locally weighted regression and smoothing scatterplots’, J. Amer.Statist. Assoc., 74, 829–836.

Cox, D.D. (1983), ‘Asymptotics for M -type smoothing splines’, Ann. Statist., 11, 530–551.Cox, D.D., and O’Sullivan, F. (1990), ‘Asymptotic analysis of penalized likelihood and related

estimators’, Ann. Statist., 18, 1676–1695.Cox, D.D., and O’Sullivan, F. (1996), ‘Penalized likelihood-type estimators for generalized non-

parametric regression’, J. Multivariate Anal., 56, 185–206.Eggermont, P.P.B., and LaRiccia, V.N. (2009), Maximum penalized likelihood estimation. Volume

II, Springer Series in Statistics, Springer, Dordrecht. Regression.Eubank, R.L. (1999), Nonparametric regression and spline smoothing, Statistics: Textbooks and

Monographs, Vol. 157, 2nd ed., Marcel Dekker, Inc., New York.Fan, J., Gasser, T., Gijbels, I., Brockmann, M., and Engel, J. (1997), ‘Local polynomial regression:

optimal kernels and asymptotic minimax efficiency’, Ann. Inst. Statist. Math., 49, 79–99.Ghosh, A., and Basu, A. (2013), ‘Robust estimation for independent non-homogeneous observa-

tions using density power divergence with applications to linear regression’, Electron. J. Stat.,7, 2420–2456.

Green, P.J., and Silverman, B.W. (1994), Nonparametric regression and generalized linear mod-els, Monographs on Statistics and Applied Probability, Vol. 58, Chapman & Hall, London. Aroughness penalty approach.

Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002), A distribution-free theory of nonpara-metric regression, Springer Series in Statistics, Springer-Verlag, New York.

Hardle, W. (1984), ‘Robust regression function estimation’, J. Multivariate Anal., 14, 169–180.Hardle, W., and Luckhaus, S. (1984), ‘Uniform consistency of a class of regression function

estimators’, Ann. Statist., 12, 612–623.Hardle, W., and Tsybakov, A.B. (1988), ‘Robust nonparametric regression with simultaneous

scale curve estimation’, Ann. Statist., 16, 120–135.Jones, M.C., Hjort, N.L., Harris, I.R., and Basu, A. (2001), ‘A comparison of related density-based

minimum divergence estimators’, Biometrika, 88, 865–873.Kohler, M., and Krzyzak, A. (2001), ‘Nonparametric regression estimation using penalized least

squares’, IEEE Trans. Inform. Theory, 47, 3054–3058.Mammen, E., and van de Geer, S. (1997), ‘Locally adaptive regression splines’, Ann. Statist., 25,

387–413.Ng, P.T. (1994), ‘Smoothing spline score estimation’, SIAM J. Sci. Comput., 15, 1003–1025.Oh, H.S., Nychka, D.W., and Lee, T.C.M. (2007), ‘The role of pseudo data for robust smoothing

with application to wavelet regression’, Biometrika, 94, 893–904.Park, C., and Basu, A. (2004), ‘Minimum disparity estimation: asymptotic normality and break-

down point results’, Bull. Inform. Cybernet., 36, 19–33.Reinsch, C.H. (1967), ‘Smoothing by spline functions. I, II’, Numer. Math., 10, 177–183; ibid. 16

(1970/71), 451–454.Ronchetti, E., Field, C., and Blanchard, W. (1997), ‘Robust linear model selection by cross-

17


validation’, J. Amer. Statist. Assoc., 92, 1017–1023.Silverman, B.W. (1985), ‘Some aspects of the spline smoothing approach to nonparametric re-

gression curve fitting’, J. Roy. Statist. Soc. Ser. B, 47, 1–52. With discussion.Staniswalis, J.G. (1989), ‘The kernel estimate of a regression function in likelihood-based models’,

J. Amer. Statist. Assoc., 84, 276–283.Stone, C.J. (1977), ‘Consistent nonparametric regression’, Ann. Statist., 5, 595–645. With dis-

cussion and a reply by the author.Tibshirani, R., and Hastie, T. (1987), ‘Local likelihood estimation’, J. Amer. Statist. Assoc., 82,

559–567.Tukey, J.W. (1947), ‘Non-parametric estimation. II. Statistically equivalent blocks and tolerance

regions–the continuous case’, Ann. Math. Statistics, 18, 529–539.van de Geer, S. (1990), ‘Estimating a regression function’, Ann. Statist., 18, 907–924.Wahba, G. (1990), Spline models for observational data, CBMS-NSF Regional Conference Series

in Applied Mathematics, Vol. 59, Society for Industrial and Applied Mathematics (SIAM),Philadelphia, PA.

Wang, X., Du, P., and Shen, J. (2013a), ‘Smoothing splines with varying smoothing parameter’,Biometrika, 100, 955–970.

Wang, X., Jiang, Y., Huang, M., and Zhang, H. (2013b), ‘Robust variable selection with expo-nential squared loss’, J. Amer. Statist. Assoc., 108, 632–643.

Yuan, M., and Wahba, G. (2004), ‘Doubly penalized likelihood estimator in heteroscedastic re-gression’, Statist. Probab. Lett., 69, 11–20.

18

Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Robust Non-Parametric Curve Estimation using Density Power ...€¦ · eralized linear models. ......

Documents