ON NONPARAMETRIC INTERVAL ESTIMATION OF A REGRESSION FUNCTION BASED ON THE RESAMPLING

transcript

ON NONPARAMETRIC INTERVAL ESTIMATION

OF A REGRESSION

FUNCTION BASED ON THE RESAMPLING

ALEXANDER ANDRONOVRiga Technical University

Riga, Latvia

OUTLINE 1. Introduction 2. Averaging Method 3. Median Smoothing 4. Case Study References

1. INTRODUCTIONWe consider nonparametric regression

xmY , (1.1)

where Y is a dependent variable, m is an unknown regression function, x is a d -dimensional vector of independent variables (regressors), is a random term. It is supposed that the random term has zero expectation ( 0E ) and variance )(2 xVar where 2 is an unknown constant and )(x is a known weighted function. Furthermore we have a sequence of independent observations ii xY , ,

ni ..,,2,1 . On that base we need to construct an upper confidence bound xm~ for xm at the point x corresponding to probability :

xmxmP ~ . (1.2)

Usual way [DiCicco and Efron, 1996] consists of using a consistent and asymptotic normal distributed estimate xm of xm . A final expression contains derivatives xm , xm and variance 2 that are replaced by corresponding estimators.

The resampling approach [Wu, 1986, Andronov and Afanasyeva, 2004] gives an alternative way that can be described as follows. For fixed point x we take k nearest neighbors

kxxx ...,,, 21 of x among nxxx ...,,, 21 (in some sense, for example using any kernel function iH xxK , Mahalanobis or other distance):

)(:,...,, 21 xIixxxx cik ,

....,,, among of neighborsnearest theof one is : 21 nic xxxxkxixI

Now we have sample kk YxYxYx ,...,,,,, 2211 instead of

nn YxYxYx ,...,,,,, 2211 .

Then we derive sample without replacement { riii ...,,, 21 } of size r ( kr ) from set {1, 2, …, k}, form resample YxYxYx r ,...,,,,, 2211 ,

where jij xx and

jij YY , and calculate estimate )(xm of our function

of interest xm .

Then we return all selected elements into initial samples and repeat this procedure R times. As a result the sequence of estimators xmxmxm R

...,,, 21 takes place. After ordering we have sequence xmxmxm R)()2()1( ...,,, where xmxm ii )1()( .

Let number R is selected so that R is an integer. Then we set xmxm R~ .

Averaging method and Median smoothing method for xm calculation are considered. Our main aim is to elaborate a numerical method for cover probability calculation:

xmxmPx ~)(Pr . (1.3)

It means that we need to known a distribution of the R -th order statistic

xm R )( . That is a main problem that it necessary to solve.

2. AVERAGING METHOD At first we consider the method of kernel regression estimation [Hardle etc., 2004]. Let HK be any kernel function (Epanechnikov, Quartic and so on). Then Nadaraya-Watson point estimator xm is calculated by the formula

YxxKxxK

, (2.1)

where ix and

iY are the vectors of independent variables and dependent variable for the i -th elements of the resample, ri ...,,2,1 .

The resampling procedure gives us sequence xmxmxm R ...,,, 21 ,

j jYjxxKjxxK

where jxi and jYi

are the vectors of independent variables and dependent variable for the i -th elements of the j -th resample,

ri ...,,2,1 , Rj ...,,2,1 .

With respect to (1.1) we have:

jxmjxxKjxxK

jxxmE i

jxwjxxK

jxxmVar i

where jxjxjxjx r ...,,, 21 .

zjj zmzxK

(2.3) where the sums are taken on - a set of all r -samples without replacement from }....,,,{ 21

Analogous expression we are able to write down for unconditional variance.

At first let us calculate the second moment:

zjYzxKE

.)()(2

)()(11

ijjijHiH

zmzmzxKzxK

zmzwzxK

Now the variance can be calculated by formula

.)()()( 22 xmExmExmVar (2.4)

Now we need to calculate the covariance between two various estimates xm j

and xm j ' . We have for j j’:

.)()()(

)()()()())(),((2

xmExmxmE

xmxmxmxmExmxmCov

Therefore

jjzxKr

kxmxmCov

)('),(

.)(1 2

vzzmmHr

zwzxKvxK

To avoid computational difficulties, it is possible to consider the following estimate instead of (2.1):

1)( (2.8)

and corresponding those sequence xmxmxm R ...,,, 21 .

Expectations, variances and covariance matrix for this sequence of random variables can be determined using the following lemmas.

Lemma 1 Let kZZZ ...,,, 21 be independent random variables with expectations

k ...,,, 21 and variances 222

21 ...,,, k . Let

rZZZ ...,,, 21 be a random sample of size r from kZZZ ...,,, 21 without replacement and S be their sum:

rZZZS ...21 . Then

kkrSE ...21 , (2.9)

jjj kk

rkkrSVar . (2.10)

Lemma 2 For the conditions of the previous Lemma let sample

rZZZ ...,,, 21 be returned into set kZZZ ...,,, 21 and the described procedure be repeated, so that we have new sample

rZZZ ...,,, 21 and a corresponding sum rZZZS ...21 . Then the covariance between S and S is calculated by formula

rkkrSSCov

11, . (2.11)

3. MEDIAN SMOOTHINGAs it is noted by [Hardle etc, 2002] “Median smoothing may be described as the nearest-neighbor technique to solve the problem of estimating the conditional median function, rather than the conditional expectation function … . The conditional median function xYmed is more robust to outliers than the conditional expectation xYExm .”

For the resample rijYjx ii ...,,2,1;, the median smoother is defined as

jYjYjYmedxm rj ...,,, 21 . (3.1)

To evaluate the cover probability (1.3) we apply an approach that has been elaborated by [Andronov and Afanasyeva, 2004]. The idea of this approach is such: any median estimator coincides with some element of the initial sample. Therefore we need to calculate a corresponding probability for this element.

At first we consider the following problem. Let kYYY ...,,, 21 be the order statistics for initial dependent variables

kYYY ...,,, 21 and dxQ , be the probability of event 1)( dd YxmY :

1)(, dd YxmYPdxQ , nd ...,,1,0 , (3.2)

where 0Y , 1nY .

Formally, this probability can be expressed by formula

zvdxz zv

vmxmFvmxmFdxQ 1,),(

where dx, is a set of all d -samples dxxxz ...,,, 21 without

replacement from kxxx ...,,, 21 , xPxF is a distribution

function of the random term from formula (1.1).

Furthermore for the concrete r -resample jx jxjxjx r

...,,, 21 and corresponding median smoothing (3.1) we are able to calculate a probability that calculated median smoothing

xm j will be greater than xm : if d is fixed then we have a case of

hypergeometrical distribution:

i irdk

xmxmPdxq (3.4)

(we remind that r is an odd number).

The conditional probability of interest (1.3) given fixed d is calculated by such way:

Ridxqdxq

dxmxmPdx

,1,~,Pr

. (3.5)

Finally we got the cover probability of interest:

0,Pr,~Pr

ddxdxQxmxmPx . (3.6)

4. Case StudyWe consider a numerical example that illustrates the median smoothing. Let regression function be a such:

21.01 xxxm , < x < . (4.1)

Of course one is unknown for us, but we wish to calculate a true cover probability for this regression if we use the above described median smoothing. The following assumptions are supposed: random term has the normal distribution with zero mean and variance 2 Var = 4, a number of observations (a size of the initial sample) n = 22, values of the independent variables are ix i – 4, i = 1, 2, … , 22.

Further we assume that the confidence probability = 0.8, a number of considered nearest neighbors k = 9, a resample size r = 5.

Gotten results are presented in Table 1 that contains the values of regression function m(x) and cover probabilities )(Pr x . The last have been calculated with respect to above described procedure. Table 1

The cover probabilities

x -3 -2 -1 0 1 2 3 m(x) 4.9 3.4 2.1 1 0.1 -0.6 -1.1

)(Pr x 0.932 0.925 0.921 0.920 0.922 0.928 0.934

x 4 5 6 7 8 10

m(x) -1.4 -1.5 -1.4 -1.1 -0.6 1 )(Pr x 0.939 0.941 0.939 0.934 0.928 0.920

The presented results show that the considered median smoothing approach is the conservative estimation method because one sets the cover probabilities too high. We would like to remark too that cover probability )(Pr x is not monotone and convex function of argument x.

Proofs of the LemmasProof of Lemma 1Let 1j if the random variable jZ belongs to the sample

rZZZ ...,,, 21 and 0j otherwise. Of course k ...,,, 21 are

dependent random variables because rk ...21 . We have: krP j /1 , krP j /10 , krPE jj /1 , krkrVar j //1 , 1/11,1 kkrrPE jiji

for ji . Furthermore

Random variables i and iZ are independent therefore

rZEEZESE111 ,

22222iiiiii k

rZEEZE ,

krZEZEZVar

iiiiiiiii

Random variables iZ , jZ and ji for ji are independent too therefore

kkrrZEZEEZZE jijijijjii ,

kkrrZZCov jijijijjii . (A.2)

Formulas (A.1) and (A.2) give formula (2.10).

Proof of Lemma 2Let

i ijjjii

jjjiij

ZZCovZZCov

ZZCovZZCovSSCov

For i j random variables jiji ZZ ,,, are independent, therefore

jjii ZZCov , = 0. Further

iiiiii

iiiiiiiiiii

krZEZEE

ZEZEZEZZCov

Therefore

rSSCov1

REFERENCES[1] Andronov A., Afanasyeva H. Resampling-based

nonparametric statistical inferences about the distributions of order statistics. In: Transactions of XXIV International Seminar on Stability Problems for Stochastic Models, Transport and Telecommunication Institute, Riga, Latvia, 2004, pp. 300 – 307.

[2] DiCiccoT.J., Efron B. Bootstrap confidence intervals. Statistical Sciences, Vol.11, No.3, 1996, pp. 189 – 228.

[3] Hardle W., Muller M., Sperlich S., Werwatz A. Nonparametric and Semiparametric Models. Springer, Berlin, 2004.

[4] Wu C.F.J. Jackknife, Bootstrap and other resampling methods in regression analysis. The Annals of Statistics, Vol. 14, No. 3, 1986, pp. 1261 – 1295.

Thank You for your attention

ON NONPARAMETRIC INTERVAL ESTIMATION OF A REGRESSION FUNCTION BASED ON THE RESAMPLING

Documents