Foundations of multivariate inference using modern computers · Foundations of multivariate...

Linear Algebra and its Applications 321 (2000) 365–385www.elsevier.com/locate/laa

Foundations of multivariate inference usingmodern computers

H.D. Vinod1

Economics Department. Fordham University, Bronx, New York, NY 10458, USA

Received 12 March 1999; accepted 26 May 2000

Dedicated to T.W. Anderson on the occasion of his 80th birthday

Submitted by H.J. Werner

Abstract

Fisher suggested in 1930s algebraically structured pivot functions (PFs) whose distributiondoes not depend on unknown parameters. These pivots provided a foundation for (asymptotic)statistical inference. T.W. Anderson [Introduction to Multivariate Statistical Analysis, Wiley,New York, 1958, p. 116] introduced the concept of a critical function of observables, whichfinds the rejection probability of a test for Fisher’s pivot. H.D. Vinod [J. Econometrics 86(1998) 387] shows that V.P. Godambe’s [Biometrika 78 (1985) 419] pivot function (GPF)based on Godambe–Durbin ‘estimating funtions’ (EFs) from [Ann. Math. Statist. 31 (1960)1208] are particularly robust compared to pivots by B. Efron and D.V. Hinkley [Biometrica65 (1978) 457] and R.M. Royall [Internat. Statist. Rev. 54 (2) (1986) 221]. Vinod argues thatnumerically computed algebraic roots of GPFs based on algebraically scaled score functionscan fill a long-standing need of the bootstrap literature for robust pivots. This paper considersD.R. Cox’s [Biometrica 62 (1975) 269] example in detail and reports on a simulation forit. This paper also discusses new pivots for Poisson mean, binomial probability and normalstandard deviation. We propose inference methods for a modified standard deviation designedto represent financial risk. In the context of regression problems, we propose and discussGodambe-type multivariate pivots (denoted by GPF2) which are asymptoticallyχ2. © 2000Elsevier Science Inc. All rights reserved.

AMS classification: 62; 65; 90

Keywords: Bootstrap; Regression; The Fisher information; Robustness; Pivot; Double bootstrap

E-mail address:[email protected] (H.D. Vinod),Web: www.fordham.edu/economics/vinod.1 Present address: H.D. Vinod, 92 Hillside Avenue, Tenafly, NJ 07670, USA.

0024-3795/00/$ - see front matter( 2000 Elsevier Science Inc. All rights reserved.PII: S 0 0 2 4 - 3 7 9 5 ( 0 0 ) 0 0 1 8 5 - 3

366 H.D. Vinod / Linear Algebra and its Applications 321 (2000) 365–385

1. Introduction and how estimating functions evolve into Godambe’s pivotfunctions

Sir R.A. Fisher, Neyman, Pearson, and others developed the basic framework ofasymptotic and small-sample statistical inference. It relies on asymptotic normalityof the estimatorθ of θ . Anderson [1, p. 116] introduced the concept of a criticalfunction of observables, which finds the rejection probability of a test. Mittelham-mer [24, Chapters 9–11] discusses this material in modern notation and explains theduality between confidence and critical regions, whereby uniformly most powerful(UMP) tests lead to uniformly most accurate confidence regions. These methods arethe foundations of statistical inference, which relies on parametric modeling using(asymptotic) normality. We show that statistical inference can be made more robustand nonparametric by exploiting the power of computers, unavailable at the timewhen the pioneers did their work.

Let θ be a univariate estimator, SE be its standard error, and let FPF denote Fish-er’s PF. Typical bootstraps resample only older Wald-type statistics, FPF=(θ − θ)/SE (see [18, p. 128]). Greene [17, p. 153] defines the pivot as a functionof θ andθ , fp(θ, θ ) with a known distribution. Actually, it would be a useless PF ifthe ‘known distribution’ depends on the unknown parametersθ . The distribution ofa valid PF must be independent ofθ . If θ is a biased estimator ofθ , where the biasdepends onθ , then the distribution of FPF obviously depends onθ . Note that FPFis a statistic for the estimatorθ . FPF can be an invalid pivot wheneverθ is a biasedestimator, such as a ridge regression estimator [31]. Recent bootstrap literature (see,e.g., [6,18,30]) recognizes that reliable inference from bootstraps needsvalid pivotfunctions (PFs), whose distribution doesnotdepend on unknown parametersθ .

Let y denote data,T denote number of observations,K denote a constant (oftenzero) andSt denotetth scaled quasi-likelihood score function (QSF). Godambe’s[13] pivot function is GPF(y, θ) = ∑T

t=1 St . Numerically solving GPF(y, θ) = K

for θ yields θ , though GPF itself is not a function ofθ . Solving GPF= K for anyfunctionf (θ) yieldsf (θ). The roots of GPF= K are called GPF-roots. Since theGPF is defined as a sum ofT items St , the central limit theorem (CLT) assures usthat GPF converges to unit normality orN(0, I ) directly. Since theN(0, I ) neverdepends onθ , it is a valid pivot for anyθ or f (θ). Thus, GPF can fill a long-stand-ing need in the bootstrap literature for valid pivots. We often construct confidenceintervals (CIs) by usingJ = 999 GPF-roots.

Efron and Hinkely [11] and Royall [27] inject robustness in FPF for nonnor-mal situations. Vinod [35] shows that there is an important link between robustness,nonnormality and the so-called ‘information matrix equality’(IF = I2op = Iopg) be-tween the Fisher information matrix(IF), a matrix of second order partials(I2op) ofthe log-likelihood and the matrix of outer product of gradients(Iopg) of the log-like-lihood. Vinod’s [35] Proposition 1 formally proves that GPF-roots yield more robustpivots than Efron–Hinkley–Royall pivots. The robustness of GPF-roots is achievedby allowing I2op /= Iopg, which occurs when the skewness and excess kurtosis are

H.D. Vinod / Linear Algebra and its Applications 321 (2000) 365–385 367

nonzero. The numerical GPF-roots are robust, because they avoid the ‘Wald-type’statisticW = [f (θ) − f (θ)]/SE(f ) altogether. One need not find the potentially‘mixture’ sampling distribution ofW denoted byf w(W). There is no need to makesure thatf w(W) does not depend on unknownθ . In fact, one need not even find thestandard error SE(f) for eachf (θ). Our numerical GPF-root method simply needsa reliable computer algorithm for solving nonlinear equations. Dufour [8] discussesinference for some ill-behavedf (θ) called ‘locally almost unidentified’ (LAU) func-tions, lau(θ). He shows that LAU functions are quite common in applications, thatlau(θ) may have unbounded CIs, and that the usual CIs can have zero coverage prob-ability. The limiting normality of GPFs avoids the problematic sampling distributionof lau(θ) altogether.

An introduction to the estimating function (EF) literature is given in [9,16,19,21,34,35]. The EF estimators date back to 1960 and are defined as roots of a functiong(y, θ) = 0 of data and parameters. If the EF estimator does not coincide with themaximum likelihood (ML) estimatorθml of parameterθ , EF is shown to be superiorin that it ‘attains Cramer–Rao’ lower bound on its variance. See [19] for nonexpo-nential family extensions of EFs and a cogent discussion of advantages of EFs overtraditional methods.

Remark 1. An importantlessonfrom the EF theory is that an indirect approach isgood for estimation. One should choose the best available EF (e.g., unbiased EF at-taining Cramer–Rao) to indirectly ensure best available estimators, which are definedas roots of EF= K (constant). In many examples, the familiar direct method seekingbest properties of roots (estimators) themselves can end up failing to achieve them.

Vinod [35] proposes a similarlessonfor statistical inference based on numericalGPF roots. The GPFs as functions having desirable properties (asymptotic normality)indirectly do achieve desirable properties (e.g., short CIs conditional on coverage)of (numerical) GPF-roots. This paper discusses Cox’s [4] example, Poisson mean,binomial probability, normal standard deviation and some multivariate pivots in thecontext of regression problem omitted in [35]. This paper also derives a second kindof GPFs called GPF2 where the asymptotic distribution isχ2, again arising from theCLT-type arguments. Section 2 relates FPFs to CIs. For Cox’s example, Section 3derives the GPF and Section 4 develops the CIs from bootstraps and simulates Cox’sexample. Section 5 gives more GPF examples. Section 6 discusses two types of GPFfor regressions. Section 7 has summary and conclusions.

2. Traditional confidence intervals (CIs) from FPFs

Let θ = {θi} vector estiamteθ = {θi} such that its asymptotic distribution is nor-

mal, θd→ N(θ, Var(θ )). The normality yields the information matrix equality noted


earlier and Var(θ) = I−1F . Let ASEi denote the asymptotic standard errors from the

diagonals of Var(θ). The usual asymptotic 95% CIs (CI95) for theith element ofθis simply[

θi − 1.96ASEi , θi + 1.96ASEi

]. (1)

There is no loss of generality in discussing the 95%= 100(1 − α)% CIs, since onecan readily modifyα (= 0.05) to any desired significance levelα < 1/2. Rescalingand recentering theith element ofθ one defines Wald-type FPF which obviouslyconverges to the unit normal:

FPF= ZF = (θi − θi

)/ASEi

d→N(0, 1). (2)

One can derive the CI95 in (1) by algebraically ‘inverting Fisher’s PF’, i.e., bysolving FPF= K = zα = ±1.96. In general, consider a two-sided intervals givena left-hand-tail probabilityαL and a distinct right-hand-tail probabilityαU leading todistinct quantileszL andzU from normal tables to satisfy

Pr[zL 6 z 6 zU] = 1 − αL − αU. (3)

Now, define FPF roots as solutions of FPF= K = zL and FPF= zU which yieldthe lower and upper limits of a more general CI. In finite samples when standarderrors(SEi) are estimated from the data, Student’st distribution generally meansa slightly larger constant than 1.96. In the sequel, we avoid notational clutter byusing CI95 as a genric interval, 1.96 as a generic value ofz from normal or ttables, and ASE or SE(·) as standard errors from the square root matrix Var(θ),usually based on Fisher’s information matrixIF. The FPF in (2) is a test statisticand the CI in (1) is said to be obtained by inverting that statistic. We shall seethat GPF(y, θ) is not a test statistic and contatins all information in the sample.However, we can still numerically solve GPF= K to obtain CI95 by analogywith solutions of FPF= K.

3. Derivation of the GPF for Cox’s example

Cox’s [4] example was used by Efron, Hinkley and Royall to improve the FPF.It has a univariateθ from yit ∼ N(θ, σ 2

i ) (i = 1, 2 andt = 1, . . . , T ), with knowndichotomous random variances. One imagines two (independent) gadgets with dis-tinct known measurement error variances. The choice of the gadgeti depends on theoutcome of a toss of an unbiased coin. The subscriptsyit imply that a record is keptof both the gadget numberi and t th measurement. Denote byT1 the total numberof heads, byT2 total tails, and byyi the mean of theith sample. The log-likelihoodfunction is:L = constant−∑T

t=1∑2

i=1 log σi − (1/2)∑T

t=1∑2

i=1(yit − θ)2/σ 2i .


The score is

S = oL/oθ = (L − θR),

where

L =(

T∑t=1

Y1t /σ21

)+(

T∑t=1

Y2t /σ22

)= (

T1y1/σ21

)+ (T2y2/σ

22

)and

R = (T1/σ

21

)+ (T2/σ

22

).

The ML estimator ofθ is the root of the score equationS = 0 solved forθ anddenoted byθml or simply asθ :

θml = L/R = [(T1y1/σ

21

)+ (T2y2/σ

22

)]/[(T1/σ

21

)+ (T2/σ

22

)]. (4)

Fisher’s PF here is FPF= (θml − θ)/SE, where SE is the square root of theVar(θ) = T/IF = 2[(1/σ 2

1 ) + (1/σ 22 )]−1, where I2op = IF = Var(S) = E − S =

(T /2)[∑2i=1 σ−2

i ]. Efron and Hinkely [11] suggest replacing the trueIF by the ‘ob-

served Fisher information’IF = R defined in (4). They explain thatIF formulahaving the(T /2) factor above is not robust, since it usesT = T1 + T2. Realizednumber of heads(T1) need not equal number of tails(T2). Efron and Hinkely [11]replace the SE in FPF by the square root of

T/IF = T[(

T1/σ21

)+ (T2/σ

22

)]−1. (5)

The CIs from (5) will also be more robust. So far, we are accepting on faith thatσ 2i

are known. To inject further robustness Royall [27] allows for errors in the suppos-edly ‘known’ σ 2

i . Royall’s Var(θ) is

K = T I−12opIopgI

−12op. (6)

If I2op = Iopg, (6) becomesT I−12op which equalsT/IF of (5) for Cox’s [4] example.

Royall [27] replaces the SE in FPF by the square root of

T

{T∑

t=1

(y1t − θ)2/σ 41 +

T∑t=1

(y2t − θ)2/σ 42

} (IF)−2

. (7)

He proves that (7) is more robust than Efron and Hinkley’s [11]T/IF of (5), becauseit offers protection against errors in the assumed variancesσ 2

i . Another way of think-

ing about robustnes of Royall’s pivot is that (6) reduces toT/IF only if I2op = Iopg,i.e., the ‘information matrix equality’ holds, which in turn implies the restrictiveassumption of zero skewness and zero excess kurtosis.

Before we can write the GPF for Cox’s example, we need to develop the optimalestimating function (EF) for Cox’s example. From the mean,θ , and varianceσ 2

i ofyit the quasi-likelihood function is available. The quasi-score equationS = 0 is the


Optimal EF whose root is the EF-theory ‘point estimate’ ofθ which coincides withthe ML estimator of (4) above for Cox’s example.

3.1. Digression

Let us exploit the context of Cox’s simple example to describe a general meth-od for obtaining optimal EFs. From the mean and variance ofyit , defineg∗

it =σ−2

i (yit − θ), where we are standardizing withσ−2i , instead of the usualσi . Theg∗

it

is unbiased sinceE(g∗it ) = 0 for eachi = 1, 2. Since each equates the mean ofyit

to θ , it is a ‘moment condition’ andθ is ‘overidentified’ in econometric terminology.The generalized method of moments (GMM) has a certain way of combining mo-ment conditions. The theory of estimating functions involves ensuring orthogonalityEg∗

1tg∗2t = 0 and uses weights coming from Godambe’s criterion for combining them

into one optimal EF forθ as

T∑t=1

g∗1t

E(og∗1t /oθ)

E(g∗1t )

2+

T∑t=1

g∗2t

E(og∗1t /oθ)

E(g∗2t )

2= 0, (8)

where Eog∗it /oθ = −(σi)

−2, Eg∗2it = E(σi)

−4(yit − θ)2 = (σi)−2. Hence, the

weights ong∗it in (8) are simply−1. Now denote the quasi-scoreSt = g∗

t = ∑2i=1 g∗

it .

The EF of (8) for Cox’s example becomes∑T

t=1 St a sum of quasi-scores. It isoptimal, because it is unbiased with zero mean and its variance is minimal. Whenwe solve (8) forθ , the solution (denoted by a subscript ef) for this example coincideswith the ML estimator:

θef = L/R = θml, (9)

where theL andR are from (4). In general, optimal EF, ML and GMM can yielddistinct solutions.

Now first defineSt = St {∑Tt=1(St )

2}−0.5 and then define Godambe’s [13] pivotGPF as

GPF= zG =T∑

t=1

g∗t

/√√√√{ T∑t=1

g∗2t

}=

T∑t=1

Std→N(0, 1), (10)

where the convergence toN(0, 1) is easy to verify by using the central limit theo-rem (CLT). After all, it is a sum ofT items St rescaled to have unit variance. TheGPF-roots are defind as solutions of GPF= K.

To review, we have defined GPF in (10) as a sum ofT ‘scaled quasi-scores’, whichis pivotal becauseN(0, 1) as its asymptotic sampling distribution does not depend onthe unknownθ . Without the(θ − θ ) term, (10) does not look like a typical Fisherianpivot (FPF) in (2). However, we claim four advantages:

(i) The very absence of(θ − θ ) is an important asset of GPFs for robust inference.

(ii) Biased estimators haveE(θ) /= θ . Yet,E(GPF) = 0 can hold.


(iii) Numerical roots of GPF= K yield CI95 for the unknownθ , associated withEF-estimators, while avoiding explicit studies of sampling distributions ofθ .The usual FPF= (θ − θ)/SE can involve a mixture of the distribution ofθ andthe distribution ofestimatedSE in the denominator. No mixture distributionsare needed to assert (10) since its normality is based on the CLT.

(iv) Heyde [19, p. 62] proves that CIs from ‘asymptotic normal’ GPFs are shorterthan CIs from ‘locally asymptotic mixed normal’ pivots.

We are now ready to place GPFs in the context of the bootstrap for robust computerintensive inference.

4. Computation of GPF roots, CIs from bootstraps and a simulation for Cox’sexample

Two analytical solutions of FPF= (θ − θ)/SE= zα = ±1.96 give the limits ofCI95 of (1). From (10) it is clear that we cannot, in general, hope to solve GPF=K = ±1.96 analytically. Hence, one needs numerical methods to estimate the limitsof a CI95 from the GPF. For better behavior of numerical algorihms, let us rewrite(10) without the reciprocals of square roots as follows:

zα

{T∑

t=1

2∑i=1

σ−4i (yit − θ)2

}0.5

=T∑

t=1

2∑i=1

σ−2i (yit − θ). (11)

The choiceK = zα = ±1.96 depends on the normality in (10), and yields asymp-totic CI95 for GPF.

We now briefly describe the bootstrap CI95 for GPF. We first resample theSt of(10) with replacement.This amounts to using a nonparametric distribution inducedby empirical distribution function (EDF) of a large number(J = 999) of solutionsof GPF= ∑

St = 0. If the sampling distribution of the FPF depends onθ (e.g., ifthe biasE(θ) − θ depends onθ), then FPF is an invalid pivot. A bootstrap usingan invalid pivot may need considerable adjustment, if not complete rejection. Anadjustment for ridge regression is discussed in [31]. This problem is avoided by ourGPF bootstraps.

4.1. Refinement ofK

The parametric choiceK = zα = ±1.96 can be refined by using the bootstrapto construct the sampling distribution of scaled sum scoresSt . Although asymptoti-cally normal, its quantiles need not coincide with±1.96. The simple idea is to usebootstrap quantiles as a refinement. We can use the ML estimates(= θef) of (9) togenerate scaled scoresSt for eacht . Let us denote the estimatedt th scaled score as

Est(St ) = zGt ={

T∑t=1

2∑i=1

σ−4i (yit − θef)

2

}−0.5 2∑i=1

σ−2i (yit − θef). (12)


Next, we shufflewith replacementthese(t = 1, . . . , T ) scaled scoresJ (= 999)times. This createsJ replications of GPFs from their own EDF. Solving (12) numeri-cally for each replicate yieldsJ estimates of GPF-roots, to be analyzed by descriptiveand order statistics as follows. LetzGj and sdj (·) denote the sample mean and stan-dard deviation overT values for eachj = 1, . . . , J . AlthoughE(GPF) = 0 holdsfor largeT , for relatively smallT the observed meanzGj may be nonzero. However,if we can assume that any disrepancy betweenzGj and zero does not depend on un-knownθ , the following FPF remains valid:z∗j = (zGtj − zGj )/sdj (zGt ), wherezGtj

denotes the scaled score from (12) for thej th replicate. Whenz∗j are notN(0, 1),the 2.5% and 97.5% quantiles refine the parametric choice∓1.96.

4.2. Nonparametric bootstrap

We solve (11) numerically forθ by substituting the refinedzα . For eachj =1, . . . , J (= 999) we find the GPF-roots denoted byθ∗j . Arranging the roots in anincreasing order yields ‘order statistics’ denoted byθ∗(j). A possibly nonsymmetric(single), bootstrap naive nonparametric CI95 is given by[

θ∗(25), θ∗(975)]. (13)

If any CI procedure insists that the CI mustalwaysbe symmetric around theθef,it is intuitively obvious that it will not be robust. After all, it is not hard to constructexamples, where a nonsymmetric CI is superior. Efron and Hinkley’s [11] and Roy-all’s [27] CI95s (based on (4) or (7), respectively) areprima facienot fully robust,simply because they retain the symmetric structure. For the FPF= (θ − θ)/SE, Hall[18, pp. 111–113] proves that symmetric CIs are asymptotically superior to theirequal-tailed counterparts. However, Hall does not consider GPFs and his exampleshows that his superiority result depends on the confidence level. For example, itholds true for a 90% interval, but not for a CI95. In finite samples symmetric CIs arenot robust, in general.

4.3. Parametric bootstrap

A parametric bootstrap usesN(0, 1) to generatezGj for j = 1, 2, . . . J and sub-stitutes them forzα in (11) to obtainJ numerical solutions. The asymptotic choice of−1.96 and 1.96 needed only two solutions. Let the GPF-roots of (11) be denoted byθefj . The appropriate order statistics yields the following CI95 from our GPF-N(0, I )

algorithm:[θef(25), θef(975)

]. (14)

Remark 2. This remark is a digression from the main theme. If computational re-sources are limited (e.g., ifT is very large), the following CI95 remains availablefor some problems. First, find an unbiased estimateθunb, which can be from the


mean of a small simulation. Next, computeδ = θunb/θef, and an ‘unbiased estimateof squared bias’,U = (θef − θunb)

2. Adding estimated variance to theU yields un-biased estimate of the mean squared error (UMSE), and its square root

√(UMSE).

Vinod [28] derives the sampling distribution of a generalizedt ratio, where ASE isreplaced by

√(UMSE) as a ratio of weightedχ2 random variables. He indicates

approximate methods for obtaining appropriate constants like the generic 1.96 for anumerically ‘known’ bias factorδ.

4.4. A simulation of Cox’s example

We now discuss our simulation of Cox’s example withT1 = 50,T2 = 30,σ1 = 1and σ2 = 2. We generateyi ∼ N(5, σ 2

i ) for i = 1, 2. Our yi = (5.2787, 4.8654).The ML estimator θml = 5.1833 with the ASE of 1.1547 and ML CI95 is[2.9201, 7.4465]. The Efron and Hinkely’s [11] estimate of ASE is 1.1094 with CI95of [3.0089, 7.3577]. Royall’s [27] ASE= 1.0706 gives a shorter CI95:[3.0848, 7.2817].

The GPF defined in (10) implies (11) for Cox’s example. We solve (11)J = 999times with the help of NLSYS library in GAUSS computer language. The nonpara-metric CI95 in (13) is[4.9572, 5.4924] with a standard deviation 0.1283 over the999 realizations. The parametric algorithm uses unit normal deviates in GAUSS andrank order the rootsθj leading to the CI95 of (14). The smallest among 999 esti-mates, minj (θj ), is 4.7761; and the mean(θj ) is 5.1835. The median and maximumare, respectively, 5.1885 and 5.5408. The standard deviation is 0.1263. Since themedian is slightly larger than the mean, the approximate sampling distribution isslightly skewed to the left. Otherwise, the sampling distribution is quite tight andfairly well behaved, with a remarkably short CI95 from (14);[4.9312, 5.4373]. Notethat in this simulation we know that the true value ofθ is 5, and we can comparethe widths of intervals given that the true value 5 is inside the CIs (conditional oncoverage). The widths (CIWs) in decreasing order are: 4.53 for classical ML, 4.35for Efron and Hinkley, 4.20 for Royall, 0.51 for our parametric version, and 0.53 forour nonparametric version. Hence, we can conclude that the parametric CI95 fromsimulated (14) is the best for this example, with the shortest CIW. The CIW 0.53 forthe nonparametric (13) is almost as low, and the difference may be due to randomvariation.

Thus, for Cox’s example, used by others in the present context, our simulationshows that the GPFs provide a superior alternative. This supports Godambe and Hey-de’s [15] property (iii) and our discussion.

Vinod [35, Section 2] describes the sequence of robust improvements from Fish-er’s pivot (FPF) to Efron and Hinkley’s and Royall’s pivots. All these pivots containthe expression similar to(θi − θi) in (2). Let L = ∑T

t=1 Lt denote the log of the(quasi-) likelihood function, whereLt = ln ft (yt ; θ). Godambe’s [13,14] PF in (10)with the score written asoLt/oθ can also be written as


GPF= zG =T∑

t=1

oLt/oθ

{T∑

t=1

[oLt/oθ ]2}−0.5

=T∑

t=1

St . (15)

For Cox’s example, (13) and (14) give CI95’s from (10) or (15). We have seen thatdespite the missing(θi − θi) we can give CI95’s from its numerical GPF-roots. Tothe best of my knowledge, (15) has not been implemented in the literature for Cox’sor other examples. A multivariate extension of (15) is given later in (26). For easyreference, we restate a proposition from [35]:

Proposition 1. Let the scaled quasi-scoresSt satisfy McLeish’s [22] assumptions.

Now, by his CLT, their partial sum, GPF= zG = ∑Tt=1 St

d→N(0, 1), or convergesin distribution to unit normal asT → ∞. Defining robustness as absence of addi-tional assumptions, such as asymptotic normality of a rootθ , zG is more robust thanFisher, Efron–Hinkley and Royall pivots.

5. Further GPF example from the exponential family

In this section, we illustrate our proposal for various well-known exponential fam-ily distributions and postpone the normal regression application.

5.1. Poisson mean

If yt are i.i.d. Poission random variables, the ML estimator of the mean isy. Thevariance ofyt is alsoy and the Var(y) = (y/T ). Hence, CI95 is[y ∓ 1.96

√(y/T )]

for both Fisher and Efron–Hinkley. Royall’sλ = Var(yt ) = ∑Tt=1(yt − y)2/T and

his CI95 is [y ∓ 1.96√

(λ/T )]. If the Poisson model is precisely correct,λ = y

and then Royall’s CI95 coincides with Fisher’s. The Poisson likelihood function isft (yt; θ) = (θyt /yt !) exp(−θ). The score function isoLt/oθ = (yt/θ) − 1 = θ−1

(yt − θ), and GPF is obtained algebraically by scaling these scores as

GPF= zG =T∑

t=1

(yt − θ)

/√√√√{ T∑t=1

(yt − θ)2

}. (16)

In general, GPF= K needs to be solved numerically for either a parametric or boot-strap choice of the constantK. Thus one can obtain robust CI95 for Poisson meany

or any functionf (y) thereof.

5.2. Binomial probability

Lety be independently distributed as binomial(k

y

)θy(1 − θ)k−y for y = 0, 1, . . . ,

k. Now the ML estimator isθ = y/k, both Fisher and Efron–Hinkley pivots are


(y/k)[1 − (y/k)]/k. Royall shows thatλ = ∑Tt=1(yt − y)2/T k2 is robust in the

sense that if the actual distribution is hypergeometric, hisλ → variance, unlikeFisher’s or Efron and Hinkley’s. Since the score isoLt/oθ = ytθ

−1(k − θ)−1 − k

(k − θ)−1 = (k − θ)−1θ−1(yt − kθ), GPF-roots need numerical solutions, as above.We have

GPF= zG=T∑

t=1

(k − θ)−1θ−1(yt − kθ)

×{

T∑t=1

(k − θ)−2θ−2(yt − kθ)2

}−0.5

. (17)

5.3. Normal standard deviation

For yt independent normal,N(µ, σ 2), the ML estimate of σ is

σ

√{∑T

t=1(yt − y)2/T }. The asymptotic variance isσ 2/2, whether one uses Fisher’s

or Efron and Hinkley’s estimate. Royall’sλ = (1/4σ 2)[∑Tt=1{yt − y)2 − σ 2}2/T ],

and his asymptotic CI95 is robust against nonnormality. Since(oLt/oσ) = σ−3

(yt − µ)2 − σ−1, we need numerical solutions of GPF= K:

GPF= zG={

T∑t=1

(yt − µ)2 − T σ 2

}

×{

T∑t=1

(yt − µ)4 − 2σ 2T∑

t=1

(yt − µ)2 + T σ 4

}−0.5

. (18)

This GPF is robust against nonnormality similar to (7) by protecting against non-zero skewness and excess kurtosis. Again, GPF= K needs to be solved numeri-cally for various bootstraps to obtain a robust CI95 forσ . In finance,σ is used torepresent risk or volatility. Unfortunately,σ is not a satisfactory measure of risk,since the financial loss occurs on the down side only. The sign of(yt − y) or thesign of(yt − y∗) wherey∗ is a riskfree return should matter. We suggest a modifiedσ ∗ with trader’s own sign-sensitive weight functionwt to better represent risk by

σ ∗ =√

{∑Tt=1 wt(yt − y)2/T }. The inference forσ ∗ by insertingwt in (18) with

appropriate modificaions can make risk in finance theory more realistic.

6. The GPFs for regressions

In this section, we develop two types of GPFs for the regression problems in (25)and (26). We start with the familiar case and show that GPFs are intuitive and yet


they can handle considerable generality with flexibility and robustness. Consider theusual regression model withT observation andp regressors:y = Xβ + ε, E(ε) = 0,Eεε′ = σ 2I , whereI is the identity matrix.

Lt = (−0.5) log(2pσ 2)− 0.5σ−2[yt − (Xβ)t ]′[yt − (Xβ)t ], (19)

where(Xβ)t denotes thet th element of theT × 1 vectorXβ andεt = yt − (Xβ)t .Let Xt = (xt1, . . . , xtp) denote arow vector and(X′

tXt ) be p × p matrices. Now(oLt/oβ) is proportional to[X′

t yt − X′tXtβ], asp equations for eacht = 1, . . . , T .

The quasi-score function from (19) isS = ∑Tt=1(oLt/oβ) = 0, which is the op-

timal EFg∗. The EF solution here is simplyβef ≡ βols = (X′X)−1X′y, the ordinaryleast-squares (OLS) estimator. Hence, the OLS estimator is equivalent to the root ofthe following optimal EF or ‘normal equations’:

gols = g∗ = X′(y − Xβ) =T∑

t=1

X′t (yt − Xtβ) = 0. (20)

In the OLS regression withEεε′ = σ 2I , a 100(1 − α)% confidence region forβcontains those values ofβ which satisfy the inequality [7]

(y − Xβ)′(y − Xβ) − (y − Xβols

)′(y − Xβols

)6 s2pFp,T −p,1−α,

wheres2 = (y − Xβols)′(y − Xβols)/(T − p), and whereFp,T −p,1−α denotes the

upper 100(1− α)% quantile of theF distribution withp andT − p degrees of free-dom, in the numerator and denominator, respectively. It is equivalently written as(

βols − β)′X′X

(βols − β

)6 s2pFp,n−p,1−α,

which shows that the shape will be ellipsoidal. Whenyt is subjected to a transfor-mation τ (yt), Davidson and MacKinnon, DM, [5, p. 489] show that (19) has anadditional Jacobian term. Ifτ (yt ) = log(yt ), oτ (yt )/oyt = 1/yt and the Jacobianfactor is its absolute value− log(|yt |).

An important generality is achieved by considering nonlinear regressions,y =x(β) + ε. Let us write aT × p matrixX(β) similar toX above containing partials ofan elementx(β) with respect to an elemnt ofβ. We also let the covariances dependon additional parametersφ, with E(εε′) = σ 2X(φ). If X is known, we have thegeneralized least-squares (GLS) estimator, which coincides with the ML estimatorunder normality. The ‘normal equations’ are written in [35] as a sum ofT scores:

g∗ =X′X−1y − X′X−1Xβ

=T∑

t=1

X′tX

−1(yt − Xtβ)

=T∑

t=1

H ′t εt =

T∑t=1

St = 0. (21)


In (21) we can properly ignoreσ 2, Ht denotes thetth row of X−1X, and the scoreSt = H ′

t εt is p × 1. We viewg∗ = 0 as a Godambe-type optimal estimating func-tion. This finishes our derivation of the optimal EF for the regression problem as asum ofT scores in a potentially very general setting.

Our next task is to construct the GPF as scaled sum of scores. This needs deri-vation of flexible and general scale factors based on variances. The usual asymptot-ic theory for the nonlinear regression implies the following general expression forvariances [5, p. 290]:

√T (β − β)

d→ N(0, plim

T →∞[T −1X(β)′σ−2X(φ)−1X(β)

]−1). (22)

Writing X = X(β) andX = X(φ) for clarity, Fisher information matrixIF is [X′X−1

X]/σ 2, where(T − p)σ 2 = (y − Xβ)′X−1(y − Xβ). Now its inverse,I−1F , is the

asymptotic covariance matrix (ASE)2.The usualt or F tests for regressions are based on Fisher’s pivotzF = (β −

β)(ASE)−1. Replacing the expected information matrix by the observed, one obtains

the Efron–Hinkley pivot as(β − β)(ASE)−1

. If X is nonstochastic,E[X′X−1X] =[X′X−1X] and ASE equals ASE. A simple binary variable regression where thismakes a difference and is intuitively sensible is given by Davidson and MacKinnon[5, p. 267]. For the special case of heteroskedasticity, whereX is a known diagonalmatrix, Royall [27] suggests the following robust estimator of the covariance matrix:

K1 = T[X′X−1X

]−1X′X−1 diag

(ε2t

)X−1X

[X′X−1X

]−1, (23)

whereεt = yt − (Xβ)t is the residual and diag(·) denotes a diagonal matrix. WhenX is unknown, Royall’s covariance matrix for regressions is complicated, see [35,p. 392]. If X is the identity matrix, and if diag(ε2

t ) = X, then (23) reduces to whatis known in econometrics texts [5, p. 553] as the Eicker–White heteroskedasticityconsistent (HC) covariance matrix estimator:T (X′X)−1X′XX(X′X)−1. Let ht de-note the diagonal element of the hat matrixX(X′X)−1X′. The econometric literatureconsiders four choices for the diagonal matrixX denoted by HC1 to HC4 as

HC0 = ε2t , HC1 = ε2

t T /(T − p),

HC2 = ε2t /(1 − ht ), HC3 = ε2

t /(1 − ht )2.

(24)

Among these, Davidson and MacKinnon [5] recommend HC3 by alluding to MonteCarlo studies. IfEεε′ = X, a p-variate version of (15) is GPF= [X′X−1Eεε′X−1

X]−0.5X′X−1ε, which hasp equations inp unknown coefficientsβ. If we insertσ 2

in Eεε′ = σ 2X and assume thatσ 2X is known,

GPF=T∑

t=1

St = [X′X−1X

]−0.5X′X−1(ε/σ). (25)

Recall from (21) thatX′X−1ε = ∑H ′

t εt = ∑St is a sum of scores. In (25) this sum

of scores is premultiplied by a scaling matrix andσ is no longer ignored. Thus, our


GPF for regressions in (25) is a sum ofT scaled scores. The OLS hasX = I andGPF= [X′X]−0.5X′(ε/σ). This finishes the derivation of the GPF for regression asa sum of scaled quasi-scores which is asymptoticallyN(0, I ) by the CLT.

Assuming the linear caseX(β) = X and a known and fixedX matrix, a bootstrapof (25) in [35] shuffles (and sometimes also studentizes)J (= 999) timeswith re-placementthe T scaled scoresSt . The GPF-N(0, I ) bootstrap algorithm makesJevaluations of GPF= ±1.96 for anyβ or any scalar functionsf (β) to construct aCI95 for inference. This paper considers flexible choicesX = X(φ), whereφ pa-rameters can represent the autocorrelation and/or heteroskedasticity amongε errors.Babu [3] studies the ‘breakdown point’ of bootstraps, and favors winsorization (re-place certain percent of ‘extreme’ values by the nearest nonextreme values) beforeresampling. SinceSt arep × 1 vectors, we must either winsorize each dimensionseparately or winsorize a vector norm|St |. Now numerical roots of shuffled GPF=∑

St = K will estimateβ for each shuffle, leading to a bootstrap CI95. Furtherresearch and simulations are needed to assess robustness and widths of CI95.

This paper also proposes a new multivariate pivot (denoted by superscript 2 toGPF) using the property that a sum of squares ofp unit normals isχ2(p). The sumof squares from (25) is

GPF2 = (ε/σ)′X−1X[X′X−1X′]−1

X′X−1(ε/σ)d→χ2(p). (26)

If σ is assumed to be known, the limiting distributionχ2 in (26) has zero noncen-trality and it does not depend on unknown parameters. If we estimateσ 2 by s2 =(ε′ε)/(T − p), whereε vector contains regression residuals, we haveFp,T −p,1−α asthe limiting distribution instead ofχ2.

Remark 3. How do we bootstrap using (26)? We need algebraic manipulations towrite GPF2 as a sum ofT items, which can be shuffled. Note that (26) is a quadraticform ε′Aε, and we can decompose theT × T matrix A asGKG′, whereG is or-thogonal andK is diagonal. Let us define aT × 1 vectorε = ε′GK0.5, and note thatε′Aε = ε′ε = ∑

(εt )2, using the elements ofε. Now, our bootstrap shuffles theT

values of(εt )2 with replacement. Since(εt )

2 values are scalar, they do not need anynorms before winsorization. Unlike the GPF-N(0, I ) algorithm above, the GPF2syield ellipsoid regions rather than convenient upper and lower limits of the usualCIs. The limits of these regions come from the tabulatedTα upper 95% value of theχ2 or F . If we are interested in inference about a scalar functionf (β) of β vector,we can construct a CI95 by shuffing(εt )

2 andJ times solving GPF2 = Tα for f (β).Simultaneous CIs forβ are more difficult. We have to numerically maximize andminimize each element ofβ subject to the inequality constraint GPF2 6 Tα .

Next few paragraphs discuss five robust choices ofX(φ) for both GPFs of (25)and (26).

(i) First, underheteroskedasticity, X = diag(ε2t ), and we replaceX by a consistent

estimate based on HC0–HC3 defined in (24).


(ii) Second, underautocorrelationalone, Vinod [32,35] suggests using the i.i.d.‘recursive residuals’ for time series bootstraps. Note that one retains (the first)p

residuals unchanged and constructsT − p i.i.d. recursive residuals for shuffling. Thekey advantage of this method is that it is valid for arbitrary error autocorrelationstructures. This is a nonparametric choice ofX without anyφ.

(iii) Third, under first-order autoregression, AR(1), among regression errors, wehave the following model at timet : yt = X′

t β + εt , whereεt = ρεt−1 + ut . Undernormality of errors the likelihood function may be written as

f (y1, y2, . . . yT ) = f (y1)f (y2 | y1)f (y3 | y2) · · ·f (yT | yT −1).

Next we use the so-called quasi-first difference (e.g.,yt − ρyt−1). It is well knownthat here one treats the first observation differently from others [17, p. 600]. For thismodel the log-likelihood function is available in textbooks and the correspondingscore function equals the partials of the log-likelihood. The partial derivative formulais different for t = 1 compared to all othert values. The partials with respect to(w.r.t.)β are:

(1/σ 2

u

) T∑t=1

utX∗t , whereu1 = (1 − ρ2)0.5

(y1 − ρX1β);

and

for t = 2, 3, . . . , T , ut = (yt − ρyt−1) − Xt − ρXt−ρ)β,

where we have defined 1× p vectors:X∗1 = (1 − ρ2)0.5X1 and X∗t = (Xt − ρ

Xt−1) for t = 2, 3, . . . , T . Collecting all ut we have theu vector of dimension(T × 1). Similarly, we have theT × p matrix of quasi-differenced regressor dataX∗ satisfying the definitional relation:(X∗′X∗) = X′X−1X. The partial w.r.t.σ 2

u is

(− T/2σ 2u

)+ (1/2σ 4

u

) T∑t−1

u2t .

Finally, the partial derivative w.r.t.ρ is

(1/σ 2

u

) T∑t=2

ut εt−1 + (ρε2

1/σ 2u

)− (ρ/[

1 − ρ2]),whereεt = yt − Xtβ. Thus, for the parameter vectorθ = (β, σ 2

u , ρ), we have analyt-ical expressions for the score vector, which is the optimal EF. Simultaneous solutionof thep + 2 equations in the parameter vectorθ gives the usual ML estimator whichcoincides with the EF estimator here.

Instead of simulataneous solutions, Durbin’s [10] seminal paper starting the wholeEF literature also suggested the following two-step OLS estimator forρ. Regressyt

on yt−1,X′t and(−X′

t−1) and use the coefficient ofyt−1 as ρ. Also, one can use a

consistent estimate ofs2u of σ 2

u and simplify the score forβ as(1/s2u)∑T

t=1 utX∗t =


(1/s2u)X∗′

u. Using rescaled scores, we have GPF= zG = [X∗′X∗]−0.5X∗′

u, andanalogous GPF2 is the sum of squares(zG)′zG.

(iv) Fourth, we considerX(φ) induced by a mixed autoregressive moving av-erage (ARMA) error process. Note that any ‘stable and invertible’ dynamic errorprocess can be approximated by an ARMA(q, q − 1) process. Vinod [29] providesan algorithm for exact ML estimation of regression coefficientsβ when the modelerrors are ARMA(q, q − 1). His approximation is based on analytically known ei-genvalues and eigenvectors of tri-diagonal matrices, with an explicit derivations forthe ARMA(2, 1). In general, this regression error process would imply thatX(φ) is afunction of 2q − 1 elements of theφ vector representingq AR-side parameters and(q − 1) MA-side parameters.

(v) Our fifth robust choice ofX has heteroskedasticity and autocorrelation con-sistent (HAC) estimators ofX discussed in [5, pp. 553, 613]. Assume that both het-eroskedasticity and general autocorrelation among regression errors are present andwe are not willing to assume anyX(φ) parametric specification. Instead ofX, wedenote a nonparametric HAC covariance matrix asEεε′ = C, to emphasize its non-zero off-diagonals due to autocorrelations and nonconstant diagonal elements dueto heteroskedasticity. Practical construction ofC requires the following smoothingand truncation adjustments using the (quasi-) score functionsSt defined in (21). De-fine our basic building block as ap × p matrix Xj = (1/T )

∑Tt=j+1 St (St−j )

′ ofautocovariances. We smooth them by using[Xj + X′

j ] to guarantee that we have asymmetric matrix. We further assume that autocovariances die down afterm lags,with knownm, and truncate a sum afterm terms. After such truncation and smooth-ing, we constructC as a nonsingular symmetric matrix proposed by Newey and West[25]:

C = X0 +m∑

j=1

w(j,m)[Xj + X′j ], wherew(j,m) = 1 − j (m + 1)−1.

(27)

Thew(j,m) in (27) are Bartlett’s window weights familiar from spectral analysis,declining linearly asj increases. One can refine (27) by using a pre-whitened HACestimator proposed by Andrews and Monahan [2]. Here we do not merely use (27)in the usual fashion as a HAC estimator of variance. We are also extending the EF-theorylessonmentioned in Remark 1 to construct a robust (HAC) estimator of thevariance of the underlying score function itself, which permits construction of scaledscoresSt needed to define the GPF in (25). Thus, upon smoothing and truncation,we propose GPF as[X′C−1X]−0.5X′C−1(ε/σ) with nonparametricC. Similar to(26) we also have

GPF2 = ε′C−1X[X′C−1X

]−1X′C−1ε

d→ σ 2χ2(p). (28)

If we uses2 to estimateσ 2, in Eεε′ = σ 2C, the limiting distribution in (28) becomesFp,T −p,1−α. The difficulty mentioned in Remark 3 about simultaneous CIs forβ is


relevant for (28) also. Since (28) is one equation inp unknowns ofβ, we mustfocus on one functionf (β) or one elementβ, sayβ1. Next, we show that this focusentails no loss of generality by rearranging the model in terms ofrevised setof p

parameters. Let us still denote them byβ to avoid notational clutter. The trick isto use the Frisch–Waugh theorem [17, p. 247] to focus on one parameter at a timewithout loss of generality as follows. First, rewrite our model after partitioning asy = X1β1 + X2β2, where the focus is on one parameterβ1 and we have combinedall other parameters into the vectorβ2. Now construct a vectory∗ of residuals fromthe regression ofy on X2 and also create the regressor columns from the residualsof the regression ofX1 on X2. Frisch–Waugh theorem guarantees that this givesexactly the same regression coefficient as the original model. Sinceβ1 could beany one of thep elements ofβ by rearrangement, there is no loss of generalityin this method and theχ2(p) in (28) becomesχ2(1) with only one unknownβ1.SinceE(χ2(1)) = 1, one tempting possibility is to solve the equation GPF2 = s2

for β1 someJ = 999 times to construct CI95 forβ1 and eventually for all param-eters ofβ. With estimateds2 one can use numerical evaluation of the inequalityGPF2 6 Fp,T −p,1−α . One can further correct any discrepancy between the theoreti-cally correctF values and the observed density of GPF2s in resamples by using theupper(1 − α) quantile from the observed order statistics instead ofFp,T −p,1−α. Onecan also use the double bootstrap (d-boot) refinement if adequate computer resourcesare available. These extensions are left for future work.

Practitioners often need to test some (economic) theory implying algebraic re-strictions onβ. For example, ify is consumption,X1 is income andX2 is laggedconsumption iny = β1 + β1X1 + β3X2 (a consumption function), some Keynesiantheorists claim that long-run marginal propensity to consume,β2/(1 − β3), must beunity [17, Chapter 7]. In general, we want to testm such restrictions rewritten asmnonlinear equatinsC(β) = 0. Econometricians usually linerize the nonlinear restric-tions asC(β) = Rβ − q, whereR is anm × p matrix andq is ap × 1 vector. Thus,C(β) = 0, under the null hypothesis that the theory is true. From (25), our GPF-rootsare found by solving GPF(y,X, β) = [X′X−1X]−0.5X′X−1(y − Xβ) = K. Thesearep nonlinear equations in thep unknown elements ofβ. In most cases, we canconstruct bootstrap CIs forC(β) without linearization. If the theory is supported bythe data, each row ofC(β) will be close to zero except for random variation. If theoberved 95% confidence interval (CI95) for any row ofC(β) does not contain thezero, we shall reject the theory. The ill-behaved LAU functions lau(β) mentioned inSection 1 can be viewed as special cases ofC(β). A practical disadvantage of GPF2s(quadratic forms in GPFs) mentioned in Remark 3 applies here and it may be some-what difficult to use (26) to construct GPF2 confidence sets for certain ill-behavedC(β). However, we can always use (25).

Proposition 2. Let zG denote a sum of scaled scores. Assume that a nonsingularestimate ofEzGz′

G is available(possiby after smoothing and truncation), and that


we are interested in inference on Dufour’s ‘ locally almost unidentified’ functionslau(β). By choosing aGPFas in(25)or (28)which does not explicitly involve scalarlau(β) at all, we can construct valid confidence intervals.

Proof. Dufour’s [8] result is that CIs obtained by inverting Wald-type statistics[lau(β) − lau(β)]/SE( ˆlau) can havezerocoverage probability for lau(β). One prob-lem is that the Wald-type statistic is not a valid pivotal quantity when the samplingdistribution of lau ˆ(β) depends on unknown nuisance parameters. Another problemis that the covariance matrix needed for SE( ˆlau) can be singular. The GPF of (25)is asymptotically normal by Proposition 1. Such GPFs are certainly not Wald-type,since they do not even contain the expression lau(β). Assuming nonsingularEzGz′

Gis less stringent than assuming nonsingular matrix of variances of lau(β) for eachchoice of lau(β). Let ι denote ap × 1 vector of ones. We obtain a CI95 for functionf (β) of regression parameters by numerically solving forf (β) a system ofp equa-tion GPF(y, β) = ±1.96(ι) = zα . For example, iff (β) = lau(β) is a ratio of tworegression coefficients, then its denominator can be zero, and the variance of suchf (β) can have obvious difficulties. Rao [26, Section 4(b)] discusses an ingenioussquaring method for finding the CIs of ratios. Our approach is more general andsimple to program. In the absence of software, Rao’s squaring method is rarelyimplemented. By contrast, our GPF is readily implemneted in [33,36] for functions(ratios) of regression coefficients. Given a scalar function lau(β), we simply replaceone of thep parameters inβ by lau(β), the parametric function of interest, and solveGPF= (constant) to construct a valid CI for the lau(β). For further improvements,we suggest the bootstrap.�

The bootstrap resampling of scaled scores to compute CIs is attractive in light oflimited simulations in [35]. To confirm it we would have to consider a wide range ofactual and nominal ‘test sizes and powers’ for a wide range of data sets. The GPF-N(0, I ) algorithm is a parametric bootstrap and relaxing the parametric assumptionsof N(0, I ), andX(φ) leads to a nonparametric and double bootstrap (GPF-d-boot)algorithms. McCullough and Vinod [23] offer practical details about implementingthe d-boot. Although the d-boot imposes heavy computational burden, and cannotbe readily simulated, Letson and McCullough [20] do report encouraging resultsfrom their d-boot simulations. Also difficult to simulate are the algorithms developedhere for new GPF2s discussed here. Our limited experiments with new algorithmsshow that GPF2s are quite feasible, but have the following drawback. GPF2s cannotbe readily solved for arbitrary nonlinear vector functions ofβ. Our recent experi-ments with nonnormal, nonparametric, nonspherical error GPF algorithms indicatethat their 95% CIs can be wider than the usual intervals. We have not attemptedwinsorization, since it offers too rich a variety of options, including a choice of thewinsorized percentage. From Babu’s [3] theory it is clear that greater robustness canbe achieved by winsorization, and we have shown how shorter bootstrap CIs canarise by removing extreme values.


7. Summary and conclusions

The influential parametric asymptotic inference based on the early work of Fish-er, Anderson and others was developed before the era of modern computers. Wemention the early idea of focusing on functions as in Anderson’s critical functions.The estimating functions (EFs) were developed by Godambe and Durbin in 1960.The main lesson of EF-theroy (in Remark 1) is that good EFs automatically lead togood EF-estimators (=EF-roots). Similarly, good pivots (GPFs), which contain allinformation in the sample, can lead to useful GPF-roots for reliable inference. TheGPF= ∑

St , a sum ofT items, converges toN(0, I ) by the central limit theorem.We provide details on bootstrap shuffling of scaled quasi-scoresSt to construct con-findence intervals for statistical inference. For regression coefficientsβ whenX(φ)

is the matrix of error variances, depending on the particular application at hand, wesuggest explicit bootstrap algorithms for five robust choices ofX.

Although GPF appears in [13], the use of its numerical roots is first proposedin [35]. Vinod [35] claims that the GPFs fill a long-standing need of the bootstrapliterature for robust pivots and enable robust statistical inference in many generalsituations. This paper supports the claim by using Cox’s simple example studied byEfron, Hinkely and Royall. For that, we derive Fisher’s highly structured pivot, itsmodification by Efron and Hinkley [11], and a further modification by Royall [27]to inject robustness. The GPF for Cox’s example has a simpler algebraic form thanRoyall’s pivot. We use numerical GPF-roots for robust statistical inference.

We test the quality of inference by simulating all these pivots for Cox’s example.A parametric GPF-N(0, I ) bootstrap algorithm uses about a 1000 standard normaldeviates to simulate the sampling distribution. A nonparametric (single) bootstrapalgorithm uses empirical distribution function for robustness. For Cox’s univariateexample, our simulation shows that GPFs yield shortandrobust CIs, without havingto use double bootstrap (d-boot). The width of the traditional interval is 4.53, whereasthe width of GPF intervals is only about 0.53. This is obviously a major reduction inthe confidence interval (CI) width, which is predicted by the asymptotic property (iii)given by Godambe and Heyde [15]. Thus, we have demonstrated that for univariateproblems our bootstrap methods based on GPFs offer superior statistical inference.

This paper discusses new GPF formulas for some exponential family membersincluding Poisson mean, binomial probability, and normal standard deviation, wherea weighted version is designed to better represent financial risk. For regressions,this paper also derives from a quadratic form of the GPF a new second type GPF2,which is asymptoticallyχ2 or F . The five robust choices ofX(φ) mentioned aboveare shown to be available for GPF2 also. We discuss inference problems for someill-behaved functions, where traditional CIs can havezerocoverage probabiltiy. Oursolution to this problem, stated as Proposition 2, is to numerically solve an equa-tion involving the GPF. Since Dufour [8] shows that such ill-behaved functions areubiquitous in applications (e.g., computation of long-run multiplier) our solution isof considerable practical interest in econometrics and other fields when regressions


are used to test nonlinear theoretical propositions, especially those involving ratiosof random variables.

Heyde [19] and Davison and Hinkley [6] offer formal proofs showing that GPFsand d-boots, respectively, are powerful tools for construction of short and robust CIs.We have shown a way of combining the two. By defining the tail areas as rejectionregions, our CIs can obviously be used for significance testing. The CIs from GPF-roots can serve as a foundation for further research on asymptotic inference in an eraof powerful computing. For example, numerical pivots may help extend the well-de-veloped EF-theory for nuisance parameters [21]. The potential of EFs and GPFs forsemi-parametric and semi-martingale models with nuisance parameters is indicatedby Heyde [19]. Of course, these ideas need to be developed, and we need greaterpractical experience with many more examples. We have shown that our proposalcan potentially simplify, robustify and improve the asymptotic inference methodscurrently used in statistics and econometrics.

Acknowledgement

I thank Professor Godambe of the University of Waterloo for important sugges-tions. A version of this paper was circulated as a Fordham University EconomicsDepartment Discussion Paper dated 12 June 1996 and was presented at ColumbiaUniversity in April 1999.

References

[1] T.W. Anderson, Introduction to Multivariate Statistical Analysis, Wiley, New York, 1958.[2] D.W.K. Andrews, J.C. Monahan, An improved heteroscedasticity and auto-correlation consistent

covariance matrix estimator, Econometrica 60 (4) (1992) 953–966.[3] G.J. Babu, Breakdown theory for estimators based on bootstrap and other resampling schemes,

Department of Statistics, Pennsylvania State University, University Park, PA 16802, USA, 1997.[4] D.R. Cox, Partial likelihood, Biometrika 62 (1975) 269–276.[5] R. Davidson, J.G. MacKinnon, Estimation and Inference in Econometrics, Oxford University Press,

New York, 1993.[6] A.C. Davidson, D.V. Hinkley, Bootstrap Methods and their Application, Cambridge University

Press, New York, 1997.[7] J.R. Donaldson, R.B. Schnabel, Computational experience with confidence regions and confidence

intervals for nonlinear least-squares, Technometrics 29 (1987) 67–82.[8] J.M. Dufour, Some impossibility theorems in econometrics with applications to structural and dy-

namic models, Econometrica 65 (1997) 1365–1387.[9] D.D. Dunlop, Regression for longitudinal data: a bridge from least-squares regression, Amer. Statist.

48 (1994) 299–303.[10] J. Durbin, Estimation of parameters in time-series regressin models, J. Roy. Statist. Soc. Ser. B 22

(1960) 139–153.[11] B. Efron, D.V. Hinkley, Assessing the accuracy of maximum likelihood estimation: observed versus

expected information, Biometrika 65 (1978) 457–482.


[12] V.P. Godambe, An optimum property of regular maximum likelihood estimation, Ann. Math. Statist.31 (1960) 1208–1212.

[13] V.P. Godambe, The foundations of finite sample estimation in stochastic processes, Biometrika 72(1985) 419–428.

[14] V.P. Godambe, Orthogonality of estimating functions and nuisance parameters, Biometrika 78 (1991)143–151.

[15] V.P. Godambe, C.C. Heyde, Quasilikelihood and optimal estimation, Internat. Statist. Rev. 55 (1987)231–244.

[16] V.P., Godambe, B.K. Kale, Estimating functions: an overview, in: V.P. Godambe (Ed.), EstimatingFunctions, Clarendon Press, Oxford, 1991 (Chapter 1).

[17] W.H. Greene, Econometric Analysis, third ed., Prentice-Hall, New York, 1997.[18] P. Hall, The Bootstrap and Edgeworth Exapnsion, Springer, New York, 1992.[19] C.C. Heyde, Quasi-likelihood and its Applications, Springer, New York, 1997.[20] D. Letson, B.D. McCullough, Better confidence intervals: the double bootstrap with no pivot, Am.

J. Agri. Econom. 80 (3) (1998) 552–559.[21] K. Liang, S.L. Zeger, Inference based on estimating functions in the presence of nuisance parame-

ters, Statist. Sci. 10 (1995) 158–173.[22] D.L. McLeish, Dependent central limit theorems and invariance principles, Ann. Probability 2 (4)

(1974) 620–628.[23] B.D. McCullough, H.D. Vinod, Implementing the double bootstrap, Comput. Econom. 12 (1998)

79–95.[24] R.C. Mittelhammer, Mathematical Statistics for Economics and Business, Springer, New York, 1996.[25] W.K. Newey, K.D. West, A simple positive semi-definite, heteroscedasticity and autocorrelation

consistent covariance matrix, Econometrica 55 (3) (1987) 703–708.[26] C.R. Rao, Linear Statistical Inference and its Applications, Wiley, New York, 1973.[27] R.M. Royall, Model robust confidence intervals using maximum likelihood estimators, Internat. Sta-

tist. Rev. 54 (2) (1986) 221–226.[28] H.D. Vinod, Distribution of a generalizedt ratio for biased estimators, Econ. Lett. 14 (1984) 43–52.[29] H.D. Vinod, Exact maximum likelihood regression estimation with ARMA(n, n − 1) errors, Econ.

Lett. 17 (1985) 335–358.[30] H.D. Vinod, Bootstrap methods: applications in econometrics, in: G.S. Maddala, C.R. Rao, H.D.

Vinod (Eds.), Handbook of Statistics: Econometrics, vol. 11, North-Holland, New York, 1993, pp.629–661 (Chapter 23).

[31] H.D. Vinod, Double bootstrap for shrinkage estimators, J. Econometrics 68 (2) (1995) 287–302.[32] H.D. Vinod, Comments on bootstrapping time series data, Econom. Rev. 15 (2) (1996) 183–190.[33] H.D. Vinod, Concave consumption, Euler equation and inference using estimating functions, in:

1997 Proceedings of the Business and Economic Statistics Section of the American Statistical As-sociation, Alexandria, VA, 1998, pp. 118–123.

[34] H.D. Vinod, Using Godambe–Durbin estimating functions in econometrics. in: Basawa, Godambe,Taylor (Eds.) Selected proceedings of the symposium on estimating equations, IMS Lecture Notes,Monographs Series, vol. 32, Hayward, CA, USA, 1997, pp. 215–238.

[35] Vinod, H.D. Foundations, of statistical inference based on numerical roots of robust pivot functions(Fellow’s Cornor), J. Econometrics 86 (1998) 387–396.

[36] H.D. Vinod, P. Samanta, Forecasting exchange rate dynamics using GMM, estimating functions andnumerical conditional variance methods, Presented at the 17th Annual International Symposium onForecasting, Barbados, June 1997.

Date post:	24-Jun-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Foundations of multivariate inference using modern computers · Foundations of multivariate...

Documents