[Handbook of Economic Forecasting] Volume 2 || Forecasting Binary Outcomes

CHAPTER1919Forecasting Binary OutcomesKajal Lahiri and Liu YangUniversity at Albany: SUNY

Contents

1. Introduction 10262. Probability Predictions 1027

2.1. Model-Based Probability Predictions 10282.1.1. Parametric Approach 1029

2.1.2. Non-Parametric Approach 1036

2.1.3. Semi-Parametric Approach 1037

2.1.4. Bayesian Approach 1039

2.1.5. Empirical Illustration 1041

2.1.6. Probability Predictions in Panel Data Models 1044

2.2. Non-Model-Based Probability Predictions 10493. Evaluation of Binary Event Predictions 1051

3.1. Evaluation of Probability Predictions 10513.1.1. Evaluation of Forecast Skill 1051

3.1.2. Evaluation of Forecast Value 1065

3.2. Evaluation of Point Predictions 10703.2.1. Skill Measures for Point Forecasts 1070

3.2.2. Statistical Inference Based on Contingency Tables 1072

3.2.3. Evaluation of Forecast Value 1076

4. Binary Point Predictions 10774.1. Two-Step Approach 10784.2. One-Step Approach 10794.3. Empirical Illustration 10834.4. Classification Models in Statistical Learning 1086

4.4.1. Linear Discriminant Analysis 1086

4.4.2. Classification Trees 1088

4.4.3. Neural Networks 1091

5. Improving Binary Predictions 10925.1. Combining Binary Predictions 10945.2. Bootstrap Aggregating 1096

6. Conclusion 1097Acknowledgments 1099References 1099

Abstract

Binary events are involved inmany economic decision problems. In recent years, considerable progresshas been made in diverse disciplines indeveloping models for forecasting binary outcomes. We

Handbook of Economic Forecasting, Volume 2B © 2013 Elsevier B.V.ISSN 1574-0706, http://dx.doi.org/10.1016/B978-0-444-62731-5.00019-1 All rights reserved. 1025

http://dx.doi.org/10.1016/B978-0-444-62731-5.00019-1

1026 Kajal Lahiri et al.

distinguish between two types of forecasts for binary events that are generally obtained as the outputof regressionmodels: probability forecasts andpoint forecasts.We summarize specification, estimation,and evaluation of binary response models for the purpose of forecasting in a unified framework that ischaracterized by the joint distribution of forecasts and actuals, and a general loss function. Analysis ofboth the skill and the value of probability and point forecasts can be carried out within this framework.Parametric, semi-parametric, non-parametric, and Bayesian approaches are covered. The emphasis ison the basic intuitions underlying eachmethodology, abstracting away from themathematical details.

Keywords

Probability prediction, Point prediction, Skill, Value, Joint distribution, Loss function

1. INTRODUCTION

The need for accurate prediction of events with binary outcomes, like loan defaults,occurrence of recessions, passage of a specific legislation, etc., arises often in economicsand numerous other areas of decision making. For example,a firm may base its productiondecisions on macroeconomic prospects; a bank manager may decide whether to extend aloan to an individual depending on the risk of default; and the propensity of a worker toapply for disability benefits is partially determined by the probability of being approved.

How should one characterize a good forecast in these situations? Take the loan offeras an example: a skilled bank manager with professional experience, after observing allrelevant personal characteristics of the applicant, is probably able to guess the odds thatan applicant will default. However, this ability does not necessarily translate into a gooddecision because the ultimate payoff also depends on the accurate assessment of the costand benefit associated with a decision. The cost of an incorrect approval of the loan canbe larger than that of an incorrect denial such that an optimal decision will depend onhow large this cost differential is. A manager,who may otherwise be a skillful forecaster, isunable to make an optimal decision unless he is aware of the costs and benefits associatedwith each of the binary outcomes. The value of a forecast can only be evaluated in adecision-making context.

It is useful to distinguish between two types of forecasts for binary outcomes: prob-ability forecasts and point forecasts. The former is a member of the broader category ofdensity forecasts, since knowing the probability of a binary event is equivalent to knowingthe entire density for the binary variable. Growing interest in probability forecasts hasmainly been dictated by the desire of the professional forecasting community to quantifyforecast uncertainty,which is often ignored in making point forecasts. After all, a primarypurpose of forecasting is to reduce uncertainty. In practice, a set of covariates is availablefor predicting the binary outcome under consideration. In this setting, probability fore-casts only describe the objective statistical properties of the joint distribution between theevent and covariates,and thus can be analyzed first without considering forecast value. Onthe contrary, a binary point forecast, always being either 0 or 1, cannot logically be issued

Forecasting Binary Outcomes 1027

in isolation of the loss function implicit in the underlying decision-making problem. Inthis sense, probability forecasts are more fundamental in nature. Because a point forecastis a mixture of the objective joint distribution between the event and the covariates, andthe loss function,we will defer an in-depth discussion of binary point forecasts until someimportant concepts regarding forecast value have been introduced.

Given the importance of density and point forecasts for other types of target variablessuch as GDP growth and inflation rates,one may wonder what feature of a binary outcomenecessitates a separate analysis and evaluation of its forecasts. It is the discrete support spaceof the dependent variable that makes forecasting binary outcomes distinctive, and thisrestriction should be taken into account in the specification, estimation, and evaluationexercises. For probability forecasts,any hypothesized model ignoring this feature may leadto serious bias in forecasts. This, however, is not necessarily the case in making binarypoint forecasts where the working model may violate this restriction, cf. Elliott and Lieli(2013). Due to the nature of a binary event, its joint distribution and loss function are ofspecial forms,which can be used to design a wide array of tools for forecast evaluation andcombination. For most of these procedures, it is hard to find comparable counterparts inforecasting other types of target variables.

This chapter summarizes a substantial body of literature on forecasting binary out-comes in a unified framework that has been developed in a number of disciplines such asbiostatistics, computer science, econometrics, mathematics, medical imaging, meteorol-ogy, and psychology.We cover only those models and techniques that are common acrossthese disciplines,with a focus on their applications in economic forecasting. Nevertheless,we give references to some of the methods excluded from this analysis.

The outline of this chapter is as follows. In Section 2, we present methods for fore-casting binary outcomes that have been developed primarily by econometricians in theframework of binary regressions. Section 3 is concerned with the evaluation method-ologies for assessing binary forecast skill and forecast value, most of which have beendeveloped in meteorology and psychology. Section 4 is built upon the previous two sec-tions; it consists of models especially designed for binary point predictions. We discusstwo alternative methodologies to improve binary forecasts in Section 5. Section 6 closesthis chapter by underscoring the unified framework that is at the core of the literature,by providing coherence to the diversity of issues and generic solutions.

2. PROBABILITY PREDICTIONS

This section addresses the issue of modeling the conditional probability of a binaryevent given an information set available at the time of prediction. It is a special formof density prediction since, for a Bernoulli distribution, knowing the conditional proba-bility is equivalent to knowing the density. Four classical binary response models devel-oped in econometrics along with an empirical illustration will come first, followed by


generalizations to panel data forecasting. Sometimes, forecasts are not derived from anyestimated econometric model, but are completely subjective or judgmental. These willbe introduced briefly in Section 2.2.

2.1. Model-Based Probability PredictionsFor the purpose of probability predictions, the forecaster often has an information set(denoted by �) that includes all variables relevant to the occurrence of a binary event.Incorporation of a particular variable into � is justified either by economic theory orby the variable’s historical forecasting performance. Suppose the dependent variable Yequals 1 when the target event occurs and 0 otherwise.The question to be answered in thissection is how to model the conditional probability of Y = 1 given �, viz., P(Y = 1|�).The formulation of binary probability prediction in this manner is sufficiently general tonest nearly all specific models that follow. For instance, if � contains lagged dependentvariables, then we have a dynamic model commonly used in macroeconomic forecast-ing.When it comes to the functional form of the conditional probability, we can identifythree broad approaches: (i) a parametric model which imposes a very strong assumptionon P(Y = 1|�), the only unknown is a finite dimensional parameter vector; (ii) a non–parametric model which does not constrain P(Y = 1|�) beyond certain regular proper-ties such as smoothness; and (iii) a semi-parametric model which lies between these twoextremes in that it does restrict some elements of P(Y = 1|�),and yet allows flexible spec-ification of other elements. If � contains prior knowledge on the parameters,P(Y = 1|�)

is a Bayesian model that integrates the prior with sample information to yield the poste-rior predictive probability. Before examining each specific model in detail, we will offermotivations as to why special care must be taken when the dependent variable is binary.

For modeling a binary event,a natural question is whether we can treat it as an ordinarydependent variable and assume a linear structure for P(Y = 1|�). In a linear probabilitymodel, for example, the conditional probability of Y = 1 depends on a k-dimensionalvector X in a linear way, that is,

P(Y = 1|�) = Xβ, (1)

where � = X and β is a parameter vector conforming in dimension with X . However,thismodel may not be suitable for the binary response case. As noted by Maddala (1983), forsome range of covariates X , Xβ may fall outside of [0, 1]. This is not permissible giventhat conditional probability must be a number between zero and one. Consequently,discreteness of binary dependent variables calls for non–linear econometric models, andthe selected specification must tackle this issue properly.

The common approach to overcome the drawback associated with the linear modelinvolves a non–linear link function taking values within [0, 1]. One well-known exampleis the cumulative distribution function for any random variable. Often, restrictions onP(Y = 1|�) are imposed within the framework of the following latent dependent variable


form (with � = X ):

Y ∗ = G(X) + ε, ε is distributed as F(·)Y = 1 if Y ∗ > 0, otherwise Y = 0. (2)

Here, Y ∗ is a hypothesized latent variable with conditional expectation G(X), called theindex function. ε is a random error with cumulative distribution function F(·) and isindependent of X . The observed binary variable Y is generated according to (2). Bydesign, the conditional probability of Y = 1 given X must be a number between zeroand one, as shown below:

E(Y |X) = P(Y = 1|X) = P(Y ∗ > 0|X)

= P(ε > −G(X)|X)

= 1 − F(−G(X)). (3)

Regardless of X , F(−G(X)) always lies inside [0, 1], so does the conditional expectationitself. In a parametric model, the functional form of F(·) is known whereas the indexG(·) is specified up to a finite dimensional parameter vector β, that is, G(·) = G0(·, β)

and the functional form of G0(·, ·) is known. As mentioned earlier, a non-parametricmodel does not impose stringent restrictions on the functional form of F(·) and G(·)besides some regular smoothness conditions. If either F(·) or G(·) is flexible but theother is subject to specification, a semi-parametric model results.

2.1.1. Parametric ApproachTwo prime parametric binary response models assume the index function to be linear,that is,G0(X , β) = Xβ. If F is the distribution function of a standard normal variate, thatis,

F(u) =∫ u

−∞1√2π

e− 12 t2dt, (4)

then we have the probit model. Alternatively, if F is logistic distribution function, that is,

F(u) = eu

1 + eu, (5)

we have the logit model.These are two popular parametric binary response models in econometrics. By sym-

metry of their density functions around zero, conditional probability of Y = 1 reducesto the simple form F(Xβ). Note that the index function does not have to be linear andit could be any non-linear function of β. In addition, the link function F(·) need notbe (4) or (5), it could be any other distribution function. One of the possibilities is theextreme value distribution:

F(u) = e−e−u. (6)


Nevertheless, the key point in parametric models is that the functional forms for thelink and index, irrespective of how complex they are, should be specified up to a finitedimensional parameter vector.

Koenker andYoon (2009) introduced two wider classes of parametric link functionsfor binary response models: the Gosset link based on the Student t-distribution for ε,and the Pregibon link based on the generalized Tukey λ family. The probit and logitlinks are nested within Gosset and Pregibon classes, respectively. For example, when thedegrees of freedom for Student t-distribution are large, it can be very close to standardnormal distribution. For generalized Tukey λ link with two parameters controlling thetail behavior and skewness, logit link is obtained by setting these two parameters tozero. Based on these observations, Koenker and Yoon (2009) compared and contrastedthe Bayesian and asymptotic chi-squared tests for the suitability of probit or logit linkwithin these more general families. One primary objective of their paper was to correctthe misperception that all links are essentially indistinguishable. They argued that themisspecification of the link function may lead to a severe estimation bias, even whenthe index is correctly specified. The binary response model with Gosset or Pregibonas link offers a relatively simple compromise between the conventional probit or logitspecification and the semi-parametric counterpart to be introduced in Section 2.1.3.

Train (2003) discussed various identification issues in parametric binary response mod-els. For the purpose of prediction, we care about the predicted probabilities instead ofparameters, implying that we have no preference over two models generating identicalpredicted probabilities, even though one of them is not fully identified. For this reason,identification is often not an issue, and unidentified or partially identified models may bevaluable in forecasting.

Once the parametric model is specified and identification conditions are recognized,the remaining job is to estimate β, given a sample. Amongst a number of methods, max-imum likelihood (ML) yields an asymptotically efficient estimator, provided the model iscorrectly specified. Suppose the index is linear. The logarithm of conditional likelihoodfunction given a sample {Yt, Xt} with t = 1, . . . , T is

l(β|{Yt, Xt}) ≡T∑

t=1

Ytln(F(Xtβ)) + (1 − Yt)ln(1 − F(Xtβ)), (7)

and ML maximizes (7) over the parameter space.Amemiya (1985) derived consistency andasymptotic normality of the maximum likelihood estimator for this model,and establishedthe global concavity of the likelihood function in the logit and probit cases. This meansthat the Newton–Raphson iterative procedure will converge to the unique maximizerof (7), no matter what the starting values are. For details regarding the iterative procedureto calculate ML estimator in these models, see Amemiya (1985). Statistical inference onthe parameters, predicted probabilities, marginal effects, and interaction effects can beconducted in a straightforward way, provided the sample is independently and identically


distributed (i.i.d.) or stationary and ergodic (in addition to satisfying certain momentconditions). These, however, may not always hold. Park and Phillips (2000) developedthe limiting distribution theory of ML estimator in parametric binary choice modelswith non-stationary integrated explanatory variables, which was extended further tomultinomial responses by Hu and Phillips (2004a,b).

In dynamic binary response models, the information set � may include unobservedvariables. Chauvet and Potter (2005) incorporated the lagged latent variable, togetherwith exogenous regressors, in �. A practical difficulty with these models is that thelikelihood function involves an intractable multiple integral over the latent variable. Oneway to circumvent this problem is to use a Bayesian computational technique based ona Markov chain Monte Carlo algorithm. See the technical appendix in Monokroussos(2011) for implementation details. Kauppi and Saikkonen (2008) examined the predic-tive performance of various dynamic probit models in which the lagged indicator ofeconomic recession, or the conditional mean of the latent variable, is used to forecastrecessions. Their dynamic formulations are much easier to implement by applying stan-dard numerical methods, and iterated multi-period forecasts can be generated. For ageneral treatment of multiple forecasts over multiple horizons in dynamic models, seeTeräsvirta et al. (2010), where four iterative procedures are outlined and assessed in termsof their forecast accuracy. Hao and Ng (2011) evaluated the predictive ability of four pro-bit model specifications proposed by Kauppi and Saikkonen (2008) to forecast Canadianrecessions, and found that dynamic models with actual recession indicator as an explana-tory variable were better in predicting the duration of recessions, whereas the additionof the lagged latent variable helped in forecasting the peaks of business cycles.

In macroeconomic and financial time series, the probability law underlying the wholesequence of 0’s and 1’s is often not fixed, but characterized by long repetitive cycles withdifferent periodicities. Exogenous shocks and sudden policy changes can lead to a suddenor gradual change in regime. If the model ignores this possibility, chances are high thatthe resulting forecasts will be off the mark. Hamilton (1989, 1990) developed a flexibleMarkov switching model to analyze a time series subject to changes in regime, where anunderlying unobserved binary state variable st governed the behavior of observed timeseries Yt . The change of regime in Yt is simply due to the change of st from one state tothe other. It is called Markov regime-switching model because the probability law of st ishypothesized to be a discrete time two-state Markov chain.The advantage of this modelis that it does not require prior knowledge of regime separation at each time. Instead,such information can be inferred from observed data Yt . For this reason, one can takeadvantage of this model to get predicted probability of a binary state even if it cannot beobserved directly. For a comprehensive survey of this model, see Hamilton (1994, 1993).Lahiri and Wang (1994) utilized this model for estimating recession probabilities usingthe index of leading indicators (LEI), circumventing the use of ad hoc filter rules such asthree consecutive declines in LEI as the recession predictor.


Unlike benchmark probit and logit models, a number of parametric binary responsemodels may be derived from other target objects. The autoregressive conditional hazard(ACH) model in Hamilton and Jordà (2002) serves as a good example. The original tar-get to be predicted is the length of time between events, such as the duration betweentwo successive changes of the federal funds rate in the United States. For this purpose,Engle (2000) and Engle and Russell (1997, 1998) developed an autoregressive condi-tional duration (ACD) model where the conditional expectation of the present durationwas specified to be a linear function of past observed durations and their conditionalexpectations. Hamilton and Jordà (2002) considered the hazard rate defined as the con-ditional probability of a change in the federal funds rate, given the latest information �.The ACH model is implied by the ACD model since the expected duration between twosuccessive changes is the inverse of the hazard rate.They also generalized this simple spec-ification by adding a vector of exogenous variables to represent new information relevantfor predicting the probability of the next target change.The discreteness of observed tar-get rate changes along with potential dynamic structure are dealt with simultaneously inthis framework. See Grammig and Kehrle (2008), Scotti (2011) and Kauppi (2012) forfurther applications and extensions.

Instead of predicting a single binary event, it is often useful to forecast multiple binaryresponses jointly. For instance, we may like to predict the direction-of-change in severalfinancial markets at a future date given current information. A special issue arises inthis context as these multiple binary dependent variables may be intercorrelated, evenafter controlling for all independent variables. One way to model this contemporaneouscorrelation is based on copulas,which decomposes the joint modeling approach into twoseparate steps. The power of a copula is that for multivariate distributions, the univariatemarginals and the dependence structure can be isolated, and all dependence informationis contained in the copula. While modeling the marginal, one can proceed as if thecurrent binary event is the only concern, which means that all previously discussedmethodologies including dynamic models can be directly applied. After this step,we mayconsider modeling the dependence structure by using a copula.1 Patton (2006) and Scotti(2011) used this approach in forecasting.Anatolyev (2009) suggested a more interpretablemeasure, called dependence ratios, for the purpose of directional forecasts (DF) in anumber of financial markets. Both marginal Bernoulli distributions and dependenceratios are parameterized as functions of the direction of past changes. By exploiting theinformation contained in this contemporaneous dependence structure, it is expected thatthis multivariate model will produce higher quality out-of-sample DF than its univariatecounterparts.

Cramer (1999) considered the predictive performance of the logit model in unbal-anced samples in which one event is more prevalent than the other. Denote the in-sample

1 In the binary case, the copula is characterized by a few parameters and thus is simple to model, see Tajar et al. (2001).


estimated probabilities of Yt = 1 and Yt = 0 by Pt and 1 − Pt , respectively. By the prop-erty of logit models, the sample average of Pt always equals the in-sample proportion ofYt = 1, which is denoted by α. Cramer proved that the average of Pt over the subsam-ple of Yt = 1 cannot be less than the average of 1 − Pt over the subsample of Yt = 0, ifα ≥ 0.5. Thus, in unbalanced samples, the average predicted probability of Yt = 1 whenYt = 1 is greater than or equal to the average predicted probability of Yt = 0 whenYt = 0. As a result, Cramer pointed out that estimated probabilities are a poor measureof in-sample predictive performance. Using estimated probabilities leads to the absurdconclusion that success is predicted more accurately than failure even though the twooutcomes are complementary.

King and Zeng (2001) investigated the use of a logit model in situations where theevent of interest is rare. With the typical sample proportion of the event less than 5%,they showed that the logit model performs well asymptotically provided it is correctlyspecified. However, in small samples, the logit estimator is biased. In these cases, efficientcompeting estimators with smaller mean squared errors do exist. This point has beennoticed by statisticians but has not attracted much attention in the applied literature, seeBull et al. (1997).

The estimated asymptotic covariance matrix of the logit estimators is the inverse ofthe estimated information matrix, that is,

V (β) =[

T∑t=1

Pt(1 − Pt)x′txt

]−1

, (8)

where β is the logit ML estimator, and Pt is the fitted conditional probability for observa-tion t, which is 1/(1 + e−xtβ ). King and Zeng (2001) pointed out that in logit models, Pt

for the subsample for which the rare event occurred would usually be large and close to0.5.This is because probabilities reported in studies of rare events are generally very smallcompared to those in balanced samples. Consequently, the contribution of this value tothe information matrix would also be relatively large. This argument implies that forrare event data, observations with Y = 1 have more information content than thosewith Y = 0. In this situation, random samples that are often used in microeconometricsno longer provide efficient estimates. Drawing more observations from Y = 1, relativeto what can be obtained in a random sampling scheme, could effectively yield variancereduction. This is called choice-based, or more generally, endogenously stratified sam-pling in which a random sample of pre-assigned size is drawn from each stratum basedon the values of Y . This non-random design tends to deliberately oversample from thesubpopulation (that is, Y = 1) that leads to variance reduction. King and Zeng (2001)suggested a sequential procedure to determine the sample size for Y = 0 based on theestimation accuracy of each previously selected sample.


The statistical procedures valid for random samples need to be adjusted as well inorder to accommodate this choice-based sampling scheme. Maddala and Lahiri (2009)included some preliminary discussions on this issue. Manski and Lerman (1977) proposedtwo modifications of the usual maximum likelihood estimation. The first one involvescomputing a logistic estimate and correcting it according to prior information about thefraction of ones in the population, say τ , and the observed fraction of ones in the sample,say Y . For the logit model, the estimator of slope coefficient β1 is consistent in bothsampling designs. The estimator of the intercept βo in the choice-based sample shouldbe corrected as:

βo − ln[(

1 − τ

τ

)(Y

1 − Y

)], (9)

where βo is the ML estimate for βo. For the random sample, τ = Y , and thus there isno need to adjust βo. However, in a choice-based sample with more observations on1’s, we must have τ < Y , and the corrected estimate is less than βo accordingly. Theprior correction is easy to implement and only requires the knowledge of τ , which isoften available from census data. However, in the case of a misspecified parametric model,prior correction may not work. Given the prevalence of misspecification in economicapplications, more robust correction procedures are called for. Another limitation ofthis prior correction procedure is that it may not be applicable for other parametricspecifications, such as the probit model, for which the inconsistency of the ML estimatormay take a more complex form (unlike in the logit case).

Manski and Lerman (1977)’s second approach – the weighted exogenous samplingmaximum-likelihood estimator – is robust even when the functional form of logit modelis incorrect, see Xie and Manski (1989). Instead of maximizing the logarithm of likelihoodfunction of the usual form, it maximizes the following weighted version:

lw(β|{Yt, Xt}) ≡ −T∑

t=1

wtln(1 + e(1−2yt)xtβ). (10)

The weight function wt is w1Yt +wo(1−Yt),where w1 = τ/Y and wo = (1−τ)/(1− Y ).As noted by Scott andWild (1986) andAmemiya and Vuong (1987), in the case of correctspecification, the weighting approach is asymptotically less efficient than prior correc-tion, but the difference is not very large. However, if model misspecification is suspected,weighting is a robust alternative. Unlike prior correction, the weighted estimator can beapplied equally well to other parametric specifications.The only knowledge required forits implementation is τ , the population probability of the rare event. Manski and Lerman(1977) has proved that the weighted estimator for any correctly specified model is con-sistent given the true τ . However, this estimator may not be asymptotically efficient.Theintuition behind the lack of efficiency is that unlike in a random sample, the knowledgeof τ must contain additional restrictions for the unknown parameters β in a choice-based sample. Failure to exploit this additional information makes the resulting estimator


inefficient. Imbens (1992) and Imbens and Lancaster (1996) examined how to efficientlyestimate β in an endogenously stratified sample.Their estimator based on the generalized-method-of-moment (GMM) reformulation does not require prior knowledge of τ andthe marginal distribution of regressors. Instead,τ can be treated as an additional parameterthat is estimated by GMM jointly with β. They have shown that this estimator achievesthe semi-parametric efficiency bound given all available information. For an excellentsurvey on estimation in endogenously stratified samples, see Cosslett (1993).

One interesting point in the context of choice-based sampling is that the logit modelcould sometimes be consistently estimated when the original data comes exclusively fromone of the strata. This problem has been investigated by Steinberg and Cardell (1992).In this paper, they have shown how to pool an appropriate supplementary sample thatcan often be found in general purpose public use surveys, such as the U.S. Census, withoriginal data to estimate the parameters of interest. The supplementary sample can bedrawn from the marginal distribution of the covariates without having any informationon Y . This estimator is algebraically similar to the above weighed ML estimator andhence can be implemented in conventional statistical packages. Only the logit modelis analyzed in this paper due to the existence of an analytic solution. In principle, theanalysis can be generalized to other parametric binary response models.

In finite samples, however, all of the above statistical procedures are subject to biaseven when the model is correctly specified. King and Zeng (2001) pointed out that suchbias may be amplified in the case of rare events. They proposed two methods to correctfor the finite sample bias in the estimation of parameters and the probabilities. For theparameters, they derived an approximate expression of bias in the usual ML estimator,viz.,(X ′WX)−1(X ′W ξ) where ξt = 0.5Qtt[(1+w1)Pt −w1], Qtt is the diagonal elementof Q = X(X ′WX)−1X ′, and W = diag{Pt(1 − Pt)wt}.This bias term is easy to estimatesince it is just the weighted least squares estimate of regressing ξ on X with W as theweight. The bias-corrected estimator of β is β = β − (X ′WX)−1(X ′W ξ) with theapproximate variance V (β) = (T/(T + k))2V (β), where k is the dimension of β.Observe that T/(T + k) < 1 for all sample sizes. The bias-corrected estimator is notonly unbiased but has smaller variance, and thus has a smaller mean squared error thanthe usual ML estimator in finite samples. When it comes to the predicted probabilities,a possible solution is to replace the unknown parameters β in 1/(1 + e−xtβ) with thebias-corrected estimator β. The problem is that a non-linear function of β may not beunbiased. King and Zeng (2001) developed the approximate Bayesian estimator based onthe approximation of the following estimator after averaging out the uncertainty due toestimation of β:

P(Y = 1|X = xo) =∫

1/(1 + e−xoβ∗)P(β∗)dβ∗. (11)

They stated that ignoring the estimation uncertainty of β would lead to underestimationof the true probability in a rare event situation. From a Bayesian viewpoint, P(β∗),


which summarizes such uncertainty, is interpreted as the posterior density of β, thatis, N (β, V (β)). Computation of this approximate Bayesian estimator and its associatedstandard deviation can be carried out in a straightforward way.The pitfall of this estimatoris that it is not unbiased in general, even though it often has small mean squared error infinite samples. King and Zeng (2001) therefore proposed another competing estimator,viz.,“the approximate unbiased estimator,” which, as its name suggests, is unbiased.

2.1.2. Non-Parametric ApproachAs mentioned at the beginning of Section 2.1, the non-parametric approach is the mostrobust way to model the conditional probability, in that both the link and the index canbe rather flexible. Non-parametric regression often deals with continuous responses withwell-behaved density functions, but the theory does not explicitly rule out other possi-bilities like a binary dependent variable. All extant non-parametric regression methods,after minor modifications, can be used to model binary dependent variables as well.

The most well-known non-parametric regression estimator of conditional expecta-tion is the so-called local polynomial estimator. For the univariate case, the pth localpolynomial estimator solves the following weighted least square problem given a sample{Yt, Xt} with t = 1, . . . , T :

minbo,b1,...,bp

T∑t=1

(Yt − bo − b1(Xt − x) − . . . − bp(Xt − x)p)2K(

x − Xt

hT

)(12)

where hT is the selected bandwidth, possibly depending on the sample, and K (·) is thekernel function.When p = 0, it reduces to local constant or Nadaraya–Watson estimator;when p = 1, it is the local linear estimator. In any case, the conditional probabilityP(Y = 1|X = x) can be estimated using bo, the solution to (12). However, this fittedprobability may exceed the feasible range [0, 1] for some values of x, since there is nosuch implicit constraint underlying this model. An immediate solution in practice wouldbe to cap the estimates at 0 and 1 when the fitted values fall beyond this range. Theproblem is that there is no strong support in theory to do so, and the modified fittedprobability is likely to assume these boundary values for a large number of values of x andthus the estimated marginal effect at these values must be zero as well. Like probit or logittransformations in the parametric model, we can make use of the same technique here.The only difference is that we fit the model locally by kernel smoothing. Specifically,let g(x, βx) be such a transformation function with unknown coefficient vector βx. Theconditional probability is modeled as:

P(Y = 1|X = x) = g(x, βx). (13)

In contrast to a parametric model, the coefficient βx is allowed to vary with the eval-uation point x. In the present context, the local logit is a sensible choice in which


g(x, βx) = 1/(1 + e−xβx). Generally speaking, any distribution function can be taken asg. Currently, there are three approaches to estimate βx and thus P(Y = 1|X = x) in (13);see Gozalo and Linton (2000),Tibshirani and Hastie (1987), and Carroll et al. (1998).

Another way to get the fitted probabilities within [0, 1] non-parametrically is simplyby noting that

p(y|x) = p(y, x)

p(x), (14)

where p(y|x), p(y, x) and p(x) are the conditional, joint, and marginal densities, respec-tively. A non-parametric conditional density estimator is obtained by replacing p(y, x)

and p(x) in (14) by their kernel estimates.When Y is a binary variable, p(1|x) = P(Y =1|X = x). A technical difficulty is that the ordinary kernel smoothing implicitly assumesthat the underlying density function is continuous,which is not true for a binary variable.Li and Racine (2006) provides a comprehensive treatment of several ways to cope withthis problem based on generalized kernels.

A number of papers have compared non-parametric binary models with the famil-iar parametric benchmarks. Frölich (2006) applied local logit regression to analyze thedependence of Portuguese women’s labor supply on family size,especially on the numberof children. For the parametric logit estimator, the estimated employment effects of chil-dren never changed sign in the population. However, the non-parametric estimator wasable to detect a larger heterogeneity of marginal effects in that the estimated effects werenegative for some women but positive for others. Bontemps et al. (2009) compared non-parametric conditional density estimation with a conventional parametric probit modelin terms of their out-of-sample binary forecast performances by bootstrap resampling.They found that the non-parametric method was significantly better behaved accordingto the “revealed performance” test proposed by Racine and Parmeter (2009). Hardingand Pagan (2011) considered a non-parametric regression model using constructed binarytime series.They argued that due to the complex scheme of transformation, the true datagenerating process governing an observed binary sequence is often not described well bya parametric specification, say, the static or dynamic probit model. Their dynamic non-parametric model was then applied to U.S. recession data using the lagged yield spread topredict recessions.They compared the fitted probabilities from the probit model and thosebased on the Nadaraya-Watson estimator, and concluded that the parametric probit spec-ification could not characterize the true relationship between recessions and yield spreadover some range. The gap between these two specifications was statistically significantand economically substantial.

2.1.3. Semi-Parametric ApproachThe semi-parametric model consists of both parametric and non-parametric components.Compared with the two extremes, a semi-parametric model has its own strength. It is notonly more robust than a parametric one because of its flexibility in the non-parametric


part, but also reduces the risk of the “curse of dimensionality” and data “sparseness” asso-ciated with its non-parametric counterpart. Various semi-parametric models for binaryresponses have emerged in the last few decades. We will briefly review some of theimportant developments in this area.

Recall that the link function is assumed to be known in the parametric model.Suppose this assumption is relaxed while keeping the index unchanged. We have thenthe following single-index model:

E(Y |X) = P(Y = 1|X) = F(G(X)). (15)

Generally speaking, the index G(X) does not have to be linear, as in the parametricmodel.We only consider the case where G(X) = Xβ for the sake of simplicity.The onlydifference from the parametric model is that the functional form for F(·) is unknownhere and thus needs to be estimated. By allowing for a flexible link function, greaterrobustness is achieved, provided the index has been correctly specified. Horowitz (2009)discussed the identification issues for various sub-cases of (15). Generally speaking, thesimplest identified specification can be used without worrying about other possibilities,provided that the alternative models are observationally equivalent from the standpointof forecasting.

For the single-index model, once a consistent estimator of β is available, F couldbe estimated using a non-parametric regression with β replaced by its estimator. Thereare three suggested estimators for β. Horowitz (2009) categorized them according towhether a non-linear optimization problem has to be solved.Two estimators obtained asthe solution of a non-linear optimization problem are the semi-parametric weighted non-linear least square estimator due to Ichimura (1993), and the semi-parametric maximumlikelihood estimator proposed by Klein and Spady (1993).A direct estimator not involvingoptimization is the average derivative estimator; see Stoker (1986, 1991a,b), Härdle andStoker (1989), Powell et al. (1989), and Hristache et al. (2001).

Another semi-parametric model suitable for binary responses is the non-parametricadditive model where the link is given, but the index contains non-parametric additiveelements:

P(Y = 1|X = x) = F(μ + m1(x1) + . . . + mk(xk)). (16)

Here, X is a k-dimensional random vector and the function F(·) is known prior toestimation, although the univariate function mj(·) for each j needs to be estimated. Themodel is semi-parametric in nature as it contains both the parametric component F(·),along with the additive structure, and the non-parametric component mj(·). Note thatthis non-parametric additive model does not overlap with the single-index model, inthe sense that there is at least one single-index model that cannot be rewritten in theform of non-parametric additive model, and vice versa. Like the single-index model,the non-parametric additive model relaxes restrictions on model specification to some


extent, thereby reducing the risk of misspecification as compared with the parametricapproach. Furthermore, it overcomes the“curse of dimensionality”associated with a typ-ical multivariate non-parametric regression by assuming each additive component to bea univariate function. Often, a cumulative distribution function with range between 0and 1 is a sensible choice for F(·).To ensure consistency of estimation methodology,F(·)has to be correctly specified. Horowitz and Mammen (2004) described estimation of thisadditive model.The basic idea is to estimate each mj(·) by series approximation. A naturalgeneralization is to allow for unknown F(·). This more general specification nests (15)and (16) as two special cases. Horowitz and Mammen (2007) developed a penalized-least-squares estimator for this model,which does not suffer from the“curse of dimensionality”and achieves the optimal one-dimensional non-parametric rate of convergence.

2.1.4. Bayesian ApproachIn contrast to the frequentist approach, the Bayesian approach takes the probability of abinary event as a random variable instead of a fixed value. Combining prior informationwith likelihood using Bayes’ rule, it obtains the posterior distribution of parameters ofinterest. By the property of a binary variable, each 0/1-valued Yt must be distributed asBernoulli with probability p. The likelihood function for a random sample would takethe following form:

T !T1!T0!p

T1(1 − p)T0, (17)

where T1 and T0 are the total number of observations with Yt = 1 and Yt = 0, respec-tively, and T = T1 + T0. A conjugate prior for parameter p is Beta (α, β) where bothα and β are non-negative real numbers. According to Bayes’ rule, the posterior is Beta(α + T1, β + T0) with mean:

E(p|Y ) = λpo + (1 − λ)T1

T, (18)

where po = α/(α + β) is the prior mean, T1/T is the sample mean, and λ = (α + β)/

(α + β + T ) is the weight assigned to the prior mean. If α = β = 1 in the above Beta-Binomial model, that is, when a non-informative prior is used, the posterior distributionis then dominated by the likelihood, and (18) gets close to the sample mean provided Tis sufficiently large. In other words, Bayesian nests the frequentist approach as a specialcase. However, this flexibility comes at the cost of robustness, as the posterior relies onthe prior, which, to some extent, is thought of as arbitrary and subject to choice by theanalyst. This deficiency can be alleviated by checking the sensitivity of the posterior tomultiple priors, or using empirical Bayes methods. For the former, if different priorsproduce similar posteriors, the result obtained under a particular prior is robust. In thelatter approach, the prior is determined by other data sets such as those examined inprevious studies. For instance, we can match the prior mean and variance with sample


counterparts to determine two parameters α and β in the above Beta-Binomial model.This is a natural way to update the information from previous studies. Once the posteriordensity is known, the predicted probability can be obtained under a suitable loss function.For example, the posterior mean is the optimal choice under quadratic loss.

Up to this point, only the information contained in the prior distribution and past Yare utilized for generating probability forecasts. Usually in practice, a set of covariates Xis available for use. In line with our general formulation at the beginning of this section,only the prior distribution and past Y are incorporated into the information set � inthe Beta-Binomial model. Let us now consider how to incorporate X into � within theframework of (2).There are two approaches to do this.The first one is conceptually simplein that only Bayes’ rule is involved. The prior density of parameters π(β) multiplied bythe conditional sampling density of Y given X generates the posterior in the followingway:

p(β|Y , X) = π(β)

T∏t=1

F(G0(Xt, β))Yt (1 − F(G0(Xt, β)))1−Yt /C, (19)

where C is a constant which equals

∫π(β)

T∏t=1

F(G0(Xt, β))Yt (1 − F(G0(Xt, β)))1−Yt dβ. (20)

The Metropolis–Hastings algorithm can draw samples from this distribution directly.Alternatively,we can use Monte Carlo integration to approximate the constant C. Albertand Chib (1993) developed the second method using the idea of data augmentation.Theparametric model F(G0(Xt, β)) is seen to have an underlying regression structure onthe latent continuous data; see (2). Without loss of generality, we only consider the casewhere G0(Xt, β) = Xtβ, and ε has the standard normal distribution, that is, F(·) = (·)where (·) is the standard normal distribution function with φ(·) as its density.

If the latent data {Y ∗t } is known, then the posterior distribution of the parameters

can be computed using standard results for normal linear models; see Koop (2003) formore details.Values of the latent variable are drawn from the following truncated normaldistributions:

p(Y ∗t |Yt, Xt, β) ∝

{φ(Y ∗

t − Xtβ)I (Y ∗t > 0) if Yt = 1;

φ(Y ∗t − Xtβ)I (Y ∗

t ≤ 0) otherwise,(21)

where ∝ means “is proportional to”. Draws from the posterior distribution are thenused to sample new latent data, and the process is iterated with Gibbs sampling, givenall conditional densities. The distribution of the predicted probability can be obtainedas follows. Given an evaluation point x, the conditional probability is (xβ), which israndom in the Bayesian framework. When a sufficiently large sample is generated from


the posterior p(β|Y , X), the distribution of (xβ) can be approximated arbitrarily wellby evaluating (xβ) at each sample point. As before, when only a point estimate isdesired, we can derive it given a specified loss function.

Albert and Chib (1993) also pointed out a number of advantages of the Bayesianestimation over a frequentist approach. First, frequentist ML relies on asymptotic theoryand its estimator may not perform satisfactorily in finite samples. Indeed, Griffiths et al.(1987) found that a ML estimator could have significant bias in small samples, whilethe Bayesian estimator could perform exact inferences even in these cases. Second, theBayesian approach based on the latent variable formulation, is computationally attractive.Third, Gibbs sampling needs to draw samples mainly from several standard distributions,and therefore is simple to implement. Finally, we can easily extend this model to dealwith other sampling densities for the latent variables other than the present multivariatenormal density. As a cautionary note, some diagnostic methods have to be used to ensurethat the generated Markov chain has reached its equilibrium distribution. For applicationsof this general approach in other binary response models, see Koenker andYoon (2009),Lieli and Springborn (forthcoming), and Scotti (2011).

2.1.5. Empirical IllustrationIn this part, we will present an empirical example that illustrates the application of themethodologies covered so far.The task is to generate the probabilities of future U.S. eco-nomic recessions.The monthly data we use consists of 624 observations on the differencebetween 10-year and 3-monthTreasury rates, and NBER-dated recession indicators fromJanuary 1960 to December 2011.2 The binary target event is the recession indicator thatis one, if the recession occurred, and zero otherwise. The sample proportion of monthsthat were in recession is about 14.9%, indicating that it is a relatively uncommon event.The independent variables are the yield spread, i.e., difference between 10-year and 3-month Treasury rates, and the lagged recession indicator. Estrella and Mishkin (1998)found that the best fit occurred when the yield spread is lagged 12 months.We maintainthis assumption here. Figure 19.1 shows the frequency distribution of the yield spread inour sample periods.The three tallest bars show that the value of the spread was between 0and 1.5 percentage points in about 42.6% of the cases.The distribution is heavily skewedtoward the positive values. All our fitted models with the yield spread as the explanatoryvariable reveal a very strong serial correlation in residuals. As a result, the dynamic speci-fication involving one month lagged indicator as an additional regressor is used here.Weimplement parametric, semi-parametric, and non-parametric approaches on this dataset,and summarize the fitted curves in a single graph. For the Bayesian approach, we use theR code provided by Albert (2009) to simulate the posterior distributions under differentpriors.

2 Downloaded from http://www.financeecon.com/ycestimates1.html.

http://www.financeecon.com/ycestimates1.html


Figure 19.1 Frequency distribution of the yield spread.

Figure 19.2 presents three fitted curves generated using a parametric probit model,a semi-parametric single-index model, and the non-parametric conditional density esti-mator of Section 2.1.2, given the value of the lagged indicator. Both the probit and thesingle-index models contain the linear index.3 In the top panel in Figure 19.2, which isconditional on being in recession in the last month, we find the estimated conditionalprobabilities to be very close to each other, except for values of the yield spread largerthan 2.5%. Despite the divergence between them on the right end, both are downward-sloping. In contrast, the relationship, as estimated by the non-parametric model, is notmonotonic in that the probability surprisingly rises when the spread increases from −1%to 0. However, this finding is hard to explain given the prototypical negative correlationbetween them. We ascribe this to the data “sparseness” exhibited in Figure 19.1; namelythat the non-parametric estimators on these values are not reliable. In the bottom panel,which is conditional on not being in recession in the last month, there is no substantialdifference among these three models, and all of them are decreasing over the entire range.Again, the precision for non-parametric estimators on both ends are relatively low for thesame reason as before. An interesting issue that arises as one compares both the panelsis that the estimated probabilities when the lagged recession occurs are uniformly largerthan those when it does not. Actually, the probabilities in the bottom panel are nearlyzero in magnitude no matter how small the spread is.This could be true if there is a strongserial correlation in recessions identified by NBER,as shown in our probit model that hasa highly significant coefficient estimate for the lagged indicator. For this reason, the infor-mation contained in the current macroeconomic state,which is related to the occurrenceof future recessions, is far more important than that given by the spread.This example, atfirst sight, seems to be evidence against the predictive power of the yield spread. However,

3 The single-index model is estimated by the Klein–Spady approach with carefully selected bandwidth, see Section 2.1.3.


Figure 19.2 Probability of a recession given its lagged value (1 for the top panel; 0 for the bottompanel).

it is not the case given the fact that the 1-month lagged recession indicator is unavailableat the date of forecasting.The autocorrelation among recession indicators shrinks towardzero as forecast horizon increases.The yield spread stands out only in these longer horizonforecasts where few competing predictors with good quality exist.

To apply the Bayesian approach,we need some prior information. Suppose the coeffi-cient vector β is assigned a multivariate normal prior with mean βo and covariance matrixVo. For βo, we assume the prior means of the intercept, the coefficient of the spread andthe lagged indicator to be −1,−1, and 1, respectively. As for Vo, three cases are examined:the non-informative prior corresponding to infinitely large Vo, and a variation of theZeller’s g informative priors4 with large and small precisions. Figure 19.3 summarizes thesimulated posterior means for the conditional probabilities as well as the probit curvesfrom Figure 19.2. For comparison purpose, we also plot a curve replacing unknown

4 See Albert (2009) for an explanation of g informative prior.


Figure 19.3 Probability of a recession given its lagged value (1 for the top panel; 0 for the bottompanel).

β by its prior mean βo. In both panels, the Bayesian fitted curves are sensitive to theprior involved. For non-informative and informative priors with small precision, thesecurves are almost identical to the probit curves, reflecting the dominance of the sampleinformation over priors. The reversed pattern appears in the other two curves. Whenthe prior precision is extremely large, the forecasters’ beliefs about the true relationshipbetween the spread and future recession is so firm that they are unlikely to be affected bythe observed sample. That is the reason why the simulated curves under this sharp prioralmost overlap with the curves implied by βo alone.To summarize, the Bayesian approachis a compromise between prior and sample information, and the degree of compromisecrucially depends on the relative informativeness.

2.1.6. Probability Predictions in Panel DataModelsPanel data consists of repeated observations for a given sample of cross-sectional units,such as individuals, households, companies, and countries. In empirical microeconomics,


a typical panel has a small number of observations along the time dimension but verylarge number of cross-sectional units. The opposite scenario is generally true in macro-economics. In this section,we consider a micropanel environment with small or moderateT and large N . Many estimation and inference methods developed for micropanels canbe adapted to binary probability prediction. For the ease of exposition, only balancedpanels with an equal number of repeated observations for each unit will be discussed.

The basic linear static panel data model can be written in the following form:

Yit = Xitβ + ci + εit, i = 1, . . ., N , t = 1, . . ., T , (22)

where Yit and Xit are the dependent and k-dimensional independent variables,respectively,for unit i and period t. One of the crucial features that distinguishes panel data models fromcross-sectional and univariate time series models is the presence of unobserved ci,the time-invariant individual effects. In more general unobserved effects models, time effects λt arealso included. εit is the idiosyncratic error varying with i and t, and is often assumed to bei.i.d. and independent from other model components. The benefits of using panel datamainly come from its larger flexibility in specification as it allows the unobserved effectto be correlated with regressors. In a cross-sectional context without further information(such as availability of the valid instruments), parameters such as β cannot be identified.Even if ci is uncorrelated with regressors, the panel data estimator is generally moreefficient relative to those obtained in cross-sectional models. Baltagi (forthcoming) coversmany aspects of forecasting in panel data models with continuous response variables.

When Yit is binary, the linear panel data model, like the linear probability model, is nolonger adequate. Again, we rewrite it in the latent variable form. The unobserved latentdependent variable Y ∗

it satisfies:

Y ∗it = Xitβ + ci + εit, i = 1, . . ., N , t = 1, . . ., T . (23)

Instead of knowing Y ∗it , only its sign Yit = I (Y ∗

it > 0) is observed. In order to get theconditional probability of Yit = 1, certain distributional assumptions concerning εit andci have to be made. For example, when εit is i.i.d. with distribution function F(·) and cihas G(·) as its marginal distribution, the conditional probability of Yit = 1 given Xi =(X ′

i1, X ′i2, . . ., X ′

iT )′ and ci is

P(Yit = 1|Xi, ci) = 1 − F(−Xitβ − ci). (24)

The problem with this conditional probability is that ci is unobserved and P(Yit = 1|Xi, ci)cannot be estimated directly except for large T . In a micropanel, the solution, withoutestimating ci, is to compute P(Yit = 1|Xi), that is, integrating out ci from P(Yit =1|Xi, ci). If the conditional density of ci given Xi is denoted by g(·|·), then the conditionalprobability is:

P(Yit = 1|Xi) =∫

(1 − F(−Xitβ − c))g(c|Xi)dc, (25)


which is a function of Xi alone,and thus can be estimated by replacing β with its estimate,provided that the functional forms of F(·) and g(·|·) are known.

In general, the function g(·|·) is unknown.The usual practice is to make some assump-tions about it. One such assumption is that ci is independent of Xi, so

g(c|Xi) = g(c) ≡ dG(c)dc

. (26)

This leads to the random effects model. Given this specification, β and other parametersin g(·) and F(·) can be efficiently jointly estimated by maximum likelihood. For someparametric specifications of g(·) and F(·),such as normal distributions,identification oftenrequires further restrictions on their parameters; see Lechner et al. (2008). In general, theconditional likelihood function for each unit i is computed as below by noting thatidiosyncratic error is i.i.d. across t:

Li(Yi|Xi) =∫ T∏

t=1

[1 − F(−Xitβ − c)]Yit F(−Xitβ − c)1−Yit g(c)dc. (27)

If both G(·) and F(·) are zero mean normal distributions with variances σ 2c and σ 2

ε ,respectively, then σ 2

c +σ 2ε = 1 is often needed to identify all parameters. In general,G(·)

or F(·) may be any cumulative distribution function. Multiplying conditional likelihoodfunctions Li(Yi|Xi) for each i and taking logarithms,we get the conditional log-likelihoodfunction for the whole sample:

l(Y |X) =N∑

i=1

lnLi(Yi|Xi). (28)

The ML estimate is defined as the global maximizer of l(Y |X) over the parameter space,and the estimated conditional probability is thus

P(Y = 1|x) =∫

(1 − F(−xβ − c))g(c)dc, (29)

where β is the ML estimate of β, g(·) and F(·) are the density of c and the distribution ofε, with unknown parameters replaced by their ML estimates. The predicted probabilityis evaluated at x.

The above framework can be extended to a general case where the covariance matrixof errors is not restricted to have the conventional component structure. Let Y ∗

i =(Y ∗

i1, Y ∗i2, . . ., Y ∗

iT )′ and ui = (ui1, ui2, . . ., uiT )′ be the stacked matrix of Y ∗ and u forunit i. The latent variable linear panel data model can be rewritten in the followingcompact form:

Y ∗i = Xiβ + ui. (30)


We consider the case where Xi is independent of ui,with the latter having a T -dimensionalmultivariate joint distribution Fu. Note that when uit = ci + εit for each t, (30) reducesto the random effects model discussed above. Given data (Yi, Xi) for i = 1, . . ., N , thelikelihood function for unit i is

Li(Yi|Xi) =∫

Di

dFu, (31)

whereDi = {u ∈ RT : I (Xitβ + ut > 0) = Yit for t = 1, . . ., T }. (32)

The log-likelihood for the whole sample is thus l(Y |X) = ∑Ni=1 lnLi(Yi|Xi). Denote the

ML estimate by β. The predicted probability at point x is then

P(Y = 1|x) = P(xβ + ux > 0|x)

= P(ux > −xβ|x)

=∫

ux>−xβdFo, (33)

where F o is the estimated joint distribution function of (ui, ux). Here,ux is the latent errorterm corresponding to the point x, and (33) is for unit i. In general, it is hard to specifya particular form for Fo without further knowledge of the serial dependence among theui. Additional conditions, such as serial independence, are needed to make (33) tractable.

In practice, this general framework is hard to implement due to the presence ofthe multiple integral in the likelihood function. Numerous methods of overcoming thistechnical difficulty have been developed in the last few decades. Most of them are basedon a stochastic approximation of the multiple integral by simulation; see Lee (1992),Gourieroux and Monfort (1993), and Train (2003) for more details on these simulation-based estimators and their asymptotic properties.

We can generalize the above model further to deal with the case where ui depends onXi in a known form. Similar to the linear panel data model, Chamberlain (1984) relaxedthe assumption that the individual effect ci is independent of the regressors. Let the linearprojection of ci on Xi be in the following form:

ci = Xiγ + ηi. (34)

For simplicity, ηi is assumed to be independent of Xi. After plugging Xiγ + ηi into (23),we get the following equation free of ci:

Y ∗it = Xiγt + ηi + εit, (35)

where γt = γ + β ⊗ et , and et is a T -dimensional column vector with one for thetth element and zero for the others. The composite error ηi + εit is independent of Xi.


If we know the distributions of ηi and εit , the above likelihood-based framework canbe applied here in the same manner. Note that for making probability predictions, weare not interested in β in (23); the reduced form parameter γt in (35) is sufficient. Tosummarize, in parametric panel data models, as long as the conditional distribution oferror given Xi is correctly specified, the predicted probability at evaluation point x isobtained by replacing unknown parameters by their maximum likelihood estimates.Theparametric approach is efficient but not robust. In the panel data context, it is hard toensure that all stochastic components of the model are correctly specified. If one of themis misspecified, the resulting estimator is in general not consistent. More robust estima-tion approaches, that do not require full specification of the random components, havebeen proposed, such as the well-known conditional logit model which allows for an arbi-trary relationship between the individual effect and the regressors, see Andersen (1970),Chamberlain (1980, 1984), and Hsiao (1996). Unfortunately, these approaches cannotbe used to get probability forecasts. Given that the conditional probability P(Y = 1|x)

depends on both β and the distribution function that transfers an index into a numberbetween zero and one, consistency of the parameter estimator is not enough. Whenparametric models fail, the semi-parametric or non-parametric approach may be anobvious choice; see Ai and Li (2008). However, most of the semi-parametric and non-parametric panel data models focus on how to estimate β, instead of the predictedprobabilities.

In a dynamic binary panel data model, the latent variable in period t depends on thelagged observed binary event as shown below:

Y ∗it = Yit−1α + Xitβ + ci + εit . (36)

The dynamic model is useful in some cases as it accounts for the state dependenceof the binary choice explicitly. Consider consumers’ brand choice as an example. Theunobserved indirect utility over a brand is likely to be correlated with past purchasingbehavior, as most consumers tend to buy the same brand if it has been tried before andwas satisfactory. Presence of the lagged endogenous variable Yit−1 on the right-handside of (36) complicates the estimation due to the correlation between ci and Yit−1. Indynamic panel data models, the lagged value Yi0 is not observed by the econometricians.Therefore, another issue is how to deal with this initial distribution in order to get thevalid likelihood function for estimation and inference; see Heckman (1981),Wooldridge(2005), and Arellano and Carrasco (2003) for alternative solutions. Lechner et al. (2008)provided an outstanding overview of several dynamic binary panel data models.

The Bayesian approach in the panel data context shares much similarity with itscounterpart in the single equation case. Chib (2008) considered a general latent variablemodel in which both slope and intercept exhibit heterogeneity. This random coefficientmodel is shown below:

Y ∗it = Xitβ + Witbi + εit, (37)


where Wit is the subvector of Xit whose marginal effects on Y ∗it captured by bi are unit

specific, and where εit follows standard normal distribution.The probability of the binaryresponse given this formulation is P(Yit = 1|Xit, bi) = (Xitβ + Witbi). bi is assumedto be a multivariate random vector N (0, D). Again, data augmentation with the latentcontinuous response is suggested to facilitate computation of the posterior distribution;see Chib (2008) for more details.

2.2. Non-Model-Based Probability PredictionsThe methodologies covered so far rely crucially on alternative econometric binaryresponse models. In practice, researchers sometimes are confronted with binary prob-ability predictions that may or may not come from any econometric model. Instead,the predicted probabilities are issued by a number of market experts following theirprofessional judgments and experiences.These are non-model-based probability predic-tions, or judgmental forecasts in psychological parlance; see, for instance, Lawrence et al.(2006). The Survey of Professional Forecasters (SPF) conducted by the Federal ReserveBank of Philadelphia and by the European Central Bank (ECB) are leading examples ofnon-model-based probability predictions in economics. Other forecasting organizationslike the Blue Chip Surveys, Bloomberg, and many central banks also report probabilityforecasts from time to time. Given the high reputation and widespread use of the U.S.SPF data in academia and industry, this section will give a brief introduction to this sur-vey focusing on probability forecasts for real GDP declines. See Croushore (1993) for ageneral introduction to SPF, and Lahiri and Wang (2013) for these probability forecasts.

The SPF is the oldest quarterly survey of macroeconomic forecasts in the UnitedStates. It began in 1968 and was conducted by the American Statistical Association andthe National Bureau of Economic Research. The Federal Reserve Bank of Philadelphiatook over the survey in 1990. Currently, the dataset contains over 30 economic variables.In every quarter,the questionnaire is distributed to selected individual forecasters and theyare asked for their expectations about a number of economic and business indicators, suchas real GDP, CPI, and employment rate in the current and next few quarters. For realGDP, GDP Price Deflator, and Unemployment, density forecasts are also collected, viz.,the predicted probability of annual percent change in each prescribed interval for currentand the next four quarters. Furthermore, the survey asks forecasters for their predictedprobabilities of declines in real GDP in the quarter in which the survey is conductedand each of the following four quarters. For any target year, there are five forecasts froman individual forecaster, each corresponding to a different quarterly forecast horizon.By investigating the time series of individual forecasts for a given target, we can studyhow their subjective judgments evolve over time and their usefulness. SPF also reportsaggregate data summarizing responses from all forecasters, including their mean, median,and cross-sectional dispersion. Note that the dataset is not balanced, and the individualforecasters enter or exit from the survey in any quarter for a number of reasons.Also, some


forecasters may not report their predictions for some variables or horizons. Given thenovelty and quality of this dataset, SPF is extensively used in macroeconomics. For ourpurpose, probability forecasts of a binary economic event can also be easily constructedfrom the subjective density forecasts. Galbraith and van Norden (2012) used the Bankof England’s forecast densities to calculate the forecast probability that the annual rateof change of inflation and output growth exceed given threshold values. For instance, ifthe target event is GDP decline in the current year, then the constructed probability ofthis event is the sum of probabilities in each interval with negative values. For quarterlyGDP declines, however, this probability is readily available in the U.S. SPF, and can beanalyzed for their properties. Clements (2006) has found some internal inconsistencybetween these probability and density forecasts, whereas Lahiri and Wang (2006) foundthat the probability forecasts for real GDP declines have no significant skill beyond thesecond quarter.

A commonly cited SPF indicator is the anxious index. It is defined as the probabilityof a decline in real GDP in the next quarter. For example, in the survey taken in thefourth quarter of 2011, the anxious index is 16.6 percent,which means that forecasters onaverage believed that there was a 16.6 percent chance that real GDP will decline duringthe first quarter of 2012. Figure 19.4 illustrates the path of anxious index over time,beginning in the fourth quarter of 1968, along with the shaded NBER dated recessions.The fluctuations in the probabilities seem roughly coincident with the NBER definedpeaks and troughs of the U.S. business cycle since 1968. Rudebusch andWilliams (2009)compared the economic downturn forecast accuracy of SPF and a simple binary probitmodel using yield spread as regressor, finding that in terms of alternative measures offorecasting performance, the former wins for the current quarter but the difference isnot statistically significant. Its advantage over the latter deteriorates as forecast horizonincreases. Given the widespread recognition of the enduring role of yield spread in

0

10

20

30

40

50

60

70

80

90

100

1968

1969

1970

1971

1972

1973

1974

1975

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Prob

abilit

y

(per

cent

)

Survey Date

Figure 19.4 The Anxious index from 1968:Q4 to 2011:Q4. (Source: SPF website.)


predicting contractions during the past 20 years, the fact that professional forecasters donot seem to incorporate this readily available information on yield spread in forecastingreal GDP downturns appears to be puzzling; see Lahiri et al. (2013) for further analysisof the issue. A number of papers have studied the properties of the SPF data. See, forexample, Braun andYaniv (1992), Clements (2008, 2011), Lahiri et al. (1988), and Lahiriand Wang (2013), to name a few.

Engelberg et al. (2011) called attention to the problem of changing panel composi-tion in surveys of forecasters and illustrated this problem using SPF data. They warnedthat the traditional aggregate analysis of time series SPF conflate changes in the expecta-tions of individual forecasters with changes in the composition of the panel. Instead ofaggregating individual forecasts by mean or median as reported by the Federal ReserveBank of Philadelphia, they suggested putting more emphasis on the analysis of time seriesof predictions made by each individual forecaster. Aggregation, as a simplifying device,should only be applied to subpanels with fixed composition.

3. EVALUATIONOF BINARY EVENT PREDICTIONS

Given a sequence of predicted values for a binary event that may come from an estimatedmodel or subjective judgments by individual forecasters like SPF, we can evaluate theiraccuracy empirically. For example, it is desirable to verify whether it is associated wellwith the realized event. An important issue here is how to compare the performanceof two or more forecasting systems predicting the same event, and whether a particularforecasting system is valuable from the perspective of end users. In this section, we shallsummarize many important and useful evaluation methodologies developed in diversefields in a coherent fashion. There are two types of binary predictions: probability pre-diction discussed thoroughly in Section 2 and point prediction, which will be covered inthe next section. The evaluation of probability predictions is discussed first.

3.1. Evaluation of Probability PredictionsWe can roughly classify the extant methodologies on binary forecast evaluation intotwo categories. The first one measures forecast skill, which describes how the forecastis related to the actual, while the second one measures forecast value, which emphasizesthe usefulness of a forecast from the viewpoint of an end user. Skill and value are twofacets of a forecasting system; a skillful forecast may or may not be valuable. We will firstreview the evaluation of forecast skill and then move to forecast value where the optimalforecasts are defined in the context of a two-state, two-action decision problem.

3.1.1. Evaluation of Forecast SkillThe econometric literature contains many alternative measures of goodness of fit analo-gous to the R2 in conventional regressions, which can be related to various re-scalings of


functions of the likelihood ratio statistics for testing that all slope coefficients of the modelare zero.5 These measures, though useful in many situations, are not directly orientedtowards measuring forecast skill, and are often unsatisfactory in gauging the usefulness ofthe fitted model in either identifying a relatively uncommon or rare event in the sampleor forecasting out-of-sample. Most methods for skill evaluation for binary probabilitypredictions were developed in meteorology without emphasizing model fit. Murphyand Winkler (1984) provide a historic review of probability predictions in meteorologyfrom both theoretical and practical perspectives. Given the prevalence of binary eventsin economics such as economic recessions and stock market crashes, existing economicprobability forecasts should be evaluated carefully, whether they are generated by modelsor judgments.

Murphy and Winkler (1987) described a general framework of forecast skill evaluationwith binary probability forecasts as a special case. The basis for their framework is thejoint distribution of forecasts and observations,which contains all of the relevant statisticalinformation. Let Y be the binary event to be predicted and P be the predicted probabilityof Y = 1 based on a forecasting system. The joint distribution of (Y , P) is denoted byf (Y , P), a bivariate distribution when only one forecasting system is involved. Murphyand Winkler (1987) suggested two alternative factorizations of the joint distribution.Consider the calibration-refinement factorization first. f (Y , P) can be decomposed intothe product of two distributions: the marginal distribution of P and the conditionaldistribution of Y given P , that is, f (Y , P) = f (P)f (Y |P). For perfect forecasts, f (1|P =1) = 1 and f (1|P = 0) = 0, i.e., the conditional probability of Y = 1 given the forecast isexactly equal to the predicted value. In general, it is natural to require f (1|P) = P almostsurely over P and this property is called calibration in the statistics literature, see Dawid(1984). A well-calibrated probability forecast implies the actual frequency of event giveneach forecast value should be close to the forecast itself, and the user will not commit alarge error by taking the face value of the probability forecast as the true value. Given asample {Yt, Pt} of actuals and forecasts,we can plot the observed sample fraction of Y = 1against P , the so-called attribute diagram, to check calibration graphically. The idealsituation is that all pairs of (Yt, Pt) concentrate around the diagonal line, and correspondsto the so-called Mincer–Zarnowitz regression in a rational expectation framework, cf.Lovell (1986). Seillier-Moiseiwitsch and Dawid (1993) proposed a test to determine if infinite samples the difference between the actual and the probability forecasts is purely dueto the sampling uncertainty.This test is based on the asymptotic approximation using themartingale central limit theorem, and is consistent in spirit with the prequential principleof Dawid (1984), which states that any assessment of a series of probability forecastsshould not depend on the way the forecast is generated. The strength of the prequential

5 Estrella (1998) and Windmeijer (1995) contain critical analyses and comparison of most of these goodness of fitmeasures.


principle is that it allows for a unified test for calibration regardless of the probability lawunderlying a particular forecasting system.

Seillier-Moiseiwitsch and Dawid (1993) calibration test groups a sequence of proba-bility forecasts in a small number of cells, say J cells with the midpoint Pj as the estimateof the probability in each cell. Given a sample {Yt, Pt}, the number of events Yt = 1in the jth cell is counted and denoted by Nj . The corresponding expected count underthe predicted probability is PjTj where Tj is the number of observations in the jth cell.The calibration test for cell j becomes straightforward by constructing the test statisticsZj = (Nj − PjTj)/

√wj , where wj = TjPj(1 − Pj) is the weight for cell j. Under the nullhypothesis of calibration for cell j, Zj is asymptotically normally distributed with zeromean and unit variance, and should not lie too far out in the tail of this distribution.Theoverall calibration test for all cells is then conducted using statistic

∑Jj=1 Z2

j which hasχ2 distribution asymptotically with J degrees of freedom, and there is a strong evidenceagainst overall calibration if it exceeds the critical value under a significant level. As anexample,Lahiri andWang (2013) find that for the current quarter aggregate SPF forecastsof GDP declines introduced in Section 2.2,the calculated χ2 value is 8.01,which is signif-icant at the 5% level.Thus,even at this short horizon,recorded forecasts are not calibrated.

Calibration measures the predictive performance of probability forecasts with observedbinary outcomes. However, this is not a unique criterion of primary concern in practice.Consider the naive forecast which always predicts the marginal probability P(Y = 1).Since f (1|P) = P(Y = 1|P(Y = 1)) = P(Y = 1), it is necessarily calibrated. Generallyspeaking, any conditional probability forecast P(Y = 1|�) for some information set �

has to be calibrated since

P(Y = 1|P(Y = 1|�)) = E(E(Y |�)|P(Y = 1|�)) = P(Y = 1|�), (38)

by applying the law of iterated expectations. The naive forecast P(Y = 1) is a specialcase of this conditional probability forecast with � containing only the constant term.However, forecasting with the long-run probability P(Y = 1) is typically not a goodoption as it does not distinguish those observations when Y = 1 with those whenY = 0. This latter property is better characterized by the marginal distribution f (P) thatis a measure of the refinement for probability forecasts and indicates how often differentforecast values are used. For the naive forecast, f (P) is a degenerate distribution with allprobability mass at P = P(Y = 1) and the forecast is said to be not refined, or sharp. Aperfectly refined forecasting system tends to predict 0 and 1 in each case. According tothese definitions, the aforementioned perfect forecast is not only perfectly calibrated butalso refined. In contrast, the naive forecast is perfectly calibrated but not refined at all.Anyforecasting system that predicts 1 when Y = 0 and 0 when Y = 1 is still perfectly refinedbut not calibrated at all. Given that perfect forecasts do not exist in reality, Gneiting et al.(2007) developed a paradigm of maximizing the sharpness subject to calibration, see alsoMurphy and Winkler (1987).


The second way of factorizing f (Y , P) is to write it as the product of f (P|Y ) andf (Y ),called the likelihood-base rate factorization,which corresponds to Edwin Mills’ImplicitExpectations hypothesis; see Lovell (1986). Given a binary event Y , we have two condi-tional distributions, namely, f (P|Y = 1) and f (P|Y = 0). The former is the conditionaldistribution of predicted probabilities in the case of Y = 1, while the latter is the distri-bution for Y = 0. We would hope that f (P|Y = 1) puts more density on higher valuesof P and opposite for f (P|Y = 0). These two distributions are the conditional likeli-hoods associated with the forecast P . For perfect forecasts, f (P|Y = 1) and f (P|Y = 0)

degenerate at P = 1 and P = 0, respectively. Conversely, if f (P|Y = 0) = f (P|Y = 1)

for all P , the forecasts are said not to be discriminatory at all between the two events andprovide no useful information about the occurrence of the event.The forecast is perfectlydiscriminatory if f (P|Y = 1) and f (P|Y = 0) are two distinct degenerate densities, inwhich case, after observing the value of P , we are sure which event will occur. Basedon this idea, Cramer (1999) suggested the use of the difference in the means of thesetwo conditional densities as a measure of goodness of fit. Since each mean is taken overrespective sub-samples, this measure is not unduly influenced by the success rate in themore prevalent outcome group.

Figure 19.5 shows these two empirical likelihoods for the current quarter forecastsbased on SPF data; cf. Lahiri and Wang (2013). This diagram shows that the currentquarter probability forecasts discriminate between the two events fairly well,and f (P|Y =0) puts more weight on the lower probability values than f (P|Y = 1) does. However,not enough weight is associated with higher probability values when GDP does decline,and so the SPF forecasters appear to be somewhat conservative in this sense.

0%

10%

20%

30%

40%

50%

60%

70%

80%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Like

lihoo

d [f(

P|Y)

]

Forecast Probability (P)

f(P|Y=0)f(P|Y=1)

Figure 19.5 Likelihoods for quarter 0. (Source: Lahiri andWang (2013).)


In the likelihood-base rate factorization, f (Y ) is the unconditional probability of eachevent. In weather forecasting, this is called the base rate or sample climatology and representsthe long run frequency of the target event. Since it is only a description of the forecastingsituation, it is fully independent of the forecasting system. Murphy and Winkler (1987)took f (Y ) as the probability forecast in the absence of any forecasting system and f (P|Y )

as the new information beyond the base rate contributed by a forecasting system P .They emphasized the central role of joint distribution of forecasts and observations inany forecast evaluation, and discussed the close link between their general frameworkand some popular evaluation procedures widely used in practice. For example, Brier’s(1950) score can be calculated as the sample mean squared error of forecasts and actualsor 1/T

∑Tt=1 (Yt − Pt)

2 which has a range between zero and one. Perfect forecasts havezero Brier score,and a smaller value of Brier score indicates better predictive performance.The population mean squared error is E(Yt − Pt)

2 = Var(Yt − Pt) + [E(Yt) − E(Pt)]2

where the first term is the variance of the forecast errors and the second is the square ofthe forecast bias. Murphy andWinkler (1987) expressed this score in terms of populationmoments as follows:

E(Yt − Pt)2 = Var(Pt) + Var(Yt) − 2Cov(Yt, Pt) + [E(Yt) − E(Pt)]2. (39)

This decomposition reaffirms the previous statement that all evaluation procedures arebased on the joint distribution of Y and P . It shows that the performance, as measuredby the mean squared error, is not only affected by the covariance Cov(Yt, Pt) (largervalue means better performance), but also by the marginal moments of forecasts andactuals. Suppose Y is a relatively rare event with E(Yt) close to zero.The optimal forecastminimizing (39) is close to the constant E(Y) which is the naive forecast having no skillat all. In practice, the skill score defined below, which measures the relative skill over thenaive forecast, is often used in this context:

skill score ≡ 1 −∑T

t=1 (Yt − Pt)2∑T

t=1 (Yt − E(Yt))2. (40)

The reference naive forecast has no skill in that its skill score is zero, whereas a skillfulforecast is rewarded by a positive skill score. The larger the skill score, the higher skillthe forecast has. For the current quarter forecasts from SPF, Lahiri and Wang (2013)calculated Brier score and skill score as 0.0668 and 0.45, respectively, seen as impressive.

Murphy (1973) decomposed the Brier score in terms of two factorizations of f (Y , P).In light of the calibration-refinement factorization, it can be rewritten as:

E(Yt − Pt)2 = Var(Yt) + EP [Pt − E(Yt|Pt)]2 − EP [E(Yt|Pt) − E(Yt)]2, (41)

where EP(·) is the expectation operator with respect to the marginal distribution of P .This decomposition summarizes the features in two marginal distributions and f (Y |P).


The second term is a measure of calibration as it is a weighted average of the discrepancybetween the face value of the probability forecast and the actual probability of the realiza-tion given the forecast.The third term is a measure of the difference between conditionaland unconditional probabilities of Y = 1. This attribute is called resolution by Murphyand Daan (1985). In terms of the likelihood-base rate factorization, the Brier score canbe alternatively decomposed as

E(Yt − Pt)2 = Var(Pt) + EY [Yt − E(Pt|Yt)]2 − EY [E(Pt|Yt) − E(Pt)]2, (42)

where EY (·) is the expectation operator with respect to the marginal distribution of Y .Instead of using information in f (Y |P),(42) exploits information in the likelihood f (P|Y )

in addition to two marginal distributions. The second term is a weighted average of thesquared difference between the observation and the mean forecast given observation andis supposed to be small for a good forecast. The third term is a weighted average of thesquared difference between the mean forecast given the observation and the overall meanforecast, and measures the discriminatory power of forecasts against two events. Thesetwo decompositions summarize different aspects of f (Y , P), and its sample analogue canbe computed straightforwardly.

Yates (1982) suggested an alternative decomposition of the Brier score which isolatedindividual components capturing distinct features of f (Y , P) in the same sprite as Mur-phy and Wrinkler’s general framework. Yates’ decomposition, popular in psychology, isderived from the usual interpretation of the mean squared error (39) in terms of varianceand squared bias. Note that Var(Yt) = E(Yt)[1 − E(Yt)] and Cov(Yt, Pt) = [E(Pt|Yt =1) − E(Pt|Yt = 0)]E(Yt)[1 − E(Yt)]. We getYates’ covariance decomposition by plug-ging these into (39) using the definition VarP,min(Pt) ≡ [E(Pt|Yt = 1) − E(Pt|Yt =0)]2E(Yt)[1 − E(Yt)] and obtain

E(Yt)[1 − E(Yt)] + �Var(Pt) + VarP,min(Pt) − 2Cov(Yt, Pt) + [E(Yt) − E(Pt)]2, (43)

where �Var(Pt) ≡ Var(Pt)−VarP,min(Pt) by definition.The first term E(Yt)[1−E(Yt)]is the variance of the binary event and thus is independent of forecasts. It is close tozero when either E(Yt) or 1 − E(Yt) is very small. Given this property, a comparisonacross several forecasts with different targets based on the overall Brier score may bemisleading, because two target events tend to have different marginal distributions, andthe discrepancy of the scores is likely to solely reflect the differential of the marginaldistributions, thus saying nothing about the real skill. Yates regarded E(Yt)[1 − E(Yt)]as the Brier score of the naive forecast mentioned before, and showed that it is theminimal achievable value for a constant probability forecast. It is the remaining part, thatis, E(Yt − Pt)

2 − E(Yt)[1 − E(Yt)], that matters for evaluation purposes.The term [E(Yt) − E(Pt)]2 measures the magnitude of the global forecast bias and

is zero for unbiased forecasts. In contrast to perfect calibration, which requires theconditional probability to be equal to the face value almost surely, Yates called this


calibration-in-the-large. It says that the unconditional probability of Y = 1 should matchthe average predicted values. Cov(Yt, Pt) describes how responsive a forecast is to theoccurrence of the target event, both in terms of the direction and the magnitude. Askillful forecast ought to identify and explore this information in a sensitive and correctmanner. It is apparent that small Var(Pt) is desired, but this is not everything. A typicalexample is the naive forecast with zero variance but no skill VarP,min(Pt) is the minimumvariance of Pt given any value of the covariance Cov(Yt, Pt), and �Var(Pt) is the excessvariance which should be minimized.The minimal variance VarP,min(Pt) is achieved onlywhen �Var(Pt) = 0 for which Pt = P1 on all occasions of Yt = 1, and Pt = P0 on otheroccasions and the variation of forecasts is due to the event’s occurrence. In this sense,Yatescalled �Var(Pt) the excess variability of forecasts and it is not zero, when the forecast isresponsive to information that is not related to the event’s occurrence. Using the currentquarter SPF forecasts, Lahiri and Wang (2013) found that the excess variability was 53%of the total forecast variance of 0.569. For longer horizons, excess variability increasesrapidly and indicates an interesting characteristic of these forecasts. Overall, the Yates’decomposition stipulates that a skillful forecast is expected to be unbiased and highlysensitive to relevant information, but insensitive to irrelevant information. Yates (1982)emphasized on the essence of resolution instead of the conventional focus on calibrationin probability forecast evaluation; see also Toth et al. (2003).

Although the Brier score is extensively used in probability forecast evaluation, it is notthe only choice. Alternative scores characterizing other features of the joint distributionexist.Two leading examples are the average absolute deviation, which is E(|Yt −Pt|) andthe logarithmic score −E(Ytlog(Pt)+ (1−Yt)log(1−Pt)). In general, any function with(Yt, Pt) as arguments can be taken as a score. In the theoretical literature, a subclass calledproper scoring rules is comprised of functions satisfying

E(S(Yt, P∗t )) ≤ E(S(Yt, Pt)), ∀Pt ∈ [0, 1], (44)

where S(·, ·) is the score function with the observation as the first argument and theforecast as the second, and P∗

t is the underlying true conditional probability. If P∗t is the

unique minimizer of the expected score, S(·, ·) is called the strictly proper scoring rule. Itcan be easily shown that the Brier score and the logarithmic score are proper, while theabsolute deviation is not. Gneiting and Raftery (2007) pointed out the importance ofusing proper scores for evaluation purposes and provided an example to demonstrate theproblem associated with improper scores. Schervish (1989) developed an intuitive wayof constructing a proper scoring rule that has a natural economic interpretation in termsof the loss associated with a decision problem based on forecasts. He also generated aproper scoring rule that is equal to the integral of the expected loss function evaluated atthe threshold value with respect to a measure defined on unit interval, and discussed theconnection between calibration and a proper scoring rule. Gneiting (2011) argued thata consistent scoring function or an elicitable target functional (the mean in our context)


ought to be specified ex ante if forecasts are to be issued and evaluated.Thus, it does notmake sense to evaluate probability forecasts using the absolute deviation, which is notconsistent for the mean.

Up to this point, all evaluations are carried out through a number of proper scoringrules. If we have more than one competing forecasting model targeting the same eventand a large sample tracking the forecasts, scores can be calculated and compared. Forexample, in terms of the Brier score 1/T

∑Tt=1 (Yt − Pt)

2, model A with larger score isconsidered to be a worse performer than model B. Lopez (2001), based on Diebold andMariano (1995), proposed a new test constructed from the sample difference betweentwo scores, allowing for asymmetric scores, non-Gaussian and non-zero mean forecasterrors, serial correlation among observations, and contemporaneous correlation betweenforecasts. Here we replace the objective function of Diebold and Mariano (1995) by ageneric proper scoring rule. Let S(Yt, Pti) be the score value of the ith (i = 1 or 2) modelfor observation t. It is often assumed to be a function of the forecast error defined byeti ≡ Yt −Pti, that is,S(Yt, Pti) = f (eti).The method works equally well for more generalcases where the functional form of S(·, ·) is not restricted in this way. In addition, letdt = f (et1) − f (et2) be the score differential between 1 and 2. The null hypothesis of noskill differential is stated as E(dt) = 0.

Suppose the score differential series {dt} is covariance stationary and has short mem-ory. The standard central limit theorem for dependent data can be used to establish theasymptotic distribution of test statistic under E(dt) = 0 as

√T (d − E(dt))

d−→ N (0, 2π fd(0)), (45)

where

d = 1T

T∑t=1

dt (46)

is the sample mean of score differentials,

fd(0) = 1

2π

∞∑τ=−∞

γd(τ ) (47)

is the spectral density of dt at frequency zero, and γd(τ ) = E(dt − E(dt))(dt−τ − E(dt))

is the autocovariance of dt with the τ th lag. The t statistic is thus

t = d√2π fd(0)

T

, (48)

where fd(0) is a consistent estimator of fd(0). Estimation of fd(0) based on lag truncationmethods is quite standard in time series econometrics,see Diebold and Mariano (1995) formore details. The key idea is that only very weak assumptions about the data generating


process are imposed and neither serial nor contemporaneous correlation is ruled outby these assumptions. Implementation of this procedure is quite easy as it is simply thestandard t test of a zero mean for a single population after adjusting for serial correlation.Thus,while comparing the current quarter SPF forecasts with the naive constant forecastgiven by the sample proportion, Lahiri and Wang (2013) found the Lopez t statistic tobe −2.564, suggesting the former to have significantly lower Brier score than the naiveforecast at the usual 5% level.

West (1996) developed procedures for asymptotic inference about the moments of asmooth score based on out-of-sample prediction errors. If predictions are generated byeconometric models, these procedures adjust for errors in the estimation of the modelparameters. The conditions are also given, under which ignoring this estimation errorwould not affect out-of-sample inference. This framework is neither more general nora special case of the Diebold-Mariano approach and thus should be viewed as comple-mentary. Note that the Diebold-Mariano test is not applicable when two competingforecasts cannot be treated as coming from two non-nested models. However, if we thinkof the null hypothesis as the two forecast series having equal finite sample forecast accu-racy, then, the Diebold-Mariano test statistic as a standard normal approximation gives areasonably-sized test of the null in both nested and non-nested cases, provided that thelong-run variances are estimated properly and the small-sample adjustment of Harveyet al. (1997) is employed; see Clark and McCracken (forthcoming).

Another useful tool for probability forecast evaluation, popular in medical imaging,meteorology, and psychology, which has not received much attention in economics isthe Receiver Operating Characteristic (ROC) analysis; see Berge and Jordà (2011) for arecent exception. Given the joint distribution f (Y , P) and a threshold value, which is anumber between zero and one,we can calculate two conditional probabilities:hit rate andfalse alarm rate. Let P∗ be a threshold, and Yt = 1 is predicted if and only if Pt ≥ P∗, thatis, P∗ transforms a continuous probability forecast into a binary point forecast.Table 19.1presents the joint distribution of this forecast and realization under a generic P∗. In this2 × 2 contingency table, πij is the joint probability of (Y = i, Y = j) while πi. andπ.j are marginal probabilities of Y = i and Y = j, respectively. The hit rate (H) is theconditional probability of Y = 1 given Y = 1, that is, H ≡ πY=1|Y=1 = π11/π.1 and ittells the chance that Y = 1 is correctly predicted when it does happen.

Table 19.1 Joint Distribution of Binary Point Forecast Y and Observation Y

Y = 1 Y = 0 Row Total

Y = 1 π11 π10 π1.

Y = 0 π01 π00 π0.

Column total π.1 π.0 1


In contrast, false alarm rate (F) is the conditional probability of Y = 1 given Y = 0,that is, F ≡ πY=1|Y=0 = π10/π.0 and it measures the fraction of incorrect forecasts whenY = 1 does not occur. Although these two probabilities appear to be constant for a givensample, they are actually functions of P∗. If P∗ = 0 ≤ Pt for all t, then Y = 1 wouldalways be predicted. As a result, both the hit and false alarm rates equal one. Conversely,only Y = 0 would be given and both probabilities are zero when P∗ = 1. For interiorvalues of P∗, H and F fall within [0, 1]. Their relationship due to the variation of P∗can be depicted by tracing out all possible pairs of (F(P∗), H(P∗)) for P∗ ∈ [0, 1]. Thisgraph plotted with the false alarm rate on the horizontal axis and the hit rate on thevertical axis is called the Receiver Operating Characteristic curve. Its typical shape for askillful probability forecast is shown in Figure 19.6.

In categorical data analysis, H is often called the sensitivity and 1 − F = πY=0|Y=0 isthe specificity. Both measure the fraction of correct forecasts and are expected to be highfor skillful forecasts. (F(P∗), H(P∗)), corresponding to a particular threshold P∗, is onlyone point on the ROC curve, which consists of all such points for possible values of P∗.

The ROC curve can be constructed in an alternative way based on the likelihood-base rate factorization f (Y , P) = f (P|Y )f (Y ). Given a threshold P∗, H is the integralof f (P|Y = 1),

H =∫ 1

P∗f (P|Y = 1)dP, (49)

and F is the integral of f (P|Y = 0) over the same domain,

F =∫ 1

P∗f (P|Y = 0)dP. (50)

Figure 19.7 illustrates these two densities along with three values of P∗.

0

20

40

60

80

100

0 20 40 60 80 100

Figure 19.6 A typical ROC curve.


Hits=97.5%Falsealarms=84%

Hits=84%Falsealarms=50%

Hits=50%Falsealarms=16%

Figure 19.7 f (P|Y = 1) (right), f (P|Y = 0) (left) and three values of P∗.

In this graph, H is the area of f (P|Y = 1) on the right of P∗, while F is the samearea for f (P|Y = 0). As the vertical line shifts rightward from top to bottom, both areasshrink, and both H and F decline. In one extreme where P∗ = 0, both areas equal one.In the other extreme where P∗ = 1, they equal zero. Figure 19.7 reveals the tradeoffbetween H and F : they move together in the same direction as P∗ varies and the scenario(H = 1, F = 0) is generally unobtainable unless the forecast is perfect. This relationshipis also apparent from the upward sloping ROC curve in Figure 19.6. Deriving ROCcurve from the likelihood-base rate factorization is in the same spirit of Murphy andWinkler’s general framework. To see this, consider the likelihoods of two systems (A andB) predicting the same event, see Figure 19.8 below.

Let us assume that the likelihoods when Y = 1 are exactly the same for both A and B,while the likelihoods when Y = 0 share the same shape but center at different locations.The likelihood f (P|Y = 0) for A is symmetric around a value that is less than thecorresponding value for B. In the terminology of the likelihood-base rate factorization,A is said to have a higher discriminatory ability than B because its f (P|Y = 0) is fartherapart from f (P|Y = 1) and is thus more likely to distinguish the two cases. Consequently,A has a higher forecast skill, which should be reflected by its ROC curve as well. Thisresult is supported by considering any threshold value represented by a vertical line inthis graph. As discussed before, the area of f (P|Y = 0) for A lying on the right of thethreshold (A’s false alarm rate) is always smaller than that for B, and this is true for anythreshold. On the other hand, since f (P|Y = 1) is identical for both A and B, hit ratesdefined as the area of f (P|Y = 1) on the right of the vertical line are the same for both.Therefore,A is more skillful than B,which is shown in Figure 19.9 where the ROC curveof A always lies to the left of B for any fixed H .


Figure 19.8 Likelihoods for forecasts A and B with a common threshold.

The ROC curve is a convenient graphical tool to evaluate forecast skill and can beused to facilitate comparison among competing forecasting systems.To see this, considerthree special curves in the unit box.The first one is the 45 degree diagonal line on whichH = F . The probability forecast, which has an ROC curve of this type, is the randomforecast that is statistically independent of observation. As a result, H and F are identicaland both equal the integral of marginal density of probability forecast over the domain[P∗, 1]. One of the examples is the naive forecast. Probability forecasts whose ROC curveis the diagonal line have no skill and are often taken as the benchmark to be comparedwith other forecasts of interest. For a perfect forecast, the corresponding ROC curveis the left and upper boundaries of the unit box. Most probability forecasts in real-lifesituations fall in between, and their ROC curves lie in the upper triangle, like the oneshown in Figure 19.6. Since higher hit rate and lower false alarm rate are always desired,the ROC curve lying farther from the diagonal line indicates higher skill. A curve in thelower triangle appears to be even worse than the random forecast at first sight, but it canpotentially be relabeled to be useful.

Given a sample, there are two methods of plotting the ROC curve: parametric andnon-parametric. In the parametric approach, some distributional assumptions about the


Figure 19.9 ROC curves for A and B with different skills.

likelihoods f (P|Y = 1) and f (P|Y = 0) are necessary. A typical example is the normaldistribution. However, it is not a sensible choice given that the range of P is limited. Nev-ertheless, we can always transform P into a variable with unlimited range. For instance,the inverse function of any normal distribution suffices for this purpose. The parametersin this distribution are estimated from a sample, and the fitted ROC curve can be plottedby varying the threshold in the same way as when deriving the population curve. Thisapproach, however, is subject to misspecification like any parametric method. In contrast,non-parametric estimation does not need such stringent assumptions and can be carriedout based on data alone. Fawcett (2006) provides an illustrative example with compu-tational details. Fortunately, most current commercial statistical packages like Stata havebuilt-in procedures for generating ROC graphs.

Sometimes, a single statistic summarizing information contained in an ROC curveis warranted. There are two alternatives: one measures the local skill for a threshold ofprimary interest, while the other measures global skill over all thresholds. For the former,there are two statistics most commonly used. The first one is the smallest Euclideandistance between point (0, 1) and the point on the ROC curve. This is motivated byobserving that the ROC curve of more skillful probability forecast is often closer to(0, 1). The second statistic is called theYouden index, which is the maximal vertical gapbetween diagonal to the ROC curve (or hit rate minus false alarm rate). The globalmeasure is the area under the ROC curve (AUC). For random forecasts, the AUC is onehalf while it is one for perfect forecasts.The larger AUC thus implies higher forecast skill.


H

F

B

AdA>dB

Figure 19.10 ROC curves for two forecasts: A and B.

Calculation of theAUC proceeds in two ways depending on the approach used to estimatethe ROC curve. For parametric estimation, the AUC is the integral of a smooth curveover the domain [0, 1]. For non-parametric estimation, the empirical ROC curve is astep function and its integral is obtained by summing areas of a finite number of trapezia.If the underlying ROC curve is smooth and concave, the AUC computed in this way isbound to underestimate the true value in a finite sample. Note that these two measuresmay not concord with each other in the sense that they may give conflicting judgmentsregarding forecast skill. Figure 19.10 illustrates a situation like this.

In this graph,dA and dB are local skill statistics forA and B,respectively,andA is slightlyless skillful in terms of this criterion. However, the AUC of A is larger than that of B.Conflict between these two raises a question in practice as to which one should be used.Often, there is no universal answer and it depends on the adopted loss function. Masonand Graham (2002),Mason (2003),Cortes and Mohri (2005), Faraggi and Reiser (2002),Liu et al. (2005),among others,proposed and compared estimation and inference methodsconcerning AUC in large data sets. These include, but are not limited to, the traditionaltest based on the Mann-Whitney U-statistic, an asymptotic t-test, and bootstrap-basedtests. Using these procedures in large samples, we can answer questions like: “Does aforecasting system have any skill?” or “Is its AUC larger than 1/2 significantly?” or “IsAUC of forecast A significantly larger than that of B in the population?”

ROC analysis was initially developed in the field of signal detection theory, where itwas used to evaluate the discriminatory ability for a binary detection system to distinguishbetween two clearly-defined possibilities:signal plus noise and noise only.Thereafter,it hasgained increasing popularity in many other related fields. For a general treatment of ROCanalysis, readers are referred to Egan (1975) Swets (1996), Zhou et al. (2002),Wickens(2001), and Krzanowski and Hand (2009), just to name a few. For economic forecasts,Lahiri and Wang (2013) evaluated the SPF probability forecasts of real GDP declines for


0.2

5.5

.75

1

0 .25 .5 .75 1

Figure 19.11 ROC curve with 95% confidence band for quarter 0. (Source: Lahiri andWang (2013)).

the U.S. economy using the ROC curve. Figure 19.11, taken from this paper for thecurrent quarter forecasts, shows that at least for the current quarter, the SPF is skillful.

3.1.2. Evaluation of Forecast ValueFor calculating the forecast value, one needs more information than what is contained inthe measures of association between forecasts and realizations. Let L(a, Y ) be the loss ofa decision maker when (s) he takes the action a and the event Y is realized in the future.Here, like in the banker’s problem, only the scenario with two possible actions (e.g.,making a loan or not) coupled with a binary event (e.g., default or not) is considered. Itis simple, yet fits a large number of real-life decision-making scenarios in economics.

First, we need to show that a separate analysis of forecast value is necessary. Thefollowing example suffices to this end. Suppose A and B are two forecasts targeting thesame binary event Y . The following tables summarize predictive performances for bothmodels (see Tables 19.2 and 19.3).

Here A and B are 0/1 binary point forecasts. If forecast skill is measured by the Brierscore, then A performs better than B since its Brier score is about 10.79%, less than B’s

Table 19.2 Contingency Table Cross-Classifying Forecasts of A and Observations Y


Y = 1 20 100 120Y = 0 23 997 1020

Column total 43 1097 1140


Table 19.3 Contingency Table Cross-Classifying Forecasts of B and Observations Y


Y = 1 40 197 237Y = 0 3 900 903

Column total 43 1097 1140

Table 19.4 Loss Function Associated with the 2 × 2 DecisionProblem

Y = 1 Y = 0

a = 1 0 10a = 0 5000 0

score of 17.54%. Consequently,A is superior to B in terms of the forecast skill measuredby the Brier score. Does the same conclusion hold in terms of forecast value? To answerthis question,we have to specify the loss function L(a, Y ) first.Without loss of generality,suppose the decision rule is given by a = 1 if Y = 1 is predicted and a = 0 otherwise.The loss is described in Table 19.4.

This loss function has some special features: it is zero when the event is correctlypredicted; the losses associated with incorrect forecasts are not symmetric in that the lossfor a = 0 when the event Y = 1 occurs is much larger than that when a = 1 and theevent Y = 1 does not occur. Loss functions of this type are typical when the target eventY = 1 is rare but people incur a substantial loss once it takes place, such as a dam collapseor financial crisis. The overall loss of A is 10 × 100 + 5000 × 23 = 116,000, which ismuch larger than that of B (10 × 197 + 5000 × 3 = 16,970). This example shows thatthe superiority of A in terms of skill does not imply its usefulness from the standpoint ofa forecast user. An evaluation of forecast value needs to be carried out separately.

Thompson and Brier (1955) and Mylne (1999) examined forecast values in the simplecost/loss decision context in which L(1, 1) = L(1, 0) = C > 0, L(0, 1) = L > 0, andL(0, 0) = 0. C is cost and L is loss. This model simplifies the analysis by summarizingthe loss function into two values: cost and loss, and its result can be conveyed visually asa consequence. Loss functions of this type are suitable in a context such as the decisionto purchase insurance by a consumer, where two actions are “buy insurance” or “do notbuy insurance,” which lead to different losses depending on whether the adverse eventoccurs in the future. If one buys the insurance (a = 1), (s) he is able to protect against


the effects of adverse event by paying a cost C, whereas occurrence of adverse eventwithout benefit of this protection results in a loss L. If the consumer knows the marginalprobability that the adverse event would occur at the moment of decision, the problemboils down to comparing expected losses by two actions. On the one hand, (s) he hasto pay C irrespective of the event if (s) he decides to buy the insurance, and her/hisexpected loss would equal PL if (s) he does not do so,where P is the marginal probabilityof Y = 1 perceived by the consumer.The optimal decision rule is thus a = 1 if and onlyif P ≥ C/L, and the lowest expected loss resulting from this rule is min(PL, C) denotedby ELclim. Now, suppose the consumer has access to perfect forecasts.Then, the minimumexpected loss would be ELper f ≡ PC which is smaller in magnitude than ELclim giventhat P ∈ [0, 1] and C ≤ L.The difference ELclim − ELper f measures the gain of a perfectforecast relative to the naive forecast. The more realistic situation is that the probabilityforecast under consideration improves upon the naive forecast, but is not perfect. Wilks(2001) suggested the value score (VS) to measure the value of a forecasting system where

VS = ELclim − ELP

ELclim − ELper f, (51)

and ELP denotes the expected loss of the forecasting system P .The value score defined inthis way can be interpreted as the expected economic value of the forecasts of interest as afraction of the value of perfect forecasts relative to naive forecasts. Its value lies in (−∞, 1]and it is positively oriented in the sense that higherVS means larger forecast value. Naiveforecasts and perfect forecasts haveVS 0 and 1, respectively. Note thatVS may be negative,indicating that it is better to use the naive forecast of no skill in these cases. However,Murphy (1977) demonstrated that VS must be non-negative for any forecasting systemwith perfect calibration;thus any perfectly calibrated probability forecast is at least as usefulas the naive forecast.This illustrates the interplay between forecast skill and forecast value.

Given a probability forecast Pt ,VS can be calculated from f (Pt, Yt), the joint dis-tribution of forecasts and observations, and the loss function. To accomplish this, thejoint distribution of (a, Y ) must be derived first where the optimal action depends onconsumer’s knowledge of f (Pt, Yt). Given the forecast Pt , the conditional probability ofthe event is f (Yt = 1|Pt), which corresponds to the second element in the calibration-refinement factorization of f (Pt, Yt), and the optimal decision rule takes the form spec-ified above: a = 1 if and only if P(Yt = 1|Pt) ≥ C/L. Therefore, the cost/loss ratioC/L is the optimal threshold for translating a continuous probability P(Yt = 1|Pt) intoa binary action. Given C/L, the joint probability of (a = 1, Y = 1) is thus equal toπ11 ≡ ∫

I (P(Yt = 1|Pt) ≥ C/L)f (Pt, Yt = 1)dPt where I (·) is the indicator functionwhich is one only when the condition in (·) is met. Likewise, we can calculate otherthree joint probabilities, listed as follows:

π10 ≡∫

I (P(Yt = 1|Pt) ≥ C/L)f (Pt, Yt = 0)dPt;


π01 ≡∫

I (P(Yt = 1|Pt) < C/L)f (Pt, Yt = 1)dPt;

π00 ≡∫

I (P(Yt = 1|Pt) < C/L)f (Pt, Yt = 0)dPt . (52)

Based on these results, the expected loss ELP is the weighted average of L(a, Y ) with theabove probabilities πij as weights:

ELP = (π11 + π10)C + π01L, (53)

which is then plugged into (51) to get VS. Note that in this derivation, not only is theinformation contained in f (Pt, Yt) used, but the cost/loss ratio, which is user-specific,plays a role as well. This observation reconfirms our previous argument that the forecastvalue is a mixture of objective skill and subjective loss. If f (Pt, Yt) is fixed, ELP is afunction of C and L. Wilks (2001) proved a stronger result thatVS is only a function ofC/L, so that only the ratio matters. For this reason, we can plotVS against cost/loss ratioin a simple 2-dimensional diagram. In other decision problems, where the loss functiontakes a more general rather than the current cost/loss form,VS can be calculated in thesame fashion as before, but the resulting VS as a function of four loss values cannot beshown by a 2- or 3-dimensional diagram.

Figure 19.12 plots VS against the cost/loss ratio of a probability forecast. Note thatthe domain of interest is the unit interval between zero and one, as the non-negative costC is assumed to be less than the loss L. The two points (0, 0) and (1, 0) must lie onVScurve, because when C/L = 0, a = 1 is adopted, resulting in ELclim = ELP = VS = 0;on the other hand, when C/L = 1, a = 0 with ELclim = ELP = PC, which againimplies zero VS. In this graph, the probability forecast is not calibrated, as the VS curvelies beneath zero for some cost/loss ratios. Krzysztofowicz (1992) and Krzysztofowicz

Figure 19.12 An artificial value score curve.


and Long (1990) showed that recalibration (i.e., relabeling) of such forecasts will notchange the refinement but can improve the value score over the entire range of cost/lossratios, which again is evidence that forecast skill would affect the forecast value. For theROC curve, however,Wilks (2001) demonstrated that even with such recalibration, therecalibrated ROC curve will not change.Wilks (2001) hence concluded that “the ROCcurve is best interpreted as reflecting potential rather than actual skill” and it is insensitiveto calibration improvement. Further details on the interaction of skill and value measuredby other criteria are available in Richardson (2003).

The value score curve lends support for the use of probability forecasts instead ofbinary point forecasts. For the latter, only 0/1 values are issued without any uncertaintymeasurement. Suppose there is a community populated by more than one forecast user,and each one has his own cost/loss ratio. Initially, the single forecaster serving the com-munity produces a probability forecast Pt , and then changes it into a 0/1 prediction byusing a threshold P∗, which is announced to the community. The threshold P∗ deter-mines a unique 2 × 2 contingency table, and the value score for any given C and Lcan be calculated. As a result, the value score curve as a function of the cost/loss ratiocan be plotted as well. Richardson (2003) pointed out that thisVS curve is never locatedhigher than that generated by probability forecasts Pt for any cost/loss ratio on [0, 1].Thisresult is obvious since the optimal P∗ for the community as a whole may not be optimalfor all users. If the forecaster provides a probability forecast Pt instead of a binary pointforecast, each user has larger flexibility to choose an action according to his/her owncost/loss ratio, and this would minimize the individual expected loss. A single forecasterwithout knowing the distribution of cost/loss ratios across individuals is likely to give asub-optimal 0/1 forecast for the whole community.

Similar to the ROC analysis,we often need a single quantity like AUC to measure theoverall value of a probability forecast. A natural choice is the integral of VS curve over[0, 1]. This may be justified by a uniform distribution of cost/loss ratios, which meansthat forecast values are equally weighted for all cost/loss ratios. Wilks (2001) proved thatthis integral is equivalent to the Brier score. This is a special case where forecast valueis completely determined by forecast skill. This may not be true generally. Wilks (2001)suggested using a beta distribution on the domain [0, 1], with two parameters (α, β), todescribe the distribution of cost/loss ratios, as it allows for a very flexible representationof how C/L spreads across individuals by specifying only two parameters. For example,α = β = 1 yields the uniform distribution with equal weights.The weighted average ofvalue scores (WVS) is

WVS ≡∫ 1

0VS(

CL

)b(

CL

; α, β

)d

CL

, (54)


where VS(

CL

)is the value score as a function of the cost/loss ratio and b

(CL ; α, β

)is the

beta density with parameters α and β. Wilks (2001) found that this overall measure offorecast value is very sensitive to the choice of parameters. In practice, it is impossible fora forecaster to know this distribution exactly since the cost/loss ratio is user-dependentand may involve cost and loss in some mental or utility unit. Therefore the applicationof WVS in forecast evaluation practice calls for extra caution. However, even if onehas a perfect awareness of the cost/loss distribution and ranks a collection of competingforecasts by WVS, this rank cannot be interpreted from the perspective of a particularend user. After all,WVS is only an overall measure; and the good forecasts identified byWVS may not be equally good in the eyes of a particular user who will re-evaluate eachforecast according to his own cost/loss ratio.

Although the value score provides a general framework to evaluate the usefulness ofprobability forecasts in terms of economic cost and loss, it has its own drawbacks. In thederivation of value score, we have used the conditional probability P(Yt = 1|Pt), whichis unknown in practice and needs to be estimated from a sample. For a user withoutmuch professional knowledge, this is highly infeasible. Richardson (2003) simplified thederivation by assuming the forecast is perfectly calibrated (P(Yt = 1|Pt) = Pt) and thusa user can take the face value Pt as the truth. All empirical value score curves presented inRichardson (2003) are generated under this assumption. However, the assumption maynot hold for any probability forecast, and deriving theVS curve and conducting statisticalinference in such a situation become much more challenging.

3.2. Evaluation of Point PredictionsCompared to probability forecasts, only 0/1 values are issued in binary point predictions,which will be discussed in depth in Section 4. For binary forecasts of this type, the2 × 2 contingency tables, cross-classifying forecasts and actuals, completely characterizethe joint distribution, and thus are convenient tools from which a variety of evaluationmeasures about skill and value can be constructed.We will introduce usual skill measuresbased on contingency tables. See Stephenson (2000) and Mason (2003) as well. Statisticalinference on a contingency table, especially the independence test under two samplingdesigns, and the measure of forecast value are then briefly reviewed.

3.2.1. Skill Measures for Point ForecastsAlthough there are four cells in a contingency table (Table 19.1), only three quantities aresufficient for describing it completely.The first one is the bias (B),which is defined to bethe ratio of two marginal probabilities π1./π.1. For an unbiased forecasting system, B isone and E(Y ) = E(Y ). Note that B summarizes the marginal distributions of forecastsand observations, and thus does not tell anything about the association between them.For example, independence of Y and Y is possible for any value of the bias.The unbiasedrandom forecasts are often taken as having no skill in this context, and all other forecasts


are assessed relative to this benchmark.Two other measures necessary to characterize theforecast errors are the hit rate (H) and the false alarm rate (F) and are the two basic buildingblocks for a ROC curve. Note that for the random forecasts of no skill, both H and Fare equal to the marginal probability P(Y = 1) due to independence. For forecasts ofpositive skill,H is expected to exceed F. Given B,H,and F,any joint probability πij inTable19.1 is uniquely determined, verifying that only three degrees of freedom are needed fora 2 × 2 contingency table.The false alarm ratio is defined as 1 − H ′ ≡ P(Y = 0|Y = 1)

while the conditional miss rate is F ′ ≡ P(Y = 1|Y = 0). Using Bayes’ rule connectingtwo factorizations, Stephenson (2000) derived the following relationship between thesefour conditional measures:

H ′ = HB

F ′ = F(1 − H)

F − H + B(1 − F). (55)

Other measures of forecast skill can be constructed using the above three elementarybut sufficient statistics. The first one is the odds ratio (OR) defined as the ratio of twoodds

OR ≡ H1 − H

/F

1 − F, (56)

which is positively oriented in that it equals 1 for random forecasts and is greater than1 for forecasts of positive skill. Actually, OR is often taken as a measure of associationbetween rows and columns in any contingency table, and is one if and only if they areindependent; see Agresti (2007). Note that OR is just a function of H and F, both ofwhich are summaries of the conditional distributions. As a result, OR does not relyon the marginal information. Another measure that is parallel to the Brier score is theprobability of correct forecasts defined as

πcorr ≡ 1 − E(Y − Y )2

= π11 + π00

= FH + (1 − F)(B − H)

B − H + F, (57)

which depends on B and the marginal information as well. In rare event cases wherethe unconditional probability of Y = 1 is close to zero, πcorr would be very high for therandom forecasts of no skill.This is easily seen by observing that H = F = P(Y = 1) =P(Y = 1) and B = 1. Substituting these into πcorr , we get

πcorr = FH + (1 − F)(B − H)

B − H + F= 2P2(Y = 1) − 2P(Y = 1) + 1 (58)


and the minimum is obtained when P(Y = 1) = 0.5, that is, the event is balanced. Incontrast, it achieves its maximum when P(Y = 1) = 1 or P(Y = 1) = 0. For rare eventswhere P(Y = 1) is close to zero, πcorr is near one and this leads to the misconceptionthat the random forecasts perform exceptionally well, as nearly 100% cases are correctlypredicted. Even if there is no association between forecasts and observations, this scorecould be very high. For this reason, Gandin and Murphy (1992) regarded πcorr to be“inequitable” in the sense of encouraging hedging. In contrast, the OR, which is notdependent on B does not have this flaw and hence is a reliable measure in rare eventcases. Often we take logarithm of OR to transform its range into the whole real line, andthe statistical inference based on log OR is much simpler to conduct than ones basedon OR, as shown in Section 3.2.2.6 Alternatively, we can use the improvement of πcorr

relative to the random forecasts of no skill to measure the forecast skill.This is the Heidkeskill score (HSS):

HSS = πcorr − π ocorr

1 − π ocorr

, (59)

where π ocorr is πcorr for random forecasts. According to Stephenson (2000), HSS is a more

reliable score to use than πcorr , albeit it also depends on B.The second widely used skill score that gets rid of the marginal information is the

Peirce skill score (PSS) or Kuipers score, which is defined as the hit rate minus the falsealarm rate, cf. Peirce (1884). Like OR, forecasts of higher skill is rewarded by larger PSS.One of the advantages of PSS over OR is that it is a linear function of H and F, and thusis well-defined for virtually all contingency tables, whereas OR is not defined when Hand F are zero. Stephenson (2000) evaluated the performance of these scores in terms ofcomplement and transpose symmetry properties, and their encouragement to hedgingbehavior. His conclusion is that the OR is generally a useful measure of skill for binarypoint forecasts. It is easy to compute and construct inference built on it; moreover, it isindependent of the marginal totals and is both complement and transpose symmetric.Mason (2003) provided a more comprehensive survey on various scores that are builton contingency tables and established five criteria for screening these measures, namely,equitability, propriety, consistency, sufficiency, and regularity.

3.2.2. Statistical Inference Based on Contingency TablesSo far, all scores are calculated using population contingency tables and nearly all ofthem are functions of four joint probabilities. In practice, only a sample {Yt, Yt} for t =1, . . . , T is available, which may or may not be generated from the models in Section 4.We have to use this sample to construct the score estimates. This is made simple bynoticing that any score,denoted by f (π11, π10, π01), is a function of three probabilities πij .

6 Another transformation of OR is the so-called Yule’s Q or Odds Ratio Skill Score (ORSS), which is defined as(OR − 1)/(OR + 1). Unlike OR, ORSS ranges from −1 to 1 and is recognized conventionally as a measure ofassociation in contingency tables.


The estimator is obtained by replacing each πij by the sample proportion pij .The statisticalinference is therefore based on the maximum likelihood theory if the sample size issufficiently large. For simplicity, let us consider the random sampling scheme where{Yt, Yt} is i.i.d..The objective is to find the asymptotic distribution of an empirical scorewhich is a function of the sample proportions, denoted by f (p11, p10, p01).

Taking each (Yt, Yt) as a random draw from the joint distribution of forecasts andobservations, we have four possible outcomes for each draw: (1, 1), (1, 0), (0, 1), and(0, 0) with corresponding probabilities π11, π10, π01, and π00, respectively. Under theassumption of independence, the sampling distribution of {Yt, Yt} is the multinomialhaving four outcomes each with probability πij .The likelihood as a function of πij is thus

L({πij}|{Yt, Yt}) = T !n11!n10!n01!n00!π

n1111 π

n1010 π

n0101 π

n0000 , (60)

where nij is the number of observations in the cell (i, j), and T = ∑1i=0

∑1j=0 nij . The

maximum likelihood estimator is obtained by maximizing (60) over πij , subject to thenatural constraint:

∑1i=0

∑1j=0 πij = 1.Agresti (2007) showed that ML estimator is simply

pij = nij/T , which is the sample proportion of outcomes (i, j). By maximum likelihoodtheory, pij is consistent and asymptotically normally distributed, that is,

√T (p − π)

d−→ N (0, V ), (61)

where p = (p11, p10, p01)′, π = (π11, π10, π01)

′ and V is the 3×3 asymptotic covariancematrix, which can be estimated by the inverse of negative Hessian for the log-likelihoodevaluated at p. The asymptotic distribution of f (p11, p10, p01) can be derived by deltamethod, provided f is differentiable in a neighborhood of π , obtaining

√T ( f (p11, p10, p01) − f (π11, π10, π01))

d−→ N(

0,∂ f∂π

V∂ f∂π

T), (62)

where ∂ f∂π

is the gradient vector of f evaluated at π , and can be estimated by replacingπ with p. Asymptotic confidence intervals for any score defined above can be obtainedbased on (62); see Stephenson (2000) and Mason (2003).

In small samples, the above asymptotic approximation is no longer valid. A rule-of-thumb is that the number of observations in each cell should be at least 5 in order forthe approximation to be valid. For samples in real life, one or more cells may not containany observation; and some measures, such as OR, cannot be calculated. The Bayesianapproach with a reasonable prior could work in these situations. As shown above, thesample is drawn from a multinomial distribution.Albert (2009) showed that the conjugate


prior for π is the so-called Dirichlet distribution with four parameters (α11, α10, α01, α00)

with density

p(π) =�(∑1

i=0

∑1j=0 αij

)∏1

i=0

∏1j=0�(αij)

πα11−111 π

α10−110 π

α01−101 π

α00−100 , (63)

where∑1

i=0

∑1j=0 πij = 1 and �(·) is the Gamma function. A natural choice is the non-

informative prior, in which all αij ’s equal one and all π ’s are equally likely. Albert (2009)showed that the posterior distribution is also Dirichlet with the updated parameters (α11+n11, α10 + n10, α01 + n01, α00 + n00). A random sample of size M from this posterior dis-tribution,denoted by {πm} for m = 1, . . . , M , can be used to obtain a sequence of scores{ f (πm)}. For the purpose of inference, the resulting highest posterior density (HPD)credible set Cα at a given significant level α can be treated as the same as the confidenceinterval in the non-Bayesian analysis. Note that the strength of the Bayesian approach inthe present situation is that the score can be calculated even though some nij ’s are zero.

Testing independence between rows and columns in contingency tables is very impor-tant for forecast evaluation. As shown above, independent forecasts would not be crediteda high value by any score. Merton (1981) proposed a statistic to measure the markettiming skill of DF. According to Merton (1981), a DF has no value if, and only if,

HM ≡ P(Yt = 1|Yt = 1) + P(Yt = 0|Yt = 0) = 1, (64)

where Yt = 1 means the variable has moved upward. In our terminology, this means that

P(Yt = 1|Yt = 1) − P(Yt = 1|Yt = 0) = 0. (65)

Note that P(Yt = 1|Yt = 1) is the hit rate and P(Yt = 1|Yt = 0) is the false alarmrate. As a result, the DF under consideration has no market timing skill in the sense ofMerton (1981) if, and only if, the PSS is zero. Blaskowitz and Herwartz (2008) derivedan alternative expression for the HM statistic in relation to the covariance of realized andforecasted directions

HM − 1 = Cov(Yt, Yt)

Var(Yt). (66)

HM = 1 if, and only if,Cov(Yt, Yt) is zero,which is equivalent to independence betweenYt and Yt in the case of binary variables. Interestingly, a large number of papers investi-gating DF use symmetric loss functions of various forms, which amounts to taking thepercentage of correct forecasts as the score; see Leitch and Tanner (1995), Greer (2005),Blaskowitz and Herwartz (2009), Swanson and White (1995, 1997a,b), Gradojevic andYang (2006), and Diebold (2006), to name a few. Pesaran and Skouras (2002) linked theHM statistic with a loss function in a decision-based forecast evaluation framework.


Since testing market timing skills is equivalent to the independence test in contingencytables, let us look at this test a bit more. The independence test under random samplingis much simpler than the test in the presence of serial correlation. As a matter of fact, allof the above frequentist and Bayesian tests are applicable in this situation. Take the PSSas an example. We can construct an asymptotic confidence interval for PSS based on alarge sample and then check whether zero is included in the confidence interval. Besidesthese, two additional asymptotic tests exist, namely, the likelihood ratio and the Pearsonchi-squared tests. The former is constructed as

LR ≡ 2(lnL({π∗ij }|{Yt, Yt}) − lnL({πij}|{Yt, Yt})), (67)

where π∗ij is the unrestricted ML estimate, whereas πij is the restricted one under the

restrictions πij = πi.π.j for all i and j. Given the null hypothesis of independence, LRfollows a chi-squared distribution with one degree of freedom asymptotically, and thenull should be rejected if and only if LR is larger than the critical value at a pre-assignedsignificant level. The Pearson chi-squared statistic is

χ2 ≡1∑

i=0

1∑j=0

(nij − nij)2

nij, (68)

where nij is the observed cell count, nij = Tpi.p.j is the expected cell count underindependence, pi. is the marginal sample proportion of the ith row, and p.j is that for thejth column. If the rows and the columns are independent, this statistic is expected tobe small. It also has an asymptotic chi-squared distribution with one degree of freedomand the same rejection area. Both tests are valid and equivalent in large samples. In finitesamples, where one or more cell counts are smaller than 5, Fisher’s exact test is preferredunder the assumption that the total row and column counts are fixed.The null distributionof the Fisher test statistic is not valid if these marginal counts are not fixed, as is often thecase in random sampling. Specifically, the probability of the first count n11 given marginaltotals and independence is

P(n11) = n1.!n11!n10!

n0.!n01!n00!

/T !

n.1!n.0! , (69)

which has the hypergeometric distribution for any sample size.This test was proposed byFisher in 1934 and is widely used to test independence for I × J contingency tables in therandom sampling design. Here only the simple case with I = J = 2 is considered, andreaders are referred to Agresti (2007) for further discussions on this exact test. Anotherway of testing independence in general I × J contingency tables is the asymptotic test ofANOVA coefficients of ln(πij), that is, the significance test of relevant coefficients in thelog-linear model, which is popular in statistics and biostatistics, but rarely used by econo-metricians.This test makes use of the fact that ANOVA coefficients of ln(πij) must meet


some conditions under independence. One of the conditions is that the coefficient ofany interaction term must be zero.The test proceeds by checking whether the maximumlikelihood estimators support these implied values by three standard procedures, that is,the Wald, likelihood ratio, and Lagrangian multiplier tests. In econometrics, Pesaran andTimmermann (1992) proposed an asymptotic test (PT92) based on the difference betweenP(Y = 1, Y = 1) + P(Y = 0, Y = 0) and P(Y = 1)P(Y = 1) + P(Y = 0)P(Y = 0),which should be close to zero under independence. A large deviation of the sample esti-mate from zero is thus a signal of rejection. In 2 × 2 contingency tables, ANOVA andPT92 tests are asymptotically equivalent to the classical χ2 test.

In reality, especially for macroeconomic forecasts, Yt and Yt are likely to be seriallycorrelated. All of the above testing statistics can be used nevertheless, but their null dis-tributions are going to change. For example,Tavaré and Altham (1983) examined theperformance of the usual χ2 test, where both row and column are characterized by two-state Markov chains, and concluded that the χ2 statistic does not have the χ2 distributionwith one degree of freedom,as in the case of random samples. Before drawing any mean-ingful conclusions from these classic tests, serial correlation needs to be tackled properly.

Blaskowitz and Herwartz (2008) provided a summary of the testing methodologies inthe presence of serial correlation of Yt and Yt . These include a covariance test based onthe covariance of observations and events, a static/dynamic regression approach adjustedfor serial correlation by calculating Newey-West corrected t-statistic, and the PesaranandTimmermann (2009) test based on the canonical correlations from dynamically aug-mented reduced rank regressions specialized to the binary case. They found that all ofthese tests based on the asymptotic approximations tend to produce incorrect empiricalsize in finite samples, and suggested a circular bootstrap approach to improve their finitesample performance. Bootstrap-based tests are found to have smaller size distortion insmall samples without much sacrifice of power, and those without taking care of serialcorrelation tend to generate inflated test size in finite samples.

Dependence of forecasts and observations is necessary for a forecasting system tohave positive skill. However, it is only a minimal requirement for good forecasts. It is notunusual that the performance of a forecasting system is worse than random forecasts of noskill in terms of some specific criterion. Donkers and Melenberg (2002) proposed a test ofrelative forecasting performance over this benchmark by comparing the difference in thepercentage of correct forecasts. In a real-life example, they found that the test proposedby them and the PT92 test differ dramatically in the estimation and evaluation samples.

3.2.3. Evaluation of Forecast ValueMost evaluation methodologies focus on the skill of binary point forecasts. As argued byDiebold and Mariano (1995) and Granger and Pesaran (2000a,b), however, the end useroften finds measures of economic value to be more useful than the usual mean squarederror or other statistical scores.We have emphasized this point in the context of probability


forecasts in which the cost/loss ratio is important for value-evaluation in a forecast-baseddecision problem. In a 2 × 2 payoff matrix (e.g.,Table 19.4), each cell corresponds to theloss associated with a possible combination of action and realization, and is not limited tothe specific cost/loss structure. Blaskowitz and Herwartz (2011) proposed a general lossfunction suitable for DF in economics and finance, which takes into account the realizedsign and the magnitude of directional movement for the target economic variable. Theyregarded this general loss function as an alternative to the commonly used mean squarederror for forecast evaluation.

As indicated before, Richardson (2003) analyzed the relationship between skill andvalue in the context of the cost/loss decision problems. Note that for probability forecasts,any user, faced with a probability value, decides whether or not to take some actionaccording to his optimal threshold. For binary point predictions, we can also calculatethe value score, defined as a function of the cost/loss ratio.The resulting VS curve wouldlie below the one generated by probability forecasts. Richardson (2003) proved that theparticular cost/loss ratio which maximizes VS is equal to the marginal probability ofY = 1, and the highest achievable value score is simply the PSS. Granger and Pesaran(2000b) derived a very similar result. Consequently, the maximum economic value isrelated to the forecast skill, and PSS is taken as a measure of the potential forecast value aswell as skill. However, for a specific user with a cost/loss ratio different from the marginalprobability P(Y = 1), this maximum value is not attainable. Thus PSS only gives thepossible maximum rather than the actual value achievable for any user. On the otherhand, Stephenson (2000) argued that in order to have a positive value score for at leastone cost/loss ratio, the OR has to exceed one. That is, forecasts and observations haveto depend on each other, otherwise, nobody benefits from the forecasts and one wouldrather use the random forecasts with no skill.This observation provides another example,where forecast value is influenced by forecast skill. Only those forecasts satisfying theminimal skill requirements can be economically valuable.

4. BINARY POINT PREDICTIONS

In some circumstances, especially in two-state, two-action decision problems, one hasto make a binary decision according to the predicted probability of a future event. Thiscan be done by transforming a continuous probability into a 0/1 point prediction, aswe will discuss in this section. Unlike probability forecasts, binary point forecasts cannotbe isolated from an underlying loss function. For this reason, we deferred a detailedexamination of the topic until after forecast evaluation under a general loss function wasreviewed in Section 3.The plan of this section is as follows: Section 4.1 considers ways totransform predicted probabilities into point forecasts – the so called“two-step approach.”Manski (1975, 1985) generalized this transformation procedure to other cases where noprobability prediction is given as the prior knowledge, and the optimal forecasting rule is


obtained through a one-step approach.This will be addressed in Section 4.2, followed byan empirical illustration in Section 4.3. A set of binary classification techniques primarilyused in the statistical learning literature are briefly introduced in Section 4.4. Theseinclude discriminant analysis, classification trees, and neural networks.

4.1. Two-Step ApproachIn the two-step approach, the first step consists of generating binary probability predic-tions, as reviewed in Section 2, while a threshold is employed to translate these probabil-ities into 0/1 point predictions in the second step. In the cost/loss decision problem, theoptimal threshold of doing so is based on the cost/loss ratio. For a general loss functionL(Y , Y ), the optimal threshold minimizing the expected loss can be solved by compar-ing two quantities, namely, the expected loss of Y = 1 and that of Y = 0. Denote theformer by EL1 = P(Y = 1|P)L(1, 1) + (1 − P(Y = 1|P))L(1, 0) and the latter byEL 0 = P(Y = 1|P)L(0, 1) + (1 − P(Y = 1|P))L(0, 0). Y = 1 is optimal if and onlyif EL1 ≤ EL 0, or,

P(Y = 1|P) ≥ L(1, 0) − L(0, 0)

L(1, 0) − L(0, 0) + L(0, 1) − L(1, 1)≡ P∗. (70)

Here we assume that making a correct forecast is beneficial and making a false forecast iscostly, that is, L(0, 0) < L(1, 0) and L(1, 1) < L(0, 1). P∗ defined above is the optimalthreshold, which is a function of losses, and is interpreted as the fraction of the gainfrom getting the forecast right when Y = 0 over the total gain of correct forecasts.Given P∗, the optimal decision (or forecasting) rule is: Y = I (P(Y = 1|P) ≥ P∗).In general, P(Y = 1|P) is unknown, and this rule is infeasible. However, suppose P isgenerated by one of the models in Section 2 that are correctly specified in the sensethat P = P(Y = 1|�). The law of iterated expectations implies that P(Y = 1|P) = P ,that is, P is perfectly calibrated, and so the decision rule reduces to Y = I (P ≥ P∗).Given a sequence of this type of probability forecasts {Pt}, this rule says that we cangenerate another sequence of 0/1 point forecasts {Y t} by simply comparing each Pt withP∗. In reality, rather than P , what we know is its estimate P from a particular binaryresponse model, say probit or single index model, evaluated at a particular covariate valuex. Once this model is correctly specified, the decision rule using P in replace of P isasymptotically optimal as well, and both yield the same expected loss as the sample sizeapproaches infinity. Figure 19.13 illustrates a decision rule based on the probit modelwith threshold 0.4.

From this figure, Y = 1 is predicted for any observation with (Xβ) ≥ 0.4, or forthose on the right-hand side of the vertical line.


Figure 19.13 Probit and linear probability models with threshold 0.4.

4.2. One-Step ApproachManski (1975, 1985) developed a semi-parametric estimator for the binary responsemodel, the so-called maximum score estimator (MSCORE).This is different from othersemi-parametric estimators in Section 2.1.3 in terms of the imposed assumptions. Bothsingle-index and non-parametric additive models assume that the error in (2) is stochas-tically independent of X . In contrast, MSCORE only assumes the conditional medianof this error is zero, that is, med(ε|X) = 0, or median independence, which is muchweaker. Manski assumed the index function to be linear in unknown parameters β, sothe full specification is akin to the parametric model in Section 2.1.1, but he relaxedthe independence and distributional assumptions. Compared with other binary responsemodels, the salient feature of Manski’s semi-parametric estimator is its weak distributionalassumptions. However, as a result, the conditional probability P(Y = 1|X) cannot beestimated – the price one has to pay with less information.This is the reason why we didnot discuss this model in Section 2 under “Probability Predictions.”

The maximum score estimator β solves the following maximization problem basedon a sample {Yt, Xt}:

maxβ∈B,|β1|=1

Sms(β) ≡ 1T

T∑t=1

(2Yt − 1)(2I (Xtβ ≥ 0) − 1), (71)

where B is the permissible parameter space, |β1| is assumed to be 1 due to identificationconsiderations, as β is identified up to scale, and Sms(·) is the score function. Note thatwhen Yt = 1 and Xtβ ≥ 0 or Yt = 0 and Xtβ < 0, (2Yt − 1)(2I (Xtβ ≥ 0) − 1) = 1;otherwise, (2Yt − 1)(2I (Xtβ ≥ 0) − 1) = −1. Interpreting this as the problem of usingX to predict Y , it says that Y = 1 is predicted if, and only if, a linear predictor Xβ islarger than zero. As long as the predicted and observed values are the same, the scorerises by 1/T ; otherwise, it decreases by the same amount. By this observation,MSCORE


attempts to estimate the optimal linear forecasting rule of the form Xβ,which maximizesthe percentage of correct forecasts.

Manski (1985) established strong consistency of the maximum score estimator. Therate of convergence and the asymptotic distribution were analyzed by Cavanagh (1987)and Kim and Pollard (1990),respectively. However,the score function is not continuous inparameters, and thus the limiting distribution is complex for carrying out statistical infer-ence. Manski and Thompson (1986) suggested using a bootstrap to conduct inferencefor MSCORE, which was critically evaluated by Abrevaya and Huang (2005). Delgadoet al. (2001) discussed the use of non-replacement subsampling to approximate the dis-tribution of MSCORE. Furthermore, the convergence rate of MSCORE is T 1/3,whichis slower than the usual

√T . All of these issues restrict the application of MSCORE

in empirical studies. To overcome the problem resulting from discontinuity, Horowitz(1992) proposed a smoothed version of the score function using a differentiable kernel.The resulting smoothed MSCORE is consistent and asymptotically normal with a con-vergence rate of at least T 2/5 ,and can be arbitrarily close to

√T under some assumptions.

Horowitz (2009) also discussed extensions of MSCORE to choice-based samples, paneldata, and ordered-response models. Caudill (2003) illustrated the use of MSCORE inforecasting where seeding is taken as a predictor of winning in the men’s NCAA bas-ketball tournament. He found that MSCORE tends to outperform parametric probitmodels for both in-sample and out-of-sample forecasts.

Manski and Thompson (1989) investigated a one-step analog estimation of optimalpredictors of binary response with much relaxed parametric assumptions on the responseprocess. The loss functions they considered are quite general. The first is the class ofasymmetric absolute loss functions under which the optimal forecasting rule takes thesame form as Y = I (P ≥ P∗).The second is the class of asymmetric square loss functions,and the last is the logarithmic loss function. Under these last two losses, however, theoptimal forecasts are not 0/1-valued and thus are omitted here. A natural estimationstrategy is to estimate P first, and then to get the point forecasts using the optimal rule, asexplained in Section 4.1. Manski andThompson (1989) suggested estimating the optimalbinary point forecasts directly by the analogy principle,viz.,the estimates of best predictorsare obtained by solving sample analogs of the prediction problem without the need toestimate P first. The potential benefit of this one-step procedure is that it allows for acertain degree of misspecification for P . They discussed this issue in two specific binaryresponse models,“isotonic” and “single-crossing,” finding that the analog estimators for alarge class of predictors are algebraically equivalent to MSCORE, and so are consistent.

Elliott and Lieli (2013) followed the same one-step approach under a general lossfunction. They extended Manski and Thompson’s analog estimator allowing the bestpredictor to be non-linear in β. In MSCORE, the “rule of thumb” threshold of trans-forming Xβ into 0/1 binary point forecasts is 0. Note that Xβ is not the conditionalprobability of Y = 1 given X. However, this threshold may not be optimal for a


particular decision problem under consideration. Elliott and Lieli (2013) derived an opti-mal threshold based on a general utility function, which may depend on the covariatesX as well. Their motivation can be explained in terms of Figure 19.13.

Suppose the true model is the probit model, but a linear probability model is fit-ted instead, with the fitted line shown in Figure 19.13. According to the analysis inSection 2.1.1, the estimated β is generally not consistent and so the linear probabilitymodel will be viewed as a bad choice. Elliott and Lieli (2013) argued, however, that thismay not be the case, at least in this example. Rather than concentrating on β, what isimportant is the optimal forecasting rule; two different models may yield the same fore-casting rule. In Figure 19.13, the optimal forecasting rule determined by the true modelis: Y = 1 is predicted if, and only if, X lies on the right hand side of the vertical line –the very rule we get by using the linear predictor Xβ. This finding highlights the pointthat we do not require the model to be correctly specified in order to obtain an optimalforecasting rule. As a result, modeling binary responses for point predictions becomesmuch more flexible than for probability predictions. However, this gain in specificationflexibility should not be overstated, since not every misspecified model will work. Thekey requirement is that both the working model and the true model have to cross theoptimal threshold level at exactly the same cutoff point. The working model can behavearbitrarily elsewhere,where the predictions can even go beyond [0, 1].7 Therefore,a goodworking model may not be the real conditional probability model and need not have anystructural interpretation. For example, β in the linear probability model in Figure 19.13does not give the marginal effect of X on the probability of Y = 1. Elliott and Lieliconcluded that the usual two-step estimation procedures, such as maximum likelihoodestimation, fit the working model globally, and thus the fitted model is close to the truemodel over the whole range of covariate values. However, this is not necessary since thegoodness of fit in the neighborhood of the cutoff point is all that is necessary. In otherwords, all we need is a potentially misspecified working model that fits well locally insteadof globally.

To overcome the problem of the two-step estimation approach, Elliott and Lieli (2013)incorporated utility into the estimation stage – the one-step approach initially proposedby Manski andThompson (1989).The population problem involves maximizing expectedutility by choosing a binary optimal action as a function of X , namely,

maxa(·)

E(U (a(X), Y , X)), (72)

where U (a, Y , X) is the utility function depending on the binary action a, which isagain a function of X , realized event Y as well as covariates X .8 After some algebraic

7 Another non-trivial requirement is that the working model must be above (below) the cutoff whenever the true modelis above (below) it.

8 Elliott and Lieli suggested empirical examples where X enters into the utility function.


manipulations, (72) can be rewritten as

maxg∈G

E(b(X)[Y + 1 − 2c(X)]sign[g(X)]), (73)

where b(X) = U (1, 1, X) − U (−1, 1, X) + U (−1,−1, X) − U (1,−1, X) > 0, c(X)

is the optimal threshold expressed as a function of utility, a(X) = sign[g(X)], and G isa collection of all measurable functions from Rk to R (note X is k-dimensional). Theso-called Maximum Utility Estimator (MUE) is then obtained by solving the sampleversion of (73):

maxg∈G

1

T

T∑t=1

b(Xt)[Yt + 1 − 2c(Xt)]sign[g(Xt)]. (74)

For implementation, g needs to be parameterized,that is,only a subclass of G is consideredto reduce the estimation dimension. The estimator β, which maximizes the objectivefunction

maxβ∈B

1T

T∑t=1

b(Xt)[Yt + 1 − 2c(Xt)]sign[h(Xt, β)] (75)

produces the empirical forecasting rule sign[h(Xt, β)].9 Under weak conditions, thisempirical forecasting rule converges to the theoretically optimal rule given the modelspecification h(x, β). If, in addition, the model h(x, β) satisfies the stated condition forcorrect specification, the constrained optimal forecast is also the globally optimal forecastfor all possible values of the predictors.They recommended a finite order polynomial foruse in practice.

The identification issues in the Manski and Elliott and Lieli approaches are less impor-tant for prediction purposes than for structural analysis.The estimation proceeds withoutmuch worry about identification provided alternative identification restrictions yield thesame forecasting rules.Their statistical inference is built on the optimand function insteadof the usual focus on β. One difficulty comes from the discontinuity of the objectivefunction, meaning that maximization in practice cannot be undertaken by the usualgradient-based numerical optimization techniques. Elliott and Lieli employed the simu-lated annealing algorithm in their Monte Carlo studies,while mixed integer programmingwas suggested by Florios and Skouras (2007) to solve the optimization problem.

Lieli and Springborn (forthcoming) assessed the predictive ability of three proce-dures (two-step maximum likelihood, two-step Bayesian, and one-step maximum util-ity estimation) in deciding whether to allow novel imported goods, which may beaccompanied by undesirable side effects, such as biological invasion. They used Aus-tralian data to demonstrate that a maximum utility method is likely to offer significant

9 Note that (75) with constant b(Xt), c(Xt) = 0.5, and h(Xt , β) = Xtβ is equivalent to the maximum score problem.Therefore, MSCORE is a special case of this general estimator.


incremental gains relative to the other alternatives, and estimated this annual value to be$34–$49 million (AU$) under their specific loss function. This paper also extends themaximum utility model to address an endogenously stratified sample where the uncom-mon event is over-represented in the sample relative to the population rate, as discussed inSection 2.1.1.

Lieli and Nieto-Barthaburu (2010) generalized the above approach with a singledecision maker to a more complex context where a group of decision makers has het-erogeneous utility functions. They considered a public forecaster serving all decisionmakers by maximizing a weighted sum of individual (expected) utilities. The maximumwelfare estimator was then defined through the forecaster’s maximization problem,and itsproperties were explored. The conditions under which the traditional binary predictionmethods can be interpreted asymptotically as socially optimal were given, even when theestimated model was misspecified.

4.3. Empirical IllustrationTo illustrate the difference between the one-step and two-step approaches in terms of theirforecasting performance, the data in Section 2.1.5 involving yield spreads and recessionindicators are used here. For simplicity, the lagged indicator is removed, that is, only staticmodels with yield spread as the only regressor are fitted. It is well known that the bestmodel for fitting the data is not always the best model for forecasting.The whole sampleis, therefore, split into two groups.The first group,covering the period from January 1960to December 1979, is for estimation use, while the second one, including all remainingobservations, is for out-of-sample evaluation. For the conventional two-step approach,we fit a parametric probit model with a linear index. The recession for the month t ispredicted if and only if

(β0 + β1YSt−12) ≥ optimal threshold, (76)

where (·) is the standard normal distribution function, YSt−12 is the 12-month laggedyield spread, and β j , for j = 0 and 1, are the maximum likelihood estimates. For thepurpose of comparison, the same model specification in (76) is fitted by Elliott and Lieli’sapproach under alternative loss functions. In this case, we use the same forecasting rule(76) with β j replaced by β j , the maximum utility estimates.Two particular loss functionsare analyzed here: the percentage of correct forecasts and the PSS, with 0.5 and thepopulation probability of recession as the optimal thresholds, respectively. We take thesample proportion as the estimate of the population probability. Note that these are alsothe two most commonly used thresholds to translate a probability into a 0/1 value inempirical studies; see Greene (2011).The maximum utility estimates are computed usingOPTMODEL procedure in SAS 9.2.


Figure 19.14 One-step vs. two-step fitted curves.

Figure 19.14 presents these fitted curves using the estimation sample, together withtwo optimal thresholds.10 In contrast to the two-step maximum likelihood approach,one-step estimates depend on the loss function of interest. When the PSS is maximized,instead of the percentage of correct forecasts, both intercept and slope estimates change,making the fitted curve shift rightward. One noteworthy result is that both the one-step and two-step fitted curves of maximizing PSS touch the optimal threshold (0.15)in roughly the same region, despite their large gap when the yield spread is negative.According to Elliott and Lieli (2013), this implies that both are expected to yield thesame forecasting rule, and thus yield the same value for the PSS. For the percentage ofcorrect forecasts, the fitted curves from these two approaches are also very close to eachother in the critical region, where the curves touch the optimal threshold (0.5). Theirresults are confirmed in Table 19.5, where we summarize the in-sample goodness of fitfor all fitted models. As expected, it makes no difference in terms of the objectives theyattempt to maximize. For instance, the maximized PSS is 0.4882 for both the probitand MPSS. One possible reason for their equivalence in this particular example couldbe due to the correct specification in (76), i.e., the true data generating process can berepresented by the probit model correctly.11 Note that in Table 19.5, the PSS of MPC issignificantly lower than those for the other two; so is the percentage of correct forecastsfor MPSS. This is not surprising, as the one-step semi-parametric model is not designedto maximize it.

To correct for possible in-sample overfitting, we evaluate the fitted models usingthe second sample with the results summarized inTable 19.6. Both tables convey similarinformation pertaining to the forecasting performances of one-step and two-step models.In Table 19.6, the probit model still performs admirably well. In terms of percentageof correct forecasts, it even outperforms MPC, which is constructed to maximize this

10 In Figure 19.14,MPC is the fitted curve for the maximum percentage of correct forecasts,while MPSS is the maximumPSS fitted curve.

11 In fact, a non-parametric specification test shows that the functional form in (76) cannot be rejected by the sample.Thus, the fitted probit model serves as a proxy for the unknown data generating process.


Table 19.5 In-Sample Goodness of Fit for One-Step vs. Two-StepModels

PC PSS

Probit 0.8625 0.4882MPC 0.8625 0.1744MPSS 0.7167 0.4882

Table 19.6 Out-of-Sample Evaluation for One-Step vs. Two-Step Models

PC PSS

Probit 0.8672 0.5854MPC 0.8542 0.1333MPSS 0.8229 0.5854

criterion. Given that the probit model is correctly specified, the slight superiority of two-step approach may be possibly due to sampling variability or the structural differencesbetween estimation and evaluation samples.

The relative flexibility of the one-step approach, as emphasized in Section 4.2, isthat it allows for some types of misspecification, which are not allowed in the two-stepapproach. In order to highlight this point, we fit the linear probability model (1) insteadof the probit model (76). For the two-step approach, the recession for the month t ispredicted if and only if

βOLS0 + β

OLS1 YSt−12 ≥ optimal threshold, (77)

where βOLSj , for j = 0 and 1, are the OLS estimates. For the one-step approach, these

parameters are estimated by the Elliott and Lieli method. Figure 19.15 illustrates someinteresting results in this setting. Compared with the probit fitted curve, the OLS fit-ted line is dramatically different. However, the MUE fitted lines, based on PC and PSS,intersect the MUE fitted curves (76) at their associated threshold values (0.5 and 0.15,respectively).Thus,MUE produces the same binary point forecasts even when the work-ing model (77) is misspecified. Figure 19.15 shows that the lines estimated by MUE donot fit the data generating process globally very well, yet are capable of producing correctpoint predictions. Given that a global fit is less important than the localized problem ofidentifying the cutoff in the present binary point forecast context, the one-step approachwith better local fit should be preferred.12

12 When we implemented the in-sample and out-of-sample evaluation exercises for the linear specification (77),we foundthat the linear model fitted by OLS performed worse than its MUE counterparts.


Figure 19.15 One-step vs. two-step linear fitted lines.

4.4. Classification Models in Statistical LearningSupervised statistical learning theory is mainly concerned with predicting the value of aresponse variable using a few input variables (or covariates),which is similar to forecastingmodels in econometrics. Many binary point prediction models have been proposed inthe supervised learning literature, and are called binary classification models. This sectionserves as a sketchy introduction to a few classical classification models amongst them.

4.4.1. Linear Discriminant AnalysisAs stated above, an optimal threshold is needed to transform the conditional probabilityP(Y = 1|X) into 0/1 point prediction. The most widely used threshold is 1/2 whichcorresponds to a symmetric loss function given by Mason (2003). Given this threshold,classification simply involves comparison of two conditional probabilities, that is, P(Y =1|X) and P(Y = 0|X), and the event with larger probability is predicted accordingly.Linear discriminant analysis follows this rule but obtains P(Y = 1|X) in a differentway than the usual regression-based approach. The analysis assumes that we know themarginal probability P(Y = 1) and the conditional density f (X |Y ). By Bayes’ rule, theconditional probability is given by

P(Y = 1|X) = P(Y = 1)f (X |Y = 1)

P(Y = 1)f (X |Y = 1) + P(Y = 0)f (X |Y = 0). (78)

To simplify the analysis, hereafter a parametric assumption is imposed on the conditionaldensity f (X |Y ).The usual practice,when X is continuous, is to assume both f (X |Y = 1)

and f (X |Y = 0) are multivariate normal with different means but a common covariancematrix �, that is,

f (x|Y = j) = 1(2π)k/2|�|1/2

exp(

−12(x − μj)�

−1(x − μj)′)

, (79)


where j = 1 or 0. Under this assumption, the log odds in terms of the conditionalprobabilities is

lnP(Y = 1|X = x)

P(Y = 0|X = x)= ln

f (x|Y = 1)

f (x|Y = 0)+ ln

P(Y = 1)

P(Y = 0)

= lnP(Y = 1)

P(Y = 0)− 1

2(μ1 + μ0)�

−1(μ1 − μ0)′

+ x�−1(μ1 − μ0)′, (80)

which is an equation linear in x.The equal covariance matrices causes the normalizationfactors to cancel, as well as the quadratic part in the exponents. The previous classifica-tion rule amounts to determining whether (80) is positive for a given x. The decisionboundary that is given by setting (80) to be zero is a hyperplane in Rk, dividing thewhole space into two disjoint subsets. For any given x in Rk, it must exclusively fallinto one subset; and the classification follows in a straightforward way. To make thisrule work in practice, four blocks of parameters have to be estimated using samples:P(Y = 1), μ1, μ0 and �. This can be done easily by using their sample counterparts.To be specific, P(Y = 1) = T1/T , μj = ∑

j Xt/Tj for j = 0, 1 and � = (∑

1 (Xt −μ1)

′(Xt −μ1)+∑0 (Xt −μ0)′(Xt −μ0))/(T −2),where P is the estimate of P, T1 is the

number of observations with Yt = 1, and∑

j is the summation over those observationswith Yt = j. Substituting parameters with their estimates in the decision boundary yieldsthe empirical classification rule. It is called linear discriminant analysis simply because theresulting decision boundary is a hyperplane in the input vector space, which again is theconsequence of the imposed assumptions. Hastie et al. (2001) derived a decision boundarydescribed by a quadratic equation under the normality assumption with distinct covari-ance matrices, that is, �1 �= �0. They also extended this simplest case by consideringother distributional assumptions leading to more complex decision boundaries.

Another point worth mentioning is that the log odds generated by linear discriminantanalysis takes the form of a logistic specification. Specifically, the linear logistic model byconstruction has linear logit

lnP(Y = 1|X = x)

P(Y = 0|X = x)= βo + xβ1, (81)

which is akin to (80) if

βo ≡ lnP(Y = 1)

P(Y = 0)− 1

2(μ1 + μ0)�

−1(μ1 − μ0)′ (82)

andβ1 ≡ �−1(μ1 − μ0)

′. (83)

Therefore, the assumptions in linear discriminant analysis induce the logistic regressionmodel, which can be estimated by maximum likelihood to get estimates for βo and β1.


In this sense, both models generate the same classification rules asymptotically, in spiteof the difference in their estimation methods. However, the joint distribution of Y andX is used in discriminant analysis, whereas logistic regression only uses the conditionaldistribution of Y given X , leaving the marginal distribution of X not explicitly speci-fied. As a consequence, linear discriminant analysis, by relying on the additional modelassumptions, is more efficient but less robust when the assumed conditional density of Xgiven Y is not true. In the situation where some of the components of X are discrete,logistic regression is a safer, more robust choice.

Maddala (1983) followed an alternative way to derive the linear discriminant boundary,which provides a deep insight into what discriminant analysis actually does. Suppose thatonly a linear boundary is considered for simplicity.Without loss of generality,denote it byXλ = 0, and Y = 1 is predicted if and only if Xλ ≥ 0.What discriminant analysis does isto find the optimal value for λ according to a certain criterion. Fisher posed this probleminitially for finding λ such that the between-class variance is maximized relative to thewithin-class variance.The between-class variance measures how far away from each otherare the means of Xλ for both classes (Y = 1 and Y = 0 where Y is the binary pointprediction),which should be maximized subject to the constraint that the variance of Xλ

within each class is fixed.This does make intuitive sense in the context of classification. Ifthe dispersion of two means is small or two distributions of Xλ overlap to a large extent, itis hard to distinguish one from the other. In other words,a large proportion of observationscould be misclassified. Alternatively, even if the means of two distributions are far awayfrom each other, they cannot be sharply distinguished unless both distributions havesmall variances.The optimal λ solving the Fisher’s problem gives the best linear decisionboundary whose analytical form is given in Maddala (1983). Mardia et al. (1979) offereda concise discussion of linear discriminant analysis. Michie et al. (1994) compared a largenumber of popular classifiers on benchmark datasets. Linear discriminant analysis is asimple classification model with a linear decision boundary,and subsequent developmentshave extended it in various directions; see Hastie et al. (2001) for details.

4.4.2. Classification TreesAs with discriminant analysis, methods based on classification trees partition the inputvector space into a number of subsets on which 0/1 binary point predictions are made.Consider the case with two input variables: X1 and X2, both of which take values in theunit interval. Figure 19.16 presents a particular partition of the unit box.

First, subset R1 is derived if X1 < t1. For the remaining part,check whether X2 < t2,if so, we get R2. Otherwise, check whether X1 < t3, if so, we get R3. Otherwise, checkwhether X2 < t4, if so, we get R4. Otherwise, we take the remaining as R5.This processcan be represented by a classification tree in Figure 19.17.

Each node on the tree represents a stage in the partition;and the number of final subsetsequals that of terminal nodes.The branch connecting two nodes gives the condition under


Figure 19.16 Partition of the unit box.

Figure 19.17 The classification tree associated with Figure 19.16.

which the upper node transits to the lower one. For example, condition X1 < t1 mustbe satisfied in order to get R1 from the initial node.The tree shown in Figure 19.17 canbe expanded further to incorporate more terminal nodes when the partition ends upwith more final subsets. In general, suppose we have M subsets: R1, R2,…, RM on eachof which we have assigned a unique probability denoted by pj for j = 1, . . . , M . Usingthe optimal threshold 1/2, Y = 1 should be predicted on subset j if and only if pj ≥ 0.5.Hence, the classification boils down to how to divide the input vector space into disjoint


subsets as shown in Figure 19.16 (or how to generate a classification tree like the one inFigure 19.17), and how to assign probabilities to them.

To introduce an algorithm to grow a classification tree,we define X as a k-dimensionalinput vector, with Xj as its jth element and

R1( j, s) ≡ {X |Xj ≤ s},and R2( j, s) ≡ {X |Xj > s}. (84)

Given a sample {Yt, Xt}, the optimal splitting variable j and split point s solve the followingproblem:

minj,s

⎡⎣min

c1

∑xt∈R1( j,s)

(Yt − c1)2 + minc2

∑xt∈R2(j,s)

(Yt − c2)2

⎤⎦ . (85)

For any fixed j and s, the optimal ci (for i = 1 or 2) that minimizes the mean squared errorsis the sample proportion of Yt = 1 within the class of {Xt : Xt ∈ Ri(j, s)}. Computation ofthe optimal j and s can be carried out in most statistical packages without much difficulty.Having found the best split, the whole input space is divided into two subsets according towhether Xj∗ ≤ s∗ where j∗ and s∗ are the optimal solutions to (85).The whole procedureis then iterated on each subset to get finer subsets which can be partitioned further asbefore. In principle, this process can be repeated infinitely many times,but we have to stopit when a certain criterion is met. To this end, we define the cost complexity criterionfunction

Cα(T ) ≡|T |∑

m=1

∑Xt∈Rm

(Yt − Y m)2 + α|T |, (86)

where T is a subtree of To that is a very large tree, |T | is the number of terminal nodes ofT each of which is indexed by Rm for m = 1, . . . , |T |, and Y m is the sample proportionof Yt = 1 within subset Rm. The criterion is a function of α, that is a non-negativetuning parameter to be specified by user.The optimal subtree T depending on α shouldminimize Cα(T ). If α = 0, the optimal T should be as large as possible and equals theupper bound To. Conversely, an infinitely large α forces T to be very small. This resultis very intuitive. When the partition gets finer and finer, fewer and fewer observationsfall into each subset. In the limit, each one would contain at most one observation, sothat Y m = Yt for each m, and the first term of Cα(T ) would vanish. This also showsthat without any other constraint, the optimal partition rule tends to overfit in-sampledata. This is very unstable and inaccurate in the sense that this rule is sensitive to even aslight change in sample.The optimal subtree should balance the tradeoff between stabilityand in-sample goodness of fit. This balance is controlled by parameter α. Breiman et al.(1984) and Ripley (1996) outlined details to obtain the optimal subtree for a given α thatis determined by cross-validation.


Hastie et al. (2001) recommended using other measures of goodness of fit in thecomplexity criterion function instead of the sample mean squared error in (86) for binaryclassification purpose, including the misclassification error, Gini index,and cross-entropy.They compared them in terms of their sensitivity to changes in the node probabilities.They also discussed cases with categorical predictors and asymmetric loss function. Foran initial introduction to classification trees, see Morgan and Sonquist (1963). Breimanet al. (1984) and Quinlan (1992) contain a general treatment of this topic.

4.4.3. Neural NetworksThe model of neural networks is a highly non-linear supervised learning model, whichseeks to approximate the regression function by combining a k-dimensional input vectorin a hierarchical way via multiple hidden layers. To outline its basic idea, only a singlehidden layer neural networks is considered here.

As before, Y is a binary response, and X is a k-dimensional input vector to be usedfor classification. Let Z1, . . . , ZM be unobserved hidden units that depend on X byZm = σ(α0m + Xαm), for m = 1, . . . , M , where σ(·) is a known link function. A typicalchoice is σ(v) = 1/(1 + e−v). Then the neural networks, with Z1, . . . , ZM as the onlyhidden layer, can be written as

Tk = β0k + Zβk, k = 0, 1,

P(Y = 1|X) = g(T ), (87)

where T = (T0, T1), Z = (Z1, . . . , ZM ), P(Y = 1|X) is the conditional probability ofY = 1 given X , and g is a known function with two arguments. For a binary response,g(T ) = eT1

eT0+eT1is often used. The above model structure is presented by Figure 19.18.

In general, there may be more than one hidden layer, and so Y will depend on X in amore complex way.The model therefore allows for enhanced specification flexibility andreduced risk of misspecification. Note that there are M (k +1)+2(M +1) parameters inthis model that need to be estimated, and some of them may not be identified when bothM and k are large. In other words, the specification is too rich to be identified. For thisreason, instead of fitting the full model, only a nested model,with some parameters fixed,is estimated given a sample {Yt, Xt}. Despite its complex structure, it is still a parametricmodel because the functional forms of g and σ are known a priori and only a finite setof parameters are estimated. The usual non-linear least squares, or maximum likelihood,method is used to get a consistent estimator. For the former, the objective function thatshould be minimized is the forecast mean squared error

R(θ) =T∑

t=1

(Yt − P(Y = 1|Xt))2, (88)


Figure 19.18 Neural networks with a single hidden layer.

whereas the likelihood function for the latter is

R(θ) =T∑

t=1

P(Y = 1|Xt)Yt (1 − P(Y = 1|Xt))

1−Yt , (89)

where θ is the vector of all parameters. The classification rule is that Y = 1 is predictedif, and only if, the fitted probability P(Y = 1|X) is no less than 0.5. Typically, the globalsolutions of the above problems are often not desirable in that they tend to overfit themodel in-sample but perform poorly out-of-sample. So, one can obtain a suboptimalsolution either directly through a penalty term added in any of the above objectivefunctions, or indirectly by early stopping. For computational details on neural networks,see Hastie et al. (2001), Parker (1985), and Rumelhart et al. (1986).A general introductionof neural networks is given by Ripley (1996), Hertz et al. (1991), and Bishop (1995). Fora useful review of neural networks from an econometric point of view, see Kuan andWhite (1994). Refenes and White (1998), Stock and Watson (1999),Abu-Mostafa et al.(2001), Marcellino (2004), and Teräsvirta et al. (2005) applied neural networks in timeseries econometrics and forecasting.

5. IMPROVING BINARY PREDICTIONS

Till now, all binary probability and point predictions have been constructed based on asingle training sample {Yt, Xt}, and the resulting predictions are thus subject to samplingvariability. We say a binary probability/point prediction Q(x) evaluated at x is unstableif its value is sensitive to even a slight change of the training sample from which it isderived. The lack of stability is especially severe in cases of small training samples and


highly non-linear forecasting models. If Q(x) varies a lot, it is hardly reliable as one mayget a completely different predicted value when a different training sample is used. Inother words, the variance of the forecast error would be extremely large for an unstableprediction. To improve forecast performance and reduce the uncertainty associated withan unstable binary forecast, combining multiple individual forecasts for the same eventwas suggested; see Bates and Granger (1969), Deutsch et al. (1994), Granger and Jeon(2004), Stock and Watson (1999, 2005), Yang (2004), and Timmermann (2006). Themotivation of forecast combination is much analogous to the use of the sample meaninstead of a single observation as an unbiased estimator of the population mean, as takingaverage reduces the variance without affecting unbiasedness. Let us consider using theusual criterion of mean squared error for forecast evaluation. Denote an individual binaryforecast by Q(x, L) where x is the evaluation point of interest and L is the training sample{Yt, Xt} (for t = 1, . . . , T ) by which Q(x, L) is constructed.The mean squared error ofan individual forecast is

el ≡ ELEY ,X (Y − Q(X , L))2. (90)

Suppose we can draw N random samples {Li} each of which has size T from the jointdistribution f (Y , X).Then the combined forecast QA(x) ≡ 1/N

∑Ni=1 Q(x, Li) is closer

to the population average when N is very large, that is,

QA(x) ≈ ELQ(x, L). (91)

The mean squared error associated with this combined forecast is thus

ea ≡ EY ,X (Y − QA(X))2. (92)

Now using Jensen’s inequality, we have

el = EY ,X Y 2 − 2EY ,X YQA(X) + EY ,X EL(Q(X , L))2

≥ EY ,X Y 2 − 2EY ,X YQA(X) + EY ,X (QA(X))2

= EY ,X (Y − QA(X))2

= ea. (93)

Thus, the combined forecast has a lower mean squared error than any individual forecast,and the magnitude of improvement depends on EL(Q(X , L))2 − (ELQ(X , L))2 =VarL(Q(X , L)),which is the variance of the individual forecasts due to the uncertainty ofthe training sample and measures forecast stability. Substantial instability leaves more spacefor improvement induced by forecast combination. Generally speaking, small trainingsamples and high non-linearity in forecasting models are two main sources of instability.Forecast combination can help a lot under these circumstances. Section 5.1 deals with thecase where multiple binary forecasts for the same event are available and the combinationto be carried out is straightforward. The bootstrap aggregating technique is followedwhen we only have a single training set.


5.1. Combining Binary PredictionsSometimes more than one binary prediction is available for the same target. A typicalexample is the SPF probability forecasts of real GDP declines where approximately 40−50individual forecasters issue their subjective probability judgements in each survey aboutreal GDP declines in the current and each of the next four quarters. In these instances,individual forecasters might give diverse probability assessments of a future event butnone of them makes effective use of all available information. Besides, the forecasts arelikely to fluctuate over time and across individuals. Stimulated by concerns of instability,a number of combination methods have been suggested. However, the combinationmethods should not be arbitrary and simplistic. Cases of combined forecasts that haveperformed worse than individual forecasts have been documented in the literature; seeRanjan and Gneiting (2010) for a good example. In this light, an effort to search for theoptimal combination method is desired. Here, the main focus is to combine probabilityforecasts instead of point forecasts. As for the latter, there are already a large numberof articles in computer science under the title of multiple classifier systems (MCS), seeKuncheva (2004) for a textbook treatment.

The optimal combination of probability forecasts is discussed in a probabilistic contextwhere the joint distribution of observation and multiple individual forecasts is

f (Y , P1, P2, . . . , PM ), (94)

where Pm for m = 1, . . . , M is the mth individual probability forecast of the binary eventY . The derivation of the optimal combination in the framework of the joint distribu-tion unifies various separate combination techniques in that it allows for more generalassumptions on observations and forecasts. For example, the Pm may be contempora-neously correlated with each other, which is very common as individual forecasts areoften based on similar information sets. Serial correlation of observations and forecasts isalso allowed. Moreover, individual forecasts may come from either econometric models,subjective judgements,or both.As shown in Section 3, there are many competing criteriaor scores to measure the skill or accuracy for probability forecasts. As a consequence, onemay expect that optimal combination rules may rely on adopted scores and thereby nouniversal combination rule will exist. Fortunately, the situation is not as hopeless as itseems, as long as the score is proper. Denote the proper score by S(Y , P) which is afunction of the realized event and the probability forecasts, and the conditional probabil-ity of Y = 1, given all individual forecasts, by P ≡ P(Y = 1|P1, P2, . . . , PM ). Ranjanand Gneiting (2010) proved that P , as a function of individual forecasts, is the optimalcombined forecast in the sense that its expected score is the smallest among all candidatesprovided the score is proper. To see this, note that the expected score of P is given by

E(S(Y , P)) = E(E(S(Y , P)|P1, P2, . . . , PM ))

= E(PS(1, P) + (1 − P)S(0, P))


≤ E(PS(1, f (P1, P2, . . . , PM )) + (1 − P)S(0, f (P1, P2, . . . , PM )))

= E(E(S(Y , f (P1, P2, . . . , PM ))|P1, P2, . . . , PM ))

= E(S(Y , f (P1, P2, . . . , PM ))), (95)

where f (P1, P2, . . . , PM ) is any measurable function of (P1, P2, . . . , PM ), an alternativecombined forecast.The inequality above uses the fact that S(Y , P) is a negatively orientedproper scoring rule. This result says that taking P as the combined forecast always wins,which is true irrespective of the possible dependence structures. A specific combinationrule, such as the widely used linear opinion pool (OLP) in which f (P1, P2, . . . , PM ) =∑M

m=1 wmPm and wm is the non-negative weight satisfying∑M

m=1 wm = 1,13 performs wellonly if it is close to the optimal P . A large number of specific rules have been developed,each of which is valid under its own assumptions. As a result, a specific rule may succeedif its assumptions roughly hold in practice, but fail when the data generating processviolates these assumptions. For example, the rule ignoring dependence structure amongindividual forecasts may perform poorly if they are highly correlated with each other.For details of various specific combination rules, see Genest and Zidek (1986), Clemen(1989), Diebold and Lopez (1997), Graham (1996),Wallsten et al. (1997), Clemen andWinkler (1986, 1999, 2007),Timmermann (2006), and Primo et al. (2009).

In general, the functional form of this conditional probability P is unknown and needsto be estimated from the sample {Yt, P1t, P2t, . . . , PMt} for t = 1, . . . , T , which is theusual practice in econometrics,by noting that P is nothing more than a conditional proba-bility.All methods covered in Section 2 will work here.The most robust way of estimationis non-parametric regression, even though it is subject to the “curse of dimensionality”when a large number of individual forecasts need to be combined. Ranjan and Gneiting(2010) recommended the beta-transformed linear opinion pool (BLP) to reduce the esti-mation dimension, yet reserve certain flexibility in the specification. BLP is akin to theparametric model (2) with linear index and beta distribution as its link function, that is,

P(Y = 1|P1, P2, . . . , PM ) = Bα,β

(M∑

m=1

wmPm

), (96)

where Bα,β(·) is the distribution function of the beta density with two parameters α > 0and β > 0.The number of unknown parameters including α and β is M +2.They showedthat BLP reduces to OLP when α = β = 1.All parameters can be estimated by maximumlikelihood given a sample, and validity of OLP can thus be verified by a likelihood ratiotest. Ranjan and Gneiting examined the properties of BLP, compared it with OLP andeach individual forecast in terms of their calibration and refinement. They found thatcorrectly specified BLP, necessarily calibrated by construction, is a recalibration of OLP,

13 That is, f (P1, P2, . . . , PM ) is a convex combination of individual forecasts. Note that the linearity of P is possible aseach Pm lies in the unit interval, so does the convex combination.


which may not be calibrated even if the individual forecasts are.The empirical version ofBLP, based on a sample, performs equally well compared with the optimal P . Using SPFforecasts, Lahiri et al. (2012) find that the procedure works reasonably well in practice.

5.2. Bootstrap AggregatingBootstrap aggregating,or bagging,is a forecast combination approach proposed by Breiman(1996) in the machine learning literature, when only a single training sample is available.The basic intuition is to average individual predictions generated by each bootstrap sam-ple to reduce the variance of unbagged prediction without affecting its bias. Like theusual forecast combination approach, bagging is useful only if the sample size is not largeand the forecasting model is highly non-linear. Typical examples where forecasts can beimproved significantly by bagging include classification trees and neural networks. Butbagging does not seem to work well in linear discriminant analysis and k-nearest neighbormethods; see Friedman and Hall (2007), Buja and Stuetzle (2006), and Bühlmann andYu(2002) for further discussion of this issue. A striking result is that bagged predictors canperform even worse than unbagged predictors in terms of certain criteria, as shown inHastie et al. (2001).Though it is not useful for all problems at hand, its ability to stabilizea binary classifier has been supported in the machine learning literature, as documentedby Bauer and Kohavi (1999),Kuncheva andWhitaker (2003), and Evgeniou et al. (2004).Lee andYang (2006) demonstrated that bagged predictors outperform unbagged predic-tors even under asymmetric loss functions, instead of the usual mean squared error.Theyalso established the conditions under which bagging is successful.

Bootstrap aggregating starts by resampling {Yt, Xt} via bootstrap to get B bootstrapsamples. The binary forecasts, with fixed evaluation point x, are then constructed fromeach bootstrap sample to get a set of {Q(x, Li)}, where Li is the ith bootstrap sample.The bagged predictor is calculated as the weighted average of {Q(x, Li)}, where

Qb(x, L) ≡ 1

B

B∑i=1

wiQ(x, Li) (97)

and wi is the non-negative weight attached to the ith bootstrap sample Li and satisfies theusual constraint

∑Bi=1 wi = 1. The bagged predictor Qb(x, L) depends on the original

sample L, as resampling is based on the empirical distribution of L. There are a fewpoints to be clarified for its implementation. First, appropriate bootstrap methods shouldbe used depending on the context. For example, non-parametric bootstrap is the naturalchoice for independent data, and parametric bootstrap is more efficient when the datagenerating process of L is known up to a finite dimensional parameter vector. For timeseries or other dependent data, block bootstrap can provide a sound simulation sample,as illustrated by Lee and Yang (2006). Second, for probability prediction, the predictorQb(x, L) is directly usable as its value must be between zero and one if each Q(x, Li) is.


However, this is not the case for binary point prediction, as Qb(x, L) is not 0/1-valuedeven if each Q(x, Li) is. In this context,a usual rule is the so-called majority voting,whereQb(x, L) always predicts what is predicted more often in {Q(X , Li)}.This is equivalent totaking 1/2 as threshold, that is, using I (Qb(x, L) ≥ 1/2) as the bagged predictor.14 Third,the BLP combination method in Section 5.1 can be used here, provided its parameterscan be estimated from bootstrap samples. Finally, the choice of B depends on the originalsample size, computational capacity and model structure in a complex way. Lee andYang(2006) showed that B = 50 is more than sufficient to get a stable bagged predictor,and even B = 20 is good enough in some cases in their empirical example. For otherapplications of bootstrap aggregating in econometrics, interested readers are referred toKitamura (2001), Inoue and Kilian (2008), and Stock and Watson (2005).

6. CONCLUSION

In this chapter,we discussed the specification,estimation and evaluation of binary responsemodels in a unified framework from the standpoint of forecasting. In a stochastic setting,generating the probability of the occurrence of an event with binary outcomes boilsdown to the specification and estimation of the conditional expectation or the regressionfunction. In this process, the conventional non-linear econometric modeling approachesplay a dominant role. Specification designed for the limited range of the response dis-tinguishes models for binary dependent variables from those for continuous predictands.Therefore, the validity of transformations like the probit link function becomes an issuein modeling binary events for forecasting.

Two types of forecasts for binary events are distinguished in this chapter: probabilityforecasts and point forecasts. There is no universal answer as to which one is better. Thevalue score analysis in Section 3.1.2 justifies the use of probability forecast, as it allows forthe heterogeneity in the loss functions of the end users in decision making. However, ifthe working model is misspecified, the point forecast based on a one-step approach thatintegrates estimation and forecasting may be superior, provided a loss function has beenproperly chosen. Moreover, in many regulatory environments, there are mandates for theissuance of only binary forecasts.

The joint distribution of forecasts and actuals embodies the basic ingredients requiredfor the evaluation of forecast skill. All existing scoring rules and graphical approachesessentially reflect certain attributes of this joint distribution. Since no single evalua-tion tool provides a complete measure of skill for forecasting binary events, the use ofa battery of such measures is recommended to assess the skill more comprehensively.

14 Hastie et al. (2001) suggested another way to make a binary point prediction if we can obtain a probability predictionat evaluation point x. The bagged probability predictor is then derived by (97) which is then transformed to a 0/1value according to the threshold. They argued that, compared to the first procedure, this approach ends up with abagged predictor having lower variance especially for small B.


As a general rule, those not influenced by the marginal information regarding the actualsare preferred. Many examples fall into this category, such as the OR,PSS,or ROC. Com-pared with those commonly used in practice,the tools within this category are more likelyto capture the true forecast skill. In circumstances where the event under consideration israre or relatively uncommon, the marginal probability of the occurrence of the event mayconfound the true skill if it is not isolated from the score.The usual methods for assessingthe goodness of fit of a binary regression model, such as the pseudo R2 or the percentageof correct predictions, do not adjust for the asymmetry of the response variable.We havealso emphasized the need for reporting sampling errors of these statistics. In this regard,there is substantial room for improvement in current econometric practice.

Given that we have introduced a wide range of models and methods for forecastingbinary outcomes, a natural question is which ones should be used in a particular situation.It appears that complex models that often fit better in-sample tend not to do well out-of-sample.Three classification models in Section 4.4 illustrate this point pretty well. Simplemodels like the discriminant analysis with a linear boundary or the neutral networkswith a single hidden layer often do very well in out-of-sample forecasting exercises.Thisalso explains why the forecast combination would usually work when the individualforecasts come from complex non-linear models. When multiple forecasts of the samebinary event are available, the skill performance of any single forecast can potentially beimproved when it is combined with other individual forecasts efficiently. Here again, theoptimal combination scheme should be derived from the joint distribution of forecastsand actuals. When only a single training sample is available and the individual forecastsbased on it are highly unstable,bagging is an attractive way to reduce the forecast varianceand improve the forecast skill.

It is virtually impossible that a forecast with an extremely low skill would satisfythe need of a forecast user. Only those forecasts that enjoy at least a moderate amountof skill can be of some value in guiding the decision-making process. It is possiblethat a skillful forecast on the basis of a particular criterion may not be useful at all inanother decision-making context. Knowing the joint distribution is not enough for thepurpose of evaluating the usefulness of a forecast from the perspective of a user – theloss function connecting forecasts and realizations needs to be considered as well. Thebinary point prediction discussed in Section 4 is a prime example where a 0/1 forecastis made by implicitly or explicitly relying on a threshold value that is determined by apresumed loss function. In some specific contexts, certain skill scores are directly linkedto the value of the end user. One such example is that, under certain circumstances,the highest achievable value score is the PSS, as shown in Section 3.2.3. Without anyknowledge about the joint distribution of forecasts and realizations, we do not know thenature of uncertainty facing us. However, even with knowledge of the joint distribution,without information regarding the loss function, we would not know how to balance


the expected gains and losses under different forecasting scenarios for making decisionsunder uncertainty. For a truly successful forecasting system, we need both.

ACKNOWLEDGMENTSWe are indebted to the Editors, two anonymous referees, and the participants of the Handbook Conferenceat St. Louis Fed for their constructive comments on an earlier version of this chapter. We are also gratefulto Antony Davies, Arturo Estrella,Terry Kinal, Massimiliano Marcellino, andYongchen Zhao for their help.Much of the revision of this chapter was completed when Kajal Lahiri was visiting the European UniversityInstitute as a Fernand Braudel Senior Fellow during 2012. The responsibility for all remaining errors andomissions are ours.

REFERENCESAbrevaya, J., Huang, J., 2005. On the Bootstrap of the Maximum Score Estimator. Econometrica 73,

1175–1204.Abu-Mostafa,Y.S., Atiya, A.F., Magdon-Ismail, M., White, H., 2001. Introduction to the Special Issue on

Neural Networks in Financial Engineering. IEEE Transactions on Neural Networks 12, 653–656.Agresti,A., 2007. An Introduction to Categorical Data Analysis. John Wiley & Sons.Ai, C., Li, Q., 2008. Semi–parametric and non-parametric methods in panel data models. In:

Mátyás, L., Sevestre, P. (Eds.),The Econometrics of Panel Data: Fundamentals and Recent Develop-ments in Theory and Practice. Springer, pp. 451–478.

Albert, J., 2009. Bayesian Computation with R. Springer.Albert, J.H., Chib, S., 1993. Bayesian analysis of binary and polychotomous response data. Journal of the

American Statistical Association 88, 669–679.Amemiya,T., 1985. Advanced Econometrics. Harvard University Press.Amemiya,T.,Vuong, Q.H., 1987. A comparison of two consistent estimators in the choice-based sampling

qualitative response model. Econometrica 55, 699–702.Anatolyev, S., 2009. Multi-market direction-of-change modeling using dependence ratios. Studies in

Non-linear Dynamics & Econometrics 13 (Article 5).Andersen, E.B., 1970. Asymptotic properties of conditional maximum-likelihood estimators. Journal of the

Royal Statistic Society, Series B 32, 283–301.Arellano, M., Carrasco, R., 2003. Binary choice panel data models with predetermined variables. Journal of

Econometrics 115, 125–157.Baltagi, B.H., forthcoming. Panel data forecasing. In: Timmermann, A., Elliott, G. (Eds.), Handbook of

Economic Forecasting. North-Holland,Amsterdam.Bates, J.M., Granger, C.W.J., 1969. The combination of forecasts. Operational Research Quarterly 20,

451–468.Bauer, E., Kohavi, R., 1999. An empirical comparison of voting classification algorithms: Bagging.

boosting, and variants. Machine Learning 36, 105–139.Berge, T.J., Jordà, Ò., 2011. Evaluating the classification of economic activity into recessions and

expansions. American Economic Journal: Macroeconomics 3, 246–277.Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press.Blaskowitz, O., Herwartz, H., 2008.Testing Directional ForecastValue in the Presence of Serial Correlation.

Humboldt University, Collaborative Research Center 649, SFB 649, Discussion Papers.Blaskowitz, O., Herwartz, H., 2009. Adaptive forecasting of the EURIBOR swap term structure. Journal of

Forecasting 28, 575–594.Blaskowitz, O., Herwartz, H., 2011. On economic evaluation of directional forecasts. International Journal

of Forecasting 27, 1058–1065.Bontemps, C., Racine, J.S., Simioni, M., 2009. Non-parametric vs Parametric Binary Choice Models: An

Empirical Investigation. Toulouse School of Economics TSE Working Papers with number 09-126.


Braun, P.A., Yaniv, I., 1992. A case study of expert judgment: Economists probabilities versus base-ratemodel forecasts. Journal of Behavioral Decision Making 5, 217–231.

Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123–140.Breiman, L., Friedman, J., Olshen, R.A., Stone, C.J., 1984. Classification and RegressionTrees. Chapman &

Hall.Brier, G.W., 1950.Verification of forecasts expressed in terms of probability. Monthly Weather Review 78,

1–3.Bühlmann, P., Yu, B., 2002. Analyzing bagging. Annals of Statistics 30, 927–961.Buja, A., Stuetzle,W., 2006. Observations on bagging. Statistica Sinica 16, 323–351.Bull, S.B., Greenwood, C.M.T., Hauck, W.W., 1997. Jackknife bias reduction for polychotomous logistic

regression. Statistics in Medicine 16, 545–560.Carroll, R.J., Ruppert, D.,Welsh,A.H., 1998. Local estimating equations. Journal of the American Statistical

Association 93, 214–227.Caudill, S.B.,2003. Predicting discrete outcomes with the maximum score estimator:The case of the NCAA

Men’s Basketball Tournament. International Journal of Forecasting 19, 313–317.Cavanagh,C.L., 1987. Limiting Behavior of Estimators Defined by Optimization. Unpublished Manuscript,

Department of Economics, Harvard University.Chamberlain, G., 1980. Analysis of covariance with qualitative data. Review of Economic Studies 47,

225–238.Chamberlain, G., 1984. Panel dada. In: Griliches, Z., Intrilligator, M.D. (Eds.), Handbook of Econometrics.

North-Holland,Amsterdam, pp. 1248–1318.Chauvet, M., Potter, S., 2005. Forecasting recessions using the yield curve. Journal of Forecasting 24,

77–103.Chib, S., 2008., Panel data modeling and inference:A bayesian primer. In: Mátyás, L., Sevestre, P. (Eds.),The

Econometrics of Panel Data:Fundamentals and Recent Developments inTheory and Practice. Springer,pp. 479–515.

Clark,T.E., McCracken, M.W., forthcoming. Advances in forecast evaluation. In:Timmermann,A., Elliott,G. (Eds.), Handbook of Economic Forecasting. North-Holland,Amsterdam.

Clemen, R.T., 1989. Combining forecasts: A review and annotated bibliography. International Journal ofForecasting 5, 559–583.

Clemen, R.T., Winkler, R.L., 1986. Combining economic forecasts. Journal of Business & EconomicStatistics 4, 39–46.

Clemen, R.T., Winkler, R.L., 1999. Combining probability distributions from experts in risk analysis. RiskAnalysis 19, 187–203.

Clemen, R.T.,Winkler, R.L., 2007. Aggregating probability distributions. In: Edwards,W., Miles, R.F., vonWinterfeldt, D. (Eds.), Advances in Decision Analysis: From Foundations to Applications. CambridgeUniversity Press, pp. 154–176.

Clements, M.P., 2006. Evaluating the survey of professional forecasters probability distributions of expectedinflation based on derived event probability forecasts. Empirical Economics 31, 49–64.

Clements, M.P., 2008. Consensus and uncertainty: Using forecast probabilities of output declines. Interna-tional Journal of Forecasting 24, 76–86.

Clements, M.P., 2011. An empirical investigation of the effects of rounding on the SPF probabilities ofdecline and output growth histograms. Journal of Money, Credit and Banking 43, 207–220.

Cortes, C., Mohri, M., 2005. Confidence Intervals for the Area under the ROC Curve,Advances in NeuralInformation Processing Systems (NIPS 2004).

Cosslett, S.R., 1993. Estimation from endogenously stratified samples, In: Maddala, G.S., Rao, C.R.,Vinod,H.D. (Eds.), Handbook of Statistics 11 (Econometrics). North-Holland,Amsterdam, pp. 1–44.

Cramer, J.S., 1999. Predictive performance of the binary logit model in unbalanced samples. Journal of theRoyal Statistical Society, Series D 48, 85–94.

Croushore, D., 1993. Introducing: The Survey of Professional Forecasters. Federal Reserve Bank ofPhiladelphia Business Review, November/December, pp. 3–13.

Dawid, A.P.,1984. Present position and potential developments:Some personal views:Statistical theory:Theprequential approach. Journal of the Royal Statistical Society, Series A 147, 278–292.

Delgado, M.A., Rodríguez-Póo, J.M., Wolf, M., 2001. Subsampling Inference in cube root asymptoticswith an application to Manski’s maximum score estimator. Economics Letters 73, 241–250.


Deutsch, M., Granger, C.W.J., Teräsvirta, T., 1994. The combination of forecasts using changing weights.International Journal of Forecasting 10, 47–57.

Diebold, F.X., 2006. Elements of Forecasting. South-Western College.Diebold, F.X., Lopez, J.A., 1997. Forecast evaluation and combination. In: Maddala, G.S., Rao, C.R. (Eds.),

Handbook of Statistics 14 (Statistical Methods in Finance). North-Holland,Amsterdam, pp. 241–268.Diebold, F.X.,Mariano,R.S., 1995. Comparing predictive accuracy. Journal of Business & Economic Statis-

tics 13, 253–263.Donkers, B., Melenberg, B., 2002. Testing Predictive Performance of Binary Choice Models. Erasmus

School of Economics, Econometric Institute Research Papers.Egan, J.P., 1975. Signal Detection Theory and ROC Analysis. Academic Press.Elliott, G., Lieli, R.P., 2013. Predicting binary outcomes. Journal of Econometrics 174, 15–26.Engelberg, J., Manski, C.F.,Williams, J., 2011. Assessing the temporal variation of macroeconomic forecasts

by a panel of changing composition. Journal of Applied Econometrics 26, 1059–1078.Engle, R.F., 2000. The econometrics of ultra-high-frequency data. Econometrica 68, 1–22.Engle, R.F., Russell, J.R., 1997. Forecasting the frequency of changes in quoted foreign exchange prices

with the ACD model. Journal of Empirical Finance 12, 187–212.Engle, R.F., Russell, J.R., 1998. Autoregressive conditional duration: A new model for irregularly spaced

transaction data. Econometrica 66, 1127–1162.Estrella, A., 1998. A new measure of fit for equations with dichotomous dependent variables. Journal of

Business & Economic Statistics 16, 198–205.Estrella, A., Mishkin, F.S., 1998. Predicting US recessions: Financial variables as leading indicators. The

Review of Economics and Statistics 80, 45–61.Evgeniou, T., Pontil, M., Elisseeff, A., 2004. Leave one out error. stability, and generalization of voting

combinations of classifiers. Machine Learning 55, 71–97.Faraggi, D., Reiser, B., 2002. Estimation of the area under the ROC curve. Statistics in Medicine 21,

3093–3106.Fawcett,T., 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874.Florios,K., Skouras, S., 2007. Computation of Maximum ScoreType Estimators by Mixed Integer Program-

ming. Department of International and European Economic Studies,Athens University of Economicsand Business,Working Paper.

Friedman, J.H., Hall, P., 2007. On bagging and nonlinear estimation. Journal of Statistical Planning andInference 137, 669–683.

Frölich, M., 2006. Non–parametric regression for binary dependent variables. Econometrics Journal 9,511–540.

Galbraith, J.W., van Norden, S., 2012. Assessing gross domestic product and inflation probability forecastsderived from Bank of England fan charts. Journal of the Royal Statistical Society, Series A 175,713-727.

Gandin, L.S., Murphy,A.H., 1992. Equitable skill scores for categorical forecasts. MonthlyWeather Review120, 361–370.

Genest, C., Zidek, J.V., 1986. Combining probability distributions: a critique and an annotated bibliography.Statistical Science 1, 114–135.

Gneiting,T., 2011. Making and evaluating point forecasts. Journal of the American Statistical Association106, 746–762.

Gneiting, T., Balabdaoui, F., Raftery, A.E., 2007. Probabilistic forecasts, calibration and sharpness. Journalof the Royal Statistical Society, Series B 69, 243–268.

Gneiting, T., Raftery, A.E., 2007. Strictly proper scoring rules, prediction, and estimation. Journal of theAmerican Statistical Association 102, 359–378.

Gourieroux,C.,Monfort,A., 1993. Simulation-based inference: a survey with special reference to panel datamodels. Journal of Econometrics 59, 5–33.

Gozalo, P., Linton, O., 2000. Local nonlinear least squares: Using parametric information in nonparametricregression. Journal of Econometrics 99, 63–106.

Gradojevic, N.,Yang, J., 2006. Non-linear, non-parametric, non-fundamental exchange rate forecasting.Journal of Forecasting 25, 227–245.

Graham, J.R., 1996. Is a group of economists better than one?Than none? Journal of Business 69, 193–232.


Grammig, J., Kehrle, K., 2008. A new marked point process model for the federal funds rate target:methodology and forecast evaluation. Journal of Economic Dynamics and Control 32, 2370–2396.

Granger, C.W.J., Jeon,Y., 2004. Thick modeling. Economic Modeling 21, 323–343.Granger, C.W.J., Pesaran, M.H., 2000a. A decision-theoretic approach to forecast evaluation. In: Chan,W.S.,

Li W.K., Tong, H. (Eds.), Statistics and Finance:An Interface. Imperial College Press, pp. 261–278.Granger, C.W.J., Pesaran, M.H., 2000b. Economic and statistical measures of forecast accuracy. Journal of

Forecasting 19, 537–560.Greene, W.H., 2011. Econometric Analysis. Prentice Hall.Greer, M.R., 2005. Combination forecasting for directional accuracy: an application to survey interest rate

forecasts. Journal of Applied Statistics 32, 607–615.Griffiths,W.E., Hill, R.C., Pope, P.J., 1987. Small sample properties of probit model estimators. Journal of

the American Statistical Association 82, 929–937.Hamilton, J.D.,1989.A new approach to the economic analysis of nonstationary time series and the business

cycle. Econometrica 57, 357–384.Hamilton, J.D., 1990. Analysis of time series subject to changes in regime. Journal of Econometrics 45,

39–70.Hamilton, J.D., 1993. Estimation, inference and forecasting of time series subject to changes in regime,

In: Maddala, G.S., Rao C.R.,Vinod, H.D. (Eds.), Handbook of Statistics 11 (Econometrics), North-Holland Amsterdam, pp. 231–260.

Hamilton, J.D., 1994. Time Series Analysis. Princeton University Press.Hamilton, J.D., Jordà, Ò., 2002. A model of the federal funds rate target. Journal of Political Economy 110,

1135–1167.Hao, L., Ng, E.C.Y., 2011. Predicting canadian recessions using dynamic probit modelling approaches.

Canadian Journal of Economics 44, 1297–1330.Harding, D., Pagan, A., 2011. An econometric analysis of some models for constructed binary time

series. Journal of Business & Economic Statistics 29, 86–95.Härdle, W., Stoker, T.M., 1989. Investigating smooth multiple regression by the method of average

derivatives. Journal of the American Statistical Association 84, 986–995.Harvey, D., Leybourne, S., Newbold, P., 1997. Testing the equality of prediction mean squared errors.

International Journal of Forecasting 13, 281–291.Hastie, T.,Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference,

and Prediction. SpringerHeckman, J.J., 1981. The incidental parameters problem and the problem of initial conditions in estimat-

ing a discrete time-discrete data stochastic process and some Monte-Carlo evidence, In: Manski, C.F.,McFadden, D. (Eds.), Structural Analysis of Discrete Data. MIT Press, pp. 179–195.

Hertz, J., Krogh, A., Palmer, R.G., 1991. Introduction to the Theory of Neural Computation. WestviewPress.

Horowitz, J.L., 1992. A smoothed maximum score estimator for the binary response model. Econometrica60, 505–531.

Horowitz, J.L., 2009. Semi-Parametric and Nonparametric Methods in Econometrics. Springer.Horowitz, J.L., Mammen, E., 2004. Non-parametric estimation of an additive model with a link function.

Annals of Statistics 32, 2412–2443.Horowitz, J.L., Mammen,E., 2007. Rate-optimal estimation for a general class of nonparametric regression

models with unknown link functions. Annals of Statistics 35, 2589–2619.Hristache, M., Juditsky, A., Spokoiny,V., 2001. Direct estimation of the index coefficient in a single-index

model. Annals of Statistics 29, 595–623.Hsiao, C., 1996. Logit and probit models. In: Mátyás, L., Sevestre, P. (Eds.),The Econometrics of Panel Data:

Handbook of Theory and Applications. Kluwer Academic Publishers, pp. 410–428.Hu, L., Phillips, P.C.B., 2004a. Dynamics of the federal funds target rate: a nonstationary discrete choice

approach. Journal of Applied Econometrics 19, 851–867.Hu, L., Phillips, P.C.B., 2004b. Non-stationary discrete choice. Journal of Econometrics 120, 103–138.Ichimura,H.,1993. Semi-parametric least squares (SLS) and weighted SLS estimation of single-index models.

Journal of Econometrics 58, 71–120.


Imbens, G.W., 1992. An efficient method of moments estimator for discrete choice models with choice-based sampling. Econometrica 60, 1187–1214.

Imbens, G.W., Lancaster,T., 1996. Efficient estimation and stratified sampling. Journal of Econometrics 74,289–318.

Inoue, A., Kilian, L., 2008. How useful is bagging in forecasting economic time series? a case study of USCPI inflation. Journal of the American Statistical Association 103, 511–522.

Kauppi, H., 2012. Predicting the direction of the fed’s target rate. Journal of Forecasting 31, 47–67.Kauppi, H., Saikkonen, P., 2008. Predicting US recessions with dynamic binary response models. The

Review of Economics and Statistics 90, 777–791.Kim, J., Pollard, D., 1990. Cube root asymptotics. Annals of Statistics 18, 191–219.King, G., Zeng, L., 2001. Logistic regression in rare events data. Political Analysis 9, 137–163.Kitamura, Y., 2001. Predictive inference and the bootstrap. Working Paper.Yale University.Klein, R.W., Spady, R.H., 1993. An efficient semiparametric estimator for binary response models. Econo-

metrica 61, 387–421.Koenker, R.,Yoon, J., 2009. Parametric links for binary choice models: a Fisherian-Bayesian colloquy.

Journal of Econometrics 152, 120–130.Koop, G., 2003. Bayesian Econometrics. John Wiley & Sons.Krzanowski, W.J., Hand, D.J., 2009. ROC curves for continuous data. Chapman & Hall.Krzysztofowicz,R., 1992. Bayesian correlation score: a utilitarian measure of forecast skill. MonthlyWeather

Review 120, 208–219.Krzysztofowicz,R.,Long,D.,1990. Fusion of detection probabilities and comparison of multisensor systems.

IEEE Transactions on Systems, Man, and Cybernetics 20, 665–677.Kuan,C.M.,White,H.,1994.Artificial neural networks:an econometric perspective. Econometrics Reviews

13, 1–91.Kuncheva, L.I., 2004. Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons.Kuncheva, L.I.,Whitaker, C.J., 2003. Measures of diversity in classifier ensembles and their relationship with

the ensemble accuracy. Machine Learning 51, 181–207.Lahiri, K., Monokroussos, G., Zhao,Y., 2013. The yield spread puzzle and the information content of SPF

forecasts. Economics Letter 118, 219–221Lahiri, K., Peng, H., Zhao,Y., 2012, Evaluating the value of probability forecasts in the sense of merton.

Paper Presented at the 7th NewYork Camp Econometrics.Lahiri, K.,Teigland, C., Zaporowski, M., 1988. Interest rates and the subjective probability distribution of

inflation forecasts. Journal of Money, Credit and Banking 20, 233–248.Lahiri, K.,Wang, J.G., 1994. Predicting cyclical turning points with leading index in a markov switching

model. Journal of Forecasting 13, 245–263.Lahiri, K.,Wang, J.G., 2006. Subjective probability forecasts for recessions: evaluation and guidelines for use.

Business Economics 41, 26–37.Lahiri, K.,Wang, J.G., 2013. Evaluating probability forecasts for GDP declines using alternative methodolo-

gies. International Journal of Forecasting 29, 175–190.Lawrence, M., Goodwin, P., O’Connor, M., Önkal, D., 2006. Judgmental forecasting: a review of progress

over the last 25 years. International Journal of Forecasting 22, 493–518.Lechner, M., Lollivier, S., Magnac, T., 2008. Parametric binary choice models, In: Mátyás, L., Sevestre, P.

(Eds.),The Econometrics of Panel Data: Fundamentals and Recent Developments inTheory and Prac-tice. Springer, pp. 215–245.

Lee, L.F., 1992. On efficiency of methods of simulated moments and maximum simulated likelihood esti-mation of discrete response models. Econometric Theory 8, 518–552.

Lee,T.H.,Yang,Y., 2006. Bagging binary and quantile predictors for time series. Journal of Econometrics135, 465–497.

Leitch,G.,Tanner, J.,1995. Professional economic forecasts: are they worth their costs? Journal of Forecasting14, 143–157.

Li, Q., Racine, J.S., 2006. Non-Parametric Econometrics:Theory and Practice. Princeton University Press.Lieli, R.P., Nieto-Barthaburu, A., 2010. Optimal binary prediction for group decision making. Journal of

Business & Economic Statistics 28, 308–319.


Lieli, R.P., Springborn, M., forthcoming. Closing the gap between risk estimation and decision-making:efficient management of trade-related invasive species risk. Review of Economics and Statistics.

Liu, H., Li, G., Cumberland,W.G.,Wu,T., 2005. Testing statistical significance of the area under a receivingoperating characteristics curve for repeated measures design with bootstrapping. Journal of Data Science3, 257–278.

Lopez, J.A., 2001. Evaluating the predictive accuracy of volatility models. Journal of forecasting 20,87–109.Lovell, M.C., 1986. Tests of the rational expectations hypothesis. The American Economic Review 76,

110–124.Maddala, G.S., 1983. Limited-Dependent and QualitativeVariables in Econometrics. Cambridge University

Press.Maddala, G.S., Lahiri, K., 2009. Introduction to Econometrics. John Wiley & Sons.Manski, C.F., 1975. Maximum score estimation of the stochastic utility model of choice. Journal of Econo-

metrics 3, 205–228.Manski, C.F., 1985. Semi-parametric analysis of discrete response: asymptotic properties of the maximum

score estimator. Journal of Econometrics 27, 313–333.Manski,C.F., Lerman,S.R., 1977.The estimation of choice probabilities from choice based samples. Econo-

metrica 45, 1977–1988.Manski, C.F.,Thompson,T.S., 1986. Operational characteristics of maximum score estimation. Journal of

Econometrics 32, 85–108.Manski, C.F.,Thompson, T.S., 1989. Estimation of best predictors of binary response. Journal of Econo-

metrics 40, 97–123.Marcellino, M., 2004. Forecasting EMU macroeconomic variables. International Journal of Forecasting 20,

359–372.Mardia, K.V., Kent, J.T., Bibby, J.M., 1979. Multivariate Analysis. Academic Press.Mason, I.B., 2003. Binary events. In: Jolliffe, I.T., Stephenson, D.B. (Eds.), Forecast Verification: A Practi-

tioner’s Guide in Atmospheric Science. John Wiley & Sons, pp. 37–76.Mason, S.J., Graham, N.E., 2002. Areas beneath the relative operating characteristics (ROC) and relative

operating levels (ROL) curves: statistical significance and interpretation. Quarterly Journal of the RoyalMeteorological Society 128, 2145–2166.

Merton, R.C., 1981. On market timing and investment performance. I. An equilibrium theory of value formarket forecast. Journal of Business 54, 363–406.

Michie, D., Spiegelhalter, D.J.,Taylor, C.C., 1994. Machine Learning. Prentice Hall, Neural and StatisticalClassification.

Monokroussos, G., 2011. Dynamic limited dependent variable modeling and US monetary policy. Journalof Money, Credit and Banking 43, 519–534.

Morgan, J.N., Sonquist, J.A., 1963. Problems in the analysis of survey data, and a proposal. Journal of theAmerican Statistical Association 58, 415–434.

Murphy, A.H., 1973. A new vector partition of the probability score. Journal of Applied Meteorology 12,595–600.

Murphy, A.H., 1977. The value of climatological. categorical and probabilistic forecasts in the cost-losssituation. Monthly Weather Review 105, 803–816.

Murphy, A.H., Daan, H., 1985. Forecast evaluation. In: Murphy, A.H., Katz, R.W. (Eds.), Probability,Statistics, and Decision Making in the Atmospheric Sciences. Westview Press, pp. 379–437.

Murphy, A.H., Winkler, R.L., 1984. Probability forecasting in meteorology. Journal of the AmericanStatistical Association 79, 489–500.

Murphy,A.H.,Winkler,R.L.,1987.A general framework for forecast verification. MonthlyWeather Review115, 1330–1338.

Mylne, K.R., 1999. The use of forecast value calculations for optimal decision-making using probabilityforecasts. In: 17th Conference on Weather Analysis and Forecasting. American Meteorological Society,Boston, Massachusetts, 235–239.

Park, J.Y., Phillips, P.C.B., 2000. Non-stationary binary choice. Econometrica 68, 1249–1280.Parker, D.B., 1985. Learning logic.Technical ReportTR-47. Cambridge MA: MIT Center for Research in

Computational Economics and Management Science.


Patton, A.J., 2006. Modelling asymmetric exchange rate dependence. International Economic Review 47,527–556.

Peirce, C.S., 1884. The numerical measure of the success of predictions. Science 4, 453–454.Pesaran, M.H. & Skouras, S., 2002. Decision-Based Methods for Forecast Evaluation. In: Clements, M.P.,

Hendry, D.F. (Eds.),A companion to economic forecasting. Wiley-Blackwell, pp. 241–267.Pesaran, M.H.,Timmermann,A., 1992. A simple nonparametric test of predictive performance. Journal of

Business & Economic Statistics 10, 461–465.Pesaran, M.H.,Timmermann,A., 2009. Testing dependence among serially correlated multi-category vari-

ables. Journal of the American Statistical Association 104, 325–337.Powell, J.L., Stock, J.H., Stoker, T.M., 1989. Semi-parametric estimation of index coefficients.

Econometrica 57, 1403–1430.Primo, C., Ferro, C.A.T., Jolliffe, I.T., Stephenson, D.B., 2009. Combination and calibration methods for

probabilistic forecasts of binary events. Monthly Weather Review 137, 1142–1149.Quinlan, J.R., 1992. C4.5: Programs for Machine Learning, Morgan Kaufmann.Racine, J.S., Parmeter, C.F., 2009. Data-driven model evaluation: a test for revealed performance. Mac

Master University Working Papers.Ranjan, R., Gneiting, T., 2010. Combining probability forecasts. Journal of the Royal Statistical Society,

Series B 72, 71–91.Refenes, A.P., White, H., 1998. Neural networks and financial economics. International Journal of Fore-

casting 17, 347–495.Richardson, D.S., 2003. Economic value and skill. In: Jolliffe, I.T., Stephenson, D.B. (Eds.), ForecastVerifi-

cation:A Practitioner’s Guide in Atmospheric Science. John Wiley & Sons, pp. 165–187.Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press.Rudebusch,G.D.,Williams, J.C., 2009. Forecasting recessions: the puzzle of the enduring power of the yield

curve. Journal of Business and Economic Statistics 27, 492–503.Rumelhart,D.E.,Hinton,G.E.,Williams,R.J., 1986. Learning internal representations by error propagation.

In: Rumelhart, D.E., McClelland, J.L., the PDP Research Group (Eds.), Parallel Distributed Processing:Explorations in the Microstructure of Cognition. MIT Press, pp. 318–362.

Schervish, M.J., 1989. A general method for comparing probability assessors. Annals of Statistics 17,1856–1879.

Scott, A.J.,Wild, C.J., 1986. Fitting logistic models under case-control or choice based sampling. Journal ofthe Royal Statistical Society Series B 48, 170–182.

Scotti, C., 2011. A bivariate model of federal reserve and ECB main policy rates. International Journal ofCentral Banking 7, 37–78.

Seillier-Moiseiwitsch, F.,Dawid, A.P.,1993. On testing the validity of sequential probability forecasts. Journalof the American Statistical Association 88, 355–359.

Steinberg, D., Cardell, N.S., 1992. Estimating logistic regression models when the dependent variable hasno variance. Communications in Statistics-Theory and Methods 21, 423–450.

Stephenson, D.B., 2000. Use of the “Odds Ratio” for diagnosing forecast skill. Weather Forecasting 15,221–232.

Stock, J.H.,Watson, M.W., 1999. A comparison of linear and nonlinear univariate models for forecastingmacroeconomic time series. In: Engle, R.F.,White, H. (Eds.), Cointegration, Causality, and Forecasting:A Festschrift in Honor of Clive W.J. Granger. Oxford University Press, 1–44.

Stock, J.H.,Watson, M.W., 2005. An Empirical Comparison of Methods for Forecasting Using Many Pre-dictors. Harvard University and Princeton University,Working Paper.

Stoker, T.M., 1986. consistent estimation of scaled coefficients. Econometrica 54, 1461–1481.Stoker, T.M., 1991a. Equivalence of direct, indirect and slope estimators of average derivatives. In: Barnett,

W.A., Powell, J.,Tauchen, G. (Eds.), Non-parametric and Semi-parametric Methods in Econometricsand Statistics. Cambridge University Press, pp. 99–118.

Stoker,T.M., 1991b. Lectures on Semiparametric Econometrics. CORE Foundation, Louvain-la-Neuve,Belgium.

Swanson, N.R., White, H., 1995. A model selection approach to assessing the information in the termstructure using linear models and artificial neural networks. Journal of Business & Economic Statistics13, 265–275.


Swanson, N.R.,White, H., 1997a. Forecasting economic time series using flexible versus fixed specificationand linear versus nonlinear econometric models. International Journal of Forecasting 13, 439–461.

Swanson, N.R., White, H., 1997b. A model selection approach to real-time macroeconomic forecastingusing linear models and artificial neural networks.The Review of Economics and Business Statistics 79,540–550.

Swets, J.A., 1996. Signal Detection Theory and ROC Analysis in Psychology and Diagnostics. CollectedPapers. Lawrence Erlbaum Associates.

Tajar,A., Denuit, M., Lambert, P., 2001. Copula-Type Representation for Random Couples with BernoulliMargins. Discussing paper 0118, Universite Catholique de Louvain.

Tavaré, S., Altham, P.M.E., 1983. Dependence in goodness of fit tests and contingency tables. Biometrika70, 139–144.

Teräsvirta, T., Tjstheim, D., Granger, C.W.J., 2010. Modelling Nonlinear Economic Time Series. OxfordUniversity Press.

Teräsvirta,T., van Dijk, D., Mederios, M.C., 2005. Smooth transition autoregressions, neural networks, andlinear models in forecasting macroeconomic time series: A re-examination. International Journal ofForecasting 21, 755–774.

Thompson, J.C., Brier, G.W., 1955. The economic utility of weather forecasts. Monthly Weather Review83, 249–254.

Tibshirani, R., Hastie,T., 1987. Local likelihood estimation. Journal of the American Statistical Association82, 559–567.

Timmermann, A., 2006. Forecast combinations. In: Elliott, G., Granger, C.W.J., Timmermann, A. (Eds.),Handbook of Economic Forecasting. North-Holland,Amsterdam, pp. 135–196.

Toth, Z., Talagrand, O., Candille, G., Zhu, Y., 2003. Probability and ensemble forecasts. In: Jolliffe, I.T.,Stephenson, D.B. (Eds.), Forecast Verification: A Practitioner’s Guide in Atmospheric Science. JohnWiley & Sons, pp. 137–163.

Train, K.E., 2003. Discrete Choice Methods with Simulation. Cambridge University Press.Wallsten,T.S., Budescu, D.V., Erev, I., Diederich,A., 1997. Evaluating and combining subjective probability

estimates. Journal of Behavioral Decision Making 10, 243–268.West, K.D., 1996. Asymptotic inference about predictive ability. Econometrica 64, 1067–1084.Wickens,T.D., 2001. Elementary Signal Detection Theory. Oxford University Press.Wilks,D.S.,2001.A skill score based on economic value for probability forecasts. MeteorologicalApplications

8, 209–219.Windmeijer, F.A.G., 1995. Goodness-of-fit measures in binary choice models. Econometric Reviews 14,

101–116.Wooldridge, J.M., 2005. Simple solutions to the initial conditions problem in dynamic non linear panel data

models with unobserved heterogeneity. Journal of Applied Econometrics 20, 39–54.Xie, Y., Manski, C.F., 1989. The logit model and response-based samples. Sociological Methods and

Research 17, 283–302.Yang, Y., 2004. Combining forecasting procedures: Some theoretical results. Econometric Theory 20,

176–222.Yates, J.F., 1982. External correspondence: Decompositions of the mean probability score. Organizational

Behavior and Human Performance 30, 132–156.Zhou, X.H., Obuchowski, N.A., McClish, D.K., 2002. Statistical Methods in Diagnostic Medicine. John

Wiley & Sons.

Date post:	23-Dec-2016
Category:	Documents
Upload:	kajal
View:	216 times
Download:	4 times

[Handbook of Economic Forecasting] Volume 2 || Forecasting Binary Outcomes

Documents